Timestamp difference functions in Spark
Exploring some lesser known timestamp difference functions in Spark SQL
Most of the real-world datasets contain timestamps in them. For example,
Transactions in e-commerce, ticketing systems, etc.
Clickstream data when a user navigates a mobile app or a website
Events from IOT devices.
While analyzing these datasets, we eventually do one or more operations that involve calculating differences in timestamps in specific units. Some examples are,
The time taken to process a payment for an e-commerce transaction is typically represented in seconds or minutes.
Time spent on a particular screen or navigating from one screen to another is typically represented in seconds.
Days or weeks or months since the user’s last activity or purchase.
Months or quarters or years to represent the age of the user on the platform.
Timestamps can be used to gain valuable insights into user behavior, system performance, and other important metrics. By analyzing the time differences between events, we can identify patterns, trends, and anomalies that might not be apparent otherwise. For example, we might discover that users tend to abandon a particular screen after a certain amount of time, or that a particular payment gateway is slower than others.
Apache Spark has provided the following functions for a long time (since v1.5 as per docs) -
compute the difference between two dates (
datediff
)compute difference in months between two dates (
months_between
).
For other scenarios, like computing time difference in seconds, minutes or hours, one had to convert timestamps to unix timestamps (using unix_timestamp
) and then compute difference in seconds and convert the difference to appropriate time unit.
Similarly in the case of weeks or quarters, one would compute the difference in days and then convert it to weeks or quarters.
From Spark 3.3.0 onwards, a new generic function timestampdiff
(SPARK-38284) has been introduced with a shorter alias datediff
(SPARK-38389).
The function signature is - timestampdiff(unit, startTimestamp, endTimestamp)
- Gets the difference between the timestamps endTimestamp
and startTimestamp
in the specified unit
by truncating the fraction part, the function returns a bigint
or LongType
.
It is important to highlight the alias datediff with 3 arguments is fundamentally different from the traditional datediff with 2 arguments in Spark. Other than the obvious number of arguments, the order of arguments differ between both the variants.
Traditional datediff signature is
datediff(endDate, startDate)
, to get the same result in 3 argument datediff you will need to usedatediff(day, startDate, endDate)
.
As I put together this article, it’s been more than 1.5 years since release of Spark 3.3.0 and still these functions are not documented in Spark SQL functions reference.
As of Spark 3.5, these functions are currently not available in Scala and PySpark.
The supported values for the unit
argument are:
YEAR
QUARTER
MONTH
WEEK
DAY
HOUR
MINUTE
SECOND
MILLISECOND
MICROSECOND
Calculate the difference between timestamps in microsecond
Calculate the difference between timestamps in minute
Calculate the difference between timestamps in week
Calculate the difference between timestamps in quarter
A peek under the hood
Implementation of timestampdiff
can be found in DateTimeUtils.scala.
It uses Java’s Temporal
package to compute time-unit-wise difference using between method.
From javadocs of between
method,
The calculation returns a whole number, representing the number of complete units between the two temporals. For example, the amount in hours between the times 11:30 and 13:29 will only be one hour as it is one minute short of two hours.
Add different units to your timestamps
On very similar lines, Spark 3.3 has introduced generic functions to add various time units to timestamps - timestampadd
(SPARK-38195) and its alias dateadd
(SPARK-38332).
The function signature is -timestampadd(unit, quantity, timestamp)
- Adds the specified number of units to the given timestamp.
In conclusion, timestamps are a crucial component of many datasets, and analyzing them can provide valuable insights into user behavior, system performance, and other important metrics. By calculating the difference between timestamps in specific units, we can identify patterns, trends, and anomalies that might not be apparent otherwise.
I hope this article helped to add some new tricks up your sleeves in computing timestamp differences using Spark SQL.