Spark/Hive Hours Between Two Datetimes - hadoop

I would like to know how to precisely get the number of hours between 2 datetimes in spark.
There is a function called datediff which I could use to get the number of days and then convert to hours however this is less precise than I'd like
example of what I want modeled after datediff:
>>> df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
>>> df.select(hourdiff(df.d2, df.d1).alias('diff')).collect()
[Row(diff=22)]

Try using UDF Here is the sample code, You can modify to UDF return what ever granularity as you want.
from pyspark.sql.functions import udf, col
from datetime import datetime, timedelta
from pyspark.sql.types import LongType
def timediff_x():
def _timediff_x(date1, date2):
date11 = datetime.strptime(date1, '%Y-%m-%d %H:%M:%S')
date22 = datetime.strptime(date2, '%Y-%m-%d %H:%M:%S')
return (date11 - date22).days
return udf(_timediff_x, LongType())
df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-25 19:15:00')], ['d1', 'd2'])
df.select(timediff_x()(col("d2"), col("d1"))).show()
+----------------------------+
|PythonUDF#_timediff_x(d2,d1)|
+----------------------------+
| 6|
+----------------------------+

If your columns are of type TimestampType(), you can use the answer at the following question:
Spark Scala: DateDiff of two columns by hour or minute
However, if your columns are of type StringType(), you have an option that is easier than defining an UDF, using the built-in functions:
from pyspark.sql.functions import *
diffCol = unix_timestamp(col('d1'), 'yyyy-MM-dd HH:mm:ss') - unix_timestamp(col('d2'), 'yyyy-MM-dd HH:mm:ss')
df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
df2 = df.withColumn('diff_secs', diffCol)

Related

Polars string column to pl.datetime in Polars: conversion issue

working with a csv file with the following schema
'Ticket ID': polars.datatypes.Int64,
..
'Created time': polars.datatypes.Utf8,
'Due by Time': polars.datatypes.Utf8,
..
Converting to Datetime:
df = (
df.lazy()
.select(list_cols)
.with_columns([
pl.col(convert_to_date).str.strptime(pl.Date, fmt='%d-%m-%Y %H:%M',strict=False).alias("Create_date") #.cast(pl.Datetime)
])
)
Here is the output. 'Created time' is the original str and 'Create_date' is the conversion:
Created time
Create_date
str
date
04-01-2021 10:26
2021-01-04
04-01-2021 10:26
2021-01-04
04-01-2021 10:26
2021-01-04
04-01-2021 11:48
2021-01-05
...
...
22-09-2022 22:44
null
22-09-2022 22:44
null
22-09-2022 22:44
null
22-09-2022 22:47
null
Getting a bunch of nulls and some of the date conversions seems to be incorrect (see 4th row in the output above). Also, how may I keep the time values?
Sure I am doing something wrong and any help would be greatly appreciated.
import polars as pl
from datetime import datetime
from datetime import date, timedelta
import pyarrow as pa
import pandas as pd
convert_to_date = ['Created time','Due by Time','Resolved time','Closed time','Last update time','Initial response time']
url = 'https://raw.githubusercontent.com/DOakville/PolarsDate/main/3000265945_tickets-Dates.csv'
df = (
pl.read_csv(url,parse_dates=True)
)
df = df.with_column(
pl.col(convert_to_date).str.strptime(pl.Date, fmt='%d-%m-%Y %H:%M',strict=False).alias("Create_date") #.cast(pl.Datetime)
)
Ahhh... I think I see what is happening - your with_columns expression is successfully converting all of the columns given in the "convert_to_date" list, but assigning the result of each conversion to the same name: "Create date".
So, the values you finally get are coming from the last column to be converted ("Initial response time"), which does have nulls where you see them.
If you want each column to be associated with a separate date-converted entry, you can use the suffix expression to ensure that each conversion is mapped to its own distinct output column (based on the original name).
For example:
df.with_columns(
pl.col(convert_to_date).str.strptime(
datatype = pl.Date,
fmt = '%d-%m-%Y %H:%M',
).suffix(" date") # << adds " date" to the existing column name
)
Or, if you prefer to overwrite the existing columns with the converted ones, you could keep the existing column names:
df.with_columns(
pl.col(convert_to_date).str.strptime(
datatype = pl.Date,
fmt = '%d-%m-%Y %H:%M'
).keep_name() # << keeps original name (effectively overwriting it)
)
Finally, if you actually want datetimes (not dates), just change the value of the datatype param in the strptime expression to pl.Datetime.

Lambda Expression + pySpark

Am trying to compare a column in spark DataFrame against a given date, if column date is less than given date add n hour else add x hours.
something like
addhours = lambda x,y: X + 14hrs if (x < y) else X + 10hrs
where y will hold a static date specified then apply on DataFrame column
something like
df = df.withColumn("newDate", checkDate(df.Time, F.lit('2015-01-01') ))
here is sample for df
from pyspark.sql import functions as F
import datetime
df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2020-02-01 10:00:00')],["OriginTz", "Time"])
Am bit new to spark dataframes :)
Use when+othewise statement instead of udf.
Example:
from pyspark.sql import functions as F
#we are casting to timestamp and date so that we can compare in when
df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2003-02-01 10:00:00')],["OriginTz", "Time"]).\
withColumn("literal",F.lit('2015-01-01').cast("date")).\
withColumn("Time",F.col("Time").cast("timestamp"))
df.show()
#+---------------+-------------------+----------+
#| OriginTz| Time| literal|
#+---------------+-------------------+----------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|
#+---------------+-------------------+----------+
#using unix_timestamp function converting to epoch time then adding 10*3600 -> 10 hrs finally converting to timestamp format
df.withColumn("new_date",F.when(F.col("Time") > F.col("literal"),F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss') + 10 * 3600)).\
otherwise(F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss') + 14 * 3600))).\
show()
#+---------------+-------------------+----------+-------------------+
#| OriginTz| Time| literal| new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|2003-02-02 00:00:00|
#+---------------+-------------------+----------+-------------------+
In case if you don't want to add literal value as dataframe column.
lit_val='2015-01-01'
df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2003-02-01 10:00:00')],["OriginTz", "Time"]).\
withColumn("Time",F.col("Time").cast("timestamp"))
df.withColumn("new_date",F.when(F.col("Time") > F.lit(lit_val).cast("date"),F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss') + 10 * 3600)).\
otherwise(F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss') + 14 * 3600))).\
show()
#+---------------+-------------------+----------+-------------------+
#| OriginTz| Time| literal| new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|2003-02-02 00:00:00|
#+---------------+-------------------+----------+-------------------+
You can also do this using .expr and interval. This way you do not have to convert to another format.
from pyspark.sql import functions as F
df.withColumn("new_date", F.expr("""IF(Time<y, Time + interval 14 hours, Time + interval 10 hours)""")).show()
#+---------------+-------------------+----------+-------------------+
#| OriginTz| Time| y| new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2020-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2020-02-01 10:00:00|2020-01-01|2020-02-01 20:00:00|
#+---------------+-------------------+----------+-------------------+

Sql Window function on whole dataframe in spark

I am working on spark streaming project which consumes data from Kafka in every 3 minutes. I want to calculate moving sum of value. Below is the sample logic for a rdd which works well. I want to know will this logic work for spark streaming. I read some docs that you have to assign rang of data. ex - Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1) But I want to calculate the logic on whole dataframe. Does the below logic work for the whole value of dataframe or It will take only the range of value of dataframe.
val customers = spark.sparkContext.parallelize(List(("Alice", "2016-05-01", 50.00),
("Alice", "2016-05-03", 45.00),
("Alice", "2016-05-04", 55.00),
("Bob", "2016-05-01", 25.00),
("Bob", "2016-05-04", 29.00),
("Bob", "2016-05-06", 27.00))).
toDF("name", "date", "amountSpent")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val wSpec1 = Window.partitionBy("name").orderBy("date")
customers.withColumn( "movingSum",
sum(customers("amountSpent")).over(wSpec1) ).show()
output
+-----+----------+-----------+---------+
| name| date|amountSpent|movingSum|
+-----+----------+-----------+---------+
| Bob|2016-05-01| 25.0| 25.0|
| Bob|2016-05-04| 29.0| 54.0|
| Bob|2016-05-06| 27.0| 81.0|
|Alice|2016-05-01| 50.0| 50.0|
|Alice|2016-05-03| 45.0| 95.0|
|Alice|2016-05-04| 55.0| 150.0|
+-----+----------+-----------+---------+

spark sql distance to nearest holiday

In pandas I have a function similar to
indices = df.dateColumn.apply(holidays.index.searchsorted)
df['nextHolidays'] = holidays.index[indices]
df['previousHolidays'] = holidays.index[indices - 1]
which calculates the distance to the nearest holiday and stores that as a new column.
searchsorted http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html was a great solution for pandas as this gives me the index of the next holiday without a high algorithmic complexity Parallelize pandas apply e.g. this approach was a lot quicker then parallel looping.
How can I achieve this in spark or hive?
This can be done using aggregations but this method would have higher complexity than pandas method. But you can achieve similar performance using UDFs. It won't be as elegant as pandas, but:
Assuming this dataset of holidays:
holidays = ['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']
index = spark.sparkContext.broadcast(sorted(holidays))
And dataset of dates of 2016 in dataframe:
from datetime import datetime, timedelta
dates_array = [(datetime(2016, 1, 1) + timedelta(i)).strftime('%Y-%m-%d') for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date=d) for d in dates_array])
The UDF can use pandas searchsorted but would need to install pandas on executors. Insted you can use plan python like this:
def nearest_holiday(date):
last_holiday = index.value[0]
for next_holiday in index.value:
if next_holiday >= date:
break
last_holiday = next_holiday
if last_holiday > date:
last_holiday = None
if next_holiday < date:
next_holiday = None
return (last_holiday, next_holiday)
from pyspark.sql.types import *
return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())])
from pyspark.sql.functions import udf
nearest_holiday_udf = udf(nearest_holiday, return_type)
And can be used with withColumn:
df.withColumn('holiday', nearest_holiday_udf('date')).show(5, False)
+----------+-----------------------+
|date |holiday |
+----------+-----------------------+
|2016-01-01|[null,2016-01-03] |
|2016-01-02|[null,2016-01-03] |
|2016-01-03|[2016-01-03,2016-01-03]|
|2016-01-04|[2016-01-03,2016-03-03]|
|2016-01-05|[2016-01-03,2016-03-03]|
+----------+-----------------------+
only showing top 5 rows

Using JodaTime to compare time without date

I'm trying to compare two times (in LocalTime format) in order to use them as part of an if statement. I have done some research but all I can find it for using date without time, not the other way around. I am trying to compare a time against the system time with the following code:
import org.joda.time.*;
import org.joda.time.format.DateTimeFormat;
import org.joda.time.format.DateTimeFormatter;
import org.joda.time.LocalDate;
LocalTime startTime2;
LocalTime airTime2;
LocalTime foamTime2;
LocalTime scTime22;
firstTime = airTime2;
secondTime = localTime;
return firstTime.compareTo(secondTime);
Which should return the larger value. toLocalTime does not seem to be supported by JodaTime, does anyone know what the alternative would be?
I had adapted the code from:
LocalDate firstDate = date1.toLocalDate();
LocalDate secondDate = date2.toLocalDate();
return firstDate.compareTo(secondDate);
It seems to work pretty straightforward (JodaTime 2.9.1):
import org.joda.time.LocalTime;
LocalTime earlier = new LocalTime("23:00:00");
LocalTime later = new LocalTime("23:12:34");
System.out.println(earlier.compareTo(later)); // -1
System.out.println(later.compareTo(earlier)); // 1
System.out.println(earlier.compareTo(earlier)); // 0

Resources