Lambda Expression + pySpark - azure-databricks

Am trying to compare a column in spark DataFrame against a given date, if column date is less than given date add n hour else add x hours.
something like
addhours = lambda x,y: X + 14hrs if (x < y) else X + 10hrs
where y will hold a static date specified then apply on DataFrame column
something like
df = df.withColumn("newDate", checkDate(df.Time, F.lit('2015-01-01') ))
here is sample for df
from pyspark.sql import functions as F
import datetime
df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2020-02-01 10:00:00')],["OriginTz", "Time"])
Am bit new to spark dataframes :)

Use when+othewise statement instead of udf.
Example:
from pyspark.sql import functions as F
#we are casting to timestamp and date so that we can compare in when
df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2003-02-01 10:00:00')],["OriginTz", "Time"]).\
withColumn("literal",F.lit('2015-01-01').cast("date")).\
withColumn("Time",F.col("Time").cast("timestamp"))
df.show()
#+---------------+-------------------+----------+
#| OriginTz| Time| literal|
#+---------------+-------------------+----------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|
#+---------------+-------------------+----------+
#using unix_timestamp function converting to epoch time then adding 10*3600 -> 10 hrs finally converting to timestamp format
df.withColumn("new_date",F.when(F.col("Time") > F.col("literal"),F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss') + 10 * 3600)).\
otherwise(F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss') + 14 * 3600))).\
show()
#+---------------+-------------------+----------+-------------------+
#| OriginTz| Time| literal| new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|2003-02-02 00:00:00|
#+---------------+-------------------+----------+-------------------+
In case if you don't want to add literal value as dataframe column.
lit_val='2015-01-01'
df = spark.createDataFrame([('America/NewYork', '2020-02-01 10:00:00'),('Africa/Nairobi', '2003-02-01 10:00:00')],["OriginTz", "Time"]).\
withColumn("Time",F.col("Time").cast("timestamp"))
df.withColumn("new_date",F.when(F.col("Time") > F.lit(lit_val).cast("date"),F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss') + 10 * 3600)).\
otherwise(F.to_timestamp(F.unix_timestamp(F.col("Time"),'yyyy-MM-dd HH:mm:ss') + 14 * 3600))).\
show()
#+---------------+-------------------+----------+-------------------+
#| OriginTz| Time| literal| new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2015-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2003-02-01 10:00:00|2015-01-01|2003-02-02 00:00:00|
#+---------------+-------------------+----------+-------------------+

You can also do this using .expr and interval. This way you do not have to convert to another format.
from pyspark.sql import functions as F
df.withColumn("new_date", F.expr("""IF(Time<y, Time + interval 14 hours, Time + interval 10 hours)""")).show()
#+---------------+-------------------+----------+-------------------+
#| OriginTz| Time| y| new_date|
#+---------------+-------------------+----------+-------------------+
#|America/NewYork|2020-02-01 10:00:00|2020-01-01|2020-02-01 20:00:00|
#| Africa/Nairobi|2020-02-01 10:00:00|2020-01-01|2020-02-01 20:00:00|
#+---------------+-------------------+----------+-------------------+

Related

Epoch time conversion to time in Splunk

I am uploading an XML in which one of the field is dailyTime. This dailyTime is an epoch time and i want to convert it into human readable time.
<globalView id="108" version="17" recordClassName="NormalizedEvent" retention="0" hourly="-1" hourlyTime="1284336038994" daily="-1" dailyTime="1284336038994" intervalMilliseconds="60000" writeUniqueCountersTime="0">
<criteria bop="AND">
<left>
<expr>
<interval serialization="custom">
<com.q1labs.ariel.Interval>
<short>5000</short>
<boolean>true</boolean>
<short>5000</short>
<boolean>true</boolean>
</com.q1labs.ariel.Interval>
</interval>
</expr>
<key class
My props.conf are
[XMLPARSING]
KV_MODE = xml
SHOULD_LINEMERGE = true
BREAK_ONLY_BEFORE = <globalView\s\w*=("\d\d\d")
MAX_EVENTS = 600
EXTRACT-dailyTime = ^(?:[^=\n]*=){8}"(\d+)
TIME_FORMAT=%s%3N
TIME_PREFIX=dailyTime=
Lookahead=13
TRUNCATE = 1000
category = Custom
disabled = false
pulldown_type = true
Typically, you'd convert from the timestamp (ie epoch time) to something human-readable in your search
Like this:
index=ndx sourcetype=srctp earliest=-4h
| stats max(_time) as rtime min(_time) as etime by fieldA
| sort 0 - rtime + fieldA
| eval rtime=strftime(rtime,"%c"), etime=strftime(etime,"%c")
| rename rtime as "Most Recent" etime as "Earliest"
Splunk strftime docs: https://docs.splunk.com/Documentation/Splunk/latest/SearchReference/DateandTimeFunctions#strftime.28X.2CY.29
Further formatting info for strptime and strftime: https://strftime.org

spark sql distance to nearest holiday

In pandas I have a function similar to
indices = df.dateColumn.apply(holidays.index.searchsorted)
df['nextHolidays'] = holidays.index[indices]
df['previousHolidays'] = holidays.index[indices - 1]
which calculates the distance to the nearest holiday and stores that as a new column.
searchsorted http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.Series.searchsorted.html was a great solution for pandas as this gives me the index of the next holiday without a high algorithmic complexity Parallelize pandas apply e.g. this approach was a lot quicker then parallel looping.
How can I achieve this in spark or hive?
This can be done using aggregations but this method would have higher complexity than pandas method. But you can achieve similar performance using UDFs. It won't be as elegant as pandas, but:
Assuming this dataset of holidays:
holidays = ['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']
index = spark.sparkContext.broadcast(sorted(holidays))
And dataset of dates of 2016 in dataframe:
from datetime import datetime, timedelta
dates_array = [(datetime(2016, 1, 1) + timedelta(i)).strftime('%Y-%m-%d') for i in range(366)]
from pyspark.sql import Row
df = spark.createDataFrame([Row(date=d) for d in dates_array])
The UDF can use pandas searchsorted but would need to install pandas on executors. Insted you can use plan python like this:
def nearest_holiday(date):
last_holiday = index.value[0]
for next_holiday in index.value:
if next_holiday >= date:
break
last_holiday = next_holiday
if last_holiday > date:
last_holiday = None
if next_holiday < date:
next_holiday = None
return (last_holiday, next_holiday)
from pyspark.sql.types import *
return_type = StructType([StructField('last_holiday', StringType()), StructField('next_holiday', StringType())])
from pyspark.sql.functions import udf
nearest_holiday_udf = udf(nearest_holiday, return_type)
And can be used with withColumn:
df.withColumn('holiday', nearest_holiday_udf('date')).show(5, False)
+----------+-----------------------+
|date |holiday |
+----------+-----------------------+
|2016-01-01|[null,2016-01-03] |
|2016-01-02|[null,2016-01-03] |
|2016-01-03|[2016-01-03,2016-01-03]|
|2016-01-04|[2016-01-03,2016-03-03]|
|2016-01-05|[2016-01-03,2016-03-03]|
+----------+-----------------------+
only showing top 5 rows

Spark/Hive Hours Between Two Datetimes

I would like to know how to precisely get the number of hours between 2 datetimes in spark.
There is a function called datediff which I could use to get the number of days and then convert to hours however this is less precise than I'd like
example of what I want modeled after datediff:
>>> df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
>>> df.select(hourdiff(df.d2, df.d1).alias('diff')).collect()
[Row(diff=22)]
Try using UDF Here is the sample code, You can modify to UDF return what ever granularity as you want.
from pyspark.sql.functions import udf, col
from datetime import datetime, timedelta
from pyspark.sql.types import LongType
def timediff_x():
def _timediff_x(date1, date2):
date11 = datetime.strptime(date1, '%Y-%m-%d %H:%M:%S')
date22 = datetime.strptime(date2, '%Y-%m-%d %H:%M:%S')
return (date11 - date22).days
return udf(_timediff_x, LongType())
df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-25 19:15:00')], ['d1', 'd2'])
df.select(timediff_x()(col("d2"), col("d1"))).show()
+----------------------------+
|PythonUDF#_timediff_x(d2,d1)|
+----------------------------+
| 6|
+----------------------------+
If your columns are of type TimestampType(), you can use the answer at the following question:
Spark Scala: DateDiff of two columns by hour or minute
However, if your columns are of type StringType(), you have an option that is easier than defining an UDF, using the built-in functions:
from pyspark.sql.functions import *
diffCol = unix_timestamp(col('d1'), 'yyyy-MM-dd HH:mm:ss') - unix_timestamp(col('d2'), 'yyyy-MM-dd HH:mm:ss')
df = sqlContext.createDataFrame([('2016-04-18 21:18:18','2016-04-19 19:15:00')], ['d1', 'd2'])
df2 = df.withColumn('diff_secs', diffCol)

How to duplicate Sum(Sum(Fields!VarName.Value)) using Lookupsets and Custom Code in SSRS

I am fairly new to SSRS but am having a problem double summing when using Lookupsets as output. I have the following table and query which does work
Query for Hours_DataSet
SELECT CallbackDate, SUM(TelemarketingHours) AS DailyHours,
(SELECT SUM(TelemarketingHours) AS Expr1
FROM CallbackTbl) AS HoursPTD
FROM CallbackTbl AS CallbackTbl_1
GROUP BY CallbackDate
Definition of Matrix
| [CallbackDate] | Weekly totals
________________________________________________________________
Hours | [Sum(DailyHours]) | Sum(Sum(DailyHours))
The output is this:
12/01/2014 | 12/02/2014 | 12/03/2014 | 12/04/2014 | 12/05/2014| Weekly totals|
28.75 | 42 | 42.25 | 40.25 | 37.50 | 190.75
In another table I need to calculate the appointments per hour and total appointments per hour for the week. So I set the main data-set to be the number of appointments and use lookupset and custom code to do the summing.
Everything works well for one level of sum. I need to recreate the 190.75 number and use it in the as the denominator in the calculation for number of appointments per hour for the week.
Query for Positive_DataSet:
SELECT MainHistory_1.REALDATE, StatusTbl.Status, COUNT (MainHistory_1.DBRECID) AS Positives, StatusTbl.Code,
(SELECT COUNT(DBRECID) AS Expr1
FROM MainHistory
WHERE (REALDATE > CONVERT(DATETIME, #StartDate, 102)) AND (REALDATE < CONVERT(DATETIME, #EndDate, 102))) AS TotalCalls
FROM MainHistory AS MainHistory_1 INNER JOIN
StatusTbl ON MainHistory_1.STATUS = StatusTbl.Status
GROUP BY MainHistory_1.REALDATE, StatusTbl.Status, StatusTbl.Code
HAVING (StatusTbl.Code = 'P') AND (MainHistory_1.REALDATE > CONVERT(DATETIME, #StartDate, 102)) AND (MainHistory_1.REALDATE < CONVERT(DATETIME, #EndDate, 102))
My Matrix looks like this:
[REALDATE]| Weekly Totals
EXPR | EXPR
where the expressions are
FORMAT(Code.CalcPerHour(Lookupset(FORMAT(Fields!REALDATE.Value,"Long Date"),FORMAT(Fields!CallbackDate.Value,"Long Date"),Fields!DailyHours.Value,"Hours_DataSet"),SUM(Fields!Positives.Value)),"Fixed")
Sum(Sum(Fields!Positives.Value))/SUM(code.CalcPTD(Lookupset(FORMAT(Fields!REALDATE.Value,"Long Date"),FORMAT(Fields!CallbackDate.Value,"Long Date"),Fields!DailyHours.Value,"Hours_DataSet")))
My custom code is this:
PUBLIC SHARED FUNCTION CalcPerHour(Hours AS OBJECT, Totals AS OBJECT) AS DECIMAL
DIM i AS INTEGER
DIM PerHour AS DECIMAL
FOR i=0 TO UBOUND(Hours)
IF CINT(Hours(i)) < > 0 THEN
PerHour = PerHour + (CDEC(Totals)/CDEC(Hours(i)))
END IF
Next i
RETURN PerHour
END FUNCTION
PUBLIC SHARED FUNCTION CalcPTD(LookupArray AS Array) AS DECIMAL
DIM I AS INTEGER
DIM Total AS DECIMAL
Total = 0
FOR i = 0 to UBOUND(LookupArray)
Total = Total + CDEC(LookupArray(i))
NEXT i
RETURN Total
END FUNCTION
My Output is this:
12/01/2014 | 12/02/2014 | 12/03/2014 | 12/04/2014 | 12/05/2014 | Weekly totals|
1.63 | 1.79 | 1.75 | 1.59 | 1.41 | .87
The numbers corresponding to the days of the week are correct.
The number I should be getting for a total is
313/190.75 = 1.64
If I break it down and just look at the sum like this:
sum(Code.CalcPTD(Lookupset(FORMAT(Fields!REALDATE.Value,"Long Date"),FORMAT(Fields!CallbackDate.Value,"Long Date"),Fields!DailyHours.Value,"Hours_DataSet")))
I get the result of 352.50
If I count the number of items like this:
Count(Code.CalcPTD(Lookupset(FORMAT(Fields!REALDATE.Value,"Long Date"),FORMA(Fields!CallbackDate.Value,"Long Date"),Fields!DailyHours.Value,"Hours_DataSet")))
I get the result of 9
If I count distinct the number of items like this:
CountDistinct(Code.CalcPTD(Lookupset(FORMAT(Fields!REALDATE.Value,"Long Date"),FORMAT(Fields!CallbackDate.Value,"Long Date"),Fields!DailyHours.Value,"Hours_DataSet")))
I get the expected 5
I tried to write code for a distinct sum but it wouldn't return a single result but a series of 5 corresponding to the days of the week and I have to display in a single cell.
Any help would be appreciated. I know its kinda complicated. If you have questions or need further clarification please let me know.
So I figured out the answer on my own. To get a grand total of a variable within a dataset you can use SUM(fields!VarName.Value,"DataSet") and it does it for you.

How to do a customized "average" for pandas multilevel dataframe?

I have a pandas multilevel dataframe df to contain the quarterly financial report data for about 2000+ stocks from year 2006 to 2012 . And I am trying to figure out a way to quickly calculate the 'average' values for each data point.
demo_data() is the function to generate the demo data (df = demo_data(stk_qty=2000, col_num=200) can be used to simulate the financial report data):
def demo_data(stk_qty, col_num):
''' generate demo data, return multilevel dataframe '''
import random
import pandas as pd
rpt_date_template = [(yr+qt) for yr in map(str, range(2006, 2013)) for qt in ['0331','0630','0930','1231']]
stk_id_list = ['STK'+str(x).zfill(3) for x in range(0, stk_qty)]
stk_id_column, rpt_date_column = [], []
for i in range(stk_qty):
stk_rpt_date_list = rpt_date_template[random.randint(0,8):] # rpt dates with random start
stk_id_column.extend([stk_id_list[i]] * len(stk_rpt_date_list))
rpt_date_column.extend(stk_rpt_date_list)
index_name = ['STK_ID', 'RPT_Date']
col_name = ['COL'+str(x).zfill(3) for x in range(col_num)]
first_level_dt = stk_id_column
second_level_dt = rpt_date_column
dt = pd.DataFrame(np.random.randn(len(stk_id_column), col_num), columns=col_name)
dt[index_name[0]] = first_level_dt
dt[index_name[1]] = second_level_dt
multilevel_df = dt.set_index(index_name, drop=True, inplace=False)
return multilevel_df
Here is a sample data. (note: sw() is a method to display the four corners data of a big dataframe, source code is at: How to preview a part of a large pandas DataFrame? )
>>> df = demo_data(5,3)
>>> df.sw()
COL000 COL001 COL002
STK_ID RPT_Date
STK000 20060630 1.8196 0.9519 -1.0526
20060930 -0.4074 -0.9025 1.3562
20061231 -1.1750 0.4190 -1.2976
20070331 -0.5609 1.5190 0.4893
20070630 0.4580 -0.3804 0.3705
20070930 -0.4711 -1.1953 -0.0609
20071231 0.3363 1.1949 1.2802
20080331 1.6359 0.8355 -0.2763
20080630 0.2697 -0.8236 -1.7095
20080930 0.6178 -0.3742 -1.1646
.......................................
STK004 20111231 -0.3198 1.6972 -1.3281
20120331 -1.1905 -0.4597 0.3695
20120630 -0.8253 -0.0502 -0.2862
20120930 0.0059 -1.8535 -1.2107
20121231 0.5762 -0.2872 0.0993
Index : ['STK_ID', 'RPT_Date']
Column: COL000,COL001,COL002
row: 117 col: 3
The customized average function I want is named as my_avg() and defined as below rules:
1. Q1's average value is (Q4_of_previous_yr + Q1)/2
2. Q2's average value is (Q4_of_previous_yr + Q1 + Q2)/3
3. Q3's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3)/4
4. Q4's average value is (Q4_of_previous_yr + Q1 + Q2 + Q3 + Q4)/5
5. if some of the data points are not provided, just calculate the normal average of available data points
so the my_avg(df) will have below output for each STK_ID:
STK_ID RPT_Date COL000 COL001 COL002
STK000 20060630 1.819619705 0.951918984 -1.052639309
20060930 0.706112476 0.024688028 0.151757352
20061231 0.079077767 0.156125083 -0.331359614
20070331 -0.867930112 0.969000466 -0.404129827
20070630 -0.425943376 0.519205768 -0.145929753
20070930 -0.437234418 0.090579744 -0.124681449
20071231 -0.282524858 0.3114374 0.156297097
20080331 0.986121631 1.015202552 0.501971496
.......................................
STK004 20111231 xxxxx xxxxxxx xxxxxxx
How to write the code for my_avg() ?
Reference:
I try to write a temp_solution_avg() function. But it has three issues:
1. the average calculation not include 'Q4_of_previous_yr' data point, so the result is not what I want.
2. data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's data is wrong
3. the calculation speed is very very slow.
In [3]: df = demo_data(500,100)
In [4]: timeit temp_solution_avg(df)
1 loops, best of 3: 66.3 s per loop
def temp_solution_avg(df):
''' return the average , Q1: not change, Q2 : (df.Q1 + df.Q2)/2 ,
Q3: (df.Q1 + df.Q2 + df.Q3)/3, Q4 : (df.Q1 + df.Q2 + df.Q3 + df.Q4)/4
data's 'RPT_Date' must start with Q1 ('xxxx0331'), otherwise first yr's
data is wrong .
'''
dt = df.reset_index()
dt['yr'] = dt['RPT_Date'].str[0:4]
dt['temp_stk_id'] = dt['STK_ID']
dt = dt.set_index(['STK_ID','RPT_Date'], drop=True, inplace=False)
rst = dt.groupby(['temp_stk_id','yr']).transform(pd.expanding_mean)
return rst

Resources