PySpark round off timestamps to full hours? - time

I am interested in rounding off timestamps to full hours. What I got so far is to round to the nearest hour. For example with this:
df.withColumn("Full Hour", hour((round(unix_timestamp("Timestamp")/3600)*3600).cast("timestamp")))
But this "round" function uses HALF_UP rounding. This means: 23:56 results in 00:00 but I would instead prefer to have 23:00. Is this possible? I didn't find an option field how to set the rounding behaviour in the function.

I think you're overcomplicating things. Hour function returns by default an hour component of a timestamp.
from pyspark.sql.functions import to_timestamp
from pyspark.sql import Row
df = (sc
.parallelize([Row(Timestamp='2016_08_21 11_59_08')])
.toDF()
.withColumn("parsed", to_timestamp("Timestamp", "yyyy_MM_dd hh_mm_ss")))
df2 = df.withColumn("Full Hour", hour(unix_timestamp("parsed").cast("timestamp")))
df2.show()
Output:
+-------------------+-------------------+---------+
| Timestamp| parsed|Full Hour|
+-------------------+-------------------+---------+
|2016_08_21 11_59_08|2016-08-21 11:59:08| 11|
+-------------------+-------------------+---------+

Related

from the yfinance library how can I read the ex-dividend date?

This code should return the ex-dividend date:
import yfinance as yf
yf.Ticker('AGNC').info['exDividendDate']
but I get this as an output:
1661817600
I am wondering if there is a way to get the date from that number ?
It looks like this number is obtained based on seconds. In order to get the real date, you can use pd.to_datetime to convert the seconds to calendar date.
import pandas as pd
pd.to_datetime(1661817600, unit='s')
Out[6]: Timestamp('2022-08-30 00:00:00')
or you can use the built-in datetime package in Python.
from datetime import datetime
print(datetime.fromtimestamp(1661817600))
2022-08-30 08:00:00

Tibco Spotfire - time in seconds & milliseconds in Real, convert to a time of day

I have a list of time in a decimal format of seconds, and I know what time the series started. I would like to convert it to a time of day with the offset of the start time applied. There must be a simple way to do this that I am really missing!
Sample source data:
\Name of source file : 260521-11_58
\Recording from 26.05.2021 11:58
\Channels : 1
\Scan rate : 101 ms = 0.101 sec
\Variable 1: n1(rpm)
\Internal identifier: 63
\Information1:
\Information2:
\Information3:
\Information4:
0.00000 3722.35645
0.10100 3751.06445
0.20200 1868.33350
0.30300 1868.36487
0.40400 3722.39355
0.50500 3722.51831
0.60600 3722.50464
0.70700 3722.32446
0.80800 3722.34277
0.90900 3722.47729
1.01000 3722.74048
1.11100 3722.66650
1.21200 3722.39355
1.31300 3751.02710
1.41400 1868.27539
1.51500 3722.49097
1.61600 3750.93286
1.71700 1868.30334
1.81800 3722.29224
The Start time & date is 26.05.2021 11:58, and the LH column is elapsed time in seconds with the column name [Time] . So I just want to convert the decimal / real to a time or timespan and add the start time to it.
I have tried lots of ways that are really hacky, and ultimately flawed - the below works, but just ignores the milliseconds.
TimeSpan(0,0,0,Integer(Floor([Time])),[Time] - Integer(Floor([Time])))
The last part works to just get milli / micro seconds on its own, but not as part of the above.
Your formula isn't really ignoring the milliseconds, you are using the decimal part of your time (in seconds) as milliseconds, so the value being returned is smaller than the format mask.
You need to convert the seconds to milliseconds, so something like this should work
TimeSpan(0,0,0,Integer(Floor([Time])),([Time] - Integer(Floor([Time]))) * 1000)
To add it to the time, this would work
DateAdd(Date("26-May-2021"),TimeSpan(0,0,0,Integer([Time]),([Time] - Integer([Time])) * 1000))
You will need to set the column format to
dd-MMM-yyyy HH:mm:ss:fff

Speed up Pandas DateTime variable

I have a number of quite large cvs files (1,000,000 rows each) which contain a DateTime column. I am using Pandas pivot tables to summarise them. Part of what this involves is splitting out this DateTime variable into hours and minutes. I am using the following code, which is working fine, but it is taking quite a lot of time (around 4-5 minutes).
My question is: Is this just because the files are so large/my laptop to slow, or is there a more efficient code that allows me to split out hours and minutes from a DateTime variable?
Thanks
df['hours'], df['minutes'] = pd.DatetimeIndex(df['DateTime']).hour, pd.DatetimeIndex(df['DateTime']).minute
If dtypes of column Datetime is not datetime, first convert it to_datetime. Then use dt.hour and dt.minute:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['hours'], df['minutes'] = df['DateTime'].dt.hour, df['DateTime'].dt.minute
Sample:
import pandas as pd
df = pd.DataFrame({'DateTime': ['2014-06-17 11:09:20', '2014-06-18 10:02:10']})
print (df)
DateTime
0 2014-06-17 11:09:20
1 2014-06-18 10:02:10
print (df.dtypes)
DateTime object
dtype: object
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['hours'], df['minutes'] = df['DateTime'].dt.hour, df['DateTime'].dt.minute
print (df)
DateTime hours minutes
0 2014-06-17 11:09:20 11 9
1 2014-06-18 10:02:10 10 2

Awk and calculating start time from end time and duration

I have a file with date, end time and duration in decimal format and I need to calculate the start time. The file looks like:
20140101;1212;1.5
20140102;1515;1.58
20140103;1759;.69
20140104;1100;12.5
...
The duration 1.5 for the time 12:12 means one and a half hours and the start time would be 12:12 - 1:30 = 10:42 AM or 11:00 - 12.5 = 11:00 - 12:30 = 22:30 PM. Is there an easy way for calculating such time differences in Awk or is it the good ol' split-multiply-subtract-and-handle-the-day-break-yourself all over again?
Since the values are in hours and minutes, only the minutes matter and the seconds can be discarded, for example duration 1.58 means 1:34 and the leftover 0.8 seconds can be discarded.
I'm on GNU Awk 4.1.3
As you are using gawk take adventage of its native time functions:
gawk -F\; '{tmst=sprintf("%s %s %s %s %s 00",\
substr($1,1,4),\
substr($1,5,2),\
substr($1,7,2),\
substr($2,1,2),\
substr($2,3,2))
t1=mktime(tmst)
seconds=sprintf("%f",$3)+0
seconds*=60*60
difference=strftime("%H%M",t1-seconds)
print $0""FS""difference}' file
Results:
20140101;1212;1.5;1042
20140102;1515;1.58;1340
20140103;1759;.69;1717
20140104;1100;12.5;2230
Check: https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html
Explanation:
tmst=sprintf(..) :used to create a date string from the file
that conforms with the datespec of mktime function YYYY MM
DD HH MM SS [DST].
t1=mktime(tmst) :turn datespec into a timestamp than can be
handle by gawk (as the number of seconds elapsed since 1
January 1970)
seconds=sprintf("%f",$3)+0 : convert third field to float.
seconds*=60*60 : convert hours (in float) to seconds.
difference=strftime("%H%M",t1-seconds) : get the difference in
human maner, hours an minutes.
I highly recommend to use a programming language which supports datetime calculations, because the calculation can be tricky in detail because daylight saving shifts. You can use Python for example:
start_times.py:
import csv
from datetime import datetime, timedelta
with open('input.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quotechar='|')
for row in reader:
end_day = row[0]
end_time = row[1]
# Create a datetime object
end = datetime.strptime(end_day + end_time, "%Y%m%d%H%M")
# Translate duration into minutes
duration=float(row[2])*60
# Calculate start time
start = end - timedelta(minutes=duration)
# Column 3 is the start day (can differ from end day!)
row.append(start.strftime("%Y%m%d"))
# Column 4 is the start time
row.append(start.strftime("%H%M"))
print ';'.join(row)
Run:
python start_times.py
Output:
20140101;1212;1.5;20140101;1042
20140102;1515;1.58;20140102;1340
20140103;1759;.69;20140103;1717
20140104;1100;12.5;20140103;2230 <-- you see, the day matters!
The above example is using the system's timezone. If the input data refers to a different timezone, Pyhon's datetime module allows to specify it.
I would do something like this:
awk 'BEGIN{FS=OFS=";"}
{ h=substr($2,0,2); m=substr($2,3,2); mins=h*60 + m; diff=mins - $3*60;
print $0, int(diff/60) ":" int(diff%60)
}' file
That is, convert everything to minutes and then back to hours/minutes.
Test
$ awk 'BEGIN{FS=OFS=";"}{h=substr($2,0,2); m=substr($2,3,2); mins=h*60 + m; diff=mins - $3*60; print $0, int(diff/60) ":" int(diff%60)}' a
20140101;1212;1.5;10:42
20140102;1515;1.58;13:40
20140103;1759;.69;17:17

Smaller variation between times of different days

I have working on a algorithm that select a set of date/time objects with a certain characteristic, but with no success.
The data to be used are in a list of lists of date/time objects,
e.g.:
lstDays[i][j], i <= day chooser, j <= time chooser
What is the problem? I need a set of nearest date/time objects. Each time of this set must come from different days.
For example: [2012-09-09 12:00,2012-09-10 12:00, 2012-09-11 12:00]
This example of a set of date/time objects is the best example because it minimize to zero.
Important
Trying to contextualize this: I want to observe if a phenomenon occurs at the same time in differents days. If not, I want to evaluate if distance between the hours is reasonable for my study.
I would like a generic algorithm to any number of days and time. This algorithm should return all set of datetime objects and its time distance:
[2012-09-09 12:00,2012-09-10 12:00, 2012-09-11 12:00], 0
[2012-09-09 13:00,2012-09-10 13:00, 2012-09-11 13:05], 5
and so on.
:: "0", because the diff between all times on the first line from datetime objects is zero seconds.
:: "5", because the diff between all times on the second line from datetime objects is five seconds.
Edit: Code here
for i in range(len(lstDays)):
for j in range(len(lstDays[i])):
print lstDays[i][j]
Output:
2013-07-18 11:16:00
2013-07-18 12:02:00
2013-07-18 12:39:00
2013-07-18 13:14:00
2013-07-18 13:50:00
2013-07-19 11:30:00
2013-07-19 12:00:00
2013-07-19 12:46:00
2013-07-19 13:19:00
2013-07-22 11:36:00
2013-07-22 12:21:00
2013-07-22 12:48:00
2013-07-22 13:26:00
2013-07-23 11:18:00
2013-07-23 11:48:00
2013-07-23 12:30:00
2013-07-23 13:12:00
2013-07-24 11:18:00
2013-07-24 11:42:00
2013-07-24 12:20:00
2013-07-24 12:52:00
2013-07-24 13:29:00
Note: lstDays[i][j] is a datetime object.
lstDays = [ [/*datetime objects from a day i*/], [/*datetime objects from a day i+1*/], [/*datetime objects from a day i+2/*], ... ]
And I am not worried with perfomance, a priori.
Hope that you can help me! (:
Generate a histogram:
hours = [0] * 24
for object in objects: # whatever your objects are
# assuming object.date_time looks like '2013-07-18 10:55:00'
hour = object.date_time[11:13] # assuming the hour is in positions 11-12
hours[int(hour)] += 1
for hour in xrange(24):
print '%02d: %d' % (hour, hours[hour])
You can always resort to calculating the times into a list, then estimate the differences, and group those objects that are below that limit. All packed into a dictionary with the difference as the value and the the timestamps as keys. If this is not exactly what you need, I'm pretty sure it should be easy to select whatever result you need from it.
import numpy
import datetime
times_list = [object1.time(), object2(), ..., objectN()]
limit = 5 # limit of five seconds
groups = {}
for time in times_list:
delta_times = numpy.asarray([(tt-time).total_seconds() for tt in times_list])
whr = numpy.where(abs(delta_times) < limit)[0]
similar = [str(times_list[ii]) for ii in whr]
if len(similar) > 1:
similar.sort()
max_time = numpy.max(delta_times[whr]) # max? median? mean?
groups[tuple(similar)] = max_time

Resources