Speed up Pandas DateTime variable - performance

I have a number of quite large cvs files (1,000,000 rows each) which contain a DateTime column. I am using Pandas pivot tables to summarise them. Part of what this involves is splitting out this DateTime variable into hours and minutes. I am using the following code, which is working fine, but it is taking quite a lot of time (around 4-5 minutes).
My question is: Is this just because the files are so large/my laptop to slow, or is there a more efficient code that allows me to split out hours and minutes from a DateTime variable?
Thanks
df['hours'], df['minutes'] = pd.DatetimeIndex(df['DateTime']).hour, pd.DatetimeIndex(df['DateTime']).minute

If dtypes of column Datetime is not datetime, first convert it to_datetime. Then use dt.hour and dt.minute:
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['hours'], df['minutes'] = df['DateTime'].dt.hour, df['DateTime'].dt.minute
Sample:
import pandas as pd
df = pd.DataFrame({'DateTime': ['2014-06-17 11:09:20', '2014-06-18 10:02:10']})
print (df)
DateTime
0 2014-06-17 11:09:20
1 2014-06-18 10:02:10
print (df.dtypes)
DateTime object
dtype: object
df['DateTime'] = pd.to_datetime(df['DateTime'])
df['hours'], df['minutes'] = df['DateTime'].dt.hour, df['DateTime'].dt.minute
print (df)
DateTime hours minutes
0 2014-06-17 11:09:20 11 9
1 2014-06-18 10:02:10 10 2

Related

PySpark round off timestamps to full hours?

I am interested in rounding off timestamps to full hours. What I got so far is to round to the nearest hour. For example with this:
df.withColumn("Full Hour", hour((round(unix_timestamp("Timestamp")/3600)*3600).cast("timestamp")))
But this "round" function uses HALF_UP rounding. This means: 23:56 results in 00:00 but I would instead prefer to have 23:00. Is this possible? I didn't find an option field how to set the rounding behaviour in the function.
I think you're overcomplicating things. Hour function returns by default an hour component of a timestamp.
from pyspark.sql.functions import to_timestamp
from pyspark.sql import Row
df = (sc
.parallelize([Row(Timestamp='2016_08_21 11_59_08')])
.toDF()
.withColumn("parsed", to_timestamp("Timestamp", "yyyy_MM_dd hh_mm_ss")))
df2 = df.withColumn("Full Hour", hour(unix_timestamp("parsed").cast("timestamp")))
df2.show()
Output:
+-------------------+-------------------+---------+
| Timestamp| parsed|Full Hour|
+-------------------+-------------------+---------+
|2016_08_21 11_59_08|2016-08-21 11:59:08| 11|
+-------------------+-------------------+---------+

time data doesn't match format specified

I am trying to convert the string to the type of 'datetime' in python. My data match the format, but still get the
'ValueError: time data 11 11 doesn't match format specified'
I am not sure where does the "11 11" in the error come from.
My code is
train_df['date_captured1'] = pd.to_datetime(train_df['date_captured'], format="%Y-%m-%d %H:%M:%S")
Head of data is
print (train_df.date_captured.head())
0 2011-05-13 23:43:18
1 2012-03-17 03:48:44
2 2014-05-11 11:56:46
3 2013-10-06 02:00:00
4 2011-07-12 13:11:16
Name: date_captured, dtype: object
I tried the following by just selecting the first string and running the code with same datetime format. They all work without problem.
dt=train_df['date_captured']
dt1=dt[0]
date = datetime.datetime.strptime(dt1, "%Y-%m-%d %H:%M:%S")
print(date)
2011-05-13 23:43:18
and
dt1=pd.to_datetime(dt1, format='%Y-%m-%d %H:%M:%S')
print (dt1)
2011-05-13 23:43:18
But why wen I using the same format in pd.to_datetime to convert all the data in the column, it comes up with the error above?
Thank you.
I solved it.
train_df['date_time'] = pd.to_datetime(train_df['date_captured'], errors='coerce')
print (train_df[train_df.date_time.isnull()])
I found in line 100372, the date_captured value is '11 11'
category_id date_captured ... height date_time
100372 10 11 11 ... 747 NaT
So the code with errors='coerce' will replace the invalid parsing with NaN.
Thank you.

Awk and calculating start time from end time and duration

I have a file with date, end time and duration in decimal format and I need to calculate the start time. The file looks like:
20140101;1212;1.5
20140102;1515;1.58
20140103;1759;.69
20140104;1100;12.5
...
The duration 1.5 for the time 12:12 means one and a half hours and the start time would be 12:12 - 1:30 = 10:42 AM or 11:00 - 12.5 = 11:00 - 12:30 = 22:30 PM. Is there an easy way for calculating such time differences in Awk or is it the good ol' split-multiply-subtract-and-handle-the-day-break-yourself all over again?
Since the values are in hours and minutes, only the minutes matter and the seconds can be discarded, for example duration 1.58 means 1:34 and the leftover 0.8 seconds can be discarded.
I'm on GNU Awk 4.1.3
As you are using gawk take adventage of its native time functions:
gawk -F\; '{tmst=sprintf("%s %s %s %s %s 00",\
substr($1,1,4),\
substr($1,5,2),\
substr($1,7,2),\
substr($2,1,2),\
substr($2,3,2))
t1=mktime(tmst)
seconds=sprintf("%f",$3)+0
seconds*=60*60
difference=strftime("%H%M",t1-seconds)
print $0""FS""difference}' file
Results:
20140101;1212;1.5;1042
20140102;1515;1.58;1340
20140103;1759;.69;1717
20140104;1100;12.5;2230
Check: https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html
Explanation:
tmst=sprintf(..) :used to create a date string from the file
that conforms with the datespec of mktime function YYYY MM
DD HH MM SS [DST].
t1=mktime(tmst) :turn datespec into a timestamp than can be
handle by gawk (as the number of seconds elapsed since 1
January 1970)
seconds=sprintf("%f",$3)+0 : convert third field to float.
seconds*=60*60 : convert hours (in float) to seconds.
difference=strftime("%H%M",t1-seconds) : get the difference in
human maner, hours an minutes.
I highly recommend to use a programming language which supports datetime calculations, because the calculation can be tricky in detail because daylight saving shifts. You can use Python for example:
start_times.py:
import csv
from datetime import datetime, timedelta
with open('input.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quotechar='|')
for row in reader:
end_day = row[0]
end_time = row[1]
# Create a datetime object
end = datetime.strptime(end_day + end_time, "%Y%m%d%H%M")
# Translate duration into minutes
duration=float(row[2])*60
# Calculate start time
start = end - timedelta(minutes=duration)
# Column 3 is the start day (can differ from end day!)
row.append(start.strftime("%Y%m%d"))
# Column 4 is the start time
row.append(start.strftime("%H%M"))
print ';'.join(row)
Run:
python start_times.py
Output:
20140101;1212;1.5;20140101;1042
20140102;1515;1.58;20140102;1340
20140103;1759;.69;20140103;1717
20140104;1100;12.5;20140103;2230 <-- you see, the day matters!
The above example is using the system's timezone. If the input data refers to a different timezone, Pyhon's datetime module allows to specify it.
I would do something like this:
awk 'BEGIN{FS=OFS=";"}
{ h=substr($2,0,2); m=substr($2,3,2); mins=h*60 + m; diff=mins - $3*60;
print $0, int(diff/60) ":" int(diff%60)
}' file
That is, convert everything to minutes and then back to hours/minutes.
Test
$ awk 'BEGIN{FS=OFS=";"}{h=substr($2,0,2); m=substr($2,3,2); mins=h*60 + m; diff=mins - $3*60; print $0, int(diff/60) ":" int(diff%60)}' a
20140101;1212;1.5;10:42
20140102;1515;1.58;13:40
20140103;1759;.69;17:17

python pandas index by time

I have a csv file that looks like this:
"06/09/2013 14:08:34.930","7.2680542849633447","1.6151231744362988","0","0","21","1546964992","15.772567829158248","1577332736","8360","21.400382061280961","0","15","0","685","0","0","0","0","0","0","0","4637","0"
the csv includes 1 month daily values (24 hrs)
I have a need to load it to pandas and then get some stats on data (min, max) but I need the data to include data records for all days only working hours (between 8:00 to 18:00)
I am very new to pandas library
Load your data:
import pandas as pd
from datetime import datetime
df = pd.read_csv('data.csv', header=None, index_col=0)
Filter your data for working hours from 8:00 to 18:00:
work_hours = lambda d: datetime.strptime(d, '%d/%m/%Y %H:%M:%S.%f').hour in range(8, 18)
df = df[map(work_hours, df.index)]
Get the min and max of the first data column:
min, max = df[1].min(), df[1].max()

Converting a difference of 2 24-hr times into time

Forgive me for my ignorance if its too easy and if its not the right place to post. I have 2 24 hours times strings as 1544 and 1458. Their difference should be 46 minutes but when I subtract them it yields 86 minutes as follows.
1544
-1458
-------
86
Can someone tell me how can I find a time difference of 2 24-hr times?
Hour doesn't have 100 minutes, but unfortunately only 60. You need to do this:
15*60+44
-14*60+58
---------
46
If you are working in C#, you can try to parse them to DateTime using
DateTime date1 = DateTime.ParseExact("1544","HHmm",CultureInfo.InvariantCulture);
DateTime date2 = DateTime.ParseExact("1458","HHmm",CultureInfo.InvariantCulture);
and then subtract them:
TimeSpan diff = date1 - date2

Resources