Faster date formatting in R? - performance

I often need to convert (long) character strings into the date class in R. I notice that this step seems quite slow.
Example:
date <- c("5/31/2013 23:30", "5/31/2013 23:35", "5/31/2013 23:40", "5/31/2013 23:45", "5/31/2013 23:50", "5/31/2013 23:55")
Date <- as.POSIXct(date, format="%m/%d/%Y %H:%M")
This isn't a huge problem, but I wonder if I'm overlooking an easy route to increased efficiency. Any tips for speeding this up? Thanks.

Since I wrote this before it was pointed out this is a duplicate, I'll add it as an answer anyway. Basically package fasttime can help you IF you have dates AFTER 1970-01-01 00:00:00 AND they are GMT AND they are of the format year, month, day, hour, minute, second. If you can rewrite your dates to this format then fastPOSIXct will be quick:
# data
date <- c( "2013/5/31 23:30" , "2013/5/31 23:35" , "2013/5/31 23:40" , "2013/5/31 23:45" )
require(fasttime)
# fasttime function
dates.ft <- fastPOSIXct( date , tz = "GMT" )
# base function
dates <- as.POSIXct( date , format= "%Y/%m/%d %H:%M")
# rough comparison
require(microbenchmark)
microbenchmark( fastPOSIXct( date , tz = "GMT" ) , as.POSIXct( date , format= "%Y/%m/%d %H:%M") , times = 100L )
#Unit: microseconds
# expr min lq median uq max neval
# fastPOSIXct(date, tz = "GMT") 19.598 21.699 24.148 25.5485 215.927 100
# as.POSIXct(date, format = "%Y/%m/%d %H:%M") 160.633 163.433 168.332 181.9800 278.220 100
But the question would be, is it quicker to transform your dates to a format fasttime can accept or just use as.POSIXct or buy a faster computer?!

Related

HIVE - How do I convert Timestamp and sum valeus

I have a timestamp value like
2021-09-01T00:16:18.971228-03:00
And I need to separate the hours and date. After that summarize the day values ​​and night values
My problem is how can I do this with this kind of format?
Maybe this helps you, I created a data.frame just to show all variables together.
x <- "2021-09-01T00:16:18.971228-03:00"
library(lubridate)
library(dplyr)
tibble(x) %>%
mutate(
dttm = ymd_hms(x),
date = as_date(dttm),
hour = hour(dttm)
)
# A tibble: 1 x 4
x dttm date hour
<chr> <dttm> <date> <int>
1 2021-09-01T00:16:18.971228-03:00 2021-09-01 03:16:18 2021-09-01 3

HIVE:how to calculate seconds difference of time format: yyyyMMdd HH:mm:ss

How to calculate seconds difference of time format:yyyyMMdd HH:mm:ss?
For example,calculate seconds difference of 20190102 00:01:05 and 20190102 02:14:18
Use UNIX_TIMESTAMP function to convert timestamps to seconds, then subtract:
select UNIX_TIMESTAMP('20190102 02:14:18','yyyyMMdd HH:mm:ss') -
UNIX_TIMESTAMP('20190102 00:01:05','yyyyMMdd HH:mm:ss');
Returns:
7993 seconds.
Difference in 'HH:mm:ss' format:
select from_unixtime(UNIX_TIMESTAMP('20190102 02:14:18','yyyyMMdd HH:mm:ss') -
UNIX_TIMESTAMP('20190102 00:01:05','yyyyMMdd HH:mm:ss'), 'HH:mm:ss');
Returns:
02:13:13
Also you can use solution how to format seconds in 'HH:mm:ss' using explicit math proposed in this answer: https://stackoverflow.com/a/57497316/2700344

Awk and calculating start time from end time and duration

I have a file with date, end time and duration in decimal format and I need to calculate the start time. The file looks like:
20140101;1212;1.5
20140102;1515;1.58
20140103;1759;.69
20140104;1100;12.5
...
The duration 1.5 for the time 12:12 means one and a half hours and the start time would be 12:12 - 1:30 = 10:42 AM or 11:00 - 12.5 = 11:00 - 12:30 = 22:30 PM. Is there an easy way for calculating such time differences in Awk or is it the good ol' split-multiply-subtract-and-handle-the-day-break-yourself all over again?
Since the values are in hours and minutes, only the minutes matter and the seconds can be discarded, for example duration 1.58 means 1:34 and the leftover 0.8 seconds can be discarded.
I'm on GNU Awk 4.1.3
As you are using gawk take adventage of its native time functions:
gawk -F\; '{tmst=sprintf("%s %s %s %s %s 00",\
substr($1,1,4),\
substr($1,5,2),\
substr($1,7,2),\
substr($2,1,2),\
substr($2,3,2))
t1=mktime(tmst)
seconds=sprintf("%f",$3)+0
seconds*=60*60
difference=strftime("%H%M",t1-seconds)
print $0""FS""difference}' file
Results:
20140101;1212;1.5;1042
20140102;1515;1.58;1340
20140103;1759;.69;1717
20140104;1100;12.5;2230
Check: https://www.gnu.org/software/gawk/manual/html_node/Time-Functions.html
Explanation:
tmst=sprintf(..) :used to create a date string from the file
that conforms with the datespec of mktime function YYYY MM
DD HH MM SS [DST].
t1=mktime(tmst) :turn datespec into a timestamp than can be
handle by gawk (as the number of seconds elapsed since 1
January 1970)
seconds=sprintf("%f",$3)+0 : convert third field to float.
seconds*=60*60 : convert hours (in float) to seconds.
difference=strftime("%H%M",t1-seconds) : get the difference in
human maner, hours an minutes.
I highly recommend to use a programming language which supports datetime calculations, because the calculation can be tricky in detail because daylight saving shifts. You can use Python for example:
start_times.py:
import csv
from datetime import datetime, timedelta
with open('input.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=';', quotechar='|')
for row in reader:
end_day = row[0]
end_time = row[1]
# Create a datetime object
end = datetime.strptime(end_day + end_time, "%Y%m%d%H%M")
# Translate duration into minutes
duration=float(row[2])*60
# Calculate start time
start = end - timedelta(minutes=duration)
# Column 3 is the start day (can differ from end day!)
row.append(start.strftime("%Y%m%d"))
# Column 4 is the start time
row.append(start.strftime("%H%M"))
print ';'.join(row)
Run:
python start_times.py
Output:
20140101;1212;1.5;20140101;1042
20140102;1515;1.58;20140102;1340
20140103;1759;.69;20140103;1717
20140104;1100;12.5;20140103;2230 <-- you see, the day matters!
The above example is using the system's timezone. If the input data refers to a different timezone, Pyhon's datetime module allows to specify it.
I would do something like this:
awk 'BEGIN{FS=OFS=";"}
{ h=substr($2,0,2); m=substr($2,3,2); mins=h*60 + m; diff=mins - $3*60;
print $0, int(diff/60) ":" int(diff%60)
}' file
That is, convert everything to minutes and then back to hours/minutes.
Test
$ awk 'BEGIN{FS=OFS=";"}{h=substr($2,0,2); m=substr($2,3,2); mins=h*60 + m; diff=mins - $3*60; print $0, int(diff/60) ":" int(diff%60)}' a
20140101;1212;1.5;10:42
20140102;1515;1.58;13:40
20140103;1759;.69;17:17

Smaller variation between times of different days

I have working on a algorithm that select a set of date/time objects with a certain characteristic, but with no success.
The data to be used are in a list of lists of date/time objects,
e.g.:
lstDays[i][j], i <= day chooser, j <= time chooser
What is the problem? I need a set of nearest date/time objects. Each time of this set must come from different days.
For example: [2012-09-09 12:00,2012-09-10 12:00, 2012-09-11 12:00]
This example of a set of date/time objects is the best example because it minimize to zero.
Important
Trying to contextualize this: I want to observe if a phenomenon occurs at the same time in differents days. If not, I want to evaluate if distance between the hours is reasonable for my study.
I would like a generic algorithm to any number of days and time. This algorithm should return all set of datetime objects and its time distance:
[2012-09-09 12:00,2012-09-10 12:00, 2012-09-11 12:00], 0
[2012-09-09 13:00,2012-09-10 13:00, 2012-09-11 13:05], 5
and so on.
:: "0", because the diff between all times on the first line from datetime objects is zero seconds.
:: "5", because the diff between all times on the second line from datetime objects is five seconds.
Edit: Code here
for i in range(len(lstDays)):
for j in range(len(lstDays[i])):
print lstDays[i][j]
Output:
2013-07-18 11:16:00
2013-07-18 12:02:00
2013-07-18 12:39:00
2013-07-18 13:14:00
2013-07-18 13:50:00
2013-07-19 11:30:00
2013-07-19 12:00:00
2013-07-19 12:46:00
2013-07-19 13:19:00
2013-07-22 11:36:00
2013-07-22 12:21:00
2013-07-22 12:48:00
2013-07-22 13:26:00
2013-07-23 11:18:00
2013-07-23 11:48:00
2013-07-23 12:30:00
2013-07-23 13:12:00
2013-07-24 11:18:00
2013-07-24 11:42:00
2013-07-24 12:20:00
2013-07-24 12:52:00
2013-07-24 13:29:00
Note: lstDays[i][j] is a datetime object.
lstDays = [ [/*datetime objects from a day i*/], [/*datetime objects from a day i+1*/], [/*datetime objects from a day i+2/*], ... ]
And I am not worried with perfomance, a priori.
Hope that you can help me! (:
Generate a histogram:
hours = [0] * 24
for object in objects: # whatever your objects are
# assuming object.date_time looks like '2013-07-18 10:55:00'
hour = object.date_time[11:13] # assuming the hour is in positions 11-12
hours[int(hour)] += 1
for hour in xrange(24):
print '%02d: %d' % (hour, hours[hour])
You can always resort to calculating the times into a list, then estimate the differences, and group those objects that are below that limit. All packed into a dictionary with the difference as the value and the the timestamps as keys. If this is not exactly what you need, I'm pretty sure it should be easy to select whatever result you need from it.
import numpy
import datetime
times_list = [object1.time(), object2(), ..., objectN()]
limit = 5 # limit of five seconds
groups = {}
for time in times_list:
delta_times = numpy.asarray([(tt-time).total_seconds() for tt in times_list])
whr = numpy.where(abs(delta_times) < limit)[0]
similar = [str(times_list[ii]) for ii in whr]
if len(similar) > 1:
similar.sort()
max_time = numpy.max(delta_times[whr]) # max? median? mean?
groups[tuple(similar)] = max_time

How do I calculate the offset, in hours, of a given timezone from UTC in ruby?

I need to calculate the offset, in hours, of a given timezone from UTC in Ruby. This line of code had been working for me, or so I thought:
offset_in_hours = (TZInfo::Timezone.get(self.timezone).current_period.offset.utc_offset).to_f / 3600.0
But, it turns out that was returning to me the Standard Offset, not the DST offset. So for example, assume
self.timezone = "America/New_York"
If I run the above line, offset_in_hours = -5, not -4 as it should, given that the date today is April 1, 2012.
Can anyone advise me how to calculate offset_in_hours from UTC given a valid string TimeZone in Ruby that accounts for both standard time and daylight savings?
Thanks!
Update
Here is some output from IRB. Note that New York is 4 hours behind UTC, not 5, because of daylight savings:
>> require 'tzinfo'
=> false
>> timezone = "America/New_York"
=> "America/New_York"
>> offset_in_hours = TZInfo::Timezone.get(timezone).current_period.utc_offset / (60*60)
=> -5
>>
This suggests that there is a bug in TZInfo or it is not dst-aware
Update 2
Per joelparkerhender's comments, the bug in the above code is that I was using utc_offset, not utc_total_offset.
Thus, per my original question, the correct line of code is:
offset_in_hours = (TZInfo::Timezone.get(self.timezone).current_period.offset.utc_total_offset).to_f / 3600.0
Yes, use TZInfo like this:
require 'tzinfo'
tz = TZInfo::Timezone.get('America/Los_Angeles')
To get the current period:
current = tz.current_period
To find out if daylight savings time is active:
current.dst?
#=> true
To get the base offset of the timezone from UTC in seconds:
current.utc_offset
#=> -28800 which is -8 hours; this does NOT include daylight savings
To get the daylight savings offset from standard time:
current.std_offset
#=> 3600 which is 1 hour; this is because right now we're in daylight savings
To get the total offset from UTC:
current.utc_total_offset
#=> -25200 which is -7 hours
The total offset from UTC is equal to utc_offset + std_offset.
This is the offset from the local time where daylight savings is in effect, in seconds.

Resources