HIVE - How do I convert Timestamp and sum valeus - hadoop

I have a timestamp value like
2021-09-01T00:16:18.971228-03:00
And I need to separate the hours and date. After that summarize the day values ​​and night values
My problem is how can I do this with this kind of format?

Maybe this helps you, I created a data.frame just to show all variables together.
x <- "2021-09-01T00:16:18.971228-03:00"
library(lubridate)
library(dplyr)
tibble(x) %>%
mutate(
dttm = ymd_hms(x),
date = as_date(dttm),
hour = hour(dttm)
)
# A tibble: 1 x 4
x dttm date hour
<chr> <dttm> <date> <int>
1 2021-09-01T00:16:18.971228-03:00 2021-09-01 03:16:18 2021-09-01 3

Related

PySpark round off timestamps to full hours?

I am interested in rounding off timestamps to full hours. What I got so far is to round to the nearest hour. For example with this:
df.withColumn("Full Hour", hour((round(unix_timestamp("Timestamp")/3600)*3600).cast("timestamp")))
But this "round" function uses HALF_UP rounding. This means: 23:56 results in 00:00 but I would instead prefer to have 23:00. Is this possible? I didn't find an option field how to set the rounding behaviour in the function.
I think you're overcomplicating things. Hour function returns by default an hour component of a timestamp.
from pyspark.sql.functions import to_timestamp
from pyspark.sql import Row
df = (sc
.parallelize([Row(Timestamp='2016_08_21 11_59_08')])
.toDF()
.withColumn("parsed", to_timestamp("Timestamp", "yyyy_MM_dd hh_mm_ss")))
df2 = df.withColumn("Full Hour", hour(unix_timestamp("parsed").cast("timestamp")))
df2.show()
Output:
+-------------------+-------------------+---------+
| Timestamp| parsed|Full Hour|
+-------------------+-------------------+---------+
|2016_08_21 11_59_08|2016-08-21 11:59:08| 11|
+-------------------+-------------------+---------+

How can I retain NA's in timestamp when using dapply

I'm trying to convert many date characters columns to timestamp format using dapply. However rows that are empty characters are being converted to the origin date "1970-01-01".
df <- data.frame(a = c("12/31/2016", "12/31/2016", "12/31/2016"),
b = c("01/01/2016", "01/01/2017", ""))
ddf <- as.DataFrame(df)
schema <- structType(
structField("a", 'timestamp'),
structField("b", 'timestamp'))
converted_dates <- dapply(ddf,
function(x){ as.data.frame(lapply(x, function(y) as.POSIXct(y, format = "%m/%d/%Y"))) },
schema)
head(converted_dates)
a b
1 2016-12-31 2016-01-01
2 2016-12-31 2017-01-01
3 2016-12-31 1970-01-01
whereas running the function within the dapply call on an R data.frame retains the NA result for the empty date value
as.data.frame(lapply(df, function(y) as.POSIXct(y, format = "%m/%d/%Y")))
a b
1 2016-12-31 2016-01-01
2 2016-12-31 2017-01-01
3 2016-12-31 <NA>
Using Spark 2.0.1

Smaller variation between times of different days

I have working on a algorithm that select a set of date/time objects with a certain characteristic, but with no success.
The data to be used are in a list of lists of date/time objects,
e.g.:
lstDays[i][j], i <= day chooser, j <= time chooser
What is the problem? I need a set of nearest date/time objects. Each time of this set must come from different days.
For example: [2012-09-09 12:00,2012-09-10 12:00, 2012-09-11 12:00]
This example of a set of date/time objects is the best example because it minimize to zero.
Important
Trying to contextualize this: I want to observe if a phenomenon occurs at the same time in differents days. If not, I want to evaluate if distance between the hours is reasonable for my study.
I would like a generic algorithm to any number of days and time. This algorithm should return all set of datetime objects and its time distance:
[2012-09-09 12:00,2012-09-10 12:00, 2012-09-11 12:00], 0
[2012-09-09 13:00,2012-09-10 13:00, 2012-09-11 13:05], 5
and so on.
:: "0", because the diff between all times on the first line from datetime objects is zero seconds.
:: "5", because the diff between all times on the second line from datetime objects is five seconds.
Edit: Code here
for i in range(len(lstDays)):
for j in range(len(lstDays[i])):
print lstDays[i][j]
Output:
2013-07-18 11:16:00
2013-07-18 12:02:00
2013-07-18 12:39:00
2013-07-18 13:14:00
2013-07-18 13:50:00
2013-07-19 11:30:00
2013-07-19 12:00:00
2013-07-19 12:46:00
2013-07-19 13:19:00
2013-07-22 11:36:00
2013-07-22 12:21:00
2013-07-22 12:48:00
2013-07-22 13:26:00
2013-07-23 11:18:00
2013-07-23 11:48:00
2013-07-23 12:30:00
2013-07-23 13:12:00
2013-07-24 11:18:00
2013-07-24 11:42:00
2013-07-24 12:20:00
2013-07-24 12:52:00
2013-07-24 13:29:00
Note: lstDays[i][j] is a datetime object.
lstDays = [ [/*datetime objects from a day i*/], [/*datetime objects from a day i+1*/], [/*datetime objects from a day i+2/*], ... ]
And I am not worried with perfomance, a priori.
Hope that you can help me! (:
Generate a histogram:
hours = [0] * 24
for object in objects: # whatever your objects are
# assuming object.date_time looks like '2013-07-18 10:55:00'
hour = object.date_time[11:13] # assuming the hour is in positions 11-12
hours[int(hour)] += 1
for hour in xrange(24):
print '%02d: %d' % (hour, hours[hour])
You can always resort to calculating the times into a list, then estimate the differences, and group those objects that are below that limit. All packed into a dictionary with the difference as the value and the the timestamps as keys. If this is not exactly what you need, I'm pretty sure it should be easy to select whatever result you need from it.
import numpy
import datetime
times_list = [object1.time(), object2(), ..., objectN()]
limit = 5 # limit of five seconds
groups = {}
for time in times_list:
delta_times = numpy.asarray([(tt-time).total_seconds() for tt in times_list])
whr = numpy.where(abs(delta_times) < limit)[0]
similar = [str(times_list[ii]) for ii in whr]
if len(similar) > 1:
similar.sort()
max_time = numpy.max(delta_times[whr]) # max? median? mean?
groups[tuple(similar)] = max_time

Faster date formatting in R?

I often need to convert (long) character strings into the date class in R. I notice that this step seems quite slow.
Example:
date <- c("5/31/2013 23:30", "5/31/2013 23:35", "5/31/2013 23:40", "5/31/2013 23:45", "5/31/2013 23:50", "5/31/2013 23:55")
Date <- as.POSIXct(date, format="%m/%d/%Y %H:%M")
This isn't a huge problem, but I wonder if I'm overlooking an easy route to increased efficiency. Any tips for speeding this up? Thanks.
Since I wrote this before it was pointed out this is a duplicate, I'll add it as an answer anyway. Basically package fasttime can help you IF you have dates AFTER 1970-01-01 00:00:00 AND they are GMT AND they are of the format year, month, day, hour, minute, second. If you can rewrite your dates to this format then fastPOSIXct will be quick:
# data
date <- c( "2013/5/31 23:30" , "2013/5/31 23:35" , "2013/5/31 23:40" , "2013/5/31 23:45" )
require(fasttime)
# fasttime function
dates.ft <- fastPOSIXct( date , tz = "GMT" )
# base function
dates <- as.POSIXct( date , format= "%Y/%m/%d %H:%M")
# rough comparison
require(microbenchmark)
microbenchmark( fastPOSIXct( date , tz = "GMT" ) , as.POSIXct( date , format= "%Y/%m/%d %H:%M") , times = 100L )
#Unit: microseconds
# expr min lq median uq max neval
# fastPOSIXct(date, tz = "GMT") 19.598 21.699 24.148 25.5485 215.927 100
# as.POSIXct(date, format = "%Y/%m/%d %H:%M") 160.633 163.433 168.332 181.9800 278.220 100
But the question would be, is it quicker to transform your dates to a format fasttime can accept or just use as.POSIXct or buy a faster computer?!

Number of days between two Time instances

How can I determine the number of days between two Time instances in Ruby?
> earlyTime = Time.at(123)
> laterTime = Time.now
> time_difference = laterTime - earlyTime
I'd like to determine the number of days in time_difference (I'm not worried about fractions of days. Rounding up or down is fine).
Difference of two times is in seconds. Divide it by number of seconds in 24 hours.
(t1 - t2).to_i / (24 * 60 * 60)
require 'date'
days_between = (Date.parse(laterTime.to_s) - Date.parse(earlyTime.to_s)).round
Edit ...or more simply...
require 'date'
(laterTime.to_date - earlyTime.to_date).round
earlyTime = Time.at(123)
laterTime = Time.now
time_difference = laterTime - earlyTime
time_difference_in_days = time_difference / 1.day # just divide by 1.day
[1] pry(main)> earlyTime = Time.at(123)
=> 1970-01-01 01:02:03 +0100
[2] pry(main)> laterTime = Time.now
=> 2014-04-15 11:13:40 +0200
[3] pry(main)> (laterTime.to_date - earlyTime.to_date).to_i
=> 16175
To account for DST (Daylight Saving Time), you'd have to count it by the days. Note that this assumes less than a day is counted as 1 (rounded up):
num = 0
cur = start_time
while cur < end_time
num += 1
cur = cur.advance(:days => 1)
end
return num
Here is a simple answer that works across DST:
numDays = ((laterTime - earlyTime)/(24.0*60*60)).round
60*60 is the number of seconds in an hour
24.0 is the number of hours in a day. It's a float because some days are a little more than 24 hours, some are less. So when we divide by the number of seconds in a day we still have a float, and round will round to the closest integer.
So if we go across DST, either way, we'll still round to the closest day. Even if you're in some weird timezone that changes more than an hour for DST.
in_days (Rails 6.1+)
Rails 6.1 introduces new ActiveSupport::Duration conversion methods like in_seconds, in_minutes, in_hours, in_days, in_weeks, in_months, and in_years.
As a result, now, your problem can be solved as:
date_1 = Time.parse('2020-10-18 00:00:00 UTC')
date_2 = Time.parse('2020-08-13 03:35:38 UTC')
(date_2 - date_1).seconds.in_days.to_i.abs
# => 65
Here is a link to the corresponding PR.
None of these answers will actually work if you don't want to estimate and you want to take into account daylight savings time.
For instance 10 AM on Wednesday before the fall change of clocks and 10 AM the Wednesday afterwards, the time between them would be 1 week and 1 hour. During the spring it would be 1 week minus 1 hour.
In order to get the accurate time you can use the following code
def self.days_between_two_dates later_time, early_time
days_between = (later_time.to_date-early_time.to_date).to_f
later_time_time_of_day_in_seconds = later_time.hour*3600+later_time.min*60+later_time.sec
earlier_time_time_of_day_in_seconds = early_time.hour*3600+early_time.min*60+early_time.sec
days_between + (later_time_time_of_day_in_seconds - early_time_time_of_day_in_seconds)/1.0.day
end

Resources