Is there a way to do datetime addition in pig? - hadoop

what I would like to do in pig is something that is very common in sql.
I have date field that is of the form yyy-mm-dd hh:mm:ss and I have another field that contains an integer which represents an amount of hours. Is there a way to easily add the integer to the datetime field so that we get a result of what we expect with clock math.
Example: date is 2013-06-01 : 23:12:12.
Then I add 2 hours
I should get 2013-06-02 01:12:12.

With the latest release of Pig(0.11.0) it should be possible. But the amount of hours(time) should be as per ISO8601 Duration Format. It provides class AddDuration which allows us to add a DateTime object with a Duration object. You can find more about AddDuration at this page.
Edit :
Yes, you can add negative hours. I tried this on my Ubuntu box :
Input :
2009-01-07T01:07:01.000Z,PT1S
2008-02-06T02:06:02.000Z,PT1M
2007-03-05T03:05:03.000Z,PT-1H
Query :
grunt> a = LOAD '/pig.txt' USING PigStorage(',') AS (dt:datetime, dr:chararray);
grunt> b = FOREACH a GENERATE AddDuration(dt, dr) AS dt1;
grunt> dump b;
Output :
(2009-01-07T01:07:02.000Z)
(2008-02-06T02:07:02.000Z)
(2007-03-05T02:05:03.000Z)

Related

Apache Pig: Calculate number of days between a date and current date

I have a list of movies in the form (#,title,year,rating,duration):
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
...
I have the year in each tuple, which I need to treat it as 1st Jan of each year.
I need to calculate the number of days between this date and current date
My approach:
movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
daysbetween_data = foreach movies generate DaysBetween(ToDate(year,'<WHAT FORMAT TO GIVE HERE>'), ToDate(<CURRENT DATE HERE>));
Any idea how to do this?
Load the year to a chararray field,Use CONCAT to append 01-01- to the year field so that you get the format 'MM-dd-yyyy' and then use the ToDate and DaysBetween.
movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id:int,name:chararray,year:chararray,rating:double,duration:int);
daysbetween_data = foreach movies generate DaysBetween(ToDate(CONCAT('01-01-',year),'MM-dd-yyyy'),CurrentTime());

Epoch time difference in Pig

I have 3 columns which contains start_time , end_time and tags. Times are represented in epoch time format as shown in example below. I want to find the the rows which have 1 hour time difference between them.
Example:
Start_time End_Time Tags
1235000081 1235000501 "Answered"
1235000081 1235000551 "Answered"
I need to fetch the tags column if the time diff is less than an hour.
I want do it in PIG - can anyone kindly help?
input.txt
1235000081 1235000501 Answered
1235000081 1235000551 Answered
pig script
A = Load '/home/kishore/input.txt' as (col1:long, col2:long, col3:chararray);
B = Foreach A generate ToDate(col1) as startdate,ToDate(col2) as enddate,col3;
C = Filter B by GetHour(enddate)-GetHour(startdate) == 1;
Dump C;
you can filter the row based on your condition like >,< ,==
In case if you want to keep date fields as timestamps the solution is following:
data = LOAD '/path/to/your/input' as (Start_Time:long, End_Time:long, Tags:chararray);
data_proc = FOREACH data GENERATE *, ToDate(Start_Time*1000) as Start_Time,ToDate(End_Time*1000) as End_Time;
output = FILTER data_proc BY GetHour(End_Time)-GetHour(Start_Time) == 1;
Dump #;
The one crucial thing is that Pig ToDate UDF needs a timestamp up to milliseconds precision thus you will have simply multiply your date fields by 1000 before using this UDF.

Pig Help: Splitting a Field into Multiple Fields

Hi I am playing around with Pig for the first time and am curious how to deal with splitting up a field into multiple other fields.
I have a bag, A, like the one below:
grunt> Dump A;
(text, text, Mon Mar 07 12:00:00 CDT 2016)
What I'd like to do is split the Date-Time field into multiple fields so that I can explore the distribution of the data set and do group bys on the Day of Week, Month, Year, etc.
I have been looking at tokenize but am unsure this meets my needs as I need/want to have field names added to the bag or create a nested bag.
Any ideas?
Assuming that the value is already of datatype datetime, then you could use the following functions to extract individual elements.Builtin function reference DateTime Functions in PIG
B = FOREACH A GENERATE f1,f2,
GetDay(f3) as f3_Day,
GetMonth(f3) as f3_Month,
GetYear(f3) as f3_Year,
GetHour(f3) as f3_Hour,
GetMinute(f3) as f3_Minute,
GetSecond(f3) as f3_Second;
If the datatype is chararray then use the ToDate() function to convert it to datetime and extract the date parts.
B = FOREACH A GENERATE f1,f2,ToDate(f3,'choose your datetime format') as f3_Date;
C = FOREACH B GENERATE f1,f2,
GetDay(f3_Date) as f3_Day,
GetMonth(f3_Date) as f3_Month,
GetYear(f3_Date) as f3_Year,
GetHour(f3_Date) as f3_Hour,
GetMinute(f3_Date) as f3_Minute,
GetSecond(f3_Date) as f3_Second;

Hadoop, how to normalize multiple columns data?

I have a file .txt like this
1036177 19459.7356 17380.3761 18084.1440
1045709 19674.2457 17694.8674 18700.0120
1140443 19772.0645 17760.0904 19456.7521
where the first column represent the Key and the others are the values.
I would like to normalize (min-max) each column and after that sum up the columns.
Someone can give me some advice on how do that in MapReduce?
From an algorithmic perspective you'll need to:
Mapper
Parse / tokenize each input line by it's delimiter (space?)
Use a Text object to encapsulate the key field
Either create a custom value class to encapsulate the other fields or use an ArrayWritable wrapper
Output this Key / Value from your Mapper
Reducer
All values will be grouped by the same key, so here you'll just need to process each input value and calculate the min, max and sum for each column
Finally output your result
You might want to look at using Apache Pig which should make this task much easier (untested):
grunt> A = LOAD '/path/to/data.txt' USING PigStorage(' ')
AS (key, fld1:float, fld2:float, fld3:float);
grunt> GRP = GROUP A BY key;
grunt> B = FOREACH GRP GENERATE $0, MIN(fld1), MAX(fld1), SUM(fld1),
MIN(fld2), MAX(fld2), SUM(fld2),
MIN(fld3), MAX(fld3), SUM(fld3);
grunt> STORE B INTO '/path/to/output' USING PigStorage('\t', '-schema');

How can I compare 2 dates in a Where statement while using NoRM for MongoDB on C#?

I have a table in Mongo. One of the fields is a DateTime. I would like to be able to get all of the records that are only for a single day (i.e. 9/3/2011).
If I do something like this:
var list = (from c in col
where c.PublishDate == DateTime.Now
select c).ToList();
Then it doesn't work because it is using the time in the comparison. Normally I would just compare the ToShortDateString() but NoRM does not allow me to use this.
Thoughts?
David
The best way to handle this is normally to calculate the start datetime and end datetime for the date in question and then query for values in that range.
var start = DateTime.Now.Date;
var end = start.AddDays(1);
...
But you'd also be well advised to switch to the official C# driver now. You should also use UTC datetimes in your database (but that gets more complicated).

Resources