Apache Pig: Calculate number of days between a date and current date - hadoop

I have a list of movies in the form (#,title,year,rating,duration):
1,The Nightmare Before Christmas,1993,3.9,4568
2,The Mummy,1932,3.5,4388
3,Orphans of the Storm,1921,3.2,9062
4,The Object of Beauty,1991,2.8,6150
5,Night Tide,1963,2.8,5126
6,One Magic Christmas,1985,3.8,5333
7,Muriel's Wedding,1994,3.5,6323
8,Mother's Boys,1994,3.4,5733
9,Nosferatu: Original Version,1929,3.5,5651
10,Nick of Time,1995,3.4,5333
...
I have the year in each tuple, which I need to treat it as 1st Jan of each year.
I need to calculate the number of days between this date and current date
My approach:
movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id,name,year,rating,duration);
daysbetween_data = foreach movies generate DaysBetween(ToDate(year,'<WHAT FORMAT TO GIVE HERE>'), ToDate(<CURRENT DATE HERE>));
Any idea how to do this?

Load the year to a chararray field,Use CONCAT to append 01-01- to the year field so that you get the format 'MM-dd-yyyy' and then use the ToDate and DaysBetween.
movies = LOAD 'movies_data.csv' USING PigStorage(',') as (id:int,name:chararray,year:chararray,rating:double,duration:int);
daysbetween_data = foreach movies generate DaysBetween(ToDate(CONCAT('01-01-',year),'MM-dd-yyyy'),CurrentTime());

Related

Epoch time difference in Pig

I have 3 columns which contains start_time , end_time and tags. Times are represented in epoch time format as shown in example below. I want to find the the rows which have 1 hour time difference between them.
Example:
Start_time End_Time Tags
1235000081 1235000501 "Answered"
1235000081 1235000551 "Answered"
I need to fetch the tags column if the time diff is less than an hour.
I want do it in PIG - can anyone kindly help?
input.txt
1235000081 1235000501 Answered
1235000081 1235000551 Answered
pig script
A = Load '/home/kishore/input.txt' as (col1:long, col2:long, col3:chararray);
B = Foreach A generate ToDate(col1) as startdate,ToDate(col2) as enddate,col3;
C = Filter B by GetHour(enddate)-GetHour(startdate) == 1;
Dump C;
you can filter the row based on your condition like >,< ,==
In case if you want to keep date fields as timestamps the solution is following:
data = LOAD '/path/to/your/input' as (Start_Time:long, End_Time:long, Tags:chararray);
data_proc = FOREACH data GENERATE *, ToDate(Start_Time*1000) as Start_Time,ToDate(End_Time*1000) as End_Time;
output = FILTER data_proc BY GetHour(End_Time)-GetHour(Start_Time) == 1;
Dump #;
The one crucial thing is that Pig ToDate UDF needs a timestamp up to milliseconds precision thus you will have simply multiply your date fields by 1000 before using this UDF.

Pig case statement for finding no. Of events in a specific period of time

Pig case statement for finding no. Of events in a specific period of time.
There is a dataset which is like a movie data base bearing movies, rating, duration of movie, year of release.
The question is that how do u find the no. Of movies released during 10 years of span?
The dataset is comma separated.
Movie = load '/home/movie/movies.txt' using PigStorage(',') as (movieid:int, moviename:chararray, yearrelease:int, ratingofmovie:float, moviedurationinsec:float);
movies_released_between_2000_2010 = filter Movie by yearofrelease >2000 and yearofrelease < 2010;
result = foreach movies_released_between_2000_2010 generate moviename,yearofrelease;
dump result;
year_count = FOREACH movie GENERATE (case when year>2000 and year<2010 then 1 else 0 end) as year_flag,movie_name;
year_grp = GROUP year_count BY year_flag;
movie_count_out = FOREACH year_grp GENERATE group,COUNT(year_flag);
The above example can help you give an understanding of the solution, there might be some syntax errors tough. If you need to group on the basis of decade then you can use a substring function on top of year and get the specific range.

Pig Help: Splitting a Field into Multiple Fields

Hi I am playing around with Pig for the first time and am curious how to deal with splitting up a field into multiple other fields.
I have a bag, A, like the one below:
grunt> Dump A;
(text, text, Mon Mar 07 12:00:00 CDT 2016)
What I'd like to do is split the Date-Time field into multiple fields so that I can explore the distribution of the data set and do group bys on the Day of Week, Month, Year, etc.
I have been looking at tokenize but am unsure this meets my needs as I need/want to have field names added to the bag or create a nested bag.
Any ideas?
Assuming that the value is already of datatype datetime, then you could use the following functions to extract individual elements.Builtin function reference DateTime Functions in PIG
B = FOREACH A GENERATE f1,f2,
GetDay(f3) as f3_Day,
GetMonth(f3) as f3_Month,
GetYear(f3) as f3_Year,
GetHour(f3) as f3_Hour,
GetMinute(f3) as f3_Minute,
GetSecond(f3) as f3_Second;
If the datatype is chararray then use the ToDate() function to convert it to datetime and extract the date parts.
B = FOREACH A GENERATE f1,f2,ToDate(f3,'choose your datetime format') as f3_Date;
C = FOREACH B GENERATE f1,f2,
GetDay(f3_Date) as f3_Day,
GetMonth(f3_Date) as f3_Month,
GetYear(f3_Date) as f3_Year,
GetHour(f3_Date) as f3_Hour,
GetMinute(f3_Date) as f3_Minute,
GetSecond(f3_Date) as f3_Second;

Is there a way to do datetime addition in pig?

what I would like to do in pig is something that is very common in sql.
I have date field that is of the form yyy-mm-dd hh:mm:ss and I have another field that contains an integer which represents an amount of hours. Is there a way to easily add the integer to the datetime field so that we get a result of what we expect with clock math.
Example: date is 2013-06-01 : 23:12:12.
Then I add 2 hours
I should get 2013-06-02 01:12:12.
With the latest release of Pig(0.11.0) it should be possible. But the amount of hours(time) should be as per ISO8601 Duration Format. It provides class AddDuration which allows us to add a DateTime object with a Duration object. You can find more about AddDuration at this page.
Edit :
Yes, you can add negative hours. I tried this on my Ubuntu box :
Input :
2009-01-07T01:07:01.000Z,PT1S
2008-02-06T02:06:02.000Z,PT1M
2007-03-05T03:05:03.000Z,PT-1H
Query :
grunt> a = LOAD '/pig.txt' USING PigStorage(',') AS (dt:datetime, dr:chararray);
grunt> b = FOREACH a GENERATE AddDuration(dt, dr) AS dt1;
grunt> dump b;
Output :
(2009-01-07T01:07:02.000Z)
(2008-02-06T02:07:02.000Z)
(2007-03-05T02:05:03.000Z)

Get the the most recent and the one before the most recent item

I have a table with many anniversaries : Date + Name.
I want to display the next anniversary and the one after with Linq.
How can i build the query ?
I use EF
Thanks
John
Just order by date and then use the .Take(n) functionality
Example with a list of some objects assuming you want to order by Date then Name:
List<Anniversaries> annivDates = GetAnnivDates();
List<Anniversaries> recentAnniv = annivDates.OrderBy(d => d.Date).ThenBy(d => d.Name).Take(2).ToList();
If the anniversaries are stored in regular DateTime structs, they may have the 'wrong' year set (i.e. wedding or birth year). I suggest writing a function which calculates the next date for an anniversary (based on the current day) like:
static DateTime CalcNext(DateTime anniversary) {
DateTime newDate = new DateTime(DateTime.Now.Year, anniversary.Month, anniversary.Day);
if (newDate < DateTime.Now.Date)
newDate = newDate.AddYear(1);
return newDate;
}
Then you proceed with sorting the dates and taking the first two values like described in the other postings:
(from e in anniversaries orderby CalcNext(e.Date) select e).Take(2)

Resources