I have 3 columns which contains start_time , end_time and tags. Times are represented in epoch time format as shown in example below. I want to find the the rows which have 1 hour time difference between them.
Example:
Start_time End_Time Tags
1235000081 1235000501 "Answered"
1235000081 1235000551 "Answered"
I need to fetch the tags column if the time diff is less than an hour.
I want do it in PIG - can anyone kindly help?
input.txt
1235000081 1235000501 Answered
1235000081 1235000551 Answered
pig script
A = Load '/home/kishore/input.txt' as (col1:long, col2:long, col3:chararray);
B = Foreach A generate ToDate(col1) as startdate,ToDate(col2) as enddate,col3;
C = Filter B by GetHour(enddate)-GetHour(startdate) == 1;
Dump C;
you can filter the row based on your condition like >,< ,==
In case if you want to keep date fields as timestamps the solution is following:
data = LOAD '/path/to/your/input' as (Start_Time:long, End_Time:long, Tags:chararray);
data_proc = FOREACH data GENERATE *, ToDate(Start_Time*1000) as Start_Time,ToDate(End_Time*1000) as End_Time;
output = FILTER data_proc BY GetHour(End_Time)-GetHour(Start_Time) == 1;
Dump #;
The one crucial thing is that Pig ToDate UDF needs a timestamp up to milliseconds precision thus you will have simply multiply your date fields by 1000 before using this UDF.
Related
Greeting.
I have One datetime in my table. Its name is start_time. And I have a variable(process_time) that contains the number of seconds to run a process. How to count the number of fields with the following condition:
start_time + process_timen < Present time
You can utilize DB::raw() helper and add the SQL equivelant of what you want to achieve using DATE_ADD and INTERVAL.
$processTime = 15; // seconds
Model::where(DB::raw("DATE_ADD(start_time, INTERVAL $processTime second)"), '<', now())->count();
Pig case statement for finding no. Of events in a specific period of time.
There is a dataset which is like a movie data base bearing movies, rating, duration of movie, year of release.
The question is that how do u find the no. Of movies released during 10 years of span?
The dataset is comma separated.
Movie = load '/home/movie/movies.txt' using PigStorage(',') as (movieid:int, moviename:chararray, yearrelease:int, ratingofmovie:float, moviedurationinsec:float);
movies_released_between_2000_2010 = filter Movie by yearofrelease >2000 and yearofrelease < 2010;
result = foreach movies_released_between_2000_2010 generate moviename,yearofrelease;
dump result;
year_count = FOREACH movie GENERATE (case when year>2000 and year<2010 then 1 else 0 end) as year_flag,movie_name;
year_grp = GROUP year_count BY year_flag;
movie_count_out = FOREACH year_grp GENERATE group,COUNT(year_flag);
The above example can help you give an understanding of the solution, there might be some syntax errors tough. If you need to group on the basis of decade then you can use a substring function on top of year and get the specific range.
I am still pretty new to PIG but I understand the basic idea of map/reduce jobs. I am trying to figure out some statistics for a user based on some simple logs. We have a utility that parses out fields from the log and I am using DataFu to figure out the variance and quartiles.
My script is as follows:
log = LOAD '$data' USING SieveLoader('node', 'uid', 'long_timestamp');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL;
--Find all users
SPLIT log_map INTO cloud IF $0#'node' MATCHES '*.mis01*', dev OTHERWISE;
--For the real cloud
cloud = FOREACH cloud GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '192.168.0.231' AS ldap_server;
dev = FOREACH dev GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '10.0.0.231' AS ldap_server;
modified_logs = UNION dev, cloud;
--Calculate user times
user_times = FOREACH modified_logs GENERATE *, ToDate((long)long_timestamp) as date;
--Based on weekday/weekend
aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetWeekOrWeekend(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
--Based on actual day of week
--aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetDayOfWeek(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);
some_times_by_day = FOREACH user_days GENERATE FLATTEN(group) AS (uid, ldap_server, domain, year, month, day, day_of_week), MAX(aliased_user_times.miliseconds_into_day) AS max, MIN(aliased_user_times.miliseconds_into_day) AS min;
times_by_day = FOREACH some_times_by_day GENERATE *, max-min AS time_on;
times_by_day_of_week = GROUP times_by_day BY (uid, ldap_server, domain, day_of_week);
STORE times_by_day_of_week INTO '/data/times_by_day_of_week';
--New calculation, mean, var, std_d, (min, 25th quartile, 50th (aka median), 75th quartile, max)
averages = FOREACH times_by_day_of_week GENERATE FLATTEN(group) AS (uid, ldap_server, domain, day_of_week), 'USER' as type, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
--AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS hours_quartiles ;
STORE averages INTO '/data/averages';
I've seen that other people have had problems with DataFu calculating multiple quantiles at once so I am only trying to calculate one at a time. The custom loader loads one line at a time, passes it through a utility which converts it into a map and there is a small UDF that checks to see if a date is a weekday or a weekend (originally we wanted to get statistics based on day of week, but loading enough data to get interesting quartiles was killing the map/reduce tasks.
Using Pig 0.11
It looks like my specific problem was due to trying to calculate the min and the max in one PigLatin line. Splitting the work into two different commands and then joining them seems to have fixed my memory problem
what I would like to do in pig is something that is very common in sql.
I have date field that is of the form yyy-mm-dd hh:mm:ss and I have another field that contains an integer which represents an amount of hours. Is there a way to easily add the integer to the datetime field so that we get a result of what we expect with clock math.
Example: date is 2013-06-01 : 23:12:12.
Then I add 2 hours
I should get 2013-06-02 01:12:12.
With the latest release of Pig(0.11.0) it should be possible. But the amount of hours(time) should be as per ISO8601 Duration Format. It provides class AddDuration which allows us to add a DateTime object with a Duration object. You can find more about AddDuration at this page.
Edit :
Yes, you can add negative hours. I tried this on my Ubuntu box :
Input :
2009-01-07T01:07:01.000Z,PT1S
2008-02-06T02:06:02.000Z,PT1M
2007-03-05T03:05:03.000Z,PT-1H
Query :
grunt> a = LOAD '/pig.txt' USING PigStorage(',') AS (dt:datetime, dr:chararray);
grunt> b = FOREACH a GENERATE AddDuration(dt, dr) AS dt1;
grunt> dump b;
Output :
(2009-01-07T01:07:02.000Z)
(2008-02-06T02:07:02.000Z)
(2007-03-05T02:05:03.000Z)
I'm trying to develop a sample program using Pig to analyse some log files. I want to analyze the running time of different jobs. When I read in the log file of the job, I get the start time and the end time of the job, like this:
(Wed,03/20/13,01:03:37,EDT)
(Wed,03/20/13,01:05:00,EDT)
Now, to calculate the elapsed time, I need to subtract these 2 timestamps, but since both timestamps are in the same bag, I'm not sure how to compare them. So I'm looking for an idea on how to do this. thanks!
Is there a unique ID for the job that is in both log lines? Also is there something to indicate which event is start, and which is end?
If so, you could read the dataset twice, once for start events, once for end-events, and join the two together. Then you'll have one record with both events in it.
so:
A = FOREACH logline GENERATE id, type, timestamp;
START = FILTER A BY (type == 'start');
END = FILTER A BY (type == 'end');
JOINED = JOIN START by ID, END by ID;
DIFF = FOREACH JOINED GENERATE (START.timestamp - END.timestamp); // or whatever;