Storing time in Pig - hadoop

I have a data like this
Start_time End_time
12:10:30 13:10:00
I want to store this in pig and calculate elapsed time.
How can i do this in pig ?
I simply wrote Start_time-End_time but the result is blank

The query will be similar to this:
time = LOAD '/user/name/input_folder/file_name' USING PigStorage() AS (sd:chararray, ed:chararray, t1:chararray, t2:chararray);
A = FOREACH time GENERATE $0, $1, GetHour(ToDate(t1,'HH:mm:ss')) as hour1, GetHour(ToDate(t2,'HH:mm:ss')) as hour2;
B = FOREACH A GENERATE ($3 - $2) as time_elapsed;
dump B;

Related

Compare tuples on basis of a field in pig

(ABC,****,tool1,12)
(ABC,****,tool1,10)
(ABC,****,tool1,13)
(ABC,****,tool2,101)
(ABC,****,tool3,11)
Above is input data
Following is my dataset in pig.
Schema is : Username,ip,tool,duration
I want to add duration of same tools
Output
(ABC,****,tool1,35)
(ABC,****,tool2,101)
(ABC,****,tool3,11
Use GROUP BY and use SUM on the duration.
A = LOAD 'data.csv' USING PigStorage(',') AS (Username:chararray,ip:chararray,tool:chararray,duration:int);
B = GROUP A BY (Username,ip,tool);
C = FOREACH B GENERATE FLATTEN(group) AS (Username,ip,tool),SUM(A.duration);
DUMP C;

How to find average of a column and average of subtraction of two columns in Pig?

I am new to scripting using Pig Latin. I am stuck to write a pig script which will find the average of a column value and also to find the average of the subtracted values between two columns.
I am reading the data from a csv file having starttime and endtime columns as below:
"starttime","endtime",
"23","46",
"32","49",
"54","59"
The code that I have tried so far is as below :
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
start_ts = FOREACH file GENERATE st;
grouped = group start_ts by st
ILLUSTRATE grouped
The ILLUSTRATE output I am getting is as below and I am not able to apply the AVG function.
------------------------------------------
-------------------------------------------------------------------------------------
| grouped | group:int | file:bag{:tuple(st:int,et:int)} |
-------------------------------------------------------------------------------------
| | | {(, ), (, )} |
-------------------------------------------------------------------------------------
Can anybody please help me getting the average of the starttime which would be the result of (23 + 32 + 54)/3
And also some ideas on how to code the (endtime -starttime)/no. of records (i.e 3 in this case) would be of great help for me to get started.
Thanks.
First ensure that you are loading the data correctly.Looks like you have double quotes i.e " around your data.Load the data as chararray,replace the double quotes and then cast it to int,finally apply the AVG function for the starttime.For the avg of endtime - starttime just subtract the 2 fields and apply AVG.
A = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
B = FOREACH A GENERATE (int)REPLACE(st,'\\"','') as st,(int)REPLACE(et,'\\"','') as et;
C = GROUP B ALL;
D = FOREACH C GENERATE AVG(B.st),AVG(B.et - B.st);
Try this
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
grouped = group file by 1
AVG = foreach grouped generate AVG(file.st)
Thanks to inquisitive_mind.
My answer is majorly based on his answer with a little tweak.
This is only for the average of one column.
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
cols = FOREACH file GENERATE (int)REPLACE(st, '"', '') as st, (int)REPLACE(et, '"', '') as et;
grp_cols = GROUP cols all;
avg = FOREACH grp_cols GENERATE AVG(cols.st);
DUMP avg

How to Perform Roundup of Date in Pig

I want to perform a filter condition in Pig where in I want to filter out the data belonging to the current date , or the current hour or the current week.
In input data I have input as
2016-01-05 16:59:50,text11
2016-01-05 17:59:50,text11
I am performing a load function
A = LOAD '/hoursbetween-poc/input/' using PigStorage(',') as (time:chararray,colval:chararray) ;
G = FILTER A BY HoursBetween(CurrentTime(),ToDate(time, 'yyyy-MM-dd HH:mm:ss'))<1;
dump G;
But it is substracting 60 minutes from the current date. I want to filter all records belonging to the current hour
eg:
if the current time is 6.30
the code is filtering everyting before 5.30
i want to round up and filter only before 5.
How to acheive this in pig.
Input :
2016-01-05 10:00:50,text1
2016-01-05 10:59:50,text2
2016-01-05 11:10:50,text3
2016-01-05 09:00:50,text4
Pig Script :
A = LOAD 'a.csv' USING PigStorage(',') AS (time:chararray,colval:chararray) ;
B = FOREACH A GENERATE GetHour(CurrentTime()) AS cur_hr, GetHour(ToDate(time, 'yyyy-MM-dd HH:mm:ss')) AS act_hr, time, colval;
C = FILTER B BY (cur_hr - act_hr) <= 1;
DUMP C;
Output :
(11,10,2016-01-05 10:00:50,text1)
(11,10,2016-01-05 10:59:50,text2)
(11,11,2016-01-05 11:10:50,text3)
Script was executed at 2016-01-05 11:40, as seen in output script has selected records from 10:00 onwards.

How to get serial number in Pig Script based on column?

Currently My data is coming in this way but i want my data to show RANK with respect to pid fields changing sequence.My script is this.I have tried rank operator and dense rank operator but still no desired output.
trans_c1 = LOAD '/mypath/data_file.csv' using PigStorage(',') as (date,Product_id);
(DATE,Product id)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
The final output should look like this where the rank sequence changes with the change in (Product_id) and resets by 1.Is it possible in pig to do that?
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)
This question can be solved by using piggybank functions Stitch and Over. It can also be solved by using dataFu's Enumerate function.
Script using Piggybank functions:
REGISTER <path to piggybank folder>/piggybank.jar;
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE Over org.apache.pig.piggybank.evaluation.Over('int');
input_data = LOAD 'data_file.csv' USING PigStorage(',') AS (date:chararray, pid:chararray);
group_data = GROUP input_data BY pid;
rank_grouped_data = FOREACH group_data GENERATE FLATTEN(Stitch(input_data, Over(input_data, 'row_number')));
display_data = FOREACH rank_grouped_data GENERATE stitched::result AS rank_number, stitched::date AS date, stitched::pid AS pid;
DUMP display_data;
Script using dataFu's Enumerate function:
REGISTER <path to pig libraries>/datafu-1.2.0.jar;
DEFINE Enumerate datafu.pig.bags.Enumerate('1');
input_data = LOAD 'data_file.csv' USING PigStorage(',') AS (date:chararray, pid:chararray);
group_data = GROUP input_data BY pid;
data = FOREACH group_data GENERATE FLATTEN(Enumerate(input_data));
display_data = FOREACH data GENERATE $2, $0, $1;
DUMP display_data;
DataFu jar file can be downloaded from Maven repository: http://search.maven.org/#search%7Cga%7C1%7Cg%3a%22com.linkedin.datafu%22
Output:
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
Ref:
Implementing row number function in apache pig
Usage of Apache Pig rank function

Equivalent of linux 'diff' in Apache Pig

I want to be able to do a standard diff on two large files. I've got something that will work but it's not nearly as quick as diff on the command line.
A = load 'A' as (line);
B = load 'B' as (line);
JOINED = join A by line full outer, B by line;
DIFF = FILTER JOINED by A::line is null or B::line is null;
DIFF2 = FOREACH DIFF GENERATE (A::line is null?B::line : A::line), (A::line is null?'REMOVED':'ADDED');
STORE DIFF2 into 'diff';
Anyone got any better ways to do this?
I use the following approaches. (My JOIN approach is very similar but this method does not replicate the behavior of diff with replicated lines). As this was asked sometime ago, perhaps you were using only one reducer as Pig got an algorithm to adjust the number of reducers in 0.8?
Both approaches I use are within a few percent of eachother in performance but do not treat duplicates the same
The JOIN approach collapses duplicates (so, if one file has more duplicates than the other, this approach will not output the duplicate)
The UNION approach works like the Unix diff(1) tool and will return the correct number of extra duplicates for the correct file
Unlike the Unix diff(1) tool, order is not important (effectively the JOIN approach performs sort -u <foo.txt> | diff while UNION performs sort <foo> | diff)
If you have an incredible (~thousands) number of duplicate lines, then things will slow down due to the joins (if your use allows, perform a DISTINCT on the raw data first)
If your lines are very long (e.g. >1KB in size), then it would be recommended to use the DataFu MD5 UDF and only difference over hashes then JOIN with your original files to get the original row back before outputting
Using JOIN:
SET job.name 'Diff(1) Via Join'
-- Erase Outputs
rmf first_only
rmf second_only
-- Process Inputs
a = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS First: chararray;
b = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Second: chararray;
-- Combine Data
combined = JOIN a BY First FULL OUTER, b BY Second;
-- Output Data
SPLIT combined INTO first_raw IF Second IS NULL,
second_raw IF First IS NULL;
first_only = FOREACH first_raw GENERATE First;
second_only = FOREACH second_raw GENERATE Second;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();
Using UNION:
SET job.name 'Diff(1)'
-- Erase Outputs
rmf first_only
rmf second_only
-- Process Inputs
a_raw = LOAD 'a.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
b_raw = LOAD 'b.csv.lzo' USING com.twitter.elephantbird.pig.load.LzoPigStorage('\n') AS Row: chararray;
a_tagged = FOREACH a_raw GENERATE Row, (int)1 AS File;
b_tagged = FOREACH b_raw GENERATE Row, (int)2 AS File;
-- Combine Data
combined = UNION a_tagged, b_tagged;
c_group = GROUP combined BY Row;
-- Find Unique Lines
%declare NULL_BAG 'TOBAG(((chararray)\'place_holder\',(int)0))'
counts = FOREACH c_group {
firsts = FILTER combined BY File == 1;
seconds = FILTER combined BY File == 2;
GENERATE
FLATTEN(
(COUNT(firsts) - COUNT(seconds) == (long)0 ? $NULL_BAG :
(COUNT(firsts) - COUNT(seconds) > 0 ?
TOP((int)(COUNT(firsts) - COUNT(seconds)), 0, firsts) :
TOP((int)(COUNT(seconds) - COUNT(firsts)), 0, seconds))
)
) AS (Row, File); };
-- Output Data
SPLIT counts INTO first_only_raw IF File == 1,
second_only_raw IF File == 2;
first_only = FOREACH first_only_raw GENERATE Row;
second_only = FOREACH second_only_raw GENERATE Row;
STORE first_only INTO 'first_only' USING PigStorage();
STORE second_only INTO 'second_only' USING PigStorage();
Performance
It takes roughly 10 minutes to difference over 200GB (1,055,687,930 rows) using LZO compressed input with 18 nodes.
Each approach only takes one Map/Reduce cycle.
This results in roughly 1.8GB diffed per node, per minute (not a great throughput but on my system it seems diff(1) only operates in-memory, while Hadoop leverages streaming disks.

Resources