How to Perform Roundup of Date in Pig - hadoop

I want to perform a filter condition in Pig where in I want to filter out the data belonging to the current date , or the current hour or the current week.
In input data I have input as
2016-01-05 16:59:50,text11
2016-01-05 17:59:50,text11
I am performing a load function
A = LOAD '/hoursbetween-poc/input/' using PigStorage(',') as (time:chararray,colval:chararray) ;
G = FILTER A BY HoursBetween(CurrentTime(),ToDate(time, 'yyyy-MM-dd HH:mm:ss'))<1;
dump G;
But it is substracting 60 minutes from the current date. I want to filter all records belonging to the current hour
eg:
if the current time is 6.30
the code is filtering everyting before 5.30
i want to round up and filter only before 5.
How to acheive this in pig.

Input :
2016-01-05 10:00:50,text1
2016-01-05 10:59:50,text2
2016-01-05 11:10:50,text3
2016-01-05 09:00:50,text4
Pig Script :
A = LOAD 'a.csv' USING PigStorage(',') AS (time:chararray,colval:chararray) ;
B = FOREACH A GENERATE GetHour(CurrentTime()) AS cur_hr, GetHour(ToDate(time, 'yyyy-MM-dd HH:mm:ss')) AS act_hr, time, colval;
C = FILTER B BY (cur_hr - act_hr) <= 1;
DUMP C;
Output :
(11,10,2016-01-05 10:00:50,text1)
(11,10,2016-01-05 10:59:50,text2)
(11,11,2016-01-05 11:10:50,text3)
Script was executed at 2016-01-05 11:40, as seen in output script has selected records from 10:00 onwards.

Related

Compare tuples on basis of a field in pig

(ABC,****,tool1,12)
(ABC,****,tool1,10)
(ABC,****,tool1,13)
(ABC,****,tool2,101)
(ABC,****,tool3,11)
Above is input data
Following is my dataset in pig.
Schema is : Username,ip,tool,duration
I want to add duration of same tools
Output
(ABC,****,tool1,35)
(ABC,****,tool2,101)
(ABC,****,tool3,11
Use GROUP BY and use SUM on the duration.
A = LOAD 'data.csv' USING PigStorage(',') AS (Username:chararray,ip:chararray,tool:chararray,duration:int);
B = GROUP A BY (Username,ip,tool);
C = FOREACH B GENERATE FLATTEN(group) AS (Username,ip,tool),SUM(A.duration);
DUMP C;

How to find average of a column and average of subtraction of two columns in Pig?

I am new to scripting using Pig Latin. I am stuck to write a pig script which will find the average of a column value and also to find the average of the subtracted values between two columns.
I am reading the data from a csv file having starttime and endtime columns as below:
"starttime","endtime",
"23","46",
"32","49",
"54","59"
The code that I have tried so far is as below :
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
start_ts = FOREACH file GENERATE st;
grouped = group start_ts by st
ILLUSTRATE grouped
The ILLUSTRATE output I am getting is as below and I am not able to apply the AVG function.
------------------------------------------
-------------------------------------------------------------------------------------
| grouped | group:int | file:bag{:tuple(st:int,et:int)} |
-------------------------------------------------------------------------------------
| | | {(, ), (, )} |
-------------------------------------------------------------------------------------
Can anybody please help me getting the average of the starttime which would be the result of (23 + 32 + 54)/3
And also some ideas on how to code the (endtime -starttime)/no. of records (i.e 3 in this case) would be of great help for me to get started.
Thanks.
First ensure that you are loading the data correctly.Looks like you have double quotes i.e " around your data.Load the data as chararray,replace the double quotes and then cast it to int,finally apply the AVG function for the starttime.For the avg of endtime - starttime just subtract the 2 fields and apply AVG.
A = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
B = FOREACH A GENERATE (int)REPLACE(st,'\\"','') as st,(int)REPLACE(et,'\\"','') as et;
C = GROUP B ALL;
D = FOREACH C GENERATE AVG(B.st),AVG(B.et - B.st);
Try this
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
grouped = group file by 1
AVG = foreach grouped generate AVG(file.st)
Thanks to inquisitive_mind.
My answer is majorly based on his answer with a little tweak.
This is only for the average of one column.
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
cols = FOREACH file GENERATE (int)REPLACE(st, '"', '') as st, (int)REPLACE(et, '"', '') as et;
grp_cols = GROUP cols all;
avg = FOREACH grp_cols GENERATE AVG(cols.st);
DUMP avg

PIG- Aggregations based on multiple columns

My Input data set has 3 columns and schema looks like below:
ActivityDate, EventId, EventDate
Now, using pig i need to derive multiple variables like below in one output file:
1) All Event Ids after ActivityDate >= EventDate -30 days
2) All Event Ids after ActivityDate >= EventDate -60 days
3) All Event Ids after ActivityDate >= EventDate -90 days
I have more than 30 variables like this. If it is one variable, we can use simple FILTER to filter the data.
I am thinking about any UDF implementation which takes bag as input and returns count of Event IDs based on above criteria for each parameter.
What is the best way to aggregate the data on multiple columns in pig ?
I would suggest creating another file with all of your thresholds and cross joining with the file.
so you would have a file containing:
30
60
90
etc
read it like this:
grouping = load 'grouping.txt' using PigStorage(',') as (groups:double);
Then do:
data_with_grouping = cross data, grouping;
Then have this binary condition:
data_with_binary_condition = foreach data_with_grouping generate ActivityDate, EventId, EventDate, groups, (ActivityDate >= EventDate - groups ? 1 : 0) as binary_condition;
Now you will have one column with the threshold and one column with a binary variable that tells you whether the ID follows the condition or not.
you can do a filter out all of the zeros from the binary_condition and then group on the groups column:
data_with_binary_condition_filtered = filter data_with_binary_condition by (binary_condition != 0);
grouped_by_threshold = group data_with_binary_condition_filtered by groups;
count_of_IDS = foreach grouped_by_threshold generate group, COUNT(data_with_binary_condition.EventId);
I hope this works. Obviously, I didn't debug it for you since I don't have your files.
This code will take a tad more time to run, but it will produce the output you need without a UDF.
If I understand your question correctly, you want to divide the difference between EventDate and ActivityDate in 30 days blocks (e.g. 1 to 30, 31 to 60, 61 to 90 and so on) and then count the frequency of each block.
In this case, I would just rearrange the above equation to create the variable 'range' as below:
// assuming input contains 3 columns ActivityDate, EventId, EventDate
// dividing the difference between ED and AD by 30 and casting it to int, so that 1 block is represented by 1 integer.
input1 = FOREACH input GENERATE (int)((EventDate - ActivityDate) / 30) as range;
output1 = GROUP input1 BY range;
output2 = FOREACH output1 GENERATE group AS range, COUNT(range) as count;
Hope this helps.

Storing time in Pig

I have a data like this
Start_time End_time
12:10:30 13:10:00
I want to store this in pig and calculate elapsed time.
How can i do this in pig ?
I simply wrote Start_time-End_time but the result is blank
The query will be similar to this:
time = LOAD '/user/name/input_folder/file_name' USING PigStorage() AS (sd:chararray, ed:chararray, t1:chararray, t2:chararray);
A = FOREACH time GENERATE $0, $1, GetHour(ToDate(t1,'HH:mm:ss')) as hour1, GetHour(ToDate(t2,'HH:mm:ss')) as hour2;
B = FOREACH A GENERATE ($3 - $2) as time_elapsed;
dump B;

Calculate Average using PIG

I am new to PIG and want to calculate Average of my one column data that looks like
0
10.1
20.1
30
40
50
60
70
80.1
I wrote this pig script
dividends = load 'myfile.txt' as (A);
dump dividends
grouped = group dividends by A;
avg = foreach grouped generate AVG(grouped.A);
dump avg
It parses data as
(0)
(10.1)
(20.1)
(30)
(40)
(50)
(60)
(70)
(80.1)
but gives this error for average
2013-03-04 15:10:58,289 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<file try.pig, line 4, column 41> Invalid scalar projection: grouped
Details at logfile: /Users/PreetiGupta/Documents/CMPS290S/project/pig_1362438645642.log
ANY IDEA
The AVG built in function takes a bag as an input. In your group statement, you are currently grouping elements by the value of A, but what you really want to do is group all the elements into one bag.
Pig's GROUP ALL is what you want to use:
dividends = load 'myfile.txt' as (A);
dump dividends
grouped = group dividends all;
avg = foreach grouped generate AVG(dividends.A);
dump avg
The below will work for calculating average:
dividends = load 'myfile.txt' as (A);
grouped = GROUP dividends all;
avg = foreach grouped generate AVG(dividends);
dump avg
You have to use the original data variable name instead of using a group variable. In FOREACH line, I am using AVG(dividends.A) instead of AVG(grouped.A). Here is the solution script:
dividends = load 'myfile.txt' as (A);
dump dividends
grouped = group dividends by A;
avg = foreach grouped generate AVG(dividends.A);
dump avg

Resources