PIG- Aggregations based on multiple columns - hadoop

My Input data set has 3 columns and schema looks like below:
ActivityDate, EventId, EventDate
Now, using pig i need to derive multiple variables like below in one output file:
1) All Event Ids after ActivityDate >= EventDate -30 days
2) All Event Ids after ActivityDate >= EventDate -60 days
3) All Event Ids after ActivityDate >= EventDate -90 days
I have more than 30 variables like this. If it is one variable, we can use simple FILTER to filter the data.
I am thinking about any UDF implementation which takes bag as input and returns count of Event IDs based on above criteria for each parameter.
What is the best way to aggregate the data on multiple columns in pig ?

I would suggest creating another file with all of your thresholds and cross joining with the file.
so you would have a file containing:
30
60
90
etc
read it like this:
grouping = load 'grouping.txt' using PigStorage(',') as (groups:double);
Then do:
data_with_grouping = cross data, grouping;
Then have this binary condition:
data_with_binary_condition = foreach data_with_grouping generate ActivityDate, EventId, EventDate, groups, (ActivityDate >= EventDate - groups ? 1 : 0) as binary_condition;
Now you will have one column with the threshold and one column with a binary variable that tells you whether the ID follows the condition or not.
you can do a filter out all of the zeros from the binary_condition and then group on the groups column:
data_with_binary_condition_filtered = filter data_with_binary_condition by (binary_condition != 0);
grouped_by_threshold = group data_with_binary_condition_filtered by groups;
count_of_IDS = foreach grouped_by_threshold generate group, COUNT(data_with_binary_condition.EventId);
I hope this works. Obviously, I didn't debug it for you since I don't have your files.
This code will take a tad more time to run, but it will produce the output you need without a UDF.

If I understand your question correctly, you want to divide the difference between EventDate and ActivityDate in 30 days blocks (e.g. 1 to 30, 31 to 60, 61 to 90 and so on) and then count the frequency of each block.
In this case, I would just rearrange the above equation to create the variable 'range' as below:
// assuming input contains 3 columns ActivityDate, EventId, EventDate
// dividing the difference between ED and AD by 30 and casting it to int, so that 1 block is represented by 1 integer.
input1 = FOREACH input GENERATE (int)((EventDate - ActivityDate) / 30) as range;
output1 = GROUP input1 BY range;
output2 = FOREACH output1 GENERATE group AS range, COUNT(range) as count;
Hope this helps.

Related

How to compare two tuples in PIG?

I want to filter the records of data set A whose flight_delay_time is less than some specific values(x).
But I will get the value of x from another pig query which is a tuple in the sense x is a tuple.
But using the following statement is throwing an error:
B = FILTER A by flight_delay_time < x;
dump B;
The data in file A is in the following way;
ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26
avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35
supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,68
lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,77
Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28
the value of x = (40) which is stored as a tuple.
the last column in the above data denotes the flight_delay_time.
I am extracting the value of X in the following way.
following is the data stored in C_CONTROL_BATCH.txt
25
35
40
15
I used following code to extract the value of X.
control_batch = LOAD 'C_CONTROL_BATCH.txt' AS (start:int);
variable = ORDER control_batch BY start DESC;
X = LIMIT starttime 1;
Here is the solution:
INPUT
We have two input files:
airlinesdata.txt - Having the rawdata
ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26
avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35
supreeth,savings,prabhash,4,char,55,F,22,12,23,12,44,56,7,88,34,23,68
lavi,current,nirmesh,5,char,33,M,11,10,33,34,56,78,54,23,445,66,77
Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28
x.txt - Having data from where we get the values of x -
20
30
35
38
37
40
29
flight_delay_time column is last column in below relation and of type int.
Note - If you don't declare it here the program with thrown an exception that it cant cast from byterarray to int when you filter in the end.
rawdata = LOAD 'airlinesdata.txt' USING PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray,field6:chararray,field7:chararray,field8:chararray,field9:chararray,field10:chararray,field11:chararray,field12:chararray,field13:chararray,field14:chararray,field15:chararray,field16:chararray,field17:chararray,flight_delay_time:int);
x_data = LOAD 'x.txt' USING PigStorage() AS (x_val:int);
order_x_data = ORDER x_data BY x_val desc;
max_value = LIMIT order_x_data 1;
Here we are again casting the value to int for the filter condition to work.
max_value_casted = FOREACH max_value GENERATE $0 as (maxval:int);
Finally we can issue the filter query to get the results.
Note how the maxval is accessed below by using the . operator from the max_value_casted relation.
output_data = FILTER rawdata BY flight_delay_time < max_value_casted.maxval;
DUMP output_data;
OUTOUT - Values smaller than max value of X (40)
(ravi,savings,avinash,2,char,33,F,22,44,12,13,33,44,22,11,10,22,26)
(avinash,current,sandeep,3,char,44,M,33,11,10,12,33,22,39,12,23,19,35)
(Venkat,savings,bunny,6,char,11,F,99,12,34,55,33,23,45,66,23,23,28)
Hope it helps :)

Pig: Counting the occurence of a grouped column

In this raw data we have info of baseball players, the schema is:
name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]
Using the following script we are able to list out players and the different positions they have played. How do we get a count of how many players have played a particular position?
E.G. How many players were in the 'Designated_hitter' position?
A single position can't appear multiple times in position bag for a player.
Pig Script and output for the sample data is listed below.
--pig script
players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos = foreach players generate name, flatten(position) as position;
groupbyposition = group pos by position;dump groupbyposition;
--dump groupbyposition (output of one position i.e Designated_hitter)
(Designated_hitter,{(Michael Young,Designated_hitter)})
From what I can tell you've already done all of the 'grunt' (Ha!, Pig joke) work. All there it left to do is use COUNT on the output of the GROUP BY. Something like:
groupbyposition = group pos by position ;
pos_count = FOREACH groupbyposition GENERATE group AS position, COUNT(pos) ;
Note: Using UDFs you may be able to get a more efficient solution. If you care about counting a certain few fields then it should be more efficient to filter the postion bag before hand (This is why I said UDF, I forgot you could just use a nested FILTER). For example:
pos = FOREACH players {
-- you can also add the DISTINCT that alexeipab points out here
-- make sure to change postion in the FILTER to dist!
-- dist = DISTINCT position ;
filt = FILTER postion BY p MATCHES 'Designated_hitter|etc.' ;
GENERATE name, FLATTEN(filt) ;
}
If none of the positions you want appear in postion then it will create an empty bag. When empty bags are FLATTENed the row is discarded. This means you'll be FLATTENing bags of N or less elements (where N is the number of fields you want) instead of 7-15 (didn't really look at the data that closely), and the GROUP will be on significantly less data.
Notes: I'm not sure if this will be significantly faster (if at all). Also, using a UDF to preform the nested FILTER may be faster.
You can use nested DISTINCT to get the list of players and than count it.
players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos = foreach players generate name, flatten(position) as position;
groupbyposition = group pos by position;
pos_count = foreach groupbyposition generate {
players = DISTINCT name;
generate group, COUNT(players) as num, pos;
}

Pig de-duplicate events occuring within 1 minute of each other

We are using pig-0.11.0-cdh4.3.0 with a CDH4 cluster and we need to de-duplicate some web logs. The solution idea (expressed in SQL) is something like this:
SELECT
T1.browser,
T1.click_type,
T1.referrer,
T1.datetime,
T2.datetime
FROM
My_Table T1
INNER JOIN My_Table T2 ON
T2.browser = T1.browser AND
T2.click_type = T1.click_type AND
T2.referrrer = T1.referrer AND
T2.datetime > T1.datetime AND
T2.datetime <= DATEADD(mi, 1, T1.datetime)
I grabbed the above from here SQL find duplicate records occuring within 1 minute of each other . I am hoping I can implement a similar solution in Pig but I am finding that apparently Pig does not support JOIN via an expression (only by fields) as is required by the above join. Do you know how to de-duplicate events that are near by 1 minute with Pig? Thanks!
One approach is you can do like this group by the required parameters
top3 = foreach grpd {
sorted = filter records by time < 60;
top = limit sorted 2;
generate group, flatten(top);
};
this will be another approach
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE
FLATTEN(records), MAX(records.datetime) as maxDt ;
filterRecords = filter with_min by (maxDt - $2 ) <60;
$2 is the datatime position change it accordingly
From top of my head, something like this could work, but needs testing:
view = FOREACH input GENERATE browser, click_type, referrer, datetime, GetYear(datetime) as year, GetMonth(datetime) as month, GetDay(datetime) as day, GetHour(datetime) as hour, GetMinute(datetime) as minute;
grp = GROUP view BY (browser, click_type, referrer, year, month, day, hour, minute);
uniq = FOREACH grp {
top = LIMIT view 1;
GENERATE FLATTEN(view.(browser, click_type, referrer, datetime))
}
Of cause here if one event is at 12:03:45 and another at 12:03:59, these would be in the same group and 12:04:45 with 12:05:00 would be in different groups.
To get the exact 60 seconds difference you would need to write a UDF which would iterate over a sorted bag grouped on (browser, click_type, referrer) and remove unwanted rows.
Aleks and Marq ,
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE FLATTEN(records), MAX(records.datetime) as max
with_min = FOREACH with_min GENERATE browser, click_type, referrer,
ABS(max - dateime) as maxDtgroup;
regroup = group with_min by (browser, click_type, referrer, maxDtgroup);
Re-group with maxDtGroup is the key and filter the top 1 record.

hadoop cascading how to get top N tuples

New to cascading, trying to find out a way to get top N tuples based on a sort/order. for example, I'd like to know the top 100 first names people are using.
here's what I can do similar in teradata sql:
select top 100 first_name, num_records
from
(select first_name, count(1) as num_records
from table_1
group by first_name) a
order by num_records DESC
Here's similar in hadoop pig
a = load 'table_1' as (first_name:chararray, last_name:chararray);
b = foreach (group a by first_name) generate group as first_name, COUNT(a) as num_records;
c = order b by num_records DESC;
d = limit c 100;
It seems very easy to do in SQL or Pig, but having a hard time try to find a way to do it in cascading. Please advise!
Assuming you just need the Pipe set up on how to do this:
In Cascading 2.1.6,
Pipe firstNamePipe = new GroupBy("topFirstNames", InPipe,
new Fields("first_name"),
);
firstNamePipe = new Every(firstNamePipe, new Fields("first_name"),
new Count("num_records"), Fields.All);
firstNamePipe = new GroupBy(firstNamePipe,
new Fields("first_name"),
new Fields("num_records"),
true); //where true is descending order
firstNamePipe = new Every(firstNamePipe, new Fields("first_name", "num_records")
new First(Fields.Args, 100), Fields.All)
Where InPipe is formed with your incoming tap that holds the tuple data that you are referencing above. Namely, "first_name". "num_records" is created when new Count() is called.
If you have the "num_records" and "first_name" data in separate taps (tables or files) then you can set up two pipes that point to those two Tap sources and join them using CoGroup.
The definitions I used were are from Cascading 2.1.6:
GroupBy(String groupName, Pipe pipe, Fields groupFields, Fields sortFields, boolean reverseOrder)
Count(Fields fieldDeclaration)
First(Fields fieldDeclaration, int firstN)
Method 1
Use a GroupBy and group them base on the columns required and u can make use of secondary sorting that is provided by the cascading ,by default it provies them in ascending order ,if we want them in descing order we can do them by reverseorder()
To get the TOP n tuples or rows
Its quite simple just use a static variable count in FILTER and increment it by 1 for each tuple count value increases by 1 and check weather it is greater than N
return true when count value is greater than N or else return false
this will provide the ouput with first N tuples
method 2
cascading provides an inbuit function unique which returns firstNbuffer
see the below link
http://docs.cascading.org/cascading/2.2/javadoc/cascading/pipe/assembly/Unique.html

How can I select record with minimum value in pig latin

I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.

Resources