Compare tuples on basis of a field in pig - hadoop

(ABC,****,tool1,12)
(ABC,****,tool1,10)
(ABC,****,tool1,13)
(ABC,****,tool2,101)
(ABC,****,tool3,11)
Above is input data
Following is my dataset in pig.
Schema is : Username,ip,tool,duration
I want to add duration of same tools
Output
(ABC,****,tool1,35)
(ABC,****,tool2,101)
(ABC,****,tool3,11

Use GROUP BY and use SUM on the duration.
A = LOAD 'data.csv' USING PigStorage(',') AS (Username:chararray,ip:chararray,tool:chararray,duration:int);
B = GROUP A BY (Username,ip,tool);
C = FOREACH B GENERATE FLATTEN(group) AS (Username,ip,tool),SUM(A.duration);
DUMP C;

Related

Get value for unique record using Pig

Below is the input data set.
col1,col2,col3,col4,col5
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
Based on col2,col3,col4 will give unique record, I need to get any one value from col1 for the unique record, and populate as new field say col6. The expected output below
col1,col2,col3,col4,col5,col6
key1,111,1,12/11/2016,10,key3
key2,111,1,12/11/2016,10,key3
key3,111,1,12/11/2016,10,key3
key4,222,2,12/22/2016,10,key5
key5,222,2,12/22/2016,10,key5
key6,333,3,12/30/2016,10,key6
key7,111,0,12/11/2016,10,key7
Below is the script, I am getting error.
A = load 'test1.csv' using PigStorage(',');
B = GROUP A by ($1,$2,$3);
C = FOREACH B GENERATE FLATTEN(group), MAX(A.$0);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2106: Error executing an algebraic function
Looks like a good use case to use Nested Foreach
Ref : https://pig.apache.org/docs/r0.14.0/basic.html#foreach
Input :
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
PigScript
A = load 'input.csv' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = FOREACH(GROUP A BY (col2,col3,col4)) {
ordered = ORDER A BY col1 DESC;
latest = LIMIT ordered 1;
GENERATE FLATTEN(A) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray), FLATTEN(latest.col1) AS col6:chararray;
};
DUMP B;
Output :
(key1,111,1,12/11/2016,10,key3)
(key2,111,1,12/11/2016,10,key3)
(key3,111,1,12/11/2016,10,key3)
(key4,222,2,12/22/2016,10,key5)
(key5,222,2,12/22/2016,10,key5)
(key6,333,3,12/30/2016,10,key6)
(key7,111,0,12/11/2016,10,key7)

How to get DISTINCT values of a group of fields in PIG?

Is it Possible to get the following output in PIG ? Will i be able to use Group by 1st and 2nd field and then do DISTINCT on 3rd field ?
For example
I have input data
12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
I want output something like
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
Approach 1 : Using DISTINCT
Ref : http://pig.apache.org/docs/r0.12.0/basic.html#distinct
DISTINCT operator should help
test = LOAD 'test.csv' USING PigStorage('|');
distinct_recs = DISTINCT test;
DUMP distinct_recs;
Approach 2 : GROUP BY all fields
test = LOAD 'test.csv' USING PigStorage('|');
grp_all_fields = GROUP test BY ($0,$1,$2);
uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
DUMP uniq_recs;
Both approaches are giving the expected output for the input shared.
Try this , its pretty similar :
A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3);
unique =
FOREACH (GROUP A BY a3) {
b = A.(a1,a2);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};

how to do any mathematical calculation on stitch..over column in pig

I am trying to calculate YoY growth on my raw data. By using stitch over(lead) I am able to get last year's data along with current year data. But I am not able to do any calculation on the column returned by stitch over () clause. Below is what I have tried so far,
grunt> data = LOAD 'loan_pig' USING org.apache.hive.hcatalog.pig.HCatLoader();
grunt> grp1 = group data by issue_yr;
grunt> tot_loan = foreach grp1{ cnt = COUNT(data.id); generate FLATTEN(group) as issue_yr,cnt as ln_cnt;};
grunt> grp2 = group tot_loan all;
grunt> loan_yr = foreach grp2{ srt = order tot_loan by issue_yr desc; generate FLATTEN(Stitch(srt, Over(srt.ln_cnt,'lead',0,1,1,0)));};
grunt> final = foreach loan_yr generate issue_yr,ln_cnt,$2;
grunt> describe final;
when I describe on final, it shows
final: {stitched::issue_yr: int,stitched::ln_cnt: long,NULL}
NULL for the 'lead' value column.
And when I try to do any mathematical calculation on this column it throws below error :
grunt> final1 = foreach loan_yr generate issue_yr,ln_cnt,$2 as pr_yr;
grunt> fn = foreach final1 generate issue_yr,ln_cnt-pr_yr;
2016-06-24 11:23:42,118 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1052: Cannot cast bytearray to long
Can any one please let me know if there is any way to do calculation with columns returned from Stitch...Over. Or are they even possible in pig?

Equivalent of Union_map in pig

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

Resources