PIG Script to group and aggregate data - hadoop

I have a file with data similar to the below one
(1,11)
(1,111)
(2,22)
(2,222)
How do I generate the output below?
(1,11,111)
(2,22,222)
Thanks in advance!!!

BagToString() function will help for your use case.
Ref : http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToString.html
Input :
1,11
1,111
2,22
2,222
Pig Script :
inp_data = LOAD 'input_data.csv' USING PigStorage(',') AS (id:long,value:long);
inp_grp_id = GROUP inp_data BY id;
req_stats = FOREACH inp_grp_id GENERATE group AS id, BagToString(inp_data.value,',') AS values;
Output :
(1,11,111)
(2,22,222)

Related

Get value for unique record using Pig

Below is the input data set.
col1,col2,col3,col4,col5
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
Based on col2,col3,col4 will give unique record, I need to get any one value from col1 for the unique record, and populate as new field say col6. The expected output below
col1,col2,col3,col4,col5,col6
key1,111,1,12/11/2016,10,key3
key2,111,1,12/11/2016,10,key3
key3,111,1,12/11/2016,10,key3
key4,222,2,12/22/2016,10,key5
key5,222,2,12/22/2016,10,key5
key6,333,3,12/30/2016,10,key6
key7,111,0,12/11/2016,10,key7
Below is the script, I am getting error.
A = load 'test1.csv' using PigStorage(',');
B = GROUP A by ($1,$2,$3);
C = FOREACH B GENERATE FLATTEN(group), MAX(A.$0);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2106: Error executing an algebraic function
Looks like a good use case to use Nested Foreach
Ref : https://pig.apache.org/docs/r0.14.0/basic.html#foreach
Input :
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
PigScript
A = load 'input.csv' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = FOREACH(GROUP A BY (col2,col3,col4)) {
ordered = ORDER A BY col1 DESC;
latest = LIMIT ordered 1;
GENERATE FLATTEN(A) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray), FLATTEN(latest.col1) AS col6:chararray;
};
DUMP B;
Output :
(key1,111,1,12/11/2016,10,key3)
(key2,111,1,12/11/2016,10,key3)
(key3,111,1,12/11/2016,10,key3)
(key4,222,2,12/22/2016,10,key5)
(key5,222,2,12/22/2016,10,key5)
(key6,333,3,12/30/2016,10,key6)
(key7,111,0,12/11/2016,10,key7)

How to get DISTINCT values of a group of fields in PIG?

Is it Possible to get the following output in PIG ? Will i be able to use Group by 1st and 2nd field and then do DISTINCT on 3rd field ?
For example
I have input data
12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
I want output something like
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
Approach 1 : Using DISTINCT
Ref : http://pig.apache.org/docs/r0.12.0/basic.html#distinct
DISTINCT operator should help
test = LOAD 'test.csv' USING PigStorage('|');
distinct_recs = DISTINCT test;
DUMP distinct_recs;
Approach 2 : GROUP BY all fields
test = LOAD 'test.csv' USING PigStorage('|');
grp_all_fields = GROUP test BY ($0,$1,$2);
uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
DUMP uniq_recs;
Both approaches are giving the expected output for the input shared.
Try this , its pretty similar :
A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3);
unique =
FOREACH (GROUP A BY a3) {
b = A.(a1,a2);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};

Compare tuples on basis of a field in pig

(ABC,****,tool1,12)
(ABC,****,tool1,10)
(ABC,****,tool1,13)
(ABC,****,tool2,101)
(ABC,****,tool3,11)
Above is input data
Following is my dataset in pig.
Schema is : Username,ip,tool,duration
I want to add duration of same tools
Output
(ABC,****,tool1,35)
(ABC,****,tool2,101)
(ABC,****,tool3,11
Use GROUP BY and use SUM on the duration.
A = LOAD 'data.csv' USING PigStorage(',') AS (Username:chararray,ip:chararray,tool:chararray,duration:int);
B = GROUP A BY (Username,ip,tool);
C = FOREACH B GENERATE FLATTEN(group) AS (Username,ip,tool),SUM(A.duration);
DUMP C;

Apache Pig GROUP BY ,ORDER BY

I have a bag containing tuples having playerName,gameName,score .
I first GROUP above bag BY game and put it in another bag.Now i want tuples with highest score for each game in another bag .How should i do this ?
Input :
jon,mario,2345
joe,minesweeper,234
peter,mario,112
lisa,minesweeper,900
Pig Script :
game_data = LOAD 'game_data.csv' USING PigStorage(',') AS (player:chararray, game:chararray, score:long);
game_data_grp_by_game = GROUP game_data BY game;
game_kpis = FOREACH game_data_grp_by_game {
ord_game_data_by_score = ORDER game_data BY score DESC;
max_score_record = LIMIT ord_game_data_by_score 1;
GENERATE group AS game, FLATTEN(max_score_record.player) AS player_name, FLATTEN(max_score_record.score) AS score;
};
Output : DUMP game_kpis :
(mario,jon,2345)
(minesweeper,lisa,900)

How to get serial number in Pig Script based on column?

Currently My data is coming in this way but i want my data to show RANK with respect to pid fields changing sequence.My script is this.I have tried rank operator and dense rank operator but still no desired output.
trans_c1 = LOAD '/mypath/data_file.csv' using PigStorage(',') as (date,Product_id);
(DATE,Product id)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
The final output should look like this where the rank sequence changes with the change in (Product_id) and resets by 1.Is it possible in pig to do that?
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)
This question can be solved by using piggybank functions Stitch and Over. It can also be solved by using dataFu's Enumerate function.
Script using Piggybank functions:
REGISTER <path to piggybank folder>/piggybank.jar;
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE Over org.apache.pig.piggybank.evaluation.Over('int');
input_data = LOAD 'data_file.csv' USING PigStorage(',') AS (date:chararray, pid:chararray);
group_data = GROUP input_data BY pid;
rank_grouped_data = FOREACH group_data GENERATE FLATTEN(Stitch(input_data, Over(input_data, 'row_number')));
display_data = FOREACH rank_grouped_data GENERATE stitched::result AS rank_number, stitched::date AS date, stitched::pid AS pid;
DUMP display_data;
Script using dataFu's Enumerate function:
REGISTER <path to pig libraries>/datafu-1.2.0.jar;
DEFINE Enumerate datafu.pig.bags.Enumerate('1');
input_data = LOAD 'data_file.csv' USING PigStorage(',') AS (date:chararray, pid:chararray);
group_data = GROUP input_data BY pid;
data = FOREACH group_data GENERATE FLATTEN(Enumerate(input_data));
display_data = FOREACH data GENERATE $2, $0, $1;
DUMP display_data;
DataFu jar file can be downloaded from Maven repository: http://search.maven.org/#search%7Cga%7C1%7Cg%3a%22com.linkedin.datafu%22
Output:
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
Ref:
Implementing row number function in apache pig
Usage of Apache Pig rank function

Resources