How to get serial number in Pig Script based on column? - hadoop

Currently My data is coming in this way but i want my data to show RANK with respect to pid fields changing sequence.My script is this.I have tried rank operator and dense rank operator but still no desired output.
trans_c1 = LOAD '/mypath/data_file.csv' using PigStorage(',') as (date,Product_id);
(DATE,Product id)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00XT)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00OZ)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
(2015-01-13T18:00:40.622+05:30,B00VB)
The final output should look like this where the rank sequence changes with the change in (Product_id) and resets by 1.Is it possible in pig to do that?
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)

This question can be solved by using piggybank functions Stitch and Over. It can also be solved by using dataFu's Enumerate function.
Script using Piggybank functions:
REGISTER <path to piggybank folder>/piggybank.jar;
DEFINE Stitch org.apache.pig.piggybank.evaluation.Stitch;
DEFINE Over org.apache.pig.piggybank.evaluation.Over('int');
input_data = LOAD 'data_file.csv' USING PigStorage(',') AS (date:chararray, pid:chararray);
group_data = GROUP input_data BY pid;
rank_grouped_data = FOREACH group_data GENERATE FLATTEN(Stitch(input_data, Over(input_data, 'row_number')));
display_data = FOREACH rank_grouped_data GENERATE stitched::result AS rank_number, stitched::date AS date, stitched::pid AS pid;
DUMP display_data;
Script using dataFu's Enumerate function:
REGISTER <path to pig libraries>/datafu-1.2.0.jar;
DEFINE Enumerate datafu.pig.bags.Enumerate('1');
input_data = LOAD 'data_file.csv' USING PigStorage(',') AS (date:chararray, pid:chararray);
group_data = GROUP input_data BY pid;
data = FOREACH group_data GENERATE FLATTEN(Enumerate(input_data));
display_data = FOREACH data GENERATE $2, $0, $1;
DUMP display_data;
DataFu jar file can be downloaded from Maven repository: http://search.maven.org/#search%7Cga%7C1%7Cg%3a%22com.linkedin.datafu%22
Output:
(1,2015-01-13T18:00:40.622+05:30,B00OZ)
(2,2015-01-13T18:00:40.622+05:30,B00OZ)
(3,2015-01-13T18:00:40.622+05:30,B00OZ)
(1,2015-01-13T18:00:40.622+05:30,B00VB)
(2,2015-01-13T18:00:40.622+05:30,B00VB)
(3,2015-01-13T18:00:40.622+05:30,B00VB)
(4,2015-01-13T18:00:40.622+05:30,B00VB)
(1,2015-01-13T18:00:40.622+05:30,B00XT)
(2,2015-01-13T18:00:40.622+05:30,B00XT)
(3,2015-01-13T18:00:40.622+05:30,B00XT)
(4,2015-01-13T18:00:40.622+05:30,B00XT)
Ref:
Implementing row number function in apache pig
Usage of Apache Pig rank function

Related

Compare tuples on basis of a field in pig

(ABC,****,tool1,12)
(ABC,****,tool1,10)
(ABC,****,tool1,13)
(ABC,****,tool2,101)
(ABC,****,tool3,11)
Above is input data
Following is my dataset in pig.
Schema is : Username,ip,tool,duration
I want to add duration of same tools
Output
(ABC,****,tool1,35)
(ABC,****,tool2,101)
(ABC,****,tool3,11
Use GROUP BY and use SUM on the duration.
A = LOAD 'data.csv' USING PigStorage(',') AS (Username:chararray,ip:chararray,tool:chararray,duration:int);
B = GROUP A BY (Username,ip,tool);
C = FOREACH B GENERATE FLATTEN(group) AS (Username,ip,tool),SUM(A.duration);
DUMP C;

Pig: is it possible to write a loop over variables in a list?

I have to loop over 30 variables in a list
[var1,var2, ... , var30]
and for each variable I use some PIG group by statement such as
grouped = GROUP data by var1;
data_var1 = FOREACH grouped{
GENERATE group as mygroup,
COUNT(data) as count;
};
Is there a way to loop over the list of variables or I am forced to repeat the code above manually 30 times in my code?
Thanks!
I think what you're looking for is the pig macro
Create a relation for your 30 variables, and iterate on them by foreach, and call a macro which get 2 params: your data relation and the var you want to group by.
Just check the example in the link the macro is really similar what you'd like to do.
UPDATE & code
So here's the macro you can use:
DEFINE my_cnt(data, group_field) RETURNS C {
$C = FOREACH (GROUP $data by $group_field) GENERATE
group AS mygroup,
COUNT($data) AS count;
};
Use the macro:
IMPORT 'cnt.macro';
data = LOAD 'data.txt' USING PigStorage(',') AS (field:chararray, value:chararray);
DESCRIBE data;
e = my_cnt(data,'the_field_you_group_by');
DESCRIBE e;
DUMP e;
I'm still thinking on how can you iterate through on your fields you'd like to group by. My original suggestion to foreach through a relation what contains the filed names not correct. (To create a UDF for this always works.) Let me think about it.
But this macro works as is if you call by all the filed name you want to group.

PIG Script to group and aggregate data

I have a file with data similar to the below one
(1,11)
(1,111)
(2,22)
(2,222)
How do I generate the output below?
(1,11,111)
(2,22,222)
Thanks in advance!!!
BagToString() function will help for your use case.
Ref : http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/BagToString.html
Input :
1,11
1,111
2,22
2,222
Pig Script :
inp_data = LOAD 'input_data.csv' USING PigStorage(',') AS (id:long,value:long);
inp_grp_id = GROUP inp_data BY id;
req_stats = FOREACH inp_grp_id GENERATE group AS id, BagToString(inp_data.value,',') AS values;
Output :
(1,11,111)
(2,22,222)

How do I get the matching values inside a for loop using FILTER in PIG?

Consider this as my input,
Input (File1):
12345;11
34567;12
.
.
Input (File2):
11;(1,2,3,4,5,6,7,8,9)
12;(9,8,7,6,5,4,3,2,1)
.
.
I would like to get the output as follows:
Output:
(1,2,3,4,5,6,7,8,9)
(9,8,7,6,5,4,3,2,1)
Here's the sample code which I have tried using FILTER and I face some errors with this. Please suggest me some other options.
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
Is it possible do this inside a for loop ? Please let me know. Thanks in advance !
There are no for loops in Apache Pig, if you need to iterate through each row of the data for some specific purpose you need to implement your own UDF. The foreach keyword is not used to create a loop, it is used to transform your data based on your columns, applying UDFs to it. You can also use a nested foreach, where you perform operations over each group in your relation.
However, your syntax is wrong. You are trying to use a nested foreach without grouping your data first. What a nested foreach does, is perform the operations you define in the block of code over a grouped relation. Therefore, the only way your code could work is by grouping the data first:
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
data1 = group data1 by id;
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
However, this won't work because inside a nested foreach you cannot refer to a different relation like data2.
What you really want, is a JOIN operation over both relations using number for data1 and numberInfo for data2. This will give you this:
joined_data = join data1 by number, data2 by numberInfo;
dump joined_data;
(12345,11,11,(1,2,3,4,5,6,7,8,9))
(34567,12,12,(9,8,7,6,5,4,3,2,1))
In your question you said you only wanted as output the last column, so now you can use a foreach to generate the column you want:
final_data = foreach joined_data generate data2::collection;
dump final_data;
((1,2,3,4,5,6,7,8,9))
((9,8,7,6,5,4,3,2,1))

Error when using MAX in Apache Pig (Hadoop)

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int).
I group the relation based on featureid and want to calculate the max feature value of each group, heres the code:
grpd = GROUP features BY featureid;
DUMP grpd;
temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val;
Its giving me Invalid scalar projection: grpd Exception. I read on different forums that MAX takes in a "bag" format for such functions, but when I take the dump of grpd, it shows me a bag format. Here's a small part of the output from the dump:
(5662,{(22579,5662,1)})
(5663,{(28331,5663,1),(2624,5663,1)})
(5664,{(27591,5664,1)})
(5665,{(30217,5665,1),(31526,5665,1)})
(5666,{(27783,5666,1),(30983,5666,1),(32424,5666,1),(28064,5666,1),(28932,5666,1)})
(5667,{(31257,5667,1),(27281,5667,1)})
(5669,{(31041,5669,1)})
Whats the issue ?
The issue was with column addressing, heres the correct working code:
grpd = GROUP features BY featureid;
temp = FOREACH grpd GENERATE group as featureid, MAX(features.featurevalue) as val;

Resources