How to get DISTINCT values of a group of fields in PIG?

How to get DISTINCT values of a group of fields in PIG? - elasticsearch

Is it Possible to get the following output in PIG ? Will i be able to use Group by 1st and 2nd field and then do DISTINCT on 3rd field ?
For example
I have input data
12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
I want output something like
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585

Approach 1 : Using DISTINCT
Ref : http://pig.apache.org/docs/r0.12.0/basic.html#distinct
DISTINCT operator should help
test = LOAD 'test.csv' USING PigStorage('|');
distinct_recs = DISTINCT test;
DUMP distinct_recs;
Approach 2 : GROUP BY all fields
test = LOAD 'test.csv' USING PigStorage('|');
grp_all_fields = GROUP test BY ($0,$1,$2);
uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
DUMP uniq_recs;
Both approaches are giving the expected output for the input shared.

Try this , its pretty similar :
A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3);
unique =
FOREACH (GROUP A BY a3) {
b = A.(a1,a2);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};

Related

removing empty rows in Pig

I have a data set like blow:
1,abc,10000
,zxcv,2000
, , ,
4,xyz,50000
I want output like:
1,abc,10000
zxcv,2000
4,xyz,50000
How can I achieve this task?
I.e I want to remove the empty rows and null values.

Try Filter Operator :
A = LOAD 'your_data' USING PigStorage() AS (your fields);
B = FILTER A BY your_fields IS NOT NULL;

Distinct records based on multiple fields using Pig

Input data set:
field1,field2,field3,field4,field5
101,a1,a11,a111,a1111
102,a1,a11,a111,a1111
103,a1,a11,a111,a1111
201,b1,b11,b111,b1111
202,b1,b11,b111,b1111
Below query will give distinct records in Pig.
details = load 'emp.csv' using PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray);
distinct_detials = DISTINCT details;
I have a use case where I need to get distinct records based on field2,field3,field4.
Expected output is
101,a1,a11,a111,a1111
202,b1,b11,b111,b1111

You can use a nested foreach to accomplish what you want:
details = load 'emp.csv' using PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray);
distinct_detials = foreach (GROUP details by (field2, field3, field4) ) {
temp_rel = details.(field1, field5);
temp_limit = LIMIT temp_rel 1;
generate FLATTEN(temp_limit) as (field1, field5), FLATTEN(group) as (field2, field3, field4);
}
DUMP distinct_details;
This will give the following output:
(103,a1111,a1,a11,a111)
(202,b1111,b1,b11,b111)
You can further use a foreach on distinct_details to bring the fields in order.

Get value for unique record using Pig

Below is the input data set.
col1,col2,col3,col4,col5
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
Based on col2,col3,col4 will give unique record, I need to get any one value from col1 for the unique record, and populate as new field say col6. The expected output below
col1,col2,col3,col4,col5,col6
key1,111,1,12/11/2016,10,key3
key2,111,1,12/11/2016,10,key3
key3,111,1,12/11/2016,10,key3
key4,222,2,12/22/2016,10,key5
key5,222,2,12/22/2016,10,key5
key6,333,3,12/30/2016,10,key6
key7,111,0,12/11/2016,10,key7
Below is the script, I am getting error.
A = load 'test1.csv' using PigStorage(',');
B = GROUP A by ($1,$2,$3);
C = FOREACH B GENERATE FLATTEN(group), MAX(A.$0);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2106: Error executing an algebraic function

Looks like a good use case to use Nested Foreach
Ref : https://pig.apache.org/docs/r0.14.0/basic.html#foreach
Input :
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
PigScript
A = load 'input.csv' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = FOREACH(GROUP A BY (col2,col3,col4)) {
ordered = ORDER A BY col1 DESC;
latest = LIMIT ordered 1;
GENERATE FLATTEN(A) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray), FLATTEN(latest.col1) AS col6:chararray;
};
DUMP B;
Output :
(key1,111,1,12/11/2016,10,key3)
(key2,111,1,12/11/2016,10,key3)
(key3,111,1,12/11/2016,10,key3)
(key4,222,2,12/22/2016,10,key5)
(key5,222,2,12/22/2016,10,key5)
(key6,333,3,12/30/2016,10,key6)
(key7,111,0,12/11/2016,10,key7)

Union Two files by column using pig

I want to Union/Merge two files using pig. But, this is a different union than a usual union. Following are my files (h* are header of files) :
F1 :
h1,h2,h3,h4
a01,a02,a03,a04
a11,a12,a13,a14
F2 :
h3,h4,h5,h6
a23,a24,b01,b02
a33,a34,b11,b12
The resulting output must be a Union of these files like this :
FR :
h1,h2,h3,h4,h5,h6
a01,a02,a03,a04,,
a11,a12,a13,a14,,
,,a23,a24,b01,b02
,,a33,a34,b11,b12
One more difficulty is I want to make it generic so that it works for dynamic number of common columns. Currently there are two common columns, it could have 3 or 1 common column or even no common column at all. For example :
F1 :
h1,h2,h3,h4
a1,a2,a3,a4
F2
h5,h6,h7,h8
b1,b2,b3,b4
FR
a1,a2,a3,a4
,,,,b1,b2,b3,b4
Any hint/help is appreciable.

Here is how you can do it statically:
F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;
FR = F1full UNION F2full;
Pig is not very flexible, so I don't think it is possible to generate this dynamically/for the generic case.
If you would want a solution for the generic case, you could use a language like python to build the required command based on metadata of stored tables/files.

I tried to solve the problem using following approach :
1) Load both of the files.
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.
Following is the pig script to do the same. As this script is generic, I have mentioned what all parameters will be required before running the script.
-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);
RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;
COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);
CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;
JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;
STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');

Compare tuples on basis of a field in pig

(ABC,****,tool1,12)
(ABC,****,tool1,10)
(ABC,****,tool1,13)
(ABC,****,tool2,101)
(ABC,****,tool3,11)
Above is input data
Following is my dataset in pig.
Schema is : Username,ip,tool,duration
I want to add duration of same tools
Output
(ABC,****,tool1,35)
(ABC,****,tool2,101)
(ABC,****,tool3,11

Use GROUP BY and use SUM on the duration.
A = LOAD 'data.csv' USING PigStorage(',') AS (Username:chararray,ip:chararray,tool:chararray,duration:int);
B = GROUP A BY (Username,ip,tool);
C = FOREACH B GENERATE FLATTEN(group) AS (Username,ip,tool),SUM(A.duration);
DUMP C;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to get DISTINCT values of a group of fields in PIG? - elasticsearch

Try this , its pretty similar : A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3); unique = FOREACH (GROUP A BY a3) { b = A.(a1,a2); s = DISTINCT b; GENERATE FLATTEN(s), group AS a4; };

Related

removing empty rows in Pig

Distinct records based on multiple fields using Pig

Get value for unique record using Pig

Union Two files by column using pig

Compare tuples on basis of a field in pig

Categories

Resources