Related
I want to Union/Merge two files using pig. But, this is a different union than a usual union. Following are my files (h* are header of files) :
F1 :
h1,h2,h3,h4
a01,a02,a03,a04
a11,a12,a13,a14
F2 :
h3,h4,h5,h6
a23,a24,b01,b02
a33,a34,b11,b12
The resulting output must be a Union of these files like this :
FR :
h1,h2,h3,h4,h5,h6
a01,a02,a03,a04,,
a11,a12,a13,a14,,
,,a23,a24,b01,b02
,,a33,a34,b11,b12
One more difficulty is I want to make it generic so that it works for dynamic number of common columns. Currently there are two common columns, it could have 3 or 1 common column or even no common column at all. For example :
F1 :
h1,h2,h3,h4
a1,a2,a3,a4
F2
h5,h6,h7,h8
b1,b2,b3,b4
FR
a1,a2,a3,a4
,,,,b1,b2,b3,b4
Any hint/help is appreciable.
Here is how you can do it statically:
F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;
FR = F1full UNION F2full;
Pig is not very flexible, so I don't think it is possible to generate this dynamically/for the generic case.
If you would want a solution for the generic case, you could use a language like python to build the required command based on metadata of stored tables/files.
I tried to solve the problem using following approach :
1) Load both of the files.
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.
Following is the pig script to do the same. As this script is generic, I have mentioned what all parameters will be required before running the script.
-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);
RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;
COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);
CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;
JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;
STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
I am a student learning how to use Pig script using the hortonworks sandbox. My problem is that I am not able to use the SUM function properly. I have successfully separated the fields of a firewall log and I am able to do perform several queries and use the count function... but no luck with the SUM function which I really need in one case. This code I used below:
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
counter = foreach grpd1 {
sum1 = SUM(A.rcvd);
sum2 = SUM(A.sent);
generate sum1, sum2;
};
dump counter;
C = foreach F1 generate rcvd, sent;
dump C;
When I dump just the variable C I get a result displaying many records indicating the amount of data received/sent for the filter applied. eg:
(223,123)
(334,444)
(21,12344)
(...,...)
All I really want to do is add all those records together and show that total amount of received and sent: (?,?).
Note: I have tried changing the variable type to int, long, and chararray with no success either.
Some of the errors I am getting while trying to solve this are:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
First make sure that the fields that you are summing up are of type int
Use - DESCRIBE A; to check the data type
After that, I think since you have used filter condition and then used group by on F1 -
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
So, while summing up you should use F1 instead of A -
counter = foreach grpd1 {
sum1 = SUM(F1.rcvd);
sum2 = SUM(F1.sent);
generate sum1, sum2;
};
Use DESCRIBE grpd1; and you will understand what I am trying to say, there will be no 'A'
I guess this should solve the error. Finally, check the logic of what you want in the result I have not checked that. Hope this helps.
PS - I am also a student and new to PIG.
A lucky guess here, I'm new to Pig too :)
I'm not sure if SUM can be casted to chararray(that would explain the error), so make rcvd and sent type:int and then generate the 2 sums for grpd1 bag:
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C1 = foreach grpd1 generate SUM(F1.rcvd);
dump C1;
C2 = foreach grpd1 generate SUM(F1.sent);
dump C2;
NOTE: More info here.
Hope I helped a little!
Please try the following
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C = foreach F1 generate group,SUM(F1.rcvd), SUM(F1.sent);
dump C;
I have a file as:
1,Mary,5
1,Tom,5
2,Bill,5
2,Sue,4
2,Theo,5
3,Mary,5
3,Cindy,5
4,Andrew,4
4,Katie,4
4,Scott,5
5,Jeff,3
5,Sara,4
5,Ryan,5
6,Bob,5
6,Autumn,4
7,Betty,5
7,Janet,5
7,Scott,5
8,Andrew,4
8,Katie,4
8,Scott,5
9,Mary,5
9,Tom,5
10,Bill,5
10,Sue,4
10,Theo,5
11,Mary,5
11,Cindy,5
12,Andrew,4
12,Katie,4
12,Scott,5
13,Jeff,3
13,Sara,4
13,Ryan,5
14,Bob,5
14,Autumn,4
15,Betty,5
15,Janet,5
15,Scott,5
16,Andrew,4
16,Katie,4
16,Scott,5
I want the answer with names most appeared i.e max
(Scott,6)
There's some ambiguity in your question.
What exactly do you want.
Do you want a list of user count in descending order?
OR
Do you want just (scott,6) i.e. only one user with maximum count?
I have successfully solved both the things,on the sample data which you gave.
If the question is of first type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
GENERATE flatten(sorted);
};
This will give you a list of users in descending order as,
(Scott,6)
(Katie,4)
(Andrew,4)
(Mary,4)
(Bob,2)
(Sue,2)
(Tom,2)
(Bill,2)
(Jeff,2)
(Ryan,2)
(Sara,2)
(Theo,2)
(Betty,2)
(Cindy,2)
(Janet,2)
(Autumn,2)
If the question is of second type then,
a = load '/file.txt' using PigStorage(',') as (id:int,name:chararray,number:int);
g = group a by name;
g1 = foreach g{
generate group as g , COUNT(a) as cnt;
};
toptemp = group g1 all;
final = foreach toptemp{
sorted = order g1 by cnt desc;
top = limit sorted 1;
GENERATE flatten(top);
};
This gives us only one result ,
(Scott,6)
Thanks.I Hope it helps.
I've got an example where we're trying to do what appears to be a simple join:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
grunt> cat data1
'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
grunt> cat data2
'value1' 'result1'
'value2' 'result2'
We want to join the 'result1', 'result2' data of data2 into the entry in data1, on the obvious value field.
We managed to flatten it:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
F1 = foreach A generate item, d, flatten(things);
F2 = foreach F1 generate item..d1, flatten(values);
Then we joined the 2nd dataset in:
J = join F2 by v, B by v
J1 = foreach J generate item as item, d as d, thing as thing, d1 as d1, F2::things::values::v as v, r as r; --Remove duplicate field & clean up naming
dump J1
('item1',111,'thing1',222,'value1','result1')
('item1',111,'thing1',222,'value2','result2')
Now we need to call a UDF function once for each item, so we need to re-group those 2 levels of bags. Each item has 0 or more things, and each thing has 0 or more values, and the values now may or may not have a result.
How do we get back to:
('item1', 111, { 'thing1', 222, { ('value1, 'result1'), ('value2', 'result2') }
All of my attempts at grouping and re-joining have exploded in complexity, failed to produce the correct result, and run in 4+ mapreduce jobs what should be 1 mapreduce job in Hadoop.
The following code may work, R2 is the final result:
group_by_item_d_thing_d1 = group J1 by item, d, thing, d1;
R1 = foreach group_by_item_d_thing_d1 generate group.item, group.d, group.thing, group.d1, J1;
group_by_item_d = group R1 by item, d;
R2 = foreach group_by_item_d generate group.item, group.d, R1;
I have a question on Pig when performing what seems like two levels of groupings. As an example, let's say I had some example input data like:
email_id:chararray from:chararray to:bag{recipients:tuple(recipient:chararray)}
e1 user1#example.com {(friend1#example.com),(friend2#example.com),(friend3#myusers.com)}
e2 user1#example.com {(friend1#example.com),(friend4#example.com)}
e3 user1#example.com {(friend5#example.com)}
e4 user2#example.com {(friend2#example.com),(friend4#example.com)}
So each line is an email from a user "from" to user(s) "to".
And I ultimately want a list of all senders and all the people they've sent emails to, including the # of emails sent for each person, sorted from highest to lowest, for example:
user1#example.com {(friend1#example.com, 2), (friend2#example.com, 1), (friend3#example.com, 1), (friend4#example.com, 1), (friend5#example.com, 1)}
user2#example.com {(friend2#example.com, 1), (friend4#example.com, 1)}
Ideas on the best way to tackle this in Pig would be appreciated!
Here is one version of the script:
inpt = load '/pig_data/pig_fun/input/from_senders.txt' as (email_id:chararray, from:chararray, to:bag{recipients:tuple(recipient:chararray)});
pivot = foreach inpt generate from, FLATTEN(to);
pivot = foreach pivot generate from, to::recipient as recipient;
dump pivot;
/*
(user1#example.com,friend1#example.com)
(user1#example.com,friend2#example.com)
(user1#example.com,friend3#myusers.com)
(user1#example.com,friend1#example.com)
(user1#example.com,friend4#example.com)
(user1#example.com,friend5#example.com)
(user2#example.com,friend2#example.com)
(user2#example.com,friend4#example.com)
*/
grp = group pivot by (from, recipient);
with_count = foreach grp generate FLATTEN(group), COUNT(pivot) as count;
dump with_count;
/*
(user1#example.com,friend1#example.com,2)
(user1#example.com,friend2#example.com,1)
(user1#example.com,friend3#myusers.com,1)
(user1#example.com,friend4#example.com,1)
(user1#example.com,friend5#example.com,1)
(user2#example.com,friend2#example.com,1)
(user2#example.com,friend4#example.com,1)
*/
to_bag = group with_count by from;
result = foreach to_bag {
order_by_count = order with_count by count desc;
generate group as from, order_by_count.(recipient, count);
};
dump result;
/*
(user1#example.com,{(friend1#example.com,2),(friend2#example.com,1),(friend3#myusers.com,1),(friend4#example.com,1),(friend5#example.com,1)})
(user2#example.com,{(friend2#example.com,1),(friend4#example.com,1)})
*/
Hope it helps.