Pig: How to flatten & re-join bags within bags - hadoop

I've got an example where we're trying to do what appears to be a simple join:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
grunt> cat data1
'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
grunt> cat data2
'value1' 'result1'
'value2' 'result2'
We want to join the 'result1', 'result2' data of data2 into the entry in data1, on the obvious value field.
We managed to flatten it:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
F1 = foreach A generate item, d, flatten(things);
F2 = foreach F1 generate item..d1, flatten(values);
Then we joined the 2nd dataset in:
J = join F2 by v, B by v
J1 = foreach J generate item as item, d as d, thing as thing, d1 as d1, F2::things::values::v as v, r as r; --Remove duplicate field & clean up naming
dump J1
('item1',111,'thing1',222,'value1','result1')
('item1',111,'thing1',222,'value2','result2')
Now we need to call a UDF function once for each item, so we need to re-group those 2 levels of bags. Each item has 0 or more things, and each thing has 0 or more values, and the values now may or may not have a result.
How do we get back to:
('item1', 111, { 'thing1', 222, { ('value1, 'result1'), ('value2', 'result2') }
All of my attempts at grouping and re-joining have exploded in complexity, failed to produce the correct result, and run in 4+ mapreduce jobs what should be 1 mapreduce job in Hadoop.

The following code may work, R2 is the final result:
group_by_item_d_thing_d1 = group J1 by item, d, thing, d1;
R1 = foreach group_by_item_d_thing_d1 generate group.item, group.d, group.thing, group.d1, J1;
group_by_item_d = group R1 by item, d;
R2 = foreach group_by_item_d generate group.item, group.d, R1;

Related

Discarding nulls after full outer join in PIG

Need help with discarding nulls in the result of full outer join in pig Latin. Below are two data sets :
A:
(BOS,2)
(BUR,81)
(LAS,8)
B:
(BUR,56)
(EWR,2)
(LAS,88)
After full outer join :
C :
(BOS,2,,)
(BUR,81,BUR,56)
(,,EWR,2)
(LAS,8,LAS,88)
I need to get the output in below format :
(BOS,2)
(BUR,137)
(EWR,2)
(LAS,96)
Tried different combinations of group by , flatten , bagtotuple ... but was not able to figure out the solution . Many thanks for help.
airline = load '/demo/data/airline/airline.csv' using PigStorage(',') as (Origin: chararray, Dest: chararray);
traffic_in = GROUP airline by Origin;
traffic_in_count= FOREACH traffic_in generate group as Origin , COUNT(airline) as count ;
traffic_out = GROUP airline by Dest;
traffic_out_count = FOREACH traffic_out generate group as Dest ,COUNT (airline) as count;
traffic_top = JOIN traffic_in_count by Origin FULL OUTER , traffic_out_count by Dest ;
EDIT
Instead of using OUTER JOIN use UNION and then SUM the 2nd column values.
A = LOAD 'test1.txt' using PigStorage(',') as (A1:chararray, A2:int);
B = LOAD 'test2.txt' using PigStorage(',') as (B1:chararray, B2:int);
C = UNION A,B;
D = GROUP C BY $0;
E = FOREACH D GENERATE group,SUM(C.$1);
DUMP E;
Output

Pig - MAX is not working after grouping

I am working with Pig 0.12.1 and Map-R. I am trying to find max of a field after grouping the relation on some other field. Refer the following pig script and structure of relation in comments-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2);
-- r1: {c1: biginteger,c2: biginteger}
r2 = group r1 by c1;
-- r2: {group: chararray,r1: {(c1: chararray,c2: biginteger)}}
DUMP r2;
/* output -
1234|{(1234,9876)}
2345|{(2345,8765)}
3456|{(3456,7654)}
4567|{(4567,6543)}
*/
r3 = foreach r2 generate group as c1, MAX(r1.c2) as c2;
I am getting the following error
Could not infer the matching function for org.apache.pig.builtin.MAX as multiple or none of them fit. Please use an explicit cast.
Script Explained-
I am flattening group of SomeRelation into c1, c2 and then regrouping
on c1 to generate max of c2 with each c1 group.
Please suggest.
I'm not sure if you can use the group keyword under the flatten. Also, have you considered tokenizing the group before flattening it. See this for example:
load_data = LOAD '/PIG_TESTS_ALL/WordCount' as (line);
tokenizing_data = FOREACH load_data generate flatten(TOKENIZE(line)) as word;
group_data = GROUP tokenizing_data by word;
Result = FOREACH group_data generate group,COUNT(tokenizing_data);
dump Result;
This is actually for word count, You can probably build on this to find max value based on what you want to do.
Well it looks like the problem is that Pig doesn't allow MAX(or for that matter aggregate functions like SUM etc) on biginteger. Had to use long as a datatype for this to work. Refer the following-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2:long);
-- r1: {c1: biginteger,c2: long}
Strangely, there's no documentation highlighting this almost like datatypes biginteger and bigdecimal.
We now know the problem was the unability of MAX to handle biginteger.
You should be able to group and get the max like this, and compare results with combination of order + limit :
r1 = FOREACH SomeRelation GENERATE FLATTEN(group) AS (c1, c2);
r3 = FOREACH (group r1 by c1) {
-- you may want to apply a function on a single column
-- or compare sort + limit to MAX
list = ORDER $1 BY c2 DESC;
list_max = LIMIT list 1;
GENERATE group AS c1, MAX(r1.c2) AS c2, list_max;
}

Equivalent of Union_map in pig

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

select count distinct using pig latin

I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc).
Here's my sql with pig I am trying to code.
/*
For example in sql:
select domain, count(distinct(segment)) as segment_cnt
from table
where desc='ABC123'
group by domain
order by segment_count desc;
*/
A = LOAD 'myoutputfile' USING PigStorage('\u0005')
AS (
domain:chararray,
segment:chararray,
desc:chararray
);
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;
You could GROUP on each domain and then count the number of distinct elements in each group with a nested FOREACH syntax:
D = group C by domain;
E = foreach D {
unique_segments = DISTINCT C.segment;
generate group, COUNT(unique_segments) as segment_cnt;
};
You can better define this as a macro:
DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
temp = FOREACH $A GENERATE $c;
dist = DISTINCT temp;
groupAll = GROUP dist ALL;
$dist = FOREACH groupAll GENERATE COUNT(dist);
}
Usage:
X = LOAD 'data' AS (x: int);
Y = DISTINCT_COUNT(X, x);
If you need to use it in a FOREACH instead then the easiest way is something like:
...GENERATE COUNT(Distinct(x))...
Tested on Pig 12.
If you don't want to count on any group, you use this:
G = FOREACH (GROUP A ALL){
unique = DISTINCT A.field;
GENERATE COUNT(unique) AS ct;
};
This will just give you a number.

Resources