How Pig's COGROUP operator works? - hadoop

How does the COGROUP operator works here?
How and why we are getting empty bag in the last two lines of output(No website explained in details about the data arrangement in COGROUP) ?
A = load 'student' as (name:chararray, age:int, gpa:float);
B = load 'student' as (name:chararray, age:int, gpa:float);
dump B;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)
X = cogroup A by age, B by age;
dump X;
(18,{(joe,18,2.5)},{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})

There is a very clear example in Definitive Guide book. I hope the below snippet helps you to understand the cogroup concept.
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP generates a tuple for each unique grouping key. The first field of each tuple
is the key, and the remaining fields are bags of tuples from the relations with a matching
key. The first bag contains the matching tuples from relation A with the same key.
Similarly, the second bag contains the matching tuples from relation B with the same
key.
If for a particular key a relation has no matching key, then the bag for that relation is
empty. For example, since no one has bought a scarf (with ID 1), the second bag in the
tuple for that row is empty. This is an example of an outer join, which is the default
type for COGROUP.

Related

Inserting tuples inside an inner bag using Pig Latin - Hadoop

I am trying to create the following format of relation using Pig Latin:
userid, day, {(pid,fulldate, x,y),(pid,fulldate, x,y), ...}
Relation description: Each user (userid) in each day (day) has purchased multiple products (pid)
I am Loading the data into:
A= LOAD '**from a HDFS URL**' AS (pid: chararray,userid:
chararray,day:int,fulldate: chararray,x: chararray,y:chararray);
B= GROUP A BY (userid, day);
Describe B;
B: {group: (userid: chararray,day: int),A: {(pid: chararray,day: int,fulldate: chararray,x: chararray,userid: chararray,y: chararray)}}
C= FOREACH B FLATTEN(B) AS (userid,day), $1.pid, $1.fulldate,$1.x,$1.y;
Describe C;
C: {userid: chararray,day: int,{(pid: chararray)}},{(fulldate: chararray)},{(x: chararray)},{(y: chararray)}}
The result of Describe C does not give the format I want ! What I am doing wrong?
You are correct till the GROUP BY part. After that however you are trying to do something messy. I'm actually not sure what is happening for your alias C. To arrive at the format you are looking for, you will need a nested foreach.
C = FOREACH B {
data = A.pid, A.fulldate, A.x, A.y;
GENERATE FLATTEN(group), data;
}
This allows C to have one record for each (userid, day) and all the corresponding (pid,fulldate, x, y) tuples in a bag.
You can read more about nested foreach here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).
My understanding is that B is almost what you're looking for, except you would like the tuple containing userid and day to be flattened, and you would like only pid, fulldate, x, and y to appear in the bag.
First, you want to flatten the tuple group which has fields userid and day, not the bag A which contains multiple tuples. Flattening group unnests the tuple, which only has 1 set of unique values for each row, whereas flattening the bag A would effectively ungroup your previous GROUP BY statement since the values in the bag A are not unique. So the first part should read C = FOREACH B GENERATE FLATTEN(group) AS (userid, day);
Next, you want to keep pid, fulldate, x, and y in separate tuples for each record, but the way you've selected them essentially makes a bag of all the pid values, a bag of all the fulldate values, etc. Instead, try selecting these fields in a way that keeps the tuples nested in the bag:
C = FOREACH B GENERATE
FLATTEN(group) AS (userid, day),
A.(pid, fulldate, x, y) AS A;

Pig join two Relations only with join partner

im new at programming in Pig Latin and i have a question.
Let's say i have the following two relations (A and B):
Relation A: http://i.stack.imgur.com/Aa5Rd.png
Relation B: http://i.stack.imgur.com/m467q.png
Now, the Relations should be joined, but only when in A a key (id) exists. Otherwise not. So the Result should look like:
Relation Result: i.stack.imgur.com/3elgh.png (i cannot post more than 2 links)
How i can solve that?
My approach result = JOIN A BY id, B BY id; because it creates a result relation with all ids & texts :/
Thank you very much in advance,
Stefanos
Your approach is right. I got the correct output as you mentioned but not sure why you didn't get the output. Can you cross check your pigscript with the below one?
input1:
1
4
6
input2:
1,peter
2,jay
3,dan
4,knut
5,Gnu
6,rafael
7,hans
PigScript:
A = LOAD 'input1' AS (id:int);
B = LOAD 'input2' USING PigStorage(',') AS (id:int,text:chararray);
C = JOIN A BY id,B BY id;
D = FOREACH C GENERATE A::id AS id,B::text as text;
DUMP D;
Output:
(1,peter)
(4,knut)
(6,rafael)

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

how to normalize a tuple of maps in apache pig?

I have the following relation in a pig script:
my_relation: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,timeseries,([value#50.0,timestamp#1388675231000]))
(++JRGOCZQD,timeseries,([value#50.0,timestamp#1388592317000],[value#25.0,timestamp#1388682237000]))
(++GCYI1OO4,timeseries,())
(++JYY0LOTU,timeseries,())
There can be any number of value/timestamp pairs in the bytearray column (even zero).
I would like to transform this relation into this (one row for each entityId, attributeName, value, timestamp quartet):
++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
++GCYI1OO4,timeseries,,
++JYY0LOTU,timeseries,,
Alternatively this would be fine too - I am not interested in the rows that have no values/timestamp
++JIYMIS2D,timeseries,50.0,1388675231000
++JRGOCZQD,timeseries,50.0,1388592317000
++JRGOCZQD,timeseries,25.0,1388682237000
Any ideas? Basically I want to normalize the tuple of maps in the bytearray column so that the schema is like this:
my_relation: {entityId: chararray,
attributeName: chararray,
value: float,
timestamp: int}
I am a pig beginner so sorry if this is obvious! Do I need a UDF to do this?
This question is similar but has no answers so far: How do I split in Pig a tuple of many maps into different rows
I am running Apache Pig version 0.12.0-cdh5.1.2
EDIT - adding details of what I've done so far.
Here's a pig script snippet, with output below:
-- StateVectorFileStorage is a LoadStoreFunc and AttributeData is a UDF, both java.
ts_to_average = LOAD 'StateVector' USING StateVectorFileStorage();
ts_to_average = LIMIT ts_to_average 10;
ts_to_average = FOREACH ts_to_average GENERATE entityId, FLATTEN(AttributeData(*));
a = FOREACH ts_to_average GENERATE entityId, $1 as attributeName:chararray, $2#'value';
b = foreach a generate entityId, attributeName, FLATTEN($2);
c_no_flatten = foreach b generate
$0 as entityId,
$1 as attributeName,
TOBAG($2 ..);
c = foreach b generate
$0 as entityId,
$1 as attributeName,
FLATTEN(TOBAG($2 ..));
d = foreach c generate
entityId,
attributeName,
(float)$2#'value' as value,
(int)$2#'timestamp' as timestamp;
dump a;
describe a;
dump b;
describe b;
dump c_no_flatten;
describe c_no_flatten;
dump c;
describe c;
dump d;
describe d;
Output follows. Notice how in the relation 'c', the second value/timestamp pair [value#52.0,timestamp#1388683516000] is lost.
(++JIYMIS2D,RechargeTimeSeries,([value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000]))
(++JRGOCZQD,RechargeTimeSeries,([value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries,())
a: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000],[value#52.0,timestamp#1388683516000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000]))
(++GCYI1OO4,RechargeTimeSeries)
b: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,{([value#50.0,timestamp#1388675231000])})
(++JRGOCZQD,RechargeTimeSeries,{([value#50.0,timestamp#1388592317000])})
(++GCYI1OO4,RechargeTimeSeries,{()})
c_no_flatten: {entityId: chararray,attributeName: chararray,{(bytearray)}}
(++JIYMIS2D,RechargeTimeSeries,[value#50.0,timestamp#1388675231000])
(++JRGOCZQD,RechargeTimeSeries,[value#50.0,timestamp#1388592317000])
(++GCYI1OO4,RechargeTimeSeries,)
c: {entityId: chararray,attributeName: chararray,bytearray}
(++JIYMIS2D,RechargeTimeSeries,50.0,1388675231000)
(++JRGOCZQD,RechargeTimeSeries,50.0,1388592317000)
(++GCYI1OO4,RechargeTimeSeries,,)
d: {entityId: chararray,attributeName: chararray,value: float,timestamp: int}
This should do the the trick. First, flatten the tuple of maps to get rid of the encapsulating tuple:
b = foreach a generate entityId, attributeName, FLATTEN($2);
Now we can convert everything but the first two fields into a bag. The bag can be flattened (see http://pig.apache.org/docs/r0.12.0/basic.html#flatten) to get rows for each value/timestamp pair:
c = foreach b generate
$0 as entityId,
$1 as attributeName,
FLATTEN(TOBAG($2 ..));
Lastly, get the values you need out of the map:
d = foreach c generate
entityId,
attributeName,
(float)$2#'value' as value,
(int)$2#'timestamp' as timestamp;
Update:
Some other options to make a bag of maps out of the tuple of maps:
DataFu's TransposeTupleToBag: http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/util/TransposeTupleToBag.html
The foo() Python UDF in this answer: Pig - how to iterate on a bag of maps

Extract matching tuples in bag in PIG

I have raw data in bag:
{(id,35821),(lang,en-US),(pf_1,us)}
{(path,/ybe/wer),(id,23481),(lang,en-US),(intl,us),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(lang,ir-IL),(pf_1,il),(pf_2,werasdf|dfsas)}
How could I extract the tuples whose column 1 matches id and pf_*?
The output I want:
{(id,35821),(pf_1,us)}
{(id,23481),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(pf_1,il),(pf_2,werasdf|dfsas)}
Any suggestion would be appreciated. Thanks!
In order to process the inner bag (a bag in a format like OUTER_BAG: {INNER_BAG: {(e:int)}}) you are going to have to use a nested FOREACH. This will allow you to preform operations over the tuples in the inner bag.
For example, you are going to want to do something like:
-- A: {inner_bag: {(val1: chararray, val2: chararray)}}
B = FOREACH A {
filtered_bags = FILTER inner_bag BY val1 matches '^(id|pf_).*' ;
GENERATE filtered_bags ;
}

Resources