Inserting tuples inside an inner bag using Pig Latin - Hadoop - hadoop

I am trying to create the following format of relation using Pig Latin:
userid, day, {(pid,fulldate, x,y),(pid,fulldate, x,y), ...}
Relation description: Each user (userid) in each day (day) has purchased multiple products (pid)
I am Loading the data into:
A= LOAD '**from a HDFS URL**' AS (pid: chararray,userid:
chararray,day:int,fulldate: chararray,x: chararray,y:chararray);
B= GROUP A BY (userid, day);
Describe B;
B: {group: (userid: chararray,day: int),A: {(pid: chararray,day: int,fulldate: chararray,x: chararray,userid: chararray,y: chararray)}}
C= FOREACH B FLATTEN(B) AS (userid,day), $1.pid, $1.fulldate,$1.x,$1.y;
Describe C;
C: {userid: chararray,day: int,{(pid: chararray)}},{(fulldate: chararray)},{(x: chararray)},{(y: chararray)}}
The result of Describe C does not give the format I want ! What I am doing wrong?

You are correct till the GROUP BY part. After that however you are trying to do something messy. I'm actually not sure what is happening for your alias C. To arrive at the format you are looking for, you will need a nested foreach.
C = FOREACH B {
data = A.pid, A.fulldate, A.x, A.y;
GENERATE FLATTEN(group), data;
}
This allows C to have one record for each (userid, day) and all the corresponding (pid,fulldate, x, y) tuples in a bag.
You can read more about nested foreach here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).

My understanding is that B is almost what you're looking for, except you would like the tuple containing userid and day to be flattened, and you would like only pid, fulldate, x, and y to appear in the bag.
First, you want to flatten the tuple group which has fields userid and day, not the bag A which contains multiple tuples. Flattening group unnests the tuple, which only has 1 set of unique values for each row, whereas flattening the bag A would effectively ungroup your previous GROUP BY statement since the values in the bag A are not unique. So the first part should read C = FOREACH B GENERATE FLATTEN(group) AS (userid, day);
Next, you want to keep pid, fulldate, x, and y in separate tuples for each record, but the way you've selected them essentially makes a bag of all the pid values, a bag of all the fulldate values, etc. Instead, try selecting these fields in a way that keeps the tuples nested in the bag:
C = FOREACH B GENERATE
FLATTEN(group) AS (userid, day),
A.(pid, fulldate, x, y) AS A;

Related

How to get a SQL like GROUP BY using Apache Pig?

I have the following input called movieUserTagFltr:
(260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,action),(260,script),(260,"imaginary world),(260,space),(260,Science Fiction),(260,"space epic),(260,Syfy),(260,series),(260,classic sci-fi),(260,space adventure),(260,jedi),(260,awesome soundtrack),(260,awesome),(260,coming of age)})
(858,{(858,Katso Sanna!)})
(924,{(924,slow),(924,boring)})
(1256,{(1256,Marx Brothers)})
it follows the schema: (movieId:int, tags:bag{(movieId:int, tag:cararray),...})
Basically the first number represents a movie id, and the subsequent bag holds all the keywords associated with that movie. I would like to group those key words in such way that I would have an output something like this:
(260,{(1,starwars),(1,George Lucas),(1,sci-fi),(1,cult classic),(4,Science Fiction),(1,classic),(1,supernatural powers),(1,nerdy),(1,critically acclaimed),(1,action),(1,script),(1,"imaginary world),(1,space),(1,"space epic),(1,Syfy),(1,series),(1,classic sci-fi),(1,space adventure),(1,jedi),(1,awesome soundtrack),(1,awesome),(1,coming of age)})
(858,{(1,Katso Sanna!)})
(924,{(1,slow),(1,boring)})
(1256,{(1,Marx Brothers)})
Note that the tag Science Fiction has appeared 4 times for the movie with id 260. Using the GROUP BY and COUNT I manged to count the distinct keywords for each movie using the following script:
sum = FOREACH group_data {
unique_tags = DISTINCT movieUserTagFltr.tags::tag;
GENERATE group, COUNT(unique_tags) as tag;
};
But that only returns a global count, I want a local count. So the logic of what I was thinking was:
result = iterate over each tuple of group_data {
generate a tuple with $0, and a bag with {
foreach distinct tag that group_data has on it's $1 variable do {
generate a tuple like: (tag_name, count of how many times that tag appeared on $1)
}
}
}
You can flatten out your original input so that each movieID and tag are their own record. Then group by movieID and tag to get a count for each combination. Finally, group by movieID so that you end up with a bag of tags and counts for each movie.
Let's say you start with movieUserTagFltr with the schema you described:
A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
FLATTEN(group) AS (movieID, tag),
COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
Your final schema is:
D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

How Pig's COGROUP operator works?

How does the COGROUP operator works here?
How and why we are getting empty bag in the last two lines of output(No website explained in details about the data arrangement in COGROUP) ?
A = load 'student' as (name:chararray, age:int, gpa:float);
B = load 'student' as (name:chararray, age:int, gpa:float);
dump B;
(joe,18,2.5)
(sam,,3.0)
(bob,,3.5)
X = cogroup A by age, B by age;
dump X;
(18,{(joe,18,2.5)},{(joe,18,2.5)})
(,{(sam,,3.0),(bob,,3.5)},{})
(,{},{(sam,,3.0),(bob,,3.5)})
There is a very clear example in Definitive Guide book. I hope the below snippet helps you to understand the cogroup concept.
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)
(Hank,2)
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP generates a tuple for each unique grouping key. The first field of each tuple
is the key, and the remaining fields are bags of tuples from the relations with a matching
key. The first bag contains the matching tuples from relation A with the same key.
Similarly, the second bag contains the matching tuples from relation B with the same
key.
If for a particular key a relation has no matching key, then the bag for that relation is
empty. For example, since no one has bought a scarf (with ID 1), the second bag in the
tuple for that row is empty. This is an example of an outer join, which is the default
type for COGROUP.

Extract matching tuples in bag in PIG

I have raw data in bag:
{(id,35821),(lang,en-US),(pf_1,us)}
{(path,/ybe/wer),(id,23481),(lang,en-US),(intl,us),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(lang,ir-IL),(pf_1,il),(pf_2,werasdf|dfsas)}
How could I extract the tuples whose column 1 matches id and pf_*?
The output I want:
{(id,35821),(pf_1,us)}
{(id,23481),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(pf_1,il),(pf_2,werasdf|dfsas)}
Any suggestion would be appreciated. Thanks!
In order to process the inner bag (a bag in a format like OUTER_BAG: {INNER_BAG: {(e:int)}}) you are going to have to use a nested FOREACH. This will allow you to preform operations over the tuples in the inner bag.
For example, you are going to want to do something like:
-- A: {inner_bag: {(val1: chararray, val2: chararray)}}
B = FOREACH A {
filtered_bags = FILTER inner_bag BY val1 matches '^(id|pf_).*' ;
GENERATE filtered_bags ;
}

hadoop cascading how to get top N tuples

New to cascading, trying to find out a way to get top N tuples based on a sort/order. for example, I'd like to know the top 100 first names people are using.
here's what I can do similar in teradata sql:
select top 100 first_name, num_records
from
(select first_name, count(1) as num_records
from table_1
group by first_name) a
order by num_records DESC
Here's similar in hadoop pig
a = load 'table_1' as (first_name:chararray, last_name:chararray);
b = foreach (group a by first_name) generate group as first_name, COUNT(a) as num_records;
c = order b by num_records DESC;
d = limit c 100;
It seems very easy to do in SQL or Pig, but having a hard time try to find a way to do it in cascading. Please advise!
Assuming you just need the Pipe set up on how to do this:
In Cascading 2.1.6,
Pipe firstNamePipe = new GroupBy("topFirstNames", InPipe,
new Fields("first_name"),
);
firstNamePipe = new Every(firstNamePipe, new Fields("first_name"),
new Count("num_records"), Fields.All);
firstNamePipe = new GroupBy(firstNamePipe,
new Fields("first_name"),
new Fields("num_records"),
true); //where true is descending order
firstNamePipe = new Every(firstNamePipe, new Fields("first_name", "num_records")
new First(Fields.Args, 100), Fields.All)
Where InPipe is formed with your incoming tap that holds the tuple data that you are referencing above. Namely, "first_name". "num_records" is created when new Count() is called.
If you have the "num_records" and "first_name" data in separate taps (tables or files) then you can set up two pipes that point to those two Tap sources and join them using CoGroup.
The definitions I used were are from Cascading 2.1.6:
GroupBy(String groupName, Pipe pipe, Fields groupFields, Fields sortFields, boolean reverseOrder)
Count(Fields fieldDeclaration)
First(Fields fieldDeclaration, int firstN)
Method 1
Use a GroupBy and group them base on the columns required and u can make use of secondary sorting that is provided by the cascading ,by default it provies them in ascending order ,if we want them in descing order we can do them by reverseorder()
To get the TOP n tuples or rows
Its quite simple just use a static variable count in FILTER and increment it by 1 for each tuple count value increases by 1 and check weather it is greater than N
return true when count value is greater than N or else return false
this will provide the ouput with first N tuples
method 2
cascading provides an inbuit function unique which returns firstNbuffer
see the below link
http://docs.cascading.org/cascading/2.2/javadoc/cascading/pipe/assembly/Unique.html

Resources