How to get a SQL like GROUP BY using Apache Pig? - hadoop

I have the following input called movieUserTagFltr:
(260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,action),(260,script),(260,"imaginary world),(260,space),(260,Science Fiction),(260,"space epic),(260,Syfy),(260,series),(260,classic sci-fi),(260,space adventure),(260,jedi),(260,awesome soundtrack),(260,awesome),(260,coming of age)})
(858,{(858,Katso Sanna!)})
(924,{(924,slow),(924,boring)})
(1256,{(1256,Marx Brothers)})
it follows the schema: (movieId:int, tags:bag{(movieId:int, tag:cararray),...})
Basically the first number represents a movie id, and the subsequent bag holds all the keywords associated with that movie. I would like to group those key words in such way that I would have an output something like this:
(260,{(1,starwars),(1,George Lucas),(1,sci-fi),(1,cult classic),(4,Science Fiction),(1,classic),(1,supernatural powers),(1,nerdy),(1,critically acclaimed),(1,action),(1,script),(1,"imaginary world),(1,space),(1,"space epic),(1,Syfy),(1,series),(1,classic sci-fi),(1,space adventure),(1,jedi),(1,awesome soundtrack),(1,awesome),(1,coming of age)})
(858,{(1,Katso Sanna!)})
(924,{(1,slow),(1,boring)})
(1256,{(1,Marx Brothers)})
Note that the tag Science Fiction has appeared 4 times for the movie with id 260. Using the GROUP BY and COUNT I manged to count the distinct keywords for each movie using the following script:
sum = FOREACH group_data {
unique_tags = DISTINCT movieUserTagFltr.tags::tag;
GENERATE group, COUNT(unique_tags) as tag;
};
But that only returns a global count, I want a local count. So the logic of what I was thinking was:
result = iterate over each tuple of group_data {
generate a tuple with $0, and a bag with {
foreach distinct tag that group_data has on it's $1 variable do {
generate a tuple like: (tag_name, count of how many times that tag appeared on $1)
}
}
}

You can flatten out your original input so that each movieID and tag are their own record. Then group by movieID and tag to get a count for each combination. Finally, group by movieID so that you end up with a bag of tags and counts for each movie.
Let's say you start with movieUserTagFltr with the schema you described:
A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
FLATTEN(group) AS (movieID, tag),
COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
Your final schema is:
D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}

Related

Inserting tuples inside an inner bag using Pig Latin - Hadoop

I am trying to create the following format of relation using Pig Latin:
userid, day, {(pid,fulldate, x,y),(pid,fulldate, x,y), ...}
Relation description: Each user (userid) in each day (day) has purchased multiple products (pid)
I am Loading the data into:
A= LOAD '**from a HDFS URL**' AS (pid: chararray,userid:
chararray,day:int,fulldate: chararray,x: chararray,y:chararray);
B= GROUP A BY (userid, day);
Describe B;
B: {group: (userid: chararray,day: int),A: {(pid: chararray,day: int,fulldate: chararray,x: chararray,userid: chararray,y: chararray)}}
C= FOREACH B FLATTEN(B) AS (userid,day), $1.pid, $1.fulldate,$1.x,$1.y;
Describe C;
C: {userid: chararray,day: int,{(pid: chararray)}},{(fulldate: chararray)},{(x: chararray)},{(y: chararray)}}
The result of Describe C does not give the format I want ! What I am doing wrong?
You are correct till the GROUP BY part. After that however you are trying to do something messy. I'm actually not sure what is happening for your alias C. To arrive at the format you are looking for, you will need a nested foreach.
C = FOREACH B {
data = A.pid, A.fulldate, A.x, A.y;
GENERATE FLATTEN(group), data;
}
This allows C to have one record for each (userid, day) and all the corresponding (pid,fulldate, x, y) tuples in a bag.
You can read more about nested foreach here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).
My understanding is that B is almost what you're looking for, except you would like the tuple containing userid and day to be flattened, and you would like only pid, fulldate, x, and y to appear in the bag.
First, you want to flatten the tuple group which has fields userid and day, not the bag A which contains multiple tuples. Flattening group unnests the tuple, which only has 1 set of unique values for each row, whereas flattening the bag A would effectively ungroup your previous GROUP BY statement since the values in the bag A are not unique. So the first part should read C = FOREACH B GENERATE FLATTEN(group) AS (userid, day);
Next, you want to keep pid, fulldate, x, and y in separate tuples for each record, but the way you've selected them essentially makes a bag of all the pid values, a bag of all the fulldate values, etc. Instead, try selecting these fields in a way that keeps the tuples nested in the bag:
C = FOREACH B GENERATE
FLATTEN(group) AS (userid, day),
A.(pid, fulldate, x, y) AS A;

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

Grouping a databag by identical values in pig

I created the following Pig script to filter the sentences from a collection of web documents (Common Crawl) that mention a movie title (from a predefined data file of movie titles), apply sentiment analysis on those sentences and group those sentiments by movie.
register ../commoncrawl-examples/lib/*.jar;
set mapred.task.timeout= 1000;
register ../commoncrawl-examples/dist/lib/commoncrawl-examples-1.0.1-HM.jar;
register ../dist/lib/movierankings-1.jar
register ../lib/piggybank.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar;
register ../lib/stanford-corenlp-full-2014-01-04/ejml-0.23.jar;
register ../lib/stanford-corenlp-full-2014-01-04/joda-time.jar;
register ../lib/stanford-corenlp-full-2014-01-04/jollyday.jar;
register ../lib/stanford-corenlp-full-2014-01-04/xom.jar;
DEFINE IsNotWord com.moviereviewsentimentrankings.IsNotWord;
DEFINE IsMovieDocument com.moviereviewsentimentrankings.IsMovieDocument;
DEFINE ToSentenceMoviePairs com.moviereviewsentimentrankings.ToSentenceMoviePairs;
DEFINE ToSentiment com.moviereviewsentimentrankings.ToSentiment;
DEFINE MoviesInDocument com.moviereviewsentimentrankings.MoviesInDocument;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- LOAD pages, movies and words
pages = LOAD '../data/textData-*' USING SequenceFileLoader as (url:chararray, content:chararray);
movies_fltr_grp = LOAD '../data/movie_fltr_grp_2/part-*' as (group: chararray,movies_fltr: {(movie: chararray)});
-- FILTER pages containing movie
movie_pages = FILTER pages BY IsMovieDocument(content, movies_fltr_grp.movies_fltr);
-- SPLIT pages containing movie in sentences and create movie-sentence pairs
movie_sentences = FOREACH movie_pages GENERATE flatten(ToSentenceMoviePairs(content, movies_fltr_grp.movies_fltr)) as (content:chararray, movie:chararray);
-- Calculate sentiment for each movie-sentence pair
movie_sentiment = FOREACH movie_sentences GENERATE flatten(ToSentiment(movie, content)) as (movie:chararray, sentiment:int);
-- GROUP movie-sentiment pairs by movie
movie_sentiment_grp_tups = GROUP movie_sentiment BY movie;
-- Reformat and print movie-sentiment pairs
movie_sentiment_grp = FOREACH movie_sentiment_grp_tups GENERATE group, movie_sentiment.sentiment AS sentiments:{(sentiment: int)};
describe movie_sentiment_grp;
Test runs on a small subset of the web crawl showed to be successfully give me pairs of a movie title with a databag of integers (from 1 to 5, representing very negative, negative, neutral, positive and very positive). As a last step I would like to transform this data into pairs movie title and a databag containing tuples with all distinct integers existing for this movie title and their count. The describe movie_sentiment_grp at the end of the script returns:
movie_sentiment_grp: {group: chararray,sentiments: {(sentiment: int)}}
So basically I probably need to FOREACH over each element of movie_sentiment_grp and GROUP the sentiments databag into groups of identical values and then use the COUNT() function to get the number of elements in each group. I was however not able to find anything on how to group a databag of integers into groups of identical values. Does anyone know how to do this?
Dummy solution:
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp{
sentiments_grp = GROUP sentiments BY ?;
}
Check out the CountEach UDF from Apache DataFu. Given a bag it will produce a new bag of the distinct tuples, with the count appended to each corresponding tuple.
Example from the documentation should make this clear:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
-- input:
-- ({(A),(A),(C),(B)})
input = LOAD 'input' AS (B: bag {T: tuple(alpha:CHARARRAY, numeric:INT)});
-- output_flatten:
-- ({(A,2),(C,1),(B,1)})
output_flatten = FOREACH input GENERATE CountEachFlatten(B);
For your case:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp GENERATE
group,
CountEach(sentiments);
You were on the right track. movie_sentiment_grp is in the right format, and a nested FOREACH would be correct, except you can not use a GROUP in it. The solution is to use a UDF. Something like this:
myudfs.py
#!/usr/bin/python
#outputSchema('sentiments: {(sentiment:int, count:int)}')
def count_sentiments(BAG):
res = {}
for s in BAG:
if s in res:
res[s] += 1
else:
res[s] = 1
return res.items()
This UDF is used like:
Register 'myudfs.py' using jython as myfuncs;
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp
GENERATE group, myfuncs.count_sentiments(sentiments) ;

Query individual field after group by

I have a relation
A =
(John,19,SF)
(Mary,20,NY)
(Bill,23,SF)
(Joe,25,SF)
The schema is (name, age, city)
B = foreach (group A by city)
{
sorted = ORDER A BY age;
info = LIMIT sorted 10;
GENERATE group, info.name;
}
Pig complains that "Scalar has more than one row in the output" for GENERATE group, info.name;
How to query individual field in the bag after group by?
Thanks.
For me the above code is working and the output for 'Dump B;' is
(NY,{(Mary)})
(SF,{(John),(Bill),(Joe)})
As far as querying individual field after group by is concerned, you will have to refer in the same way you are doing now, Alias.fieldName.

Hadoop Pig GROUP by id, get owner_id?

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

Resources