How to order items in tuples? - hadoop

I have pairs of numbers and I want to sort them.
grunt> dump unordered
(11,22)
(88,33)
(55,66)
How do I sort them to:
(11,22)
(33,88)
(55,66)
Tried to use bags:
grunt> bag_of_pairs = foreach unordered generate TOBAG(TOTUPLE($0),TOTUPLE($1));
grunt> ordered = foreach bag_of_pairs {o1 = order $0 by $0; generate o1;}
And ended up with this ordered, but over-wrapped list, that I don't know how to simplify:
grunt> dump ordered
({(11),(22)})
({(33),(88)})
({(55),(66)})
Thanks

You need a UDF to convert the bag to a tuple. However, since you only need to order two items this can also be done with a bincond.
ordered = FOREACH unordered GENERATE ($0<$1?$0:$1), ($0<$1?$1:$0) ;
NOTE: I can't test this right now, but this should also work.
ordered = FOREACH unordered GENERATE FLATTEN(($0<$1?($0,$1):($1,$0)) ;

Related

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?
Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

"Flattening" a databag in Pig

Suppose I have a bunch of databags generated from a Pig UDF that holds several tuples of Strings. How can I pull all of them out of the databags and simple make each String its own "row" of data.
databags = FOREACH data GENERATE pigUdfThatMakesDataBags(data::someText);
strings = FOREACH databags { ??? };
databags = FOREACH data GENERATE pigUdfThatMakesDataBags(data::someText);
datatuples = FOREACH databags FLATTEN($0); -- Bag to Tuples
strings = FOREACH datatuples FLATTEN(TOBAG(*)); -- Tuples to Tokens'
DUMP strings;
Am I understand it right that you're looking for the FLATTEN?

extracting a tuple from a bag

I have a relation of bags of tuples which looks like this. The tuples in the bag come preordered.
{(123,1383313457523,1,US),(123,1383313457543,2,US),(123,1383313457553,3,US)}
{(456,1383313457623,1,UK),(456,1383313457643,2,UK),(456,1383313457653,3,UK)}
{(789,1383313457723,1,UK),(789,1383313457743,2,UK),(789,1383313457753,3,UK)}
Where the tuple is: (id:chararray,time:long,event:chararray,location,chararray)
I want to get the first element of each bag. So my expected output would be:
(123,1383313457523,1,US)
(456,1383313457623,1,UK)
(789,1383313457723,1,UK)
I tried this:
data = load 'mydata.txt' USING PigStorage('\t');
A = FOREACH data GENERATE $0;
dump A;
Which produces the same list of data bags as I had originally.
Alternatively trying to extract just the ids
data = load 'mydata.txt' USING PigStorage('\t');
A = FOREACH data GENERATE $0.$0;
dump A;
I expect:
(123)
(456)
(789)
but I get
{(123),(123),(123)}
{(456),(456),(456)}
{(789),(789),(789)}
How do I adjust my script to get the data that I want.
Use LIMIT inside a nested foreach:
A = FOREACH data { first = LIMIT $0 1; GENERATE FLATTEN(first); }
You cannot count on the tuples in your bag being ordered, since by definition a bag is unordered. However, you can also put an ORDER BY in a nested foreach:
A = FOREACH data { ord = ORDER $0 BY $1; first = LIMIT ord 1; GENERATE FLATTEN(first); }
I find these to be more readable if they are split up onto multiple lines:
A =
FOREACH data {
ord = ORDER $0 BY $1;
first = LIMIT ord 1;
GENERATE
FLATTEN(first);
};
I'm assuming that the bag is ordered by the second field of each tuple ($1).

Extract matching tuples in bag in PIG

I have raw data in bag:
{(id,35821),(lang,en-US),(pf_1,us)}
{(path,/ybe/wer),(id,23481),(lang,en-US),(intl,us),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(lang,ir-IL),(pf_1,il),(pf_2,werasdf|dfsas)}
How could I extract the tuples whose column 1 matches id and pf_*?
The output I want:
{(id,35821),(pf_1,us)}
{(id,23481),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(pf_1,il),(pf_2,werasdf|dfsas)}
Any suggestion would be appreciated. Thanks!
In order to process the inner bag (a bag in a format like OUTER_BAG: {INNER_BAG: {(e:int)}}) you are going to have to use a nested FOREACH. This will allow you to preform operations over the tuples in the inner bag.
For example, you are going to want to do something like:
-- A: {inner_bag: {(val1: chararray, val2: chararray)}}
B = FOREACH A {
filtered_bags = FILTER inner_bag BY val1 matches '^(id|pf_).*' ;
GENERATE filtered_bags ;
}

Grouping a databag by identical values in pig

I created the following Pig script to filter the sentences from a collection of web documents (Common Crawl) that mention a movie title (from a predefined data file of movie titles), apply sentiment analysis on those sentences and group those sentiments by movie.
register ../commoncrawl-examples/lib/*.jar;
set mapred.task.timeout= 1000;
register ../commoncrawl-examples/dist/lib/commoncrawl-examples-1.0.1-HM.jar;
register ../dist/lib/movierankings-1.jar
register ../lib/piggybank.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar;
register ../lib/stanford-corenlp-full-2014-01-04/ejml-0.23.jar;
register ../lib/stanford-corenlp-full-2014-01-04/joda-time.jar;
register ../lib/stanford-corenlp-full-2014-01-04/jollyday.jar;
register ../lib/stanford-corenlp-full-2014-01-04/xom.jar;
DEFINE IsNotWord com.moviereviewsentimentrankings.IsNotWord;
DEFINE IsMovieDocument com.moviereviewsentimentrankings.IsMovieDocument;
DEFINE ToSentenceMoviePairs com.moviereviewsentimentrankings.ToSentenceMoviePairs;
DEFINE ToSentiment com.moviereviewsentimentrankings.ToSentiment;
DEFINE MoviesInDocument com.moviereviewsentimentrankings.MoviesInDocument;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- LOAD pages, movies and words
pages = LOAD '../data/textData-*' USING SequenceFileLoader as (url:chararray, content:chararray);
movies_fltr_grp = LOAD '../data/movie_fltr_grp_2/part-*' as (group: chararray,movies_fltr: {(movie: chararray)});
-- FILTER pages containing movie
movie_pages = FILTER pages BY IsMovieDocument(content, movies_fltr_grp.movies_fltr);
-- SPLIT pages containing movie in sentences and create movie-sentence pairs
movie_sentences = FOREACH movie_pages GENERATE flatten(ToSentenceMoviePairs(content, movies_fltr_grp.movies_fltr)) as (content:chararray, movie:chararray);
-- Calculate sentiment for each movie-sentence pair
movie_sentiment = FOREACH movie_sentences GENERATE flatten(ToSentiment(movie, content)) as (movie:chararray, sentiment:int);
-- GROUP movie-sentiment pairs by movie
movie_sentiment_grp_tups = GROUP movie_sentiment BY movie;
-- Reformat and print movie-sentiment pairs
movie_sentiment_grp = FOREACH movie_sentiment_grp_tups GENERATE group, movie_sentiment.sentiment AS sentiments:{(sentiment: int)};
describe movie_sentiment_grp;
Test runs on a small subset of the web crawl showed to be successfully give me pairs of a movie title with a databag of integers (from 1 to 5, representing very negative, negative, neutral, positive and very positive). As a last step I would like to transform this data into pairs movie title and a databag containing tuples with all distinct integers existing for this movie title and their count. The describe movie_sentiment_grp at the end of the script returns:
movie_sentiment_grp: {group: chararray,sentiments: {(sentiment: int)}}
So basically I probably need to FOREACH over each element of movie_sentiment_grp and GROUP the sentiments databag into groups of identical values and then use the COUNT() function to get the number of elements in each group. I was however not able to find anything on how to group a databag of integers into groups of identical values. Does anyone know how to do this?
Dummy solution:
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp{
sentiments_grp = GROUP sentiments BY ?;
}
Check out the CountEach UDF from Apache DataFu. Given a bag it will produce a new bag of the distinct tuples, with the count appended to each corresponding tuple.
Example from the documentation should make this clear:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
-- input:
-- ({(A),(A),(C),(B)})
input = LOAD 'input' AS (B: bag {T: tuple(alpha:CHARARRAY, numeric:INT)});
-- output_flatten:
-- ({(A,2),(C,1),(B,1)})
output_flatten = FOREACH input GENERATE CountEachFlatten(B);
For your case:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp GENERATE
group,
CountEach(sentiments);
You were on the right track. movie_sentiment_grp is in the right format, and a nested FOREACH would be correct, except you can not use a GROUP in it. The solution is to use a UDF. Something like this:
myudfs.py
#!/usr/bin/python
#outputSchema('sentiments: {(sentiment:int, count:int)}')
def count_sentiments(BAG):
res = {}
for s in BAG:
if s in res:
res[s] += 1
else:
res[s] = 1
return res.items()
This UDF is used like:
Register 'myudfs.py' using jython as myfuncs;
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp
GENERATE group, myfuncs.count_sentiments(sentiments) ;

Resources