Filter inner bag in Pig - hadoop

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?

There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})

You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

Related

How do I get the matching values inside a for loop using FILTER in PIG?

Consider this as my input,
Input (File1):
12345;11
34567;12
.
.
Input (File2):
11;(1,2,3,4,5,6,7,8,9)
12;(9,8,7,6,5,4,3,2,1)
.
.
I would like to get the output as follows:
Output:
(1,2,3,4,5,6,7,8,9)
(9,8,7,6,5,4,3,2,1)
Here's the sample code which I have tried using FILTER and I face some errors with this. Please suggest me some other options.
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
Is it possible do this inside a for loop ? Please let me know. Thanks in advance !
There are no for loops in Apache Pig, if you need to iterate through each row of the data for some specific purpose you need to implement your own UDF. The foreach keyword is not used to create a loop, it is used to transform your data based on your columns, applying UDFs to it. You can also use a nested foreach, where you perform operations over each group in your relation.
However, your syntax is wrong. You are trying to use a nested foreach without grouping your data first. What a nested foreach does, is perform the operations you define in the block of code over a grouped relation. Therefore, the only way your code could work is by grouping the data first:
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
data1 = group data1 by id;
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
However, this won't work because inside a nested foreach you cannot refer to a different relation like data2.
What you really want, is a JOIN operation over both relations using number for data1 and numberInfo for data2. This will give you this:
joined_data = join data1 by number, data2 by numberInfo;
dump joined_data;
(12345,11,11,(1,2,3,4,5,6,7,8,9))
(34567,12,12,(9,8,7,6,5,4,3,2,1))
In your question you said you only wanted as output the last column, so now you can use a foreach to generate the column you want:
final_data = foreach joined_data generate data2::collection;
dump final_data;
((1,2,3,4,5,6,7,8,9))
((9,8,7,6,5,4,3,2,1))

Extract matching tuples in bag in PIG

I have raw data in bag:
{(id,35821),(lang,en-US),(pf_1,us)}
{(path,/ybe/wer),(id,23481),(lang,en-US),(intl,us),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(lang,ir-IL),(pf_1,il),(pf_2,werasdf|dfsas)}
How could I extract the tuples whose column 1 matches id and pf_*?
The output I want:
{(id,35821),(pf_1,us)}
{(id,23481),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(pf_1,il),(pf_2,werasdf|dfsas)}
Any suggestion would be appreciated. Thanks!
In order to process the inner bag (a bag in a format like OUTER_BAG: {INNER_BAG: {(e:int)}}) you are going to have to use a nested FOREACH. This will allow you to preform operations over the tuples in the inner bag.
For example, you are going to want to do something like:
-- A: {inner_bag: {(val1: chararray, val2: chararray)}}
B = FOREACH A {
filtered_bags = FILTER inner_bag BY val1 matches '^(id|pf_).*' ;
GENERATE filtered_bags ;
}

Grouping a databag by identical values in pig

I created the following Pig script to filter the sentences from a collection of web documents (Common Crawl) that mention a movie title (from a predefined data file of movie titles), apply sentiment analysis on those sentences and group those sentiments by movie.
register ../commoncrawl-examples/lib/*.jar;
set mapred.task.timeout= 1000;
register ../commoncrawl-examples/dist/lib/commoncrawl-examples-1.0.1-HM.jar;
register ../dist/lib/movierankings-1.jar
register ../lib/piggybank.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar;
register ../lib/stanford-corenlp-full-2014-01-04/ejml-0.23.jar;
register ../lib/stanford-corenlp-full-2014-01-04/joda-time.jar;
register ../lib/stanford-corenlp-full-2014-01-04/jollyday.jar;
register ../lib/stanford-corenlp-full-2014-01-04/xom.jar;
DEFINE IsNotWord com.moviereviewsentimentrankings.IsNotWord;
DEFINE IsMovieDocument com.moviereviewsentimentrankings.IsMovieDocument;
DEFINE ToSentenceMoviePairs com.moviereviewsentimentrankings.ToSentenceMoviePairs;
DEFINE ToSentiment com.moviereviewsentimentrankings.ToSentiment;
DEFINE MoviesInDocument com.moviereviewsentimentrankings.MoviesInDocument;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- LOAD pages, movies and words
pages = LOAD '../data/textData-*' USING SequenceFileLoader as (url:chararray, content:chararray);
movies_fltr_grp = LOAD '../data/movie_fltr_grp_2/part-*' as (group: chararray,movies_fltr: {(movie: chararray)});
-- FILTER pages containing movie
movie_pages = FILTER pages BY IsMovieDocument(content, movies_fltr_grp.movies_fltr);
-- SPLIT pages containing movie in sentences and create movie-sentence pairs
movie_sentences = FOREACH movie_pages GENERATE flatten(ToSentenceMoviePairs(content, movies_fltr_grp.movies_fltr)) as (content:chararray, movie:chararray);
-- Calculate sentiment for each movie-sentence pair
movie_sentiment = FOREACH movie_sentences GENERATE flatten(ToSentiment(movie, content)) as (movie:chararray, sentiment:int);
-- GROUP movie-sentiment pairs by movie
movie_sentiment_grp_tups = GROUP movie_sentiment BY movie;
-- Reformat and print movie-sentiment pairs
movie_sentiment_grp = FOREACH movie_sentiment_grp_tups GENERATE group, movie_sentiment.sentiment AS sentiments:{(sentiment: int)};
describe movie_sentiment_grp;
Test runs on a small subset of the web crawl showed to be successfully give me pairs of a movie title with a databag of integers (from 1 to 5, representing very negative, negative, neutral, positive and very positive). As a last step I would like to transform this data into pairs movie title and a databag containing tuples with all distinct integers existing for this movie title and their count. The describe movie_sentiment_grp at the end of the script returns:
movie_sentiment_grp: {group: chararray,sentiments: {(sentiment: int)}}
So basically I probably need to FOREACH over each element of movie_sentiment_grp and GROUP the sentiments databag into groups of identical values and then use the COUNT() function to get the number of elements in each group. I was however not able to find anything on how to group a databag of integers into groups of identical values. Does anyone know how to do this?
Dummy solution:
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp{
sentiments_grp = GROUP sentiments BY ?;
}
Check out the CountEach UDF from Apache DataFu. Given a bag it will produce a new bag of the distinct tuples, with the count appended to each corresponding tuple.
Example from the documentation should make this clear:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
-- input:
-- ({(A),(A),(C),(B)})
input = LOAD 'input' AS (B: bag {T: tuple(alpha:CHARARRAY, numeric:INT)});
-- output_flatten:
-- ({(A,2),(C,1),(B,1)})
output_flatten = FOREACH input GENERATE CountEachFlatten(B);
For your case:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp GENERATE
group,
CountEach(sentiments);
You were on the right track. movie_sentiment_grp is in the right format, and a nested FOREACH would be correct, except you can not use a GROUP in it. The solution is to use a UDF. Something like this:
myudfs.py
#!/usr/bin/python
#outputSchema('sentiments: {(sentiment:int, count:int)}')
def count_sentiments(BAG):
res = {}
for s in BAG:
if s in res:
res[s] += 1
else:
res[s] = 1
return res.items()
This UDF is used like:
Register 'myudfs.py' using jython as myfuncs;
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp
GENERATE group, myfuncs.count_sentiments(sentiments) ;

How would I make a pig script that only returns fields with entries over a certain length?

The data I have is already fielded, I just want a document that contains two of the fields and even then it only contains an entry if the title field is over a certain length. This is what I have so far.
records = LOAD '$INPUT' USING PigStorage('\t') AS (url:chararray, title:chararray, meta:chararray, copyright:chararray, aboutUSLink:chararray, aboutTitle:chararray, aboutMeta:chararray, contactUSLink:chararray, contactTitle:chararray, contactMeta:chararray, phones:chararray);
E = FOREACH records IF SIZE(title)>10 GENERATE url,title;
STORE E INTO '$OUTPUT/phoneNumbersAndTitles';
Why does the code exit at IF?
You should use FILTER, which selects tuples from a relation based on some condition:
filtered = FILTER records BY SIZE(title) > 10;
E = FOREACH filtered GENERATE url,title;

Hadoop Pig GROUP by id, get owner_id?

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

Resources