Extract matching tuples in bag in PIG - hadoop

I have raw data in bag:
{(id,35821),(lang,en-US),(pf_1,us)}
{(path,/ybe/wer),(id,23481),(lang,en-US),(intl,us),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(lang,ir-IL),(pf_1,il),(pf_2,werasdf|dfsas)}
How could I extract the tuples whose column 1 matches id and pf_*?
The output I want:
{(id,35821),(pf_1,us)}
{(id,23481),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(pf_1,il),(pf_2,werasdf|dfsas)}
Any suggestion would be appreciated. Thanks!

In order to process the inner bag (a bag in a format like OUTER_BAG: {INNER_BAG: {(e:int)}}) you are going to have to use a nested FOREACH. This will allow you to preform operations over the tuples in the inner bag.
For example, you are going to want to do something like:
-- A: {inner_bag: {(val1: chararray, val2: chararray)}}
B = FOREACH A {
filtered_bags = FILTER inner_bag BY val1 matches '^(id|pf_).*' ;
GENERATE filtered_bags ;
}

Related

How to get a SQL like GROUP BY using Apache Pig?

I have the following input called movieUserTagFltr:
(260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,action),(260,script),(260,"imaginary world),(260,space),(260,Science Fiction),(260,"space epic),(260,Syfy),(260,series),(260,classic sci-fi),(260,space adventure),(260,jedi),(260,awesome soundtrack),(260,awesome),(260,coming of age)})
(858,{(858,Katso Sanna!)})
(924,{(924,slow),(924,boring)})
(1256,{(1256,Marx Brothers)})
it follows the schema: (movieId:int, tags:bag{(movieId:int, tag:cararray),...})
Basically the first number represents a movie id, and the subsequent bag holds all the keywords associated with that movie. I would like to group those key words in such way that I would have an output something like this:
(260,{(1,starwars),(1,George Lucas),(1,sci-fi),(1,cult classic),(4,Science Fiction),(1,classic),(1,supernatural powers),(1,nerdy),(1,critically acclaimed),(1,action),(1,script),(1,"imaginary world),(1,space),(1,"space epic),(1,Syfy),(1,series),(1,classic sci-fi),(1,space adventure),(1,jedi),(1,awesome soundtrack),(1,awesome),(1,coming of age)})
(858,{(1,Katso Sanna!)})
(924,{(1,slow),(1,boring)})
(1256,{(1,Marx Brothers)})
Note that the tag Science Fiction has appeared 4 times for the movie with id 260. Using the GROUP BY and COUNT I manged to count the distinct keywords for each movie using the following script:
sum = FOREACH group_data {
unique_tags = DISTINCT movieUserTagFltr.tags::tag;
GENERATE group, COUNT(unique_tags) as tag;
};
But that only returns a global count, I want a local count. So the logic of what I was thinking was:
result = iterate over each tuple of group_data {
generate a tuple with $0, and a bag with {
foreach distinct tag that group_data has on it's $1 variable do {
generate a tuple like: (tag_name, count of how many times that tag appeared on $1)
}
}
}
You can flatten out your original input so that each movieID and tag are their own record. Then group by movieID and tag to get a count for each combination. Finally, group by movieID so that you end up with a bag of tags and counts for each movie.
Let's say you start with movieUserTagFltr with the schema you described:
A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
FLATTEN(group) AS (movieID, tag),
COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
Your final schema is:
D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}

Inserting tuples inside an inner bag using Pig Latin - Hadoop

I am trying to create the following format of relation using Pig Latin:
userid, day, {(pid,fulldate, x,y),(pid,fulldate, x,y), ...}
Relation description: Each user (userid) in each day (day) has purchased multiple products (pid)
I am Loading the data into:
A= LOAD '**from a HDFS URL**' AS (pid: chararray,userid:
chararray,day:int,fulldate: chararray,x: chararray,y:chararray);
B= GROUP A BY (userid, day);
Describe B;
B: {group: (userid: chararray,day: int),A: {(pid: chararray,day: int,fulldate: chararray,x: chararray,userid: chararray,y: chararray)}}
C= FOREACH B FLATTEN(B) AS (userid,day), $1.pid, $1.fulldate,$1.x,$1.y;
Describe C;
C: {userid: chararray,day: int,{(pid: chararray)}},{(fulldate: chararray)},{(x: chararray)},{(y: chararray)}}
The result of Describe C does not give the format I want ! What I am doing wrong?
You are correct till the GROUP BY part. After that however you are trying to do something messy. I'm actually not sure what is happening for your alias C. To arrive at the format you are looking for, you will need a nested foreach.
C = FOREACH B {
data = A.pid, A.fulldate, A.x, A.y;
GENERATE FLATTEN(group), data;
}
This allows C to have one record for each (userid, day) and all the corresponding (pid,fulldate, x, y) tuples in a bag.
You can read more about nested foreach here: https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch06.html (Search for nested foreach in that link).
My understanding is that B is almost what you're looking for, except you would like the tuple containing userid and day to be flattened, and you would like only pid, fulldate, x, and y to appear in the bag.
First, you want to flatten the tuple group which has fields userid and day, not the bag A which contains multiple tuples. Flattening group unnests the tuple, which only has 1 set of unique values for each row, whereas flattening the bag A would effectively ungroup your previous GROUP BY statement since the values in the bag A are not unique. So the first part should read C = FOREACH B GENERATE FLATTEN(group) AS (userid, day);
Next, you want to keep pid, fulldate, x, and y in separate tuples for each record, but the way you've selected them essentially makes a bag of all the pid values, a bag of all the fulldate values, etc. Instead, try selecting these fields in a way that keeps the tuples nested in the bag:
C = FOREACH B GENERATE
FLATTEN(group) AS (userid, day),
A.(pid, fulldate, x, y) AS A;

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

extracting a tuple from a bag

I have a relation of bags of tuples which looks like this. The tuples in the bag come preordered.
{(123,1383313457523,1,US),(123,1383313457543,2,US),(123,1383313457553,3,US)}
{(456,1383313457623,1,UK),(456,1383313457643,2,UK),(456,1383313457653,3,UK)}
{(789,1383313457723,1,UK),(789,1383313457743,2,UK),(789,1383313457753,3,UK)}
Where the tuple is: (id:chararray,time:long,event:chararray,location,chararray)
I want to get the first element of each bag. So my expected output would be:
(123,1383313457523,1,US)
(456,1383313457623,1,UK)
(789,1383313457723,1,UK)
I tried this:
data = load 'mydata.txt' USING PigStorage('\t');
A = FOREACH data GENERATE $0;
dump A;
Which produces the same list of data bags as I had originally.
Alternatively trying to extract just the ids
data = load 'mydata.txt' USING PigStorage('\t');
A = FOREACH data GENERATE $0.$0;
dump A;
I expect:
(123)
(456)
(789)
but I get
{(123),(123),(123)}
{(456),(456),(456)}
{(789),(789),(789)}
How do I adjust my script to get the data that I want.
Use LIMIT inside a nested foreach:
A = FOREACH data { first = LIMIT $0 1; GENERATE FLATTEN(first); }
You cannot count on the tuples in your bag being ordered, since by definition a bag is unordered. However, you can also put an ORDER BY in a nested foreach:
A = FOREACH data { ord = ORDER $0 BY $1; first = LIMIT ord 1; GENERATE FLATTEN(first); }
I find these to be more readable if they are split up onto multiple lines:
A =
FOREACH data {
ord = ORDER $0 BY $1;
first = LIMIT ord 1;
GENERATE
FLATTEN(first);
};
I'm assuming that the bag is ordered by the second field of each tuple ($1).

pig how to concat columns into single string

I have a column of strings that I load using Pig:
A
B
C
D
how do I convert this column into a single string like this?
A,B,C,D
You are going to have to first GROUP ALL to put everything into one bag, then join the contents of the bag together using a UDF. Something like this:
-- myudfs.py
-- #!/usr/bin/python
--
-- #outputSchema('concated: string')
-- def concat_bag(BAG):
-- return ','.join(BAG)
Register 'myudfs.py' using jython as myfuncs;
A = LOAD 'myfile.txt' AS (letter:chararray) ;
B = GROUP A ALL ;
C = FOREACH B GENERATE myfuncs.concat_bag(A.letter) AS all_letters ;
If your file/schema contains multiple columns, you are probably going to want to project out the column you want to generate the string for. Something like:
A0 = LOAD 'myfile.txt' AS (letter:chararray, val:int, extra:chararray) ;
A = FOREACH A0 GENERATE letter ;
This way you are not keeping around extra columns that will slow down an already expensive operation.

Resources