extracting a tuple from a bag - hadoop

I have a relation of bags of tuples which looks like this. The tuples in the bag come preordered.
{(123,1383313457523,1,US),(123,1383313457543,2,US),(123,1383313457553,3,US)}
{(456,1383313457623,1,UK),(456,1383313457643,2,UK),(456,1383313457653,3,UK)}
{(789,1383313457723,1,UK),(789,1383313457743,2,UK),(789,1383313457753,3,UK)}
Where the tuple is: (id:chararray,time:long,event:chararray,location,chararray)
I want to get the first element of each bag. So my expected output would be:
(123,1383313457523,1,US)
(456,1383313457623,1,UK)
(789,1383313457723,1,UK)
I tried this:
data = load 'mydata.txt' USING PigStorage('\t');
A = FOREACH data GENERATE $0;
dump A;
Which produces the same list of data bags as I had originally.
Alternatively trying to extract just the ids
data = load 'mydata.txt' USING PigStorage('\t');
A = FOREACH data GENERATE $0.$0;
dump A;
I expect:
(123)
(456)
(789)
but I get
{(123),(123),(123)}
{(456),(456),(456)}
{(789),(789),(789)}
How do I adjust my script to get the data that I want.

Use LIMIT inside a nested foreach:
A = FOREACH data { first = LIMIT $0 1; GENERATE FLATTEN(first); }
You cannot count on the tuples in your bag being ordered, since by definition a bag is unordered. However, you can also put an ORDER BY in a nested foreach:
A = FOREACH data { ord = ORDER $0 BY $1; first = LIMIT ord 1; GENERATE FLATTEN(first); }
I find these to be more readable if they are split up onto multiple lines:
A =
FOREACH data {
ord = ORDER $0 BY $1;
first = LIMIT ord 1;
GENERATE
FLATTEN(first);
};
I'm assuming that the bag is ordered by the second field of each tuple ($1).

Related

How to get a SQL like GROUP BY using Apache Pig?

I have the following input called movieUserTagFltr:
(260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,action),(260,script),(260,"imaginary world),(260,space),(260,Science Fiction),(260,"space epic),(260,Syfy),(260,series),(260,classic sci-fi),(260,space adventure),(260,jedi),(260,awesome soundtrack),(260,awesome),(260,coming of age)})
(858,{(858,Katso Sanna!)})
(924,{(924,slow),(924,boring)})
(1256,{(1256,Marx Brothers)})
it follows the schema: (movieId:int, tags:bag{(movieId:int, tag:cararray),...})
Basically the first number represents a movie id, and the subsequent bag holds all the keywords associated with that movie. I would like to group those key words in such way that I would have an output something like this:
(260,{(1,starwars),(1,George Lucas),(1,sci-fi),(1,cult classic),(4,Science Fiction),(1,classic),(1,supernatural powers),(1,nerdy),(1,critically acclaimed),(1,action),(1,script),(1,"imaginary world),(1,space),(1,"space epic),(1,Syfy),(1,series),(1,classic sci-fi),(1,space adventure),(1,jedi),(1,awesome soundtrack),(1,awesome),(1,coming of age)})
(858,{(1,Katso Sanna!)})
(924,{(1,slow),(1,boring)})
(1256,{(1,Marx Brothers)})
Note that the tag Science Fiction has appeared 4 times for the movie with id 260. Using the GROUP BY and COUNT I manged to count the distinct keywords for each movie using the following script:
sum = FOREACH group_data {
unique_tags = DISTINCT movieUserTagFltr.tags::tag;
GENERATE group, COUNT(unique_tags) as tag;
};
But that only returns a global count, I want a local count. So the logic of what I was thinking was:
result = iterate over each tuple of group_data {
generate a tuple with $0, and a bag with {
foreach distinct tag that group_data has on it's $1 variable do {
generate a tuple like: (tag_name, count of how many times that tag appeared on $1)
}
}
}
You can flatten out your original input so that each movieID and tag are their own record. Then group by movieID and tag to get a count for each combination. Finally, group by movieID so that you end up with a bag of tags and counts for each movie.
Let's say you start with movieUserTagFltr with the schema you described:
A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
FLATTEN(group) AS (movieID, tag),
COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
Your final schema is:
D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

How to count the occurrences of each value of a specific field using PIG?

The data set is in the form: FIELD_A--FIELD_B
Example:
XYZ--1
XYZ--2
XYZ--8
ABC--4
ABC--3
PQR--5
Expected output:
XYZ-3
ABC-2
PQR-1
data = LOAD 'dataset' USING PigStorage('--');
field1 = FOREACH data GENERATE $0;
grouped = GROUP field1 BY $0;
count = FOREACH grouped GENERATE COUNT(field1);
I don't see why you need the field B, just discard it in the beginning.

Extract matching tuples in bag in PIG

I have raw data in bag:
{(id,35821),(lang,en-US),(pf_1,us)}
{(path,/ybe/wer),(id,23481),(lang,en-US),(intl,us),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(lang,ir-IL),(pf_1,il),(pf_2,werasdf|dfsas)}
How could I extract the tuples whose column 1 matches id and pf_*?
The output I want:
{(id,35821),(pf_1,us)}
{(id,23481),(pf_1,yahoo),(pf_3,test)}
{(id,98234),(pf_1,il),(pf_2,werasdf|dfsas)}
Any suggestion would be appreciated. Thanks!
In order to process the inner bag (a bag in a format like OUTER_BAG: {INNER_BAG: {(e:int)}}) you are going to have to use a nested FOREACH. This will allow you to preform operations over the tuples in the inner bag.
For example, you are going to want to do something like:
-- A: {inner_bag: {(val1: chararray, val2: chararray)}}
B = FOREACH A {
filtered_bags = FILTER inner_bag BY val1 matches '^(id|pf_).*' ;
GENERATE filtered_bags ;
}

How to order items in tuples?

I have pairs of numbers and I want to sort them.
grunt> dump unordered
(11,22)
(88,33)
(55,66)
How do I sort them to:
(11,22)
(33,88)
(55,66)
Tried to use bags:
grunt> bag_of_pairs = foreach unordered generate TOBAG(TOTUPLE($0),TOTUPLE($1));
grunt> ordered = foreach bag_of_pairs {o1 = order $0 by $0; generate o1;}
And ended up with this ordered, but over-wrapped list, that I don't know how to simplify:
grunt> dump ordered
({(11),(22)})
({(33),(88)})
({(55),(66)})
Thanks
You need a UDF to convert the bag to a tuple. However, since you only need to order two items this can also be done with a bincond.
ordered = FOREACH unordered GENERATE ($0<$1?$0:$1), ($0<$1?$1:$0) ;
NOTE: I can't test this right now, but this should also work.
ordered = FOREACH unordered GENERATE FLATTEN(($0<$1?($0,$1):($1,$0)) ;

Resources