"Flattening" a databag in Pig - hadoop

Suppose I have a bunch of databags generated from a Pig UDF that holds several tuples of Strings. How can I pull all of them out of the databags and simple make each String its own "row" of data.
databags = FOREACH data GENERATE pigUdfThatMakesDataBags(data::someText);
strings = FOREACH databags { ??? };

databags = FOREACH data GENERATE pigUdfThatMakesDataBags(data::someText);
datatuples = FOREACH databags FLATTEN($0); -- Bag to Tuples
strings = FOREACH datatuples FLATTEN(TOBAG(*)); -- Tuples to Tokens'
DUMP strings;

Am I understand it right that you're looking for the FLATTEN?

Related

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?
Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

How do I get the matching values inside a for loop using FILTER in PIG?

Consider this as my input,
Input (File1):
12345;11
34567;12
.
.
Input (File2):
11;(1,2,3,4,5,6,7,8,9)
12;(9,8,7,6,5,4,3,2,1)
.
.
I would like to get the output as follows:
Output:
(1,2,3,4,5,6,7,8,9)
(9,8,7,6,5,4,3,2,1)
Here's the sample code which I have tried using FILTER and I face some errors with this. Please suggest me some other options.
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
Is it possible do this inside a for loop ? Please let me know. Thanks in advance !
There are no for loops in Apache Pig, if you need to iterate through each row of the data for some specific purpose you need to implement your own UDF. The foreach keyword is not used to create a loop, it is used to transform your data based on your columns, applying UDFs to it. You can also use a nested foreach, where you perform operations over each group in your relation.
However, your syntax is wrong. You are trying to use a nested foreach without grouping your data first. What a nested foreach does, is perform the operations you define in the block of code over a grouped relation. Therefore, the only way your code could work is by grouping the data first:
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
data1 = group data1 by id;
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
However, this won't work because inside a nested foreach you cannot refer to a different relation like data2.
What you really want, is a JOIN operation over both relations using number for data1 and numberInfo for data2. This will give you this:
joined_data = join data1 by number, data2 by numberInfo;
dump joined_data;
(12345,11,11,(1,2,3,4,5,6,7,8,9))
(34567,12,12,(9,8,7,6,5,4,3,2,1))
In your question you said you only wanted as output the last column, so now you can use a foreach to generate the column you want:
final_data = foreach joined_data generate data2::collection;
dump final_data;
((1,2,3,4,5,6,7,8,9))
((9,8,7,6,5,4,3,2,1))

extracting a tuple from a bag

I have a relation of bags of tuples which looks like this. The tuples in the bag come preordered.
{(123,1383313457523,1,US),(123,1383313457543,2,US),(123,1383313457553,3,US)}
{(456,1383313457623,1,UK),(456,1383313457643,2,UK),(456,1383313457653,3,UK)}
{(789,1383313457723,1,UK),(789,1383313457743,2,UK),(789,1383313457753,3,UK)}
Where the tuple is: (id:chararray,time:long,event:chararray,location,chararray)
I want to get the first element of each bag. So my expected output would be:
(123,1383313457523,1,US)
(456,1383313457623,1,UK)
(789,1383313457723,1,UK)
I tried this:
data = load 'mydata.txt' USING PigStorage('\t');
A = FOREACH data GENERATE $0;
dump A;
Which produces the same list of data bags as I had originally.
Alternatively trying to extract just the ids
data = load 'mydata.txt' USING PigStorage('\t');
A = FOREACH data GENERATE $0.$0;
dump A;
I expect:
(123)
(456)
(789)
but I get
{(123),(123),(123)}
{(456),(456),(456)}
{(789),(789),(789)}
How do I adjust my script to get the data that I want.
Use LIMIT inside a nested foreach:
A = FOREACH data { first = LIMIT $0 1; GENERATE FLATTEN(first); }
You cannot count on the tuples in your bag being ordered, since by definition a bag is unordered. However, you can also put an ORDER BY in a nested foreach:
A = FOREACH data { ord = ORDER $0 BY $1; first = LIMIT ord 1; GENERATE FLATTEN(first); }
I find these to be more readable if they are split up onto multiple lines:
A =
FOREACH data {
ord = ORDER $0 BY $1;
first = LIMIT ord 1;
GENERATE
FLATTEN(first);
};
I'm assuming that the bag is ordered by the second field of each tuple ($1).

How to order items in tuples?

I have pairs of numbers and I want to sort them.
grunt> dump unordered
(11,22)
(88,33)
(55,66)
How do I sort them to:
(11,22)
(33,88)
(55,66)
Tried to use bags:
grunt> bag_of_pairs = foreach unordered generate TOBAG(TOTUPLE($0),TOTUPLE($1));
grunt> ordered = foreach bag_of_pairs {o1 = order $0 by $0; generate o1;}
And ended up with this ordered, but over-wrapped list, that I don't know how to simplify:
grunt> dump ordered
({(11),(22)})
({(33),(88)})
({(55),(66)})
Thanks
You need a UDF to convert the bag to a tuple. However, since you only need to order two items this can also be done with a bincond.
ordered = FOREACH unordered GENERATE ($0<$1?$0:$1), ($0<$1?$1:$0) ;
NOTE: I can't test this right now, but this should also work.
ordered = FOREACH unordered GENERATE FLATTEN(($0<$1?($0,$1):($1,$0)) ;

Resources