Pig 0.12 Nested Foreach not working properly - filter

I've been trying to do this for a while and can't seem to figure it out and it's a bit hard to look for a fix for my problem.
I have a relation that I previously grouped by user_id and listing_id and after generating and flattened the output I got this:
test: {user_id: bytearray,listing_id: bytearray,hotness: long}
So my next step is to group by user, order by hotness and limit the amount of listings per user to 20.
grped = GROUP test BY user_id;
grped_sorted = FOREACH grped {
sorted = order test BY hotness desc;
top1 = limit sorted 20;
listings = FOREACH top1 GENERATE FLATTEN((bytearray)top1.listing_id) as listing_id;
GENERATE group as user_id, FLATTEN(listings.($0)) as listing_ids;
};
But this seems to be getting me the error, with information that was previously stripped from the listings details:
Scalar has more than one row in the output.
Please, I need help on this.
Is there a way to do this? Can I use some UDF from DataFu?
Creating my own UDF is out of the question.
Thanks in advance.

I think it should work if the code looks like this
grped = GROUP test BY user_id;
grped_sorted = FOREACH grped {
sorted = order test BY hotness desc;
top1 = limit sorted 20;
GENERATE group as user_id, top1.listing_id as listing_ids;
};
Output of this would be something like
grped_sorted: {user_id: bytearray,listing_ids: {(listing_id: bytearray)}}
Not sure if this is what you want though.

Related

why does my pig query return the wrong values

I am trying to work with the following dataset in pig
https://www.kaggle.com/zynicide/wine-reviews/version/4?
I have been getting the wrong values from my querying the only reason I can think of is it is to do with missing data in the dataset
but I don't know if thats it or exactly why I get the wrong values
allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);
allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted 5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;
DESCRIBE allWinesPricePoints;
The actual results I get are
(56203, buttered toast and spice flavors that are wrapped into a creamy texture. Should hold for a year or two.")
(61341, sweet tannins. The fresh acidity gives it an extra boost. Give it time. Best 2007–2012.")
(16417, Chardonnay is also known)
(115384, almonds and vanilla)
(136804, almonds and vanilla)
I think the output should be
(56203, 23)
(61341, 30)
(16417, 16)
(115384, 250)
(136804, 250)
I would have expected the second value to be numeric and in the price column
Proceed as follow:
allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);
--comments
--add below foreach to generate the values this will help you out to parse data correctly
--generate column in the same order as it is in the text file
allWines= FOREACH allWines GENERATE
id AS id,
country AS country,
description AS description,
designation AS designation,
points AS points,
price AS price,
province AS provience,
region_2 AS region_2,
region_1 AS region_1,
variety AS variety,
winery AS winery;
allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted 5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;
DESCRIBE allWinesPricePoints;
Hope this will help you out.
let me know in case of any concern.

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?
Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

Pig de-duplicate events occuring within 1 minute of each other

We are using pig-0.11.0-cdh4.3.0 with a CDH4 cluster and we need to de-duplicate some web logs. The solution idea (expressed in SQL) is something like this:
SELECT
T1.browser,
T1.click_type,
T1.referrer,
T1.datetime,
T2.datetime
FROM
My_Table T1
INNER JOIN My_Table T2 ON
T2.browser = T1.browser AND
T2.click_type = T1.click_type AND
T2.referrrer = T1.referrer AND
T2.datetime > T1.datetime AND
T2.datetime <= DATEADD(mi, 1, T1.datetime)
I grabbed the above from here SQL find duplicate records occuring within 1 minute of each other . I am hoping I can implement a similar solution in Pig but I am finding that apparently Pig does not support JOIN via an expression (only by fields) as is required by the above join. Do you know how to de-duplicate events that are near by 1 minute with Pig? Thanks!
One approach is you can do like this group by the required parameters
top3 = foreach grpd {
sorted = filter records by time < 60;
top = limit sorted 2;
generate group, flatten(top);
};
this will be another approach
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE
FLATTEN(records), MAX(records.datetime) as maxDt ;
filterRecords = filter with_min by (maxDt - $2 ) <60;
$2 is the datatime position change it accordingly
From top of my head, something like this could work, but needs testing:
view = FOREACH input GENERATE browser, click_type, referrer, datetime, GetYear(datetime) as year, GetMonth(datetime) as month, GetDay(datetime) as day, GetHour(datetime) as hour, GetMinute(datetime) as minute;
grp = GROUP view BY (browser, click_type, referrer, year, month, day, hour, minute);
uniq = FOREACH grp {
top = LIMIT view 1;
GENERATE FLATTEN(view.(browser, click_type, referrer, datetime))
}
Of cause here if one event is at 12:03:45 and another at 12:03:59, these would be in the same group and 12:04:45 with 12:05:00 would be in different groups.
To get the exact 60 seconds difference you would need to write a UDF which would iterate over a sorted bag grouped on (browser, click_type, referrer) and remove unwanted rows.
Aleks and Marq ,
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE FLATTEN(records), MAX(records.datetime) as max
with_min = FOREACH with_min GENERATE browser, click_type, referrer,
ABS(max - dateime) as maxDtgroup;
regroup = group with_min by (browser, click_type, referrer, maxDtgroup);
Re-group with maxDtGroup is the key and filter the top 1 record.

Hadoop Pig GROUP by id, get owner_id?

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

How can I select record with minimum value in pig latin

I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.

Resources