Query individual field after group by - hadoop

I have a relation
A =
(John,19,SF)
(Mary,20,NY)
(Bill,23,SF)
(Joe,25,SF)
The schema is (name, age, city)
B = foreach (group A by city)
{
sorted = ORDER A BY age;
info = LIMIT sorted 10;
GENERATE group, info.name;
}
Pig complains that "Scalar has more than one row in the output" for GENERATE group, info.name;
How to query individual field in the bag after group by?
Thanks.

For me the above code is working and the output for 'Dump B;' is
(NY,{(Mary)})
(SF,{(John),(Bill),(Joe)})
As far as querying individual field after group by is concerned, you will have to refer in the same way you are doing now, Alias.fieldName.

Related

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?
Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

How to get a SQL like GROUP BY using Apache Pig?

I have the following input called movieUserTagFltr:
(260,{(260,starwars),(260,George Lucas),(260,sci-fi),(260,cult classic),(260,Science Fiction),(260,classic),(260,supernatural powers),(260,nerdy),(260,Science Fiction),(260,critically acclaimed),(260,Science Fiction),(260,action),(260,script),(260,"imaginary world),(260,space),(260,Science Fiction),(260,"space epic),(260,Syfy),(260,series),(260,classic sci-fi),(260,space adventure),(260,jedi),(260,awesome soundtrack),(260,awesome),(260,coming of age)})
(858,{(858,Katso Sanna!)})
(924,{(924,slow),(924,boring)})
(1256,{(1256,Marx Brothers)})
it follows the schema: (movieId:int, tags:bag{(movieId:int, tag:cararray),...})
Basically the first number represents a movie id, and the subsequent bag holds all the keywords associated with that movie. I would like to group those key words in such way that I would have an output something like this:
(260,{(1,starwars),(1,George Lucas),(1,sci-fi),(1,cult classic),(4,Science Fiction),(1,classic),(1,supernatural powers),(1,nerdy),(1,critically acclaimed),(1,action),(1,script),(1,"imaginary world),(1,space),(1,"space epic),(1,Syfy),(1,series),(1,classic sci-fi),(1,space adventure),(1,jedi),(1,awesome soundtrack),(1,awesome),(1,coming of age)})
(858,{(1,Katso Sanna!)})
(924,{(1,slow),(1,boring)})
(1256,{(1,Marx Brothers)})
Note that the tag Science Fiction has appeared 4 times for the movie with id 260. Using the GROUP BY and COUNT I manged to count the distinct keywords for each movie using the following script:
sum = FOREACH group_data {
unique_tags = DISTINCT movieUserTagFltr.tags::tag;
GENERATE group, COUNT(unique_tags) as tag;
};
But that only returns a global count, I want a local count. So the logic of what I was thinking was:
result = iterate over each tuple of group_data {
generate a tuple with $0, and a bag with {
foreach distinct tag that group_data has on it's $1 variable do {
generate a tuple like: (tag_name, count of how many times that tag appeared on $1)
}
}
}
You can flatten out your original input so that each movieID and tag are their own record. Then group by movieID and tag to get a count for each combination. Finally, group by movieID so that you end up with a bag of tags and counts for each movie.
Let's say you start with movieUserTagFltr with the schema you described:
A = FOREACH movieUserTagFltr GENERATE FLATTEN(tags) AS (movieID, tag);
B = GROUP A BY (movieID, tag);
C = FOREACH B GENERATE
FLATTEN(group) AS (movieID, tag),
COUNT(A) AS movie_tag_count;
D = GROUP C BY movieID;
Your final schema is:
D: {group: int,C: {(movieID: int,tag: chararray,movie_tag_count: long)}}

PIG: Select records from a relaltion only if it is present in another relation

I have the following data set for a movie database:
Ratings: UserID, MovieID, Rating
Movies: MovieID, Genre
I filtered out the movies with Genres as "Action" or "War" using:
movie_filter = filter Movies by (genre matches '.*Action.*') OR (genre matches '.*War.*');
Now, I have to calculate the average ratings for War or Action movies. But the ratings is present in the Ratings file. To do this, I use the query:
movie_groups = GROUP movie_filter BY MovieID;
result = FOREACH movie_groups GENERATE Ratings.MovieID, AVG(Ratings.rating);
Then I store the result in a directory location. But when I run the program, I get the following error:
Could not infer the matching function for org.apache.pig.builtin.AVG as multiple or none of them fit. Please use an explicit cast.
Can anyone tell me what I'm doing wrong? Thanks in advance.
It looks like you're missing a join statement, which would join your two data sets (ratings & movies) on the MovieID column. I've mocked up some test data, and provided some example code below.
movie_avg.pig
ratings = LOAD 'movie_ratings.txt' USING PigStorage(',') AS (user_id:chararray, movie_id:chararray, rating:int);
movies = LOAD 'movie_data.txt' USING PigStorage(',') AS (movie_id:chararray,genre:chararray);
movies_filter = FILTER movies BY (genre MATCHES '.*Action.*' OR genre MATCHES '.*War.*');
movies_join = JOIN movies_filter BY movie_id, ratings BY movie_id;
movies_cleanup = FOREACH movies_join GENERATE movies_filter::movie_id AS movie_id, ratings::rating as rating;
movies_group = GROUP movies_cleanup by movie_id;
data = FOREACH movies_group GENERATE group, AVG(movies_cleanup.rating);
dump data;
Output of movie_avg.pig
(Jarhead,3.0)
(Platoon,4.333333333333333)
(Die Hard,3.0)
(Apocolypse Now,4.5)
(Last Action Hero,2.0)
(Lethal Weapon, 4.0)
movie_data.txt
Scrooged,Comedy
Apocolypse Now,War
Platoon,War
Guess Whos Coming To Dinner,Drama
Jarhead,War
Last Action Hero,Action
Die Hard,Action
Lethal Weapon,Action
My Fair Lady,Musical
Frozen,Animation
movie_ratings.txt
12345,Scrooged,4
12345,Frozen,4
12345,My Fair Lady,5
12345,Guess Whos Coming To Dinner,5
12345,Platoon,3
12345,Jarhead,2
23456,Platoon,5
23456,Apocolypse Now,4
23456,Die Hard,3
23456,Last Action Hero,2
34567,Lethal Weapon,4
34567,Jarhead,4
34567,Apocolypse Now,5
34567,Platoon,5
34567,Frozen,5

Pig de-duplicate events occuring within 1 minute of each other

We are using pig-0.11.0-cdh4.3.0 with a CDH4 cluster and we need to de-duplicate some web logs. The solution idea (expressed in SQL) is something like this:
SELECT
T1.browser,
T1.click_type,
T1.referrer,
T1.datetime,
T2.datetime
FROM
My_Table T1
INNER JOIN My_Table T2 ON
T2.browser = T1.browser AND
T2.click_type = T1.click_type AND
T2.referrrer = T1.referrer AND
T2.datetime > T1.datetime AND
T2.datetime <= DATEADD(mi, 1, T1.datetime)
I grabbed the above from here SQL find duplicate records occuring within 1 minute of each other . I am hoping I can implement a similar solution in Pig but I am finding that apparently Pig does not support JOIN via an expression (only by fields) as is required by the above join. Do you know how to de-duplicate events that are near by 1 minute with Pig? Thanks!
One approach is you can do like this group by the required parameters
top3 = foreach grpd {
sorted = filter records by time < 60;
top = limit sorted 2;
generate group, flatten(top);
};
this will be another approach
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE
FLATTEN(records), MAX(records.datetime) as maxDt ;
filterRecords = filter with_min by (maxDt - $2 ) <60;
$2 is the datatime position change it accordingly
From top of my head, something like this could work, but needs testing:
view = FOREACH input GENERATE browser, click_type, referrer, datetime, GetYear(datetime) as year, GetMonth(datetime) as month, GetDay(datetime) as day, GetHour(datetime) as hour, GetMinute(datetime) as minute;
grp = GROUP view BY (browser, click_type, referrer, year, month, day, hour, minute);
uniq = FOREACH grp {
top = LIMIT view 1;
GENERATE FLATTEN(view.(browser, click_type, referrer, datetime))
}
Of cause here if one event is at 12:03:45 and another at 12:03:59, these would be in the same group and 12:04:45 with 12:05:00 would be in different groups.
To get the exact 60 seconds difference you would need to write a UDF which would iterate over a sorted bag grouped on (browser, click_type, referrer) and remove unwanted rows.
Aleks and Marq ,
records_group = group records by (browser, click_type, referrer);
with_min = FOREACH records_group
GENERATE FLATTEN(records), MAX(records.datetime) as max
with_min = FOREACH with_min GENERATE browser, click_type, referrer,
ABS(max - dateime) as maxDtgroup;
regroup = group with_min by (browser, click_type, referrer, maxDtgroup);
Re-group with maxDtGroup is the key and filter the top 1 record.

Hadoop Pig GROUP by id, get owner_id?

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

Resources