apache pig unable to perform grouping and counting - hadoop

I am kind of newbie for Pig scripting . Please help me out with this issue.
I have no clue as to where I am going wrong.
My data
(catA,myid_1,2014,store1,appl)
(catA,myid_2,2014,store1,milk)
(catA,myid_3,2014,store1,appl)
(catA,myid_4,2014,store1,milk)
(catA,myid_5,2015,store1,milk)
(catB,myid_6,2014,store2,milk)
(catB,myid_7,2014,store2,appl)
The below is the result expected
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
Need to count the number of food item based on the category,year.
below is my pig script
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) as my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store);
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),COUNT(food_count);
}
DUMP result;
My output for the above script is below
(catA,2014,store1,2)
(catA,2015,store1,1)
(catB,2014,store2,2)
Could anyone please let me know as to where I am wrong in my script
Thanks

One way to do it.Not the most elegant but working example:
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) AS my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store,item);
result = FOREACH StoreG GENERATE
group.category AS category,
group.my_date AS my_date,
group.my_store AS mys_store,
group.item AS item,
COUNT(list_of.item) AS nb_items;
DUMP result;
When we add alias item to the GROUP BY statement is basically the same as to find distinct items and then count them(as you already did in the parenthesis).
If you still want to use your code you simply add a relation food_list.item to the code below :
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),food_list.item,COUNT(food_count);
}

StoreG = GROUP list_of BY (category,my_date,my_store);
should be
StoreG = GROUP list_of BY (category,my_date,item);
since your expected results are grouping by item not store.

Related

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?
Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

PIG How do I return a rowcount from 1 alias to another alias

REGISTER 'udf.py' using jython as myfunc;
loadhtml = load './assignment/crawler' using PigStorage('\u0001') as (id1:chararray,url:chararray,domain:chararray,content:chararray,source:chararray,date:chararray);
loadhtml_content = FOREACH loadhtml generate content;
flatten = FOREACH loadhtml_content generate flatten(TOKENIZE(line)) as word;
group = GROUP flatten by word;
count = FOREACH group1 generate $0, COUNT($1);
log = FOREACH count GENERATE myfunc.nLog($0,$1,**<I need to return the row count of loadhtml_content here>**);
I am trying to return a row count of loadhtml_content into another alias. I cant think of another idea to do it.
log = FOREACH count GENERATE myfunc.nLog($0,$1,(I need to return the row count of loadhtml_content here) );
I believe this is the exact feature you are looking for: https://issues.apache.org/jira/browse/PIG-1434.
It essentially allows us to use single-tupled relations as constants wherever needed.
Something along the following lines should address your question:
loadhtml_content = FOREACH loadhtml generate content;
content_rows = FOREACH (GROUP loadhtml_content ALL) GENERATE
COUNT(loadhtml_content);
log = FOREACH count GENERATE myfunc.nLog($0,$1,content_rows.$0);

Finding Unique visitors to a webpage

I want to write a pig script that find number of unique userid that visiots a particluar webpage.
table definition :a = (userid:chararray, otherid:chararray, webpage:chararray)
This is what I wrote but it doesn't work
a = (userid:chararray, otherid:chararray, webpage:chararray)
group_by_page = GROUP a by webpage ;
count_d = FOREACH group_by_page GENERATE group, count(distinct(a.userid));
You need to use the DISTINCT inside a nested foreach; it's not a UDF. This should get you where you need to go:
a = LOAD 'input' AS (userid:chararray, otherid:chararray, webpage:chararray);
group_by_page = GROUP a by webpage;
count_d = FOREACH group_by_page { uniq = DISTINCT a.userid; GENERATE group, COUNT(uniq); };
Go here to learn more about nested foreach.

How to count the number of unique users with PIG

The following piece of code doesn't return exactly what I am trying to compute; the number of unique users. Any idea?
data = LOAD 'input_initial' AS (user_id,item_id,rating,timestamp);
data = FOREACH data GENERATE user_id,item_id;
STORE data INTO 'input_final';
data_users = FOREACH data GENERATE user_id;
group_users = GROUP data_users BY user_id;
count_users = FOREACH group_users GENERATE COUNT(data_users);
STORE count_users INTO 'count_users';
You need to amend the final GROUP operation to act on 'all' rather than an individual field:
group_users = GROUP data_users BY user_id;
grp_all = GROUP group_users ALL;
count_users = FOREACH grp_all GENERATE COUNT(group_users);

Calculate count of distinct values of a field using pig script

For a file of the form
A B user1
C D user2
A D user3
A D user1
I want to calculate the count of distinct values of field 3 i.e. count(distinct(user1, user2,user2,user1)) = 3
I am doing this using the following pig script
A = load 'myTestData' using PigStorage('\t') as (a1,a2,a3);
user_list = foreach A GENERATE $2;
unique_users = DISTINCT user_list;
unique_users_group = GROUP unique_users ALL;
uu_count = FOREACH unique_users_group GENERATE COUNT(unique_users);
store uu_count into 'output';
Is there a better way to get count of distinct values of a field?
A more up-to-date way to do this:
user_data = LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3);
users = FOREACH user_data GENERATE a3;
uniq_users = DISTINCT users;
grouped_users = GROUP uniq_users ALL;
uniq_user_count = FOREACH grouped_users GENERATE COUNT(uniq_users);
DUMP uniq_user_count;
This will leave the value (3) in your log.
I have one here which is a little more concise. You might want to check which one runs faster.
A = LOAD 'myTestData' USING PigStorage('\t') AS (a1,a2,a3);
unique_users_group = GROUP A ALL;
uu_count = FOREACH unique_users_group {user = A.a2; uniq = distinct user; GENERATE COUNT(uniq);};
STORE uu_count INTO 'output';

Resources