Finding Unique visitors to a webpage - hadoop

I want to write a pig script that find number of unique userid that visiots a particluar webpage.
table definition :a = (userid:chararray, otherid:chararray, webpage:chararray)
This is what I wrote but it doesn't work
a = (userid:chararray, otherid:chararray, webpage:chararray)
group_by_page = GROUP a by webpage ;
count_d = FOREACH group_by_page GENERATE group, count(distinct(a.userid));

You need to use the DISTINCT inside a nested foreach; it's not a UDF. This should get you where you need to go:
a = LOAD 'input' AS (userid:chararray, otherid:chararray, webpage:chararray);
group_by_page = GROUP a by webpage;
count_d = FOREACH group_by_page { uniq = DISTINCT a.userid; GENERATE group, COUNT(uniq); };
Go here to learn more about nested foreach.

Related

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?
Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

PIG How do I return a rowcount from 1 alias to another alias

REGISTER 'udf.py' using jython as myfunc;
loadhtml = load './assignment/crawler' using PigStorage('\u0001') as (id1:chararray,url:chararray,domain:chararray,content:chararray,source:chararray,date:chararray);
loadhtml_content = FOREACH loadhtml generate content;
flatten = FOREACH loadhtml_content generate flatten(TOKENIZE(line)) as word;
group = GROUP flatten by word;
count = FOREACH group1 generate $0, COUNT($1);
log = FOREACH count GENERATE myfunc.nLog($0,$1,**<I need to return the row count of loadhtml_content here>**);
I am trying to return a row count of loadhtml_content into another alias. I cant think of another idea to do it.
log = FOREACH count GENERATE myfunc.nLog($0,$1,(I need to return the row count of loadhtml_content here) );
I believe this is the exact feature you are looking for: https://issues.apache.org/jira/browse/PIG-1434.
It essentially allows us to use single-tupled relations as constants wherever needed.
Something along the following lines should address your question:
loadhtml_content = FOREACH loadhtml generate content;
content_rows = FOREACH (GROUP loadhtml_content ALL) GENERATE
COUNT(loadhtml_content);
log = FOREACH count GENERATE myfunc.nLog($0,$1,content_rows.$0);

Pig: is it possible to write a loop over variables in a list?

I have to loop over 30 variables in a list
[var1,var2, ... , var30]
and for each variable I use some PIG group by statement such as
grouped = GROUP data by var1;
data_var1 = FOREACH grouped{
GENERATE group as mygroup,
COUNT(data) as count;
};
Is there a way to loop over the list of variables or I am forced to repeat the code above manually 30 times in my code?
Thanks!
I think what you're looking for is the pig macro
Create a relation for your 30 variables, and iterate on them by foreach, and call a macro which get 2 params: your data relation and the var you want to group by.
Just check the example in the link the macro is really similar what you'd like to do.
UPDATE & code
So here's the macro you can use:
DEFINE my_cnt(data, group_field) RETURNS C {
$C = FOREACH (GROUP $data by $group_field) GENERATE
group AS mygroup,
COUNT($data) AS count;
};
Use the macro:
IMPORT 'cnt.macro';
data = LOAD 'data.txt' USING PigStorage(',') AS (field:chararray, value:chararray);
DESCRIBE data;
e = my_cnt(data,'the_field_you_group_by');
DESCRIBE e;
DUMP e;
I'm still thinking on how can you iterate through on your fields you'd like to group by. My original suggestion to foreach through a relation what contains the filed names not correct. (To create a UDF for this always works.) Let me think about it.
But this macro works as is if you call by all the filed name you want to group.

apache pig unable to perform grouping and counting

I am kind of newbie for Pig scripting . Please help me out with this issue.
I have no clue as to where I am going wrong.
My data
(catA,myid_1,2014,store1,appl)
(catA,myid_2,2014,store1,milk)
(catA,myid_3,2014,store1,appl)
(catA,myid_4,2014,store1,milk)
(catA,myid_5,2015,store1,milk)
(catB,myid_6,2014,store2,milk)
(catB,myid_7,2014,store2,appl)
The below is the result expected
(catA,2014,milk,2)
(catA,2014,apple,2)
(catA,2015,milk,1)
(catB,2014,milk,1)
(catB,2014,apple,1)
Need to count the number of food item based on the category,year.
below is my pig script
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) as my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store);
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),COUNT(food_count);
}
DUMP result;
My output for the above script is below
(catA,2014,store1,2)
(catA,2015,store1,1)
(catB,2014,store2,2)
Could anyone please let me know as to where I am wrong in my script
Thanks
One way to do it.Not the most elegant but working example:
list = LOAD 'shop' USING PigStorage(',') AS (category:chararray,id:chararray,mdate:chararray,my_store:chararray,item:chararray);
list_of = FOREACH list GENERATE category,SUBSTRING(mdate,0,4) AS my_date,my_store,item;
StoreG = GROUP list_of BY (category,my_date,my_store,item);
result = FOREACH StoreG GENERATE
group.category AS category,
group.my_date AS my_date,
group.my_store AS mys_store,
group.item AS item,
COUNT(list_of.item) AS nb_items;
DUMP result;
When we add alias item to the GROUP BY statement is basically the same as to find distinct items and then count them(as you already did in the parenthesis).
If you still want to use your code you simply add a relation food_list.item to the code below :
result = FOREACH StoreG
{
food_list = FOREACH list_of GENERATE item;
food_count = DISTINCT food_list;
GENERATE FLATTEN(group) AS (category,my_date,my_store),food_list.item,COUNT(food_count);
}
StoreG = GROUP list_of BY (category,my_date,my_store);
should be
StoreG = GROUP list_of BY (category,my_date,item);
since your expected results are grouping by item not store.

How do I get the matching values inside a for loop using FILTER in PIG?

Consider this as my input,
Input (File1):
12345;11
34567;12
.
.
Input (File2):
11;(1,2,3,4,5,6,7,8,9)
12;(9,8,7,6,5,4,3,2,1)
.
.
I would like to get the output as follows:
Output:
(1,2,3,4,5,6,7,8,9)
(9,8,7,6,5,4,3,2,1)
Here's the sample code which I have tried using FILTER and I face some errors with this. Please suggest me some other options.
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
Is it possible do this inside a for loop ? Please let me know. Thanks in advance !
There are no for loops in Apache Pig, if you need to iterate through each row of the data for some specific purpose you need to implement your own UDF. The foreach keyword is not used to create a loop, it is used to transform your data based on your columns, applying UDFs to it. You can also use a nested foreach, where you perform operations over each group in your relation.
However, your syntax is wrong. You are trying to use a nested foreach without grouping your data first. What a nested foreach does, is perform the operations you define in the block of code over a grouped relation. Therefore, the only way your code could work is by grouping the data first:
data1 = load '/File1' using PigStorage(';') as (id,number);
data2 = load '/File2' using PigStorage(';') as (numberInfo, collection);
data1 = group data1 by id;
out = foreach data1{
Data_filter = FILTER data2 by (numberInfo matches CONCAT(number,''));
generate Data_filter;
}
However, this won't work because inside a nested foreach you cannot refer to a different relation like data2.
What you really want, is a JOIN operation over both relations using number for data1 and numberInfo for data2. This will give you this:
joined_data = join data1 by number, data2 by numberInfo;
dump joined_data;
(12345,11,11,(1,2,3,4,5,6,7,8,9))
(34567,12,12,(9,8,7,6,5,4,3,2,1))
In your question you said you only wanted as output the last column, so now you can use a foreach to generate the column you want:
final_data = foreach joined_data generate data2::collection;
dump final_data;
((1,2,3,4,5,6,7,8,9))
((9,8,7,6,5,4,3,2,1))

Resources