Pig Conditional Statements - hadoop

I think I already know the answer to this, but I just wanted to check here before I give up and do something ugly.
I have a query that needs to count total clicks, and also total distinct users. Total clicks would just be this code without the distinct:
report = FOREACH report GENERATE user, genre, title;
report = DISTINCT report;
report = GROUP report BY (genre, title);
My question is essentially: is there any way to write a conditional statement that would skip the DISTINCT step in this process? Pseudo:
report = FOREACH report GENERATE user, genre, title;
if $report_type == 'users':
report = DISTINCT report;
end if
report = GROUP report BY (genre, title);
I'd rather not have two separate files, and up to this point the only solutions I can find involve using a Python, etc. wrapper to dynamically deal with it. I'd rather keep everything in a simple .pig file, but can't find a way to do it.

One option could be you can try something like this. Can you check with your input?
input:
user1,action,aa
user2,comedy,cc
user3,drama,dd
user1,action,aa
user1,action,aa
user2,comedy,cc
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (user, genre, title);
B = FOREACH A GENERATE user, genre, title;
C = GROUP B BY (genre, title);
D = FOREACH C {
noDistValue = FOREACH B GENERATE user,genre,title;
distValue = DISTINCT B;
GENERATE $0 AS grp,noDistValue,distValue;
}
E = FOREACH D GENERATE grp,(('$report_type' == 'users')?distValue:noDistValue) AS mybag;
DUMP E;
Output1:
>>pig -x local -param "report_type=users" test.pig
((action,aa),{(user1,action,aa)})
((comedy,cc),{(user2,comedy,cc)})
((drama,dd),{(user3,drama,dd)})
Output2:
>>pig -x local -param "report_type=nonusers" test.pig
((action,aa),{(user1,action,aa),(user1,action,aa),(user1,action,aa)})
((comedy,cc),{(user2,comedy,cc),(user2,comedy,cc)})
((drama,dd),{(user3,drama,dd)})
In case if you want to calculate the Count then project the relation E and also you can modify the above script according to your need.

Related

Pig: is it possible to write a loop over variables in a list?

I have to loop over 30 variables in a list
[var1,var2, ... , var30]
and for each variable I use some PIG group by statement such as
grouped = GROUP data by var1;
data_var1 = FOREACH grouped{
GENERATE group as mygroup,
COUNT(data) as count;
};
Is there a way to loop over the list of variables or I am forced to repeat the code above manually 30 times in my code?
Thanks!
I think what you're looking for is the pig macro
Create a relation for your 30 variables, and iterate on them by foreach, and call a macro which get 2 params: your data relation and the var you want to group by.
Just check the example in the link the macro is really similar what you'd like to do.
UPDATE & code
So here's the macro you can use:
DEFINE my_cnt(data, group_field) RETURNS C {
$C = FOREACH (GROUP $data by $group_field) GENERATE
group AS mygroup,
COUNT($data) AS count;
};
Use the macro:
IMPORT 'cnt.macro';
data = LOAD 'data.txt' USING PigStorage(',') AS (field:chararray, value:chararray);
DESCRIBE data;
e = my_cnt(data,'the_field_you_group_by');
DESCRIBE e;
DUMP e;
I'm still thinking on how can you iterate through on your fields you'd like to group by. My original suggestion to foreach through a relation what contains the filed names not correct. (To create a UDF for this always works.) Let me think about it.
But this macro works as is if you call by all the filed name you want to group.

How to divide numbers from different tables in pig

I am trying to join two tables and divide a number from one table by a number from another table. I have attempted to do it in the original and generate a new table with the same values but I get the same error both times which is extra confusing to me.
--get the data
lines = LOAD '/historicaldata.csv' USING PigStorage(' ') AS (ticker:chararray, date:long, open:long, high:long, low:long, close:long, volume:long);
--limit it between the dates we want
specDates = FILTER lines BY (date<=20000103 and date>=19900101);
--sort by ticker symbol
companies = GROUP specDates BY ticker;
--sort DESC and get the top to get the ending date
sorted_end = FOREACH companies {
sorted1 = ORDER specDates BY date DESC;
endDate = LIMIT sorted1 1;
GENERATE endDate.ticker AS ticker, endDate.open AS open, endDate.close AS close;
}
--sort ASC and get the top to get the starting date
sorted_begin = FOREACH companies {
sorted2 = ORDER specDates BY date ASC;
startDate = LIMIT sorted2 1;
GENERATE startDate.ticker AS ticker, startDate.open AS open, startDate.close AS close;
}
joined = JOIN sorted_end BY ticker, sorted_begin BY ticker;
final = FOREACH joined GENERATE sorted_end::ticker as ticker, sorted_begin::open as open, sorted_end::close as close;
final2 = FOREACH final GENERATE ticker as ticker, (float)(close/open) as growth_factor;
The error I keep getting is:
(Name: Divide Type: null Uid: null)incompatible types in Divide Operator left hand side:bag :tuple(close:float) right hand side:bag :tuple(open:float)
Both are floats so I am not sure why they are "incompatible types" other than that they come from different bags, but adding them to "final" and trying to do it from there doesn't work.
The data is in the form:
AA,20140131,11.60,11.80,11.45,11.48,33014100
AA,20140130,12.05,12.07,11.83,11.92,23223500
AA,20140129,11.64,12.23,11.58,11.96,44433000
Every entry includes all columns and are well formatted, non-zero numbers
Based on your query, I tried to create a dummy table on my system and generate the result. I found no issue and the division operation was completed successfully. PFB some sample queries which I fired on Pig:-
A = LOAD '/home/training/716391/pig/pigdata.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
B = LOAD '/home/training/716391/pig/pigdata2.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
C = join A by ID, B by ID
D = FOREACH C generate A::ID as IDA, A::name as NAMEA, A::GPC as GPCA, B::ID as IDB, B::name as NAMEB, B::GPC as GPCB;
E = FOREACH D GENERATE IDA, (FLOAT)(GPCA/GPCB) AS VALUE;
Can you please confirm, if the divisor value in your case has no Null value or 0?
Could you please share the load statements for sorted_end and sorted_begin?

Aggregate data grouping by two columns in Pig

I have these data that I need to group by two columns and then sum up two other fields.
Suppose the name for these four columns are:OS,device,view,click. I basically want to know the count for each OS and device, how many views they have and how many clicks it have.
(2,3346,1,)
(3,3953,1,1)
(25,4840,1,1)
(2,94840,1,1)
(14,0526,1,1)
(37,4864,1,)
(2,7353,1,)
This is what I have so far
A is data: OS,device,view,click
B = GROUP A BY (OS,device);
Result = FOREACH B {
GENERATE group AS OS,device, SUM(view) AS visits, SUM(click) AS clicks;};
dump Result;
This one won't work, error message is: Projected field [OS] does not exist in schema: group:tuple(OS:int,device:long),B:bag{:tuple(OS:int,device:long,view:int,click:int)}.
Here is the code which is tested, you are missing FLATTEN:
A = LOAD '/user/root/pig_data' using PigStorage(',') AS (OS, device, view, click);
B = GROUP A BY (OS, device);
RESULT = FOREACH B GENERATE FLATTEN(group) AS (OS, device), SUM(A.view) as views, SUM(A.click) as clicks;
dump RESULT;
I think you meant B in your example instead of J2 or J3, which may be in your actual code. Try:
B = GROUP A BY (OS, device);
Result = FOREACH B GENERATE
group.OS AS OS:int,
group.device AS device:long,
SUM(B.view) AS visits:int,
SUM(B.click) AS clicks:int;
dump Result;

Finding Unique visitors to a webpage

I want to write a pig script that find number of unique userid that visiots a particluar webpage.
table definition :a = (userid:chararray, otherid:chararray, webpage:chararray)
This is what I wrote but it doesn't work
a = (userid:chararray, otherid:chararray, webpage:chararray)
group_by_page = GROUP a by webpage ;
count_d = FOREACH group_by_page GENERATE group, count(distinct(a.userid));
You need to use the DISTINCT inside a nested foreach; it's not a UDF. This should get you where you need to go:
a = LOAD 'input' AS (userid:chararray, otherid:chararray, webpage:chararray);
group_by_page = GROUP a by webpage;
count_d = FOREACH group_by_page { uniq = DISTINCT a.userid; GENERATE group, COUNT(uniq); };
Go here to learn more about nested foreach.

Hadoop Pig GROUP by id, get owner_id?

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?
The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

Resources