Hadoop Pig GROUP by id, get owner_id? - hadoop

In Hadoop I have many that look like this:
(item_id,owner_id,counter) - there could be duplicates but ALWAYS the item_id has the same owner_id!
I want to get the SUM of the counter for each item_id so I have the following script:
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id);
data = FOREACH group_by_item GENERATE group AS item_id, OWNER_ID_COLUMN_SOMEHOW, SUM(known_items.counter) AS items_count;
The problem is that in the FOREACH if I want to take known_items.owner_id - that would be a tuple that has the sum of all grouped item_id. What would be the most efficient way to get the first one of the owners?

The simplest solution gives you the right answer if your assumption that each item_id has the same owner_id is correct, and will let you know if it is not: incude the owner_id as part of the group.
alldata = LOAD '/path/to/data/*' USING D; -- D describes the structure
known_items = FILTER alldata BY owner_id > 0L;
group_by_item = GROUP known_data BY (item_id, owner_id);
data = FOREACH group_by_item GENERATE FLATTEN(group), SUM(known_items.counter) AS items_count;

Related

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

How to divide numbers from different tables in pig

I am trying to join two tables and divide a number from one table by a number from another table. I have attempted to do it in the original and generate a new table with the same values but I get the same error both times which is extra confusing to me.
--get the data
lines = LOAD '/historicaldata.csv' USING PigStorage(' ') AS (ticker:chararray, date:long, open:long, high:long, low:long, close:long, volume:long);
--limit it between the dates we want
specDates = FILTER lines BY (date<=20000103 and date>=19900101);
--sort by ticker symbol
companies = GROUP specDates BY ticker;
--sort DESC and get the top to get the ending date
sorted_end = FOREACH companies {
sorted1 = ORDER specDates BY date DESC;
endDate = LIMIT sorted1 1;
GENERATE endDate.ticker AS ticker, endDate.open AS open, endDate.close AS close;
}
--sort ASC and get the top to get the starting date
sorted_begin = FOREACH companies {
sorted2 = ORDER specDates BY date ASC;
startDate = LIMIT sorted2 1;
GENERATE startDate.ticker AS ticker, startDate.open AS open, startDate.close AS close;
}
joined = JOIN sorted_end BY ticker, sorted_begin BY ticker;
final = FOREACH joined GENERATE sorted_end::ticker as ticker, sorted_begin::open as open, sorted_end::close as close;
final2 = FOREACH final GENERATE ticker as ticker, (float)(close/open) as growth_factor;
The error I keep getting is:
(Name: Divide Type: null Uid: null)incompatible types in Divide Operator left hand side:bag :tuple(close:float) right hand side:bag :tuple(open:float)
Both are floats so I am not sure why they are "incompatible types" other than that they come from different bags, but adding them to "final" and trying to do it from there doesn't work.
The data is in the form:
AA,20140131,11.60,11.80,11.45,11.48,33014100
AA,20140130,12.05,12.07,11.83,11.92,23223500
AA,20140129,11.64,12.23,11.58,11.96,44433000
Every entry includes all columns and are well formatted, non-zero numbers
Based on your query, I tried to create a dummy table on my system and generate the result. I found no issue and the division operation was completed successfully. PFB some sample queries which I fired on Pig:-
A = LOAD '/home/training/716391/pig/pigdata.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
B = LOAD '/home/training/716391/pig/pigdata2.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
C = join A by ID, B by ID
D = FOREACH C generate A::ID as IDA, A::name as NAMEA, A::GPC as GPCA, B::ID as IDB, B::name as NAMEB, B::GPC as GPCB;
E = FOREACH D GENERATE IDA, (FLOAT)(GPCA/GPCB) AS VALUE;
Can you please confirm, if the divisor value in your case has no Null value or 0?
Could you please share the load statements for sorted_end and sorted_begin?

Pig - Get Top n and group rest in 'other'

I have data that I have grouped and aggregated, looks like this-
Date Country Browser Count
---- ------- ------- -----
2015-07-11,US,Chrome,13
2015-07-11,US,Opera Mini,1
2015-07-11,US,Firefox,2
2015-07-11,US,IE,1
2015-07-11,US,Safari,1
...
2015-07-11,UK,Chrome Mobile,1026
2015-07-11,UK,IE,455
2015-07-11,UK,Mobile Safari,4782
2015-07-11,UK,Mobile Firefox,40
...
2015-07-11,DE,Android browser,1316
2015-07-11,DE,Opera Mini,3
2015-07-11,DE,PS4 Web browser,11
I want to get the top n browsers (by count) per country, and want to aggregate the rest under 'Other'. I looked into Pig's built-in TOP function, but how would I do the grouping in other. The result I want, for example (n = 2) ->
2015-07-11,US,Chrome,13
2015-07-11,US,Firefox,2
2015-07-11,US,Other,3
What would be the best way to go about this?
Ok.. This requirement is nice..
I am simply using your input in LOAD statement of Pig script .
Input :
2015-07-11,US,Chrome,13
2015-07-11,US,Opera Mini,1
2015-07-11,US,Firefox,2
2015-07-11,US,IE,1
2015-07-11,US,Safari,1
2015-07-11,UK,Chrome Mobile,1026
2015-07-11,UK,IE,455
2015-07-11,UK,Mobile Safari,4782
2015-07-11,UK,Mobile Firefox,40
2015-07-11,DE,Android browser,1316
2015-07-11,DE,Opera Mini,3
2015-07-11,DE,PS4 Web browser,11
2015-07-11,US,Chrome,13
2015-07-11,US,Firefox,2
2015-07-11,US,Other,3
Below is the coding for this.
You can pass a value for n paramater to pig script, currently I set value 2 for n in the LIMIT statement itself.(i.e n=2).
Actually i hardcoded n=2 in this below code.
records = LOAD '/user/cloudera/inputfiles/entries.txt' USING PigStorage(',') as (dt:chararray,country:chararray,browser:chararray,count:int);
records_each = FOREACH(GROUP records BY (dt,country,browser)) GENERATE flatten(group) AS (dt,country,browser), MAX(records.count) as counts;
records_grp_order = ORDER records_each BY dt ASC , country ASC , counts DESC;
records_grp = GROUP records_grp_order BY (dt, country);
rec_each = FOREACH records_grp {
top_2_recs = LIMIT records_grp_order 2;
generate MAX(top_2_recs.dt) AS temp_dt, MAX(top_2_recs.country) AS temp_country, flatten(top_2_recs.browser) AS temp_browser;
};
rec_join = JOIN records_each BY (dt,country,browser) left outer , rec_each BY (temp_dt,temp_country,temp_browser);
rec_join_each = FOREACH rec_join generate dt,country, (temp_browser is not null ? browser : 'OTHERS') AS browser, counts AS counts;
rec_final_grp = GROUP rec_join_each BY (dt,country,browser);
final_output = FOREACH rec_final_grp generate flatten(group) AS (dt,country,browser), SUM(rec_join_each.counts) AS total_counts;
sorted_output = ORDER final_output BY dt ASC , country ASC, total_counts DESC;
dump sorted_output;
output
(2015-07-11,DE,Android browser,1316)
(2015-07-11,DE,PS4 Web browser,11)
(2015-07-11,DE,OTHERS,3)
(2015-07-11,UK,Mobile Safari,4782)
(2015-07-11,UK,Chrome Mobile,1026)
(2015-07-11,UK,OTHERS,495)
(2015-07-11,US,Chrome,13)
(2015-07-11,US,OTHERS,3)
(2015-07-11,US,Firefox,2)

MIn max group wise and filter without join in pig

I am trying to find (max+min)/2 for each group. The following is my schema
UrlXpathsCount: {url: chararray,leafpathstr: chararray,urlpath_count: long}
and I am trying to group it by url field
byUrl = GROUP UrlXpathsCount by url;
And i am trying to find (max+min)/2 by the following way.
midRangeByUrl = FOREACH byUrl{
urls_desc = order UrlXpathsCount by urlpath_count desc;
urls_max = limit urls_desc 1;
urls_asc = order UrlXpathsCount by urlpath_count asc;
urls_min = limit urls_asc 1;
GENERATE FLATTEN(urls_max),FLATTEN(urls_min);
};
The following is the schema for midRangeByUrl
midRangeByUrl: {urls_max::url: chararray,urls_max::leafpathstr: chararray,urls_max::urlpath_count: long,urls_min::url: chararray,urls_min::leafpathstr: chararray,urls_min::urlpath_count: long}
The problem i am facing now is that adding a FLATTEN(group) ,FLATTEN(urls_max) , FLATTEN(urls_min) gives me a lot of combinations that I don't want.
I would like to get max + min/2 for each group.
To do this, I am projecting the urlpath_count of both max and min by the following
computeMidRange = FOREACH midRangeByUrl generate urls_max::url as mid_url,((DOUBLE)urls_max::urlpath_count+(DOUBLE) urls_min::urlpath_count)/2 as midRange;
And I am joining the two tables by the following
/* Join computeMidRange and UrlXpathsCount */
midRangeJoin = join UrlXpathsCount by url , computeMidRange by mid_url using 'replicated';
midRangeOut = FOREACH midRangeJoin GENERATE UrlXpathsCount::url as url,UrlXpathsCount::leafpathstr as leafpathstr,
UrlXpathsCount::urlpath_count as urlpath_count,computeMidRange::midRange as midRange;
and then filter applying the filter
templates = FILTER midRangeOut by urlpath_count > midRange;
I would like to avoid the midRangeJoin . By somehow computing the midRangeByUrl and projecting the following fields url, urlpath_count ,leafpathstr , (min+max)/2 without the join.
Please help me in figuring this out.
Thanks
You could use instead the builtin MAX and MIN UDFs:
UrlXpathsCount = load 'your_data' using PigStorage(',') as (url: chararray,leafpathstr: chararray,urlpath_count: long);
B = GROUP UrlXpathsCount by url;
C = foreach B generate group as url, MAX(UrlXpathsCount.urlpath_count) as max_count,
MIN(UrlXpathsCount.urlpath_count) as min_count;
D = foreach C generate url, ((double)max_count + (double)min_count)/2 as val;
This will do exactly what you want, without nested foreachs or joins. I divided the calculation into C and D to avoid an extremely long line, but you could do it in one only line too. Just remember to cast the values to double, because your urlpath_count is a long so you won't get any decimals if you don't cast it.

Pig 0.12 Nested Foreach not working properly

I've been trying to do this for a while and can't seem to figure it out and it's a bit hard to look for a fix for my problem.
I have a relation that I previously grouped by user_id and listing_id and after generating and flattened the output I got this:
test: {user_id: bytearray,listing_id: bytearray,hotness: long}
So my next step is to group by user, order by hotness and limit the amount of listings per user to 20.
grped = GROUP test BY user_id;
grped_sorted = FOREACH grped {
sorted = order test BY hotness desc;
top1 = limit sorted 20;
listings = FOREACH top1 GENERATE FLATTEN((bytearray)top1.listing_id) as listing_id;
GENERATE group as user_id, FLATTEN(listings.($0)) as listing_ids;
};
But this seems to be getting me the error, with information that was previously stripped from the listings details:
Scalar has more than one row in the output.
Please, I need help on this.
Is there a way to do this? Can I use some UDF from DataFu?
Creating my own UDF is out of the question.
Thanks in advance.
I think it should work if the code looks like this
grped = GROUP test BY user_id;
grped_sorted = FOREACH grped {
sorted = order test BY hotness desc;
top1 = limit sorted 20;
GENERATE group as user_id, top1.listing_id as listing_ids;
};
Output of this would be something like
grped_sorted: {user_id: bytearray,listing_ids: {(listing_id: bytearray)}}
Not sure if this is what you want though.

Resources