Distinct records based on multiple fields using Pig - hadoop

Input data set:
field1,field2,field3,field4,field5
101,a1,a11,a111,a1111
102,a1,a11,a111,a1111
103,a1,a11,a111,a1111
201,b1,b11,b111,b1111
202,b1,b11,b111,b1111
Below query will give distinct records in Pig.
details = load 'emp.csv' using PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray);
distinct_detials = DISTINCT details;
I have a use case where I need to get distinct records based on field2,field3,field4.
Expected output is
101,a1,a11,a111,a1111
202,b1,b11,b111,b1111

You can use a nested foreach to accomplish what you want:
details = load 'emp.csv' using PigStorage(',') AS (field1:chararray,field2:chararray,field3:chararray,field4:chararray,field5:chararray);
distinct_detials = foreach (GROUP details by (field2, field3, field4) ) {
temp_rel = details.(field1, field5);
temp_limit = LIMIT temp_rel 1;
generate FLATTEN(temp_limit) as (field1, field5), FLATTEN(group) as (field2, field3, field4);
}
DUMP distinct_details;
This will give the following output:
(103,a1111,a1,a11,a111)
(202,b1111,b1,b11,b111)
You can further use a foreach on distinct_details to bring the fields in order.

Related

Order of Apache Pig Transformations

I am reading through Pig Programming by Alan Gates.
Consider the code:
ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS
(userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS
(movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);
nameLookup = FOREACH metadata GENERATE
movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;
nameLookupYear = FOREACH nameLookup GENERATE
movieID, movieTitle, GetYear(releaseYear) AS finalYear;
filterMovies = FILTER nameLookupYear BY finalYear < 1982;
groupedMovies = GROUP filterMovies BY finalYear;
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by finalYear DESC;
GENERATE GROUP, finalYear;
};
DUMP orderedMovies;
It states that
"Sorting by maps, tuples or bags produces error".
I want to know how I can sort the grouped results.
Do the transformations need to follow a certain sequence for them to work?
Since you are trying to sort the grouped results, you do not need a nested foreach. You would use the nested foreach if you were trying to, for example, sort each movie within the year by title or release date. Try ordering as usual (refer to finalYear as group since you grouped by finalYear in the previous line):
orderedMovies = ORDER groupedMovies BY group ASC;
DUMP orderedMovies;
If you are looking to sort the grouped values then you will have to use nested foreach. This will sort the years in descending order within a group.
orderedMovies = FOREACH groupedMovies {
sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
GENERATE GROUP, movieID, movieTitle;
};

How to get DISTINCT values of a group of fields in PIG?

Is it Possible to get the following output in PIG ? Will i be able to use Group by 1st and 2nd field and then do DISTINCT on 3rd field ?
For example
I have input data
12345|9658965|52145
12345|9658965|52145
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
I want output something like
12345|9658965|52145
23456|8541232|96589
23456|8541232|96585
Approach 1 : Using DISTINCT
Ref : http://pig.apache.org/docs/r0.12.0/basic.html#distinct
DISTINCT operator should help
test = LOAD 'test.csv' USING PigStorage('|');
distinct_recs = DISTINCT test;
DUMP distinct_recs;
Approach 2 : GROUP BY all fields
test = LOAD 'test.csv' USING PigStorage('|');
grp_all_fields = GROUP test BY ($0,$1,$2);
uniq_recs = FOREACH grp_all_fields GENERATE FLATTEN(group);
DUMP uniq_recs;
Both approaches are giving the expected output for the input shared.
Try this , its pretty similar :
A = LOAD 'test.csv' USING PigStorage('|') as (a1,a2,a3);
unique =
FOREACH (GROUP A BY a3) {
b = A.(a1,a2);
s = DISTINCT b;
GENERATE FLATTEN(s), group AS a4;
};

How to divide numbers from different tables in pig

I am trying to join two tables and divide a number from one table by a number from another table. I have attempted to do it in the original and generate a new table with the same values but I get the same error both times which is extra confusing to me.
--get the data
lines = LOAD '/historicaldata.csv' USING PigStorage(' ') AS (ticker:chararray, date:long, open:long, high:long, low:long, close:long, volume:long);
--limit it between the dates we want
specDates = FILTER lines BY (date<=20000103 and date>=19900101);
--sort by ticker symbol
companies = GROUP specDates BY ticker;
--sort DESC and get the top to get the ending date
sorted_end = FOREACH companies {
sorted1 = ORDER specDates BY date DESC;
endDate = LIMIT sorted1 1;
GENERATE endDate.ticker AS ticker, endDate.open AS open, endDate.close AS close;
}
--sort ASC and get the top to get the starting date
sorted_begin = FOREACH companies {
sorted2 = ORDER specDates BY date ASC;
startDate = LIMIT sorted2 1;
GENERATE startDate.ticker AS ticker, startDate.open AS open, startDate.close AS close;
}
joined = JOIN sorted_end BY ticker, sorted_begin BY ticker;
final = FOREACH joined GENERATE sorted_end::ticker as ticker, sorted_begin::open as open, sorted_end::close as close;
final2 = FOREACH final GENERATE ticker as ticker, (float)(close/open) as growth_factor;
The error I keep getting is:
(Name: Divide Type: null Uid: null)incompatible types in Divide Operator left hand side:bag :tuple(close:float) right hand side:bag :tuple(open:float)
Both are floats so I am not sure why they are "incompatible types" other than that they come from different bags, but adding them to "final" and trying to do it from there doesn't work.
The data is in the form:
AA,20140131,11.60,11.80,11.45,11.48,33014100
AA,20140130,12.05,12.07,11.83,11.92,23223500
AA,20140129,11.64,12.23,11.58,11.96,44433000
Every entry includes all columns and are well formatted, non-zero numbers
Based on your query, I tried to create a dummy table on my system and generate the result. I found no issue and the division operation was completed successfully. PFB some sample queries which I fired on Pig:-
A = LOAD '/home/training/716391/pig/pigdata.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
B = LOAD '/home/training/716391/pig/pigdata2.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
C = join A by ID, B by ID
D = FOREACH C generate A::ID as IDA, A::name as NAMEA, A::GPC as GPCA, B::ID as IDB, B::name as NAMEB, B::GPC as GPCB;
E = FOREACH D GENERATE IDA, (FLOAT)(GPCA/GPCB) AS VALUE;
Can you please confirm, if the divisor value in your case has no Null value or 0?
Could you please share the load statements for sorted_end and sorted_begin?

Nested FILTER in PIG

I want to perform a nested filter statement in Pig. For Example:
Query:
select trim(udc1.drky) drky,
trim(udc1.drsy) drsy,
trim(udc1.drrt) drrt,
trim(udc1.drdl01) drld01,
'Fixed' as AssetType
from f0005 udc1
where trim(udc1.drsy) = '12'
and trim(udc1.drrt) = 'C2'
and trim(udc1.drky) not in (
select trim(drky)
from f0005
where trim(drsy) = '57' and trim(drrt) = 'AC'
)
I need to convert the above query to a Pig script. However, I don't know how to take the filters from the inner query and associate them with the outer query. I can write a Pig UDF as a last option but would rather implement a solution in native Pig.
Please help me with the above issue.
Let's say the below is your input
Input is as per layout of
drky, drsy, drtt, drld01
1,57,AC,999
2,57,AC,899
2,12,C2,799
1,12,C2,699
4,57,BC,990
5,12,C3,998
6,12,C2,997
As Per your query the expected output is
6,12,C2,997
In Pig You can achieve this with the help of JOINS. Pls Look into below Code
records = LOAD '/user/user/inputfiles/assets.txt' USING PigStorage(',') AS(drky:chararray,drsy:chararray,drtt:chararray,drld01:chararray);
records_filter = FILTER records BY drsy == '57' AND drtt == 'AC';
records_each = FOREACH records_filter GENERATE drky as drky_temp;
records_join = JOIN records BY drky LEFT OUTER, records_each BY drky_temp;
records_join_filter = FILTER records_join BY drky_temp is null and drsy == '12' AND drtt == 'C2';
records_output = FOREACH records_join_filter GENERATE drky, drsy, drtt, drld01, 'FIXED' AS asset_type;
dump records_output;
OutPut as Per above Pig Script
6,12,C2,997,FIXED

Pig - Get Top n and group rest in 'other'

I have data that I have grouped and aggregated, looks like this-
Date Country Browser Count
---- ------- ------- -----
2015-07-11,US,Chrome,13
2015-07-11,US,Opera Mini,1
2015-07-11,US,Firefox,2
2015-07-11,US,IE,1
2015-07-11,US,Safari,1
...
2015-07-11,UK,Chrome Mobile,1026
2015-07-11,UK,IE,455
2015-07-11,UK,Mobile Safari,4782
2015-07-11,UK,Mobile Firefox,40
...
2015-07-11,DE,Android browser,1316
2015-07-11,DE,Opera Mini,3
2015-07-11,DE,PS4 Web browser,11
I want to get the top n browsers (by count) per country, and want to aggregate the rest under 'Other'. I looked into Pig's built-in TOP function, but how would I do the grouping in other. The result I want, for example (n = 2) ->
2015-07-11,US,Chrome,13
2015-07-11,US,Firefox,2
2015-07-11,US,Other,3
What would be the best way to go about this?
Ok.. This requirement is nice..
I am simply using your input in LOAD statement of Pig script .
Input :
2015-07-11,US,Chrome,13
2015-07-11,US,Opera Mini,1
2015-07-11,US,Firefox,2
2015-07-11,US,IE,1
2015-07-11,US,Safari,1
2015-07-11,UK,Chrome Mobile,1026
2015-07-11,UK,IE,455
2015-07-11,UK,Mobile Safari,4782
2015-07-11,UK,Mobile Firefox,40
2015-07-11,DE,Android browser,1316
2015-07-11,DE,Opera Mini,3
2015-07-11,DE,PS4 Web browser,11
2015-07-11,US,Chrome,13
2015-07-11,US,Firefox,2
2015-07-11,US,Other,3
Below is the coding for this.
You can pass a value for n paramater to pig script, currently I set value 2 for n in the LIMIT statement itself.(i.e n=2).
Actually i hardcoded n=2 in this below code.
records = LOAD '/user/cloudera/inputfiles/entries.txt' USING PigStorage(',') as (dt:chararray,country:chararray,browser:chararray,count:int);
records_each = FOREACH(GROUP records BY (dt,country,browser)) GENERATE flatten(group) AS (dt,country,browser), MAX(records.count) as counts;
records_grp_order = ORDER records_each BY dt ASC , country ASC , counts DESC;
records_grp = GROUP records_grp_order BY (dt, country);
rec_each = FOREACH records_grp {
top_2_recs = LIMIT records_grp_order 2;
generate MAX(top_2_recs.dt) AS temp_dt, MAX(top_2_recs.country) AS temp_country, flatten(top_2_recs.browser) AS temp_browser;
};
rec_join = JOIN records_each BY (dt,country,browser) left outer , rec_each BY (temp_dt,temp_country,temp_browser);
rec_join_each = FOREACH rec_join generate dt,country, (temp_browser is not null ? browser : 'OTHERS') AS browser, counts AS counts;
rec_final_grp = GROUP rec_join_each BY (dt,country,browser);
final_output = FOREACH rec_final_grp generate flatten(group) AS (dt,country,browser), SUM(rec_join_each.counts) AS total_counts;
sorted_output = ORDER final_output BY dt ASC , country ASC, total_counts DESC;
dump sorted_output;
output
(2015-07-11,DE,Android browser,1316)
(2015-07-11,DE,PS4 Web browser,11)
(2015-07-11,DE,OTHERS,3)
(2015-07-11,UK,Mobile Safari,4782)
(2015-07-11,UK,Chrome Mobile,1026)
(2015-07-11,UK,OTHERS,495)
(2015-07-11,US,Chrome,13)
(2015-07-11,US,OTHERS,3)
(2015-07-11,US,Firefox,2)

Resources