I want to replace NULL values by AVG in PIG - hadoop

Here is my Code :
claims = LOAD 'Darshan/automobile_insurance_claims.csv' USING PigStorage(',') AS (claim_id:chararray, policy_master_id:chararray, registration_no:chararray, engine_no:chararray, chassis_no:chararray, customer_id:int, Col6:int,first_name:chararray, last_name:chararray,street:chararray,address:chararray, city:chararray, zip:long,gender:chararray, claim_date:chararray, garage_city:chararray, bill_no:long, claim_amount:double, garage_name:chararray,claim_status:chararray);
grp_all = group claims all;
avg = foreach grp_all generate AVG(claims.Col6);
grp = group claims by claim_id;
m = foreach grp generate group, ((Col6 IS NULL) ? avg : Col6);
Results: dump avg; #33.45
------------------------------------------------------------------------------------------------------------------------------------Showing following error while replacing NULL values in Col6(i.e. Age):
Caused by:
Invalid scalar projection: avg : A column needs to be projected from a relation for it to be used as a s
calar
at org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10947)
at org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10164)
at org.apache.pig.parser.LogicalPlanGenerator.bin_expr(LogicalPlanGenerator.java:11992)
at org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGenerator.java:11104)
at org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10815)
at org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10164)
at org.apache.pig.parser.LogicalPlanGenerator.flatten_generated_item(LogicalPlanGenerator.java:7493)
at org.apache.pig.parser.LogicalPlanGenerator.generate_clause(LogicalPlanGenerator.java:17595)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator.java:15987)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15854)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 17 more
2016-08-08 05:51:07,297 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
Invalid scalar projection: avg : A column needs to be projected from a relation for it to be used as a s
calar .
Line 11 is: m = foreach grp generate group, ((Col6 IS NULL) ? avg : Col6);

Darshan, this doesn't look like logic issue which you are exactly trying. You can replace NULLs by AVG but here the issue is projection of a column.
To solve this please revisit your code once again and you may find that AVG is in different relation and you are accessing it from different relation.
In your code "avg" is a relation not a column, If I'm getting it correctly, After your first group statement where you are generating AVG, generate other columns as well and that's how you will have avg and col6 in same relation.
Load your data
Group your data as per your need
Calculate AVG and generate other columns
If you want you can apply this replace logic in same FOREACH.
Please let me know if you still face any issue.

You are getting the error because avg is a relation and you need to use a column in the relation avg.Correct your last PIG statement to refer to the first column in the relation avg,like this
m = foreach grp generate group, ((claims.Col6 IS NULL) ? (double)avg.$0 : claims.Col6);
Alternatively you can name the column and refer to it as well,like this
avg = foreach grp_all generate AVG(claims.Col6) AS AVG_Col6;
grp = group claims by claim_id;
m = foreach grp generate group, ((claims.Col6 IS NULL) ? (double)avg.AVG_Col6 : claims.Col6);

Here is the Final Code for my query:
claims = LOAD 'Darshan/automobile_insurance_claims.csv' USING PigStorage(',') AS
(claim_id:chararray, policy_master_id:chararray, registration_no:chararray,
engine_no:chararray, chassis_no:chararray, customer_id:int, Col6:int,
first_name:chararray, last_name:chararray,street:chararray,address:chararray,
city:chararray, zip:long,gender:chararray, claim_date:chararray,
garage_city:chararray, bill_no:long, claim_amount:double,
garage_name:chararray,claim_status:chararray);
grp_all = group claims all;
avg = foreach grp_all generate AVG(claims.Col6);
grp = group claims by claim_id;
result = foreach grp {
val = foreach claims generate ((Col6 IS NULL) ? avg.$0 : Col6);
generate group, val;
};
Here is the link to dataset automobile_insurance_claims.csv

Related

Get value for unique record using Pig

Below is the input data set.
col1,col2,col3,col4,col5
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
Based on col2,col3,col4 will give unique record, I need to get any one value from col1 for the unique record, and populate as new field say col6. The expected output below
col1,col2,col3,col4,col5,col6
key1,111,1,12/11/2016,10,key3
key2,111,1,12/11/2016,10,key3
key3,111,1,12/11/2016,10,key3
key4,222,2,12/22/2016,10,key5
key5,222,2,12/22/2016,10,key5
key6,333,3,12/30/2016,10,key6
key7,111,0,12/11/2016,10,key7
Below is the script, I am getting error.
A = load 'test1.csv' using PigStorage(',');
B = GROUP A by ($1,$2,$3);
C = FOREACH B GENERATE FLATTEN(group), MAX(A.$0);
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2106: Error executing an algebraic function
Looks like a good use case to use Nested Foreach
Ref : https://pig.apache.org/docs/r0.14.0/basic.html#foreach
Input :
key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10
PigScript
A = load 'input.csv' using PigStorage(',') AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = FOREACH(GROUP A BY (col2,col3,col4)) {
ordered = ORDER A BY col1 DESC;
latest = LIMIT ordered 1;
GENERATE FLATTEN(A) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray), FLATTEN(latest.col1) AS col6:chararray;
};
DUMP B;
Output :
(key1,111,1,12/11/2016,10,key3)
(key2,111,1,12/11/2016,10,key3)
(key3,111,1,12/11/2016,10,key3)
(key4,222,2,12/22/2016,10,key5)
(key5,222,2,12/22/2016,10,key5)
(key6,333,3,12/30/2016,10,key6)
(key7,111,0,12/11/2016,10,key7)

how to do any mathematical calculation on stitch..over column in pig

I am trying to calculate YoY growth on my raw data. By using stitch over(lead) I am able to get last year's data along with current year data. But I am not able to do any calculation on the column returned by stitch over () clause. Below is what I have tried so far,
grunt> data = LOAD 'loan_pig' USING org.apache.hive.hcatalog.pig.HCatLoader();
grunt> grp1 = group data by issue_yr;
grunt> tot_loan = foreach grp1{ cnt = COUNT(data.id); generate FLATTEN(group) as issue_yr,cnt as ln_cnt;};
grunt> grp2 = group tot_loan all;
grunt> loan_yr = foreach grp2{ srt = order tot_loan by issue_yr desc; generate FLATTEN(Stitch(srt, Over(srt.ln_cnt,'lead',0,1,1,0)));};
grunt> final = foreach loan_yr generate issue_yr,ln_cnt,$2;
grunt> describe final;
when I describe on final, it shows
final: {stitched::issue_yr: int,stitched::ln_cnt: long,NULL}
NULL for the 'lead' value column.
And when I try to do any mathematical calculation on this column it throws below error :
grunt> final1 = foreach loan_yr generate issue_yr,ln_cnt,$2 as pr_yr;
grunt> fn = foreach final1 generate issue_yr,ln_cnt-pr_yr;
2016-06-24 11:23:42,118 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1052: Cannot cast bytearray to long
Can any one please let me know if there is any way to do calculation with columns returned from Stitch...Over. Or are they even possible in pig?

How to divide numbers from different tables in pig

I am trying to join two tables and divide a number from one table by a number from another table. I have attempted to do it in the original and generate a new table with the same values but I get the same error both times which is extra confusing to me.
--get the data
lines = LOAD '/historicaldata.csv' USING PigStorage(' ') AS (ticker:chararray, date:long, open:long, high:long, low:long, close:long, volume:long);
--limit it between the dates we want
specDates = FILTER lines BY (date<=20000103 and date>=19900101);
--sort by ticker symbol
companies = GROUP specDates BY ticker;
--sort DESC and get the top to get the ending date
sorted_end = FOREACH companies {
sorted1 = ORDER specDates BY date DESC;
endDate = LIMIT sorted1 1;
GENERATE endDate.ticker AS ticker, endDate.open AS open, endDate.close AS close;
}
--sort ASC and get the top to get the starting date
sorted_begin = FOREACH companies {
sorted2 = ORDER specDates BY date ASC;
startDate = LIMIT sorted2 1;
GENERATE startDate.ticker AS ticker, startDate.open AS open, startDate.close AS close;
}
joined = JOIN sorted_end BY ticker, sorted_begin BY ticker;
final = FOREACH joined GENERATE sorted_end::ticker as ticker, sorted_begin::open as open, sorted_end::close as close;
final2 = FOREACH final GENERATE ticker as ticker, (float)(close/open) as growth_factor;
The error I keep getting is:
(Name: Divide Type: null Uid: null)incompatible types in Divide Operator left hand side:bag :tuple(close:float) right hand side:bag :tuple(open:float)
Both are floats so I am not sure why they are "incompatible types" other than that they come from different bags, but adding them to "final" and trying to do it from there doesn't work.
The data is in the form:
AA,20140131,11.60,11.80,11.45,11.48,33014100
AA,20140130,12.05,12.07,11.83,11.92,23223500
AA,20140129,11.64,12.23,11.58,11.96,44433000
Every entry includes all columns and are well formatted, non-zero numbers
Based on your query, I tried to create a dummy table on my system and generate the result. I found no issue and the division operation was completed successfully. PFB some sample queries which I fired on Pig:-
A = LOAD '/home/training/716391/pig/pigdata.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
B = LOAD '/home/training/716391/pig/pigdata2.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
C = join A by ID, B by ID
D = FOREACH C generate A::ID as IDA, A::name as NAMEA, A::GPC as GPCA, B::ID as IDB, B::name as NAMEB, B::GPC as GPCB;
E = FOREACH D GENERATE IDA, (FLOAT)(GPCA/GPCB) AS VALUE;
Can you please confirm, if the divisor value in your case has no Null value or 0?
Could you please share the load statements for sorted_end and sorted_begin?

Pig - Get Top n and group rest in 'other'

I have data that I have grouped and aggregated, looks like this-
Date Country Browser Count
---- ------- ------- -----
2015-07-11,US,Chrome,13
2015-07-11,US,Opera Mini,1
2015-07-11,US,Firefox,2
2015-07-11,US,IE,1
2015-07-11,US,Safari,1
...
2015-07-11,UK,Chrome Mobile,1026
2015-07-11,UK,IE,455
2015-07-11,UK,Mobile Safari,4782
2015-07-11,UK,Mobile Firefox,40
...
2015-07-11,DE,Android browser,1316
2015-07-11,DE,Opera Mini,3
2015-07-11,DE,PS4 Web browser,11
I want to get the top n browsers (by count) per country, and want to aggregate the rest under 'Other'. I looked into Pig's built-in TOP function, but how would I do the grouping in other. The result I want, for example (n = 2) ->
2015-07-11,US,Chrome,13
2015-07-11,US,Firefox,2
2015-07-11,US,Other,3
What would be the best way to go about this?
Ok.. This requirement is nice..
I am simply using your input in LOAD statement of Pig script .
Input :
2015-07-11,US,Chrome,13
2015-07-11,US,Opera Mini,1
2015-07-11,US,Firefox,2
2015-07-11,US,IE,1
2015-07-11,US,Safari,1
2015-07-11,UK,Chrome Mobile,1026
2015-07-11,UK,IE,455
2015-07-11,UK,Mobile Safari,4782
2015-07-11,UK,Mobile Firefox,40
2015-07-11,DE,Android browser,1316
2015-07-11,DE,Opera Mini,3
2015-07-11,DE,PS4 Web browser,11
2015-07-11,US,Chrome,13
2015-07-11,US,Firefox,2
2015-07-11,US,Other,3
Below is the coding for this.
You can pass a value for n paramater to pig script, currently I set value 2 for n in the LIMIT statement itself.(i.e n=2).
Actually i hardcoded n=2 in this below code.
records = LOAD '/user/cloudera/inputfiles/entries.txt' USING PigStorage(',') as (dt:chararray,country:chararray,browser:chararray,count:int);
records_each = FOREACH(GROUP records BY (dt,country,browser)) GENERATE flatten(group) AS (dt,country,browser), MAX(records.count) as counts;
records_grp_order = ORDER records_each BY dt ASC , country ASC , counts DESC;
records_grp = GROUP records_grp_order BY (dt, country);
rec_each = FOREACH records_grp {
top_2_recs = LIMIT records_grp_order 2;
generate MAX(top_2_recs.dt) AS temp_dt, MAX(top_2_recs.country) AS temp_country, flatten(top_2_recs.browser) AS temp_browser;
};
rec_join = JOIN records_each BY (dt,country,browser) left outer , rec_each BY (temp_dt,temp_country,temp_browser);
rec_join_each = FOREACH rec_join generate dt,country, (temp_browser is not null ? browser : 'OTHERS') AS browser, counts AS counts;
rec_final_grp = GROUP rec_join_each BY (dt,country,browser);
final_output = FOREACH rec_final_grp generate flatten(group) AS (dt,country,browser), SUM(rec_join_each.counts) AS total_counts;
sorted_output = ORDER final_output BY dt ASC , country ASC, total_counts DESC;
dump sorted_output;
output
(2015-07-11,DE,Android browser,1316)
(2015-07-11,DE,PS4 Web browser,11)
(2015-07-11,DE,OTHERS,3)
(2015-07-11,UK,Mobile Safari,4782)
(2015-07-11,UK,Chrome Mobile,1026)
(2015-07-11,UK,OTHERS,495)
(2015-07-11,US,Chrome,13)
(2015-07-11,US,OTHERS,3)
(2015-07-11,US,Firefox,2)

Error when using MAX in Apache Pig (Hadoop)

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int).
I group the relation based on featureid and want to calculate the max feature value of each group, heres the code:
grpd = GROUP features BY featureid;
DUMP grpd;
temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val;
Its giving me Invalid scalar projection: grpd Exception. I read on different forums that MAX takes in a "bag" format for such functions, but when I take the dump of grpd, it shows me a bag format. Here's a small part of the output from the dump:
(5662,{(22579,5662,1)})
(5663,{(28331,5663,1),(2624,5663,1)})
(5664,{(27591,5664,1)})
(5665,{(30217,5665,1),(31526,5665,1)})
(5666,{(27783,5666,1),(30983,5666,1),(32424,5666,1),(28064,5666,1),(28932,5666,1)})
(5667,{(31257,5667,1),(27281,5667,1)})
(5669,{(31041,5669,1)})
Whats the issue ?
The issue was with column addressing, heres the correct working code:
grpd = GROUP features BY featureid;
temp = FOREACH grpd GENERATE group as featureid, MAX(features.featurevalue) as val;

Resources