Calculate Average using PIG - hadoop

I am new to PIG and want to calculate Average of my one column data that looks like
0
10.1
20.1
30
40
50
60
70
80.1
I wrote this pig script
dividends = load 'myfile.txt' as (A);
dump dividends
grouped = group dividends by A;
avg = foreach grouped generate AVG(grouped.A);
dump avg
It parses data as
(0)
(10.1)
(20.1)
(30)
(40)
(50)
(60)
(70)
(80.1)
but gives this error for average
2013-03-04 15:10:58,289 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
<file try.pig, line 4, column 41> Invalid scalar projection: grouped
Details at logfile: /Users/PreetiGupta/Documents/CMPS290S/project/pig_1362438645642.log
ANY IDEA

The AVG built in function takes a bag as an input. In your group statement, you are currently grouping elements by the value of A, but what you really want to do is group all the elements into one bag.
Pig's GROUP ALL is what you want to use:
dividends = load 'myfile.txt' as (A);
dump dividends
grouped = group dividends all;
avg = foreach grouped generate AVG(dividends.A);
dump avg

The below will work for calculating average:
dividends = load 'myfile.txt' as (A);
grouped = GROUP dividends all;
avg = foreach grouped generate AVG(dividends);
dump avg

You have to use the original data variable name instead of using a group variable. In FOREACH line, I am using AVG(dividends.A) instead of AVG(grouped.A). Here is the solution script:
dividends = load 'myfile.txt' as (A);
dump dividends
grouped = group dividends by A;
avg = foreach grouped generate AVG(dividends.A);
dump avg

Related

Compare tuples on basis of a field in pig

(ABC,****,tool1,12)
(ABC,****,tool1,10)
(ABC,****,tool1,13)
(ABC,****,tool2,101)
(ABC,****,tool3,11)
Above is input data
Following is my dataset in pig.
Schema is : Username,ip,tool,duration
I want to add duration of same tools
Output
(ABC,****,tool1,35)
(ABC,****,tool2,101)
(ABC,****,tool3,11
Use GROUP BY and use SUM on the duration.
A = LOAD 'data.csv' USING PigStorage(',') AS (Username:chararray,ip:chararray,tool:chararray,duration:int);
B = GROUP A BY (Username,ip,tool);
C = FOREACH B GENERATE FLATTEN(group) AS (Username,ip,tool),SUM(A.duration);
DUMP C;

I want to replace NULL values by AVG in PIG

Here is my Code :
claims = LOAD 'Darshan/automobile_insurance_claims.csv' USING PigStorage(',') AS (claim_id:chararray, policy_master_id:chararray, registration_no:chararray, engine_no:chararray, chassis_no:chararray, customer_id:int, Col6:int,first_name:chararray, last_name:chararray,street:chararray,address:chararray, city:chararray, zip:long,gender:chararray, claim_date:chararray, garage_city:chararray, bill_no:long, claim_amount:double, garage_name:chararray,claim_status:chararray);
grp_all = group claims all;
avg = foreach grp_all generate AVG(claims.Col6);
grp = group claims by claim_id;
m = foreach grp generate group, ((Col6 IS NULL) ? avg : Col6);
Results: dump avg; #33.45
------------------------------------------------------------------------------------------------------------------------------------Showing following error while replacing NULL values in Col6(i.e. Age):
Caused by:
Invalid scalar projection: avg : A column needs to be projected from a relation for it to be used as a s
calar
at org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10947)
at org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10164)
at org.apache.pig.parser.LogicalPlanGenerator.bin_expr(LogicalPlanGenerator.java:11992)
at org.apache.pig.parser.LogicalPlanGenerator.projectable_expr(LogicalPlanGenerator.java:11104)
at org.apache.pig.parser.LogicalPlanGenerator.var_expr(LogicalPlanGenerator.java:10815)
at org.apache.pig.parser.LogicalPlanGenerator.expr(LogicalPlanGenerator.java:10164)
at org.apache.pig.parser.LogicalPlanGenerator.flatten_generated_item(LogicalPlanGenerator.java:7493)
at org.apache.pig.parser.LogicalPlanGenerator.generate_clause(LogicalPlanGenerator.java:17595)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_plan(LogicalPlanGenerator.java:15987)
at org.apache.pig.parser.LogicalPlanGenerator.foreach_clause(LogicalPlanGenerator.java:15854)
at org.apache.pig.parser.LogicalPlanGenerator.op_clause(LogicalPlanGenerator.java:1933)
at org.apache.pig.parser.LogicalPlanGenerator.general_statement(LogicalPlanGenerator.java:1102)
at org.apache.pig.parser.LogicalPlanGenerator.statement(LogicalPlanGenerator.java:560)
at org.apache.pig.parser.LogicalPlanGenerator.query(LogicalPlanGenerator.java:421)
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:188)
... 17 more
2016-08-08 05:51:07,297 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse:
Invalid scalar projection: avg : A column needs to be projected from a relation for it to be used as a s
calar .
Line 11 is: m = foreach grp generate group, ((Col6 IS NULL) ? avg : Col6);
Darshan, this doesn't look like logic issue which you are exactly trying. You can replace NULLs by AVG but here the issue is projection of a column.
To solve this please revisit your code once again and you may find that AVG is in different relation and you are accessing it from different relation.
In your code "avg" is a relation not a column, If I'm getting it correctly, After your first group statement where you are generating AVG, generate other columns as well and that's how you will have avg and col6 in same relation.
Load your data
Group your data as per your need
Calculate AVG and generate other columns
If you want you can apply this replace logic in same FOREACH.
Please let me know if you still face any issue.
You are getting the error because avg is a relation and you need to use a column in the relation avg.Correct your last PIG statement to refer to the first column in the relation avg,like this
m = foreach grp generate group, ((claims.Col6 IS NULL) ? (double)avg.$0 : claims.Col6);
Alternatively you can name the column and refer to it as well,like this
avg = foreach grp_all generate AVG(claims.Col6) AS AVG_Col6;
grp = group claims by claim_id;
m = foreach grp generate group, ((claims.Col6 IS NULL) ? (double)avg.AVG_Col6 : claims.Col6);
Here is the Final Code for my query:
claims = LOAD 'Darshan/automobile_insurance_claims.csv' USING PigStorage(',') AS
(claim_id:chararray, policy_master_id:chararray, registration_no:chararray,
engine_no:chararray, chassis_no:chararray, customer_id:int, Col6:int,
first_name:chararray, last_name:chararray,street:chararray,address:chararray,
city:chararray, zip:long,gender:chararray, claim_date:chararray,
garage_city:chararray, bill_no:long, claim_amount:double,
garage_name:chararray,claim_status:chararray);
grp_all = group claims all;
avg = foreach grp_all generate AVG(claims.Col6);
grp = group claims by claim_id;
result = foreach grp {
val = foreach claims generate ((Col6 IS NULL) ? avg.$0 : Col6);
generate group, val;
};
Here is the link to dataset automobile_insurance_claims.csv

How to find average of a column and average of subtraction of two columns in Pig?

I am new to scripting using Pig Latin. I am stuck to write a pig script which will find the average of a column value and also to find the average of the subtracted values between two columns.
I am reading the data from a csv file having starttime and endtime columns as below:
"starttime","endtime",
"23","46",
"32","49",
"54","59"
The code that I have tried so far is as below :
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
start_ts = FOREACH file GENERATE st;
grouped = group start_ts by st
ILLUSTRATE grouped
The ILLUSTRATE output I am getting is as below and I am not able to apply the AVG function.
------------------------------------------
-------------------------------------------------------------------------------------
| grouped | group:int | file:bag{:tuple(st:int,et:int)} |
-------------------------------------------------------------------------------------
| | | {(, ), (, )} |
-------------------------------------------------------------------------------------
Can anybody please help me getting the average of the starttime which would be the result of (23 + 32 + 54)/3
And also some ideas on how to code the (endtime -starttime)/no. of records (i.e 3 in this case) would be of great help for me to get started.
Thanks.
First ensure that you are loading the data correctly.Looks like you have double quotes i.e " around your data.Load the data as chararray,replace the double quotes and then cast it to int,finally apply the AVG function for the starttime.For the avg of endtime - starttime just subtract the 2 fields and apply AVG.
A = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
B = FOREACH A GENERATE (int)REPLACE(st,'\\"','') as st,(int)REPLACE(et,'\\"','') as et;
C = GROUP B ALL;
D = FOREACH C GENERATE AVG(B.st),AVG(B.et - B.st);
Try this
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:int, et:int);
grouped = group file by 1
AVG = foreach grouped generate AVG(file.st)
Thanks to inquisitive_mind.
My answer is majorly based on his answer with a little tweak.
This is only for the average of one column.
file = LOAD '/project/timestamp.csv' Using PigStorage(',') AS (st:chararray, et:chararray);
cols = FOREACH file GENERATE (int)REPLACE(st, '"', '') as st, (int)REPLACE(et, '"', '') as et;
grp_cols = GROUP cols all;
avg = FOREACH grp_cols GENERATE AVG(cols.st);
DUMP avg

Error when using MAX in Apache Pig (Hadoop)

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int).
I group the relation based on featureid and want to calculate the max feature value of each group, heres the code:
grpd = GROUP features BY featureid;
DUMP grpd;
temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val;
Its giving me Invalid scalar projection: grpd Exception. I read on different forums that MAX takes in a "bag" format for such functions, but when I take the dump of grpd, it shows me a bag format. Here's a small part of the output from the dump:
(5662,{(22579,5662,1)})
(5663,{(28331,5663,1),(2624,5663,1)})
(5664,{(27591,5664,1)})
(5665,{(30217,5665,1),(31526,5665,1)})
(5666,{(27783,5666,1),(30983,5666,1),(32424,5666,1),(28064,5666,1),(28932,5666,1)})
(5667,{(31257,5667,1),(27281,5667,1)})
(5669,{(31041,5669,1)})
Whats the issue ?
The issue was with column addressing, heres the correct working code:
grpd = GROUP features BY featureid;
temp = FOREACH grpd GENERATE group as featureid, MAX(features.featurevalue) as val;

How can I select record with minimum value in pig latin

I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.

Resources