I am using the ternary operator to include values in SUM() operation conditionally. Here is how I am doing it.
GROUPED = GROUP ALL_MERGED BY (fld1, fld2, fld3);
REPORT_DATA = FOREACH GROUPED
{ GENERATE group,
SUM(GROUPED.fld4 == 'S' ? GROUPED.fld5 : 0) AS sum1,
SUM(GROUPED.fld4 == 'S' ? GROUPED.fld5 : (GROUPED.fld5 * -1)) AS sum2;
}
Schema for ALL_MERGED is
{ALL_MERGED: {fld1:chararray, fld2:chararray, fld3:chararray, fld4:chararray: fld5:int}}
When I execute this, it gives me following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: SUM in {group: (fld1:chararray, fld2:chararray, fld3:chararray), ALL_MERGED: {fld1:chararray, fld2:chararray, fld3:chararray, fld4:chararray: fld5:int}}
What am I doing wrong here?
SUM is a UDF which takes a bag as input. What you are doing has a number of problems, and I suspect it would help you to review a good reference on Pig. I recommend Programming Pig, available for free online. To begin with, GROUPED has two fields: a tuple called group and a bag called ALL_MERGED, which is what the error message is trying to tell you. (I say "trying" because Pig error messages are often quite cryptic.)
Also, you cannot pass expressions to UDFs like you wish to do. Instead you will have to GENERATE these fields and then pass them afterward. Try this:
ALL_MERGED_2 =
FOREACH ALL_MERGED
GENERATE
fld1 .. fld5,
((fld4 == 'S') ? fld5 : 0) AS sum_me1,
((fld4 == 'S') ? fld5 : fld5*-1) AS sum_me2;
GROUPED = GROUP ALL_MERGED_2 BY (fld1, fld2, fld3);
DATA =
FOREACH GROUPED
GENERATE
group,
SUM(ALL_MERGED_2.sum_me1) AS sum1,
SUM(ALL_MERGED_2.sum_me2) AS sum2;
Related
I need the following output.
NE 50
SE 80
I am using pig query to count the country based on zone.
c1 = group country by zone;
c2 = foreach c1 generate COUNT(country.zone), (
case country.zone
when 1 then 'NE'
else 'SE'
);
But I am not able to achieve my output. I am getting error like the following:
2016-03-30 13:57:16,569 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: (Name: Equal Type: null Uid: null)incompatible types in Equal Operator left hand side:bag :tuple(zone:int) right hand side:int
Details at logfile: /home/cloudera/pig_1459370643493.log
But I was able to do using following query.
c2 = foreach c1 generate group, COUNT(country.zone);
This will give following output:
(1,50)
(2,80)
How can I add NE instead of 1 and SE instead of 2? I thought using CASE would help but I am getting error. Can anyone help?
EDIT
Pig 0.12.0 Version now supports CASE expression.
c2 = FOREACH c1 GENERATE (CASE group
WHEN 1 THEN 'NE'
WHEN 2 THEN 'SE'
WHEN 3 THEN 'AE'
ELSE 'VR' END), COUNT(country.zone);
Older Pig Versions
Pig does not have a case statement.Your best option is to use UDF.If the group values are limited to only two then you can use bincond operator to check the value
c2 = foreach c1 generate (group == 1 ? 'NE' : 'SE'), COUNT(country.zone);
If you have multiple values then use this.I've used test values to generate the output.
Input
c2 = FOREACH c1 GENERATE (group == 1 ? 'NE' :
(group == 2 ? 'SE' :
(group == 3 ? 'AE' : 'VR'))), COUNT(country.zone);
Output
In Pig 12 and later, you can use case statement in pig
In your case, country.zone is a bag and you cant compare it to an int
With above posted answer getting this error.
mismatched input ')' expecting END.
So updating a working code:
c2 = FOREACH c1 GENERATE (CASE group
WHEN 1 THEN 'NE'
WHEN 2 THEN 'SE'
WHEN 3 THEN 'AE'
ELSE 'VR' END), COUNT(country.zone);
Output:
(NE, 50)
(SE, 80)
(AE, 30)
Does pig script support if-else statement
here is what I want to do:
if($NAME=='Joey')
Do something
else
Do something
is that doable?
Thanks
Its Called a "Bincond" Operator
Statements Like:
(Price > 75 ? 'High':'Low')
are also valid
For Handling Null Records:
((Name is null or IsEmpty(Name)) ? {('unknown')} : Name)
Use them in a foreach statement with alias along other fields i.e:
A = load 'x/y/Price.csv' as (Name, Product, Price);
B = foreach A generate Name, Product, Price, (Price > 75 ? 'High':'Low') as Indicator;
dump B;
You can use the conditional operator. For example
(Name=='Joey'? 'Yes':'No')
If I understand correctly (I started pig latin yesterday), pig doesn't have if-else or for statement, you have to use python or java to do so, see here : http://chimera.labs.oreilly.com/books/1234000001811/ch09.html
I'm learning Apache Pig and have encountered an issue to realise what I wish.
I've this object (after doing a GROUP BY):
MLSET_1: {group chararray,MLSET: {(key: chararray, text: chararray)}}
I'd like to GENERATE key only when a certain pattern (PATTERN_A) appears in text AND another pattern (PATTERN_B) does not appear in the text field for one key.
I know that I can use MLSET.text to get a tupple of all text values for a specific key but then I'm still having the same issue on how to filter on the list of items from a tuple.
Here's an example:
(key_A,{(key_A,start),(key_A,stop),(key_A,unknown),(key_A,whatever)})
(key_B,{(key_B,stop),(key_B,whatever)})
(key_C,{(key_C,start),(key_C,stop),(key_C,whatever)})
I'd like to get keys for lines where "start" appears and "unknown" does not appears. In this example I will get only key_C as a result.
Thanks in advance for your help !
Here's some code that might help you out. The solution is a nested foreach here:
C = FOREACH MLSET_1 {F1 = FILTER MLSET BY (text == PATTERN_A); F2 = FILTER MLSET BY (text != PATTERN_B); GENERATE group, COUNT(F1) AS cnt1, COUNT(F2) AS cnt2;};
D = FILTER C BY (cnt1 > 1 AND cnt2 == 0);
you'll probably have to adapt the comparison in the nested filter.
Here the another approach
C = FOREACH MLSET_1 GENERATE $0,$1,BagToString(MLSET.(key,text));
D = FILTER C BY ($2 MATCHES '.*start.*') AND NOT($2 MATCHES '.*unknown.*');
E = FOREACH D GENERATE $0,$1;
DUMP E;
Output for the above input:
(key_c,{(key_c,start),(key_c,stop),(key_c,whatever)})
I am learning how to use Hadoop Pig now.
If I have a input file like this:
a,b,c,true
s,c,v,false
a,s,b,true
...
The last field is the one I need to count... So I want to know how many 'true' and 'false' in this file.
I try:
records = LOAD 'test/input.csv' USING PigStorage(',');
boolean = foreach records generate $3;
groups = group boolean all;
Now I gets stuck. I want to use:
count = foreach groups generate count('true');"
To get the number of "true" but I always get the error:
2013-08-07 16:32:36,677 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /etc/pig/pig_1375911119028.log
Can anybody tell me where the problem is?
Two things. Firstly, count should actually be COUNT. In pig, all builtin functions should be called with all-caps.
Secondly, COUNT counts the number of values in a bag, not for a value. Therefore, you should group by true/false, then COUNT:
boolean = FOREACH records GENERATE $3 AS trueORfalse ;
groups = GROUP boolean BY trueORfalse ;
counts = FOREACH groups GENERATE group AS trueORfalse, COUNT(boolean) ;
So now the output of a DUMP for counts will look something like:
(true, 2)
(false, 1)
If you want the counts of true and false in their own relations then you can FILTER the output of counts. However, it would probably be better to SPLIT boolean, then do two separate counts:
boolean = FOREACH records GENERATE $3 AS trueORfalse ;
SPLIT boolean INTO alltrue IF trueORfalse == 'true',
allfalse IF trueORfalse == 'false' ;
tcount = FOREACH (GROUP alltrue ALL) GENERATE COUNT(alltrue) ;
fcount = FOREACH (GROUP allfalse ALL) GENERATE COUNT(allfalse) ;
I am trying to perform to calculation. I have a donations (d) table that contains Quantity Needed (d.QtyNeeded) and I need to determine the numbers of items still needed by pulling quantity filled (qtyfilled) from the donors table. Not every donation has a donor, so I need a conditional statement to handle nulls so the sum will work. When I try to compile, I am getting an error: *Operator '-' cannot be applied to operands of type 'int' and 'bool'. I am not great at Linq yet, what am I missing or is there a better way?
QtyOpen = d.QtyNeeded - (from dv in db.Donors
select dv).All(v => v.QtyFilled == null)
? 0
: (from dv in db.Donations
select dv.QtyFilled).Sum()
The problem is not the LINQ statement, but rather precedence in the subtraction operator. Consider this example:
int result = quantity - true ? 0 : otherValue;
This fails to compile for the exact same reason, Operator '-' cannot be applied to operands of type 'int' and 'bool'. The solution is to simply group the conditional statement in parens:
int result = quantity - (true ? 0 : otherValue);
So, your example should compile by adding parens around your entire conditional operator statement.
Try filtering out the nulls within the initial select by adding a where in the select:
QtyOpen = d.QtyNeeded - (from dv in db.Donors
where dv.QtyFilled != null
select dv.QtyFilled).Sum();