PIG script IF ELSE statement

PIG script IF ELSE statement - hadoop

Does pig script support if-else statement
here is what I want to do:
if($NAME=='Joey')
Do something
else
Do something
is that doable?
Thanks

Its Called a "Bincond" Operator
Statements Like:
(Price > 75 ? 'High':'Low')
are also valid
For Handling Null Records:
((Name is null or IsEmpty(Name)) ? {('unknown')} : Name)
Use them in a foreach statement with alias along other fields i.e:
A = load 'x/y/Price.csv' as (Name, Product, Price);
B = foreach A generate Name, Product, Price, (Price > 75 ? 'High':'Low') as Indicator;
dump B;

You can use the conditional operator. For example
(Name=='Joey'? 'Yes':'No')

If I understand correctly (I started pig latin yesterday), pig doesn't have if-else or for statement, you have to use python or java to do so, see here : http://chimera.labs.oreilly.com/books/1234000001811/ch09.html

Related

Pig: is it possible to write a loop over variables in a list?

I have to loop over 30 variables in a list
[var1,var2, ... , var30]
and for each variable I use some PIG group by statement such as
grouped = GROUP data by var1;
data_var1 = FOREACH grouped{
GENERATE group as mygroup,
COUNT(data) as count;
};
Is there a way to loop over the list of variables or I am forced to repeat the code above manually 30 times in my code?
Thanks!

I think what you're looking for is the pig macro
Create a relation for your 30 variables, and iterate on them by foreach, and call a macro which get 2 params: your data relation and the var you want to group by.
Just check the example in the link the macro is really similar what you'd like to do.
UPDATE & code
So here's the macro you can use:
DEFINE my_cnt(data, group_field) RETURNS C {
$C = FOREACH (GROUP $data by $group_field) GENERATE
group AS mygroup,
COUNT($data) AS count;
};
Use the macro:
IMPORT 'cnt.macro';
data = LOAD 'data.txt' USING PigStorage(',') AS (field:chararray, value:chararray);
DESCRIBE data;
e = my_cnt(data,'the_field_you_group_by');
DESCRIBE e;
DUMP e;
I'm still thinking on how can you iterate through on your fields you'd like to group by. My original suggestion to foreach through a relation what contains the filed names not correct. (To create a UDF for this always works.) Let me think about it.
But this macro works as is if you call by all the filed name you want to group.

How to use CASE statement in pig?

I need the following output.
NE 50
SE 80
I am using pig query to count the country based on zone.
c1 = group country by zone;
c2 = foreach c1 generate COUNT(country.zone), (
case country.zone
when 1 then 'NE'
else 'SE'
);
But I am not able to achieve my output. I am getting error like the following:
2016-03-30 13:57:16,569 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1039: (Name: Equal Type: null Uid: null)incompatible types in Equal Operator left hand side:bag :tuple(zone:int) right hand side:int
Details at logfile: /home/cloudera/pig_1459370643493.log
But I was able to do using following query.
c2 = foreach c1 generate group, COUNT(country.zone);
This will give following output:
(1,50)
(2,80)
How can I add NE instead of 1 and SE instead of 2? I thought using CASE would help but I am getting error. Can anyone help?

EDIT
Pig 0.12.0 Version now supports CASE expression.
c2 = FOREACH c1 GENERATE (CASE group
WHEN 1 THEN 'NE'
WHEN 2 THEN 'SE'
WHEN 3 THEN 'AE'
ELSE 'VR' END), COUNT(country.zone);
Older Pig Versions
Pig does not have a case statement.Your best option is to use UDF.If the group values are limited to only two then you can use bincond operator to check the value
c2 = foreach c1 generate (group == 1 ? 'NE' : 'SE'), COUNT(country.zone);
If you have multiple values then use this.I've used test values to generate the output.
Input
c2 = FOREACH c1 GENERATE (group == 1 ? 'NE' :
(group == 2 ? 'SE' :
(group == 3 ? 'AE' : 'VR'))), COUNT(country.zone);
Output

In Pig 12 and later, you can use case statement in pig
In your case, country.zone is a bag and you cant compare it to an int

With above posted answer getting this error.
mismatched input ')' expecting END.
So updating a working code:
c2 = FOREACH c1 GENERATE (CASE group
WHEN 1 THEN 'NE'
WHEN 2 THEN 'SE'
WHEN 3 THEN 'AE'
ELSE 'VR' END), COUNT(country.zone);
Output:
(NE, 50)
(SE, 80)
(AE, 30)

If there a way to compile (not execute) a query syntaxicly and semanticly with JDBC?

I have numerous queries which contains syntax error (and without unit test, but that's another problem) and I'd like to massively check if there are no errors.
For that, I've done the following at first:
String q = ...; // some query
try (PreparedStatement stmt = connection.prepareStatement(q)) {
final ParameterMetaData pmd = stmt.getParameterMetaData();
for (int i = 1; i <= pmd.getParameterCount(); ++i) {
stmt.setNull(i, java.sql.Types.NULL);
}
stmt.execute();
} catch (SQLException e) {
...
} finally {
connection.rollback();
}
It works, but then I came into such errors: http://www.oracle-error.com/11g/ORA-30081.html
Basically, somewhere in my query, I have that:
select *
from table T
where id = ? or ( ? - INTERVAL '1' DAY ) between date_start and date_end
If I execute the same query, replacing ? by NULL, in TOAD, I've got the same error.
The ParameterMetaData does not help either, because it don't store the information I want (eg: what Oracle expect as parameter).
Is there some solution to compile the query syntactically and semantically (to check for missing columns, etc) ignoring parameters along the way?
As of now, I am replacing the ? by NULL, except if after the "?" I found some date stuff, where I use sysdate.
eg:
select *
from table T
where id = NULL or ( sysdate - INTERVAL '1' DAY ) between date_start and date_end

Not directly through JDBC, but you can do it indirectly; heavily inspired by this, you can do:
String q = ...; // some query
try (PreparedStatement stmt = connection.prepareStatement("declare c integer;
begin
c := dbms_sql.open_cursor;
dbms_sql.parse(c,?,dbms_sql.native);
dbms_sql.close_cursor(c);
end;")) {
stmt.setString(1, q.replace("?", ":b0"));
stmt.execute();
} catch (SQLException e) {
...
}
The statement you prepare is now an anonymous block, and the single bind variable is now your original query to validate. You don't need to know anything about the query's parameters. The replace converts the JDBC ? placeholders to generic :b0 bind variables so the parser doesn't object to them.
You could be more advanced and replace each placeholder with a different bind variable (:b0, :b1) etc. but I don't think it will generally be necessary. This crude replace would also potentially modify string literals though, of course, which may be something you need to consider; a regular expression approach would be more robust.

One other option to try might be to use the EXPLAIN PLAN statement available in Oracle and in some other DBMSes (possibly in a slightly different form). Prepend 'EXPLAIN PLAN FOR ' to your statement and execute() (no need to prepare). The original statement won't actually run, but it will be parsed and compiled, and you don't need to bind any parameters.
Proof.
It may still choke on untyped parameter markers in some cases though.

Apache Pig: filter based on tupple member content

I'm learning Apache Pig and have encountered an issue to realise what I wish.
I've this object (after doing a GROUP BY):
MLSET_1: {group chararray,MLSET: {(key: chararray, text: chararray)}}
I'd like to GENERATE key only when a certain pattern (PATTERN_A) appears in text AND another pattern (PATTERN_B) does not appear in the text field for one key.
I know that I can use MLSET.text to get a tupple of all text values for a specific key but then I'm still having the same issue on how to filter on the list of items from a tuple.
Here's an example:
(key_A,{(key_A,start),(key_A,stop),(key_A,unknown),(key_A,whatever)})
(key_B,{(key_B,stop),(key_B,whatever)})
(key_C,{(key_C,start),(key_C,stop),(key_C,whatever)})
I'd like to get keys for lines where "start" appears and "unknown" does not appears. In this example I will get only key_C as a result.
Thanks in advance for your help !

Here's some code that might help you out. The solution is a nested foreach here:
C = FOREACH MLSET_1 {F1 = FILTER MLSET BY (text == PATTERN_A); F2 = FILTER MLSET BY (text != PATTERN_B); GENERATE group, COUNT(F1) AS cnt1, COUNT(F2) AS cnt2;};
D = FILTER C BY (cnt1 > 1 AND cnt2 == 0);
you'll probably have to adapt the comparison in the nested filter.

Here the another approach
C = FOREACH MLSET_1 GENERATE $0,$1,BagToString(MLSET.(key,text));
D = FILTER C BY ($2 MATCHES '.*start.*') AND NOT($2 MATCHES '.*unknown.*');
E = FOREACH D GENERATE $0,$1;
DUMP E;
Output for the above input:
(key_c,{(key_c,start),(key_c,stop),(key_c,whatever)})

Conditional SUM in Pig

I am using the ternary operator to include values in SUM() operation conditionally. Here is how I am doing it.
GROUPED = GROUP ALL_MERGED BY (fld1, fld2, fld3);
REPORT_DATA = FOREACH GROUPED
{ GENERATE group,
SUM(GROUPED.fld4 == 'S' ? GROUPED.fld5 : 0) AS sum1,
SUM(GROUPED.fld4 == 'S' ? GROUPED.fld5 : (GROUPED.fld5 * -1)) AS sum2;
}
Schema for ALL_MERGED is
{ALL_MERGED: {fld1:chararray, fld2:chararray, fld3:chararray, fld4:chararray: fld5:int}}
When I execute this, it gives me following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: SUM in {group: (fld1:chararray, fld2:chararray, fld3:chararray), ALL_MERGED: {fld1:chararray, fld2:chararray, fld3:chararray, fld4:chararray: fld5:int}}
What am I doing wrong here?

SUM is a UDF which takes a bag as input. What you are doing has a number of problems, and I suspect it would help you to review a good reference on Pig. I recommend Programming Pig, available for free online. To begin with, GROUPED has two fields: a tuple called group and a bag called ALL_MERGED, which is what the error message is trying to tell you. (I say "trying" because Pig error messages are often quite cryptic.)
Also, you cannot pass expressions to UDFs like you wish to do. Instead you will have to GENERATE these fields and then pass them afterward. Try this:
ALL_MERGED_2 =
FOREACH ALL_MERGED
GENERATE
fld1 .. fld5,
((fld4 == 'S') ? fld5 : 0) AS sum_me1,
((fld4 == 'S') ? fld5 : fld5*-1) AS sum_me2;
GROUPED = GROUP ALL_MERGED_2 BY (fld1, fld2, fld3);
DATA =
FOREACH GROUPED
GENERATE
group,
SUM(ALL_MERGED_2.sum_me1) AS sum1,
SUM(ALL_MERGED_2.sum_me2) AS sum2;

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

PIG script IF ELSE statement - hadoop

Does pig script support if-else statement here is what I want to do: if($NAME=='Joey') Do something else Do something is that doable? Thanks

You can use the conditional operator. For example (Name=='Joey'? 'Yes':'No')

If I understand correctly (I started pig latin yesterday), pig doesn't have if-else or for statement, you have to use python or java to do so, see here : http://chimera.labs.oreilly.com/books/1234000001811/ch09.html

Related

Pig: is it possible to write a loop over variables in a list?

How to use CASE statement in pig?

If there a way to compile (not execute) a query syntaxicly and semanticly with JDBC?

Apache Pig: filter based on tupple member content

Conditional SUM in Pig

Categories

Resources