Hadoop Pig count number - hadoop

I am learning how to use Hadoop Pig now.
If I have a input file like this:
a,b,c,true
s,c,v,false
a,s,b,true
...
The last field is the one I need to count... So I want to know how many 'true' and 'false' in this file.
I try:
records = LOAD 'test/input.csv' USING PigStorage(',');
boolean = foreach records generate $3;
groups = group boolean all;
Now I gets stuck. I want to use:
count = foreach groups generate count('true');"
To get the number of "true" but I always get the error:
2013-08-07 16:32:36,677 [main] ERROR org.apache.pig.tools.grunt.Grunt
- ERROR 1070: Could not resolve count using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /etc/pig/pig_1375911119028.log
Can anybody tell me where the problem is?

Two things. Firstly, count should actually be COUNT. In pig, all builtin functions should be called with all-caps.
Secondly, COUNT counts the number of values in a bag, not for a value. Therefore, you should group by true/false, then COUNT:
boolean = FOREACH records GENERATE $3 AS trueORfalse ;
groups = GROUP boolean BY trueORfalse ;
counts = FOREACH groups GENERATE group AS trueORfalse, COUNT(boolean) ;
So now the output of a DUMP for counts will look something like:
(true, 2)
(false, 1)
If you want the counts of true and false in their own relations then you can FILTER the output of counts. However, it would probably be better to SPLIT boolean, then do two separate counts:
boolean = FOREACH records GENERATE $3 AS trueORfalse ;
SPLIT boolean INTO alltrue IF trueORfalse == 'true',
allfalse IF trueORfalse == 'false' ;
tcount = FOREACH (GROUP alltrue ALL) GENERATE COUNT(alltrue) ;
fcount = FOREACH (GROUP allfalse ALL) GENERATE COUNT(allfalse) ;

Related

Filter inner bag in Pig

The data looks like this:
22678, {(112),(110),(2)}
656565, {(110), (109)}
6676, {(2),(112)}
This is the data structure:
(id:chararray, event_list:{innertuple:(innerfield:chararray)})
I want to filter those rows where event_list contains 2. I thought initially to flatten the data and then filter those rows that have 2. Somehow flatten doesn't work on this dataset.
Can anyone please help?
There might be a simpler way of doing this, like a bag lookup etc. Otherwise with basic pig one way of achieving this is:
data = load 'data.txt' AS (id:chararray, event_list:bag{});
-- flatten bag, in order to transpose each element to a separate row.
flattened = foreach data generate id, flatten(event_list);
-- keep only those rows where the value is 2.
filtered = filter flattened by (int) $1 == 2;
-- keep only distinct ids.
dist = distinct (foreach filtered generate $0 as (id:chararray));
-- join distinct ids to origitnal relation
jnd = join a by id, dist by id;
-- remove extra fields, keep original fields.
result = foreach jnd generate a::id, a::event_list;
dump result;
(22678,{(112),(110),(2)})
(6676,{(2),(112)})
You can filter the Bag and project a boolean which says if 2 is present in the bag or not. Then, filter the rows which says that projection is true or not
So..
input = LOAD 'data.txt' AS (id:chararray, event_list:bag{});
input_filt = FOREACH input {
bag_filter = FILTER event_list BY (val_0 matches '2');
GENERATE
id,
event_list,
isEmpty(bag_filter.$0) ? false : true AS is_2_present:boolean;
;
};
output = FILTER input_filt BY is_2_present;

PIG: Filter a string on a basis of a phrase

I was wondering if it is possible yo filter a string on the basis of the phrase? For example,I want to count number of times when ps3(ps 3) appears in the query. I am not sure how not to use exact match with the filter condition for "ps 3" as do not know how to put a tab inside of it. My code so far is:
data = LOAD '/user/cloudera/' using PigStorage(',') as (text:chararray);
filtered_data = FILTER data BY (text matches '.*ps3.*') OR (text == 'ps 3');
Res = FOREACH (GROUP filtered_data ALL) GENERATE COUNT(filtered_data);
DUMP Res;
So obviously code fails to count queries like "ps 3 today". Is there is a way to handle this?
Try this -
A = LOAD 'input.csv' USING PigStorage(',') AS (text:chararray);
B = FILTER A BY (LOWER(text) MATCHES '.*ps 3.*' OR LOWER(text) MATCHES '.*ps3.*');
DUMP B Output :
(ps 3 today)
(ps 3)
(ps3)
(PS3TODAY)

Apache Pig: filter based on tupple member content

I'm learning Apache Pig and have encountered an issue to realise what I wish.
I've this object (after doing a GROUP BY):
MLSET_1: {group chararray,MLSET: {(key: chararray, text: chararray)}}
I'd like to GENERATE key only when a certain pattern (PATTERN_A) appears in text AND another pattern (PATTERN_B) does not appear in the text field for one key.
I know that I can use MLSET.text to get a tupple of all text values for a specific key but then I'm still having the same issue on how to filter on the list of items from a tuple.
Here's an example:
(key_A,{(key_A,start),(key_A,stop),(key_A,unknown),(key_A,whatever)})
(key_B,{(key_B,stop),(key_B,whatever)})
(key_C,{(key_C,start),(key_C,stop),(key_C,whatever)})
I'd like to get keys for lines where "start" appears and "unknown" does not appears. In this example I will get only key_C as a result.
Thanks in advance for your help !
Here's some code that might help you out. The solution is a nested foreach here:
C = FOREACH MLSET_1 {F1 = FILTER MLSET BY (text == PATTERN_A); F2 = FILTER MLSET BY (text != PATTERN_B); GENERATE group, COUNT(F1) AS cnt1, COUNT(F2) AS cnt2;};
D = FILTER C BY (cnt1 > 1 AND cnt2 == 0);
you'll probably have to adapt the comparison in the nested filter.
Here the another approach
C = FOREACH MLSET_1 GENERATE $0,$1,BagToString(MLSET.(key,text));
D = FILTER C BY ($2 MATCHES '.*start.*') AND NOT($2 MATCHES '.*unknown.*');
E = FOREACH D GENERATE $0,$1;
DUMP E;
Output for the above input:
(key_c,{(key_c,start),(key_c,stop),(key_c,whatever)})

How would I make a pig script that only returns fields with entries over a certain length?

The data I have is already fielded, I just want a document that contains two of the fields and even then it only contains an entry if the title field is over a certain length. This is what I have so far.
records = LOAD '$INPUT' USING PigStorage('\t') AS (url:chararray, title:chararray, meta:chararray, copyright:chararray, aboutUSLink:chararray, aboutTitle:chararray, aboutMeta:chararray, contactUSLink:chararray, contactTitle:chararray, contactMeta:chararray, phones:chararray);
E = FOREACH records IF SIZE(title)>10 GENERATE url,title;
STORE E INTO '$OUTPUT/phoneNumbersAndTitles';
Why does the code exit at IF?
You should use FILTER, which selects tuples from a relation based on some condition:
filtered = FILTER records BY SIZE(title) > 10;
E = FOREACH filtered GENERATE url,title;

Conditional SUM in Pig

I am using the ternary operator to include values in SUM() operation conditionally. Here is how I am doing it.
GROUPED = GROUP ALL_MERGED BY (fld1, fld2, fld3);
REPORT_DATA = FOREACH GROUPED
{ GENERATE group,
SUM(GROUPED.fld4 == 'S' ? GROUPED.fld5 : 0) AS sum1,
SUM(GROUPED.fld4 == 'S' ? GROUPED.fld5 : (GROUPED.fld5 * -1)) AS sum2;
}
Schema for ALL_MERGED is
{ALL_MERGED: {fld1:chararray, fld2:chararray, fld3:chararray, fld4:chararray: fld5:int}}
When I execute this, it gives me following error:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Invalid alias: SUM in {group: (fld1:chararray, fld2:chararray, fld3:chararray), ALL_MERGED: {fld1:chararray, fld2:chararray, fld3:chararray, fld4:chararray: fld5:int}}
What am I doing wrong here?
SUM is a UDF which takes a bag as input. What you are doing has a number of problems, and I suspect it would help you to review a good reference on Pig. I recommend Programming Pig, available for free online. To begin with, GROUPED has two fields: a tuple called group and a bag called ALL_MERGED, which is what the error message is trying to tell you. (I say "trying" because Pig error messages are often quite cryptic.)
Also, you cannot pass expressions to UDFs like you wish to do. Instead you will have to GENERATE these fields and then pass them afterward. Try this:
ALL_MERGED_2 =
FOREACH ALL_MERGED
GENERATE
fld1 .. fld5,
((fld4 == 'S') ? fld5 : 0) AS sum_me1,
((fld4 == 'S') ? fld5 : fld5*-1) AS sum_me2;
GROUPED = GROUP ALL_MERGED_2 BY (fld1, fld2, fld3);
DATA =
FOREACH GROUPED
GENERATE
group,
SUM(ALL_MERGED_2.sum_me1) AS sum1,
SUM(ALL_MERGED_2.sum_me2) AS sum2;

Resources