In PIG how to project disambiguited field present in bag? - hadoop

I have something like this :
joined = JOIN A BY F1, B BY F1 ;
joinOutput = FOREACH joined GENERATE A::f3 AS f3, A::f4 AS f4, B::f5 AS f5 ;
grouped = GROUP joinOutput BY f3 ;
countOutput = FOREACH grouped FLATTEN(joinOutput) , count(f5) as COUNT ;
if I do """ DESCRIBE countOutput """ then I get following:
countOutput = { joinOutput::f3 :chararray, joinOutput::f4 :int, COUNT :int }
Now if I try to reference f3 with respect to "countOutput" i.e. countOutput.f3 I get error saying invalid field projection.
So my question is how do I project field f3 with respect to countOutput.
I haven't tried this is yet if this is correct but I could think of following ways -
countOutput.joinOutput::f3
Not sure though if this is correct way.
Any help is appreciated.

ok, found solution after trying out few things. I found that you can specify schema explicitly when you FLATTEN.
So this particular step can be re-written as follows :
countOutput = FOREACH grouped FLATTEN(joinOutput) AS ( f3 :chararray, f4: int) , count(f5) as COUNT ;
Now I can directly reference flattened fields with respect to outer relation.
Hope this helps if someone runs into same problem.

Related

Pig - MAX is not working after grouping

I am working with Pig 0.12.1 and Map-R. I am trying to find max of a field after grouping the relation on some other field. Refer the following pig script and structure of relation in comments-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2);
-- r1: {c1: biginteger,c2: biginteger}
r2 = group r1 by c1;
-- r2: {group: chararray,r1: {(c1: chararray,c2: biginteger)}}
DUMP r2;
/* output -
1234|{(1234,9876)}
2345|{(2345,8765)}
3456|{(3456,7654)}
4567|{(4567,6543)}
*/
r3 = foreach r2 generate group as c1, MAX(r1.c2) as c2;
I am getting the following error
Could not infer the matching function for org.apache.pig.builtin.MAX as multiple or none of them fit. Please use an explicit cast.
Script Explained-
I am flattening group of SomeRelation into c1, c2 and then regrouping
on c1 to generate max of c2 with each c1 group.
Please suggest.
I'm not sure if you can use the group keyword under the flatten. Also, have you considered tokenizing the group before flattening it. See this for example:
load_data = LOAD '/PIG_TESTS_ALL/WordCount' as (line);
tokenizing_data = FOREACH load_data generate flatten(TOKENIZE(line)) as word;
group_data = GROUP tokenizing_data by word;
Result = FOREACH group_data generate group,COUNT(tokenizing_data);
dump Result;
This is actually for word count, You can probably build on this to find max value based on what you want to do.
Well it looks like the problem is that Pig doesn't allow MAX(or for that matter aggregate functions like SUM etc) on biginteger. Had to use long as a datatype for this to work. Refer the following-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2:long);
-- r1: {c1: biginteger,c2: long}
Strangely, there's no documentation highlighting this almost like datatypes biginteger and bigdecimal.
We now know the problem was the unability of MAX to handle biginteger.
You should be able to group and get the max like this, and compare results with combination of order + limit :
r1 = FOREACH SomeRelation GENERATE FLATTEN(group) AS (c1, c2);
r3 = FOREACH (group r1 by c1) {
-- you may want to apply a function on a single column
-- or compare sort + limit to MAX
list = ORDER $1 BY c2 DESC;
list_max = LIMIT list 1;
GENERATE group AS c1, MAX(r1.c2) AS c2, list_max;
}

Apache Pig: filter based on tupple member content

I'm learning Apache Pig and have encountered an issue to realise what I wish.
I've this object (after doing a GROUP BY):
MLSET_1: {group chararray,MLSET: {(key: chararray, text: chararray)}}
I'd like to GENERATE key only when a certain pattern (PATTERN_A) appears in text AND another pattern (PATTERN_B) does not appear in the text field for one key.
I know that I can use MLSET.text to get a tupple of all text values for a specific key but then I'm still having the same issue on how to filter on the list of items from a tuple.
Here's an example:
(key_A,{(key_A,start),(key_A,stop),(key_A,unknown),(key_A,whatever)})
(key_B,{(key_B,stop),(key_B,whatever)})
(key_C,{(key_C,start),(key_C,stop),(key_C,whatever)})
I'd like to get keys for lines where "start" appears and "unknown" does not appears. In this example I will get only key_C as a result.
Thanks in advance for your help !
Here's some code that might help you out. The solution is a nested foreach here:
C = FOREACH MLSET_1 {F1 = FILTER MLSET BY (text == PATTERN_A); F2 = FILTER MLSET BY (text != PATTERN_B); GENERATE group, COUNT(F1) AS cnt1, COUNT(F2) AS cnt2;};
D = FILTER C BY (cnt1 > 1 AND cnt2 == 0);
you'll probably have to adapt the comparison in the nested filter.
Here the another approach
C = FOREACH MLSET_1 GENERATE $0,$1,BagToString(MLSET.(key,text));
D = FILTER C BY ($2 MATCHES '.*start.*') AND NOT($2 MATCHES '.*unknown.*');
E = FOREACH D GENERATE $0,$1;
DUMP E;
Output for the above input:
(key_c,{(key_c,start),(key_c,stop),(key_c,whatever)})

SUM function in Pig script

I am a student learning how to use Pig script using the hortonworks sandbox. My problem is that I am not able to use the SUM function properly. I have successfully separated the fields of a firewall log and I am able to do perform several queries and use the count function... but no luck with the SUM function which I really need in one case. This code I used below:
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
counter = foreach grpd1 {
sum1 = SUM(A.rcvd);
sum2 = SUM(A.sent);
generate sum1, sum2;
};
dump counter;
C = foreach F1 generate rcvd, sent;
dump C;
When I dump just the variable C I get a result displaying many records indicating the amount of data received/sent for the filter applied. eg:
(223,123)
(334,444)
(21,12344)
(...,...)
All I really want to do is add all those records together and show that total amount of received and sent: (?,?).
Note: I have tried changing the variable type to int, long, and chararray with no success either.
Some of the errors I am getting while trying to solve this are:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.
First make sure that the fields that you are summing up are of type int
Use - DESCRIBE A; to check the data type
After that, I think since you have used filter condition and then used group by on F1 -
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
So, while summing up you should use F1 instead of A -
counter = foreach grpd1 {
sum1 = SUM(F1.rcvd);
sum2 = SUM(F1.sent);
generate sum1, sum2;
};
Use DESCRIBE grpd1; and you will understand what I am trying to say, there will be no 'A'
I guess this should solve the error. Finally, check the logic of what you want in the result I have not checked that. Hope this helps.
PS - I am also a student and new to PIG.
A lucky guess here, I'm new to Pig too :)
I'm not sure if SUM can be casted to chararray(that would explain the error), so make rcvd and sent type:int and then generate the 2 sums for grpd1 bag:
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C1 = foreach grpd1 generate SUM(F1.rcvd);
dump C1;
C2 = foreach grpd1 generate SUM(F1.sent);
dump C2;
NOTE: More info here.
Hope I helped a little!
Please try the following
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C = foreach F1 generate group,SUM(F1.rcvd), SUM(F1.sent);
dump C;

pig how to concat columns into single string

I have a column of strings that I load using Pig:
A
B
C
D
how do I convert this column into a single string like this?
A,B,C,D
You are going to have to first GROUP ALL to put everything into one bag, then join the contents of the bag together using a UDF. Something like this:
-- myudfs.py
-- #!/usr/bin/python
--
-- #outputSchema('concated: string')
-- def concat_bag(BAG):
-- return ','.join(BAG)
Register 'myudfs.py' using jython as myfuncs;
A = LOAD 'myfile.txt' AS (letter:chararray) ;
B = GROUP A ALL ;
C = FOREACH B GENERATE myfuncs.concat_bag(A.letter) AS all_letters ;
If your file/schema contains multiple columns, you are probably going to want to project out the column you want to generate the string for. Something like:
A0 = LOAD 'myfile.txt' AS (letter:chararray, val:int, extra:chararray) ;
A = FOREACH A0 GENERATE letter ;
This way you are not keeping around extra columns that will slow down an already expensive operation.

linq to sql "All" operator type query

String[] codes = new[] {"C1", "C2" }
Below is the schema for SomeEntity
fkey code
--------------
f1 C1
f1 C2
f2 C2
f2 C3
f2 C4
f3 C5
I have to write queries that can get all the entities based on 2 condition -
1) does the entity have any of the codes
2) does the entity have all the codes
Below is what i wrote for 1st condition
-----------------------------------------
from f in dc.GetTable<SomeEntity>
where codes.Contains(f.code)
select f;
I tried to write 2nd codition by using All operator but "All" operator is not valid for linq to sql.
Question: How to write a query that will return only those entities that have all the codes ?
var codeCount = dc.GetTable<SomeEntity>.Distinct(e => e.Code);
var matching = (from f in dc.GetTable<SomeEntity>
group f by f.Key into grouped
select new { f.Key, values = grouped}).Where(v => v.values.Count() == codeCount);
Probably a hideous query, a stored proc mapped to a function on the data context might be a better solution to be honest
You could probably do the codeCount as let binding in the query and carry on the linq statement rather than close out to the .Where call. Not sure if it'd make it more performant but would save an extra trip to the DB
Try this.
Group by fkey (which will create distinct fkeys)
Filter the groups where
All members of the list of codes are contained within the codes of that group
See below
var matching = dc.GetTable<SomeEntity>()
.GroupBy(e => e.fkey)
.Where(group => codes.All(c =>
group.Select(g => g.code).Contains(code)))
.Select(group => group.Key);

Resources