Pig Latin Remove Tuple in Data Bag - hadoop

Here is my code leading up to my issue:
a = LOAD 'tellers' using TextLoader() AS line;
# convert a to charrarry
b = foreach a generate (chararray)line;
# run through my UDF to create tuples
c = foreach b generate myudfs.TellerParser5(line); # ({(20),(5),(5),(10)(1),(1),(1),(1),(1),(5),(10),(10),(10)})....
d = foreach c generate flatten(number);
e = group d by number; #{group: chararray,d: {(number: chararray)}}
f = foreach e generate group, COUNT(d); # f: {group: chararray,long}
In databag f, I have an empty tuple (,1) I'd like to filter/remove.
dump f;
(,1)
(1,97)
(5,49)
(10,87)
(20,24)
describe f;
f: {group: chararray,long}
I've tried this with no success (makes no change):
remove_tuple = filter f BY group is not null;

Group is a pig keyword. Hope this should work when some other word is used for the tuple name.

NULL can be filtered by using !='null' as a condition. I have taken below as the input.
(,1)
(1,97)
(5,49)
(10,87)
(20,24)
Below is how we can filter NULL's.
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:long);
B = FILTER A BY a!='null';
DUMP B;
So for your script the line will be something like
remove_tuple = filter f BY group!='null';
Output:
(1,97)
(5,49)
(10,87)
(20,24)

I solved by adding a step and casting as an int. Here are the steps:
e = foreach d generate (int)$0; # this is the key added step
f = group e by number; #{group: chararray,d: {(number: chararray)}}
g = foreach f generate group, COUNT(e); # f: {group: chararray,long}
h = foreach f generate group, SUM(e);
i = filter g by $0 is not null;
dump i;
(1,97)
(5,49)
(10,87)
(20,24)

Related

Aggregate values in Pig Latin

After performing multi-level filtering inside Pig, I get the below results -
(2343433,Argentina,2015,Sci-Fi)
(2343433,France,2015,Sci-Fi)
(2343433,Germany,2015,Sci-Fi)
(2343433,Netherlands,2015,Sci-Fi)
(2343433,Argentina,2015,Drama)
(2343433,France,2015,Drama)
(2343433,Germany,2015,Drama)
(2343433,Netherlands,2015,Drama)
(2343433,Argentina,2015,Family)
(2343433,France,2015,Family)
(2343433,Germany,2015,Family)
(2343433,Netherlands,2015,Family)
The column names are movieid,country,year and genre respectively. I need to aggregate these results and produce something like this -
(2343433,France,2015,Sci-Fi,Drama,Family)
(2343433,Germany,2015,Sci-Fi,Drama,Family)
(2343433,Netherlands,2015,Sci-Fi,Drama,Family)
(2343433,Argentina,2015,Sci-Fi,Drama,Family)
Either that or something like this -
(2343433,France,Germany,Netherlands,Argentina,2015,Sci-Fi,Drama,Family)
Below is my code to get the above results -
A = LOAD '/user/a1.csv' USING PigStorage('|') as (movie_id,movie_name,prod_year);
B = LOAD '/user/a2.csv' USING PigStorage('|') as (g_movieid,genres);
C = LOAD '/user/a3.csv' USING PigStorage('|') as (c_movieid,country_released);
D = JOIN A by movie_id, B by g_movieid;
E = JOIN D by g_movieid, C by c_movieid;
F = FOREACH E GENERATE movie_id,country,year,genre;
Any idea on how to achieve this using Pig?
try this,
Dump F;
(2343433,Argentina,2015,Sci-Fi)
(2343433,France,2015,Sci-Fi)
(2343433,Germany,2015,Sci-Fi)
(2343433,Netherlands,2015,Sci-Fi)
(2343433,Argentina,2015,Drama)
(2343433,France,2015,Drama)
(2343433,Germany,2015,Drama)
(2343433,Netherlands,2015,Drama)
(2343433,Argentina,2015,Family)
(2343433,France,2015,Family)
(2343433,Germany,2015,Family)
(2343433,Netherlands,2015,Family)
G = GROUP F BY (movie_id, country, year);
H = foreach G generate FLATTEN(group) as (movie_id, country, year), $1.$3 AS (genre:{T:(value:chararray)});
I = foreach H generate movie_id, country, year, FLATTEN(BagToTuple(genre.value));
Dump I;
(2343433,France,2015,Sci-Fi,Drama,Family)
(2343433,Germany,2015,Sci-Fi,Drama,Family)
(2343433,Argentina,2015,Sci-Fi,Drama,Family)
(2343433,Netherlands,2015,Sci-Fi,Drama,Family)

PIG : How to exclude first n lines while Loading

is there a way to exclude the first n lines of a file while loading some data on pig ?
I have a csv file that i would like to load but i have to ignore the first 3 lines.
One option could be you can try like this.
A = LOAD 'input' <schema>;
B = RANK A;
C = FILTER B BY $0 > 3;
D = FOREACH C GENERATE $1..;
DUMP D;
If you defined the schema in your load stmt then instead of positional notation($0,$1 etc) use the defined names. It will be more readable.
Try the following code:
abt = LOAD 'act.psv' using PigStorage('|')
as (r1:chararray,r2:chararray);
r = rank abt;
n = filter r by ($0 > 3);
p = foreach n generate r1,r2;
dump p;

Equivalent of Union_map in pig

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

Pig: How to flatten & re-join bags within bags

I've got an example where we're trying to do what appears to be a simple join:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
grunt> cat data1
'item1' 111 { ('thing1', 222, {('value1'),('value2')}) }
grunt> cat data2
'value1' 'result1'
'value2' 'result2'
We want to join the 'result1', 'result2' data of data2 into the entry in data1, on the obvious value field.
We managed to flatten it:
A = load 'data6' as ( item:chararray, d:int, things:bag{(thing:chararray, d1:int, values:bag{(v:chararray)})} );
B = load 'data7' as ( v:chararray, r:chararray );
F1 = foreach A generate item, d, flatten(things);
F2 = foreach F1 generate item..d1, flatten(values);
Then we joined the 2nd dataset in:
J = join F2 by v, B by v
J1 = foreach J generate item as item, d as d, thing as thing, d1 as d1, F2::things::values::v as v, r as r; --Remove duplicate field & clean up naming
dump J1
('item1',111,'thing1',222,'value1','result1')
('item1',111,'thing1',222,'value2','result2')
Now we need to call a UDF function once for each item, so we need to re-group those 2 levels of bags. Each item has 0 or more things, and each thing has 0 or more values, and the values now may or may not have a result.
How do we get back to:
('item1', 111, { 'thing1', 222, { ('value1, 'result1'), ('value2', 'result2') }
All of my attempts at grouping and re-joining have exploded in complexity, failed to produce the correct result, and run in 4+ mapreduce jobs what should be 1 mapreduce job in Hadoop.
The following code may work, R2 is the final result:
group_by_item_d_thing_d1 = group J1 by item, d, thing, d1;
R1 = foreach group_by_item_d_thing_d1 generate group.item, group.d, group.thing, group.d1, J1;
group_by_item_d = group R1 by item, d;
R2 = foreach group_by_item_d generate group.item, group.d, R1;

select count distinct using pig latin

I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc).
Here's my sql with pig I am trying to code.
/*
For example in sql:
select domain, count(distinct(segment)) as segment_cnt
from table
where desc='ABC123'
group by domain
order by segment_count desc;
*/
A = LOAD 'myoutputfile' USING PigStorage('\u0005')
AS (
domain:chararray,
segment:chararray,
desc:chararray
);
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;
You could GROUP on each domain and then count the number of distinct elements in each group with a nested FOREACH syntax:
D = group C by domain;
E = foreach D {
unique_segments = DISTINCT C.segment;
generate group, COUNT(unique_segments) as segment_cnt;
};
You can better define this as a macro:
DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
temp = FOREACH $A GENERATE $c;
dist = DISTINCT temp;
groupAll = GROUP dist ALL;
$dist = FOREACH groupAll GENERATE COUNT(dist);
}
Usage:
X = LOAD 'data' AS (x: int);
Y = DISTINCT_COUNT(X, x);
If you need to use it in a FOREACH instead then the easiest way is something like:
...GENERATE COUNT(Distinct(x))...
Tested on Pig 12.
If you don't want to count on any group, you use this:
G = FOREACH (GROUP A ALL){
unique = DISTINCT A.field;
GENERATE COUNT(unique) AS ct;
};
This will just give you a number.

Resources