Equivalent of Union_map in pig - hadoop

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.

Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

Related

Discarding nulls after full outer join in PIG

Need help with discarding nulls in the result of full outer join in pig Latin. Below are two data sets :
A:
(BOS,2)
(BUR,81)
(LAS,8)
B:
(BUR,56)
(EWR,2)
(LAS,88)
After full outer join :
C :
(BOS,2,,)
(BUR,81,BUR,56)
(,,EWR,2)
(LAS,8,LAS,88)
I need to get the output in below format :
(BOS,2)
(BUR,137)
(EWR,2)
(LAS,96)
Tried different combinations of group by , flatten , bagtotuple ... but was not able to figure out the solution . Many thanks for help.
airline = load '/demo/data/airline/airline.csv' using PigStorage(',') as (Origin: chararray, Dest: chararray);
traffic_in = GROUP airline by Origin;
traffic_in_count= FOREACH traffic_in generate group as Origin , COUNT(airline) as count ;
traffic_out = GROUP airline by Dest;
traffic_out_count = FOREACH traffic_out generate group as Dest ,COUNT (airline) as count;
traffic_top = JOIN traffic_in_count by Origin FULL OUTER , traffic_out_count by Dest ;
EDIT
Instead of using OUTER JOIN use UNION and then SUM the 2nd column values.
A = LOAD 'test1.txt' using PigStorage(',') as (A1:chararray, A2:int);
B = LOAD 'test2.txt' using PigStorage(',') as (B1:chararray, B2:int);
C = UNION A,B;
D = GROUP C BY $0;
E = FOREACH D GENERATE group,SUM(C.$1);
DUMP E;
Output

Pig Latin Remove Tuple in Data Bag

Here is my code leading up to my issue:
a = LOAD 'tellers' using TextLoader() AS line;
# convert a to charrarry
b = foreach a generate (chararray)line;
# run through my UDF to create tuples
c = foreach b generate myudfs.TellerParser5(line); # ({(20),(5),(5),(10)(1),(1),(1),(1),(1),(5),(10),(10),(10)})....
d = foreach c generate flatten(number);
e = group d by number; #{group: chararray,d: {(number: chararray)}}
f = foreach e generate group, COUNT(d); # f: {group: chararray,long}
In databag f, I have an empty tuple (,1) I'd like to filter/remove.
dump f;
(,1)
(1,97)
(5,49)
(10,87)
(20,24)
describe f;
f: {group: chararray,long}
I've tried this with no success (makes no change):
remove_tuple = filter f BY group is not null;
Group is a pig keyword. Hope this should work when some other word is used for the tuple name.
NULL can be filtered by using !='null' as a condition. I have taken below as the input.
(,1)
(1,97)
(5,49)
(10,87)
(20,24)
Below is how we can filter NULL's.
A = LOAD 'file' using PigStorage(',') AS (a:chararray,b:long);
B = FILTER A BY a!='null';
DUMP B;
So for your script the line will be something like
remove_tuple = filter f BY group!='null';
Output:
(1,97)
(5,49)
(10,87)
(20,24)
I solved by adding a step and casting as an int. Here are the steps:
e = foreach d generate (int)$0; # this is the key added step
f = group e by number; #{group: chararray,d: {(number: chararray)}}
g = foreach f generate group, COUNT(e); # f: {group: chararray,long}
h = foreach f generate group, SUM(e);
i = filter g by $0 is not null;
dump i;
(1,97)
(5,49)
(10,87)
(20,24)

Aggregate values in Pig Latin

After performing multi-level filtering inside Pig, I get the below results -
(2343433,Argentina,2015,Sci-Fi)
(2343433,France,2015,Sci-Fi)
(2343433,Germany,2015,Sci-Fi)
(2343433,Netherlands,2015,Sci-Fi)
(2343433,Argentina,2015,Drama)
(2343433,France,2015,Drama)
(2343433,Germany,2015,Drama)
(2343433,Netherlands,2015,Drama)
(2343433,Argentina,2015,Family)
(2343433,France,2015,Family)
(2343433,Germany,2015,Family)
(2343433,Netherlands,2015,Family)
The column names are movieid,country,year and genre respectively. I need to aggregate these results and produce something like this -
(2343433,France,2015,Sci-Fi,Drama,Family)
(2343433,Germany,2015,Sci-Fi,Drama,Family)
(2343433,Netherlands,2015,Sci-Fi,Drama,Family)
(2343433,Argentina,2015,Sci-Fi,Drama,Family)
Either that or something like this -
(2343433,France,Germany,Netherlands,Argentina,2015,Sci-Fi,Drama,Family)
Below is my code to get the above results -
A = LOAD '/user/a1.csv' USING PigStorage('|') as (movie_id,movie_name,prod_year);
B = LOAD '/user/a2.csv' USING PigStorage('|') as (g_movieid,genres);
C = LOAD '/user/a3.csv' USING PigStorage('|') as (c_movieid,country_released);
D = JOIN A by movie_id, B by g_movieid;
E = JOIN D by g_movieid, C by c_movieid;
F = FOREACH E GENERATE movie_id,country,year,genre;
Any idea on how to achieve this using Pig?
try this,
Dump F;
(2343433,Argentina,2015,Sci-Fi)
(2343433,France,2015,Sci-Fi)
(2343433,Germany,2015,Sci-Fi)
(2343433,Netherlands,2015,Sci-Fi)
(2343433,Argentina,2015,Drama)
(2343433,France,2015,Drama)
(2343433,Germany,2015,Drama)
(2343433,Netherlands,2015,Drama)
(2343433,Argentina,2015,Family)
(2343433,France,2015,Family)
(2343433,Germany,2015,Family)
(2343433,Netherlands,2015,Family)
G = GROUP F BY (movie_id, country, year);
H = foreach G generate FLATTEN(group) as (movie_id, country, year), $1.$3 AS (genre:{T:(value:chararray)});
I = foreach H generate movie_id, country, year, FLATTEN(BagToTuple(genre.value));
Dump I;
(2343433,France,2015,Sci-Fi,Drama,Family)
(2343433,Germany,2015,Sci-Fi,Drama,Family)
(2343433,Argentina,2015,Sci-Fi,Drama,Family)
(2343433,Netherlands,2015,Sci-Fi,Drama,Family)

Count and find maximum number in Hadoop using pig

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?
Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

select count distinct using pig latin

I need help with this pig script. I am just getting a single record. I am selecting 2 columns and doing a count(distinct) on another while also using a where like clause to find a particular description (desc).
Here's my sql with pig I am trying to code.
/*
For example in sql:
select domain, count(distinct(segment)) as segment_cnt
from table
where desc='ABC123'
group by domain
order by segment_count desc;
*/
A = LOAD 'myoutputfile' USING PigStorage('\u0005')
AS (
domain:chararray,
segment:chararray,
desc:chararray
);
B = filter A by (desc=='ABC123');
C = foreach B generate domain, segment;
D = DISTINCT C;
E = group D all;
F = foreach E generate group, COUNT(D) as segment_cnt;
G = order F by segment_cnt DESC;
You could GROUP on each domain and then count the number of distinct elements in each group with a nested FOREACH syntax:
D = group C by domain;
E = foreach D {
unique_segments = DISTINCT C.segment;
generate group, COUNT(unique_segments) as segment_cnt;
};
You can better define this as a macro:
DEFINE DISTINCT_COUNT(A, c) RETURNS dist {
temp = FOREACH $A GENERATE $c;
dist = DISTINCT temp;
groupAll = GROUP dist ALL;
$dist = FOREACH groupAll GENERATE COUNT(dist);
}
Usage:
X = LOAD 'data' AS (x: int);
Y = DISTINCT_COUNT(X, x);
If you need to use it in a FOREACH instead then the easiest way is something like:
...GENERATE COUNT(Distinct(x))...
Tested on Pig 12.
If you don't want to count on any group, you use this:
G = FOREACH (GROUP A ALL){
unique = DISTINCT A.field;
GENERATE COUNT(unique) AS ct;
};
This will just give you a number.

Resources