Count and find maximum number in Hadoop using pig - hadoop

I have a table which contain sample CDR data in that column A and column B having calling person and called person mobile number
I need to find whose having maximum number of calls made(column A)
and also need to find to which number(column B) called most
the table structure is like below
calling called
889578226 77382596
889582256 77382596
889582256 7736368296
7785978214 782987522
in the above table 889578226 have most number of outgoing calls and 77382596 is most called number in such a way need to get the output
in hive i run like below
SELECT calling_a,called_b, COUNT(called_b) FROM cdr_data GROUP BY calling_a,called_b;
what might be the equalent code for the above query in pig?

Anas, Could you please let me know this is what you are expecting or something different?
input.txt
a,100
a,101
a,101
a,101
a,103
b,200
b,201
b,201
c,300
c,300
c,301
d,400
PigScript:
A = LOAD 'input.txt' USINg PigStorage(',') AS (name:chararray,phone:long);
B = GROUP A BY (name,phone);
C = FOREACH B GENERATE FLATTEN(group),COUNT(A) AS cnt;
D = GROUP C BY $0;
E = FOREACH D {
SortedList = ORDER C BY cnt DESC;
top = LIMIT SortedList 1;
GENERATE FLATTEN(top);
}
DUMP E;
Output:
(a,101,3)
(b,201,2)
(c,300,2)
(d,400,1)

Related

Hadoop Pig Max Command

I have one file that contains data of all the countries from all over the world.
I want to find out the country which has maximum airport.
I have written below code:
A = load 'airports.dat' USING PigStorage (',') AS(AirportID:int,Name:chararray,City:chararray,Country:chararray,IATA:chararray,IATAothers:chararray,Latitude:float,Longitude:float,Altitude:float,Timezone:float,DST:chararray,Zone:chararray);
B= GROUP A BY Country;
C= FOREACH B GENERATE A.Country, COUNT(A) AS Count;
but after this I am not getting how to find the maximum.
Can anybody please help.
You have created the number of airports per country. What you need to do now, is take the row with the highest number:
D = order C by $1 DESC;
E = limit D 1;
dump E;

PIG replace for multiple columns

I have a total of about 150 columns and want to search for \t and replace it with spaces
A = LOAD 'db.table' USING org.apache.hcatalog.pig.HCatLoader();
B = GROUP A ALL;
C = FOREACH B GENERATE REPLACE(B, '\\t', ' ');
STORE C INTO 'location';
This output is producing ALL the only word as output.
Is there a better way to replace all columns at once??
Thank you
Nivi
You could do this with a Python UDF. Say you had some data like this with tabs in it:
Data:
hi there friend,whats up,nothing much
yo yo yo,green eggs, ham
You could write this in Python
UDF:
#outputSchema("datums:{(no_tabs:chararray)}")
def remove_tabs(columns):
try:
out = [tuple(map(lambda s: s.replace("\t", " "), x)) for x in columns]
return out
except:
return [(None)]
and then in Pig
Query:
REGISTER 'remove_tabs.py' USING jython AS udf;
data = LOAD 'toy_data' USING PigStorage(',') AS (col0:chararray,
, col1:chararray, col2:chararray);
grpd = GROUP data all;
A = FOREACH grpd GENERATE FLATTEN(udf.remove_tabs(data));
DUMP A;
Output:
(hi there friend,whats up,nothing much)
(yo yo yo,green eggs,ham)
Ovbiously you have more than three columns, but since you are grouping by all, the script should generalize to any number of columns.

Equivalent of Union_map in pig

I have been trying to find the union_map() equivalent in pig. I know for sure that TOMAP function brings in MAP datatype.
But the requirement is to bring all the MAPs for a given id as shown below.
select I1,UNION_MAP(MAP(Key,Val)) as new_val group by I1;
Sample Input and result is provided below.
Input
ID,Key,Val
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
select ID,UNION_MAP(TO_MAP(Key,VAL)) from table group by ID;
Result
ID1,(K1#V7,K2#V4)
ID2,(K1#V2,K3#V3)
I would like to get the similar output in pig.
Download the piggybank.jar from this link http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm and set it in your classpath and try the below approach.
input
ID1,K1,V1
ID2,K1,V2
ID2,K3,V3
ID1,K2,V4
ID1,K1,V7
PigScript:
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage(',') AS (ID:chararray,Key:chararray,Val:chararray);
B = RANK A;
C = GROUP B BY (ID,Key);
D = FOREACH C {
sortByRank = ORDER B BY rank_A DESC;
top1 = LIMIT sortByRank 1;
GENERATE FLATTEN(top1);
}
E = GROUP D BY top1::ID;
F = FOREACH E {
ToMap = FOREACH D GENERATE TOMAP(top1::Key,top1::Val);
GENERATE group,BagToTuple(ToMap) AS myMap;
}
DUMP F;
Output:
(ID1,([K1#V7],[K2#V4]))
(ID2,([K1#V2],[K3#V3]))

Why is my PIG script failing?

I'm running a PIG script where I join 2 datasets. I individually verified that the two datasets I'm joining has the expected schema and when I dump the 2 datasets to output, I can see the datasets.
The strange issue I'm having is that my PIG job fails 9 out of 10 times, but succeeds once in a while. The only error message I get is "Internal error creating job configuration". I get this error when I try to perform the JOIN. Unfortunately, I don't have CLI access to the cluster, I can't get a detailed error message. (Jobs are submitted through a REST API). The PIG version is 0.8.
The 2 datasets are fairly small and can be held in memory. So I'm using replicated option. I also tried it without the replicated option. It fails again.
Any idea on whats happenning and how can I solve this problem?
SET mapred.map.tasks.speculative.execution false;
X = LOAD 'data1' using ....; -- using custom loader
P = LOAD 'data2' using ....; -- using custom loader
-- dataset1
Y = FILTER X BY type MATCHES 'Info' AND name MATCHES 'someregex';
Z = GROUP Y BY name;
A = FOREACH Z GENERATE group, $1.data;
B = FOREACH A GENERATE $0, FLATTEN(udf1($1));
C = GROUP B ALL;
D = FOREACH C GENERATE FLATTEN(udf2($1)) as (metric1:chararray, count:double, avg: double);
--dataset2
Q = FILTER P BY type MATCHES 'Info' AND name MATCHES 'someother_regex';
R = FOREACH Q GENERATE REPLACE(name,'Percent','') , udf3(records);
S = GROUP R BY $0;
T = FOREACH S GENERATE $0, FLATTEN(udf4(R.$1)) as (q1: double, q2: double, q3: double, q4: double);
-- Each tuple in T is like this: (metric2:chararray, q1: double, q2: double, q3: double, q4: double)
----join the two dataset
I = JOIN D BY $0, T BY $0 USING 'replicated';
STORE I into '$output' using PigStorage(',');

get things out of bag in pig

In the pig example:
A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);
DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
B = GROUP A BY name;
DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
C = FOREACH B GENERATE A.name, AVG(A.gpa);
DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
The last output A.name is a bag. How can I get things out of bag:
(John, 3.850000023841858)
(Mary, 3.925000011920929)
GROUP creats a magical item called group, which is what you grouped on. This is made for exactly this purpose.
B = GROUP A BY name;
C = FOREACH B GENERATE group AS name, AVG(A.gpa);
Check out DESCRIBE B;, you'll see that group is in there. It is a single value that represents what was in the BY ... part of the GROUP.

Resources