Count frequency of values in a column in PIG? - hadoop

I have something like this:
ColA ColB
a xxx
b yyy
c xxx
d yyy
e xxx
I need to find out the number of times each value of ColB occurs.
Output:
xxx 3
yyy 2
Here's what I've been trying:
Considering A has my data,
grunt> B = GROUP A by ColB;
grunt> DESCRIBE B;
B: {group: chararray,A: {(ColA: chararray,ColB: chararray)}}
Now I'm confused, do I do something like this?
grunt> C = FOREACH B GENERATE COUNT(B.ColB)
So I need the output to be like this,
xxx 3
yyy 2

I figured it out.
C = FOREACH B GENERATE GROUP AS ColB, COUNT(A) as count;

Use lower-case for 'group as', it works for me:
C = FOREACH B GENERATE group as ColB, COUNT(A) as count;

Related

Get week using date data and do some calculation in pig

My data is something like this:
(201601030637,2,64.001213)
(201601030756,3,63.5869656667)
(201601040220,2,62.758471)
which the first column is year (2016) month (01) day (03) hour (06) and minutes (37) connected to each other.
I want to sum the values of third column based on the week. How can I group them to have 52 different groups for entire year? Can anyone help?
Thanks!
Use ToDate to convert your datestring to datetime type. Then use GetWeek to get weeknumber. Finally use GROUP to group by weeknum and SUM.
A = LOAD '/path_to_data/data' USING PigStorage(',') as (c1: chararray, c2: int, c3: float);
B = FOREACH A GENERATE GetWeek(ToDate(c1,'yyyyMMddHHmm')) as weeknum, c1, c2, c3;
C = FOREACH (GROUP B BY weeknum) GENERATE group as weeknum, SUM(B.c2) as c2_sum;
DUMP C;
Use GetWeek and create a new column from the first column.Then group by the new column and use SUM.Assuming you have loaded the data to a relation A.
B = FOREACH A GENERATE A.$0,A.$1,A.$2,GetWeek(A.$0) as week_of_year;
C = GROUP B BY (B.$4);
D = FOREACH C GENERATE group,SUM(B.$2);
DUMP D;

Filter records in Pig

Below is the data
col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,dollar
61,62,63,64,pound
col1,col2,col3 will form the combination of unique keys. The use case is to filter the data based on col5.
For the unique key combination we need to filter the record where col5 value is "dollar", only if the same combination has "pound" value.
The expected output is
col1,col2,col3,col4,col5
------------------------
10,20,30,40,dollar
20,30,40,50,dollar
20,30,10,50,dollar
61,62,63,64,pound
How to proceed further since there is no special operators in Pig like Hive.
A = load 'test1.csv' using PigStorage(',') as (col1:int,col2:int,col3:int,col4:int,col5:chararray);
B = FILTER A BY col5 == 'pound';
Get all the records with 'pound', then get all records with 'dollar' that does not match with the id combination with 'pound' in col5. Finally, marry them off ... UNION.
B = FILTER A BY col5 == 'pound';
C = JOIN A BY (col1,col2,col3) LEFT OUTER,B BY (col1,col2,col3);
D = FILTER C BY (B::col1 is null);
E = FOREACH D GENERATE A::col1,A::col2,A::col3,A::col4,A::col5;
F = UNION B,E;
DUMP F;
Output

Pig script to find the max, min,avg,sum of Salary in each department

I get stuck after grouping the data by department no.The steps followed by me
grunt> A = load '/home/cloudera/naveen1/hive_data/emp_data.txt' using PigStorage(',') as (eno:int,ename:chararray,job:chararray,sal:float,comm:float,dno:int);
grunt> B = group A by don;
grunt> describe B;
B: {group: int,A: {(eno: int,ename: chararray,job: chararray,sal: float,comm: float,dno: int)}}
Please let me know the steps after this.I am bit confused about the Nested Foreach statement execution.
Data contains eno,ename,sal,job,commisson,deptno and i want extract the max sal in each dept and the employee getting the highest salary.
Similary for min sal.
Use the aggregate functions after grouping.
C = FOREACH B GENERATE group,MAX(A.sal),MIN(A.sal),AVG(A.sal),SUM(A.sal);
DUMP C;
To get the name,eno and max sal in each dept,sort the records and get the top row
C = FOREACH B {
max_sal = ORDER A BY sal DESC;
max_limit = LIMIT max_sal 1;
GENERATE FLATTEN(max_limit);
}
DUMP C;

Combination of Union and Join in apache pig

I have two files in hdfs containing data as follows, File1:
id,name,age
1,x1,15
2,x2,14
3,x3,16
File2:
id,name,grades
1,x1,A
2,x2,B
4,y1,A
5,y2,C
I want to produce the following output :
id,name,age,grades
1,x1,15,A
2,x2,14,B
3,x3,16,
4,y1,,A
5,y2,,C
I am using Apache pig to perform the operation, is it possible to get the above output in pig. This is kind of Union and Join both.
As you can do unions and joins in pig this is of course possible.
Without digging into the exact syntax, I can tell you this should work (have used similar solutions in the past).
Suppose we have A and B.
Take the first two columns of A and B to be A2 and B2
Union A2 and B2 into M2
Distinct M2
Now you have your 'index' matrix, and we just need to add the extra columns.
Left join M2 with A and B
Generate the relevant columns
Thats it!
A = load 'pdemo/File1' using PigStorage(',') as(id:int,name:chararray,age:chararray);
B = load 'pdemo/File2' using PigStorage(',') as(id:int,name:chararray,grades:chararray);
lj = join A by id left outer,B by id;
rj = join A by id right outer,B by id;
lj1 = foreach lj generate A::id as id,A::name as name,A::age as age,B::grades as grades;
rj1 = foreach rj generate B::id as id,B::name as name,A::age as age,B::grades as grades;
res = union lj1,rj1;
FinalResult = distinct res;
2nd approach is better according to performance
A1 = foreach A generate id,name;
B1 = foreach B generate id,name;
M2 = union A1,B1;
M2 = distinct M2;
M2A = JOIN M2 by id left outer,A by id;
M2AB = JOIN M2A by M2::id left outer, B by id;
Res = foreach M2AB generate M2A::M2::id as id,M2A::M2::name as name,M2A::A::age as age,B::grades as grades;
Hope this will help!!
u1 = load 'PigDir/u1' using PigStorage(',') as (id:int,name:chararray,age:int);
u2 = load 'PigDir/u2' using PigStorage(',') as (id:int, name:chararray,grades:chararray);
uj = join u2 by id full outer,u1 by id;
uif = foreach uj generate ($0 is null ?$3:$0) as id,($1 is null ? $4 : $1) as name,$5 as age,$2 as grades;

Top 3 records in inside group by query in pig

Sample data for my problem :
1 12 1234
2 12 1233
1 13 5555
1 15 4444
2 34 2222
7 89 1111
Field Description :
col1 cust_id ,col2 zip_code , col 3 transaction_id.
Using pig scripting i need to find the below question :
for each cust_id i need to find the zip code mostly used for last 3 transactions .
Approach I used so far :
1) Group records with cust_id :
(1,{(1,12,1234),(1,13,5555),(1,15,4444),(1,12,3333),(1,13,2323),(1,13,3434),(1,13,5755),(1,18,4424),(1,12,3383),(1,13,2823)})
(2,{(2,34,2222),(2,12,1233),(2,34,6666),(2,34,6666),(2,34,2422)})
(6,{(6,14,2312),(6,15,8888),(6,14,4634),(6,14,2712),(6,15,8288)})
(7,{(7,45,4244),(7,89,1111),(7,45,4544),(7,89,1121)})
2) Sort them and restrict them on latest 3 transactions.
Using nested foreach i have sorted by transaction id and limit that to 3
nested = foreach group_by { sor = order zip by $2 desc ; limi = limit sor 3 ; generate limi; };
After grouping data is :
({(1,12,1234),(1,13,2323),(1,13,2823)})
({(2,12,1233),(2,34,2222),(2,34,2422)})
({(6,14,2312),(6,14,2712),(6,14,4634)})
({(7,89,1111),(7,89,1121),(7,45,4244)})
why my above data is not getting sorted on the basis of descending order ?
Even on ascending order , Now how do i find the most used zip code for last 3 transactions .
Result should be
1) 13
2) 34
3) 14
4) 89
Can you try this?
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS(CustomerId:int,ZipCode:int,TransactionId:int);
B = GROUP A BY CustomerId;
C = FOREACH B {
SortTxnId = ORDER A BY $2 DESC;
TxnIdLimit = LIMIT SortTxnId 3;
GENERATE group,TxnIdLimit;
}
D = FOREACH C GENERATE FLATTEN($1);
E = GROUP D BY ($0,$1);
F = FOREACH E GENERATE group,COUNT(D);
G = GROUP F BY group.$0;
I = FOREACH G {
SortZipCode = ORDER F BY $1 DESC;
ZipCodeLimit = LIMIT SortZipCode 1;
GENERATE FLATTEN(ZipCodeLimit.group);
}
J = FOREACH I GENERATE FLATTEN($0.TxnIdLimit::ZipCode);
DUMP J;
Output:
(13)
(34)
(14)
(89)
input.txt
1,12,1234
1,13,5555
1,15,4444
1,12,3333
1,13,5755
1,18,4424
2,34,2222
2,12,1233
2,33,6666
2,34,6666
2,34,2422
6,14,2312
6,15,8888
6,14,4634
6,14,2712
7,45,4244
7,89,1111
7,89,3111
7,89,1121

Resources