Count the 1 and 0 by group in Pig - hadoop

How to count how many 1 and 0 here for each type of events? I'm doing all this in pig and there's only 1 and 0 in the second field.
The data looks like this:
(pageLoad,1)
(pageLoad,0)
(pageLoad,1)
(appLaunch,1)
(appLaunch,0)
(otherEvent,1)
(otherEvent,0)
(event,1)
(event,1)
(event,0)
(somethingelse,0)
The output will be something like this
pageLoad 1:234 0:2359
appLaunch 1:54 0:111
event 1:345 0:0
or
type 1 0
pageLoad 21 345
appLaunch 0 123
event 234 12
Thanks everyone.

Input :
pageLoad,1
pageLoad,0
pageLoad,1
appLaunch,1
appLaunch,0
otherEvent,1
otherEvent,0
event,1
event,1
event,0
somethingelse,0
Pig Script :
A = LOAD 'input.csv' USING PigStorage(',') AS (event_type:chararray,status:int);
B = GROUP A BY event_type;
req = FOREACH B {
event_type_1 = FILTER A BY status==1;
event_type_0 = FILTER A BY status==0;
GENERATE group AS event_type, COUNT(event_type_1) AS event_type_1_count, COUNT(event_type_0) AS event_type_0_count;
};
DUMP req;
Output :
(event,2,1)
(pageLoad,2,1)
(appLaunch,1,1)
(otherEvent,1,1)
(somethingelse,0,1)

Related

How to identify customers who didn't make/used incoming call, outgoing call, and internet during the churn phase?

I'm trying to solve a problem where data sets are below:
Cust_Id period Total_Incoming_Call Total_outgoing_call Net_uses
123 09/01/2018 0 0 2
234 09/02/2018 0 0 0
345 09/03/2018 1 40 1
abc1 09/04/2018 0 0 0
I'd like to get the output in below:
Cust_Id Period Total_Incoming_call Total_outgoing_call Net_uses
234 09/02/2018 0 0 0
abc1 09/04/2018 0 0 0
I know how to extract one column from pandas data frame but not sure how to extract multiple columns so I can tagged them as churn customers.
cust = pd.csv(....../.csv)
cust = cust[cust.net_uses == 0]
cust = cust[cust.Total_incoming_call ==0]
Should I used below or we have better method to do?
cust = cust[(cust.total_incoming_call==0)&(cust.net_uses ==0)]
cust = cust[(cust.total_incoming_call == 0) & (cust.net_uses == 0)] works just fine.
You can also use .loc for the same purpose:
cust = cust.loc[(cust.total_incoming_call == 0) & (cust.net_uses == 0), :]
In case you just want to replace values for which the condition is False:
cust = cust.where((cust.total_incoming_call == 0) & (cust.net_uses == 0))

Reshape data in pig - change row values to column names

Is there a way to reshape the data in pig?
The data looks like this -
id | p1 | count
1 | "Accessory" | 3
1 | "clothing" | 2
2 | "Books" | 1
I want to reshape the data so that the output would look like this--
id | Accessory | clothing | Books
1 | 3 | 2 | 0
2 | 0 | 0 | 1
Can anyone please suggest some way around?
If its a fixed set of product line the below code might help, otherwise you can go for a custom UDF which helps in achieving the objective.
Input : a.csv
1|Accessory|3
1|Clothing|2
2|Books|1
Pig Snippet :
test = LOAD 'a.csv' USING PigStorage('|') AS (product_id:long,product_name:chararray,rec_cnt:long);
req_stats = FOREACH (GROUP test BY product_id) {
accessory = FILTER test BY product_name=='Accessory';
clothing = FILTER test BY product_name=='Clothing';
books = FILTER test BY product_name=='Books';
GENERATE group AS product_id, (IsEmpty(accessory) ? '0' : BagToString(accessory.rec_cnt)) AS a_cnt, (IsEmpty(clothing) ? '0' : BagToString(clothing.rec_cnt)) AS c_cnt, (IsEmpty(books) ? '0' : BagToString(books.rec_cnt)) AS b_cnt;
};
DUMP req_stats;
Output :DUMP req_stats;
(1,3,2,0)
(2,0,0,1)

advice to make my below Pig code simple

Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.
student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;
student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);
Sample input (student ID, location ID, mathScore, verbScore),
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),
7 3 150 240
thanks in advance,
Lin
You might be looking for this.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA = filter data by lid == 1;
student_CA_sum = SUM( student_CA.sid ) ;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output is:
grunt> dump result
(6,3,150,240)
grunt> describe result
result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}
first load the file(student)in hadoop file system. The perform the below action.
split student into student_CA if locationId == 1, student_Other if locationId != 1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;
student_Other_all = group student_Other all;
student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;
student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;
student_summary_all = group student_CAandOther_all_summary all;
student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;
output:
dump student_summary;
(6,3,150,240)
Hope this helps :)
While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.
The answer accepted has an logical error.
Try to have the below input file
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
7 1 10 10
The output will be
(13,4,160,250)
The output should be
(7,4.170,260)
I have modified the script to work correct.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA_sum = COUNT( data.sid ) ;
student_CA = filter data by lid == 1;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output
(7,4,160,250)

Query Regarding PIG- How to put a if like condition in ForEach

I have a query wrt writing pig script
RESULT_SOMETYPE = FOREACH SOMETYPE_DATA_GROUPED GENERATE flatten(group) , SUM(SOMETYPEDATA.DURATION) as duration, COUNT(SOMETYPEDATA.DURATION) as cnt;
Here I want to replace SUM(SOMETYPEDATA.DURATION) with some number like
if(0>Sum > 1000) then put 1
if(1001> Sum > 2000 ) then put 2
if(2001> Sum > 3000 ) then put 3
How to acheive this in pig
Please suggest
SPLIT will do that but not inside the FOREACH loop. Pig also has a ternary operator kind of thing but that will not be helpful to store the result in a variable. Here is how you can use SPLIT to achieve something close to your requirement.
A = LOAD '/home/vignesh/a.dat' using PigStorage(',') as (a:int,b:int,c:int);
SPLIT A INTO B IF (a > 0 AND a < 1000), C IF (a > 1001 AND a<2000), D IF (a > 2001 AND a < 3000);
We can use either bincond operator (?:) or CASE statement (from Pig Version : 0.12 on wards) to achieve the objective.
RESULT_SOMETYPE = FOREACH SOMETYPE_DATA_GROUPED GENERATE flatten(group) AS grp_name , SUM(SOMETYPEDATA.DURATION) as duration_sum, COUNT(SOMETYPEDATA.DURATION) as cnt;
result_required = FOREACH RESULT_SOMETYPE GENEATE grp_name,
(duration_sum > 0 AND duration_sum < 1000 ? 1 :
(duration_sum > 1001 AND duration_sum < 2000 ? 2 :
(duration_sum > 2001 AND duration_sum < 3000 ? 3 : 9999)
)
) AS duration, cnt;
Refer : http://pig.apache.org/docs/r0.12.0/basic.html#arithmetic

Split file into 4 equal parts using apache Pig

I want to split a file into 4 equal parts using Apache pig. Example, if a file has 100 lines the first 25 should go to the 1st output file and so on.. the last 25 lines should go to the 4th output file. Can someone help me to achieve this. I am using Apache pig because the number of records in the file will be in Millions and there are previous steps that generate the file that needs to be split uses Pig.
I did a bit of digging on this, because it comes up the the Hortonworks sample exam for hadoop. It doesn't seem to be well documented - but its quite simple really. In this example I was using the Country sample database offered for download on dev.mysql.com:
grunt> storeme = order data by $0 parallel 3;
grunt> store storeme into '/user/hive/countrysplit_parallel';
Then if we have a look at the directory in hdfs:
[root#sandbox arthurs_stuff]# hadoop fs -ls /user/hive/countrysplit_parallel
Found 4 items
-rw-r--r-- 3 hive hdfs 0 2016-04-08 10:19 /user/hive/countrysplit_parallel/_SUCCESS
-rw-r--r-- 3 hive hdfs 3984 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00000
-rw-r--r-- 3 hive hdfs 4614 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00001
-rw-r--r-- 3 hive hdfs 4768 2016-04-08 10:19 /user/hive/countrysplit_parallel/part-r-00002
Hope that helps.
You can use some of the below PIG feature to achieve your desired result.
SPLIT function http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#SPLIT
MultiStorage class : https://pig.apache.org/docs/r0.10.0/api/org/apache/pig/piggybank/storage/MultiStorage.html
Write custom PIG storage : https://pig.apache.org/docs/r0.7.0/udf.html#Store+Functions
You have to provide some condition based on your data.
This could do. But may be there could be a better option.
A = LOAD 'file' using PigStorage() as (line:chararray);
B = RANK A;
C = FILTER B BY rank_A > 1 and rank_A <= 25;
D = FILTER B BY rank_A > 25 and rank_A <= 50;
E = FILTER B BY rank_A > 50 and rank_A <= 75;
F = FILTER B BY rank_A > 75 and rank_A <= 100;
store C into 'file1';
store D into 'file2';
store E into 'file3';
store F into 'file4';
My requirement changed a bit, I have to store only the first 25% of the data into one file and the rest to another file. Here is the pig script that worked for me.
ip_file = LOAD 'input file' using PigStorage('|');
rank_file = RANK ip_file by $2;
rank_group = GROUP rank_file ALL;
with_max = FOREACH rank_group GENERATE COUNT(rank_file),FLATTEN(rank_file);
top_file = filter with_max by $1 <= $0/4;
rest_file = filter with_max by $1 > $0/4;
sort_top_file = order top_file by $1 parallel 1;
store sort_top_file into 'output file 1' using PigStorage('|');
store rest_file into 'output file 2 using PigStorage('|');

Resources