advice to make my below Pig code simple - hadoop

Here is my code and I do two group all operations and my code works. My purpose is to generate all student unique user count with their total scores, student located in CA unique user count. Wondering if good advice to make my code simple to use only one group operation, or any constructive ideas to make code simple, for example using only one FOREACH operation? Thanks.
student_all = group student all;
student_all_summary = FOREACH student_all GENERATE COUNT_STAR(student) as uu_count, SUM(student.mathScore) as count1,SUM(student.verbScore) as count2;
student_CA = filter student by LID==1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA);
Sample input (student ID, location ID, mathScore, verbScore),
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
Sample output (unique user, unique user in CA, sum of mathScore of all students, sum of verb Score of all students),
7 3 150 240
thanks in advance,
Lin

You might be looking for this.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA = filter data by lid == 1;
student_CA_sum = SUM( student_CA.sid ) ;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output is:
grunt> dump result
(6,3,150,240)
grunt> describe result
result: {student_CA_sum: long,student_CA_count: long,mathScore: long,verbScore: long}

first load the file(student)in hadoop file system. The perform the below action.
split student into student_CA if locationId == 1, student_Other if locationId != 1;
student_CA_all = group student_CA all;
student_CA_all_summary = FOREACH student_CA_all GENERATE COUNT_STAR(student_CA) as uu_count,COUNT_STAR(student_CA)as locationCACount, SUM(student_CA.mathScore) as mScoreCount,SUM(student_CA.verbScore) as vScoreCount;
student_Other_all = group student_Other all;
student_Other_all_summary = FOREACH student_Other_all GENERATE COUNT_STAR(student_Other) as uu_count,0 as locationOtherCount:long, SUM(student_Other.mathScore) as mScoreCount,SUM(student_Other.verbScore) as vScoreCount;
student_CAandOther_all_summary = UNION student_CA_all_summary, student_Other_all_summary;
student_summary_all = group student_CAandOther_all_summary all;
student_summary = foreach student_summary_all generate SUM(student_CAandOther_all_summary.uu_count) as studentIdCount, SUM(student_CAandOther_all_summary.locationCACount) as locationCount, SUM(student_CAandOther_all_summary.mScoreCount) as mathScoreCount , SUM(student_CAandOther_all_summary.vScoreCount) as verbScoreCount;
output:
dump student_summary;
(6,3,150,240)
Hope this helps :)
While solving your problem, I also encountered an issue with PIG. I assume it is because of improper exception handling done in UNION command. Actually, it can hang you command line prompt, if you execute that command, without proper error message. If you want I can share you the snippet for that.

The answer accepted has an logical error.
Try to have the below input file
1 1 10 20
2 1 20 30
3 1 30 40
4 2 30 50
5 2 30 50
6 3 30 50
7 1 10 10
The output will be
(13,4,160,250)
The output should be
(7,4.170,260)
I have modified the script to work correct.
data = load '/tmp/temp.csv' USING PigStorage(' ') as (sid:int,lid:int, ms:int, vs:int);
gdata = group data all;
result = foreach gdata {
student_CA_sum = COUNT( data.sid ) ;
student_CA = filter data by lid == 1;
student_CA_count = COUNT( student_CA.sid ) ;
mathScore = SUM(data.ms);
verbScore = SUM(data.vs);
GENERATE student_CA_sum as student_CA_sum, student_CA_count as student_CA_count, mathScore as mathScore, verbScore as verbScore;
};
Output
(7,4,160,250)

Related

netcool omnibus probe rules

I have following probable values for $6 coming to netcool omnibus rules.
I would like to extract the InstanceName from $6
Eg: SQLSERVER1
SQL2012TESTRTM
SQL2012TESTSTD1
SQL2014STD
MSSQLSERVER
Below are $6 values
6 = "Microsoft.SQLServer.DBEngine:TM-B33F-FAD4.cap.dev.net;SQLSERVER1:1"
6 = "Microsoft.SQLServer.2012.Agent:TM-B33F-FAD4.cap.dev.net;SQL2012TESTRTM;SQLAgent$SQL2012TESTRTM:1"
6 = "Microsoft.SQLServer.2012.Agent:TM-B33F-FAD4.cap.dev.net;SQL2012TESTRTM;SQLAgent$SQL2012TESTRTM:1"
6 = "Microsoft.SQLServer.2012.Agent:TM-B33F-FAD4.cap.dev.net;SQL2012TESTSTD1;SQLAgent$SQL2012TESTSTD1:1"
6 = "Microsoft.SQLServer.Database:TM-B33F-FAD4.cap.dev.net;SQL2012TESTSTD1;DB2:1"
6 = "Microsoft.SQLServer.2012.Agent:TM-B33F-FAD4.cap.dev.net;SQL2012TESTRTM;SQLAgent$SQL2012TESTRTM:1"
6 = "Microsoft.SQLServer.2014.Agent:TM-B33F-FAD4.cap.dev.net;SQL2014STD;SQLAgent$SQL2014STD:1"
6 = "Microsoft.SQLServer.Database:TM-B33F-FAD4.cap.dev.net;SQL2012TESTSTD1;DB2:1"
6 = "Microsoft.SQLServer.2014.DBEngine:TM-B33F-FAD4.cap.dev.net;SQL2014STD:1"
6 = "Microsoft.SQLServer.Database:TM-B33F-FAD4.cap.dev.net;SQL2012TESTSTD1;DB2:1"
6 = "Microsoft.SQLServer.2014.Agent:TM-B33F-FAD4.cap.dev.net;SQL2014STD;SQLAgent$SQL2014STD:1"
6 = "Microsoft.SQLServer.Database:TM-B33F-FAD4.cap.dev.net;SQL2012TURKSTD1;DB1:1"
6 = "Microsoft.SQLServer.2014.DBFile:CTNTV01;MSSQLSERVER;SPOT;1;35:1"
6 = "Microsoft.SQLServer.Library.EventLogCollectionTarget:TM-B33F-FAD4.cap.dev.net:1"
I have tried below code to extract, it works for most of them above.
#temp = extract($6, ";([^\:]+)\:")
if (regmatch(#temp, "[\;]"))
{
#temp = extract(#temp, "([^\:]+)\;")
}
But it does not work for
Microsoft.SQLServer.2014.DBFile:CTNTV01;MSSQLSERVER;SPOT;1;35:1
I believe the second extract inside if statement needs to be corrected little more.
It extracts until MSSQLSERVER;SPOT;1, however I only want MSSQLSERVER from it.
Can you please help in correcting this.
Try with below.
#temp = extract($6, ";([^\:]+)\:")
if (regmatch(#temp, "[\;]"))
{
#temp = extract(#temp, "([^\;]+)\;")
}

Count the 1 and 0 by group in Pig

How to count how many 1 and 0 here for each type of events? I'm doing all this in pig and there's only 1 and 0 in the second field.
The data looks like this:
(pageLoad,1)
(pageLoad,0)
(pageLoad,1)
(appLaunch,1)
(appLaunch,0)
(otherEvent,1)
(otherEvent,0)
(event,1)
(event,1)
(event,0)
(somethingelse,0)
The output will be something like this
pageLoad 1:234 0:2359
appLaunch 1:54 0:111
event 1:345 0:0
or
type 1 0
pageLoad 21 345
appLaunch 0 123
event 234 12
Thanks everyone.
Input :
pageLoad,1
pageLoad,0
pageLoad,1
appLaunch,1
appLaunch,0
otherEvent,1
otherEvent,0
event,1
event,1
event,0
somethingelse,0
Pig Script :
A = LOAD 'input.csv' USING PigStorage(',') AS (event_type:chararray,status:int);
B = GROUP A BY event_type;
req = FOREACH B {
event_type_1 = FILTER A BY status==1;
event_type_0 = FILTER A BY status==0;
GENERATE group AS event_type, COUNT(event_type_1) AS event_type_1_count, COUNT(event_type_0) AS event_type_0_count;
};
DUMP req;
Output :
(event,2,1)
(pageLoad,2,1)
(appLaunch,1,1)
(otherEvent,1,1)
(somethingelse,0,1)

Query Regarding PIG- How to put a if like condition in ForEach

I have a query wrt writing pig script
RESULT_SOMETYPE = FOREACH SOMETYPE_DATA_GROUPED GENERATE flatten(group) , SUM(SOMETYPEDATA.DURATION) as duration, COUNT(SOMETYPEDATA.DURATION) as cnt;
Here I want to replace SUM(SOMETYPEDATA.DURATION) with some number like
if(0>Sum > 1000) then put 1
if(1001> Sum > 2000 ) then put 2
if(2001> Sum > 3000 ) then put 3
How to acheive this in pig
Please suggest
SPLIT will do that but not inside the FOREACH loop. Pig also has a ternary operator kind of thing but that will not be helpful to store the result in a variable. Here is how you can use SPLIT to achieve something close to your requirement.
A = LOAD '/home/vignesh/a.dat' using PigStorage(',') as (a:int,b:int,c:int);
SPLIT A INTO B IF (a > 0 AND a < 1000), C IF (a > 1001 AND a<2000), D IF (a > 2001 AND a < 3000);
We can use either bincond operator (?:) or CASE statement (from Pig Version : 0.12 on wards) to achieve the objective.
RESULT_SOMETYPE = FOREACH SOMETYPE_DATA_GROUPED GENERATE flatten(group) AS grp_name , SUM(SOMETYPEDATA.DURATION) as duration_sum, COUNT(SOMETYPEDATA.DURATION) as cnt;
result_required = FOREACH RESULT_SOMETYPE GENEATE grp_name,
(duration_sum > 0 AND duration_sum < 1000 ? 1 :
(duration_sum > 1001 AND duration_sum < 2000 ? 2 :
(duration_sum > 2001 AND duration_sum < 3000 ? 3 : 9999)
)
) AS duration, cnt;
Refer : http://pig.apache.org/docs/r0.12.0/basic.html#arithmetic

Pig: Group By, Average, and Order By

I am new to pig and I have a text file where each line contains a different record of information in the following format:
name, year, count, uniquecount
For example:
Zverkov winced_VERB 2004 8 8
Zverkov winced_VERB 2008 4 4
Zverkov winced_VERB 2009 1 1
zvlastni _ADV_ 1913 1 1
zvlastni _ADV_ 1928 2 2
zvlastni _ADV_ 1929 3 2
I want to group all the records by their unique names, then for each unique name calculate count/uniquecount, and finally sort the output by this calculated value.
Here is what I have been trying:
bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count) / SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average;
It seems my original code does produce the desired output with one minor change:
bigrams = LOAD 'input/bigram/zv.gz' AS (bigram:chararray, year:int, count:float, books:float);
group_bigrams = GROUP bigrams BY bigram;
average_bigrams = FOREACH group_bigrams GENERATE group, SUM(bigrams.count)/SUM(bigrams.books) AS average;
sorted_bigrams = ORDER average_bigrams BY average DESC, group ASC;

Pig 0.11.1 - Count groups in a time range

I have a dataset, A, that has timestamp, visitor, URL:
(2012-07-21T14:00:00.000Z, joe, hxxp:///www.aaa.com)
(2012-07-21T14:01:00.000Z, mary, hxxp://www.bbb.com)
(2012-07-21T14:02:00.000Z, joe, hxxp:///www.aaa.com)
I want to measure number of visits per user per URL in a time window of say, 10 minutes, but as a rolling window that increments by the minute. Output would be:
(2012-07-21T14:00 to 2012-07-21T14:10, joe, hxxp://www.aaa.com, 2)
(2012-07-21T14:01 to 2012-07-21T14:11, joe, hxxp://www.aaa.com, 1)
To make the arithmetic easy, I change the timestamp to minute of the day, as:
(840, joe, hxxp://www.aaa.com) /* 840 = 14:00 hrs x 60 + 00 mins) */
To iterate over 'A' by a moving time window, I create a dataset B of minutes in the day:
(0)
(1)
(2)
.
.
.
.
(1440)
Ideally, I want to do something like:
A = load 'dataset1' AS (ts, visitor, uri)
B = load 'dataset2' as (minute)
foreach B {
C = filter A by ts > minute AND ts < minute + 10;
D = GROUP C BY (visitor, uri);
foreach D GENERATE group, count(C) as mycnt;
}
DUMP B;
I know "GROUP" isn't allowed inside a "FOREACH" loop but is there a workaround to achieve the same result?
Thanks!
Maybe you can do something like this?
NOTE: This is dependent on the minutes you create for the logs being integers. If they are not then you can round to the nearest minute.
myudf.py
#!/usr/bin/python
#outputSchema('expanded: {(num:int)}')
def expand(start, end):
return [ (x) for x in range(start, end) ]
myscript.pig
register 'myudf.py' using jython as myudf ;
-- A1 is the minutes. Schema:
-- A1: {minute: int}
-- A2 is the logs. Schema:
-- A2: {minute: int,name: chararray}
-- These schemas should change to fit your needs.
B = FOREACH A1 GENERATE minute,
FLATTEN(myudf.expand(minute, minute+10)) AS matchto ;
-- B is in the form:
-- 1 1
-- 1 2
-- ....
-- 2 2
-- 2 3
-- ....
-- 100 100
-- 100 101
-- etc.
-- Now we join on the minute in the second column of B with the
-- minute in the log, then it is just grouping by the minute in
-- the first column and name and counting
C = JOIN B BY matchto, A2 BY minute ;
D = FOREACH (GROUP C BY (B::minute, name))
GENERATE FLATTEN(group), COUNT(C) as count ;
I'm a little worried about speed for larger logs, but it should work. Let me know if you need me to explain anything.
A = load 'dataSet1' as (ts, visitor, uri);
houred = FOREACH A GENERATE user, org.apache.pig.tutorial.ExtractHour(time) as hour, uri;
hour_frequency1 = GROUP houred BY (hour, user);
Something like this should help
ExtractHour is a UDF, you could create something similar for your required Duration.
Then grouping by Hour and then User
Your can use the GENERATE to do a count.
http://pig.apache.org/docs/r0.7.0/tutorial.html

Resources