SUM function in Pig script - hadoop

I am a student learning how to use Pig script using the hortonworks sandbox. My problem is that I am not able to use the SUM function properly. I have successfully separated the fields of a firewall log and I am able to do perform several queries and use the count function... but no luck with the SUM function which I really need in one case. This code I used below:
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
counter = foreach grpd1 {
sum1 = SUM(A.rcvd);
sum2 = SUM(A.sent);
generate sum1, sum2;
};
dump counter;
C = foreach F1 generate rcvd, sent;
dump C;
When I dump just the variable C I get a result displaying many records indicating the amount of data received/sent for the filter applied. eg:
(223,123)
(334,444)
(21,12344)
(...,...)
All I really want to do is add all those records together and show that total amount of received and sent: (?,?).
Note: I have tried changing the variable type to int, long, and chararray with no success either.
Some of the errors I am getting while trying to solve this are:
Could not infer the matching function for org.apache.pig.builtin.SUM as multiple or none of them fit. Please use an explicit cast.

First make sure that the fields that you are summing up are of type int
Use - DESCRIBE A; to check the data type
After that, I think since you have used filter condition and then used group by on F1 -
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
So, while summing up you should use F1 instead of A -
counter = foreach grpd1 {
sum1 = SUM(F1.rcvd);
sum2 = SUM(F1.sent);
generate sum1, sum2;
};
Use DESCRIBE grpd1; and you will understand what I am trying to say, there will be no 'A'
I guess this should solve the error. Finally, check the logic of what you want in the result I have not checked that. Hope this helps.
PS - I am also a student and new to PIG.

A lucky guess here, I'm new to Pig too :)
I'm not sure if SUM can be casted to chararray(that would explain the error), so make rcvd and sent type:int and then generate the 2 sums for grpd1 bag:
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C1 = foreach grpd1 generate SUM(F1.rcvd);
dump C1;
C2 = foreach grpd1 generate SUM(F1.sent);
dump C2;
NOTE: More info here.
Hope I helped a little!

Please try the following
A = FOREACH logs_base GENERATE device_id,src,src_port,dst,dst_port,tran_ip,tran_port,service,duration,sent,rcvd,sent_pkt,rcvd_pkt,SN,user,group1, REGEX_EXTRACT(date, '\\d{3}-(\\d{2})-\\d{2}', 1) AS(month:chararray);
F1 = FILTER A BY user == 'PR11MS1120' and month == '10';
grpd1 = group F1 by user;
C = foreach F1 generate group,SUM(F1.rcvd), SUM(F1.sent);
dump C;

Related

Google Spreadsheet Loop & IF statement: Loop through co-worker list and add them to a rotation schedule

EDIT - SOLVED
I have to make a rotation schedule for coworkers for the next year. Some coworkers have standard days off and I do not want to schedule them on those days.
This is the manual outcome I would like to get.
Example: Consultant A does not work on mondays, so I do not want Consultant A to be added to the schedule on a monday.
I then want consultant B to be added to the schedule as a fill-up. Consultant A would be next in line on a tuesday etc. Next would be consultant C but consultant C does not work on wednesdays. Therefore, we need to take consultant D for wednesday and consultant C on a thursday, and so on. When we are at the last consultant of the F column, it needs to start again at consultant A.
I have tried all kinds of formulas, like if statements and arrayformula. But there is no way that I know of to loop through the F column just with formulas.
I am not sure if this is at all clear what I want to achieve here, I am stuck 😄
I am using additionally an add-on to send the schedule to everyone's agenda, thats also the reason why i'd love to automate this, because it would help me SO much.
I did try myself on some coding, but I am no coder and I am not sure if it would be helpful at all to share my failure 😄 But this is what I've tried so far:
function Loop() {
var ss = SpreadsheetApp.getActiveSpreadsheet().getActiveSheet();
var EndRow = ss.getLastRow();
for (var i = 2; i <= EndRow; i++) {
var Day = ss.getRange(i,2).getValue();
var Consultants = ss.getRange(i,6).getValue();
var Off = ss.getRange(i,7).getValue ();
var Count = ss.getRange(i,8).getValue();
if(Day == Off){
ss.getRange(i, 3).setValue(Consultants)
}else{
ss.getRange(i, 3).setValue(Consultants)
}
}
}
EDIT:
I found a way without using apps scripts, costs me some more work manually and first tried it with a shorter team list.
The highlighted yellow cells are the cells in which the day off was identical to the work-day cell. So they got switched.
I did have to copy paste my input list of consultants but if this is the only manual way, its fine :)
Try this:
Code:
function myFunction() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sh = ss.getSheetByName("Sheet1");
var dayCol = sh.getRange("B2:B343").getValues().flat(); //get day column values and convert it to 1d array
var dayOffCol = sh.getRange("G2:G9").getValues().flat(); //get day off values and convert it to 1d array
var dayOffColCopy; //initialize copy
var consCol = sh.getRange("F2:F9").getValues().flat(); //get consultants values or column F
var consColCopy; //initialize copy
var tempArray = []; //storage of final value for column C
for(var i = 0; i < dayCol.length; i++){ //loop through dayCol values
var ctr = (i % consCol.length); //used modulo as counter. the value will return to 0 if the value of i is divisible to the length of consCol or in example 8
//The if statement below will help up reset the value of
//consColCopy and dayOffColCopy once the values are emptied because of the splice()
if(ctr == 0){
consColCopy = consCol.slice();
dayOffColCopy = dayOffCol.slice();
}
//the loop below will get the first non-matching values of dayCol and dayOffColCopy,
//the first non-matching values will be removed to the copy variables using splice()
//and insert it to tempArray using push()
for(var j = 0; j < dayOffColCopy.length; j++){ //loop through dayOffColCopy values
if(dayCol[i] != dayOffColCopy[j]){
tempArray.push(consColCopy.splice(j, 1));
dayOffColCopy.splice(j, 1);
break; //exit loop
}
}
}
sh.getRange(2, 3, tempArray.length, 1).setValues(tempArray); //set the values of temp array to column C
}
Example Data & Output:
Note: Make sure to use the cell that has data in your range and change the sheet name. I also added comments in my code to explain the process.
References:
Array.prototype.push()
Array.prototype.slice()
Array.prototype.splice()
Class Range

PIG How do I return a rowcount from 1 alias to another alias

REGISTER 'udf.py' using jython as myfunc;
loadhtml = load './assignment/crawler' using PigStorage('\u0001') as (id1:chararray,url:chararray,domain:chararray,content:chararray,source:chararray,date:chararray);
loadhtml_content = FOREACH loadhtml generate content;
flatten = FOREACH loadhtml_content generate flatten(TOKENIZE(line)) as word;
group = GROUP flatten by word;
count = FOREACH group1 generate $0, COUNT($1);
log = FOREACH count GENERATE myfunc.nLog($0,$1,**<I need to return the row count of loadhtml_content here>**);
I am trying to return a row count of loadhtml_content into another alias. I cant think of another idea to do it.
log = FOREACH count GENERATE myfunc.nLog($0,$1,(I need to return the row count of loadhtml_content here) );
I believe this is the exact feature you are looking for: https://issues.apache.org/jira/browse/PIG-1434.
It essentially allows us to use single-tupled relations as constants wherever needed.
Something along the following lines should address your question:
loadhtml_content = FOREACH loadhtml generate content;
content_rows = FOREACH (GROUP loadhtml_content ALL) GENERATE
COUNT(loadhtml_content);
log = FOREACH count GENERATE myfunc.nLog($0,$1,content_rows.$0);

How to divide numbers from different tables in pig

I am trying to join two tables and divide a number from one table by a number from another table. I have attempted to do it in the original and generate a new table with the same values but I get the same error both times which is extra confusing to me.
--get the data
lines = LOAD '/historicaldata.csv' USING PigStorage(' ') AS (ticker:chararray, date:long, open:long, high:long, low:long, close:long, volume:long);
--limit it between the dates we want
specDates = FILTER lines BY (date<=20000103 and date>=19900101);
--sort by ticker symbol
companies = GROUP specDates BY ticker;
--sort DESC and get the top to get the ending date
sorted_end = FOREACH companies {
sorted1 = ORDER specDates BY date DESC;
endDate = LIMIT sorted1 1;
GENERATE endDate.ticker AS ticker, endDate.open AS open, endDate.close AS close;
}
--sort ASC and get the top to get the starting date
sorted_begin = FOREACH companies {
sorted2 = ORDER specDates BY date ASC;
startDate = LIMIT sorted2 1;
GENERATE startDate.ticker AS ticker, startDate.open AS open, startDate.close AS close;
}
joined = JOIN sorted_end BY ticker, sorted_begin BY ticker;
final = FOREACH joined GENERATE sorted_end::ticker as ticker, sorted_begin::open as open, sorted_end::close as close;
final2 = FOREACH final GENERATE ticker as ticker, (float)(close/open) as growth_factor;
The error I keep getting is:
(Name: Divide Type: null Uid: null)incompatible types in Divide Operator left hand side:bag :tuple(close:float) right hand side:bag :tuple(open:float)
Both are floats so I am not sure why they are "incompatible types" other than that they come from different bags, but adding them to "final" and trying to do it from there doesn't work.
The data is in the form:
AA,20140131,11.60,11.80,11.45,11.48,33014100
AA,20140130,12.05,12.07,11.83,11.92,23223500
AA,20140129,11.64,12.23,11.58,11.96,44433000
Every entry includes all columns and are well formatted, non-zero numbers
Based on your query, I tried to create a dummy table on my system and generate the result. I found no issue and the division operation was completed successfully. PFB some sample queries which I fired on Pig:-
A = LOAD '/home/training/716391/pig/pigdata.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
B = LOAD '/home/training/716391/pig/pigdata2.csv' USING PigStorage(',') as (ID:INT, name:CHARARRAY, GPC:FLOAT)
C = join A by ID, B by ID
D = FOREACH C generate A::ID as IDA, A::name as NAMEA, A::GPC as GPCA, B::ID as IDB, B::name as NAMEB, B::GPC as GPCB;
E = FOREACH D GENERATE IDA, (FLOAT)(GPCA/GPCB) AS VALUE;
Can you please confirm, if the divisor value in your case has no Null value or 0?
Could you please share the load statements for sorted_end and sorted_begin?

Pig - MAX is not working after grouping

I am working with Pig 0.12.1 and Map-R. I am trying to find max of a field after grouping the relation on some other field. Refer the following pig script and structure of relation in comments-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2);
-- r1: {c1: biginteger,c2: biginteger}
r2 = group r1 by c1;
-- r2: {group: chararray,r1: {(c1: chararray,c2: biginteger)}}
DUMP r2;
/* output -
1234|{(1234,9876)}
2345|{(2345,8765)}
3456|{(3456,7654)}
4567|{(4567,6543)}
*/
r3 = foreach r2 generate group as c1, MAX(r1.c2) as c2;
I am getting the following error
Could not infer the matching function for org.apache.pig.builtin.MAX as multiple or none of them fit. Please use an explicit cast.
Script Explained-
I am flattening group of SomeRelation into c1, c2 and then regrouping
on c1 to generate max of c2 with each c1 group.
Please suggest.
I'm not sure if you can use the group keyword under the flatten. Also, have you considered tokenizing the group before flattening it. See this for example:
load_data = LOAD '/PIG_TESTS_ALL/WordCount' as (line);
tokenizing_data = FOREACH load_data generate flatten(TOKENIZE(line)) as word;
group_data = GROUP tokenizing_data by word;
Result = FOREACH group_data generate group,COUNT(tokenizing_data);
dump Result;
This is actually for word count, You can probably build on this to find max value based on what you want to do.
Well it looks like the problem is that Pig doesn't allow MAX(or for that matter aggregate functions like SUM etc) on biginteger. Had to use long as a datatype for this to work. Refer the following-
r1 = foreach SomeRelation generate flatten(group) as (c1 , c2:long);
-- r1: {c1: biginteger,c2: long}
Strangely, there's no documentation highlighting this almost like datatypes biginteger and bigdecimal.
We now know the problem was the unability of MAX to handle biginteger.
You should be able to group and get the max like this, and compare results with combination of order + limit :
r1 = FOREACH SomeRelation GENERATE FLATTEN(group) AS (c1, c2);
r3 = FOREACH (group r1 by c1) {
-- you may want to apply a function on a single column
-- or compare sort + limit to MAX
list = ORDER $1 BY c2 DESC;
list_max = LIMIT list 1;
GENERATE group AS c1, MAX(r1.c2) AS c2, list_max;
}

Pig Conditional Statements

I think I already know the answer to this, but I just wanted to check here before I give up and do something ugly.
I have a query that needs to count total clicks, and also total distinct users. Total clicks would just be this code without the distinct:
report = FOREACH report GENERATE user, genre, title;
report = DISTINCT report;
report = GROUP report BY (genre, title);
My question is essentially: is there any way to write a conditional statement that would skip the DISTINCT step in this process? Pseudo:
report = FOREACH report GENERATE user, genre, title;
if $report_type == 'users':
report = DISTINCT report;
end if
report = GROUP report BY (genre, title);
I'd rather not have two separate files, and up to this point the only solutions I can find involve using a Python, etc. wrapper to dynamically deal with it. I'd rather keep everything in a simple .pig file, but can't find a way to do it.
One option could be you can try something like this. Can you check with your input?
input:
user1,action,aa
user2,comedy,cc
user3,drama,dd
user1,action,aa
user1,action,aa
user2,comedy,cc
PigScript:
A = LOAD 'input' USING PigStorage(',') AS (user, genre, title);
B = FOREACH A GENERATE user, genre, title;
C = GROUP B BY (genre, title);
D = FOREACH C {
noDistValue = FOREACH B GENERATE user,genre,title;
distValue = DISTINCT B;
GENERATE $0 AS grp,noDistValue,distValue;
}
E = FOREACH D GENERATE grp,(('$report_type' == 'users')?distValue:noDistValue) AS mybag;
DUMP E;
Output1:
>>pig -x local -param "report_type=users" test.pig
((action,aa),{(user1,action,aa)})
((comedy,cc),{(user2,comedy,cc)})
((drama,dd),{(user3,drama,dd)})
Output2:
>>pig -x local -param "report_type=nonusers" test.pig
((action,aa),{(user1,action,aa),(user1,action,aa),(user1,action,aa)})
((comedy,cc),{(user2,comedy,cc),(user2,comedy,cc)})
((drama,dd),{(user3,drama,dd)})
In case if you want to calculate the Count then project the relation E and also you can modify the above script according to your need.

Resources