Summing up values in Pig - hadoop

I'm trying to deliver an output which aggregates the last two fields (count and books) and divides them by each other (count/books) for each grouping. Currently I have the grouping code, which groups by the first element in the array. I'm not sure how to get the sums of the last two elements and sum them however. I've posted what code I have so far. Thanks in advance!
bigrams = LOAD 'txt' AS (bigram:chararray, year:int, count:int, books:int);
grouping = group bigrams by bigram;
STORE grouping INTO 's3://cse6242vrv3/output1.txt';

It's not completely clear what are you expecting in output. So, I assume you just want to know how to do aggregations in Pig. Let us know more if you are looking for something different.
bigrams = LOAD 'txt' AS (bigram:chararray, year:int, count:int, books:int);
grouping = foreach(group bigrams by bigram) generate group AS biagram,
SUM(bigrams.count) AS sum_count,
SUM(biagram.books) AS sum_books,
SUM(bigrams.count)/SUM(biagram.books) AS ratio;
STORE grouping INTO 's3://cse6242vrv3/output1.txt';
You can find more details about pig aggregation here-
https://pig.apache.org/docs/r0.15.0/basic.html#group
One more thing you might be interested in pig is nested blocks which can be used for complex calculations in group by.
https://pig.apache.org/docs/r0.15.0/basic.html#nestedblock

Related

How to get the sum of values of a column in tmap?

I have 2 columns - Matches(Integer), Accounts_type(String). And i want to create a third column where i want to get proportions of matches played by different account types. I am new to Talend & am facing issue with this for past 2 days & did a lot of research but to no avail. Please help..
You can do it like this:
You need to read your source data twice (I used tFixedFlowInput_1 and tFixedFlowInput_2 with the same data). The idea is to calculate the total of your matches in tAggregateRow_1, it simply does a sum of all Matches without a group by column, then use that as a lookup.
The tMap then joins your source data with the calculated total. Since the total will always be one record, you don't need any join column. You then simply divide Matches by Total as required.
This is supposing you have unique values in Account_type; if you don't, you need to add another tAggregateRow between your source and tMap_1, in order to get sum of Matches for each Account_type (group by Account_type).

Pig SUM a column until it reaches a certain value and return the rows

Can someone help me how to calculate the sum of a coloumn until it reaches a certain value. Usecase: top product which produced 50% of the revenue.
Is there any library like piggybank to get it done, I couldn't find it in piggybank.
I am trying to implement UDF but I am worried is that the only way :(.
Here is the data structure looks like-
productId, totalProfitByProduct, totalProfitByCompany, totalRevenueOfCompany.
Data is in descending order on totalProfitByProduct.
totalProfitByCompany, totalRevenueOfCompany remains same for every row.
Now I want to apply sum over totalProfitByProduct for each product above from the top and get the top products which generated greater than 50% of totalProfitByCompany or totalRevenueOfCompany
piggybank has percentile UDf , which can be used for your requirement .
Pig Script along with the udf can help you achieve it .

How to rename the fields in a relation

Consider the following code :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
The relation ranked has two fields : the line number and the text. The text is called line and can be referred to by this alias, but the line number generated by RANK has none. As a consequence, the only way I can refer to it is as $0.
How can I give $0 an name, so that I can refer to it more easily once it's been joined to another data set and is no longer $0?
What you want to do is to define a schema for you data. The easiest way to do so is to use the AS keywoard just like you're doing with LOAD.
You can define a schema with three operators : LOAD, STREAM and FOREACH.
Here, the easiest way to do so would be the following :
ebook = LOAD '$ebook' USING PigStorage AS (line:chararray);
ranked = RANK ebook;
renamed_ranked = foreach B generate $0 as rank, $1;
You may find more informations on the associated documentation.
It is also good to know that this operation won't add an iteration to your script. As #ArnonRotem-Gal-Oz said :
Pig doesn't perform the action in a serial manner i.e. it doesn't do all the ranking and then does another iteration on all the records. The pig optimizer will do the rename when it assigns the rank. You can see a similar behaviour explained in the pig cookbook.
You can add a projection with FOREACH as
named_ranked = FOREACH ranked GENERATE $0 as r,*;

Intersection of Intervals in Apache Pig

In Hadoop I have a collection of datapoints, each including a "startTime" and "endTime" in milliseconds. I want to group on one field then identify each place in the bag where one datapoint overlaps another in the sense of start/end time. For example, here's some data:
0,A,0,1000
1,A,1500,2000
2,A,1900,3000
3,B,500,2000
4,B,3000,4000
5,B,3500,5000
6,B,7000,8000
which I load and group as follows:
inputdata = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
grouped = GROUP inputdata BY where;
The ideal result here would be
(1,2)
(4,5)
I have written some bad code to generate an individual tuple for each second with some rounding, then do a set intersection, but this seems hideously inefficient, and in fact it still doesn't quite work. Rather than debug a bad approach, I want to work on a good approach.
How can I reasonably efficiently get tuples like (id1,id2) for the overlapping datapoints?
I am thoroughly comfortable writing a Java UDF to do the work for me, but it seems as though Pig should be able to do this without needing to resort to a custom UDF.
This is not an efficient solution, and I recommend writing a UDF to do this.
Self Join the dataset with itself to get a cross product of all the combinations. In pig, it's difficult to join something with itself, so you just act as if you are loading two separate datasets. After the cross product, you end up with data like
1,A,1500,2000,1,A,1500,2000
1,A,1500,2000,2,A,1900,3000
.....
At this point, you need to satisfy four conditionals,
"where" field matches
id one and two from the self join don't match (so you don't get back the same ID intersecting with itself)
start time from second group being compared should be greater than start time for first group and less then end time for first group
This code should work, might have a syntax error somewhere as I couldn't test it but should help you to write what you need.
inputdataone = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
inputdatatwo = LOAD 'inputdata' USING PigStorage(',')
AS (id:long, where:chararray, start:long, end:long);
crossProduct = CROSS inputdataone, inputdatatwo;
crossProduct =
FOREACH crossProduct
GENERATE inputdataone::id as id_one,
inputdatatwo::id as id_two,
(inputdatatwo::start-inputdataone::start>=0 AND inputdatatwo::start-inputdataone::end<=0 AND inputdataone::where==inputdatatwo::where?1:0) as intersect;
find_intersect = FILTER crossProduct BY intersect==1;
final =
FOREACH find_intersect
GENERATE id_one,
id_two;
Crossing large sets inflates the data.
A naive solution without crossing would be to partition the intervals and check for intersections within each interval.
I am working on a similar problem and will provide a code sample when I am done.

Problems generating ordered files in Pig

I am facing two issues:
Report Files
I'm generating PIG report. The output of which goes into several files: part-r-00000, part-r-00001,... (This results fromt he same relationship, just multiple mappers are producing the data. Thus there are multiple files.):
B = FOREACH A GENERATE col1,col2,col3;
STORE B INTO $output USING PigStorage(',');
I'd like all of these to end up in one report so what I end up doing is before storing the result using HBaseStorage, I'm sorting them using parallel 1: report = ORDER report BY col1 PARALLEL1. In other words I am forcing the number of reducers to 1, and therefore generating a single file as follows:
B = FOREACH A GENERATE col1,col2,col3;
B = ORDER B BY col1 PARALLEL 1;
STORE B INTO $output USING PigStorage(',');
Is there a better way of generating a single file output?
Group By
I have several reports that perform group-by: grouped = GROUP data BY col unless I mention parallel 1 sometimes PIG decides to use several reducers to group the result. When I sum or count the data I get incorrect results. For example:
Instead of seeing this:
part-r-00000:
grouped_col_val_1, 5, 6
grouped_col_val_2, 1, 1
part-r-00001:
grouped_col_val_1, 3, 4
grouped_col_val_2, 5, 5
I should be seeing:
part-r-00000:
grouped_col_val_1, 8, 10
grouped_col_val_2, 6, 6
So I end up doing my group as follows: grouped = GROUP data BY col PARALLEL 1
then I see the correct result.
I have a feeling I'm missing something.
Here is a pseudo-code for how I am doing the grouping:
raw = LOAD '$path' USING PigStorage...
row = FOREACH raw GENERATE id, val
grouped = GROUP row BY id;
report = FOREACH grouped GENERATE group as id, SUM(val)
STORE report INTO '$outpath' USING PigStorage...
EDIT, new answers based on the extra details you provided:
1) No, the way you describe it is the only way to do it in Pig. If you want to download the (sorted) files, it is as simple as doing a hdfs dfs -cat or hdfs dfs -getmerge. For HBase, however, you shouldn't need to do extra sorting if you use the -loadKey=true option of HBaseStorage. I haven't tried this, but please try it and let me know if it works.
2) PARALLEL 1 should not be needed. If this is not working for you, I suspect your pseudocode is incomplete. Are you using a custom partitioner? That is the only explanation I can find to your results, because the default partitioner used by GROUP BY sends all instances of a key to the same reducer, thus giving you the results you expect.
OLD ANSWERS:
1) You can use a merge join instead of just one reducer. From the Apache Pig documentation:
Often user data is stored such that both inputs are already sorted on the join key. In this case, it is possible to join the data in the map phase of a MapReduce job. This provides a significant performance improvement compared to passing all of the data through unneeded sort and shuffle phases.
The way to do this is as follows:
C = JOIN A BY a1, B BY b1, C BY c1 USING 'merge';
2) You shouldn't need to use PARALLEL 1 to get your desired result. The GROUP should work fine, regardless of the number of reducers you are using. Can you please post the code of the script you use for Case 2?

Resources