Looping within the results of a Pig Group By - hadoop

Let's say I have a game with player ids. Each id can have multiple character names (playerNames) and we have a score for each of those names. I would like to total all the scores per playerName, and calculate the percentage score per player name per id.
So, for instance:
id playerName playerScore
01 Test 45
01 Test2 15
02 Joe 100
would output
id {(playerName, playerScore, percentScore)}
01 {(Test, 45, .75), (Test2, 15, .25)}
02 {(Joe, 100, 1.0)}
Here's how I did it:
data = LOAD 'someData.data' AS (id:int, playerName:chararray, playerScore:int);
grouped = GROUP data BY id;
withSummedScore = FOREACH grouped GENERATE SUM(data.playerScore) AS summedPlayerScore, FLATTEN(data);
withPercentScore = FOREACH withSummedScore GENERATE data::id AS id, data::playerName AS playerName, (playerScore/summedPlayerScore) AS percentScore;
percentScoreIdroup = GROUP withPercentScore By id;
Currently, I do this with 2 GROUP BY statements, and I was curious if they were both necessary, or if there's a more efficient way to do this. Can I reduce this to a single GROUP BY? Or, is there a way I can iterate over the bag of tuples and add percentScore to all of them without flattening the data?

No, you can not do this without 2 GROUP, and the reason is more fundamental than just Pig:
To get the total number of points you need a linear pass through the player's scores.
Then, you need another linear pass over the player's scores to calculate the fraction. You can not do this before you know the sum.
Having said that, if the player's number of playerNames is small, I'd write a UDF that takes a bag of player scores and outputs a bag of score-per-playerName tuples, since each GROUP will generate a reducer and the process becomes ridiculously slow. A UDF that takes the bag would have to do those 2 linear passes as well, but if the bags are small enough, it won't matter and it'll certainly be an order of magnitude faster than creating another reducer.

Related

Wrong sorting while using Query function

I've been trying to do a report about the quantity of breakdonws of products in our company. The problem is that the QUERY function is operating as normal, but the sorting order is well - a bit strange.
The data I'm trying to sort are as follows (quantities are blacked out since I cannot share those informations):
Raw data
First column - name of the product, second, it's EAN code, third, breakdown rate for last year, last column - average breakdown rate. "b/d" means "brak danych" or no data.
What I want to achieve is to get the end table with values sorted by average breakdown rate.
My query is as follows:
=query(Robocze!A2:D;"select A where A is not null and NOT D contains 'b/d' order by D desc")
Final result
As You can see, we have descending order, but there are strange artifacts - like the 33.33% after 4,00% and before 3,92%.
Why is that!?
try:
=INDEX(LAMBDA(x; SORT(x; INDEX(x;; 4)*1; 0))
(QUERY(Robocze!A2:D; "where A is not null and NOT D contains 'b/d'"; 0));; 4)

Creating portfolios depending on 2 variables with SAS

I am actually new to SAS and would like form portfolios between the intersection of 2 variables from my spreadsheet.
Basically, I have an excel file called 'Up' with variables in it like 'month, company, BM, market cap usd)
I would like to sort for each month my data: the size (descending) and then BM (descending). I would like to create 4 size portfolios according to P25, P50 and P75 with the first size portfolio being above P75 (for each month) and so on. Then for each size portfolio that was create recreating 4 new portfolios in function of 'BM' and also with P25, P50, and P75.
Could someone help me and display me the SAS code and the way to add it to my existing 'Up' file (name of the sheet is also named 'up')
So I agree with the comment, this is not asked well. However, it is a common problem to solve and somewhat fun. So here goes:
First I'm going to just make up some data. Google search how to read Excel in SAS. It's easy.
1000 companies with a random SIZE and BM value.
data companies(drop=c);
format company $12.;
do c=1 to 1000;
company = catt("C_",put(c,z4.));
size = ceil(100*ranuni(1));
BM = ceil(100*ranuni(1));
output;
end;
run;
So I'm assuming you just want equal amounts in these 4 groups. You don't want to estimate percentiles based on a distribution or KDE. For this, PROC RANK works well.
proc rank data=companies out=companies descending groups=4;
var size;
ranks p_size;
run;
We now have a variable P_SIZE that is values 0,1,2,3 based on the descending order of SIZE.
Sort the portfolios by that P_SIZE value.
proc sort data=companies;
by p_size;
run;
Now run PROC RANK again, this time using a BY statement with P_SIZE, ranking on BM, and creating P_SIZE_BM.
proc rank data=companies out=companies descending groups=4;
var bm;
by p_size;
ranks p_size_bm;
run;
P_SIZE_BM now contains values 0,1,2,3 for EACH value of P_SIZE.
Sort the data and see how it comes out:
proc sort data=companies;
by p_size p_size_bm;
run;

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

olap4J - calculations on member grouping

I'm trying to write an olap4j (Mondrian) query that will group the rows by ranges.
Assume we have counts of cards per child and the children ages.
i want to sum the cards amount by age ranges, so i will have counts for ages 0-5,5-10,10-15 and so on.
Is this can be done with olap4j?
You need to define calculated members for that:
With member [Age].[0-4] as [Age].[0]:[Age].[4]
member [Age].[5-9] as [Age].[5]:[Age].[9]
etc.
Alternatively, you may want to re-design your dimension table. I'm guessing you have age as a degenerate dimension in the fact table. I suggest creating a separate dimension dim_age with a structure like this:
age_id, age, age_group
0, null, null
1, 0, 0-4
2, 1, 0-4
(...)
Then it's easy to define a first level on the dimension based on the age_group.

R - Sorting and Sub-setting Maximum Values within Columns

I am trying to iteratively sort data within columns to extract N maximum values.
My data is set up with the first and second columns containing occupation titles and codes, and all of the rest of the columns containing comparative values (in this case location quotients that had to be previously calculated for each city) for those occupations for various cities:
*occ_code city1 ... city300*
occ1 5 ... 7
occ2 20 ... 22
. . . .
. . . .
occ800 20 ... 25
For each city I want to sort by the maximum values, select a subset of those maximum values matched by their respective occupations titles and titles. I thought it would be relatively trivial but...
edit for clarification: I want end to with a sorted subset of the data for analysis.
occ_code city1
occ200 10
occ90 8
occ20 2
occ95 1.5
At the same time I want to be able to repeat the sort column-wise (so I've tried lots of order commands through calling columns directly: data[,2]; just to be able to run the same analysis functions over the entire dataset.
I've been messing with plyr for the past 3 days and I feel like the setup of my dataset is just not conducive to how plyer was meant to be used.
I'm not exactly sure what your desired output is according to your example snippit. Here's how you could get a data frame like that for every city using plyr and reshape
#using the same df from nico's answer
library(reshape)
df.m <- melt(df, id = 1)
a.cities <- cast(df.m, codes ~ . | variable)
library(plyr)
a.cities.max <- aaply(a.cities, 1, function(x) arrange(x, desc(`(all)`))[1:4,])
Now, a.cities.max is an array of data frames, with the 4 largest values for each city in each data frame. To get one of these data frames, you can index it with
a.cities.max$X13
I don't know exactly what you'll be doing with this data, but you might want it back in data frame format.
df.cities.max <- adply(a.cities.max, 1)
One way would be to use order with ddply from the package plyr
> library(plyr)
> d<-data.frame(occu=rep(letters[1:5],2),city=rep(c('A','B'),each=5),val=1:10)
> ddply(d,.(city),function(x) x[order(x$val,decreasing=TRUE)[1:3],])
order can sort on multiple columns if you want that.
This will output the max for each city. Similar results can be obtained using sort or order
# Generate some fake data
codes <- paste("Code", 1:100, sep="")
values <- matrix(0, ncol=20, nrow=100)
for (i in 1:20)
values[,i] <- sample(0:100, 100, replace=T)
df <- data.frame(codes, values)
names(df) <- c("Code", paste("City", 1:20, sep=""))
# Now for each city we get the maximum
maxval <- apply(df[2:21], 2, which.max)
# Output the max for each city
print(cbind(paste("City", 1:20), codes[maxval]))

Resources