Hadoop Pig ordered results; find order position? - hadoop

I want to sort my pig results, and then be able to determine where certain items are in my ordered results. Example:
mydata = LOAD 'mydata.txt' AS (label:chararray, rank_score:float);
ranked_data = ORDER mydata BY rank_score DESC;
ranked_positions = FOREACH ranked_data GENERATE label, AUTO_INCREMENT_ID;
results = FILTER ranked_data BY label = 'item1' OR label='item2';
DUMP results;
AUTO_INCREMENT_ID would auto-increment in my perfect world. Given how mappers/reducers are independent from each other, I'm guessing Pig/Hadoop may not support this. If not, can you think of another way to generate my end result?
Example input:
item1 34.33
item2 48.39
item3 93.3
Desired output:
item1 3
item2 2

If you set parallelism of ORDER to 1, you can just do auto-increment yourself in a udf; of course, that would have the potentially undesired effect of only using 1 reducer to do your sorting.
(Also, I am not sure how you got your example output -- the input seems to be already ordered, so item1 should have id 1 and item 2 should have id 2, right? did you mean to order by rank_score desc?)

Related

How to create dataframe from ordered dictionary?

I have an ordered dictionary which has 4 keys and multiple values. I tried to create the dataframe like this
df = pd.DataFrame(items, index=[0])
print('\ndf is ',df)
But this triggers ValueError, as the multiple values from the dictionary don't match.
The ordered dictionary is below:
OrderedDict([('Product', 'DASXZSDASXZS'), ('Region', ['A', 'B', 'C']), ('Items', ['1', '2', '3']), ('Order', ['123', '456', '789'])])
I want the dataframe format to be like:
Product Region Items Order
DASXZSDASXZS A 1 123
DASXZSDASXZS B 2 456
...
How can I achieve this format for the dataframe?
Not enough rep to comment. Why do you try to specify index=[0]?
Simply doing
df = pd.DataFrame(items)
works; if you want to change the index, you can set it later with df.set_index(...)
#viktor_dmitry your comment to #Battleman links to external data, here's a solution.
In https://www.codepile.net/pile/GY336DYN you have a list of OrderedDict entries, in the example above you just had 1 OrderedDict. Each needs to be treated as a separate DataFrame construction. From the resulting list you use concat to get a final DataFrame:
ods = [OrderedDict([('MaterialNumber', '2XV9450-1AR24'), ('ForCountry'...]),
OrderedDict([('MaterialNumber', ...),
...]
new_df = pd.concat([pd.DataFrame(od) for od in ods])
# new_df has 4 columns and many rows
Note also that 1 of your example items is invalid, you'd need to filter this out, the rest appear to be fine:
ods[21]
OrderedDict([('MaterialNumber', '4MC9672')]) # lacks the rest of the columns!

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

how to create set of values, after group function in Pig (Hadoop)

Lets say I have set of values in file.txt
a,b,c
a,b,d
k,l,m
k,l,n
k,l,o
And my code is:
file = LOAD 'file.txt' using PigStorage(',');
events = foreach file generate session_id, user_id, code, type;
gr = group events by (session_id, user_id);
and I have set of value:
((a,b),{(a,b,c),(a,b,d)})
((k,l),{(k,l,m),(k,l,n),(k,l,o)})
And I'd like to have:
(a,b,(c,d))
(k,l,(m,n,o))
Have you got any idea how to do it?
Regards
Pawel
Note: you are inconsistent in your question. You say session_id, user_id, code, type in the FOREACH line, but your have a PigStorage not providing values. Also, that FOREACH has 4 values, while your sample data only has 3. I'll assume that type doesn't exist in order to answer your question.
After your gr relation, you are left with the group by key (in this case (session_id, user_id)) in a automatically generated tuple called group.
So, first step: gr2 = FOREACH gr GENERATE FLATTEN(group);
This will give you the tuples (a,b) and (k,l). You need to use FLATTEN because group is a tuple and you are asking for session_id and user_id to be individual columns. FLATTEN does that for you.
Ok, so now modify the gr2 line to also use a projection to tease out the third value:
gr2 = FOREACH gr GENERATE FLATTEN(group), events.code;
events.code creates a bag out of all the code values. events is the name of the bag of grouped tuples (it's named after the original relation).
This should give you:
(a, b, {c, d})
(k, l, {m, n, o})
It's very important to note that the values in the list are in a bag not a tuple, like you asked for. Keeping it in a bag is the right idea because the bag is a variable list, while a tuple is not.
Additional advice: Understanding how GROUP BY outputs data is something I see a lot of people struggle with when first using Pig. If you think my answer doesn't make much sense, I'd recommend spending some time to really get to understand GROUP BY. Understanding versus thinking it is magic will pay off in the long run.

Looping within the results of a Pig Group By

Let's say I have a game with player ids. Each id can have multiple character names (playerNames) and we have a score for each of those names. I would like to total all the scores per playerName, and calculate the percentage score per player name per id.
So, for instance:
id playerName playerScore
01 Test 45
01 Test2 15
02 Joe 100
would output
id {(playerName, playerScore, percentScore)}
01 {(Test, 45, .75), (Test2, 15, .25)}
02 {(Joe, 100, 1.0)}
Here's how I did it:
data = LOAD 'someData.data' AS (id:int, playerName:chararray, playerScore:int);
grouped = GROUP data BY id;
withSummedScore = FOREACH grouped GENERATE SUM(data.playerScore) AS summedPlayerScore, FLATTEN(data);
withPercentScore = FOREACH withSummedScore GENERATE data::id AS id, data::playerName AS playerName, (playerScore/summedPlayerScore) AS percentScore;
percentScoreIdroup = GROUP withPercentScore By id;
Currently, I do this with 2 GROUP BY statements, and I was curious if they were both necessary, or if there's a more efficient way to do this. Can I reduce this to a single GROUP BY? Or, is there a way I can iterate over the bag of tuples and add percentScore to all of them without flattening the data?
No, you can not do this without 2 GROUP, and the reason is more fundamental than just Pig:
To get the total number of points you need a linear pass through the player's scores.
Then, you need another linear pass over the player's scores to calculate the fraction. You can not do this before you know the sum.
Having said that, if the player's number of playerNames is small, I'd write a UDF that takes a bag of player scores and outputs a bag of score-per-playerName tuples, since each GROUP will generate a reducer and the process becomes ridiculously slow. A UDF that takes the bag would have to do those 2 linear passes as well, but if the bags are small enough, it won't matter and it'll certainly be an order of magnitude faster than creating another reducer.

Max/Min for whole sets of records in PIG

I have a set set of records that I am loading from a file and the first thing I need to do is get the max and min of a column.
In SQL I would do this with a subquery like this:
select c.state, c.population,
(select max(c.population) from state_info c) as max_pop,
(select min(c.population) from state_info c) as min_pop
from state_info c
I assume there must be an easy way to do this in PIG as well but I'm having trouble finding it. It has a MAX and MIN function but when I tried doing the following it didn't work:
records=LOAD '/Users/Winter/School/st_incm.txt' AS (state:chararray, population:int);
with_max = FOREACH records GENERATE state, population, MAX(population);
This didn't work. I had better luck adding an extra column with the same value to each row and then grouping them on that column. Then getting the max on that new group. This seems like a convoluted way of getting what I want so I thought I'd ask if anyone knows a simpler way.
Thanks in advance for the help.
As you said you need to group all the data together but no extra column is required if you use GROUP ALL.
Pig
records = LOAD 'states.txt' AS (state:chararray, population:int);
records_group = GROUP records ALL;
with_max = FOREACH records_group
GENERATE
FLATTEN(records.(state, population)), MAX(records.population);
Input
CA 10
VA 5
WI 2
Output
(CA,10,10)
(VA,5,10)
(WI,2,10)

Resources