pig distinct atom

pig distinct atom - hadoop

Suppose my data looks like this with columns named food, action, and population:
pizzas eatenBy humans
pizzas eatenBy collegeKids
pizzas eatenBy everyOne
pizzas grownBy farmers
sprouts grownBy sproutFarmers
sprouts grownBy humans
How can I write a Pig Latin script to produce ONLY a unique food & action, with any valid population from the distinct food & action group?
ie, the only output I'd like from the above data would be this (though the population of the 1st and 3rd lines could be different):
pizzas eatenBy everyOne
pizzas grownBy farmers
sprouts grownBy sproutFarmers
Thank you,

Don't know how you'd do this with DISTINCT (which is more efficient than what I'm about to suggest), but you could do this:
food = load 'foodInput' AS (foodType,action,population);
foodGrouped = GROUP food by (foodType,action);
foodLimited = foreach foodGrouped {
limited = LIMIT food 1;
GENERATE FLATTEN(limited.(foodType,action,population));
};

Related

Calculate percentage when an item can be assigned to multiple categories

this might be a stupid question but I'm blanking at the moment and can't find the answer in google.
I have the following example table
customer
bought
A
food
B
food,drink
C
drink
D
drink
now how do I calculate the percentage of customers that bought food/drink over total customers? what would be the best way to calculate this?
solution
% food = #customers who bought food / #total unique customers = 2 / 4 = 50%
% drink = #customers who bought drink / #total unique customers = 3 / 4 = 75%
the problem here is the total % exceeds 100%
solution - count customer B twice, once for drink and once for food
% food = #customers who bought food / #total customers = 2 / 5 = 40%
% drink = #customers who bought drink / #total customers = 3 / 5 = 60%
the total is 100% but is this the correct way to calc % in this case?
The issue is obviously that one customer can buy both food and drink and I'm not sure how to handle this case. Any help would be appreciated. Thank you!
UPDATE:
thanks for the answer. It makes sense to remove altogether the customers who bought both products. But now I'm wondering what happens if there are more than 2 categories?
Example as follows, now we have an additional product (ice cream) in the mix
customer
bought
A
food
B
food,drink
C
drink
D
drink
E
drink,ice cream
F
ice cream
G
food,drink,ice cream
I guess with the same logic we can remove customer G since they bought all the products? And how should we handle customer E and B?

I think that what you are looking for is simply to eliminate the number of customers that purchased both products.
If you are looking for unique combinations of customers who purchased only a specific product then:
3 unique customers bought a single unique item
1/3 food => 33%
2/3 drink => 66%
The ones who bought both products, don't matter in these calculations since they add the same percentage to both cases.
EDIT:
I used excel to help me out. I hope that is fine
I added your data into an excel sheet and created a crosstable pivot.
I've set the Customers as columns and the Products as rows, in order to see what each individual customer has purchased via the values field Count of Product
I've changed the count of Product into the formula for % of Total
in order to show me for each product, the percentage split between the customers, so for example if customer B bought both drink and food he will be shown as 50% in both corresponding rows. If he bought all 3 then 33%
The grand total column, will contains the final percentage for each product row item. Excel calculates it the same way as states in some comments, it counts the products total instead of the customers total, which makes sense since that is your data set being consumed, and not the customers themselves.
If we swap the rows and columns around and do the same calculations, we see individual percentages for each customer (how much % of product they each individually purchased) and we can sum them up for each product. The result is the same

rasa-core Map entities to other entites

I am building a simple food ordering bot. In this I have a take_order intent in which two entities will be extracted food_item and quantity, both of these entities have list types in slots, for example if a user message like this comes:
I would like to have [one] (quantity) [chicken burger] (food_item) and [two] (quantity) [fries] (food_item)
slot for this example will be: slot{“quantity”: [“one”, “two”], “food_item”: [“chicken burger”, “fries”]}
in the action user_take_order I will be multiplying quantity of each item to its price and giving total bill to the user.
But I have a problem, in a complex case when user do not provide a quantity for the food_item, I will be assuming the default quantity to one but the problem occurs when user orders three items and do not provide quantity for only the second item, for example:
I would like to have [one] (quantity) [chicken burger] (food_item), [fries] (food_item) and [two] (quantity) [soft drinks] (food_item)
in this example no quantity is provided for the fries and slots filled: slot{“quantity”: [“one”, “two”], “food_item”: [“chicken burger”, “fries”, “cold drink”]}
in the action user_take_order I would like to do this:
1 x price_of_chicken_burger
1 x price_of_fries
2 x price_of_cold_drink
but the problem is that in quantity slot I have only quantities of chicken burger and cold drink and I do not have a clue that user did not mention the quantity of fries( I want to set quantity of fries as 1 “default case”)
have I choose the wrong types for slots quantity and food_item?
slots:
food_item:
type: list
quantity:
type: list

One of the possible solutions is to extract quantity and food item as one entity:
I would like to have [one chicken burger] (quantity_food_item), [fries] (quantity_food_item) and [two soft drinks] (quantity_food_item)
then differentiate them inside an action.

Complicated Cube Query

I'm working on a fairly complicated view, which calculates the total cost of a guest's stayed based on data pulled from four different tables. The output however is not exactly what I want. My code is
CREATE OR REPLACE VIEW Price AS
SELECT UNIQUE
Booking.Booking_ID AS "Booking",
Booking.GuestID AS "Guest ID",
Room.Room_Price*(Booking.CheckOutDate-Booking.CheckInDate) AS "Room Price",
Add_Ons.Price AS "Add ons Price",
Room.Room_Price*(Booking.CheckOutDate-Booking.CheckInDate) + (Add_Ons.Price) AS "Total Price"
FROM Booking JOIN Room ON Room.Room_Num = Booking.Room_Num
JOIN Booking_Add_Ons ON Booking.Booking_ID = Booking_Add_Ons.Booking_ID
JOIN Add_ons ON Booking_Add_Ons.Add_On_ID = Add_Ons.Add_On_ID
ORDER BY Booking.Booking_ID;
Now, I'm trying to get this to return the total cost of all Addons, plus the cost of the hotel rooms as the total price, however it is returning the cost of the rooms + each of the addons on separate lines. As follows:
My question is, is it possible to use something like CUBE, or SUM to add up the rows, so that there is only one entry for each of the Bookings with the total price of all add-ons accounted for?

Looping within the results of a Pig Group By

Let's say I have a game with player ids. Each id can have multiple character names (playerNames) and we have a score for each of those names. I would like to total all the scores per playerName, and calculate the percentage score per player name per id.
So, for instance:
id playerName playerScore
01 Test 45
01 Test2 15
02 Joe 100
would output
id {(playerName, playerScore, percentScore)}
01 {(Test, 45, .75), (Test2, 15, .25)}
02 {(Joe, 100, 1.0)}
Here's how I did it:
data = LOAD 'someData.data' AS (id:int, playerName:chararray, playerScore:int);
grouped = GROUP data BY id;
withSummedScore = FOREACH grouped GENERATE SUM(data.playerScore) AS summedPlayerScore, FLATTEN(data);
withPercentScore = FOREACH withSummedScore GENERATE data::id AS id, data::playerName AS playerName, (playerScore/summedPlayerScore) AS percentScore;
percentScoreIdroup = GROUP withPercentScore By id;
Currently, I do this with 2 GROUP BY statements, and I was curious if they were both necessary, or if there's a more efficient way to do this. Can I reduce this to a single GROUP BY? Or, is there a way I can iterate over the bag of tuples and add percentScore to all of them without flattening the data?

No, you can not do this without 2 GROUP, and the reason is more fundamental than just Pig:
To get the total number of points you need a linear pass through the player's scores.
Then, you need another linear pass over the player's scores to calculate the fraction. You can not do this before you know the sum.
Having said that, if the player's number of playerNames is small, I'd write a UDF that takes a bag of player scores and outputs a bag of score-per-playerName tuples, since each GROUP will generate a reducer and the process becomes ridiculously slow. A UDF that takes the bag would have to do those 2 linear passes as well, but if the bags are small enough, it won't matter and it'll certainly be an order of magnitude faster than creating another reducer.

Pig: Pulling individual fields out after a GROUP

In PigLatin, I want to pull the other fields out of a record I want to select because of an aggregate, such as MAX.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the name of the oldest person at a household:
Relation A is four columns, (name, address, zipcode, age)
B = GROUP A BY (address, zipcode); # group by the address
# generate the address, the person's age, but how do I grab that person's name?
C = FOREACH B GENERATE FLATTEN(group), MAX(age), ??? Name ???;
How do I generate the name of the person with the MAX age?

The problem with your logic is there can be more then 1 people with the MAX(age). Then you have to GROUP BY (name, address, age). But to give you a quick answer I will write that gets only one of the max ages. (I am not sure its the optimum way though)
C = FOREACH B {
DA = ORDER A BY age DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.age), FLATTEN(DB.name);
}

Be careful with frail's answer which is accepted, as it would have undesirable behavior if the number in the LIMIT command is higher than 1. In particular, in that case the output would be a cross-product between all ages and names due to the last two FLATTEN calls. Then, if the value in the LIMIT is N, there would be N^2 output rows instead of intended N.
Much safer is to do the following in the GENERATE line, which would give exactly the same result as the accepted answer when 'LIMIT 1' is used:
GENERATE FLATTEN(group) AS (address, zipcode), FLATTEN(DB.(age, name)) AS (age, name);

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio