Grouping by bag value in Pig - user-defined-functions

I've been stuck on this question for a while. I have a data file that looks like this:
2012/01/01 Name1 "Category1,Category2,Category3"
2012/01/01 Name2 "Category2,Category3"
2012/01/01 Name3 "Category1,Category5"
Each item is associated with a comma-separated list of categories. I would like to be able to group by category name, to get output like this:
Category1 Name1, Name3
Category2 Name1, Name2
...
Category5 Name3
(even more specifically, I don't need the names of the items - just the counts of number of items in that category would do)
I ended up writing a UDF to take the comma-separated category field, and convert it to a Pig bag. My data schema is now something like this:
{date: chararray, name: chararray, categories: {t: (category:chararray)}}
I am stuck on the next step - actually performing a grouping by nested bag value. I have tried variations of nested FOREACH statement without any luck. For example:
x = FOREACH myData
{
categoryNames = FOREACH categories GENERATE category;
GENERATE myData.Name, categoryNames;
}
My thought was that this kind of syntax could generate tuples of (Name, category), which I can run a GROUP over. However, the actual output is the whole bag, taking me back to square 1. I am out of ideas on how to proceed - help/feedback would be most appreciated. Thanks!

Assuming each name is unique in your data file, you could FLATTEN the bag of category, then GROUP by category and COUNT the number of names by category.
e.g.
name_category =
FOREACH data
GENERATE
name,
FLATTEN(categories) AS category;
category_group =
GROUP name_category
BY category;
category_count =
FOREACH category_group
GENERATE
FLATTEN(group) AS category,
COUNT(name_category) AS count;

Related

Iterate on 2 Data Sources in PIG

I have 2 data sources
1) Params.txt which has the following content
item1
item2
item2
.
.
.
itemN
2) Data.txt which which has following content
he names (aliases) of relations A, B, and C are case sensitive.
The names (aliases) of fields f1, f2, and f3 are case sensitive.
Function names PigStorage and COUNT are case sensitive.
Keywords LOAD, USING, AS, GROUP, BY, FOREACH, GENERAT
and DUMP are case insensitive. They can also be written
The task is to see if each of N items of param file exist in each line of data file.
this is the pseudocode for the same
FOREACH d IN data:
FOREACH PARAM IN PARAMS:
IF PARAM IN d:
GENERATE PARAM,1
Is something of this sort possible in PIG scripting, if yes could you please point me in that direction.
Thanks
This is possible in Pig, but Pig is perhaps an unusual language to solve the problem!
I would approach the problem like this:
Load in Params.txt
Load in Data.txt and tokenise each line (assuming you're happy to split the text on spaces - you might need to think about what to do with punctuation)
Flatten the bag from tokenise to get one "word" per record in the relation.
Join the Params and Data relations. An inner join would give you words that are only in both.
Group the data and then count the occurrence of each word.
params = LOAD 'Params.txt' USING PigStorage() AS (param_word:chararray);
data = LOAD 'Data.txt' USING PigStorage() AS (line:chararray);
token_data = FOREACH data GENERATE TOKENIZE(line) AS words:{(word:chrarray)};
token_flat = FOREACH token_data GENERATE FLATTEN(words) AS (word);
joined = JOIN params BY param_word, token_flat BY word;
word_count = FOREACH (GROUP joined BY params::param_word) GENERATE
group AS param_word,
COUNT(joined) AS param_word_count;

join two tables in linq with special conditions

I hope one can help me, I am new in linq,
I have 2 tables name tblcart and tblorderdetail:
I just show some fields in these two tables to show whats my problem:
tblCart:
ID,
CartID,
Barcode,
and tblOrderDetail:
ID,
CartID,
IsCompleted
Barcode
when someone save an order, before he confirms his request,one row temporarily enter into the tblCart,
then if he or she confirms his request another row will be inserted into the tblOrderDetail ,
Now I wanna not to show the rows that is inserted into tblOrderDetailed(showing just temporarily rows which there is in tblCart),
In another words, if there is rows in tblCart with cartID=1 and at the same time there is the same row with CartID= 1 in tblOrderDetail, then I dont want that Row.
All in all, Just the rows that there isnt in tblOrderDetail, and the field to realize this is CartID,
I should mention that I make Iscompleted=true, and with that either we can exclude the rows we do not want,
I did this:
var cartItems = context.tblCarts
.Join(context.tblSiteOrderDetails,
w => w.CartID,
orderDetail => orderDetail.cartID,
(w,orderDetail) => new{w,orderDetail})
.Where(a=>a.orderDetail.cartID !=a.w.CartID)
.ToList()
however it doesn't work.
one example:
tblCart:
ID=1
CartID=1213
Barcode=4567
ID=2
CartID=1214
Barcode=4567
ID=3
CartID=1215
Barcode=6576
tblOrderDetail:
ID=2
CartID=1213
Barcode=4567
IsCompleted=true
with these data it should just show the last two Row in tblCart, I mean
ID=2
CartID=1214
Barcode=4567
ID=3
CartID=1215
Barcode=6576
This sounds like a case for WHERE NOT EXISTS in sql.
roughly translated this should be something like this in LINQ:
var cartItems = context.tblCarts.Where(crt => !context.tblSiteOrderDetails.Any(od => od.CartID == crt.cartID));
If you have a navigation property on cart to reference details (I'll assume it's called Details), then:
var results=context.tblCarts.Where(c=>!c.Details.Any(d=>d.IsCompleted));

Pig - how to select only some values from the list (not just simple distinct)?

Let's say I have intput_file.txt (user_id, event_code, event_date):
1,a,1
1,b,2
2,a,3
2,b,4
2,b,5
2,b,6
2,c,7
2,b,8
as you can see, user_id = 2, has events like this: abbbcb
I'd like to have a result like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
So when we have few events, with the same code, I'd like to take only the last one.
Can you please share any hints?
Regards
Pawel
The main thing you are describing is what GROUP BY does.
In this case:
B = GROUP A BY user_id;
Gets your records together by user_id. Your data will now look like this:
1,{(a,1),(b,2)}
2,{(a,2),(b,6),(c,7),(b,8)}
You say you only want the last one (I assume you mean the one with the greatest event_date). To do this, you can do a nested FOREACH with an ORDER BY to sort by date, and then take the first one with LIMIT. Note that this has arbitrary behavior when there are ties.
C = FOREACH B {
DA = ORDER A BY event_date DESC;
DB = LIMIT DA 1;
GENERATE FLATTEN(group), FLATTEN(DB.event_code), FLATTEN(DB.event_date);
}
Your data should now look like this:
1,b,2
2,b,8
Another option would be to use a UDF to write some custom behavior on the groups given by GROUP BY:
B = GROUP A BY user_id;
C = FOREACH B GENERATE YourUDFThatYouBuilt(group, A);
In that UDF you'd write whatever custom behavior you want (in this case return the tuple with the greatest date)
It seems like you could use the DistinctBy UDF from Apache DataFu to achieve this. This UDF, given a bag, returns the first instance found for a given field. In your case the field you care about is event_code. But we have to reverse the order, as you actually want the last instance.
One clarification though. Correct me if I'm wrong, but I think the intended output is:
1,{(a,1),(b,2)}
2,{(a,3),(b,6),(c,7),(b,8)}
That is, the (a,3) event occurs for member 2. The (a,2) event occurs for member 1.
Here's how you can do it:
-- pass in 1 because we want distinct by event code (position 1)
define DistinctBy datafu.pig.bags.DistinctBy('1');
FOREACH (GROUP A BY user_id) {
-- reverse so we can take the last event code occurrence
A_reversed = ORDER A BY event_date DESC;
-- use DistinctBy to get the first tuple having an occurrence of a field value
A_distinct_by_code = DistinctBy(A_reversed);
-- put back in order again
A_ordered = ORDER A_distinct_by_code BY event_date ASC;
GENERATE group as user_id, A_ordered.(event_code,event_date);
}

Linq datatable to get unique rows and their count

i have data table like :
country
China
India
Thailand
India
china
china
Thailand
Hong kong
India
can get my output as shown below using LINQ
Country Count
India 3
China 2
Thailand 2
Hong kong 1
As Ben Allred pointed out, what you're likely looking for is the LINQ GroupBymethod.
Using query syntax, it may look something like this:
var query = from tuple in table
group tuple by tuple.Country into g
select new { Country = g.Key, Count = g.Count() };
query now contains an IEnumerable collection of anonymous objects which have as members the string Country and the integer Count representing the number of occurrences of that country in the table.
You can now of course iterate over these objects as such:
foreach (var item in query)
{
Console.WriteLine("Country : {0} - Count : {1}", item.Country, item.Count);
}
For more examples, I strongly suggest the 101 LINQ Samples
It's also worth pointing out if you haven't used LINQ before that the processing is deferred, meaning that the iteration over the query object doesn't occur until you try to access any of its items, for example, in the foreach statement. If the collection or reading from table is expensive and you intend to use the results of the query more than once, you can call ToList() on query to return a more tangible, concrete collection.

Hadoop Pig ordered results; find order position?

I want to sort my pig results, and then be able to determine where certain items are in my ordered results. Example:
mydata = LOAD 'mydata.txt' AS (label:chararray, rank_score:float);
ranked_data = ORDER mydata BY rank_score DESC;
ranked_positions = FOREACH ranked_data GENERATE label, AUTO_INCREMENT_ID;
results = FILTER ranked_data BY label = 'item1' OR label='item2';
DUMP results;
AUTO_INCREMENT_ID would auto-increment in my perfect world. Given how mappers/reducers are independent from each other, I'm guessing Pig/Hadoop may not support this. If not, can you think of another way to generate my end result?
Example input:
item1 34.33
item2 48.39
item3 93.3
Desired output:
item1 3
item2 2
If you set parallelism of ORDER to 1, you can just do auto-increment yourself in a udf; of course, that would have the potentially undesired effect of only using 1 reducer to do your sorting.
(Also, I am not sure how you got your example output -- the input seems to be already ordered, so item1 should have id 1 and item 2 should have id 2, right? did you mean to order by rank_score desc?)

Resources