Separate Tuple of tuples In pig - hadoop

I am having a result which is in the form of tuple of tuples. I need to put all the data in one tuple into one column and other into other. I don't know how to achieve that. Below is my data.
Sample:
((completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed,completed),(10160-0),(20140403,20151207,20160313,20101225,20100420,20110208,20100419,20110310,20100412,20120130,20110729),(20160306,20110822,20110822,20110822,20110321,20110608,20110822,20120326,20110822),(24,12,24,24,7,24,8,8,7,24,24,24,24,6),(h,h,h,h,d,h,h,h,d,h,h,h,h,h),(1,30,1,1,1,1,1,1,1,30,1,1,1,1,0.0,1,1,1),(1,Ointment,Tablet,Tablet,1,Tablet,1,1,Cream,Ointment,1,Tablet_ER_24HR,1,1,Tablet,Cream,Tablet_Disperse,Teaspoon(s))
Expected Out put:
completed,10160-0,20140403,20160306,24,h,1,1
completed,10160-0,20151207,20110822,12,h,30,Ointment
completed,10160-0,20160313,20110822,24,h,1,Tablet
completed,10160-0,20101225,20110822,24,h,1,1
completed,10160-0,20100420,20110321,7,h,1,,1
completed,10160-0,20110208,20110608,24,h,1,Cream
completed,10160-0,20100419,20110822,8,h,1,Ointment
completed,10160-0,20110310,20120326,8,d,1,1
completed,10160-0,20100412,20110822,7,h,30,Tablet_ER_24HR

TOBAG should do it.Assuming you have relation A with tuple of tuples.
B = FOREACH A GENERATE TOBAG(*);

Related

Simpler alternative to simultaneously Sort and Filter by column in Google Spreadsheets

I have a spreadsheet (here's a copy) with the following (headered) columns:
A: Indices for a list of groceries;
B: Names for the groceries to be indexed by column A;
C: Check column with "x" for inactive items in column B, empty otherwise;
D: Sorting indices that I want to apply to column B;
Currently, I am getting the sorted AND filtered result with this formula:
=SORT(FILTER(B2:B; C2:C = ""); FILTER(D2:D; C2:C = ""); TRUE)
The problem is that I need to apply the filter two times: one for the items and one for the indices, otherwise I get a mismatch between elements for the Sort function.
I feel that this doesn't scale well since it creates duplication.
Is there a way to get the same results with a simpler formula or another arrangement of columns?
=SORT(FILTER({Itens!B2:B\Itens!G2:G}; Itens!D2:D=""))
=SORT(FILTER({Itens!B2:B\Itens!G2:G}; Itens!D2:D="");2;1)
or maybe: =SORT(FILTER(Itens!B2:B; Itens!D2:D="");2;1)

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

Pig latin join by field

I have a Pig latin related problem:
I have this data below (in one row):
A = LOAD 'records' AS (f1:chararray, f2:chararray,f3:chararray, f4:chararray,f5:chararray, f6:chararray);
DUMP A;
(FITKA,FINVA,FINVU,FEEVA,FETKA,FINVA)
Now I have another dataset:
B = LOAD 'values' AS (f1:chararray, f2:chararray);
Dump B;
(FINVA,0.454535)
(FITKA,0.124411)
(FEEVA,0.123133)
And I would like to get those two dataset joined. I would get corresponding value from dataset B and place that value beside the value from dataset A. So expected output is below:
FITKA 0.123133, FINVA 0.454535 and so on ..
(They can also be like: FITKA, 0.123133, FINVA, 0.454535 and so on .. )
And then I would be able to multiply values (0.123133 x 0.454535 .. and so on) because they are on the same row now and this is what I want.
Of course I can join column by column but then values appear "end of row" and then I can clean it by using another foreach generate. But, I want some simpler solution without too many joins which may cause performance issues.
Dataset A is text (Sentence in one way..).
So what are my options to achieve this?
Any help would be nice.
A sentence can be represented as a tuple and contains a bag of tuples (word, count).
Therefore, I suggest you change the way you store your data to the following format:
sentence:tuple(words:bag{wordcount:tuple(word, count)})

Hadoop, how to normalize multiple columns data?

I have a file .txt like this
1036177 19459.7356 17380.3761 18084.1440
1045709 19674.2457 17694.8674 18700.0120
1140443 19772.0645 17760.0904 19456.7521
where the first column represent the Key and the others are the values.
I would like to normalize (min-max) each column and after that sum up the columns.
Someone can give me some advice on how do that in MapReduce?
From an algorithmic perspective you'll need to:
Mapper
Parse / tokenize each input line by it's delimiter (space?)
Use a Text object to encapsulate the key field
Either create a custom value class to encapsulate the other fields or use an ArrayWritable wrapper
Output this Key / Value from your Mapper
Reducer
All values will be grouped by the same key, so here you'll just need to process each input value and calculate the min, max and sum for each column
Finally output your result
You might want to look at using Apache Pig which should make this task much easier (untested):
grunt> A = LOAD '/path/to/data.txt' USING PigStorage(' ')
AS (key, fld1:float, fld2:float, fld3:float);
grunt> GRP = GROUP A BY key;
grunt> B = FOREACH GRP GENERATE $0, MIN(fld1), MAX(fld1), SUM(fld1),
MIN(fld2), MAX(fld2), SUM(fld2),
MIN(fld3), MAX(fld3), SUM(fld3);
grunt> STORE B INTO '/path/to/output' USING PigStorage('\t', '-schema');

how to create set of values, after group function in Pig (Hadoop)

Lets say I have set of values in file.txt
a,b,c
a,b,d
k,l,m
k,l,n
k,l,o
And my code is:
file = LOAD 'file.txt' using PigStorage(',');
events = foreach file generate session_id, user_id, code, type;
gr = group events by (session_id, user_id);
and I have set of value:
((a,b),{(a,b,c),(a,b,d)})
((k,l),{(k,l,m),(k,l,n),(k,l,o)})
And I'd like to have:
(a,b,(c,d))
(k,l,(m,n,o))
Have you got any idea how to do it?
Regards
Pawel
Note: you are inconsistent in your question. You say session_id, user_id, code, type in the FOREACH line, but your have a PigStorage not providing values. Also, that FOREACH has 4 values, while your sample data only has 3. I'll assume that type doesn't exist in order to answer your question.
After your gr relation, you are left with the group by key (in this case (session_id, user_id)) in a automatically generated tuple called group.
So, first step: gr2 = FOREACH gr GENERATE FLATTEN(group);
This will give you the tuples (a,b) and (k,l). You need to use FLATTEN because group is a tuple and you are asking for session_id and user_id to be individual columns. FLATTEN does that for you.
Ok, so now modify the gr2 line to also use a projection to tease out the third value:
gr2 = FOREACH gr GENERATE FLATTEN(group), events.code;
events.code creates a bag out of all the code values. events is the name of the bag of grouped tuples (it's named after the original relation).
This should give you:
(a, b, {c, d})
(k, l, {m, n, o})
It's very important to note that the values in the list are in a bag not a tuple, like you asked for. Keeping it in a bag is the right idea because the bag is a variable list, while a tuple is not.
Additional advice: Understanding how GROUP BY outputs data is something I see a lot of people struggle with when first using Pig. If you think my answer doesn't make much sense, I'd recommend spending some time to really get to understand GROUP BY. Understanding versus thinking it is magic will pay off in the long run.

Resources