Extract ordered tuple values from a bag - hadoop

In pig I massaged my data into something like:
(a,{(b,c),(d,e),(f,g)})
(h,{(i,j),(k,l)})
where the first item is the group and the bag are other values related to the group. I would like to get it into the following format:
(a,b,c,d,e,f,g)
(h,i,j,k,l)
I got to where I am now with
grunt> j = foreach G {
>> o = order myvar by second;
>> generate group, o.(first,second);
>> };
So the tuples in the bag are currently ordered. If I do something like mystuff = foreach j generate group, flatten($1); I get it all flattened and un-grouped.
Is this possible in pig, and if so what command should I be looking at?

There is no way I can that can do what you want out of the box. You really need to use a user-defined function for this. I know it sucks because you have to write Java or Python code, but you'll find several situations where Pig just doesn't go far enough. Pig can be considered a data flow language and not so much of a programming language, which is why UDFs play such an important role: they bridge the gap.
My suggestion is you write a UDF that takes in the group and value bag as parameters. Do the ordering/sorting in the UDF and also the flattening.
The other thing you want to be careful about is that now your rows will have different numbers of columns and Pig doesn't really like this. If you are just immediately outputting it, you can probably get away with this. You might want to consider having your UDF write out the list in a tab-delimited string or something that is preformatted. This isn't that big of a deal... feel free to ignore my advice here.

Related

In Wolfram Mathematica, who do I query the result of a Counts operation efficiently and conveniently?

EDIT At the suggestion of #HighPerformanceMark, I've moved the question to mathematica.stackexchange.com: my question, so I attempted to close the question here. But SO doesn't allow me to do it properly, hence this up-front warning.
Setup
Say, I'm given a dataset, like the one below:
titanic = ExampleData[{"Dataset", "Titanic"}]; titanic
Answering with:
And I want to count the occurrences of any combination between { "1st", "2nd"} and {"female", "male"}, using the Counts operator on the dataset, like:
genderclasscounts = titanic[All, {"class", "sex"}][Counts]
Problem statement
This is not a "flat" dataset and I don't have a clue how to query in the usual way, like:
genderclasscount[Select[ ... ], ...]
The resulting dataset doesn't provide "column" names to be used as parameters in the Select nor can I refer to the number representing the count by a name.
And I've no clue how to express an Association as a value in a Select!?
Furthermore, try genderclasscount[Print], this demonstrates the values presented to the operation over this dataset are just numbers!
An unsatisfactory attempt
Of course, I can "flatten" the Counts result, by doing something horrific and inefficient like:
temp = Dataset[(row \[Function]
AssociationThread[{"class", "sex", "count"} -> row]) /# (Nest[
Normal, genderclasscounts, 3] /.
Rule[{Rule["class", class_], Rule["sex", sex_]},
count_] -> {class, sex, count})]
In this form it is easy to query a count result:
First#temp[Select[#class == "1st" \[And] #sex == "female" &], "count"]
Question
So, my questions are
How can I query the (immediate) result of the Count operation in a convenient and efficient fashion, like using a Select operation on the resulting dataset? Or, if that is not possible;
Is there an efficient and convenient transformation of the Counts result dataset possible facilitating such a query? With "convenient" I mean, for example, that you just provide the dataset and the transformation handles the rest. So, not something like I've shown above in my unsatisfactory "solution" ;-)
Thanks for reading this far and I'm looking forward to anwsers and inspiration.
/#nanitous

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

Pig:FLATTEN keyword

I am a little confused with the use of FLATTEN keyword in PIG.
Consider the below dataset:
tuple_record: {details: (firstname: chararray,lastname: chararray,age: int,sex: chararray)}
Without using the FLATTEN I can access a field (suppose firstname) like this:
display_firstname = FOREACH tuple_record GENERATE details.firstname;
Now, using the FLATTEN keyword:
flatten_record = FOREACH tuple_record GENERATE FLATTEN(details);
DESCRIBE gives me this:
flatten_record: {details::firstname: chararray,details::lastname: chararray,details::age: int,details::sex: chararray}
And hence I can access the fields present directly without dereferencing like this:
display_record = FOREACH flatten_record GENERATE firstname;
My questions related to this FLATTEN keyword is:
1) Which way among the two (i.e. with or without using FLATTEN) is the optimized way of achieving the same output?
2) Any special scenarios where without using the FLATTEN keywords, the desired output cant be achieved?
Totally confused; please clarify its use and in which all scenarios I shall use it.
Sometimes you have data in a bag or a tuple and you want to remove that level of nesting.
when you want to switch around your data on the fly and group by a particular field, you need a way to pull those entries out of the bag.
As per Pig documentation:
The FLATTEN operator looks like a UDF syntactically, but it is
actually an operator that changes the structure of tuples and bags in
a way that a UDF cannot. Flatten un-nests tuples as well as bags. The
idea is the same, but the operation and result is different for each
type of structure.
For more details check this link they have explained the usage of FLATTEN clearly with examples

Condense nested for loop to improve processing time with text analysis python

I am working on an untrained classifier model. I am working in Python 2.7. I have a loop. It looks like this:
features = [0 for i in xrange(len(dictionary))]
for bgrm in new_scored:
for i in xrange(len(dictionary)):
if bgrm[0] == dictionary[i]:
features[i] = int(bgrm[1])
break
I have a "dictionary" of bigrams that I have collected from a data set containing customer reviews and I would like to construct feature arrays of each review corresponding to the dictionary I have created. It would contain the frequencies of the bigrams found within the review of the features in the dictionary (I hope that makes sense). new_scored is a list of tuples which contains the bigrams found within a particular review paired with their relative frequency of occurrence in that review. The final feature arrays will be the same length as the original dictionary with few non zero entries.
The above works fine but I am looking at a data set of 13000 reviews, for each review to loop through this code is going to take for eeever (if my computer doesnt run out of RAM first). I have been sitting with it for a while and cannot see how I can condense it.
I am very new to python so I was hoping a more experienced could help with condensing it or perhaps point me in the right direction towards a library that will contain the function I need.
Thank you in advance!
Consider making dictionary an actual dict object (or some fancier subclass of dict if it better suits your needs), as opposed to an iterable (list or tuple seems like what it is now). dictionary could map bigrams as keys to an integer identifier that would identify a feature position.
If you refactor dictionary that way, then the loop can be rewritten as:
features = [0 for key in dictionary]
for bgram in new_scored:
try:
features[dictionary[bgram[0]]] = int(bgrm[1])
except KeyError:
# do something if the bigram is not in the dictionary for some reason
This should convert what was an O(n) traversal through dictionary into a hash lookup.
Hope this helps.

Hadoop Pig Latin Tuples: How to pass them to UDFs?

My goal is to pass every field in the input to a UDF as follows:
A = LOAD './input/file1' USING PigStorage(' ') AS (f1:chararray, f2:chararray);
B = FOREACH A GENERATE com.mycompany.udf.FAKEUDF(tuple(*));
NOTE: I am using Cloudera's version 0.12.0-cdh5.0.0.
The above FOREACH is just one of my many attempts. I have seen examples like
...FAKEUDF(*)
And so forth.
The main question is, what is the correct syntax? And has the syntax changed from earlier versions?
Here is a link which shows the lone asterisk syntax:
Chapter 10: Writing Evaluation & Filter Functions
It depends how u are processing your reqiurement. Argument will be name of column (one or more) like FAKEUDF(column1,column2,....) or for all the column you can specify * also like FAKEUDF(*) or you can specify relationName also. In UDF, you have to take out the column values from the tuple like : tuple.get(index). You have to be careful what you have sent as argument based on that processing is happening. It can be even DataBag.

Resources