I am a little confused with the use of FLATTEN keyword in PIG.
Consider the below dataset:
tuple_record: {details: (firstname: chararray,lastname: chararray,age: int,sex: chararray)}
Without using the FLATTEN I can access a field (suppose firstname) like this:
display_firstname = FOREACH tuple_record GENERATE details.firstname;
Now, using the FLATTEN keyword:
flatten_record = FOREACH tuple_record GENERATE FLATTEN(details);
DESCRIBE gives me this:
flatten_record: {details::firstname: chararray,details::lastname: chararray,details::age: int,details::sex: chararray}
And hence I can access the fields present directly without dereferencing like this:
display_record = FOREACH flatten_record GENERATE firstname;
My questions related to this FLATTEN keyword is:
1) Which way among the two (i.e. with or without using FLATTEN) is the optimized way of achieving the same output?
2) Any special scenarios where without using the FLATTEN keywords, the desired output cant be achieved?
Totally confused; please clarify its use and in which all scenarios I shall use it.
Sometimes you have data in a bag or a tuple and you want to remove that level of nesting.
when you want to switch around your data on the fly and group by a particular field, you need a way to pull those entries out of the bag.
As per Pig documentation:
The FLATTEN operator looks like a UDF syntactically, but it is
actually an operator that changes the structure of tuples and bags in
a way that a UDF cannot. Flatten un-nests tuples as well as bags. The
idea is the same, but the operation and result is different for each
type of structure.
For more details check this link they have explained the usage of FLATTEN clearly with examples
Related
EDIT At the suggestion of #HighPerformanceMark, I've moved the question to mathematica.stackexchange.com: my question, so I attempted to close the question here. But SO doesn't allow me to do it properly, hence this up-front warning.
Setup
Say, I'm given a dataset, like the one below:
titanic = ExampleData[{"Dataset", "Titanic"}]; titanic
Answering with:
And I want to count the occurrences of any combination between { "1st", "2nd"} and {"female", "male"}, using the Counts operator on the dataset, like:
genderclasscounts = titanic[All, {"class", "sex"}][Counts]
Problem statement
This is not a "flat" dataset and I don't have a clue how to query in the usual way, like:
genderclasscount[Select[ ... ], ...]
The resulting dataset doesn't provide "column" names to be used as parameters in the Select nor can I refer to the number representing the count by a name.
And I've no clue how to express an Association as a value in a Select!?
Furthermore, try genderclasscount[Print], this demonstrates the values presented to the operation over this dataset are just numbers!
An unsatisfactory attempt
Of course, I can "flatten" the Counts result, by doing something horrific and inefficient like:
temp = Dataset[(row \[Function]
AssociationThread[{"class", "sex", "count"} -> row]) /# (Nest[
Normal, genderclasscounts, 3] /.
Rule[{Rule["class", class_], Rule["sex", sex_]},
count_] -> {class, sex, count})]
In this form it is easy to query a count result:
First#temp[Select[#class == "1st" \[And] #sex == "female" &], "count"]
Question
So, my questions are
How can I query the (immediate) result of the Count operation in a convenient and efficient fashion, like using a Select operation on the resulting dataset? Or, if that is not possible;
Is there an efficient and convenient transformation of the Counts result dataset possible facilitating such a query? With "convenient" I mean, for example, that you just provide the dataset and the transformation handles the rest. So, not something like I've shown above in my unsatisfactory "solution" ;-)
Thanks for reading this far and I'm looking forward to anwsers and inspiration.
/#nanitous
Is there any way to ignore "stop words" while sorting.
For example:
I have words like
dixit
singla
the marklogic
On sorting in descending order the result should be
singla, the marklogic, dixit
As in the above example the is ignored.
Any way to achieve this?
Update:
Stop word can occur at any place.
for example
the MarkLogic
MarkLogic is the best
the MarkLogic is awesome
while sorting should not consider any stop word in the text.
Above is just a small example to describe the problem.
In actual I am using search:search API.
For sorting, I am using sort-order search options.
The element on which I have to perform sorting is dynamic. There are approx 30-35 elements.
Is there any way to customize the collation at this level like to configure some words (stop words) which will be ignored while sorting.
There is no standard collation URI that is going to do this for you (at least none that I've ever seen). You can do it dynamically, of course, by sorting on the result of a function invocation, but if you want it done efficiently at scale (and available to search:search), then you need to materialize the sortable string into your document. I've often done this as an attribute on the element:
<title sortable="Great Gatsby, The">The Great Gatsby</title>
Then you put a range index on the title/#sortable attribute.
You can also use the "envelope pattern" where materialized metadata like this is maintained in its own section of the document with the original kept in its own section. For things like this, I think it's a bit more elegant to decorate the elements directly, to keep the context.
If I understand your question correctly you're trying to get rid of the definite article when sorting your result-set.
In order to do this you need to use some additional functions and create a 'sort' criteria. My solution would look like this (I'm also including some sample documents so that you can test this just by copy-pasting):
(:
xdmp:document-insert("/peter.xml", <person><firstName>Peter</firstName><lastName>O'Toole</lastName><age>60</age></person>);
xdmp:document-insert("/john.xml", <person><firstName>John</firstName><lastName>Adams</lastName><age>18</age></person>);
xdmp:document-insert("/simon.xml", <person><firstName>Simon</firstName><lastName>Petrov</lastName><age>22</age></person>);
xdmp:document-insert("/mark.xml", <person><firstName>Mark</firstName><lastName>the Lord</lastName><age>25</age></person>);
:)
for $person in /person
let $sort := fn:reverse(fn:tokenize($person/lastName, ' '))[1]
order by $sort
(: return $person :)
return $person/lastName/text()
Notice that now the sort order is going to be
- Adams
- the Lord
- O'Toole
- Petrov
I hope this will help.
I am trying NEST out and it seems very nice, but I am having some trouble understanding some things.
The response is serialized to an hierarchy of objects. I would like to iterate over it and to create my own structure.
I would be able to do somethings like this (thanks to #Martijn Laarman, who helped me in the GitHub page):
var buckets = result.Aggs.Terms("level_1");
var term = buckets.Items[0].Terms("level_2");
It works, but I would like to have a generic algorithm that parses the response. To do that, I would like to get content independently of the query (if it used terms, range, etc). So I would like to do things like:
var buckets = result.Aggregrations["level_1"];
var term = buckets.Items[0].Aggreggation["level_2"];
Unfortunately the Aggregations collection returns Nest.Bucket and I can't do anything from there.
Is there any way that I can iterate over the result independently on how the query was formed?
Thanks!
For sake of completeness, I was not able to find any way to do so.
I created a parser using a mix of JObects and Dictionary and operated over the JSON response to generate the output I wanted.
In pig I massaged my data into something like:
(a,{(b,c),(d,e),(f,g)})
(h,{(i,j),(k,l)})
where the first item is the group and the bag are other values related to the group. I would like to get it into the following format:
(a,b,c,d,e,f,g)
(h,i,j,k,l)
I got to where I am now with
grunt> j = foreach G {
>> o = order myvar by second;
>> generate group, o.(first,second);
>> };
So the tuples in the bag are currently ordered. If I do something like mystuff = foreach j generate group, flatten($1); I get it all flattened and un-grouped.
Is this possible in pig, and if so what command should I be looking at?
There is no way I can that can do what you want out of the box. You really need to use a user-defined function for this. I know it sucks because you have to write Java or Python code, but you'll find several situations where Pig just doesn't go far enough. Pig can be considered a data flow language and not so much of a programming language, which is why UDFs play such an important role: they bridge the gap.
My suggestion is you write a UDF that takes in the group and value bag as parameters. Do the ordering/sorting in the UDF and also the flattening.
The other thing you want to be careful about is that now your rows will have different numbers of columns and Pig doesn't really like this. If you are just immediately outputting it, you can probably get away with this. You might want to consider having your UDF write out the list in a tab-delimited string or something that is preformatted. This isn't that big of a deal... feel free to ignore my advice here.
We have a posting analyzing requirement, that is, for a specific post, we need to return a list of posts which are mostly related to it, the logic is comparing the count of common tags in the posts. For example:
postA = {"author":"abc",
"title":"blah blah",
"tags":["japan","japanese style","england"],
}
there are may be other posts with tags like:
postB:["japan", "england"]
postC:["japan"]
postD:["joke"]
so basically, postB gets 2 counts, postC gets 1 counts when comparing to the tags in the postA. postD gets 0 and will not be included in the result.
My understanding for now is to use map/reduce to produce the result, I understand the basic usage of map/reduce, but I can't figure out a solution for this specific purpose.
Any help? Or is there a better way like custom sorting function to work it out? I'm currently using the pymongodb as I'm python developer.
You should create an index on tags:
db.posts.ensure_index([('tags', 1)])
and search for posts that share at least one tag with postA:
posts = list(db.posts.find({_id: {$ne: postA['_id']}, 'tags': {'$in': postA['tags']}}))
and finally, sort by intersection in Python:
key = lambda post: len(tag for tag in post['tags'] if tag in postA['tags'])
posts.sort(key=key, reverse=True)
Note that if postA shares at least one tag with a large number of other posts this won't perform well, because you'll send so much data from Mongo to your application; unfortunately there's no way to sort and limit by the size of the intersection using Mongo itself.