Mongo DB MapReduce: Emit key from array based on condition - performance

I am new to mongo db so excuse me if this is rather trivial. I would really appreciate the help.
The idea is to generate a histogram over some specific values. In that case the mime types of some files. For that I am using a map reduce job.
I have a mongo with documents in the following form:
{
"_id" : ObjectId("4fc5ed3e67960de6794dd21c"),
"name" : "some name",
"uid" : "some app specific uid",
"collection" : "some name",
"metadata" : [
{
"key" : "key1",
"value" : "Plain text",
"status" : "SINGLE_RESULT",
},
{
"key" : "key2",
"value" : "text/plain",
"status" : "SINGLE_RESULT",
},
{
"key" : "key3",
"value" : 3469,
"status" : "OK",
}
]
}
Please note, that in almost every document there are more metadata key values.
Map Reduce job
I tried doing the following:
function map() {
var mime = "";
this.metadata.forEach(function (m) {
if (m.key === "key2") {
mime = m.value;}
});
emit(mime, {count:1});
}
function reduce() {
var res = {count:0};
values.forEach(function (v) {res.count += v.count;});
return res;
}
db.collection.mapReduce(map, reduce, {out: { inline : 1}})
This seems to work for a small number of documents (~15K) but the problem is that iterating through all metadata key values takes a lot of time during the mapping phase. When running this on more documents (~1Mio) the operation takes for ever.
So my question is:
Is there some way in which I can emit the mime type (the value) directly instead of iterating through all keys and selecting it? Or is there a better way to write a map reduce functions.
Something like emit (this.metadata.value {$where this.metadata.key:"key2"}) or similar...
Thanks for your help!

Two thoughts ...
First thought: How attached are you to this document schema? Could you instead have the metadata field value as an embedded document rather than an embedded array, like so:
{
"_id" : ObjectId("4fc5ed3e67960de6794dd21c"),
"name" : "some name",
"uid" : "some app specific uid",
"collection" : "some name",
"metadata" : {
"key1" : {
"value" : "Plain text",
"status" : "SINGLE_RESULT"
},
"key2": {
"value" : "text/plain",
"status" : "SINGLE_RESULT"
},
"key3" : {
"value" : 3469,
"status" : "OK"
}
}
}
Then your map step does away with the loop entirely:
function map() {
emit( this.metadata["key2"].value, { count : 1 } );
}
At that point, you might even be able to cast this as a "group" command rather than a "mapReduce".
Second thought: Absent a schema change like that, particularly if "key2" appears early in the metadata array, you could at least exit the loop eagerly once the key is found to save yourself some iterations, like so:
function map() {
var mime = "";
this.metadata.forEach(function (m) {
if (m.key === "key2") {
mime = m.value;
break;
}
});
emit(mime, {count:1});
}
Not sure if either path is the key to victory, but hopefully helpful thoughts. Best of luck!

Related

How to insert an element into already present list in elastic search

Say I have documents stored like below.
document 1
{
id : '1',
title : "This is a test document1",
list : ["value1" , "value2"],
...
}
document 2
{
id : '2',
title : "This is a test document2",
valueList : ["value1" , "value2"],
...
}
I need to add some more elements to the valueList in the documents with a list of document ids using bulk api. The resulting should look like
document 1
{
id : '1',
title : "This is a test document1",
list : ["value1" , "value2", "value3"],
...
}
document 2
{
id : '2',
title : "This is a test document2",
valueList : ["value1" , "value2" , "value3"],
...
}
What can I do to achieve this?
I tried using the scripts but it only updates a single document.
Sorry am really new to elastic search. I could even be stupid on this question. Please forgive and make me clear with this question.
See Updating Document. It should be straightforward. You need to use _update and just to give you an idea, even though the documentation is nearly perfect, it could look like this:
POST /your_index/your_type/document1/_update
{
id : '1',
title : "This is a test document1",
list : ["value1" , "value2", "value3"]
}
This will update document1.
In case of bulk updates you should read Batch Processing and have a look at the Bulk API.
From the docs:
POST /your_index/your_type/_bulk
{ "update" : {"_id" : "document1", "_type" : "your_type", "_index" : "your_index"}}
{ "doc" : {"myfield" : "newvalue"} }
{ "update" : {"_id" : "document2", "_type" : "your_type", "_index" : "your_index"}}
{ "doc" : {"myfield" : "newvalue"} }
Please note that you can just use _update for Partial Updates.
The simplest form of the update request accepts a partial document as
the doc parameter, which just gets merged with the existing document.
Objects are merged together, existing scalar fields are overwritten,
and new fields are added.

Performance with nested data in a script field

I am wondering if there is a more performant way of performing a calculation on nested data in a script field or of organizing the data. In the code below, the data will contain values for 50 states and/or other regions. Each user is tied to an area, so the script above will search to see that the averageValue in their area is above a certain threshold and return a true/false value for each matching document.
Mapping
{
"mydata" : {
"properties" : {
...some fields,
"related" : {
"type" : "nested",
"properties" : {
"average_value" : {
"type" : "integer"
},
"state" : {
"type" : "string"
}
}
}
}
}
}
Script
"script_fields" : {
"inBudget" : {
"script" : {
"inline" : "_source.related.find { it.state == default_area && it.average_value >= min_amount } != null",
"params" : {
"min_amount" : 100,
"default_area" : "CA"
}
}
}
}
I have a working solution using the above method, but it slows my query down and I am curious if there is a better solution. I have been toying with the idea of using a inner object with a key, like: related_CA and having each states data in a separate object, however for flexibility I would rather not have to pre-define each region in a mapping (as I may not have them all ahead of time). I feel like I might be missing a simpler/better way and I am open to either reorganizing the data/mapping and/or changes to the script.

How can I ignore certain attributes when comparing two json files in ruby?

I'm looking for a parallel approach to solving a problem. One approach I posted in "How can I subract two json files in ruby." Another approach is this.
I'm using this nifty json_diff.rb program to compare two similarly-structured json files. How can I skip certain attributes that can be nested inside other attributes?
For example, I have file1.json
{
"id" : "file1",
"att1" : {
"attA" : {
"free_mem" : "1234",
"buff_mem" : "5678"
},
"attB" : {
"name" : "Joe",
"location" : "Lab"
}
}
}
and file2.json
{
"id" : "file2",
"att1" : {
"attA" : {
"free_mem" : "5555",
"buff_mem" : "6666"
},
"attB" : {
"name" : "John",
"location" : "Lab"
}
}
}
I want to ignore attA. Note that these files are just examples, and real json files I have will have more attributes to ignore, which may be nested deeper inside other attributes.
I've done this in rspec with json_spec: https://github.com/collectiveidea/json_spec with good success. but that's specific to rspec and not just ruby

Converting MongoQuery to C# statement

I'm searching for a value in an array of a sub arrays. What will the code in C# look like?
db.File.find({
Properties: {
$elemMatch: {
$elemMatch: {
$in:
['AWS-Uploaded']
}
}
}
});
A simplified version of the documents looks like this:
{
"_id" : ObjectId("4f3b83acec76021c6827769e"),
"Extension" : ".mov",
"Length" : NumberLong(7910975),
"Properties" : [
["MediaId", "20898180"],
["AWS-Uploaded", "11/08/2013 16:15:50"]
]
}
For the query object for returning documents where an array element exists, this should do the job
Query.ElemMatch("Properties",
Query.Exists('AWS-Uploaded',true)
);
To use ElemMatch with query operators, like where MediaId=20898180 would be:
Query.ElemMatch("Properties",
Query.EQ('MediaId',20898180)
);
Hope that helps

Check for id existence in param Array with Elasticsearch custom script field

Is it possible to add a custom script field that is a Boolean and returns true if the document's id exists in an array that is sent as a param?
Something like this https://gist.github.com/2437370
What would be the correct way to do this with mvel?
Update:
Having trouble getting it to work as specified in Imotov's answer.
Mapping:
Sort:
:sort=>{:_script=>{:script=>"return friends_visits_ids.contains(_fields._id.value)", :type=>"string", :params=>{:friends_visits_ids=>["4f8d425366eaa71471000011"]}, :order=>"asc"}}}
place: {
properties: {
_id: { index: "not_analyzed", store: "yes" },
}
}
I don't get any errors, the documents just doesn't get sorted right.
Update 2
Oh, and I do get this back on the documents:
"sort"=>["false"]
You were on the right track. It just might be more efficient to store list of ids in a map instead of an array if this list is large.
"sort" : {
"_script" : {
"script" : "return friends_visits_ids.containsKey(_fields._id.value)",
"type" : "string",
"params": {
"friends_visits_ids": { "1" : {}, "2" : {}, "4" : {}}
}
}
}
Make sure that id field is stored. Otherwise _fields._id.value will return null for all records.

Resources