Difference with count result in Mongo group by query with Ruby/Javascript - ruby

I'm using Mongoid to get a count of certain types of records in a Mongo database. When running the query with the javascript method:
db.tags.group({
cond : { tag: {$ne:'donotwant'} },
key: { tag: true },
reduce: function(doc, out) { out.count += 1; },
initial: { count: 0 }
});
I get the following results:
[
{"tag" : "thing", "count" : 4},
{"tag" : "something", "count" : 1},
{"tag" : "test", "count" : 1}
]
Does exactly what I want it to do. However, when I utilize the corresponding Mongoid code to perform the same query:
Tag.collection.group(
:cond => {:tag => {:$ne => 'donotwant'}},
:key => [:tag],
:reduce => "function(doc, out) { out.count += 1 }",
:initial => { :count => 0 },
)
the count parameters are (seemingly) selected as floats instead of integers:
[
{"tag"=>"thing", "count"=>4.0},
{"tag"=>"something", "count"=>1.0},
{"tag"=>"test", "count"=>1.0}
]
Am I misunderstanding what's going on behind the scenes? Do I need to (can I?) cast those counts or is the javascript result just showing it without the .0?

JavaScript doesn't distinguish between floats and ints. It has one Number type that is implemented as a double. So what you are seeing in Ruby is correct, the mongo shell output follows javascript printing conventions and displays Numbers that don't have a decimal component without the '.0'

Related

Matching by array elements in Elasticsearch

I have to construct quite a non-trivial (as it seems to be now) query in Elasticsearch.
Suppose I have a couple of entities, each with an array element, consisting of strings:
1). ['A', 'B']
2). ['A', 'C']
3). ['A', 'E']
4). ['A']
Mappings for array element is as follows (using dynamic templates):
{
"my_array_of_strings": {
"path_match": "stringArray*",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
Json representation of entity looks like this:
{
"stringArray": [
"A",
"B"
]
}
Then I have user input:
['A', 'B', 'C'].
What I want to achieve is to find entities which contain only elements specified in input - expected results are:
['A', 'B'], ['A', 'C'], ['A'] but NOT ['A', 'E'] (because 'E' is not present in user input).
Can this scenario be implemented with Elasticsearch?
UPDATE:
Apart from the solution with using the scripts - which should work nicely, but will most likely slow down the query considerably in case when there are many records that match - I have devised another one. Below I will try to explain its main idea, without code implementation.
One considerable condition that I failed to mention (and which might have given other users valuable hint) is that arrays consist of enumerated elements, i.e. there are finite number of such elements in array. This allows to flatten such array into separate field of an entity.
Lets say there are 5 possible values: 'A', 'B', 'C', 'D', 'E'. Each of these values is a boolean field - true if it is empty (i.e. array version would contain this element ) and false otherwise.
Then each of the entities could be rewritten as follows:
1).
A = true
B = true
C = false
D = false
E = false
2).
A = true
B = false
C = true
D = false
E = false
3).
A = true
B = false
C = false
D = false
E = true
4).
A = true
B = false
C = false
D = false
E = false
With the user input of ['A', 'B', 'C'] all I would need to do is:
a) take all possible values (['A', 'B', 'C', 'D', 'E']) and subtract from them user input -> result will be ['D', 'E'];
b) find records where each of resulting elements is false, i.e. 'D = false AND E = false'.
This would give records 1, 2 and 4, as expected. I am still experimenting with the code implementation of this approach, but so far it looks quite promising. It has yet to be tested, but I think this might perform faster, and be less resource demanding, than using scripts in query.
To optimize this a little bit further, it might be possible not to provide fields which will be 'false' at all, and modify the previous query to 'D = not exists AND E = not exists' - result should be the same.
You can achieve this with scripting, This is how it looks
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"terms": {
"name": [
"A",
"B",
"C"
]
}
},
{
"script": {
"script": "if(user_input.containsAll(doc['name'].values)){return true;}",
"params": {
"user_input": [
"A",
"B",
"C"
]
}
}
}
]
}
}
}
}
}
This groovy script is checking if the list contains anything apart from ['A', 'B', 'C'] and returns false if it does, so it wont return ['A', 'E']. It is simply checking for sublist match. This script might take couple of seconds. You would need to enable dynamic scripting, also syntax might be different for ES 2.x, let me know if it does not work.
EDIT 1
I have put both conditions inside filter only. First only those documents that have either A, B or C are returned, and then script is applied on only those documents, so this would be faster than the previous one. More on filter ordering
Hope this helps!!
In same case for me I have done the follow steps:
First of all I have deleted the index to redefine analyzer/settings with sense plugin.
DELETE my_index
Then I have defined custom analyzer for my_index
PUT my_index
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
Then I have defined mapping properties inside my code, but you can also do that with sense. both of them are same.
PUT /my_index/_mapping/my_type
{
"properties" : {
"conduct_days" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
Then For testing do these bellow steps:
PUT /my_index/my_type/1
{
"coduct_days" : "1,2,3"
}
PUT /my_index/my_type/2
{
"conduct_days" : "3,4"
}
PUT /my_index/my_type/3
{
"conduct_days" : "1,6"
}
GET /my_index/_search
{
"query": {"match_all": {}}
}
GET /my_index/_search
{
"filter": {
"or" : [
{
"term": {
"coduct_days": "6"
}
},
{
"term": {
"coduct_days": "3"
}
}
]
}
}

how to filter for results for which a property has a value contained in array X

Say I've got a dynamic array A of values [x,y,z].
I want to return all results for which property P has a value that exists in A.
I could write some recursive filter that concatenates 'or's for each value in A, but it's extremely clunky.
Any other out-of-the-box way to do this?
You can use the filter command in conjunction with the reduce and contains command to accomplish this.
Example
Let's say you have the following documents:
{
"id": "41e352d0-f543-4731-b427-6e16a2f6fb92" ,
"property": [ 1, 2, 3 ]
}, {
"id": "a4030671-7ad9-4ab9-a21f-f77cba9bfb2a" ,
"property": [ 5, 6, 7 ]
}, {
"id": "b0694948-1fd7-4293-9e11-9e5c3327933e" ,
"property": [ 2, 3, 4 ]
}, {
"id": "4993b81b-912d-4bf7-b7e8-e46c7c825793" ,
"property": [ "b" ,"c" ]
}, {
"id": "ce441f1e-c7e9-4a7f-9654-7b91579029be" ,
"property": [ "a" , "b" , "c" ]
}
From these sequence, you want to get all documents that have either "a" or 1 in their property property. You can write a query that returns a chained contains statement using reduce.
r.table('30510212')
// Filter documents
.filter(function (row) {
// Array of properties you want to filter for
return r.expr([ 1, 'a' ])
// Insert `false` as the first value in the array
// in order to make it the first value in the reduce's left
.insertAt(0, false)
// Chain up the `contains` statement
.reduce(function (left, right) {
return left.or(row('property').contains(right));
});
})
Update: Better way to do it
Actually, you can use 2 contains to execute the same query. This is shorter and probably a bit easier to understand.
r.table('30510212')
.filter(function (row) {
return row('property').contains(function (property) {
return r.expr([ 1, 'a' ]).contains(property);
})
})

How do I use Spring Mongo to group two fields and get a array of one fields based on the other?

Let me give a example here:
Two entries in the collection Author:
{
"name" : "Joe"
"Book" : "A"
},
{
"name" : "Joe"
"Book" : "B"
}
Now, if I use the aggregation function in Mongo via spring mongo, basically just to grab the books with name Joe, it could be coded like:
Aggregation agg = newAggregation(Map.class, group("name", "Book"));
AggregationResults<Map> results = mongoTemplate.aggregate(agg, "Author",
Map.class);
Obviously I could get two Maps this way, one has entry {"name":"Joe", "Book": A}, the other has {"name" : "Joe", "Book" : "B"}
But what if I want get ONLY one result back, with one entry :
{"name" : Joe, "Books" : ["A", "B"]}?
I'm not sure if it is reachable just using one query. It certainly could be achieved by multiple steps, which I'd hate to do..
You need to use the $addToSet operator in your $group pipeline. This will return an array of all unique values ["A", "B"] that results from applying the $group expression to each document in a group of documents that share the same group by key "name". So in mongo shell you have
db.author.aggregate([
{ $group: {
_id: '$name',
Books: { $addToSet: '$Book' }
} }
]);
which brings back the desired result
{
"result" : [
{
"_id" : "Joe",
"Books" : [ "B", "A" ]
}
],
"ok" : 1
}
The equivalent Spring aggregation:
Aggregation agg = newAggregation(Map.class, group("name").addToSet("Book").as("Books"));
AggregationResults<Map> results = mongoTemplate.aggregate(agg, "Author", Map.class);

Selecting age count without intervals

so what I am trying to is write a query that will return a count of people that are each age - not increments. So the count of people that have been alive for 1, 2, 3, ... 67 ... 99, ... years.
I am not familiar with NoSQL but I know that because time is ongoing, the ages count will have to be periodically updated/refreshed. What I was thinking was to have a collection or something that has a key of the age and the value as the number of people that are the age. When a new person is created, it will increment the amount of people in his or her age - then as I said earlier have something to update it.
What I am trying to figure out is if there is a way to actively fetch the amount of amount of people (real time) of all different ages without having a counter. Or if I must use a counter, how can I have the database automatically increment the counter so I don't need to interact with the program?
You can achieve this by using MongoDB's aggregation framework. In order to keep it up to date in real time, what you need to do is the following:
Project an ageMillis field by subtracting the date of birth (dob) from the current date. You will get an age value in milliseconds.
Divide ageMillis by the number of milliseconds in a year (in JavaScript it is 31536000000) and project this onto an ageDecimal field. You don't want to use this age to group because it contains a decimal.
Project the ageDecimal field and a decimal field containing the decimal portion of the age. You are able to do this using the $mod operator.
Subtract decimal from ageDecimal and project it to an age field. This gives you the age value in years.
Group by the age field and keep track of the count using $sum. Basically you add 1 for every document you see for that age.
If needed, sort by age field.
The command in the mongo shell would look something like the command below, using JavaScript's Date() object to get the current date. If you want to do this in Ruby, you would have to change that bit of code and make sure that for the rest, you follow the syntax for the Ruby driver.
db.collection.aggregate([
{ "$project" :
{
"ageMillis" : { "$subtract" : [ new Date(), "$dob" ]}
}
},
{ "$project" :
{
"ageDecimal" : { "$divide" : [ "$ageMillis", 31536000000 ]}
}
},
{ "$project" :
{
"ageDecimal" : "$ageDecimal",
"decimal" : { "$mod" : [ "$ageDecimal", 1 ]}
}
},
{ "$project" :
{
"age" : { "$subtract" : [ "$ageDecimal", "$decimal" ]}
}
},
{ "$group" :
{
"_id" : { "age" : "$age" },
"count" : { "$sum" : 1 }
}
},
{ "$sort" :
{
"_id.age" : 1
}
}
]);
This should give you the results that you want. Note that the aggregate() method returns a cursor. You will have to iterate through it to get the results.
The aggregation framework is the best approach for this. Mongoid exposes the lower level collection object through a .collection accessor. This allows the native driver implementation of aggregate to be used.
The basic math here is:
Rounded Result of:
( difference from date of birth to now in milliseconds /
number of milliseconds in a year )
Feed the current Time value into your aggregation statement to get the current age
res = Model.collection.aggregate([
{ "$group" => {
"_id" => {
"$subtract" => [
{ "$divide" => [
{ "$subtract" => [ Time.now, "$dob" ] },
31536000000
]},
{ "$mod" => [
{ "$divide" => [
{ "$subtract" => [ Time.now, "$dob" ] },
31536000000
]},
1
]}
]
},
"count" => { "$sum" => 1 }
}},
{ "$sort" => { "_id" => -1 } }
])
pp res

Query Mongo Embedded Documents with a size

I have a ruby on rails app using Mongoid and MongoDB v2.4.6.
I have the following MongoDB structure, a record which embeds_many fragments:
{
"_id" : "76561198045636214",
"fragments" : [
{
"id" : 76561198045636215,
"source_id" : "source1"
},
{
"id" : 76561198045636216,
"source_id" : "source2"
},
{
"id" : 76561198045636217,
"source_id" : "source2"
}
]
}
I am trying to find all records in the database that contain fragments with duplicate source_ids.
I'm pretty sure I need to use $elemMatch as I need to query embedded documents.
I have tried
Record.elem_match(fragments: {source_id: 'source2'})
which works but doesn't restrict to duplicates.
I then tried
Record.elem_match(fragments: {source_id: 'source2', :source_id.with_size => 2})
which returns no results (but is a valid query). The query Mongoid produces is:
selector: {"fragments"=>{"$elemMatch"=>{:source_id=>"source2", "source_id"=>{"$size"=>2}}}}
Once that works I need to update it to $size is >1.
Is this possible? It feels like I'm very close. This is a one-off cleanup operation so query performance isn't too much of an issue (however we do have millions of records to update!)
Any help is much appreciated!
I have been able to achieve desired outcome but in testing it's far too slow (will take many weeks to run across our production system). The problem is double query per record (we have ~30 million records in production).
Record.where('fragments.source_id' => 'source2').each do |record|
query = record.fragments.where(source_id: 'source2')
if query.count > 1
# contains duplicates, delete all but latest
query.desc(:updated_at).skip(1).delete_all
end
# needed to trigger after_save filters
record.save!
end
The problem with the current approach in here is that the standard MongoDB query forms do not actually "filter" the nested array documents in any way. This is essentially what you need in order to "find the duplicates" within your documents here.
For this, MongoDB provides the aggregation framework as probably the best approach to finding this. There is no direct "mongoid" style approach to the queries as those are geared towards the existing "rails" style of dealing with relational documents.
You can access the "moped" form though through the .collection accessor on your class model:
Record.collection.aggregate([
# Find arrays two elements or more as possibles
{ "$match" => {
"$and" => [
{ "fragments" => { "$not" => { "$size" => 0 } } },
{ "fragments" => { "$not" => { "$size" => 1 } } }
]
}},
# Unwind the arrays to "de-normalize" as documents
{ "$unwind" => "$fragments" },
# Group back and get counts of the "key" values
{ "$group" => {
"_id" => { "_id" => "$_id", "source_id" => "$fragments.source_id" },
"fragments" => { "$push" => "$fragments.id" },
"count" => { "$sum" => 1 }
}},
# Match the keys found more than once
{ "$match" => { "count" => { "$gte" => 2 } } }
])
That would return you results like this:
{
"_id" : { "_id": "76561198045636214", "source_id": "source2" },
"fragments": ["76561198045636216","76561198045636217"],
"count": 2
}
That at least gives you something to work with on how to deal with the "duplicates" here

Resources