Query Mongo Embedded Documents with a size - ruby

I have a ruby on rails app using Mongoid and MongoDB v2.4.6.
I have the following MongoDB structure, a record which embeds_many fragments:
{
"_id" : "76561198045636214",
"fragments" : [
{
"id" : 76561198045636215,
"source_id" : "source1"
},
{
"id" : 76561198045636216,
"source_id" : "source2"
},
{
"id" : 76561198045636217,
"source_id" : "source2"
}
]
}
I am trying to find all records in the database that contain fragments with duplicate source_ids.
I'm pretty sure I need to use $elemMatch as I need to query embedded documents.
I have tried
Record.elem_match(fragments: {source_id: 'source2'})
which works but doesn't restrict to duplicates.
I then tried
Record.elem_match(fragments: {source_id: 'source2', :source_id.with_size => 2})
which returns no results (but is a valid query). The query Mongoid produces is:
selector: {"fragments"=>{"$elemMatch"=>{:source_id=>"source2", "source_id"=>{"$size"=>2}}}}
Once that works I need to update it to $size is >1.
Is this possible? It feels like I'm very close. This is a one-off cleanup operation so query performance isn't too much of an issue (however we do have millions of records to update!)
Any help is much appreciated!
I have been able to achieve desired outcome but in testing it's far too slow (will take many weeks to run across our production system). The problem is double query per record (we have ~30 million records in production).
Record.where('fragments.source_id' => 'source2').each do |record|
query = record.fragments.where(source_id: 'source2')
if query.count > 1
# contains duplicates, delete all but latest
query.desc(:updated_at).skip(1).delete_all
end
# needed to trigger after_save filters
record.save!
end

The problem with the current approach in here is that the standard MongoDB query forms do not actually "filter" the nested array documents in any way. This is essentially what you need in order to "find the duplicates" within your documents here.
For this, MongoDB provides the aggregation framework as probably the best approach to finding this. There is no direct "mongoid" style approach to the queries as those are geared towards the existing "rails" style of dealing with relational documents.
You can access the "moped" form though through the .collection accessor on your class model:
Record.collection.aggregate([
# Find arrays two elements or more as possibles
{ "$match" => {
"$and" => [
{ "fragments" => { "$not" => { "$size" => 0 } } },
{ "fragments" => { "$not" => { "$size" => 1 } } }
]
}},
# Unwind the arrays to "de-normalize" as documents
{ "$unwind" => "$fragments" },
# Group back and get counts of the "key" values
{ "$group" => {
"_id" => { "_id" => "$_id", "source_id" => "$fragments.source_id" },
"fragments" => { "$push" => "$fragments.id" },
"count" => { "$sum" => 1 }
}},
# Match the keys found more than once
{ "$match" => { "count" => { "$gte" => 2 } } }
])
That would return you results like this:
{
"_id" : { "_id": "76561198045636214", "source_id": "source2" },
"fragments": ["76561198045636216","76561198045636217"],
"count": 2
}
That at least gives you something to work with on how to deal with the "duplicates" here

Related

Search result fluctuations

I have bunch of collections with documents and i have encountered so,ething starnge. When I execute same request few times in a row result change consecutively
It would be fine if it's small fluctuations, but count of results changes on ~75000 of documents
So I have a question what's going on
My request is:
POST mycollection/mytype/_search
{
"fields": ["timestamp", "bool_field"],
"filter" : {
"terms":{
"bool_field" : [true]
}
}
}
results are going like this:
=> 148866
=> 75381
=> 148866
=> 75381
=> 148866
=> 75381
=> 148866
When count is 148k
I see some records with bool_field: "False" in Sense

Selecting age count without intervals

so what I am trying to is write a query that will return a count of people that are each age - not increments. So the count of people that have been alive for 1, 2, 3, ... 67 ... 99, ... years.
I am not familiar with NoSQL but I know that because time is ongoing, the ages count will have to be periodically updated/refreshed. What I was thinking was to have a collection or something that has a key of the age and the value as the number of people that are the age. When a new person is created, it will increment the amount of people in his or her age - then as I said earlier have something to update it.
What I am trying to figure out is if there is a way to actively fetch the amount of amount of people (real time) of all different ages without having a counter. Or if I must use a counter, how can I have the database automatically increment the counter so I don't need to interact with the program?
You can achieve this by using MongoDB's aggregation framework. In order to keep it up to date in real time, what you need to do is the following:
Project an ageMillis field by subtracting the date of birth (dob) from the current date. You will get an age value in milliseconds.
Divide ageMillis by the number of milliseconds in a year (in JavaScript it is 31536000000) and project this onto an ageDecimal field. You don't want to use this age to group because it contains a decimal.
Project the ageDecimal field and a decimal field containing the decimal portion of the age. You are able to do this using the $mod operator.
Subtract decimal from ageDecimal and project it to an age field. This gives you the age value in years.
Group by the age field and keep track of the count using $sum. Basically you add 1 for every document you see for that age.
If needed, sort by age field.
The command in the mongo shell would look something like the command below, using JavaScript's Date() object to get the current date. If you want to do this in Ruby, you would have to change that bit of code and make sure that for the rest, you follow the syntax for the Ruby driver.
db.collection.aggregate([
{ "$project" :
{
"ageMillis" : { "$subtract" : [ new Date(), "$dob" ]}
}
},
{ "$project" :
{
"ageDecimal" : { "$divide" : [ "$ageMillis", 31536000000 ]}
}
},
{ "$project" :
{
"ageDecimal" : "$ageDecimal",
"decimal" : { "$mod" : [ "$ageDecimal", 1 ]}
}
},
{ "$project" :
{
"age" : { "$subtract" : [ "$ageDecimal", "$decimal" ]}
}
},
{ "$group" :
{
"_id" : { "age" : "$age" },
"count" : { "$sum" : 1 }
}
},
{ "$sort" :
{
"_id.age" : 1
}
}
]);
This should give you the results that you want. Note that the aggregate() method returns a cursor. You will have to iterate through it to get the results.
The aggregation framework is the best approach for this. Mongoid exposes the lower level collection object through a .collection accessor. This allows the native driver implementation of aggregate to be used.
The basic math here is:
Rounded Result of:
( difference from date of birth to now in milliseconds /
number of milliseconds in a year )
Feed the current Time value into your aggregation statement to get the current age
res = Model.collection.aggregate([
{ "$group" => {
"_id" => {
"$subtract" => [
{ "$divide" => [
{ "$subtract" => [ Time.now, "$dob" ] },
31536000000
]},
{ "$mod" => [
{ "$divide" => [
{ "$subtract" => [ Time.now, "$dob" ] },
31536000000
]},
1
]}
]
},
"count" => { "$sum" => 1 }
}},
{ "$sort" => { "_id" => -1 } }
])
pp res

How to map / query this data with ElasticSearch?

I'm using ElasticSearch along with the tire gem to power the search
functionality of my site. I'm having trouble figuring out how to map and
query the data to get the results I need.
Relevant code is below. I will explain the desired outbut below that as
well.
# models/product.rb
class Product < ActiveRecord::Base
include Tire::Model::Search
include Tire::Model::Callbacks
has_many :categorizations
has_many :categories, :through => :categorizations
has_many :product_traits
has_many :traits, :through => :product_traits
mapping do
indexes :id, type: 'integer'
indexes :name, boost: 10
indexes :description, analyzer: 'snowball'
indexes :categories do
indexes :id, type: 'integer'
indexes :name, type: 'string', index: 'not_analyzed'
end
indexes :product_traits, type: 'string', index: 'not_analyzed'
end
def self.search(params={})
out = tire.search(page: params[:page], per_page: 12, load: true) do
query do
boolean do
must { string params[:query], default_operator: "OR" } if params[:query].present?
must { term 'categories.id', params[:category_id] } if params[:category_id].present?
# if we aren't browsing a category, search results are "drill-down"
unless params[:category_id].present?
must { term 'categories.name', params[:categories] } if params[:categories].present?
end
params.select { |p| p[0,2] == 't_' }.each do |name,value|
must { term :product_traits, "#{name[2..-1]}##{value}" }
end
end
end
# don't show the category facets if we are browsing a category
facet("categories") { terms 'categories.name', size: 20 } unless params[:category_id].present?
facet("traits") {
terms :product_traits, size: 1000 #, all_terms: true
}
# raise to_curl
end
# process the trait facet results into a hash of arrays
if out.facets['traits']
facets = {}
out.facets['traits']['terms'].each do |f|
split = f['term'].partition('#')
facets[split[0]] ||= []
facets[split[0]] << { 'term' => split[2], 'count' => f['count'] }
end
out.facets['traits']['terms'] = facets
end
out
end
def to_indexed_json
{
id: id,
name: name,
description: description,
categories: categories.all(:select => 'categories.id, categories.name, categories.keywords'),
product_traits: product_traits.includes(:trait).collect { |t| "#{t.trait.name}##{t.value}" }
}.to_json
end
end
As you can see above, I'm doing some pre/post processing of the data
to/from elasticsearch in order to get what i want from the
'product_traits' field. This is what doesn't feel right and where my
questions originate.
I have a large catalog of products, each with a handful of 'traits' such
as color, material and brand. Since these traits are so varied, I
modeled the data to include a Trait model which relates to the Product
model via a ProductTrait model, which holds the value of the trait for
the given product.
First question is: How can i create the elasticsearch mapping to index
these traits properly? I assume that this involves a nested type but I
can't make enough sense of the docs to figure it out.
Second question: I want the facets to come back in groups (in the
manner that I am processing them at the end of the search method
above) but with counts that reflect how many matches there are without
taking into account the currently selected value for each trait. For
example: If the user searches for 'Glitter' and then clicks the link
corresponding to the 'Blue Color' facet, I want all the 'Color' facets
to remain visible and show counts correspinding the query results
without the 'Blue Color' filter. I hope that is a good explanation,
sorry if it needs more clarification.
If you index your traits as:
[
{
trait: 'color',
value: 'green'
},
{
trait: 'material',
value: 'plastic'
}
]
this would be indexed internally as:
{
trait: ['color', 'material' ],
value: ['green', 'plastic' ]
}
which means that you could only ever query for docs which have a trait with value 'color' and a value with value green. There is no relationship between the trait and the value.
You have a few choices to solve this problem.
As single terms
The first you are already doing, and it is a good solution, ie storing the traits as single terms like:
['color#green`','material#plastic']
As objects
An alternative (assuming you have a limited number of trait names) would be to store them as:
{
traits: {
color: 'green',
material: 'plastic'
}
}
Then you could run queries against traits.color or traits.material.
As nested
If you want to keep your array structure, then you can use the nested type eg:
{
"mappings" : {
"product" : {
"properties" : {
... other fields ...
"traits" : {
"type" : "nested",
"properties" : {
"trait" : {
"index" : "not_analyzed",
"type" : "string"
},
"value" : {
"index" : "not_analyzed",
"type" : "string"
}
}
}
}
}
}
}
Each trait/value pair would be indexed internally as a separate (but related) document, meaning that there would be a relationship between the trait and its value. You'd need to use nested queries or nested filters to query them, eg:
curl -XGET 'http://127.0.0.1:9200/test/product/_search?pretty=1' -d '
{
"query" : {
"filtered" : {
"query" : {
"text" : {
"name" : "my query terms"
}
},
"filter" : {
"nested" : {
"path" : "traits",
"filter" : {
"and" : [
{
"term" : {
"trait" : "color"
}
},
{
"term" : {
"value" : "green"
}
}
]
}
}
}
}
}
}
'
Combining facets, filtering and nested docs
You state that, when a user filters on eg color == green you want to show results only where color == green, but you still want to show the counts for all colors.
To do that, you need to use the filter param to the search API rather than a filtered query. A filtered query filters out the results BEFORE calculating the facets. The filter param is applied to query results AFTER calculating facets.
Here's an example where the final query results are limited to docs where color == green but the facets are calculated for all colors:
curl -XGET 'http://127.0.0.1:9200/test/product/_search?pretty=1' -d '
{
"query" : {
"text" : {
"name" : "my query terms"
}
},
"filter" : {
"nested" : {
"path" : "traits",
"filter" : {
"and" : [
{
"term" : {
"trait" : "color"
}
},
{
"term" : {
"value" : "green"
}
}
]
}
}
},
"facets" : {
"color" : {
"nested" : "traits",
"terms" : { "field" : "value" },
"facet_filter" : {
"term" : {
"trait" : "color"
}
}
}
}
}
'

Using upsert with push to an array option on Ruby driver

I'm trying to do an upsert with ruby driver to mongodb.
If the row exist I wish to push new data to and array, else create new document with one item in the array.
When I run it on mongodb it looks like that:
db.events.update( { "_id" : ObjectId("4f0ef9171d41c85a1b000001")},
{ $push : { "events" : { "field_a" : 1 , "field_b" : "2"}}}, true)
And it works.
When I run it on ruby it looks like that:
#col_events.update( { "_id" => BSON::ObjectId.from_string("4f0ef9171d41c85a1b000001")},
{ :$push => { "events" => { "field_a" => 1 , "field_b" => "2"}}}, :$upsert=>true)
And it doesn't work. I don't get an error but I don't see new rows either.
Will appreciate the help in understanding what am I doing wrong.
So a couple of issues.
In Ruby, the command should be :upsert=>true. Note that there is not $. The docs for this are here.
You are not running the query with :safe=>true. This means that some exceptions will not fire. So you could be causing an exception on the server, but you are not waiting for the server to acknowledge the write.
Just adding some code for Gates VP's excellent answer:
require 'rubygems'
require 'mongo'
#col_events = Mongo::Connection.new()['test']['events']
#safemode enabled
#col_events.update(
{ "_id" => BSON::ObjectId.from_string("4f0ef9171d41c85a1b000001")},
{ "$push" => { "events" => { "field_a" => 1, "field_b" => "2"}}},
:upsert => true, :safe => true
)

Difference with count result in Mongo group by query with Ruby/Javascript

I'm using Mongoid to get a count of certain types of records in a Mongo database. When running the query with the javascript method:
db.tags.group({
cond : { tag: {$ne:'donotwant'} },
key: { tag: true },
reduce: function(doc, out) { out.count += 1; },
initial: { count: 0 }
});
I get the following results:
[
{"tag" : "thing", "count" : 4},
{"tag" : "something", "count" : 1},
{"tag" : "test", "count" : 1}
]
Does exactly what I want it to do. However, when I utilize the corresponding Mongoid code to perform the same query:
Tag.collection.group(
:cond => {:tag => {:$ne => 'donotwant'}},
:key => [:tag],
:reduce => "function(doc, out) { out.count += 1 }",
:initial => { :count => 0 },
)
the count parameters are (seemingly) selected as floats instead of integers:
[
{"tag"=>"thing", "count"=>4.0},
{"tag"=>"something", "count"=>1.0},
{"tag"=>"test", "count"=>1.0}
]
Am I misunderstanding what's going on behind the scenes? Do I need to (can I?) cast those counts or is the javascript result just showing it without the .0?
JavaScript doesn't distinguish between floats and ints. It has one Number type that is implemented as a double. So what you are seeing in Ruby is correct, the mongo shell output follows javascript printing conventions and displays Numbers that don't have a decimal component without the '.0'

Resources