Mongoid each + set vs Critera#set vs update_all + $addToSet - ruby

I was wondering what is better performance / memory wise: Iterating over all objects in a collection and calling set/add_to_set or calling set/add_to_set directly on the Criteria or using update all with set/add_to_set.
# update_all
User.where(some_query).update_all(
{
'$addToSet': {
:'some.field.value' => :value_to_add
}
}
)
# each do + add_to_set
User.where(some_query).each do |user|
user.add_to_set(:'some.field.value' => :value_to_add)
end
# Criteria#add_to_set
User.where(some_query).add_to_set(
:'some.field.value' => :value_to_add
)
Any input is appreciated. Thanks!

I started MongoDB server with verbose flag. That's what I got.
Option 1. update_all applied on a selector
2017-04-25 COMMAND command production_v3.$cmd command: update { update: "products", updates: [ { q: { ... }, u: { $addToSet: { test_field: "value_to_add" } }, multi: true, upsert: false } ], ordered: true }
I removed some output so that is easier to read. The flow is:
MongoID generates a single command with query and update specified.
MongoDB server gets the command. It goes through collection and updates each match in [vague] one go.
Note! You may learn from the source code or take as granted. Since MongoID, as per my terminology, generates command to send in step 1, it does not check your models. e.g. If 'some.field.value' is not one of your field in the model User, then the command will still go through and persist on DB.
Option 2. each on a selector
I get find commands like below followed by multiple getMore-s:
2017-04-25 COMMAND command production_v3.products command: find { find: "products", filter: { ... } } 0ms
I also get a huge number of update-s:
2017-04-25 COMMAND command production_v3.$cmd command: update { update: "products", updates: [ { q: { _id: ObjectId('52a6db196c3f4f422500f255') }, u: { $addToSet: { test_field: { $each: [ "value_to_add" ] } } }, multi: false, upsert: false } ], ordered: true } 0ms
The flow is radically different from the 1st option:
MongoID sends a simple query to to MongoDB server. Provided your collection is large enough and the query covers a material chunk of it, the following happens in a loop:
[loop] Respond with a subset of all matches. Leave the rest for the next iteration.
[loop] MongoID gets an array of matching items in Hash format. MongoID parses the each entry and initializes User class for it. That's an expensive operatation!
[loop] For each User instance from the previous step MongoID generates an update commands and sends it to serve. Sockets are expensive too.
[loop] MongoDB gets the command and goes through the collection until the first match. Updates the match. It is quick, but adds up once in a loop.
[loop] MongoID parses the response and updates its User instance accordingly. Expensive and unnecessary.
Option 3. add_to_set applied on a selector
Under the hood it is equivalent to Option 1. Its CPU and Memory overhead is immaterial for the sake of the question.
Conclusion.
Option 2 is so much slower that there is no point in benchmarking. In the particular case I tries, it resulted in 1000s of request to MongoDB and 1000s of User class initialization. Options 1 and 3 resulted in a single request to MongoDB and relied on MongoDB highly optimized engine.

Related

2 indexes with same alias but only one is marked as write index

I have 2 indexes in Elastic search which are having the same alias name but only for one of them, is_write_index is true. Now, when I try to write to the indexes by alias name, I am getting following error:
Alias [alias_name] has more than one indices associated with it [[indexname1, indexname2]], can't execute a single index op
This is something I am doing which is failing:
let doc = {
script: {
source: ...,
lang: "painless",
params: {
...document
}
},
upsert: {
...document
}
};
await client.update({
index: alias_name,
id: docId,
body: doc
});
What can be the issue here? My thinking was if only one of them is marked as write index, it would just write to that and writes should not fail (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html#aliases-write-index). Am I understanding wrong?
The problem is that an update is a combo operation where the document is first retrieved using GET, then updated using the script and finally reindexed using the index operation.
What doesn't work in your case is the first operation as GET cannot be done over an alias that spans multiple indexes, even if only one of them as the is_write_index flag set to true, there's no guarantee that the document to be updated is actually in that index.

How can I filter if any value of an array is contained in another array in rethinkdb/reql?

I want to find any user who is member of a group I can manage (using the webinterface/javascript):
Users:
{
"id": 1
"member_in_groups": ["all", "de-south"]
},
{
"id": 2
"member_in_groups": ["all", "de-north"]
}
I tried:
r.db('mydb').table('users').filter(r.row('member_in_groups').map(function(p) {
return r.expr(['de-south']).contains(p);
}))
but always both users are returned. Which command do I have to use and how can I use an index for this (I read about multi-indexes in https://rethinkdb.com/docs/secondary-indexes/python/#multi-indexes but there only one value is searched for)?
I got the correct answer at the slack channel so posting it here if anyone else comes to this thread through googling:
First create a multi index as described in
https://rethinkdb.com/docs/secondary-indexes/javascript/, e. g.
r.db('<db-name>').table('<table-name>').indexCreate('<some-index-name>', {multi: true}).run()
(you can omit .run() if using the webadmin)
Then query the data with
r.db('<db-name>').table('<table-name>').getAll('de-north', 'de-west', {index:'<some-index-name>'}).distinct()

ES keeps returning every document

I recently inherited an ES instance and ensured I read an entire book on ES cover-to-cover before posting this, however I'm afraid I'm unable to get even simple examples to work.
I have an index on our staging environment which exhibits behavior where every document is returned no matter what - I have a similar index on our QA environment which works like I would expect it to. For example I am running the following query against http://staging:9200/people_alias/_search?explain:
{ "query" :
{ "filtered" :
{ "query" : { "match_all" : {} },
"filter" : { "term" : { "_id" : "34414405382" } } } } }
What I noticed on this staging environment is the score of every document is 1 and it is returning EVERY document in my index no matter what value I specify ...using ?explain I see the following:
_explanation: {
value: 1
description: ConstantScore(*:*), product of:
details: [
{
value: 1, description: boost
}, { value: 1, description: queryNorm } ] }
On my QA environment, which correctly returns only one record I observe for ?explain:
_explanation: {
value: 1
description: ConstantScore(cache(_uid:person#34414405382)), product of:
details: [ {
value: 1,
description: boost
}, {
value: 1,
description: queryNorm
}
]
}
The mappings are almost identical on both indices - the only difference is I removed the manual field-level boost values on some fields as I read field-level boosting is not recommended in favor of query-time boosting, however this should not affect the behavior of filtering on the document ID (right?)
Is there any clue I can glean from the differences in the explain output or should I post the index mappings? Are there any server-level settings I should consider checking? It doesn't matter what query I use on Staging, I can use match queries and exact match lookups on other fields and Staging just keeps returning every result with Score 1.0
I feel like I'm doing something very glaringly and obviously wrong on my Staging environment. Could someone please explain the presence of ConstantScore, boost and queryNorm? I thought from looking at examples in other literature I would see things like term frequency etc.
EDIT: I am issuing the query from Elastic Search Head plugin
In your HEAD plugin, you need to use POST in order to send the query in the payload, otherwise the _search endpoint is hit without any constraints.
In your browser, if you open the developer tools and look at the networking tab, you'll see that nothing is sent in the payload when using GET.
It's a common mistake people often do. Some HTTP clients (like curl) do send a payload using GET, but others (like /head/) don't. Sense will warn you if you use GET instead of POST when sending a payload and will automatically force POST instead of GET.
So to sum it up, it's best to always use POST whenever you wish to send some payload to your servers, so you don't have to care about the behavior of the HTTP client you're using.

Setting Elastic search limit to "unlimited"

How can i get all the results from elastic search as the results only display limit to 10 only. ihave got a query like:
#data = Athlete.search :load => true do
size 15
query do
boolean do
must { string q, {:fields => ["name", "other_names", "nickname", "short_name"], :phrase_slop => 5} }
unless conditions.blank?
conditions.each do |condition|
must { eval(condition) }
end
end
unless excludes.blank?
excludes.each do |exclude|
must_not { eval(exclude) }
end
end
end
end
sort do
by '_score', "desc"
end
end
i have set the limit to 15 but i wan't to make it unlimited so that i can get all the data
I can't set the limit as my data keeps on changing and i want to get all the data.
You can use the from and size parameters to page through all your data. This could be very slow depending on your data and how much is in the index.
http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
Another approach is to first do a searchType: 'count', then and then do a normal search with size set to results.count.
The advantage here is it avoids depending on a magic number for UPPER_BOUND as suggested in this similar SO question, and avoids the extra overhead of building too large of a priority queue that Shay Banon describes here. It also lets you keep your results sorted, unlike scan.
The biggest disadvantage is that it requires two requests. Depending on your circumstance, this may be acceptable.
From the docs, "Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000". So my admittedly very ad-hoc solution is to just pass size: 10000 or 10,000 minus from if I use the from argument.
Note that following Matt's comment below, the proper way to do this if you have a larger amount of documents is to use the scroll api. I have used this successfully, but only with the python interface.
use the scan method e.g.
curl -XGET 'localhost:9200/_search?search_type=scan&scroll=10m&size=50' -d '
{
"query" : {
"match_all" : {}
}
}
see here
You can use search_after to paginate, and the Point in Time API to avoid having your data change while you paginate. Example with elasticsearch-dsl for Python:
from elasticsearch_dsl.connections import connections
# Set up paginated query with search_after and a fixed point_in_time
elasticsearch = connections.create_connection(hosts=[elastic_host])
pit = elasticsearch.open_point_in_time(index=MY_INDEX, keep_alive="3m")
pit_id = pit["id"]
query_size = 500
search_after = [0]
hits: List[AttrDict[str, Any]] = []
while query_size:
if hits:
search_after = hits[-1].meta.sort
search = (
Search()
.extra(size=query_size)
.extra(pit={"id": pit_id, "keep_alive": "5m"})
.extra(search_after=search_after)
.filter(filter_)
.sort("url.keyword") # Note you need a unique field to sort on or it may never advance
)
response = search.execute()
hits = [hit for hit in response]
pit_id = response.pit_id
query_size = len(hits)
for hit in hits:
# Do work with hits

Extremely slow performance of some MongoDB queries

I have a collection of about 30K item all of which have an element called Program. "Program" is a first part of a compound index, so looking up an item with specific Program value is very fast. It is also fast to run range queries, e.g.:
db.MyCollection.find(
{ $and: [ { Program: { "$gte" : "K", "$lt" : "L" } },
{ Program: { "$gte" : "X", "$lt" : "Y" } } ] }).count();
The query above does not return any results because I am querying for an overlap of two non-overlaping ranges (K-L) and (X-Y)). The left range (K-L) contains about 7K items.
However if I replace the second "and" clause with "where" expression, the query execution takes ages:
db.MyCollection.find(
{ $and: [ { Program: { "$gte" : "K", "$lt" : "L" } }, { "$where" : "this.Program == \"Z\"" } ] }).count();
As you can see, the query above should also return an empty result set (range K-L is combined with Program=="Z"). I am aware of slow performance of "where", but should not Mongo first reduce potential result set by evaluating the left clause (that would result in about 7K items) and only then apply "where" check? If it does, should not processing of a few thousand items take seconds and not minutes as it does on my machine with Mongo service consuming about 3GB RAM while peforming this operation? Looks too heavy for relatively small collection.
There are a few things that you can do -
Use explain() to see what is happening on your query. explain() is described here. Use the $explain operator to return a document that describes the process and indexes used to return the query. For example -
db.collection.find(query).explain()
If that doesn't return enough information, you can look at using the Database Profiler. However, please bear in mind that this is not free and adds load itself. Within this page, you can also find some basic notes about optimising the query performance.
However, in your case, it all boils down to the $where operator:
$where evaluates JavaScript and cannot take advantage of indexes. Therefore, query performance improves when you express your query using the standard MongoDB operators (e.g., $gt, $in).
In general, you should use $where only when you can’t express your query using another operator. If you must use $where, try to include at least one other standard query operator to filter the result set. Using $where alone requires a table scan. $where, like Map Reduce, limits your concurrency.
As a FYI: couple of things to note about the output from explain():
ntoreturn Number of objects the client requested for return from a query. For example, findOne(), sets ntoreturn to limit() sets the appropriate limit. Zero indicates no limit.
query Details of the query spec.
nscanned Number of objects scanned in executing the operation.
reslen Query result length in bytes.
nreturned Number of objects returned from query.

Resources