MongoDB extremely slow at counting null values (or {$exists: false}) - performance

I have a Mongo server running on an VPS with 16GB of memory (although probably with slow IO using magnetic disks).
I have a collection of around 35 million records which doesn't fit into main memory (db.stats() reports a size of 35GB and a storageSize of 14GB), however the 1.7GB reported for totalIndexSize should comfortably fit there.
There is particular field bg I'm querying over which can be present with value true or absent entirely (please no discussions about whether this is the best data representation – I still think Mongo is behaving weirdly). This field is indexed with a non-sparse index with a reported size of 146MB.
I'm using the WiredTiger storage engine with a default cache size (so it should be around 8GB).
I'm trying to count the number of records missing the bg field.
Counting true values is tolerably fast (a few seconds):
> db.entities.find({bg: true}).count()
8300677
However the query for missing values is extremely slow (around 5 minutes):
> db.entities.find({bg: null}).count()
27497706
To my eyes, explain() looks ok:
> db.entities.find({bg: null}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "testdb.entities",
"indexFilterSet" : false,
"parsedQuery" : {
"bg" : {
"$eq" : null
}
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"bg" : {
"$eq" : null
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"bg" : 1
},
"indexName" : "bg_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"bg" : [
"[null, null]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "mongo01",
"port" : 27017,
"version" : "3.0.3",
"gitVersion" : "b40106b36eecd1b4407eb1ad1af6bc60593c6105"
},
"ok" : 1
}
However the query remains stubbornly slow, even after repeated calls. Other count queries for different values are fast:
> db.entities.find({bg: "foo"}).count()
0
> db.entities.find({}).count()
35798383
I find this kind of strange, since my understanding is that missing fields in non-sparse indexes are simply stored as null, so the count query with null should be similar to counting an actual value (or maybe up to three times for three times as many positive values, if it has to count more index entries or something). Indeed, this answer reports vast speed improvements over similar queries involving null values and .count(). The only point of differentiation I can think of is WiredTiger.
Can anyone explain why is my query to count null values so slow or what I can do to fix it (apart from doing the obvious subtraction of the true counts from the total, which would work fine but wouldn't satisfy my curiosity)?

This is expected behavior, see: https://jira.mongodb.org/browse/SERVER-18653. Seems like a strange call to me to, but there you go, I'm sure there are programmers that know more about MongoDB than I do that are responsible.
You will need to use a different value to mean null. I guess this will depend on what you use the field for. In my case it is a foreign reference, so I'm just going to start using false to mean null. If you are using it to store a boolean value then you may need to use "null", -1, 0, etc.

Related

ElasticSearch too_many_nested_clauses Query contains too many nested clauses; maxClauseCount is set to 1024

we are trying to run a very simple Lucene (ver 9.3.0) query using ElasticSearch (ver 8.4.1) in Elastic cloud. Our index mapping has around 800 fields.
GET index-name/_search
{
"query": {
"query_string": {
"query": "Abby OR Alta"
}
}
}
However we are getting back an exception:
{
"error" : {
"root_cause" : [
{
"type" : "too_many_nested_clauses",
"reason" : "Query contains too many nested clauses; maxClauseCount is set to 1024"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
},
"status" : 500
}
Now from what I've read in this article link there was a breaking change since Lucene 9 which Elastic 8.4 uses.
The behavior changed dramatically on how this max clause value is counted. From previously being
num_terms = max_num_clauses
to
num_terms = max_num_clauses * num_fields_in_index
So in our case it would be 800 * 2 = 1600 > 1024
Now what I don't understand is why such a limitation was introduced and to what value we should actually change this setting ?
A OR B query with 800 fields in the index doesn't strike me as something unusual or being problematic from performance perspective.
The "easy" way out is to increase the indices.query.bool.max_clause_count limit in the configuration file to a higher value. It used to be 1024 in ES 7 and now in ES 8 it has been raised to 4096. Just be aware, though, that doing so might harm the performance of your cluster and even bring nodes down depending on your data volume.
Here is some interesting background information on how that the "ideal" value is calculated based on the hardware configuration as of ES 8.
A better way forward is to "know your data" and identify the fields to be used in your query_string query and either specify those fields in the query_string.fields array or modify your index settings to specify them as default fields to be searched on when no fields are specified in your query_string query:
PUT index/_settings
{
"index.query.default_field": [
"description",
"title",
...
]
}

Elasticsearch: count rows in a table

I have a big table (15000 x 2000 entries). In this table, I need to count rows with certain properties like "all rows, that have a 1 or 2 in column 5 and a 0 in column 6". I will call this type of operation a count operation. For my use case, the count operation needs to be very fast, as I executing several hundreds of those count operations.
I tried to do so with elastic search, but the performance seems to be very bad (like 10 seconds for 180 count operations). I was wondering, if I am building my queries the wrong way, or if maybe Elasticsearch is the wrong technology to do so?
My queries are all of the same form. I create them with java, so it's kind of hard to post here, how they do look like but I do my best to explain
I build each single coun operation as a BoolQuery. For the example above it would be a query that looks similar to this (don't blame me if it's wrong, I cannot copy the correct query, as it is built in java):
"query": {
"bool" : {
"must" : [
"should" : [
{ "column 5" : "1" },
{ "column 5" : "2" }
],
"should" : [
{ "column 6" : "0" }
],
"minimum_should_match" : 1
],
"boost" : 1.0
}
}
The many bool queries of this form are then grouped into a MultiSearchRequest. I use the option "fetchSource = false" to prevent Elasticsearch from loading the entities themselves.
Please tell me, if you need any further information, or if it is unclear, what I am trying to do!
I just fixed the problem myself. For all with a similar question, here is how:
I changed the SearchSourceBuilder, so that it now uses a ValueCountAggregator. This one counts the values and allows me to set the SearchSourceBuilder.size() to 0. In this way I get rid of the hits themselves and retrieve only the aggregation values.
Requests that took 4 seconds before are now executed in less than 100ms.

Elasticsearch performs slowly when data size increased

We have a cluster with following details:
1. OS: Windows 7 (64 bit)
2. Number of nodes: 2 (i7 processor, 8Gb RAM)
3. ES version: 2.4.4
We have created an index with following details:
1. Index size: 86 Gb
2. Number of shards: 12
3. Number of replica: None
5. Number of documents: 140 million
6. Number of fields: 15
6. For most of the fields we have set "index": "not_analyzed"
7. For few of the fields we have set "index": "no"
8. We are not executing any full-text search, aggregation or sorting
9. For 2 fields we are using fuzziness of edit distance 1
12 shards are evenly distributed on 2 nodes (6 shards each). We are running multi-search query on this cluster where each multi-search request consist of 6 individual queries.
Our queries are taking too much time to execute. From the "took" field we can see that each individual query is taking time in the range of 3-8 seconds. Rarely they are executing in milliseconds.
Avg. record count returned in result set is around 800 (max 10k records and min 10 records).
When we ran the same test on relatively small set of data (10 million records which is 7 Gb in size) then each individual query took time in the range of 50-200 milliseconds.
Could someone suggest what might be causing our queries to run slow when index size increases?
Update after the response of xeraa:
Maybe you are also using spinning disk?
Yes
800 documents (or more) sounds like a lot. Do you always need that many?
Not all but few of the individual queries return a lot of docs and we do need all of them.
Did you set the heap size to 4GB (half the memory available)?
Yes
Why 12 shards? If you only have 2 nodes this sounds a bit too much (but will probably not make a huge difference).
So that we can add more nodes later (without the need of reindex) as the data grows.
Maybe you can show a query? It sounds costly with your 6 individual queries
Following are the 2 sample queries that are used. A total of 6 similar queries are wrapped in multi-search.
POST /_msearch
{"index" : "school"}
{
"query": {
"bool" : {
"must" : [ {
"bool" : {
"should" : {
"range" : {
"marks" : {
"from" : "100000000",
"to" : "200000000",
"include_lower" : true,
"include_upper" : true
} } } } }, {
"nested" : {
"query" : {
"match" : {
"query" : "25 ",
"fields" : [ "subject.chapter" ] } },
"path" : "subject"
} } ] } }
}
{"index" : "school"}
{
"query":
{
"bool" : {
"must" : {
"nested" : {
"query" : {
"match" : {
"query" : "A100123",
"fields" : [ "student.id" ],
"fuzziness" : "1"
} },
"path" : "student"
} } } }
}
140 million documents mean 86GB of data then I guess 10 million documents translate to less than 8GB of data. So the smaller dataset can be served from memory (at least mostly with your two 8GB nodes), while the larger dataset needs to be served from disk. Maybe you are also using spinning disk? In any case the laws of physics will make your full dataset slower than the smaller one.
Various things you could look into:
800 documents (or more) sounds like a lot. Do you always need that many?
Did you set the heap size to 4GB (half the memory available)?
Why 12 shards? If you only have 2 nodes this sounds a bit too much (but will probably not make a huge difference).
Maybe you can show a query? It sounds costly with your 6 individual queries

Performance tuning MongoDB query/update?

So I have a MongoDB instance where I am trying to update data in one collection with data from another collection. The two collections are participants with about 180k documents and questions with about 95k documents.
Documents in participants typically look something like this:
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b4"),
"answers" : [
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b9"),
"question_id" : 2081,
"sub_id" : null,
"values" : [
"Yes"
]
},
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b8"),
"question_id" : 2082,
"sub_id" : 123,
"values" : [
"Would prefer to go alone"
]
},
{
"_id" : ObjectId("52f90b8bbab16dd8594b82b7"),
"question_id" : 2082,
"sub_id" : 456,
"values" : [
"Yes"
]
}
],
"created" : ISODate("2012-03-01T17:40:21Z"),
"email" : "anonymous",
"id" : 65,
"survey" : ObjectId("52f41d579af1ff4221399a7b"),
"survey_id" : 374
}
I am using the query below to perform the update:
db.participants.ensureIndex({"answers.question_id": 1, "answers.sub_id": 1});
print("created index for answer arrays!")
db.questions.find().forEach(function(doc){
db.participants.update(
{
"answers.question_id": doc.id,
"answers.sub_id": doc.sub_id
},
{
$set:
{
"answers.$.question": doc._id
}
},
false,
true
);
});
db.participants.dropIndex({"answers.question_id": 1, "answers.sub_id": 1});
But this takes about 20 minutes to run. I was hoping that adding the index would help with the performance, but it is still pretty slow. Is this index setup correctly considering that I am indexing fields in an array of objects? Can anyone see anything that I am doing that would cause the slowness? Suggestions on where to start looking to improve the performance of this query?
I think you need to consider what you are actually doing here in order to understand why the index is not helping and indeed why this operation takes so long.
The first part of the answer is explained by what you are doing here:
db.questions.find()
Now that part alone basically says that you are asking to retrieve every document in your questions collection. So we can see what you are trying to do is exactly that, as you want to update that content into your participants collection, particularly the document _id for the "question". But here, by definition of getting all documents, no index will be used.
So what you are doing is looping every document in the questions, then asking with your update operation to match the participants record with data from the "question". And what that means is you are pulling "over the wire" all of your 95K documents and sending back "over the wire" your update operation, 95K times. This is not happening on the server and there is network traffic between your application and your MongoDB.
The index itself is not going to do much other than improve the search of each participants record, which is better than scanning and you should be getting the match. But that's not the part that taking the time, its the fetching of the questions that will be the largest issue. Also note that if you were updating
So if it's possible to run your update process on a machine that is as close as possible in networking terms to the MongoDB server then that is going to be your best performance improvement. You could also wind back your Write Concern if you want to be a little daring and/or can live with checking the integrity in another opertation, and that will reduce your network traffic and waiting for a response to the update (which is actually happening) if you put it in "fire and forget" mode.
Also see the guide if you are not sure of the concepts:
http://docs.mongodb.org/manual/core/write-concern/
In case anyone is interested I was able to take the run time of this update query from 20 minutes down to about a minute and a half by using projection when selecting the questions documents. Since I am only using the _id, id and sub_id fields I was able to do the following:
db.questions.find({},{_id: 1, id: 1, sub_id: 1}).forEach(function(doc){
....
Which drastically improved performance. Hope this helps someone!

How do I troubleshoot and improve this slow running query?

I'm looking to fine tune a string search query that I am using on Mongo. In the SQL Server world, I'd like to believe I have a decent understanding of how indexes work and how to build proper indexes. I tried giving it a shot with Mongo, but, I don't believe that I'm not going about it the right way.
My collection has roughly 4.3 million documents. The document structure looks like this:
{
"_id":{
"$oid":"527027456239d1212c07a621"
},
"ReleaseId":2451,
"Status":"Accepted",
"Title":"Hard Rhythmic Motions",
"Country":"US",
"MasterId":"35976",
"Images":[
{
"Type":"primary",
"URI":"http://api.discogs.com/image/R-2451-1117047026.jpg",
"URI150":"http://api.discogs.com/image/R-150-2451-1117047026.jpg",
"Height":307,
"Width":307
},
{
"Type":"secondary",
"URI":"http://api.discogs.com/image/R-2451-1117047033.jpg",
"URI150":"http://api.discogs.com/image/R-150-2451-1117047033.jpg",
"Height":307,
"Width":307
}
],
"Artists":[
{
"_id":2894,
"Name":"DJ Hyperactive"
}
],
"Formats":[
{
"Name":null,
"Quantity":1
}
],
"Genres":[
"Electronic"
],
"Styles":[
"Hardcore",
"Acid"
]
}
I am executing a case insensitive search on one of the top-level document properties and on one of the nested document properties:
db.releases.find({$or: [{Title: new RegExp('.*mozart.*',"i")},{'Artists.Name': new RegExp('.*mozart.*',"i")}]})
I tried creating an index; when I execute .getIndexes() I can see the index I created:
{
"v" : 1,
"key" : {
"Title" : 1,
"Artists.Name" : 1
},
"ns" : "discogs.releases",
"name" : "Title_1_Artists.Name_1"
}
At this point I thought that I would be all set. However, the query ends up taking between 28 and 32 seconds to execute. I tried calling .explain() to get a little more insight:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 4098,
"nscannedObjects" : 4292400,
"nscanned" : 4292400,
"nscannedObjectsAllPlans" : 4292400,
"nscannedAllPlans" : 4292400,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 29,
"nChunkSkips" : 0,
"millis" : 29958,
"indexBounds" : {
},
"server" : "lambic:27017"
}
From my limited knowledge of Mongo, this looks like a table scan which is why the query isn't performing very well. However, I don't know how to make this query better! I would expect that the index that I created to cover this query, but, that must not be the case.
Now, the last thing I want to point out is that this is certainly not on the most robust server. The hardware specs (including CPU and RAM) are very limited. However, if my analysis is correct and I'm doing a table scan, there must be some performance improvements I can make on the Mongo side.
A fulltext index is probably what you need. You could also parse the document before inserting it, and put the keywords in an array inside the document and index this array.
Thanks everyone for the responses. I wanted to follow up on this question since it had a few votes and to make sure anyone who stumbles upon this page in the future knows what I ended up doing.
The fulltext index sounds like a great solution. However, because this is only a small side-project of mine I'm not willing to throw more hardware at the architecture (the fulltext index requires a good amount of disk space for 4 million records).
What I ended up doing is flattening my data structures to make them easier to query and removed the wildcard search so that my indexes on that new structure can actually be used. By doing this I can achieve an indexOnly query (although the performance still isn't amazing, I find it to be adequate given my weak hardware stack).

Resources