We have a cluster with following details:
1. OS: Windows 7 (64 bit)
2. Number of nodes: 2 (i7 processor, 8Gb RAM)
3. ES version: 2.4.4
We have created an index with following details:
1. Index size: 86 Gb
2. Number of shards: 12
3. Number of replica: None
5. Number of documents: 140 million
6. Number of fields: 15
6. For most of the fields we have set "index": "not_analyzed"
7. For few of the fields we have set "index": "no"
8. We are not executing any full-text search, aggregation or sorting
9. For 2 fields we are using fuzziness of edit distance 1
12 shards are evenly distributed on 2 nodes (6 shards each). We are running multi-search query on this cluster where each multi-search request consist of 6 individual queries.
Our queries are taking too much time to execute. From the "took" field we can see that each individual query is taking time in the range of 3-8 seconds. Rarely they are executing in milliseconds.
Avg. record count returned in result set is around 800 (max 10k records and min 10 records).
When we ran the same test on relatively small set of data (10 million records which is 7 Gb in size) then each individual query took time in the range of 50-200 milliseconds.
Could someone suggest what might be causing our queries to run slow when index size increases?
Update after the response of xeraa:
Maybe you are also using spinning disk?
Yes
800 documents (or more) sounds like a lot. Do you always need that many?
Not all but few of the individual queries return a lot of docs and we do need all of them.
Did you set the heap size to 4GB (half the memory available)?
Yes
Why 12 shards? If you only have 2 nodes this sounds a bit too much (but will probably not make a huge difference).
So that we can add more nodes later (without the need of reindex) as the data grows.
Maybe you can show a query? It sounds costly with your 6 individual queries
Following are the 2 sample queries that are used. A total of 6 similar queries are wrapped in multi-search.
POST /_msearch
{"index" : "school"}
{
"query": {
"bool" : {
"must" : [ {
"bool" : {
"should" : {
"range" : {
"marks" : {
"from" : "100000000",
"to" : "200000000",
"include_lower" : true,
"include_upper" : true
} } } } }, {
"nested" : {
"query" : {
"match" : {
"query" : "25 ",
"fields" : [ "subject.chapter" ] } },
"path" : "subject"
} } ] } }
}
{"index" : "school"}
{
"query":
{
"bool" : {
"must" : {
"nested" : {
"query" : {
"match" : {
"query" : "A100123",
"fields" : [ "student.id" ],
"fuzziness" : "1"
} },
"path" : "student"
} } } }
}
140 million documents mean 86GB of data then I guess 10 million documents translate to less than 8GB of data. So the smaller dataset can be served from memory (at least mostly with your two 8GB nodes), while the larger dataset needs to be served from disk. Maybe you are also using spinning disk? In any case the laws of physics will make your full dataset slower than the smaller one.
Various things you could look into:
800 documents (or more) sounds like a lot. Do you always need that many?
Did you set the heap size to 4GB (half the memory available)?
Why 12 shards? If you only have 2 nodes this sounds a bit too much (but will probably not make a huge difference).
Maybe you can show a query? It sounds costly with your 6 individual queries
Related
we are trying to run a very simple Lucene (ver 9.3.0) query using ElasticSearch (ver 8.4.1) in Elastic cloud. Our index mapping has around 800 fields.
GET index-name/_search
{
"query": {
"query_string": {
"query": "Abby OR Alta"
}
}
}
However we are getting back an exception:
{
"error" : {
"root_cause" : [
{
"type" : "too_many_nested_clauses",
"reason" : "Query contains too many nested clauses; maxClauseCount is set to 1024"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
},
"status" : 500
}
Now from what I've read in this article link there was a breaking change since Lucene 9 which Elastic 8.4 uses.
The behavior changed dramatically on how this max clause value is counted. From previously being
num_terms = max_num_clauses
to
num_terms = max_num_clauses * num_fields_in_index
So in our case it would be 800 * 2 = 1600 > 1024
Now what I don't understand is why such a limitation was introduced and to what value we should actually change this setting ?
A OR B query with 800 fields in the index doesn't strike me as something unusual or being problematic from performance perspective.
The "easy" way out is to increase the indices.query.bool.max_clause_count limit in the configuration file to a higher value. It used to be 1024 in ES 7 and now in ES 8 it has been raised to 4096. Just be aware, though, that doing so might harm the performance of your cluster and even bring nodes down depending on your data volume.
Here is some interesting background information on how that the "ideal" value is calculated based on the hardware configuration as of ES 8.
A better way forward is to "know your data" and identify the fields to be used in your query_string query and either specify those fields in the query_string.fields array or modify your index settings to specify them as default fields to be searched on when no fields are specified in your query_string query:
PUT index/_settings
{
"index.query.default_field": [
"description",
"title",
...
]
}
I am performing a refactor of the code to query an ES index, and I was wondering if there is any difference between the two snippets below:
"bool" : {
"should" : [ {
"terms" : {
"myType" : [ 1 ]
}
}, {
"terms" : {
"myType" : [ 2 ]
}
}, {
"terms" : {
"myType" : [ 4 ]
}
} ]
}
and
"terms" : {
"myType" : [ 1, 2, 4 ]
}
Please check this blog from Elastic discuss page which will answer your question. Coying here for quick referance:
There's a few differences.
The simplest to see is the verbosity - terms queries just list an
array while term queries require more JSON.
terms queries do not score matches based on IDF (the rareness) of
matched terms - the term query does.
term queries can only have up to 1024 values due to Boolean's max
clause count
terms queries can have more terms
By default, Elasticsearch limits the terms query to a maximum of
65,536 terms. You can change this limit using the
index.max_terms_count setting.
Which of them is going to be faster? Is speed also related to the
number of terms?
It depends. They execute differently. term queries do more expensive scoring but does so lazily. They may "skip" over docs during execution because other more selective criteria may advance the stream of matching docs considered.
The terms queries doesn't do expensive scoring but is more eager and creates the equivalent of a single bitset with a one or zero for every doc by ORing all the potential matching docs up front. Many terms can share the same bitset which is what provides the scalability in term numbers.
I have a Mongo server running on an VPS with 16GB of memory (although probably with slow IO using magnetic disks).
I have a collection of around 35 million records which doesn't fit into main memory (db.stats() reports a size of 35GB and a storageSize of 14GB), however the 1.7GB reported for totalIndexSize should comfortably fit there.
There is particular field bg I'm querying over which can be present with value true or absent entirely (please no discussions about whether this is the best data representation – I still think Mongo is behaving weirdly). This field is indexed with a non-sparse index with a reported size of 146MB.
I'm using the WiredTiger storage engine with a default cache size (so it should be around 8GB).
I'm trying to count the number of records missing the bg field.
Counting true values is tolerably fast (a few seconds):
> db.entities.find({bg: true}).count()
8300677
However the query for missing values is extremely slow (around 5 minutes):
> db.entities.find({bg: null}).count()
27497706
To my eyes, explain() looks ok:
> db.entities.find({bg: null}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "testdb.entities",
"indexFilterSet" : false,
"parsedQuery" : {
"bg" : {
"$eq" : null
}
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"bg" : {
"$eq" : null
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"bg" : 1
},
"indexName" : "bg_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"bg" : [
"[null, null]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "mongo01",
"port" : 27017,
"version" : "3.0.3",
"gitVersion" : "b40106b36eecd1b4407eb1ad1af6bc60593c6105"
},
"ok" : 1
}
However the query remains stubbornly slow, even after repeated calls. Other count queries for different values are fast:
> db.entities.find({bg: "foo"}).count()
0
> db.entities.find({}).count()
35798383
I find this kind of strange, since my understanding is that missing fields in non-sparse indexes are simply stored as null, so the count query with null should be similar to counting an actual value (or maybe up to three times for three times as many positive values, if it has to count more index entries or something). Indeed, this answer reports vast speed improvements over similar queries involving null values and .count(). The only point of differentiation I can think of is WiredTiger.
Can anyone explain why is my query to count null values so slow or what I can do to fix it (apart from doing the obvious subtraction of the true counts from the total, which would work fine but wouldn't satisfy my curiosity)?
This is expected behavior, see: https://jira.mongodb.org/browse/SERVER-18653. Seems like a strange call to me to, but there you go, I'm sure there are programmers that know more about MongoDB than I do that are responsible.
You will need to use a different value to mean null. I guess this will depend on what you use the field for. In my case it is a foreign reference, so I'm just going to start using false to mean null. If you are using it to store a boolean value then you may need to use "null", -1, 0, etc.
The term filter that is used:
curl -XGET 'http://localhost:9200/my-index/my-doc-type/_search' -d '{
"filter": {
"term": {
"void": false
}
},
"fields": [
[
"user_id1",
"user_name",
"date",
"status",
"q1",
"q1_unique_code",
"q2",
"q3"
]
],
"size": 50000,
"sort": [
"date_value"
]
}'
The void field is a boolean field.
The index store size is 504mb.
The elasticsearch setup consists of only a single node and the index
consists of only a single shard and 0 replicas. The version of
elasticsearch is 0.90.7
The fields mentioned above is only the first 8 fields. The actual
term filter that we execute has 350 fields mentioned.
We noticed the memory spiking by about 2-3gb though the store size is only 504mb.
Running the query multiple times seems to continuously increase the memory.
Could someone explain why this memory spike occurs?
It's quite an old version of Elasticsearch
You're returning 50,000 records in one get
Sorting the 50k records
Your documents are pretty big - 350 fields.
Could you instead return a smaller number of records? and then page through them?
Scan and Scroll could help you.
it's not clear whether you've indexed individual fields - this could help as the _source being read from disk may be incurring a memory overhead.
We have a two node cluster (VM in a private cloud, 64GB of RAM, 8 core CPU each node, CentOS), a few small indices ( ~1 mil documents) and one big index with ~220 mil docs (2 shards, 170GB of space). 24GB of memory is allocated to elastic search on each box.
Document structure:
{
'article_id': {
'index': 'not_analyzed',
'store': 'yes',
'type': 'long'
},
'feed_id': {
'index': 'not_analyzed',
'store': 'yes',
'type': 'string'
},
'title': {
'index': 'analyzed',
'type': 'string'
},
'content': {
'index': 'analyzed',
'type': 'string'
},
'lang': {
'index': 'not_analyzed',
'type': 'string'
}
}
It takes about 1-2 seconds to run the following query:
{
"query" : {
"multi_match" : {
"query" : "some search term",
"fields" : [ "title", "content" ],
"type": "phrase_prefix"
}
},
"size": 20,
"fields" :["article_id", "feed_id"]
}
Are we hitting hardware limits at this point or are there ways to optimize the query or data structure to increase performance?
Thanks in advance!
It's possible you are hitting the limits of your hardware, but there are a few things you can do to your query first to help optimize it.
Max Expansions
The first thing I would do is limit max_expansions. The way the prefix-queries work is by generating a list of prefixes that match the last token in your query. In your search query "some search term", the last token "term" would be expanded using "term" as the prefix seed. You may generate a list like this:
term
terms
terminate
terminator
termite
The prefix expansion process runs through your posting list looking for any word which matches the seed prefix. By default, this list is unbounded, which means you can generate a very large list of expansions.
The second phase rewrites your original query into a series of term queries using the expansions. The bigger the expansion list, the more terms are evaluated against your index and a corresponding decrease in speed.
If you limit the expansion process to something reasonable, you can maintain speed and still usually get good prefix matching:
{
"query" : {
"multi_match" : {
"query" : "some search term",
"fields" : [ "title", "content" ],
"type": "phrase_prefix",
"max_expansions" : 100
}
},
"size": 20,
"fields" :["article_id", "feed_id"],
}
You'll have to play with how many expansions you want. It is a tradeoff between speed and recall.
Filtering
In general, the other thing you can add is filtering. If there is some type of criteria you can filter on, you can potentially drastically improve speed. Currently, your query is executing against the entire index (250m documents), which is a lot to evaluate. If you can add filter that cuts that number down, you can see much improved latency.
At the end of the day, the fewer documents which the query evaluates, the faster the query will run. Filters decrease the number of docs that a query will see, are cached, operate very quickly, etc etc.
Your situation may not have any applicable filters, but if it does, they can really help!
File System Caching
This advice is entirely dependent on the rest of the system. If you aren't fully utilizing your heap (24gb) because you are doing simple search and filtering (e.g. not faceting / geo / heavy sorts / scripts) you may be able to reallocate your heap to the file system cache.
For example, if your max heap usage peaks at 12gb, it may make sense to decrease heap size down to 15gb. The extra 10gb that you freed will go back to the OS and help cache segments, which will help boost search performance simply by the fact that more operations are diskless.