ElasticSearch too_many_nested_clauses Query contains too many nested clauses; maxClauseCount is set to 1024 - elasticsearch

we are trying to run a very simple Lucene (ver 9.3.0) query using ElasticSearch (ver 8.4.1) in Elastic cloud. Our index mapping has around 800 fields.
GET index-name/_search
{
"query": {
"query_string": {
"query": "Abby OR Alta"
}
}
}
However we are getting back an exception:
{
"error" : {
"root_cause" : [
{
"type" : "too_many_nested_clauses",
"reason" : "Query contains too many nested clauses; maxClauseCount is set to 1024"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
},
"status" : 500
}
Now from what I've read in this article link there was a breaking change since Lucene 9 which Elastic 8.4 uses.
The behavior changed dramatically on how this max clause value is counted. From previously being
num_terms = max_num_clauses
to
num_terms = max_num_clauses * num_fields_in_index
So in our case it would be 800 * 2 = 1600 > 1024
Now what I don't understand is why such a limitation was introduced and to what value we should actually change this setting ?
A OR B query with 800 fields in the index doesn't strike me as something unusual or being problematic from performance perspective.

The "easy" way out is to increase the indices.query.bool.max_clause_count limit in the configuration file to a higher value. It used to be 1024 in ES 7 and now in ES 8 it has been raised to 4096. Just be aware, though, that doing so might harm the performance of your cluster and even bring nodes down depending on your data volume.
Here is some interesting background information on how that the "ideal" value is calculated based on the hardware configuration as of ES 8.
A better way forward is to "know your data" and identify the fields to be used in your query_string query and either specify those fields in the query_string.fields array or modify your index settings to specify them as default fields to be searched on when no fields are specified in your query_string query:
PUT index/_settings
{
"index.query.default_field": [
"description",
"title",
...
]
}

Related

When trying to index in elasticsearch 7.8.1, an error occurs saying "field" is too large, must be <= 32766 Is there a solution?

When trying to index in elasticsearch 7.8.1, an error occurs saying "testField" is too large, must be <= 32766 Is there a solution?
Field Info
"testField":{
"type": "keyword",
"index": false
}
It is a known issue and it is not clear yet on what is best to solve it. Lucene enforces a maximum term length of 32766, beyond which the document is rejected.
Until this gets solved, there are two immediate options you can choose from:
A. Use a script ingest processor to truncate the value to at most 32766 bytes.
PUT _ingest/pipeline/truncate-pipeline
{
"description": "truncate",
"processors": [
{
"script": {
"source": """
ctx.testField = ctx.testField.substring(0, 32766);
"""
}
}
]
}
PUT my-index/_doc/123?pipeline=truncate-pipeline
{ "testField": "hgvuvhv....sjdhbcsdc" }
B. Use a text field with an appropriate analyzer that would truncate the value, but you'd lose the ability to aggregate and sort on that field.
If you want to keep your field as a keyword, I'd go with option A

Elasticsearch fuzziness with multi_match and bool_prefix type

I have a set of search_as_you_type_fields I need to search against. Here is my mapping
"mappings" : {
"properties" : {
"description" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"questions" : {
"properties" : {
"content" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
},
"title" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
}
}
I am using a multi_match query with bool_prefix type.
"query": {
"multi_match": {
"query": "triangle",
"type": "bool_prefix",
"fields": [
"title",
"title._2gram",
"title._3gram",
"description",
"description._2gram",
"description._3gram",
"questions.content",
"questions.content._2gram",
"questions.content._3gram",
"questions.tags",
"questions.tags._2gram",
"questions.tags._3gram"
]
}
}
So far works fine. Now I want to add a typo tolerance which is fuzziness in ES. However, looks like bool_prefix has some conflicts working with this. So if I modify my query and add "fuzziness": "AUTO" and make an error in a word "triangle" -> "triangld", it won't get any results.
However, if I am looking for a phrase "right triangle", I have some different behavior:
even if no typos is made, I got more results with just "fuzziness": "AUTO" (1759 vs 1267)
if I add a typo to the 2d word "right triangdd", it seems to work, however looks like it now pushes the results containing "right" without "triangle" first ("The Bill of Rights", "Due process and right to privacy" etc.) in front.
If I make a typo in the 1st word ("righd triangle") or both ("righd triangdd"), the results seems to be just fine. So this is probably the only correct behavior.
I've seen a couple of articles and even GitHub issues that fuzziness does not work in a proper way with a multi_match query with bool_prefix, however I can't find a workaround for this. I've tried changing the query type, but looks like bool_prefix is the only one that supports search as you type and I need to get search result as a user starts typing something.
Since I make all the requests from ES from our backend What I also can do is manipulate a query string to build different search query types if needed. For example, for 1 word searches use one type for multi use another. But I basically need to maintain current behavior.
I've also tried appending a sign "~" or "~1[2]" to the string which seems to be another way of specifying the fuzziness, but the results are rather unclear and performance (search speed) seems to be worse.
My questions are:
How can I achieve fuzziness for 1 word searches? so that query "triangld" returns documents containing "triangle" etc.
How can I achieve correct search results when the typo in the 2d (last?) word of the query? Like I mentioned above it works, but see the point 2 above
Why just adding a fuzziness (see p. 1) returns more results even if the phrase is correct?
Anything I need to change in my analyzers etc.?
so to achieve a desired behavior, we did the following:
change query type to "query_string"
added query string preprocessing on the backend. We split the query string by white spaces and add "~1" or "~2" to each word if their length is more 4 chars or 8 chars respectively. ~ is a fuzziness syntax in ES. However, we don't add this to the current typing word until the user types a white space. For example, user typing [t, tr, tri, ... triangle] => no fuzzy, but once "triangle " => "triangle~2". This is because there will be unexpected results with the last word having fuzziness
we also removed all ngram fields from the search fields as we get the same results but performance is a bit better.
added "default_operator": "AND" to the query to contain the results from one field for phrase queries

Elasticsearch Field expansion matches too many fields

I am getting the following error when running my elastic search queries.
Field expansion matches too many fields, got: 1775. This will be limited starting with version 7.0 of Elasticsearch. The limit will be detemined by the indices.query.bool.max_clause_count setting which is currently set to 1024. You should look at lowering the maximum number of fields targeted by a query or increase the above limit while being aware that this can negatively affect your clusters performance.
Here is an example of one of my queries that throws this error is:
def searchQuery = [
"query" : [
"bool" : [
"must" : [
[ "match" : ["bodyContent_o.item.component.objectId" : objectId] ],
]
]
]
]
My understanding is that by using "match" I'm targeting a specific field and it shouldn't trigger this error, so what am I doing wrong?
Any direction or clarification is much appreciated.

Elasticsearch performs slowly when data size increased

We have a cluster with following details:
1. OS: Windows 7 (64 bit)
2. Number of nodes: 2 (i7 processor, 8Gb RAM)
3. ES version: 2.4.4
We have created an index with following details:
1. Index size: 86 Gb
2. Number of shards: 12
3. Number of replica: None
5. Number of documents: 140 million
6. Number of fields: 15
6. For most of the fields we have set "index": "not_analyzed"
7. For few of the fields we have set "index": "no"
8. We are not executing any full-text search, aggregation or sorting
9. For 2 fields we are using fuzziness of edit distance 1
12 shards are evenly distributed on 2 nodes (6 shards each). We are running multi-search query on this cluster where each multi-search request consist of 6 individual queries.
Our queries are taking too much time to execute. From the "took" field we can see that each individual query is taking time in the range of 3-8 seconds. Rarely they are executing in milliseconds.
Avg. record count returned in result set is around 800 (max 10k records and min 10 records).
When we ran the same test on relatively small set of data (10 million records which is 7 Gb in size) then each individual query took time in the range of 50-200 milliseconds.
Could someone suggest what might be causing our queries to run slow when index size increases?
Update after the response of xeraa:
Maybe you are also using spinning disk?
Yes
800 documents (or more) sounds like a lot. Do you always need that many?
Not all but few of the individual queries return a lot of docs and we do need all of them.
Did you set the heap size to 4GB (half the memory available)?
Yes
Why 12 shards? If you only have 2 nodes this sounds a bit too much (but will probably not make a huge difference).
So that we can add more nodes later (without the need of reindex) as the data grows.
Maybe you can show a query? It sounds costly with your 6 individual queries
Following are the 2 sample queries that are used. A total of 6 similar queries are wrapped in multi-search.
POST /_msearch
{"index" : "school"}
{
"query": {
"bool" : {
"must" : [ {
"bool" : {
"should" : {
"range" : {
"marks" : {
"from" : "100000000",
"to" : "200000000",
"include_lower" : true,
"include_upper" : true
} } } } }, {
"nested" : {
"query" : {
"match" : {
"query" : "25 ",
"fields" : [ "subject.chapter" ] } },
"path" : "subject"
} } ] } }
}
{"index" : "school"}
{
"query":
{
"bool" : {
"must" : {
"nested" : {
"query" : {
"match" : {
"query" : "A100123",
"fields" : [ "student.id" ],
"fuzziness" : "1"
} },
"path" : "student"
} } } }
}
140 million documents mean 86GB of data then I guess 10 million documents translate to less than 8GB of data. So the smaller dataset can be served from memory (at least mostly with your two 8GB nodes), while the larger dataset needs to be served from disk. Maybe you are also using spinning disk? In any case the laws of physics will make your full dataset slower than the smaller one.
Various things you could look into:
800 documents (or more) sounds like a lot. Do you always need that many?
Did you set the heap size to 4GB (half the memory available)?
Why 12 shards? If you only have 2 nodes this sounds a bit too much (but will probably not make a huge difference).
Maybe you can show a query? It sounds costly with your 6 individual queries

ElasticSearch + Kibana - Unique count using pre-computed hashes

update: Added
I want to perform unique count on my ElasticSearch cluster.
The cluster contains about 50 millions of records.
I've tried the following methods:
First method
Mentioned in this section:
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory.
Second method
Mentioned in this section:
Unless you configure Elasticsearch to use doc_values as the field data format, the use of aggregations and facets is very demanding on heap space.
My property mapping
"my_prop": {
"index": "not_analyzed",
"fielddata": {
"format": "doc_values"
},
"doc_values": true,
"type": "string",
"fields": {
"hash": {
"type": "murmur3"
}
}
}
The problem
When I use unique count on my_prop.hash in Kibana I receive the following error:
Data too large, data for [my_prop.hash] would be larger than limit
ElasticSearch has 2g heap size.
The above also fails for a single index with 4 millions of records.
My questions
Am I missing something in my configurations?
Should I increase my machine? This does not seem to be the scalable solution.
ElasticSearch query
Was generated by Kibana:
http://pastebin.com/hf1yNLhE
ElasticSearch Stack trace
http://pastebin.com/BFTYUsVg
That error says you don't have enough memory (more specifically, memory for fielddata) to store all the values from hash, so you need to take them out from the heap and put them on disk, meaning using doc_values.
Since you are already using doc_values for my_prop I suggest doing the same for my_prop.hash (and, no, the settings from the main field are not inherited by the sub-fields): "hash": { "type": "murmur3", "index" : "no", "doc_values" : true }.

Resources