Elasticsearch Field expansion matches too many fields - elasticsearch

I am getting the following error when running my elastic search queries.
Field expansion matches too many fields, got: 1775. This will be limited starting with version 7.0 of Elasticsearch. The limit will be detemined by the indices.query.bool.max_clause_count setting which is currently set to 1024. You should look at lowering the maximum number of fields targeted by a query or increase the above limit while being aware that this can negatively affect your clusters performance.
Here is an example of one of my queries that throws this error is:
def searchQuery = [
"query" : [
"bool" : [
"must" : [
[ "match" : ["bodyContent_o.item.component.objectId" : objectId] ],
]
]
]
]
My understanding is that by using "match" I'm targeting a specific field and it shouldn't trigger this error, so what am I doing wrong?
Any direction or clarification is much appreciated.

Related

ElasticSearch too_many_nested_clauses Query contains too many nested clauses; maxClauseCount is set to 1024

we are trying to run a very simple Lucene (ver 9.3.0) query using ElasticSearch (ver 8.4.1) in Elastic cloud. Our index mapping has around 800 fields.
GET index-name/_search
{
"query": {
"query_string": {
"query": "Abby OR Alta"
}
}
}
However we are getting back an exception:
{
"error" : {
"root_cause" : [
{
"type" : "too_many_nested_clauses",
"reason" : "Query contains too many nested clauses; maxClauseCount is set to 1024"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
},
"status" : 500
}
Now from what I've read in this article link there was a breaking change since Lucene 9 which Elastic 8.4 uses.
The behavior changed dramatically on how this max clause value is counted. From previously being
num_terms = max_num_clauses
to
num_terms = max_num_clauses * num_fields_in_index
So in our case it would be 800 * 2 = 1600 > 1024
Now what I don't understand is why such a limitation was introduced and to what value we should actually change this setting ?
A OR B query with 800 fields in the index doesn't strike me as something unusual or being problematic from performance perspective.
The "easy" way out is to increase the indices.query.bool.max_clause_count limit in the configuration file to a higher value. It used to be 1024 in ES 7 and now in ES 8 it has been raised to 4096. Just be aware, though, that doing so might harm the performance of your cluster and even bring nodes down depending on your data volume.
Here is some interesting background information on how that the "ideal" value is calculated based on the hardware configuration as of ES 8.
A better way forward is to "know your data" and identify the fields to be used in your query_string query and either specify those fields in the query_string.fields array or modify your index settings to specify them as default fields to be searched on when no fields are specified in your query_string query:
PUT index/_settings
{
"index.query.default_field": [
"description",
"title",
...
]
}

Is there a difference between using search terms and should when querying Elasticsearch

I am performing a refactor of the code to query an ES index, and I was wondering if there is any difference between the two snippets below:
"bool" : {
"should" : [ {
"terms" : {
"myType" : [ 1 ]
}
}, {
"terms" : {
"myType" : [ 2 ]
}
}, {
"terms" : {
"myType" : [ 4 ]
}
} ]
}
and
"terms" : {
"myType" : [ 1, 2, 4 ]
}
Please check this blog from Elastic discuss page which will answer your question. Coying here for quick referance:
There's a few differences.
The simplest to see is the verbosity - terms queries just list an
array while term queries require more JSON.
terms queries do not score matches based on IDF (the rareness) of
matched terms - the term query does.
term queries can only have up to 1024 values due to Boolean's max
clause count
terms queries can have more terms
By default, Elasticsearch limits the terms query to a maximum of
65,536 terms. You can change this limit using the
index.max_terms_count setting.
Which of them is going to be faster? Is speed also related to the
number of terms?
It depends. They execute differently. term queries do more expensive scoring but does so lazily. They may "skip" over docs during execution because other more selective criteria may advance the stream of matching docs considered.
The terms queries doesn't do expensive scoring but is more eager and creates the equivalent of a single bitset with a one or zero for every doc by ORing all the potential matching docs up front. Many terms can share the same bitset which is what provides the scalability in term numbers.

Elasticsearch fuzziness with multi_match and bool_prefix type

I have a set of search_as_you_type_fields I need to search against. Here is my mapping
"mappings" : {
"properties" : {
"description" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"questions" : {
"properties" : {
"content" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
},
"title" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
}
}
I am using a multi_match query with bool_prefix type.
"query": {
"multi_match": {
"query": "triangle",
"type": "bool_prefix",
"fields": [
"title",
"title._2gram",
"title._3gram",
"description",
"description._2gram",
"description._3gram",
"questions.content",
"questions.content._2gram",
"questions.content._3gram",
"questions.tags",
"questions.tags._2gram",
"questions.tags._3gram"
]
}
}
So far works fine. Now I want to add a typo tolerance which is fuzziness in ES. However, looks like bool_prefix has some conflicts working with this. So if I modify my query and add "fuzziness": "AUTO" and make an error in a word "triangle" -> "triangld", it won't get any results.
However, if I am looking for a phrase "right triangle", I have some different behavior:
even if no typos is made, I got more results with just "fuzziness": "AUTO" (1759 vs 1267)
if I add a typo to the 2d word "right triangdd", it seems to work, however looks like it now pushes the results containing "right" without "triangle" first ("The Bill of Rights", "Due process and right to privacy" etc.) in front.
If I make a typo in the 1st word ("righd triangle") or both ("righd triangdd"), the results seems to be just fine. So this is probably the only correct behavior.
I've seen a couple of articles and even GitHub issues that fuzziness does not work in a proper way with a multi_match query with bool_prefix, however I can't find a workaround for this. I've tried changing the query type, but looks like bool_prefix is the only one that supports search as you type and I need to get search result as a user starts typing something.
Since I make all the requests from ES from our backend What I also can do is manipulate a query string to build different search query types if needed. For example, for 1 word searches use one type for multi use another. But I basically need to maintain current behavior.
I've also tried appending a sign "~" or "~1[2]" to the string which seems to be another way of specifying the fuzziness, but the results are rather unclear and performance (search speed) seems to be worse.
My questions are:
How can I achieve fuzziness for 1 word searches? so that query "triangld" returns documents containing "triangle" etc.
How can I achieve correct search results when the typo in the 2d (last?) word of the query? Like I mentioned above it works, but see the point 2 above
Why just adding a fuzziness (see p. 1) returns more results even if the phrase is correct?
Anything I need to change in my analyzers etc.?
so to achieve a desired behavior, we did the following:
change query type to "query_string"
added query string preprocessing on the backend. We split the query string by white spaces and add "~1" or "~2" to each word if their length is more 4 chars or 8 chars respectively. ~ is a fuzziness syntax in ES. However, we don't add this to the current typing word until the user types a white space. For example, user typing [t, tr, tri, ... triangle] => no fuzzy, but once "triangle " => "triangle~2". This is because there will be unexpected results with the last word having fuzziness
we also removed all ngram fields from the search fields as we get the same results but performance is a bit better.
added "default_operator": "AND" to the query to contain the results from one field for phrase queries

Elasticsearch: count rows in a table

I have a big table (15000 x 2000 entries). In this table, I need to count rows with certain properties like "all rows, that have a 1 or 2 in column 5 and a 0 in column 6". I will call this type of operation a count operation. For my use case, the count operation needs to be very fast, as I executing several hundreds of those count operations.
I tried to do so with elastic search, but the performance seems to be very bad (like 10 seconds for 180 count operations). I was wondering, if I am building my queries the wrong way, or if maybe Elasticsearch is the wrong technology to do so?
My queries are all of the same form. I create them with java, so it's kind of hard to post here, how they do look like but I do my best to explain
I build each single coun operation as a BoolQuery. For the example above it would be a query that looks similar to this (don't blame me if it's wrong, I cannot copy the correct query, as it is built in java):
"query": {
"bool" : {
"must" : [
"should" : [
{ "column 5" : "1" },
{ "column 5" : "2" }
],
"should" : [
{ "column 6" : "0" }
],
"minimum_should_match" : 1
],
"boost" : 1.0
}
}
The many bool queries of this form are then grouped into a MultiSearchRequest. I use the option "fetchSource = false" to prevent Elasticsearch from loading the entities themselves.
Please tell me, if you need any further information, or if it is unclear, what I am trying to do!
I just fixed the problem myself. For all with a similar question, here is how:
I changed the SearchSourceBuilder, so that it now uses a ValueCountAggregator. This one counts the values and allows me to set the SearchSourceBuilder.size() to 0. In this way I get rid of the hits themselves and retrieve only the aggregation values.
Requests that took 4 seconds before are now executed in less than 100ms.

Finding fields Elasticsearch has matched on

I am using Elasticsearch to search for a group a user should join. I have the user data nested into the search query. On return I get back the closest matched group that user should be in.
The field I am searching on is a nested field as follows:
`{"interests": [
{"topics":["python", "stackoverflow", "elasticsearch"]},
{"topics":["arts", "textiles"]}
]}`
However if you want an understanding of a match - how do you do this?
Elasticsearch does have an explain function which says what the scoring is made up of using tfidf, but not specifically what terms were used.
For example, if I search for 'Textile', the doc should match on 'textiles'. Thus I want the term 'textiles' to be returned in explain or some other way.
The only way I see that provides this need, is to store the search and the document retrieved and then process both to discover words ES has most likely matched on.
EDIT - for some more clarity of the question
An example in my index of a group which has "interests": ['arts', 'fine arts', 'art painting', 'arts and crafts', 'sports']
Now my search, I am looking for Arts and many other things. Now the term I am searching for comes up in this list many times, thus should always be a contributor.
What I want in the response is to say these words were matched ['arts', 'fine arts', 'art painting', 'arts and crafts']along with the degree to which they match i..e 'arts' should be higher than the others, but all others are also relevant
Elasticsearch allows you to specify the _name field for all queries and
filters. This means that you can separate your query into different parts with
separate names, which will allow you to determine which parts matched.
For example:
{
"query" : {
"bool" : {
"should" : [
{"match" : { "interests.topics" : {"query" : "python", "_name" : "py-topic"} }},
{"match" : { "interests.topics" : {"query" : "arts", "_name" : "arts-topic"} }}
]
}
}
}
Then, in your response, you will get back any array of which queries (or
filters) matched and you can determine if the py-topic query and/or the
arts-topic query matched above.

Resources