I have 2 queries:
GET _search
{
"query": {
"constant_score": {
"filter": {
"term": {
"idpays": 250
}
}
}
}
}
and
GET _search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": {
"term": {
"idpays": 250
}
}
}
}
}
}
}
Theses 2 queries return the same results.
Which one has the best performance? The first one or the second one with bool and must?
Regards
Since Elasticsearch uses lucene under the hood all the queries are rewritten as simpler lucene queries before they are executed. If you use overly complicated queries to do a simple task, it will take Elasticsearch more time to rewrite the query into a simpler one.
add "profile": true at the root of your query to return a detailed analysis of the performance stats of the query and take a look at the rewrite time.
The larger the time the more complex the query is. A quick look tells me the second one should be slower but you should analyze the results yourself.
Related
We use elasticsearch 7.2 and we've been observing something weird lately
We tried executing the following two queries
{
"query": {
"bool": {
"must": [
{
"term": {
"customer(keyword_field)": "big_customer"
}
}
]
}
}
}
{
"query": {
"bool": {
"filter": [
{
"term": {
"customer(keyword_field)": "big_customer"
}
}
]
}
}
}
This matches around ~1million documents. The 1st one was faster than the 2nd (10 times faster!). I expected 1 to be slower because of scoring
Also, when i added sorting, both of them got slower (2nd remained the same, 1st became as slow as 2nd)
I have a suspicion that the 'filter' looks through all documents, whereas the 'term' (or range for dates, or match etc etc) will look at the indexed values., SPotted something similar at a new client, and was baffled why they were using 'filter' at the top level, and not range or match.
Could be wrong here btw...so try on your systems first
If you wrap bool query in constant Score Query does it calculate score for internal queries. Is there another easy way to disable scoring?
Hi I have an update so I have a query where no scoring is required, so I wrote it in two forms and did load testing with 10000 documents.
Following are the two structures of query with which I did load testing:
{
"query": {
"bool": {
"filter": [
{
"must": {
"bool": {
"must": [
{
bool:.......
}
]
}
}
}
]
}
}
}
And the second one is:
{
"query": {
"bool": {
"filter": [
{
"filter": {
"bool": {
"filter": [
{
bool:.......
}
]
}
}
}
]
}
}
}
What I found was that the first query took almost as double time as the second query. I would like to know why this happened?
Also, do the internal bool queries inside filter in first example run in query context or filter context?
I have read elastic search documentation and cannot find references or details on how it works internally.
Thanks, in advance!!
Query can have two type of context in elastic search. Query context and filter context. Query context tells how well a document matches the query i.e. it calculates score whereas filter context tells whether a document matches the query and no scoring is done.
To answer your question, if you don't want scoring for bool query simply put it in filter context. More info on query context can be found here
Caching is probably why. See the documentation: "Frequently used filters will be cached automatically by Elasticsearch, to speed up performance."
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
I run the query below on a large elastic search cluster. The cluster bcomes unresponsive
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"regexp": {
"message": {
"value": ".*exception.*"
}
}
},
{
"bool": {
"should": [
{
"term": {
"beat.hostname": "ip-xxx-xx-xx-xx"
}
}
]
}
},
{
"range": {
"#timestamp": {
"lt": 1518459660000,
"format": "epoch_millis",
"gte": 1518459600000
}
}
}
]
}
}
}
When I remove the wildcarded .*exception.* and replace it with any non wildcarded string like xyz it returns fast. Though the query uses a wildcarded expression, it also looks for a small time range and a specific host. I would think this is a very simple query. Any reason why elasticsearch server can't handle this query? The cluster has 10 nodes and 20 TB of data.
See the documentation for Regexp Query. It clearly states the following:
Note: The performance of a regexp query heavily depends on the regular
expression chosen. Matching everything like .* is very slow
What would be ideal is to change the text analysis on the message field with a WordDelimiterTokenFilter and set split_on_case_change to true. Then something like NullPointerException will get indexed as three separate tokens [Null, Pointer, Exception]. This can help you search on exception without using a regex. Caveat is you need to reindex all your documents.
Another quick thing to try might be to keep your filter conditions on the hostname and timestamp in a filter context, which will prefilter documents before running your regexp query. This may be a short-term solution for you until you fix the text analysis.
Consider the following query in Elasticsearch:
GET nyc_visionzero/_search
{
"query": {
"bool": {
"must": [{
"fuzzy": {
"on_street_name": "AVENUE"
}
}
],
"filter": {
"term": {
"borough": "MANHATTAN"
}
}
}
}
}
Is the filter part executed first and then fuzzy or its the other way around? What if I want to change the order of their execution! How can I do that?
This question relates to the query vs. filter context topic. Everything in the query context (here query.bool.must) counts to the score of a document whereas the conditions in the filter context (query.filter) are a yes/no decision.
So from a performance perspective, filters are faster and can be cached. On the other side queries allow for some fuzziness.
There is a much more detailed explanation on this in the elasticsearch docs on query and filter context.
This query takes 200+ ms every time it is executed:
{
"filter": {
"term": {
"id": "123456",
"_cache": true
}
}
}
but this one only takes 2-3 ms every time it is executed after the first query:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"id": "123456"
}
}
}
}
}
Note the same ID values in both queries. Looks like the second query uses cached results from the first query. But why the first query cannot use the cached results itself? Removing "_cache" : true from the first query doesn't change anything.
And when I execute the second query with some other ID, it takes ~ 40 ms to execute it for the first time and 2-3 ms every time after that. So the second query not only works faster but it also caches the results and uses the cache for subsequent calls.
Is there an explanation for all this?
The top-level filter element in the first request has very special function in Elasticsearch. It's used to filter search result without affecting facets. In order to avoid interfering with facets, this filter is applied during collection of results and not during searching, which causes its slow performance. Using top-level filter without facets makes very little sense because filtered and constant_score queries typically provide much better performance. If verbosity of filtered query with match_all bothers you, you can rewrite your second request into equivalent constant_score query:
{
"query": {
"constant_score": {
"filter": {
"term": {
"id": "123456"
}
}
}
}
}