Performance of elastic queries - performance

This query takes 200+ ms every time it is executed:
{
"filter": {
"term": {
"id": "123456",
"_cache": true
}
}
}
but this one only takes 2-3 ms every time it is executed after the first query:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"term": {
"id": "123456"
}
}
}
}
}
Note the same ID values in both queries. Looks like the second query uses cached results from the first query. But why the first query cannot use the cached results itself? Removing "_cache" : true from the first query doesn't change anything.
And when I execute the second query with some other ID, it takes ~ 40 ms to execute it for the first time and 2-3 ms every time after that. So the second query not only works faster but it also caches the results and uses the cache for subsequent calls.
Is there an explanation for all this?

The top-level filter element in the first request has very special function in Elasticsearch. It's used to filter search result without affecting facets. In order to avoid interfering with facets, this filter is applied during collection of results and not during searching, which causes its slow performance. Using top-level filter without facets makes very little sense because filtered and constant_score queries typically provide much better performance. If verbosity of filtered query with match_all bothers you, you can rewrite your second request into equivalent constant_score query:
{
"query": {
"constant_score": {
"filter": {
"term": {
"id": "123456"
}
}
}
}
}

Related

Elastic search wildcard query crashes cluster

I run the query below on a large elastic search cluster. The cluster bcomes unresponsive
{
"size": 10000,
"query": {
"bool": {
"must": [
{
"regexp": {
"message": {
"value": ".*exception.*"
}
}
},
{
"bool": {
"should": [
{
"term": {
"beat.hostname": "ip-xxx-xx-xx-xx"
}
}
]
}
},
{
"range": {
"#timestamp": {
"lt": 1518459660000,
"format": "epoch_millis",
"gte": 1518459600000
}
}
}
]
}
}
}
When I remove the wildcarded .*exception.* and replace it with any non wildcarded string like xyz it returns fast. Though the query uses a wildcarded expression, it also looks for a small time range and a specific host. I would think this is a very simple query. Any reason why elasticsearch server can't handle this query? The cluster has 10 nodes and 20 TB of data.
See the documentation for Regexp Query. It clearly states the following:
Note: The performance of a regexp query heavily depends on the regular
expression chosen. Matching everything like .* is very slow
What would be ideal is to change the text analysis on the message field with a WordDelimiterTokenFilter and set split_on_case_change to true. Then something like NullPointerException will get indexed as three separate tokens [Null, Pointer, Exception]. This can help you search on exception without using a regex. Caveat is you need to reindex all your documents.
Another quick thing to try might be to keep your filter conditions on the hostname and timestamp in a filter context, which will prefilter documents before running your regexp query. This may be a short-term solution for you until you fix the text analysis.

Search within the results got from elasticsearch

Is it possible to search within the results that I get from elasticsearch?
To achieve that currently I need to run & wait for two searches on elasticsearch: the first search is
{ "match": { "title": "foo" } }
It takes 5 seconds and returns 500 docs etc.. And then a second search
{
"bool": {
"must": [
{ "match": { "title": "foo" } },
{ "match": { "title": "bar" } }
]
}
}
It takes another 5 seconds and returns 200 docs, which basically has nothing to do with the first search from elasticsearch's perspective.
Instead of doing it this way, I'd like to offer a "search further within the result" option to my users. Hopefully with this option, users can make a search with more keyword provided based on the result returned from the first search.
So my scenario is that a user makes a first search with keyword "foo", and gets 500 results on the webpage, and then selects "search further within the result", to make a second search within the 500 results, and hope to get some refined results really quick.
How can I achive it? Thanks!
What you could do is use the IDS query. Collect all document IDs from the first request, and then post them with a new Bool query that includes an IDS query in a must clause next to the original query. You could efficiently collect the IDs in the first request using the Scroll API. Since you will return the second result sorted anyway, it does not make sense to do any sorting in the first request, so you can speed up the first request.
See:
Scroll API: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
IDS Query: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html
post filter is a way to search inside an other search.
In your case :
GET _search
{
"query": {
"match": {
"title": "foo"
}
},
"post_filter": {
"match": {
"title": "bar"
}
}
}
post_filter will be executed on the query result.

How to specify the execution order of filter and query in an Elasticsearch query

Consider the following query in Elasticsearch:
GET nyc_visionzero/_search
{
"query": {
"bool": {
"must": [{
"fuzzy": {
"on_street_name": "AVENUE"
}
}
],
"filter": {
"term": {
"borough": "MANHATTAN"
}
}
}
}
}
Is the filter part executed first and then fuzzy or its the other way around? What if I want to change the order of their execution! How can I do that?
This question relates to the query vs. filter context topic. Everything in the query context (here query.bool.must) counts to the score of a document whereas the conditions in the filter context (query.filter) are a yes/no decision.
So from a performance perspective, filters are faster and can be cached. On the other side queries allow for some fuzziness.
There is a much more detailed explanation on this in the elasticsearch docs on query and filter context.

How to get only x results from elastic and then stop searching?

My whole index is about 700M docs, this query:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10
}
applies to ca 5M docs. "Some_field" is indexed, not analysed.
Query takes ca 1s on average hetzner. Too slow :) I don't care about pagination or sorting or scoring. I just want 10 first "random" matching docs.
Is there the way to do it with disabled score, in the "mysql way"?
filter or constant_score do not help
If you go with filters, that will remove the score computation and should provide faster query speeds:
{
"query": {
"bool": {
"filter": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
}
}
}
"size": 10
}
If that's still too slow, you could consider using document routing, but it may not be a viable option for you as you might have just 1 shard or very few terms for SOME_FIELD.
I also suggest you go over the production deployment document by Elastic, it gives you an overview on how to configure your cluster optimally and can also produce some serious performance boost in case you currently have a misconfigured cluster, i.e. running on a strong machine but keeping the default ES_HEAP_SIZE value.
The option i was looking for is "terminate_after". Unfortunately it is not "very well" documemented, see:
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/query-dsl-limit-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-count.html#_request_parameters
so, my query looks like this:
{
"query": {
"term": {
"SOME_FIELD": "SOME_TERM"
}
},
"size": 10,
"terminate_after": 10
}
Don't use "10" instead of 10. Elastic does not cast it to integer and ignores the parameter

performance query in elasticsearch

I have 2 queries:
GET _search
{
"query": {
"constant_score": {
"filter": {
"term": {
"idpays": 250
}
}
}
}
}
and
GET _search
{
"query": {
"constant_score": {
"filter": {
"bool": {
"must": {
"term": {
"idpays": 250
}
}
}
}
}
}
}
Theses 2 queries return the same results.
Which one has the best performance? The first one or the second one with bool and must?
Regards
Since Elasticsearch uses lucene under the hood all the queries are rewritten as simpler lucene queries before they are executed. If you use overly complicated queries to do a simple task, it will take Elasticsearch more time to rewrite the query into a simpler one.
add "profile": true at the root of your query to return a detailed analysis of the performance stats of the query and take a look at the rewrite time.
The larger the time the more complex the query is. A quick look tells me the second one should be slower but you should analyze the results yourself.

Resources