Elastic Search Filter performing much slower than Query - elasticsearch

As my ES index/cluster has scaled up (# ~2 billion docs now), I have noticed more significant performance loss. So I started messing around with my queries to see if I could squeeze some perf out of them.
As I did this, I noticed that when I used a Boolean Query in my Filter, my results would take about 3.5-4 seconds to come back. But if I do the same thing in my Query it is more like 10-20ms
Here are the 2 queries:
Using a filter
POST /backup/entity/_search?routing=39cd0b95-efc3-4eee-93d1-93e6f5837d6b
{
"query": {"bool":{"should":[],"must":[{"match_all":{}}]}},
"filter": {
"bool": {
"must": [
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]
}
}
}
Using a query
POST /backup/entity/_search?routing=39cd0b95-efc3-4eee-93d1-93e6f5837d6b
{
"query": {"bool":{"should":[],"must":[
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]}}
}
Like I said, the second method where I don't use a Filter at all takes mere milliseconds, while the first query takes almost 4 seconds. This seems completely backwards from what the documentation says. They say that the Filter should actually be very quick and the Query should be the one that takes longer. So why am I seeing the exact opposite here?
Could it be something with my index mapping? If anyone has any idea why this is happening I would love to hear suggestions.
Thanks

The root filter element is actually another name for post_filter element. Somehow, it was supposed to be removed (the filter) in ES 1.1 but it slipped through and exists in 2.x versions as well.
It is removed completely in ES 5 though.
So, your first query is not a "filter" query. It's a query whose results are used afterwards (if applicable) in aggregations, and then the post_filter/filter is applied on the results. So you basically have a two steps process in there: https://www.elastic.co/guide/en/elasticsearch/reference/1.5/search-request-post-filter.html
More about its performance here:
While we have gained cacheability of the tag filter, we have potentially increased the cost of scoring significantly. Post filters are useful when you need aggregations to be unfiltered, but hits to be filtered. You should not be using post_filter (or its deprecated top-level synonym filter) if you do not have facets or aggregations.
A proper filter query is the following:
{
"query": {
"filtered": {
"query": {
"bool": {
"should": [],
"must": [
{
"match_all": {}
}
]
}
},
"filter": {
"bool": {
"must": [
{
"term": {
"serviceId": "39cd0b95-efc3-4eee-93d1-93e6f5837d6b"
}
},
{
"term": {
"subscriptionId": "3eb5021e-2f1d-4292-9fd5-95788ebfafa0"
}
},
{
"term": {
"subscriptionType": 0
}
},
{
"terms": {
"entityType": [
"4"
]
}
}
]
}
}
}
}
}

A filter is faster. Your problem is that you include the match_all query in your filter case. This matches on all 2 billion of your documents. A set operation has to then be done against the filter to cull the set. Omit the query portion in your filter test and you'll see that the results are much faster.

Related

Elasticsearch "boost" not working when inside "filter"

I'm trying to boost matches on a certain field over another.
This works fine:
{
"query": {
"bool": {
"should": [
{
"terms": {
"boost": 2,
"mainField": "foo"
}
},
{
"terms": {
"otherField": "foo"
}
}
]
}
}
}
When i see the documents matched on mainField, i see they have a _score of 2.0 as expected.
But when i wrap this same query in a filter:
{
"query": {
"bool": {
"filter": [
{
"bool": {
"should": [
{
"terms": {
"boost": 2,
"mainField": "foo"
}
},
{
"terms": {
"otherField": "foo"
}
}
]
}
}
]
}
}
}
The _score for all documents is 0.0.
The same thing happens for multi_match. By itself (e.g inside a query) it works fine, but inside a bool + filter, it doesn't work.
Can someone explain why this is the case? I need to wrap in a filter due to the way my app composes queries.
Some context might also help: I'm trying to return documents that match on either mainField or otherField, but sort the ones matching on mainField first, so i figured boost would be the most appropriate choice here. But let me know if there is a better way.
The filter queries are always executed in the filter context. It will always return a score of zero and only contribute to the filtering of documents.
Refer to this documentation, to know more about filter context
Due to this, you are not getting a _score of 2.0, even after applying boost, in the second query

Difference between elasticsearch queries

I'm having a hard time trying to figure out why these two queries do not return the same number of results (I'm using elasticsearch 2.4.1):
{
"nested": {
"path": "details",
"filter": [
{ "match": { "details.id": "color" } },
{ "match": { "details.value_str": "red" } }
]
}
}
{
"nested": {
"path": "details",
"filter": {
"bool": {
"must": [
{ "match": { "details.id": "color" } },
{ "match": { "details.value_str": "red" } }
]
}
}
}
}
The first query has more results.
My guess was that the filter clause in the first query was working like an or/should, but if I replace the must in the second query with a should, the query yields a greater number of results than that of those two.
How does the meaning of those queries differ?
I'm afraid I have no knowledge of the structure of the indexed documents; all I know is how many rows each query returns.
The first query is wrong, the nested filter cannot be an array, so I suspect ES doesn't parse it correctly and only takes one match instead of both, which is probably why it returns more data than the second one.
The second query is correct in terms of nested filter and yields exactly what you expect.

Compare query with and without score calculation

I would like to know if it is possible to disable score calculation for should types of queries or maybe it is possible to have an OR for filter context?
ES version: 6+
For example:
this query will search matches in either records OR voIds and will have score calculation
POST customers/_search
{
"size": 10000,
"version": true,
"query": {
"bool": {
"should": [
{
"terms": {
"voIds": [
78031203, ...
]
}
},
{
"terms": {
"records.keyword": [
"S3G82U", ....
]
}
}
]
}
}
}
this query will filter documents that match in both records AND voIds and will not have score calculation. not what I need because it uses AND
POST customers/_search
{
"size": 10000,
"version": true,
"query": {
"bool": {
"filter": [
{
"terms": {
"voIds": [
78031203
]
}
},
{
"terms": {
"records.keyword": [
"S3G82U"
]
}
}
]
}
}
}
The goal for me to troubleshoot performance of the same queries with and without score. So I have first query that has score. how to write second query without score?
Thanks.
This is not possible. And I don't see much use case functionality wise. Are you seeing slowness in elasticsearch or query itself?
You can't disable scoring compltely. But you can disable query coordination. Not sure how much it helps performance wise if at all.

Elasticsearch terms query on array of values

I have data on ElasticSearch index that looks like this
{
"title": "cubilia",
"people": [
"Ling Deponte",
"Dana Madin",
"Shameka Woodard",
"Bennie Craddock",
"Sandie Bakker"
]
}
Is there a way for me to do a search for all the people whos name starts with
"ling" (should be case insensitive) and get distinct terms properly cased "Ling Deponte" not "ling deponte"?
I am find with changing mappings on the index in any way.
Edit does what I want but is really bad query:
{
"size": 0,
"aggs": {
"person": {
"filter": {
"bool":{
"should":[
{"regexp":{
"people.raw":"(.* )?[lL][iI][nN][gG].*"
}}
]}
},
"aggs": {
"top-colors": {
"terms": {
"size":10,
"field": "people.raw",
"include":
{
"pattern": ["(.* )?[lL][iI][nN][gG].*"]
}
}
}
}
}
}
}
people.raw is not_analyzed
Yes, and you can do it without a regular expression by taking advantage of Elasticsearch's full text capabilities.
GET /test/_search
{
"query": {
"match_phrase": {
"people": "Ling"
}
}
}
Note: This could also be match or match_phrase_prefix in this case. The match_phrase* queries imply an order of the values in the text. match simply looks for any of the values. Since you only have one value, it's pretty much irrelevant.
The problem is that you cannot limit the document responses to just that name because the search API returns documents. With that said, you can use nested documents and get the desired behavior via inner_hits.
You do not want to do wildcard prefixing whenever possible because it simply does not work at scale. To put it in SQL terms, that's like doing a full table scan; you effectively lose the benefit of the inverted index because it has to walk it entirely to find the actual start.
Combining the two should work pretty well though. Here, I use the query to widdle down results to what you are interested in, then I use your inner aggregation to only include based on the value.
{
"size": 0,
"query": {
"match_phrase": {
"people": "Ling"
}
}
"aggs": {
"person": {
"terms": {
"size":10,
"field": "people.raw",
"include": {
"pattern": ["(.* )?[lL][iI][nN][gG].*"]
}
}
}
}
}
Hi Please find the query it may help for your request
GET skills/skill/_search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"wildcard": {
"skillNames.raw": "jav*"
}
}
]
}
}
}
}
}
My intention is to find documents starting with the "jav"

Using geo_shape filter inside bool filter

I'm trying to combine a geo_shape Elasticsearch filter with a basic term filter within a bool filter, so I can attempt to improve performance of our elasticsearch query, with little success.
This query is used over a set of polygons in Elasticsearch, to determine which shapes the specified point is in.
It seems as though, unless I have the wrong end of the stick, geo_shape filters can't be included inside a bool filter collection like this:
{
"size": 1000,
"fields": [],
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"geo_shape": {
"deliveryAreas.area": {
"shape": {
"coordinates": [
-0.126208,
51.430874
],
"type": "point"
}
}
}
},
{
"term": {
"restaurantState": 3
}
}
]
}
}
}
}
}
The query above runs, but returns 0 results. Using the geo_shape query outside the bool works fine, but the combination of the two seems to fail. I assume it must be a syntax error, as the ElasticSearch docs recommend this approach to make the expensive geo calls cheaper, but no luck so far.

Resources