elastic search function_score query performance - elasticsearch

I'm doing function_score queries in elastic search.
The boost weights of the query are determined ad-hoc (and differ between users). Also, the terms that are queried will differ between users depending on context. An example query might look like this:
{
"query": {
"function_score": {
"filter": {
"term": { "in_stock": true },
... more filters ...
},
"functions": [
{
"filter": { "term": { "color": "red" }},
"weight": 2
},
{
"filter": { "term": { "style": "elegant" }},
"weight": 1
},
{
"filter": { "term": { "length": "long" }},
"weight": 3
}
],
"score_mode": "sum",
}
}
}
The document is simple and looks along the lines of:
{
"product_id" : "abc",
"name" : "blah blah",
"price" : 10
"in_stock" : true,
"color: "red",
"style" : "elegant",
"length" : "long",
... more attributes...
}
the mapping types of the filtered terms are keywords and boolean. Not doing any free text stuff anywhere.
The query performance is reasonable until the index size becomes large (around 1 million documents in the index). At that point the query will take multiple seconds to complete.
Index configuration:
I've played around with limiting shard size, currently the shards are limited to 1 million items because after that the performance seems to become even worse. Replication is at 5. The index is read only.
Since the weights and the terms will differ between queries, I'm not sure if it is possible to pre-sort the index in such a way that will speed up the query.
I'm not sure how/if elastic search can cache results, score and ordering in the case of weighted queries.

Related

Is it possible to affect execution order of filters in Elasticsearch?

We have a query of the form:
{
"query": {
"bool": {
"filter": [
{
"term": {
"userId": {
"value": "a_user_id",
"boost": 1
}
}
},
{
"range": {
"date": {
"from": 1648598400000,
"to": 1648684799999,
"boost": 1
}
}
},
{
"query_string": {
"query": "*MyQuery*",
"fields": [
"aField^1.0",
"anotherField^1.0",
"thirdField^1.0"
],
"boost": 1
}
}
],
"boost": 1
}
}
}
If we remove the third filter (the query_string one), performance is dramatically improved (typically going from around 2000 to 20 ms) for different variants of the above query.
The thing is, the first two filters (on userId and the date range) will always result in only a handful of search hits (say 50 or so).
So, if it was possible to hint that to Elasticsearch, or otherwise affect the query plan, it could solve our issue.
In old (1.x) versions of ES it seems that this was affected by the order of filters. from Elasticsearch: Order of filters for best performance:
"The order of filters in a bool clause is important for performance. More-specific filters should be placed before less-specific filters in order to exclude as many documents as possible, as early as possible. If Clause A could match 10 million documents, and Clause B could match only 100 documents, then Clause B should be placed before Clause A."
But newer versions are smarter - https://www.elastic.co/blog/elasticsearch-query-execution-order:
Q: Does the order in which I put my queries/filters in the query DSL matter?
A: No, because they will be automatically reordered anyway based on their respective costs and match costs.
But is it still possible to reach the desired outcome here by modifying the ES search request somehow?
Your query should be like below, so that filters run first and will only select ~50 or so documents and then your costly query_string (because of the leading wildcard) will only run on those 50 docs.
{
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "*MyQuery*",
"fields": [
"aField^1.0",
"anotherField^1.0",
"thirdField^1.0"
],
"boost": 1
}
}
],
"filter": [
{
"term": {
"userId": {
"value": "a_user_id",
"boost": 1
}
}
},
{
"range": {
"date": {
"from": 1648598400000,
"to": 1648684799999,
"boost": 1
}
}
}
],
"boost": 1
}
}
}

How to correctly query inside of terms aggregate values in elasticsearch, using include and regex?

How do you filter out/search in aggregate results efficiently?
Imagine you have 1 million documents in elastic search. In those documents, you have a multi_field (keyword, text) tags:
{
...
tags: ['Race', 'Racing', 'Mountain Bike', 'Horizontal'],
...
},
{
...
tags: ['Tracey Chapman', 'Silverfish', 'Blue'],
...
},
{
...
tags: ['Surfing', 'Race', 'Disgrace'],
...
},
You can use these values as filters, (facets), against a query to pull only the documents that contain this tag:
...
"filter": [
{
"terms": {
"tags": [
"Race"
]
}
},
...
]
But you want the user to be able to query for possible tag filters. So if the user types, race the return should show (from previous example), ['Race', 'Tracey Chapman', 'Disgrace']. That way, the user can query for a filter to use. In order to accomplish this, I had to use aggregates:
{
"aggs": {
"topics": {
"terms": {
"field": "tags",
"include": ".*[Rr][Aa][Cc][Ee].*", // I have to dynamically form this
"size": 6
}
}
},
"size": 0
}
This gives me exactly what I need! But it is slow, very slow. I've tried adding the execution_hint, it does not help me.
You may think, "Just use a query before the aggregate!" But the issue is that it'll pull all values for all documents in that query. Meaning, you can be displaying tags that are completely unrelated. If I queried for race before the aggregate, and did not use the include regex, I would end up with all those other values, like 'Horizontal', etc...
How can I rewrite this aggregation to work faster? Is there a better way to write this? Do I really have to make a separate index just for values? (sad face) Seems like this would be a common issue but have found no answers through documentation and googling.
You certainly don't need a separate index just for the values...
Here's my take on it:
What you're doing with the regex is essentially what should've been done by a tokenizer -- i.e. constructing substrings (or N-grams) such that they can be targeted later.
This means that the keyword Race will need to be tokenized into the n-grams ["rac", "race", "ace"]. (It doesn't really make sense to go any lower than 3 characters -- most autocomplete libraries choose to ignore fewer than 3 characters because the possible matches balloon too quickly.)
Elasticsearch offers the N-gram tokenizer but we'll need to increase the default index-level setting called max_ngram_diff from 1 to (arbitrarily) 10 because we want to catch as many ngrams as is reasonable:
PUT tagindex
{
"settings": {
"index": {
"max_ngram_diff": 10
},
"analysis": {
"analyzer": {
"my_ngrams_analyzer": {
"tokenizer": "my_ngrams",
"filter": [ "lowercase" ]
}
},
"tokenizer": {
"my_ngrams": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": [ "letter", "digit" ]
}
}
}
},
{ "mappings": ... } --> see below
}
When your tags field is a list of keywords, it's simply not possible to aggregate on that field without resorting to the include option which can be either exact matches or a regex (which you're already using). Now, we cannot guarantee exact matches but we also don't want to regex! So that's why we need to use a nested list which'll treat each tag separately.
Now, nested lists are expected to contain objects so
{
"tags": ["Race", "Racing", "Mountain Bike", "Horizontal"]
}
will need to be converted to
{
"tags": [
{ "tag": "Race" },
{ "tag": "Racing" },
{ "tag": "Mountain Bike" },
{ "tag": "Horizontal" }
]
}
After that we'll proceed with the multi field mapping, keeping the original tags intact but also adding a .tokenized field to search on and a .keyword field to aggregate on:
"index": { ... },
"analysis": { ... },
"mappings": {
"properties": {
"tags": {
"type": "nested",
"properties": {
"tag": {
"type": "text",
"fields": {
"tokenized": {
"type": "text",
"analyzer": "my_ngrams_analyzer"
},
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}
We'll then add our adjusted tags docs:
POST tagindex/_doc
{"tags":[{"tag":"Race"},{"tag":"Racing"},{"tag":"Mountain Bike"},{"tag":"Horizontal"}]}
POST tagindex/_doc
{"tags":[{"tag":"Tracey Chapman"},{"tag":"Silverfish"},{"tag":"Blue"}]}
POST tagindex/_doc
{"tags":[{"tag":"Surfing"},{"tag":"Race"},{"tag":"Disgrace"}]}
and apply a nested filter terms aggregation:
GET tagindex/_search
{
"aggs": {
"topics_parent": {
"nested": {
"path": "tags"
},
"aggs": {
"topics": {
"filter": {
"term": {
"tags.tag.tokenized": "race"
}
},
"aggs": {
"topics": {
"terms": {
"field": "tags.tag.keyword",
"size": 100
}
}
}
}
}
}
},
"size": 0
}
yielding
{
...
"topics_parent" : {
...
"topics" : {
...
"topics" : {
...
"buckets" : [
{
"key" : "Race",
"doc_count" : 2
},
{
"key" : "Disgrace",
"doc_count" : 1
},
{
"key" : "Tracey Chapman",
"doc_count" : 1
}
]
}
}
}
}
Caveats
in order for this to work, you'll have to reindex
ngrams will increase the storage footprint -- depending on how many tags-per-doc you have, it may become a concern
nested fields are internally treated as "separate documents" so this affects the disk space too
P.S.: This is an interesting use case. Let me know how the implementation went!

ElasticSearch more_like_this - Are options ran on the source or destination index?

A useful feature of the more_like_this function is ES is the ability to cross-search different indices, assuming field names and mappings correspond.
One thing that has me confused is how the Term Selection Parameters are applied in these situations.
Consider:
max_doc_freq
The maximum document frequency above which the terms will be ignored from the input document. This could be useful in order to ignore highly frequent words such as stop words. Defaults to unbounded (Integer.MAX_VALUE, which is 2^31-1 or 2147483647).
Is this the document frequency on the source document index? Or will it be applied to the index we are querying?
Example:
GET index_a/_search
{
"query": {
"function_score": {
"query": {
"bool": {
"should": [
{
"more_like_this": {
"boost": 1,
"fields": [
"text"
],
"include": true,
"like": [
{
"_id": "tI2N_24BFVRF37fDxSTT",
"_index": "index_b"
}
],
"max_doc_freq": 50000,
"max_query_terms": 50,
"min_term_freq": 1,
"min_word_length": 4,
"minimum_should_match": "1%",
"stop_words": []
}
}
]
}
},
"script_score": {
"script": "1.0"
}
}
}
}
max doc freq in this case is set to 50,000. But is this on index_a? or index_b?
Thats considered in rewrite phrase of query. so index_b . Rewrite phase rewrites MLT to a bool query

elasticsearch parent child extremely inefficient has_child query

I have a parent-child relationship in an ES index. The distribution in terms of the number of documents is around 20% for the parents (200M docs) and 80% children (1B docs). ES cluster has 5 nodes, each with 20GB RAM and 4 CPU cores. ES version is 1.5.2. We use 5 shards per index and 0 replication.
When I query it using the has_child, the processing is extremely slow - 170 sec. However, when I just run over the parents it takes less than a second.
This query takes far too long to return and causes timeouts within the application. I really care about the aggregations and time range filter.
I believe what is happening is that the query is running over every child first to do the filtering. In reality, I only would like it to run over the parents first and check if there is a single document and then use filter on the children.
Setup
The _parent is an action that looks like this
{
"a": "m_field",
"b": "b_field",
"c": "c_field",
"d": "d_field"
}
The _child is a timestamp when that action has occurred
{
"date": "2016-07-07T11:11:11Z"
}
These are typically stored in time series indices. Indexes are sharded by a month. An index usually takes around 70GB total size on disk. We choose to run it over an alias, which combines all or some of the most recent indices.
Query
When I query I do a query_string on the _parent document to search for the keyword and a Range filter on the child, using the has_child query.
This looks like the following.
{
"size": 0,
"aggs": {
"base_aggs": {
"cardinality": {
"field": "a"
}
}
},
"query": {
"bool": {
"must": [
{
"filtered": {
"query": {
"query_string": {
"query": "*",
"fields": [
"a",
"b",
"c",
"d",
"e"
],
"default_operator": "and",
"allow_leading_wildcard": true,
"lowercase_expanded_terms": true
}
},
"filter": {
"has_child": {
"type": "evt",
"min_children": 1,
"max_children": 1,
"filter": {
"range": {
"date": {
"lte": "2016-07-06T23:59:59.000",
"gte": "2016-06-07T00:00:00.000"
}
}
}
}
}
}
}
],
"must_not": [
{
"term": {
"b": {
"value": ""
}
}
},
{
"term": {
"b": {
"value": "__"
}
}
}
]
}
}
}
So the query should match on my query_string with the entry "*" and have children that are between the two dates provided. Because I only care about the aggregations I do not return any documents, and I only need to match on a single child document.
Question
How can I improve the speed of the query?
The performance of a has_child query or filter with the min_children
or max_children parameters is much the same as a has_child query with
scoring enabled.
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/has-child.html#min-max-children
So I guess, you would have to drop those parameters to speed up the query.

Terms aggregation (to achieve hierarchical faceting) query performance slow

I am indexing metric names in elastic search. Metric names are of the form foo.bar.baz.aux. Here is the index I use.
{
"index": {
"analysis": {
"analyzer": {
"prefix-test-analyzer": {
"filter": "dotted",
"tokenizer": "prefix-test-tokenizer",
"type": "custom"
}
},
"filter": {
"dotted": {
"patterns": [
"([^.]+)"
],
"type": "pattern_capture"
}
},
"tokenizer": {
"prefix-test-tokenizer": {
"delimiter": ".",
"type": "path_hierarchy"
}
}
}
}
}
{
"metrics": {
"_routing": {
"required": true
},
"properties": {
"tenantId": {
"type": "string",
"index": "not_analyzed"
},
"unit": {
"type": "string",
"index": "not_analyzed"
},
"metric_name": {
"index_analyzer": "prefix-test-analyzer",
"search_analyzer": "keyword",
"type": "string"
}
}
}
}
The above index creates the following terms for a metric name foo.bar.baz
foo
bar
baz
foo.bar
foo.bar.baz
If I have bunch of metrics, like below
a.b.c.d.e
a.b.c.d
a.b.m.n
x.y.z
I have to write a query to grab the nth level of tokens. In the example above
for level = 0, I should get [a, x]
for level = 1, with 'a' as first token I should get [b]
with 'x' as first token I should get [y]
for level = 2, with 'a.b' as first token I should get [c, m]
I couldn't think of any other way, other than to write terms aggregation. To figure out level 2 tokens of a.b, here is the query I came up with.
time curl -XGET http://localhost:9200/metrics_alias/metrics/_search\?pretty\&routing\=12345 -d '{
"size": 0,
"query": {
"term": {
"tenantId": "12345"
}
},
"aggs": {
"metric_name_tokens": {
"terms": {
"field" : "metric_name",
"include": "a[.]b[.][^.]*",
"execution_hint": "map",
"size": 0
}
}
}
}'
This would result in the following buckets. I parse the output and grab [c, m] from there.
"buckets" : [ {
"key" : "a.b.c",
"doc_count" : 2
}, {
"key" : "a.b.m",
"doc_count" : 1
} ]
So far so good. The query works great for most of the tenants(notice tenantId term query above). For certain tenants which has large amounts of data (around 1 Mil), the performance is really slow. I am guessing all the terms aggregation takes time.
I am wondering if terms aggregation is the right choice for this kind of data and also looking for other possible kinds of queries.
Some suggestions:
"mirror" the filter at the aggregations level in the query part as well. So, for a.b. matching, use the following as a query and keep the same aggs section:
"bool": {
"must": [
{
"term": {
"tenantId": 123
}
},
{
"prefix": {
"metric_name": {
"value": "a.b."
}
}
}
]
}
or even use regexp with the same regular expression as in the aggregation part. In this way, the aggregations will have to evaluate less buckets as the documents that reach the aggregation part will be less.
You mentioned that regexp is working better for you, my initial guess was that the prefix would perform better.
change "size": 0 from aggregations to "size": 100. After testing you mentioned this doesn't make any difference
remove "execution_hint": "map" and let Elasticsearch use the defaults. After testing you mentioned that the default execution_hint was performing far worse.
the only other thing I could think of is to relieve the pressure at searching time by moving it at indexing time. What I mean by that: at indexing time, in your own application or whatever indexing method you are using, split the text to be indexed programaticaly (not ES doing it) and index each element in the hierarchy in a separate field. For example a.b in field2, a.b.c in field3 and so on. This for the same document. Then, at search time, you look at specific fields depending on what the search text is. This whole idea, though, requires some additional work outside ES.
From all the suggestions above the first one had the greatest impact: queries response times improved from 23 secs to 11 seconds.

Resources