Elasticsearch script score impact on search performance - elasticsearch

I'm in progress of optimizing search queries performance and I'm following recommendations from https://www.elastic.co/guide/en/elasticsearch/reference/7.7/tune-for-search-speed.html
Query does the following:
Filters by multiple dates fields
Optionally filters by category_ids
Is wrapped in a function score, where one of the functions is a script score
One of the cheapest optimizations suggested is rounding dates to improve query caching. I've rounded time down to minutes at application level.
Another cheap optimization was mapping identifiers as keywords
I've tried both and none of them made a significant difference. I've observed application performance metrics, query slow logs, the difference was negligible.
Mapping identifiers as keywords turned out to be even slower, however I've also ran a test where I eliminated all the functions, reran all the queries and keyword identifiers were outperforming numeric identifiers.
The very same article suggests avoiding scripts, which I'll be doing next.
Given the case when keyword identifiers were doing better than numeric identifiers without functions and doing worse with functions is suspicious and I cannot explain that.
So in what way script score (function_score) impacts other queries performance?
This is a trimmed query version:
{
"query": {
"function_score": {
"query": {
"bool": {
"filter": [
{
"range": {
"created_at": {
"gte": "2020-06-26T17:22:00"
}
}
},
{
"terms": {
"catalog_ids": [4, 178, 222, 532, 1078, 1131]
}
}
]
}
},
"functions": [
{
"script_score": {
"script": {
"source": "1 / ln(now - doc['created_at'].value + 1)",
"lang": "expression",
"params": { "now": 1593184920000 }
}
}
},
{
"filter": {
"range": {
"boost_until": {
"gte": "2020-06-26T17:22:00"
}
}
},
"weight": 15.15
}
],
"score_mode": "multiply",
"boost_mode": "sum"
}
}
}
Query duration differences with/without function score:
These are all tests from a single-node cluster with 5M documents. Queries are taken from slow query log.

Related

Elasticsearch: How to write an 'OR' clause in filter context?

I'm looking for syntax/example compatible with ES version is 6.7.
I have seen the docs, I don't see any examples for this and the explanation isn't clear enough to me. I have tried writing query according to that, but I keep on getting syntax error. I have seen below questions on SO already but they don't help me:
Filter context for should in bool query (Elasticsearch)
It doesn't have any example.
Multiple OR filter in Elasticsearch
I get a syntax error
"type": "parsing_exception",
"reason": "no [query] registered for [filtered]",
"line": 1,
"col": 31
Maybe it's for a different version of ES.
All I need is a simple example with two 'or'ed conditions (mine is one range and one term but I guess that shouldn't matter much), both I would like to have in filter context (I don't care about scores, nor text search).
If you really need it, I can show my attempts (need to remove some 'sensitive'(duh) parts from it before posting), but they give parsing/syntax errors so I don't think there is any sense in them. I am aware that questions which don't show any efforts are considered bad for SO but I don't see any logic in showing attempts that aren't even parsed successfully, and any example would help me understand the syntax.
You need to wrap your should query in a filter query.
{
"query":{
"bool":{
"filter":[{
"bool":{
"should":[
{ // Query 1 },
{ // Query 2 }
]
}
}]
}
}
}
I had a similar scenario (even the range and match filter), with one more nested level, two conditions to be 'or'ed (as in your case) and another condition to be logically 'and'ed with its result. As #Pierre-Nicolas Mougel suggested in another answer I had nested bool clauses with one more level around the should clause.
{
"_source": [
"my_field"
],
"query": {
"bool": {
"filter": {
"bool": {
"must": [
{
"bool": {
"should": [
{
"range": {
"start": {
"gt": "1558878457851",
"lt": "1557998559147"
}
}
},
{
"range": {
"stop": {
"gt": "1558898457851",
"lt": "1558899559147"
}
}
}
]
}
},
{
"match": {
"my_id": "<My_Id>"
}
}
],
"must_not": []
}
}
}
},
"from": 0,
"size": -1,
"sort": [],
"aggs": {}
}
I read in the docs that minimum_should_match can be used too for forcing filter context. This might help you if this query doesn't work.

Elasticsearch Query for good title keyword results

We have a elasticsearch index containing a catalog of products, that we want to search by title and description.
We want it to have the following constraints:
We are searching title and description for occurences (matches in title should be twice as important as description)
We want it to have a very light fuzzy search result (but still accurate results)
Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
category_id should filter products out (so no results of other categories should be shown)
The created_at attribute should be valued very high in sorting as well.
products should lose score the "older" they get. (This is very important, because they lose importance with every day)
I have tried to create a query like that, but the results are really less than accurate. Sometimes finding completely unrelated stuff. I think that's because of the wildcard query.
Also I think there must be a more elegant solution for the "created_at" scoring. Right?
I am using Elasticsearch 6.2
This is my current code. I would be happy to see an elegant solution for this:
{
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"min_score": 0.3,
"size": 12,
"from": 0,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": [
"212",
"213"
]
}
},
"should": [
{
"match": {
"title_completion": {
"query": "Development",
"boost": 20
}
}
},
{
"wildcard": {
"title": {
"value": "*Development*",
"boost": 1
}
}
},
{
"wildcard": {
"title_completion": {
"value": "*Development*",
"boost": 10
}
}
},
{
"match": {
"title": {
"query": "Development",
"operator": "and",
"fuzziness": 1
}
}
},
{
"range": {
"created_at": {
"gte": 1563264817998,
"boost": 11
}
}
},
{
"range": {
"created_at": {
"gte": 1563264040398,
"boost": 4
}
}
},
{
"range": {
"created_at": {
"gte": 1563256264398,
"boost": 1
}
}
}
]
}
}
}
First of all, building a request returning relevant results is usually a difficult task. It can't be done without knowing the content of the documents. That said, I can give you hints to fulfill your requirements and avoid unrelevant results.
We are searching title and description for occurences (matches in title should be twice as important as description)
You can use boost as you did in your query to give more importance to matches on title compared to description.
We want it to have a very light fuzzy search result (but still accurate results)
You should use AUTO value for the fuzzy field to define different values of fuzziness depending on the length of the term. E.g., by default terms having less than 3 letters (most common terms where a change in letter can result in a different word) will not allows changes. Terms with more than 3 letters will allow one change and more than 5 will allow 2 changes. You can change this behavior depending of your tests.
Not matching results to the searchterm should not be filtered out, but only shown later (so matching results should be on top and worse results should be at the bottom)
Use a should clause in the bool statement. Clauses in a should statements does not filter documents (unless specified otherwise). The queries in should clause are only used to increase the score.
category_id should filter products out (so no results of other categories should be shown)
Use a must of filter clause in the bool statement to ensure that all documents validate a constraint. If you don't want the subqueries to contribute to the score (I believe its your case), use filter instead of match because filter will be able to cache the results. Your query is ok for this requirement.
The created_at attribute should be valued very high in sorting as well. products should lose score the "older" they get. (This is very important, because they lose importance with every day)
You should use a function score with a decay function. If decay function are not clear for you, you can skip the equations in the document and jump to the figure which self explanatory. The following query is an example using a gauss decay function.
{
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
Hints for writing queries
Avoid wildcard queries: If you use * they are not efficient and will consume a lot of memory. If you want to be able to search in part of a term (finding "penthouse" when the user search "house") you should create a subfield using ngram tokenizer and write a standard match query using the subfield.
Avoid setting a minimum score: The score is a relative value. A small score or a high score does not mean that the document is relevant or not. You can read this article about the subject.
Be carefull with fuzzy queries: Fuzzy can generate a lot of noise and confuse users. In general, I would recommend to increase the default AUTO threshold for fuzzy and accept that some queries with mispelling does not return good results. Usually, it is simpler for a user to detect a mispelling in his input compared to understanding why he has completly unrelated results.
Example query
This is just an example that you will need to adapt with your data.
{
"size": 12,
"query": {
"bool": {
"filter": {
"terms": {
"category_id": <CATEGORY_IDS>
}
},
"should": [
{
"match": {
"title": {
"query": <QUERY>,
"fuzziness": AUTO:4:12,
"boost": 3
}
}
},
{
"match": {
"title_completion": {
"query": <QUERY>,
"boost": 1
}
}
},
{
"match": {
// title_completion field with ngram tokenizer
"title_completion.ngram": {
"query": <QUERY>,
// Use lower boost because it match only partially
"boost": 0.5
}
}
}
]
},
"function_score": {
// Name of the decay function
"gauss": {
// Field to use
"created_at": {
"origin": "now", // "now" is the default so you can omit this field
"offset": "1d", // Values with less than 1 day will not be impacted
"scale": "10d", // Duration for which the scores will be scaled using a gauss function
"decay" : 0.01 // Score for values further than scale
}
}
}
}
}

Filter/aggregate one elasticsearch index of time series data by timestamps found in another index

The Data
So I have reams of different types of time series data. Currently i've chosen to put each type of data into their own index because with the exception of 4 fields, all of the data is very different. Also the data is sampled at different rates and are not guaranteed to have common timestamps across the same sub-second window so fusing them all into one large document is also not a trivial task.
The Goal
One of our common use cases that i'm trying to see if I can solve entirely in Elasticsearch is to return an aggregation result of one index based on the time windows returned from a query of another index. Pictorially:
This is what I want to accomplish.
Some Considerations
For small enough signal transitions on the "condition" data, I can just use a date histogram and some combination of a top hits sub aggregation, but this quickly breaks down when I have 10,000's or 100,000's of occurrences of "the condition". Further this is just one "case", I have 100's of sets of similar situations that i'd like to get the overall min/max from.
The comparisons are basically amongst what I would consider to be sibling level documents or indices, so there doesn't seem to be any obvious parent->child relationship that would be flexible enough over the long run, at least with how the data is currently structured.
It feels like there should be an elegant solution instead of brute force building the date ranges outside of Elasticsearch with the results of one query and feeding 100's of time ranges into another query.
Looking through the documentation it feels like some combination of Elasticsearch scripting and some of the pipelined aggregations are going to be what i want, but no definitive solutions are jumping out at me. I could really use some pointers in the right direction from the community.
Thanks.
I found a "solution" that worked for me for this problem. No answers or even comments from anyone yet, but i'll post my solution in case someone else comes along looking for something like this. I'm sure there is a lot of opportunity for improvement and optimization and if I discover such a solution (likely through a scripted aggregation) i'll come back and update my solution.
It may not be the optimal solution but it works for me. The key was to leverage the top_hits, serial_diff and bucket_selector aggregators.
The "solution"
def time_edges(index, must_terms=[], should_terms=[], filter_terms=[], data_sample_accuracy_window=200):
"""
Find the affected flights and date ranges where a specific set of terms occurs in a particular ES index.
index: the Elasticsearch index to search
terms: a list of dictionaries of form { "term": { "<termname>": <value>}}
"""
query = {
"size": 0,
"timeout": "5s",
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
"aggs": {
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
}
}
}
}
}
return es.search(index=index, body=query)
Breaking things down
Get filter the results by 'Index 2'
"query": {
"constant_score": {
"filter": {
"bool": {
"must": must_terms,
"should": should_terms,
"filter": filter_terms
}
}
}
},
must_terms is the required value to be able to get all the results for "the condition" stored in "Index 2".
For example, to limit results to only the last 10 days and when condition is the value 10 or 12 we add the following must_terms
must_terms = [
{
"range": {
"#timestamp": {
"gte": "now-10d",
"lte": "now"
}
}
},
{
"terms": {"condition": [10, 12]}
}
]
This returns a reduced set of documents that we can then pass on into our aggregations to figure out where our "samples" are.
Aggregations
For my use case we have the notion of "flights" for our aircraft, so I wanted to group the returned results by their id and then "break up" all the occurences into buckets.
"aggs": {
"by_flight_id": {
"terms": {"field": "flight_id", "size": 1000},
...
}
}
}
You can get the rising edge of the first occurence and the falling edge of the last occurence using the top_hits aggregation
"last": {
"top_hits": {
"sort": [{"#timestamp": {"order": "desc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
"first": {
"top_hits": {
"sort": [{"#timestamp": {"order": "asc"}}],
"size": 1,
"script_fields": {
"timestamp": {
"script": "doc['#timestamp'].value"
}
}
}
},
You can get the samples in between using a histogram on a timestamp. This breaks up your returned results into buckets for every unique timestamp. This is a costly aggregation, but worth it. Using the inline script allows us to use the timestamp value for the bucket name.
"time_edges": {
"histogram": {
"min_doc_count": 1,
"interval": 1,
"script": {
"inline": "doc['#timestamp'].value",
"lang": "painless",
}
},
...
}
By default the histogram aggregation returns a set of buckets with the document count for each bucket, but we need a value. This is what is required for serial_diff aggregation to work, so we have to do a token max aggregation on the results to get a value returned.
"aggs": {
"timestamps": {
"max": {"field": "#timestamp"}
},
"timestamp_diff": {
"serial_diff": {
"buckets_path": "timestamps",
"lag": 1
}
},
...
}
We use the results of the serial_diff to determine whether or not two bucket are approximately adjacent. We then discard samples that are adjacent to eachother and create a combined time range for our condition by using the bucket_selector aggregation. This will throw out buckets that are smaller than our data_sample_accuracy_window. This value is dependent on your dataset.
"aggs": {
...
"time_delta_filter": {
"bucket_selector": {
"buckets_path": {
"timestampDiff": "timestamp_diff"
},
"script": "if (params != null && params.timestampDiff != null) { params.timestampDiff > " + str(data_sample_accuracy_window) + "} else { false }"
}
}
}
The serial_diff results are also critical for us to determine how long our condition was set. The timestamps of our buckets end up representing the "rising" edge of our condition signal so the falling edge is unknown without some post-processing. We use the timestampDiff value to figure out where the falling edge is.

elasticsearch parent child extremely inefficient has_child query

I have a parent-child relationship in an ES index. The distribution in terms of the number of documents is around 20% for the parents (200M docs) and 80% children (1B docs). ES cluster has 5 nodes, each with 20GB RAM and 4 CPU cores. ES version is 1.5.2. We use 5 shards per index and 0 replication.
When I query it using the has_child, the processing is extremely slow - 170 sec. However, when I just run over the parents it takes less than a second.
This query takes far too long to return and causes timeouts within the application. I really care about the aggregations and time range filter.
I believe what is happening is that the query is running over every child first to do the filtering. In reality, I only would like it to run over the parents first and check if there is a single document and then use filter on the children.
Setup
The _parent is an action that looks like this
{
"a": "m_field",
"b": "b_field",
"c": "c_field",
"d": "d_field"
}
The _child is a timestamp when that action has occurred
{
"date": "2016-07-07T11:11:11Z"
}
These are typically stored in time series indices. Indexes are sharded by a month. An index usually takes around 70GB total size on disk. We choose to run it over an alias, which combines all or some of the most recent indices.
Query
When I query I do a query_string on the _parent document to search for the keyword and a Range filter on the child, using the has_child query.
This looks like the following.
{
"size": 0,
"aggs": {
"base_aggs": {
"cardinality": {
"field": "a"
}
}
},
"query": {
"bool": {
"must": [
{
"filtered": {
"query": {
"query_string": {
"query": "*",
"fields": [
"a",
"b",
"c",
"d",
"e"
],
"default_operator": "and",
"allow_leading_wildcard": true,
"lowercase_expanded_terms": true
}
},
"filter": {
"has_child": {
"type": "evt",
"min_children": 1,
"max_children": 1,
"filter": {
"range": {
"date": {
"lte": "2016-07-06T23:59:59.000",
"gte": "2016-06-07T00:00:00.000"
}
}
}
}
}
}
}
],
"must_not": [
{
"term": {
"b": {
"value": ""
}
}
},
{
"term": {
"b": {
"value": "__"
}
}
}
]
}
}
}
So the query should match on my query_string with the entry "*" and have children that are between the two dates provided. Because I only care about the aggregations I do not return any documents, and I only need to match on a single child document.
Question
How can I improve the speed of the query?
The performance of a has_child query or filter with the min_children
or max_children parameters is much the same as a has_child query with
scoring enabled.
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/has-child.html#min-max-children
So I guess, you would have to drop those parameters to speed up the query.

elasticsearch function score, boost weight of "number of matched terms in query" (coordination)

I want to use elasticsearch function score for customized scoring and these are my priorities for ranking:
number of common terms with query (for example a document which has 3 of 4 terms in query should be ranked higher than a document which has 2 of 4 terms in query, no matter how much is tf/idf score of each term). in elastic documentation it's called coordination factor.
sum of relevancy of terms. (tf/idf)
document popularity (number of votes for each document as described in boosting by popularity)
This is the body of request for elasticsearch currently used:
body = {
"query": {
"function_score": {
"query": {
{'match': {'text': query}}
},
"functions": [
{
"field_value_factor": {
"field": "ducoumnet_popularity",
}
}
],
}
}
}
Problem is that first priority is not satisfied with this request. for example there could be document A which has less common terms with query than document B, but because its common terms have more tf/idf score, document A is ranked higher than document B.
To prevent this I think the best way is to boost score of documents by coordination factor. is there any way to do this? something similar to this request:
body = {
"query": {
"function_score": {
"query": {
{'match': {'text': query}}
},
"functions": [
{
"field_value_factor": {
"field": "ducoumnet_popularity",
},
"field_value_factor": {
"field": "_coordination"
"weight": 10
}
}
],
}
}
}
I didn't find exact answer for this question but it may help someone to know that you can limit minimum precision for documents in result using minimum_should_match.
{
"query": {
"match": {
"content": {
"query": "quick brown dog",
"minimum_should_match": 75%
}
}
}
}
it accept many different configuration. more explanation:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-minimum-should-match.html

Resources