How does Elasticsearch aggregate or weight scores from two sub queries ("bool query" and "decay function") - elasticsearch

I have a complicated Elasticsearch query like the following example. This query has two sub queries: a weighted bool query and a decay function. I am trying to understand how Elasticsearch aggregrates the scores from each sub queries. If I run the first sub query alone (the weighted bool query), my top score is 20. If I run the second sub query alone (the decay function), my score is 1. However, if I run both sub queries together, my top score is 15. Can someone explain this?
My second related question is how to weight the scores from the two sub queries?
query = { "function_score": {
"query": {
"bool": {
"should": [
{'match': {'title': {'query': 'Quantum computing', 'boost': 1}}},
{'match': {'author': {'query': 'Richard Feynman', 'boost': 2}}}
]
},
},
"functions": [
{ "exp": # a built-in exponential decay function
{
"publication_date": {
"origin": "2000-01-01",
"offset": "7d",
"scale": "180d",
"decay": 0.5
},
},
}]
}}

I found the answer myself by reading the elasticsearch document on the usage of function_score. function_score has a parameter boost_mode that specifies how query score and function score are combined. By default, boost_mode is set to multiply.
Besides the default multiply method, we could also set boost_mode to avg, and add a parameter weight to the above decay function exp, then the combined score will be: ( the_bool_query_score + the_decay_function_score * weight ) / ( 1 + weight ).

Related

Elasticsearch "size" value not working in terms aggregation with partitions

I am trying to paginate over a specific field using the terms aggregation with partitions.
The problem is that the number of returned terms for each partition is not equal to the size parameter that I set.
These are the steps that I am doing:
Retrieve the number of different unique values for the field with "cardinality" aggregation.
In my data, the result is 21.
From the web page, the user wants to display a table with 10 items per page.
if unique_values % page_size != 0:
partitions_number = (unique_values // page_size) + 1
else:
partitions_number = (unique_values // page_size)
Than I am making this simple query:
POST my_index/_search?pretty
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"match": {
"field_to_paginate": "foo"
}
}
]
}
},
"aggs": {
"by_pchostname": {
"terms": {
"size": 10,
"field": "field_to_paginate",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
I am expecting to retrieve 10 results. But if I run the query I have only 7 results.
What am I missing here? Do I need to use a different solution here?
As a side note, I can't use composite aggregation because I need to sort results by doc_count over the whole dataset.
Partitons in terms aggregation divide the values in equal chunks.
In your case no of partition num_partitions is 3 so 21/3 == 7.
Partitons are meant for getting large values in the order of 1000 s.
You may be able to leverage shard_size parameter. My suggestion is to read this part of manual and work with the shard_size param
Terms aggregation does not allow pagination. Use composite aggregation instead (requires ES >= 6.1.0). Below is the quote from reference docs:
If you want to retrieve all terms or all combinations of terms in a
nested terms aggregation you should use the Composite aggregation
which allows to paginate over all possible terms rather than setting a
size greater than the cardinality of the field in the terms
aggregation. The terms aggregation is meant to return the top terms
and does not allow pagination.

Can Elasticsearch do a decay search on the log of a value?

I store a number, views, in Elasticsearch. I want to find documents "closest" to it on a logarithmic scale, so that 10k and 1MM are the same distance (and get scored the same) from 100k views. Is that possible?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#exp-decay describes field value factor and decay functions but can they be "stacked"? Is there another approach?
I'm not sure if you can achieve this directly with decay, but you could easily do it with the script_score function. The example below uses dynamic scripting, but please be aware that using file-based scripts is the recommended, far more secure approach.
In the query below, the offset parameter is set to 100,000, and documents with that value for their 'views' field will score the highest. Score decays logarithmically as the value of views departs from offset. Per your example, documents with 1,000,000 and/or 10,000 have identical scores (0.30279312 in this formula).
You can invert the order of these results by changing the beginning of the script to multiply by _score instead of divide.
$ curl -XPOST localhost:9200/somestuff/_search -d '{
"size": 100,
"query": {
"bool": {
"must": [
{
"function_score": {
"functions": [
{
"script_score": {
"params": {
"offset": 100000
},
"script": "_score / (1 + ((log(offset) - log(doc['views'].value)).abs()))"
}
}
]
}
}
]
}
}
}'
Note: you may want to account for the possibility of 'views' being null, depending on your data.

Boosting only results with a near-identical score in Elasticsearch

I'm using the following query to search through a database of names, allowing fuzzy matching but giving preference to exact matches.
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "x",
"operator": "and",
"boost": 10
}
}
},
{
"match": {
"name": {
"query": "x",
"fuzziness": "AUTO",
"operator": "and"
}
}
},
{
"match": {
"altname": {
"query": "x",
"fuzziness": "AUTO",
"operator": "and"
}
}
}
]
}
}
The database contains entries with identical names. If that happens, I would like to boost those entries by a second field, let's call it weight. However, I only want the boost to be applied between the subset of results with a (near) identical score, not to all of the results.
This is further complicated by the fact that results with an identical name may receive a slightly different score, as they are influenced by the relevancy on the altname field.
For example, querying for dog could give 3 results:
Dog [id 1, score 2.3, weight 10]
Dog [id 2, score 2.2, weight 20]
Doge [id 3, score 1, weight 100]
I'm looking for a query that would boost the result with id 2 to the top score. The result with id 3 should always stay at the bottom due to its poor relevancy, regardless of its weight. Ideally with tunable parameters to tweak the factor of the score vs. the factor of the weight.
Any way to do this in a single pass in Elasticsearch, of course without ruining performance?
Looks like I figured it out.
First, I realised that the example in my original question was more complex than necessary. I narrowed it down to: "How to compose a query for 'blub' that returns the following documents in the order 2, 3, 1"
id: 1
name: blub
weight: 0.01
---
id: 2
name: blub
weight: 0.1
---
id: 3
name: blub stuff
weight: 1
Thus: for the two documents with an identical (or very similar) score, the weight should be used as a tie-breaker. But documents with a significantly lower score should never be allowed to trump other results, regardless of their weight.
I loaded the data in the excellent Play tool: https://www.found.no/play/gist/edd93c69c015d4c62366#search and started experimenting.
Turned out the log2p modifier did exactly what I expected. Repeated it on a real-world dataset and everything looks exactly as expected.
function_score:
query:
match:
name: blub
field_value_factor:
field: weight
modifier: log2p

Elasticsearch - Nested Query Boost in function_score?

My question is about the boost function in elasticsearch (I've read their docs, and it's still quite unclear). Will the following "boost_mode" : "sum" apply to the boosts within the matches? Or since it's outside the enclosure perhaps it's just the sum of the final result, which is just the same as the default. I've got many fields and a vector of values - I want the scoring to be additive and not multiplicative. If the following does not work - any suggestions or pointers would be appreciated. Thanks!
"""
| "query": {
| "function_score": {
| "boost_mode": "sum",
| "query": {
| "bool": {
| "should": [
| { "match": { "someField": { "query": "someValue", "boost": 2 } } },
| { "match": { "someOtherField": { "query": "someOtherValue", "boost": 3 } } }
| }
| }
| }
| }
"""
The way the sum boost mode works is that it computes the score according to the following formula:
queryBoost * (queryScore + Math.min(funcScore, maxBoost))
where:
queryBoost is the value of the boost parameter inside your function score, since there is none, it defaults to 1.0f
queryScore is the normal score of the query, in your case it's variable and depends on the searched terms and the additional boost you're setting in your match queries
funcScore is the result of the multiplication of the score of each of your filter functions, defaults to 1.0f
maxBoost is the value of the max_boost parameter inside your function score, since there is none, it defaults to Float.MAX_VALUE
Also worth noting is that since you have no filter functions, there is no funcScore to compute and the overall score is simply the queryScore. So based what precedes, the formula can be simplified to
queryScore
which means in the end that your overall score is directly related to your query score
A good thing is also to pass ?explain=true in your query so you can get more insights into how the score was computed. In your case, since you have no filter functions, the boost_mode is simply not used at all and the query score is returned instead.
If you were to add a functions parameter with one or more score functions, then the result would be different as a funcScore could be computed.

Constant Score Query elasticsearch boosting

My understanding of Constant Score Query in elasticsearch is that boost factor would be assigned as score for every matching query. The documentation says:
A query that wraps a filter or another query and simply returns a constant score equal to the query boost for every document in the filter.
However when I send this query:
"query": {
"constant_score": {
"filter": {
"term": {
"source": "BBC"
}
},
"boost": 3
}
},
"fields": ["title", "source"]
all the matching documents are given a score of 1?! I cannot figure out what I am doing wrong, and had also tried with query instead of filter in constant_score.
Scores are only meant to be relative to all other scores in a given result set, so a result set where everything has the score of 3 is the same as a result set where everything has the score of 1.
Really, the only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries. - Elasticsearch Guide
Either the constant score is being ignored because it's not being combined with another query or it's being normalized. As #keety said, check to the output of explain to see exactly what's going on.
Constant score query gives equal score to any matching document irrespective any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
If you want score as 3 literally for all the matching documents for a particular query, then you should be using function score query, something like
"query": {
"function_score": {
"functions": [
{
"filter": { "term": { "source": "BBC" } },
"weight": 3
}
]
}
...
}

Resources