Elasticsearch FunctionScore Query ScoreMode isn't working as expected - elasticsearch

We have a function score query with about 50 functions. Each function has a filter and a script_score. We have given score mode as SUM.
Mappings:
"keywords": {
"type": "nested",
"include_in_parent": true,
"properties": {
"id": {
"type": "string",
"index_name": "id",
"analyzer": "standard"
},
"name": {
"type": "string",
"index_name": "name"
},
"score": {
"type": "double",
"index_name": "keywordScore"
}
}
}
Example Query:
{
"query": {
"bool": {
"should": {
"nested": {
"query": {
"function_score": {
"functions": [
{
"filter": {
"term": {
"keywords.id": "np14y9393"
}
},
"script_score": {
"script": {
"inline": "(doc['keyword.score'].value*log(0.138317))+100"
}
}
},
{
"filter": {
"term": {
"keywords.id": "ny6579591"
}
},
"script_score": {
"script": {
"inline": "(doc['keyword.score'].value*log(0.0631535))+100"
}
}
}
],
"score_mode": "sum",
"boost_mode": "sum"
}
},
"path": "keywords"
}
}
}
}
}
Issues:
Formula in each script_score deals with probabilities ranging from 0 to 1. So the output of script_score will always be less than 1. Example : 0.00456. In this case Elasticsearch is ignoring the score coming from script_score. I added hundred to my script which returns 100.00456. In this case the scores are showing up in the final score. May be Elasticsearch has some precision of a cutoff because of which it is behaving this way.
Eventhough SUM is specified as a Score mode, Elasticsearch is internally doing some average on that score. As I said before I will be having 50 functions in the query. If 10 keywords got matched, the score should be around 1000. But the resultant score is around 80. Then how is this score mode used? How to tell Elasticsearch not to normalize the score and use the one that I specified?
Explain API is not of much use here. It is not telling what is the score at each function level and how it is manipulating.

Let's assume you have a set of 5 documents in your index and when you run the query, it shall be running on each doc one by one. Let's dry-run the query on the first document indexed.
The final _score for the first doc will be:
_score = es_score ([0-1]) + function_score;
es_score lies between 0 to 1 inclusive.
Considering the fact that all of your 50 functions are based on keywords.id filter and script_score for each function is almost the same and assuming x number of function filters matched:
_score = es_score + function_score(func1) + .... + function_score(funcx);
_score = es_score + [(doc['keyword.score'].value*log(0.138317))+100] + .... + [(doc['keyword.score'].value*log(0.138317))+100];
_score = es_score + [-value1 + 100] + .... + [-valueX + 100];
So, it depends on the values of your computed logs (possibly negative whole numbers), what your _score value for the document will be.

Related

How to sum the size of documents within a time interval?

I'm attempting to estimate the sum of size of n documents across an index using below query :
GET /events/_search
{
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"total_size": {
"sum": {
"field": "doc['_source'].bytes"
}
}
}
}
This returns documents but the size of the aggregation is 0 :
"aggregations" : {
"total_size" : {
"value" : 0.0
}
}
How to sum the size of documents within a time interval ?
The best way to achieve what you want is to actually add another field that contains the real source size at indexing time.
However, if you want to run it once to see how it looks like, you can leverage runtime fields to compute this at search time, just know that it can put a heavy burden on your cluster. Since the Painless scripting language doesn't yet provide a way to transform the source document to the same JSON you sent at indexing time, we can only approximate the value you're looking for by stringifying the _source Hashmap, yielding this:
GET /events/_search
{
"runtime_mappings": {
"source.size": {
"type": "double",
"script": """
def size = params._source.toString().length() * 8;
emit(size);
"""
}
},
"query": {
"bool":{
"must": [
{"range": {"ts": {"gte": "2022-10-10T00:00:00Z", "lt": "2022-10-21T00:00:00Z"}}}
]
}
},
"aggs": {
"size": {
"sum": {
"field": "source.size"
}
}
}
}
Another way is to install the Mapper size plugin so that you can make use of the _size field computed at indexing time.

ElasticSearch _knn_search query on multiple fields

I'm using ES 8.2. I'd like to use approximate method of _knn_search on more than 1 vector. Below I've attached my current code searching on a single vector. So far as I've read _knn_search does not support search on nested fields.
Alternatively, I can use multi index search. One index, one vector, one search, sum up all results together. However, I need to store all these vectors together in one index as I need also to perform filtration on some other fields besides vectors for knn search.
Thus, the question is if there is a work around how I can perform _knn_search on more than 1 vector?
search_vector = np.zeros(512).tolist()
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 100,
"num_candidates": 1000
},
"filter": [
{
"range": {
"feature_vector_1.match_prc": {
"gt": 10
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
The last working query that I've end up with is
es_query = {
"knn": {
"field": "feature_vector_1.vector",
"query_vector": search_vector,
"k": 1000,
"num_candidates": 1000
},
"filter": [
{
"function_score": {
"query": {
"match_all": {}
},
"script_score": {
"script": {
"source": """
double value = dotProduct(params.queryVector, 'feature_vector_2.vector');
return 100 * (1 + value) / 2;
""",
"params": {
"queryVector": search_vector
}
},
}
}
}
],
"_source": {
"excludes": ["feature_vector_1.vector", "feature_vector_2.vector"]
}
}
However, it is not true AKNN on 2 vectors but still working option if performance of such query satisfies your expectations.
the below seems to be working for me for combining KNN searches, taking the average of multiple cosine similarity scores. Note that this is a little different than the original request, since it performs a brute force search, but you can still filter the results up front by replacing the match_all bit.
GET my-index/_search
{
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "(cosineSimilarity(params.vector1, 'my-vector1') + cosineSimilarity(params.vector2, 'my-vector2'))/2 + 1.0",
"params": {
"vector1": [
1.3012068271636963,
...
0.23468133807182312
],
"vector2": [
-0.49404603242874146,
...
-0.15835021436214447
]
}
}
}
}
}

ElasticSearch max score

I'm trying to solve a performance issue we have when querying ElasticSearch for several thousand results. The basic idea is that we do some post-query processing and only show the Top X results ( Query may have ~100000 Results while we only need the top 100 according to our Score Mechanics ).
The basic mechanics are as follows:
ElasticSearch Score is normalized between 0..1 ( score/max(score) ), we add our ranking score ( also normalized between 0..1 ) and divide by 2.
What I'd like to do is move this logic into ElasticSearch using custom scoring ( or well, anything that works ): https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#function-script-score
The Problem I'm facing is that using Score Scripts / Score Functions I can't seem to find a way to do something like max(_score) to normalize the score between 0 and 1.
"script_score" : {
"script" : "(_score / max(_score) + doc['some_normalized_field'].value)/2"
}
Any ideas are welcome.
You can not get max_score before you have actually generated the _score for all the matching documents. script_score query will first generate the _score for all the matching documents and then max_score will be displayed by elasticsearch.
According to what i can understand from your problem, You want to preserve the max_score that was generated by the original query, before you applied "script_score". You can get the required result if you do some computation at the front-end. In short apply your formula at the front end and then sort the results.
you can save your factor inside your results using script_fields query.
{
"explain": true,
"query": {
"match_all": {}
},
"script_fields": {
"total_goals": {
"script": {
"lang": "painless",
"source": """
int total = 0;
for (int i = 0; i < doc['goals'].length; ++i) {
total += doc['goals'][i];
}
return total;
""",
"params":{
"last" : "any parameters required"
}
}
}
}
}
I am not sure that I understand your question. do you want to limit the amount of results?
are you tried?
{
"from" : 0, "size" : 10,
"query" : {
"term" : { "name" : "dennis" }
}
}
you can use sort to define sort order by default it will sorted by main query.
you can also use aggregations ( with or without function_score )
{
"query": {
"function_score": {
"functions": [
{
"gauss": {
"date": {
"scale": "3d",
"offset": "7d",
"decay": 0.1
}
}
},
{
"gauss": {
"priority": {
"origin": "0",
"scale": "100"
}
}
}
],
"query": {
"match" : { "body" : "dennis" }
}
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 10
}
}
}
}
Based on this github ticket it is simply impossible to normalize score and they suggest to use boolean similarity as a workaround.

Elastic Search - Sort By Doc Type

I have an elastic search index with 2 different doc types: 'a' and 'b'. I would like to sort my results by type and give preference to type='b' (even if it has a low score). I had been consuming the results of the search below at the client end and sorting them but I've realized that this approach does not work well since I am only inspecting the first 10 results which often does not contain any b's. Increasing the return results is not ideal. I'd like to get the elastic search to do the work.
http://<server>:9200/my_index/_search?q=london
You would need to play with function_score and, depending on how you already score your documents, test some weight values, boost_modes and score_modes for each type. For example:
GET /some_index/a,b/_search
{
"query": {
"function_score": {
"query": {
# your query here
},
"functions": [
{
"filter": {
"type": {
"value": "b"
}
},
"weight": 3
},
{
"filter": {
"type": {
"value": "a"
}
},
"weight": 1
}
],
"score_mode": "first",
"boost_mode": "multiply"
}
}
}
Its working for me.you will execute below commands at command Prompt.
curl -XGET localhost:9200/index_v1,index_v2/_search?pretty -d #boost.json
boost.json
{
"indices_boost" : {
"index_v2" : 1.4,
"index_v1" : 1.3
}
}

elasticsearch scoring unique terms vs ngram terms

i've figured out how to return results on a partial word result using ngrams. but now i'd like to arrange (score or sort) my results based on the term first and then a partial term.
for example, the user searches a movie db for 'we'. i want 'we are marshall' and similar to show up at the top, and not 'north by northwest'. (the 'we' is in 'northwest').
currently this is my mapping for this title field:
"title": {
"type": "string",
"analyzer": "ngramAnalyer",
"fields": {
"term": {
"type": "string",
"analyzer": "fullTermCaseInsensitive"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
i've created a multifield where ngramAnalyzer is a custom ngram, term is using a keyword tokenizer with a standard filter, and raw is not_indexed.
my query is as follows:
"query": {
"function_score": {
"functions": [
{
"script_score": {
"script": "_score * (1+ (1 / doc['salesrank'].value) )"
}
}
],
"query": {
"bool": {
"must": [
{
"match_phrase": {
"title": {
"query": "we",
"max_expansions": 10
}
}
}
],
"should":{
"term" : {
"title.term" : {
"value" : "we",
"boost" : 10
}
}
}
}
}
}
i'm basically requiring that the ngram must be matched, and the term 'we' should be matched, and if so, boost it.
this isn't working of course.
any ideas?
edit
to add further complexity ... how would i match first on exact title, then on a custom score?
i've taken some stabs at it, but doesn't seem to work.
for example:
input: 'game'
results should be ordered by exact match 'game'
followed by a custom score based on a sales rank (integer)
so that the next results after 'game' might be something like 'hunger games'
what about bool combination of boosting query, where first match about full term with 10x boost factor, and another matches against ngram term with standard boost factor?

Resources