Can Elasticsearch do a decay search on the log of a value? - elasticsearch

I store a number, views, in Elasticsearch. I want to find documents "closest" to it on a logarithmic scale, so that 10k and 1MM are the same distance (and get scored the same) from 100k views. Is that possible?
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#exp-decay describes field value factor and decay functions but can they be "stacked"? Is there another approach?

I'm not sure if you can achieve this directly with decay, but you could easily do it with the script_score function. The example below uses dynamic scripting, but please be aware that using file-based scripts is the recommended, far more secure approach.
In the query below, the offset parameter is set to 100,000, and documents with that value for their 'views' field will score the highest. Score decays logarithmically as the value of views departs from offset. Per your example, documents with 1,000,000 and/or 10,000 have identical scores (0.30279312 in this formula).
You can invert the order of these results by changing the beginning of the script to multiply by _score instead of divide.
$ curl -XPOST localhost:9200/somestuff/_search -d '{
"size": 100,
"query": {
"bool": {
"must": [
{
"function_score": {
"functions": [
{
"script_score": {
"params": {
"offset": 100000
},
"script": "_score / (1 + ((log(offset) - log(doc['views'].value)).abs()))"
}
}
]
}
}
]
}
}
}'
Note: you may want to account for the possibility of 'views' being null, depending on your data.

Related

Elasticsearch "size" value not working in terms aggregation with partitions

I am trying to paginate over a specific field using the terms aggregation with partitions.
The problem is that the number of returned terms for each partition is not equal to the size parameter that I set.
These are the steps that I am doing:
Retrieve the number of different unique values for the field with "cardinality" aggregation.
In my data, the result is 21.
From the web page, the user wants to display a table with 10 items per page.
if unique_values % page_size != 0:
partitions_number = (unique_values // page_size) + 1
else:
partitions_number = (unique_values // page_size)
Than I am making this simple query:
POST my_index/_search?pretty
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"match": {
"field_to_paginate": "foo"
}
}
]
}
},
"aggs": {
"by_pchostname": {
"terms": {
"size": 10,
"field": "field_to_paginate",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
I am expecting to retrieve 10 results. But if I run the query I have only 7 results.
What am I missing here? Do I need to use a different solution here?
As a side note, I can't use composite aggregation because I need to sort results by doc_count over the whole dataset.
Partitons in terms aggregation divide the values in equal chunks.
In your case no of partition num_partitions is 3 so 21/3 == 7.
Partitons are meant for getting large values in the order of 1000 s.
You may be able to leverage shard_size parameter. My suggestion is to read this part of manual and work with the shard_size param
Terms aggregation does not allow pagination. Use composite aggregation instead (requires ES >= 6.1.0). Below is the quote from reference docs:
If you want to retrieve all terms or all combinations of terms in a
nested terms aggregation you should use the Composite aggregation
which allows to paginate over all possible terms rather than setting a
size greater than the cardinality of the field in the terms
aggregation. The terms aggregation is meant to return the top terms
and does not allow pagination.

What is the difference between must and filter in Query DSL in elasticsearch?

I am new to elastic search and I am confused between must and filter. I want to perform an and operation between my terms, so I did this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
},
{
"term": {
"saleType": "sale_type1"
}
}
]
}
}
}
which gave me the required results matching both the terms, and on using filter like this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
}
],
"filter": {
"term": {
"saleType": "sale_type1"
}
}
}
}
}
I get the same result, so when should I use must and when should I use filter? What is the difference?
must contributes to the score. In filter, the score of the query is ignored.
In both must and filter, the clause(query) must appear in matching documents. This is the reason for getting same results.
You may check this link
Score
The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document.
To know how score is calculated, refer this link
must returns a score for every matching document. This score helps you rank the matching documents, and compare the relative relevance between documents (using the magnitude of the score of each document).
With this, one can say, Doc 1 is how many times more relevant than Doc 2. Or that Doc 1 to 7 are of much higher relevancy than Doc 8+.
For how the relative score is determined, you can refer to the references below.
Briefly, it is related to the number of term occurrences in the document, the document length, and the average number of term occurrences in your database index.
filter doesn't return a score. All one can say is, all matching documents are of relevance. But it won't help in evaluating if one is more relevant than the other. You can think of filter as a must with only 2 scores: zero or non-zero, and where all zero-scored documents are dropped.
filter is helpful if you just want to whitelist/blacklist for e.g., all documents belonging to the topic "pets".
In summary, there are 3 points that will help you in deciding when to use what:
must is your only choice when comparing/ranking documents by relevance
filter excludes all documents that don't match
filter is a lot faster because Elasticsearch doesn't need to compute the relative score
References:
Query vs Filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
Computation of Relevance: https://www.infoq.com/articles/similarity-scoring-elasticsearch/

change _score in elasticsearch to make equal to doc's score field

I have score (integer) field in data, I'm getting data from api, and posting it directly to localhost:9200//listings/
And I want the item _score to be equal to score field in data.
For now a solution is to add ?sort=score:desc to url
One solution is to use a function_score query, where you replace the default _score using a field_value_factor score function. It goes like this:
curl -XPOST localhost:9200/listings/_search -d '{
"query": {
"function_score": {
"functions": [
{
"field_value_factor": {
"field": "score", <---- we use the score field instead
"factor": 1, <---- take the exact same score
"missing": 1 <---- use 1 as score if the score field is missing
}
}
],
"query": {
"match_all": {}
},
"boost_mode": "replace" <---- we're replacing the default _score
}
}
}'
So we're basically computing the score using the score field multiplied by 1 and if any document doesn't have the score field we just assume the score to be 1 (you can change that to whatever makes more sense in your case).
UPDATE
According to your comment, you need the _score to be multiplied by the document's score field. You can achieve it simply by removing the boost_mode parameter, the default boost_mode is to multiply the _score with whatever value comes out of the field_value_factor function.
If you need to completely replace the default scoring mechanism to be based on your score field instead, there's a more complex way using the similarity module, where you can define another similarity algorithm solely for your score field. There is a great blog post explaining the nitty gritty details of the similarity module.

Can I rescore on the basis of max score and/or max value of my custom field in elasticsearch?

All I want to do is re-score my query according to this formula,
NEW SCORE = OLD SCORE/max(OLD SCORES) + doc.value['custom']/max(doc.value['custom'])
Is this possible? I am able to rescore using the following code
{
"query": {
"function_score": {
"query": {}
, "script_score": {
"script": "_score * doc['custom'].value"
}
}
}
}
Also, it would be great if someone could tell how to use values of one script in another.
Not sure it would feet your need, but maybe you could use some flattening application, where you would force max score to be 1, assuming that you can find a score value (score_thresh) above which results are good anyway and your field doc.value['custom'] will do the ranking job.
A method such as:
1 / (score_thresh - min(_score, score_thresh - 1))
would do the trick, but be a bit harsh on your score curve. It does work for me as all I want from my old_score is to select documents, a rescore query taking care of the ranking.

Constant Score Query elasticsearch boosting

My understanding of Constant Score Query in elasticsearch is that boost factor would be assigned as score for every matching query. The documentation says:
A query that wraps a filter or another query and simply returns a constant score equal to the query boost for every document in the filter.
However when I send this query:
"query": {
"constant_score": {
"filter": {
"term": {
"source": "BBC"
}
},
"boost": 3
}
},
"fields": ["title", "source"]
all the matching documents are given a score of 1?! I cannot figure out what I am doing wrong, and had also tried with query instead of filter in constant_score.
Scores are only meant to be relative to all other scores in a given result set, so a result set where everything has the score of 3 is the same as a result set where everything has the score of 1.
Really, the only purpose of the relevance _score is to sort the results of the current query in the correct order. You should not try to compare the relevance scores from different queries. - Elasticsearch Guide
Either the constant score is being ignored because it's not being combined with another query or it's being normalized. As #keety said, check to the output of explain to see exactly what's going on.
Constant score query gives equal score to any matching document irrespective any scoring factors like TF, IDF etc. This can be used when you don't care whether how much a doc matched but just if a doc matched or not and give a score too, unlike filter.
If you want score as 3 literally for all the matching documents for a particular query, then you should be using function score query, something like
"query": {
"function_score": {
"functions": [
{
"filter": { "term": { "source": "BBC" } },
"weight": 3
}
]
}
...
}

Resources