How to boost most popular (top quartile) in elasticsearch query results (outliers) - elasticsearch

I have an elasticsearch query that includes bool - must / should sections that I have refined to match search terms and boost for terms in priority fields, phrase match, etc.
I would like to boost documents that are the most popular. The documents include a field "popularity" that indicates the number of times the document was viewed.
Preferably, I would like to boost any documents in the result set that are outliers - meaning that the popularity score is perhaps 2 standard deviations from the average in the result set.
I see aggregations but I'm interested in boosting results in a query, not a report/dashboard.
I also noted the new rank_feature query in ES 7 (I am still on 6.8 but could upgrade). It looks like the rank_feature query looks across all documents, not the result set.
Is there a way to do this?

I think that you want to use a rank or a range query in a "rescore query".
If your need is to specific for classical queries, you can use a "function_score" query in your rescore and use a script to write your own score calculation
https://www.elastic.co/guide/en/elasticsearch/reference/7.9/filter-search-results.html
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-rescore.html

Related

What is the boost factor of a should in a bool_query?

I'm a newbie to Elasticsearch and couldn't find the answer to this in the online docs.
For elasticsearch 7.6 I have a bool with a must and a should. As I wanted the should part causes those results to be boosted.
However, what I can't find is what the default boost factor is when a should is matched? Is it possible to tweak this default boost factor?
Thank you
Elasticsearch will calculate score for the individual queries in should clause and it will sum those two scores to get to a document's overall score. As a result, a document matching all should clauses will get a higher score and will be ranked at the top.
You can boost individual clauses(queries) in should clause

ElasticSearch scoring / number of doc

I have small (max 50 char) keywords stored in text field in ElasticSearch index. I noticed that if I clear the index and add only 1 document, let's say "samsung galaxy", the score when I match the document is like 0.95.
But when I add 500k other docs and I make the same query, the score is like 20. I would like to set a min_score for this query because I need a certain level of relevancy.
But as the score is depending of the doc count. I can't set a min_score as the number of docs in the index will constantly evolve.
I already looked for solutions like constant_score but I need the power of Elastic to give me a score (and not 1 or 0).
1) Does this behavior come from the IDF method or not only from it?
2) Is there a way to keep the current search algorythm (or just without the term frequency) and have always the same score for a query without doc count dependency ? This would allow me to set a min_score

Compare Elasticsearch query score across multiple queries

I'm trying to query and compare two MLT queries scores but am a bit confused based on what I read here
https://www.elastic.co/guide/en/elasticsearch/guide/current/practical-scoring-function.html
Even though the intent of the query norm is to make results from
different queries comparable, it doesn’t work very well. The only
purpose of the relevance _score is to sort the results of the current
query in the correct order. You should not try to compare the
relevance scores from different queries.
if I ran an MLT query and document 'A' is similar to document 'B' and the score is 0.4 and conversely,
running the MLT query document 'B' is similar to document 'A' and its score is 2.4.
I would expect the score to be the same based on the tokens matched in the MLT, but that's not the case.
Also,
if I ran an MLT query and document 'A' is similar to document 'B' and the score is 0.6 and
running another MLT query document 'C' is similar to document 'A' and its score is 4.7.
So my questions are:
Does this imply that C is much more similar to A than B ?
Also, what's the best way for me compare multiple queries in elasticsearch when the scores are different?
Thanks,
- Phil
1.
No, It doesn't. As you noted in your question, you should not compare the scores of different queries. If you want to get a meaningful result of which documents are most similar to C, you should generate an MLT query for document C, and search with that.
This is made doubly true due to how MLT queries work. MLT attempts to generate a list of interesting terms to search for from your document (based on the library of terms in the index), and searches for them. The set of terms generated from doc A may be much different than that generated from Document B, thus the wildly different scores when when finding A from B, and vice-versa, even though the documents themselves will obviously have the same overlap.
2.
Don't. Listen to the docs. Scores are only designed to rank how well documents match the query that generated them. Using them outside that context is not meaningful. Rethink what you are trying to accomplish.

Elasticsearch scoring

I'm using elasticsearch to find similar documents to a given document using the "more like this" query.
Is there an easy way to get the elasticsearch scoring between 0 and 1 (using cosine similarity) ?
Thanks!
You may want to look into the Function Score functionality of Elasticsearch, more specifically the script_score and field_value_factor functions. This will allow you to take the score from default scoring (_score) and enhance or replace it in other ways. It really depends on what sort of boosting or transformation you'd like. The default scoring model takes into account the Vector model but other things as well .
I don't think that's possible to retrieve directly.
But perhaps this workaround would make sense?
Elasticsearch always bring back max_score in hits document.
You can potentially divide your document _score by max_score. Report with highest value will score as 1, documents, that are not so like given one, will score less.
The Elasticsearch uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.

Difference between Elasticsearch Range Query and Range Filter

I want to query elasticsearch documents within a date range. I have two options now, both work fine for me. Have tested both of them.
1. Range Query
2. Range Filter
Since I have a small data set for now, I am unable to test the performance for both of them. What is the difference between these two? and which one would result in faster retrieval of documents and faster response?
The main difference between queries and filters has to do with scoring. Queries return documents with a relative ranked score for each document. Filters do not. This difference allows a filter to be faster for two reasons. First, it does not incur the cost of calculating the score for each document. Second, it can cache the results as it does not have to deal with possible changes in the score from moment to moment - it's just a boolean really, does the document match or not?
From the documentation:
Filters are usually faster than queries because:
they don’t have to calculate the relevance _score for each document — 
the answer is just a boolean “Yes, the document matches the filter” or
“No, the document does not match the filter”. the results from most
filters can be cached in memory, making subsequent executions faster.
As a practical matter, the question is do you use the relevance score in any way? If not, filters are the way to go. If you do, filters still may be of use but should be used where they make sense. For instance, if you had a language field (let's say language: "EN" as an example) in your documents and wanted to query by language along with a relevance score, you would combine a query for the text search along with a filter for language. The filter would cache the document ids for all documents in english and then the query could be applied to that subset.
I'm over simplifying a bit, but that's the basics. Good places to read up on this:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/0.90/query-dsl-filtered-query.html
http://exploringelasticsearch.com/searching_data.html
http://elasticsearch-users.115913.n3.nabble.com/Filters-vs-Queries-td3219558.html
Filters are cached so they are faster!
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/filter-caching.html

Resources