I am upgrading my elasticsearch from 2.2 to 7.1 and I am maintaining both the instances and I am trying to compare the results on the new version and old version by making the same search queries.
Note: I have not changed the mappings, settings or querying logic
My results are almost the same but vary a little in scoring. Is it expected? though the documents, mappings, settings and query logic are the same?
Elasticsearch 2.x version uses the tf/idf for scoring and this ES doc explains it in details.
While ES 7.X uses the improved BM25 algorithm for score calculation. this is another nice article from ES which explains it in details.
In short yes, there are significant changes in the scoring formula of ES 2.X and 7.X as underlying algorithm changed itself and even though you have everything else same like documents, mappings, settings and query, still you will be having a different score.
You can use the explain API on your query to understand the score of documents returned by the query.
Related
Does elastic search utilize the frequency of a previously searched document. For example there are document A and document B. Both have similar score in terms of edit distances and other metrics however document A is very frequently searched and B is not. Will elastic search score A better than B. If not, how to acheive this?
Elasticsearch does not change score based on previous searches in its default scoring algorithm. In fact, this is really a question about Lucene scoring, since Elasticsearch uses it for all of the actual Search logic.
I think you may be looking at this from the wrong viewpoint. Users search with a query, and Elasticsearch recommends documents. You have no way of knowing if the document it recommended was valid or not just based on the search. I think your question should really be, "How can I tune Search relevance in an intelligent way based on user data?".
Now, there are a number of ways you can achieve this, but they require you to gather user data and build the model yourself. So unfortunately, there is no easy way.
However, I would recommend taking a look at https://www.elastic.co/app-search/, which offers a managed solution with lots of custom relevant tuning which may save you lots of time depending on your use case.
I wonder whether relevance score in elasticsearch has differences with couchbase or not?
As per this 2019 couchbase thread, it looks like they are still using the tf/idf for scoring, while Elasticsearch used to have the same algorithm but now moved to BM25 algorithm for score calculation from 5.0.
Note: TF/IDF is a very popular algorism for calculating the relevance score and based on term frequency and inverse document frequency, while BM25 is the latest and improvised form based on probabilistic scoring more details about them can be found here and here.
Note: As in the question, it's not mentioned for what purpose you are comparing both the relevance of the system, My two cents are if you are building a full-blown search system and relevance matters for you, then you should choose Elasticseaech whose primary function is to search and has a lot of flexibility in choosing different algorithm and different ways to define the scoring mechanism, which is not present in NoSQL solution like Couchbase.
With what function does Elasticsearch 7.5 calculate a default score? An explanation i found here (https://www.compose.com/articles/how-scoring-works-in-elasticsearch/), but it work only for old versions as i understand, because query norm was removed in lucene 7.0.
Prior to lucene 6.X, ES was using tf/idf as its default scoring algorithm, which they changed to BM25 once they started using Lucene 6.X and higher.
ES 7.5.1 uses Lucene 8.3.1 and they are using BM25 as their default scoring algorithm.
More details about the announcement of this change and other important links are below:
BM25 announcement:- https://www.elastic.co/elasticon/conf/2016/sf/improved-text-scoring-with-bm25
BM25 details and internals :- https://speakerdeck.com/elastic/improved-text-scoring-with-bm25
How to configure different scoring algo : https://www.elastic.co/guide/en/elasticsearch/reference/current/similarity.html
I have over 33m records in my Elasticsearch 7.1 index and when I query it, I limit the result size to 20. However, ES still scores the records internally. But this isn't important for me and in fact, I want any 20 results. So for example I don't care if some of the results are more relevant.
My question is, is there a way to turn this behaviour off, and if so, will it improve the performance?
Kind regards,
R.
You can use _doc as a sort field. This will make ES return the fields sorted in the order of insertion, and hence it will not do scoring.
Here is a thread from the forums that explains more:
https://discuss.elastic.co/t/most-efficient-way-to-query-without-a-score/57457/4
I am currently using Elasticsearch and there are few things I have noticed about the ranks of the search results, which led me to think about whether there is a way to create plugins/script for ES, which can be used to modify the current scoring algorithm?
You can either write a custom Java plugin for that, use function score queries, or scripted similarity (which just came out this week).
If you can I would use the two later methods; writing a custom plugin should only be required very rarely.
You can refer to the blog A Gentle Intro to Function Scoring which describes ranking of videos on a website using a combination of textual relevance and the videos popularity on a site.
To modify the scoring algorithm the Elasticsearch provides script_score, function_score and Decay Function.