new approach of indexing in Elasticsearch - elasticsearch

I want to define a new approach of indexing in Elasticsearch so i will edit tf idf method .
where to find TF-IDF elasticsearch implementation?
what are the packages in elasticsearch source code that i need to manipulate to implement the new approach?

The TF/IDF similarity algorithm is implemented in Lucene, however, there are ways to define another similarity algorithm to be used inside Elasticsearch via the similarity module. In addition to TF/IDF, there are currently 7 more similarities supported:
BM25
Classic similarity
DFR similarity
DFI similarity
IB similarity
LM Dirichlet similarity
LM Jelinek Mercer similarity
Each of them has different parameters that you can tune. Maybe it'd be a good idea to test each of them before venturing into creating your own.
More info about the available Lucene similarity algorithms: https://lucene.apache.org/core/6_5_0/core/org/apache/lucene/search/similarities/Similarity.html

Related

Semantic Similarity using Elastic Search

I went through certain blogs that say Universal Sentence Encoder is used in elastic search fro semantic similarity , can we use BERT instead of ULSE , they also say the embedding search has to go through all the documents. can it be optimised.
https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
Sure - you can use BERT. Yet, it will induce higher runtime for transforming the data into vector embeddings. Btw, you should explore other similarity search alternatives, such as pinecone.io, which offers a managed vector search service.
Absolutely! You'll just have to make use of dense_vectors in order to search for vectors, which is what BERT works with.
For more information on dense vectors:
https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
For more information on how to optimize embeddings search, you can check out https://www.gsitechnology.com/sites/default/files/AppNotes/GSIT-Elasticsearch-Plugin-AppBrief.pdf

How does calculating relevance scoring in Elasticsearch differ from Couchbase?

I wonder whether relevance score in elasticsearch has differences with couchbase or not?
As per this 2019 couchbase thread, it looks like they are still using the tf/idf for scoring, while Elasticsearch used to have the same algorithm but now moved to BM25 algorithm for score calculation from 5.0.
Note: TF/IDF is a very popular algorism for calculating the relevance score and based on term frequency and inverse document frequency, while BM25 is the latest and improvised form based on probabilistic scoring more details about them can be found here and here.
Note: As in the question, it's not mentioned for what purpose you are comparing both the relevance of the system, My two cents are if you are building a full-blown search system and relevance matters for you, then you should choose Elasticseaech whose primary function is to search and has a lot of flexibility in choosing different algorithm and different ways to define the scoring mechanism, which is not present in NoSQL solution like Couchbase.

How cosine similarity differs from Okapi BM25?

I'm conducting a research using elasticsearch. I was planning to use cosine similarity but I noted that it is unavailable and instead we have BM25 as default scoring function.
Is there a reason for that? Is cosine similarity improper for querying documents? Why was BM25 chosen as default?
Thanks
Longtime elasticsearch use TF/IDF algorithm to find similarity in queries. But number versions ago is changed to BM25 as more efficient. You can read the information in the documentation. And good article explains what is elastic search and how to the similarity in ES.
You can also write a custom algorithm to elasticsearch. Here a good article about how to do.

Custom plugin for Elasticsearch to change the default relevancy

I am currently using Elasticsearch and there are few things I have noticed about the ranks of the search results, which led me to think about whether there is a way to create plugins/script for ES, which can be used to modify the current scoring algorithm?
You can either write a custom Java plugin for that, use function score queries, or scripted similarity (which just came out this week).
If you can I would use the two later methods; writing a custom plugin should only be required very rarely.
You can refer to the blog A Gentle Intro to Function Scoring which describes ranking of videos on a website using a combination of textual relevance and the videos popularity on a site.
To modify the scoring algorithm the Elasticsearch provides script_score, function_score and Decay Function.

Simple explanation of different ElasticSearch similarity algorithms

I am looking into the different similarity algorithms which define how the score of each document is computed during search. The available algorithms are listed here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html
My problem is that I have problems to understand them when digging through the wikipedia articles or the class descriptions in the lucene API documentation. I really like the answer about explaining the TF/IDF similarity algorithm (the default in ElasticSearch) here: What is the reasoning behind the ranking of this ElasticSearch query? (so this one I understand to a certain amount).
Can somebody provide similiar simple explanations to the other algorithms outlined there? These include:
bm25 similarity
drf similarity
ib similarity
Thank you in advance.
The problem you run into here, is by the description set forward in the linked answer, Lucene's default similarity, and bm25 are fundamentally identical, in that they both factor in:
more occurrences in the document are preferred
terms rarer in the corpus are preferred
shorter documents are more heavily weighted
other functions used to adjust score, boosts, etc.
dfr actually encompasses 7 different base-models alone, each using a different scoring algorithm, followed by two highly configurable normalization steps. A number of configuration options fit the very general steps above, some diverge from it.
Similarly, ib allows some significant configuration as well, but generally hits the same high points, of favoring higher term frequency, favoring matches on terms that are more rare (by some description), and adjusting for document length, boosts, and other possible normalizations.

Resources