Do MLT queries in Elasticsearch use term vectors? - elasticsearch

Do more like this queries in Elasticsearch make use of term vectors if these ar activated?

Yes.
The underlying Lucene MLT implementation also provides setMaxNumTokensParsed to provide some control over performance when term vectors are not available.

Related

Semantic Similarity using Elastic Search

I went through certain blogs that say Universal Sentence Encoder is used in elastic search fro semantic similarity , can we use BERT instead of ULSE , they also say the embedding search has to go through all the documents. can it be optimised.
https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
Sure - you can use BERT. Yet, it will induce higher runtime for transforming the data into vector embeddings. Btw, you should explore other similarity search alternatives, such as pinecone.io, which offers a managed vector search service.
Absolutely! You'll just have to make use of dense_vectors in order to search for vectors, which is what BERT works with.
For more information on dense vectors:
https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html
For more information on how to optimize embeddings search, you can check out https://www.gsitechnology.com/sites/default/files/AppNotes/GSIT-Elasticsearch-Plugin-AppBrief.pdf

Issue with results after upgrading elastic search

I am upgrading my elasticsearch from 2.2 to 7.1 and I am maintaining both the instances and I am trying to compare the results on the new version and old version by making the same search queries.
Note: I have not changed the mappings, settings or querying logic
My results are almost the same but vary a little in scoring. Is it expected? though the documents, mappings, settings and query logic are the same?
Elasticsearch 2.x version uses the tf/idf for scoring and this ES doc explains it in details.
While ES 7.X uses the improved BM25 algorithm for score calculation. this is another nice article from ES which explains it in details.
In short yes, there are significant changes in the scoring formula of ES 2.X and 7.X as underlying algorithm changed itself and even though you have everything else same like documents, mappings, settings and query, still you will be having a different score.
You can use the explain API on your query to understand the score of documents returned by the query.

Elastic search or Trie for search/autocomplete?

My understanding how autocomplete/search for text/item works at high level in any scalable product like Amazon eCommerce/Google at high level was :-
Elastic Search(ES) based approach
Documents are stored in DB . Once persisted given to Elastic search, It creates the index and store the index/document(based on tokenizer) in memory or disk based
configuration.
Once user types say 3 characters, it search all index under ES(Can be configured to index even ngram) , Rank them based on weightage and return to user
But after reading couple of resources on google like Trie based search
Looks some of the scalable product also uses Trie data stucture to do the prefix based search.
My question Is Can trie based approach be good alternative to ES or ES internally uses Trie or am i missing completely here ?
ES autocompletion can be achieved in two ways:
using prefix queries
either using (edge-)ngrams
or using the completion suggester
The first option is the poor man's completion feature. I'm mentioning it because it can be useful in certain situation but you should avoid it if you have a substantial amount of documents.
The second option uses the conventional ES indexing features, i.e. it will tokenize the text, all (edge-)ngrams will be indexed and then you can search for any prefix/infix/suffix that have been indexed.
The third option uses a different approach and is optimized for speed. Basically, when indexing a field of type completion, ES will create a "finite state transducer" and store it in memory for ultra fast access.
A finite state transducer is close to a trie in terms of implementation. You can check this excellent article which shows how trie compares to finite state transducer
UPDATE (June 25th, 2019):
ES 7.2 introduced a new data type called search_as_you_type that allows this kind of behavior natively. Read more at: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-as-you-type.html

Does BM25 use query coordinator?

In Lucene's practical scoring function there is a query coordinator which punishes documents that fail to match all the query terms. does Okapi BM25 use the same trick?
The reason I'm curious about it is that I'm using Elasticsearch with BM25 similarity module and sometimes I feel this algorithm does not favor documents with more matches. There are cases that a document contains one or two terms a lot, outscores a document containing all query terms.
Yes and no.
No, it doesn't use a coord factor as described by the old Lucene default similarity (note: Lucene core now uses BM25 by default, as well).
Yes, it does weigh hits on more of the query terms more heavily than a bunch of hits on the same term. It does this with better term saturation, making the old coord factor effectively obsolete.
It is, however, always possible that many hits on less terms will outscore few hits on more terms using either algorithm.

Custom plugin for Elasticsearch to change the default relevancy

I am currently using Elasticsearch and there are few things I have noticed about the ranks of the search results, which led me to think about whether there is a way to create plugins/script for ES, which can be used to modify the current scoring algorithm?
You can either write a custom Java plugin for that, use function score queries, or scripted similarity (which just came out this week).
If you can I would use the two later methods; writing a custom plugin should only be required very rarely.
You can refer to the blog A Gentle Intro to Function Scoring which describes ranking of videos on a website using a combination of textual relevance and the videos popularity on a site.
To modify the scoring algorithm the Elasticsearch provides script_score, function_score and Decay Function.

Resources