Disable IDF calculation - elasticsearch

Disable IDF calculation - elasticsearch

In my particular use case, the IDF-factor that gets calculated as part of the TF-IDF algorithm messes up the scoring for my queries. Basically, I want the queries to only take the term frequency into account. Is it possible to disable the IDF factor, i.e set it to 1, for a particular index? I have looked into the similarity module (in version 0.90.X), but haven't really found anything that could help; same goes for the function_score query. Do I need to write a custom Similarity class in java? Or is there a plugin for what I'm trying to achieve?

What about constant_score query?
See http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/ignoring-tfidf.html
Don't hesitate to use ?explain=true to see how scoring is working.
As you can here without constant_filter:
And with constant_filter query (that wraps your real query):
Screenshots made with https://beemapp.me

Related

Does elastic search use previous search frequencies?

Does elastic search utilize the frequency of a previously searched document. For example there are document A and document B. Both have similar score in terms of edit distances and other metrics however document A is very frequently searched and B is not. Will elastic search score A better than B. If not, how to acheive this?

Elasticsearch does not change score based on previous searches in its default scoring algorithm. In fact, this is really a question about Lucene scoring, since Elasticsearch uses it for all of the actual Search logic.
I think you may be looking at this from the wrong viewpoint. Users search with a query, and Elasticsearch recommends documents. You have no way of knowing if the document it recommended was valid or not just based on the search. I think your question should really be, "How can I tune Search relevance in an intelligent way based on user data?".
Now, there are a number of ways you can achieve this, but they require you to gather user data and build the model yourself. So unfortunately, there is no easy way.
However, I would recommend taking a look at https://www.elastic.co/app-search/, which offers a managed solution with lots of custom relevant tuning which may save you lots of time depending on your use case.

Is there a way to show at what percentage a selected document is similar to others on ElasticSearch?

I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?

There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!

If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.

Custom plugin for Elasticsearch to change the default relevancy

I am currently using Elasticsearch and there are few things I have noticed about the ranks of the search results, which led me to think about whether there is a way to create plugins/script for ES, which can be used to modify the current scoring algorithm?

You can either write a custom Java plugin for that, use function score queries, or scripted similarity (which just came out this week).
If you can I would use the two later methods; writing a custom plugin should only be required very rarely.

You can refer to the blog A Gentle Intro to Function Scoring which describes ranking of videos on a website using a combination of textual relevance and the videos popularity on a site.
To modify the scoring algorithm the Elasticsearch provides script_score, function_score and Decay Function.

How do normalization and internal optimization of boosting work? And how does that affect the relevance?

I'm new to elastic search. I'm having trouble understanding the calibration and scaling of boost values for fields in a document. As in how should we decide the boosting values for field so that it works as expected. I've gone through some of the online blogs and es doc as well, it's written that es does normalization and internal optimization of boosting values? How does that work?
E.g.: If we have tags, title, name and text fields in our doc, how should we decide the boosting values for these?

Elasticsearch uses a boolean model to match documents, and then a scoring model to determine relevance (i.e. ranking). The scoring model utilizes a TF/IDF score, coupled with some additional features. Those TF/IDF scores are calculated for each matching field within a query, and then aggregated to produce an overall score for a document. To dig into this process, I suggest running explain on your query to see how the score of each field is influencing the overall relevance of your document.
As the expert on your data, you're in the best position to determine which fields should most heavily influence the relevance of your document. Finding the right boost value for a field is about adjusting the levers until you find a formula that best suites your desired outcome (Also, if you have users, A/B testing can help here).

Performance of "function_score"

I'm working on a solution for custom score boosting in Elasticsearch.
I wanted to ask if using function_score is a good idea. Because the index size is great but the result of the query should not be that big.
Does function_score work on a query result or rather as a part of query logic? If former, it might be fast, is it?
PS. Initially query boost operator seemed like a best option, but I can't get it to raise a score much above the normal range for one of the match. I've checked _explain API and it says that queryNorm normalizes my boost and I still get values below normal range (0.1 .. 4).

In principle - yes, it will slow down the performance of the search. Of course real penalty will depend on the complexity of your script. It will work during so called 'search' phase, so it means, that it will be applied for all matched docs.
You could try to make your logic faster, if your case is suitable for rescoring functionality, cause it's applied only to the top N (configurable in rescore API) results.
More information about rescoring - https://www.elastic.co/guide/en/elasticsearch/guide/current/_improving_performance.html#rescore-api

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio