I'm working on a solution for custom score boosting in Elasticsearch.
I wanted to ask if using function_score is a good idea. Because the index size is great but the result of the query should not be that big.
Does function_score work on a query result or rather as a part of query logic? If former, it might be fast, is it?
PS. Initially query boost operator seemed like a best option, but I can't get it to raise a score much above the normal range for one of the match. I've checked _explain API and it says that queryNorm normalizes my boost and I still get values below normal range (0.1 .. 4).
In principle - yes, it will slow down the performance of the search. Of course real penalty will depend on the complexity of your script. It will work during so called 'search' phase, so it means, that it will be applied for all matched docs.
You could try to make your logic faster, if your case is suitable for rescoring functionality, cause it's applied only to the top N (configurable in rescore API) results.
More information about rescoring - https://www.elastic.co/guide/en/elasticsearch/guide/current/_improving_performance.html#rescore-api
Related
I need to build a search engine using Elasticsearch and the steps will be as following:
Search on the search engine with a search string.
The relevant results will display and I can click on these documents.
If I select a document, I will be redirected to another page where I will see all the details of the documents and will have an option "More Like This" (which will return documents similar to the selected document). I know that this is done using the MLT query.
Now my question is: Except for returning documents similar to the selected one, how can I also return at what percentage the documents are similar to the selected one?
There are a couple of things you can do.
using function_score query
more_like_this query is essentially a full text search, and it returns documents ordered by their relevance score. It could be possible to convert the score directly to a percentage, but it is not advised (here
and more specifically here).
Instead one can define a custom score with help of a function_score query, which can be designed so it returns a meaningful percentage.
This, of course, comes with additional cost of complexity, and the definition of "similarity" becomes more of an art than of science.
using dense_vector
One may opt to use the (yet experimental) dense_vector data type, which allows storing and comparing dense vectors (that is, arrays of numbers of fixed size). Here's an article that describes this approach very well: Text similarity search with vector fields.
In this case the definition of similarity is as precise as it can possibly be: a distance of two vectors in a multidimensional space, which can be computed via, for instance, cosine similarity.
However, such dense vectors have to be somehow computed, and the quality of said vectors will equal the quality of the similarity itself.
As the bottom line I must say that to make this work with Elasticsearch a bunch of computation and logic should be added outside, either in form of pre-computed models, or custom curated scoring algorithms. Elasticsearch out of the box does not seem to be a good percentage-similarity kind of deal.
Hope that helps!
If you're going the route of using semantic search via dense_vector, as Nikolay mentioned, I would recommend NBoost. NBoost has a good out-of-the-box systems for improving Elasticsearch results with SOTA models.
In Lucene's practical scoring function there is a query coordinator which punishes documents that fail to match all the query terms. does Okapi BM25 use the same trick?
The reason I'm curious about it is that I'm using Elasticsearch with BM25 similarity module and sometimes I feel this algorithm does not favor documents with more matches. There are cases that a document contains one or two terms a lot, outscores a document containing all query terms.
Yes and no.
No, it doesn't use a coord factor as described by the old Lucene default similarity (note: Lucene core now uses BM25 by default, as well).
Yes, it does weigh hits on more of the query terms more heavily than a bunch of hits on the same term. It does this with better term saturation, making the old coord factor effectively obsolete.
It is, however, always possible that many hits on less terms will outscore few hits on more terms using either algorithm.
In my particular use case, the IDF-factor that gets calculated as part of the TF-IDF algorithm messes up the scoring for my queries. Basically, I want the queries to only take the term frequency into account. Is it possible to disable the IDF factor, i.e set it to 1, for a particular index? I have looked into the similarity module (in version 0.90.X), but haven't really found anything that could help; same goes for the function_score query. Do I need to write a custom Similarity class in java? Or is there a plugin for what I'm trying to achieve?
What about constant_score query?
See http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/ignoring-tfidf.html
Don't hesitate to use ?explain=true to see how scoring is working.
As you can here without constant_filter:
And with constant_filter query (that wraps your real query):
Screenshots made with https://beemapp.me
Is there an equivalence between the scores an Oracle Text Score would calculate and a Lucene one ?
Would you be able to mix the sources to get one unified resultset through the score ?
Scores are not comparable between queries or data changes in Lucene, much less being comparable to another technology. Lucene scores of the same document can be changed dramatically by having other documents added or removed from the index. Scoring as a percentage of maximum becomes the obvious solution, but the same problems remain, as well as that other algorithms in another technology will ikely render different distribution. You can read about why you should not compare scores like this here and here
A way I managed to lash something similar together was to fetch matches from the other data source, and create a temporary index in a RAMDirectory, and then search again incorporating it with a MultiSearcher. That way everything is getting scored on a single, cohesive data set, within a single search. Scoring should be reasonable enough, though this isn't exactly the most efficient way to search.
The problem is that one of our terms could be very common (for example number "3"). In that case I would like to limit the amount of search result Scored while Lucene is running the Query. Is that even possible?
Just to emphasize - I don't want just to limit Lucene search results (that could easily be done using second number parameter in IndexSearher.Search method). I want to tell Lucene something like - don't spent too much time searching hits for that specific term. In case you found, let's say, a 1,000,000 - stop looking and go to other terms.
No, you can't. As you might know, absolute scores are meaningless in Lucene, so there's no support for them.
Because the term is really common, the idf will be high (or low, depending on your perspective) so it will probably be relatively inconsequential due to Lucene's pruning algorithms. You can always change the boost to make it matter even less, but I'd double check that this is really your performance bottleneck.