Elasticsearch similarity discount_overlaps - elasticsearch

I'm using Elasticsearch 5.3.1 and I'm evaluating BM25 and Classic TF/IDF.
I came across the discount_overlaps property which is optional.
Determines whether overlap tokens (Tokens with 0 position increment)
are ignored when computing norm. By default this is true, meaning
overlap tokens do not count when computing norms.
Can someone explain what the above means with an example if possible.

First off, the norm is calculated as boost / √length, and this value is stored at index time. This causes matches on shorter fields to get a higher score (because 1 in 10 is generally a better match than 1 in 1000).
For an example, let's say we have a synonym filter on our analyzer, that is going index a bunch of synonyms in the indexed form of our field. Then we index this text:
The man threw a frisbee
Once the analyzer adds all the synonyms to the field, it looks like this:
Now when we search for "The dude pitched a disc", we'll get a match.
The question is, for the purposes the norm calculation above, what is the length?
if discount_overlaps = false, then length = 12
if discount_overlaps = true, then length = 5

Related

IDF recaculation for existing documents in index?

I have gone through [Theory behind relevance scoring][1] and have got two related questions
Q1 :- As IDF formula is idf(t) = 1 + log ( numDocs / (docFreq + 1)) where numDocs is total number of documents in index. Does it mean each time new document is added in index, we need to re-calculate the IDF for each word for all existing documents in index ?
Q2 :- Link mentioned below statement. My question is there any reason why TF/IDF score is calculated against each field instead of complete document ?
When we refer to documents in the preceding formulae, we are actually
talking about a field within a document. Each field has its own
inverted index and thus, for TF/IDF purposes, the value of the field
is the value of the document.
You only calculate the score at query time and not at insert time. Lucene has the right statistics to make this a fast calculation and the values are always fresh.
The frequency only really makes sense against a single field since you are interested in the values for that specific field. Assume we have multiple fields and we search a single one, then we're only interested in the frequency of that one. Searching multiple ones you still want control over the individual fields (such as boosting "title" over "body") or want to define how to combine them. If you have a use-case where this doesn't make sense (not sure I have a good example right now — it's IMO far less common) then you could combine multiple fields into one with copy_to and search on that.

Configuring ElasticSearch relevance score to prefer a match on all words over a match with some words?

For example, with a search for "stack overflow" I want a document containing both "stack" and "overflow" to have a higher score than a document containing only one of those words.
Right now, I am seeing cases where a document that contains "stack" 0 times and "overflow" 50 times gets ranked above a document that contains "stack" 1 time and "overflow" 1 time.
A secondary concern is ranking documents higher that have the exact word as opposed to a word variant. For example, a document containing "stack" should be ranked higher than a document containing "stacking".
A third concern is ranking documents higher that have the words adjacent. For example a document "How to use stack overflow" should be ranked higher than a document "The stack of papers caused the inbox to overflow."
If you put those three concerns together, here is an example of the desired rank of results for "stack overflow":
Is it possible to configure an index or a query to calculate score this way?
Here you are trying to achieve multiple things in a single query. First you should try to understand how ES is returning you the results.
Document containing overflow 50 times gets ranked above a document that contains "stack" 1 time and "overflow" 1 time because ES score calculation is based on tf/idf based score calculation. And in this case obviously, overflow comes 50 times which is quite higher than other frequency combined for other 2
terms in another document.
Note:- You can disable this calculation as mentioned in the link.
If you don’t care about how often a term appears in a field and all
you care about is that the term is present, then you can disable term
frequencies in the field mapping:
You are getting the results containing the term stacking due to stemming and if you don't want document containing stacking shouldn't come in search results, than don't documents in stemmed form or do some post-processing after getting the results from ES and reduce their score, not sure if ES provide it out of the box.
The third thing which you want is a phrase search.
Also use explain api to understand, how ES calculates the score of the document with your query, It will help you to construct the right query according to your requirements.

Search WIthout Field Length Normalization

How can I omit the fieldLength Norm at search time? I have some documents I want to search and I want to ignore extraneous strings in my query so that things like "parrot" match to "multiple parrots."
So, how can I ignore the field length norm at search time?
I had the same problem - though I think for efficiency reasons this is normally done at index time. Instead of using tf-idf I used BM25 (which is supposedly better). BM25 has a coefficient to the field norm term which can be set to 0 so it doesn't effect the solution...
https://stackoverflow.com/a/38362244/3071643

Elasticsearch scoring based on how close a number is to a query

I want to score my documents based on on how close a number is to a query. Given I have two documents document1.field = 1 and document2.field = 10, a query field = 3 then I want document1._score > document2._score. Or in other words I want something like a fuzzy query against number. How would I achieve this? The use case is I want to support price queries (exact or range), but want to rank stuff that isn't exactly in the boundaries.
You are looking for Decay functions:
Decay functions score a document with a function that decays depending on the distance of a numeric field value of the document from a user given origin. This is similar to a range query, but with smooth edges instead of boxes.
It can be implemented using custom_score query where script will determine boost depending on absolute value of the difference between exact price and desired price. The desired price should be passed to the script as a parameter to avoid script recompilation for every request.
Alternatively, it can be implemented using custom_filters_score query. Filters here will contain different ranges around desired price. Smaller ranges will have higher boost and appear higher in the list than larger ranges.

elasticsearch fuzzy matching max_expansions & min_similarity

I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title.
As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated.
The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal solution for me. Anyway, it's not working
for example i have the word "Samvel"
queryStr max_expansions matches?
samvel 0 Should not be 0. error (but levenshtein distance can be 0!)
samvel 1 Yes
samvvel 1 Yes
samvvell 1 Yes (but it shouldn't have)
samvelll 1 Yes (but it shouldn't have)
saamvelll 1 No (but for some weird reason it matches with Samvelian)
saamvelll anything bigger than 1 No
The documentation says something I actually do not understand:
Add max_expansions to the fuzzy query allowing to control the maximum number
of terms to match. Default to unbounded (or bounded by the max clause count in
boolean query).
So can please anyone explain to me how exactly these parameters affect the search results.
The min_similarity is a value between zero and one. From the Lucene docs:
For example, for a minimumSimilarity of 0.5 a term of the same length
as the query term is considered similar to the query term if the edit
distance between both terms is less than length(term)*0.5
The 'edit distance' that is referred to is the Levenshtein distance.
The way this query works internally is:
it finds all terms that exist in the index that could match the search term, when taking the min_similarity into account
then it searches for all of those terms.
You can imagine how heavy this query could be!
To combat this, you can set the max_expansions parameter to specify the maximum number of matching terms that should be considered.

Resources