Search WIthout Field Length Normalization - elasticsearch

How can I omit the fieldLength Norm at search time? I have some documents I want to search and I want to ignore extraneous strings in my query so that things like "parrot" match to "multiple parrots."
So, how can I ignore the field length norm at search time?

I had the same problem - though I think for efficiency reasons this is normally done at index time. Instead of using tf-idf I used BM25 (which is supposedly better). BM25 has a coefficient to the field norm term which can be set to 0 so it doesn't effect the solution...
https://stackoverflow.com/a/38362244/3071643

Related

IDF recaculation for existing documents in index?

I have gone through [Theory behind relevance scoring][1] and have got two related questions
Q1 :- As IDF formula is idf(t) = 1 + log ( numDocs / (docFreq + 1)) where numDocs is total number of documents in index. Does it mean each time new document is added in index, we need to re-calculate the IDF for each word for all existing documents in index ?
Q2 :- Link mentioned below statement. My question is there any reason why TF/IDF score is calculated against each field instead of complete document ?
When we refer to documents in the preceding formulae, we are actually
talking about a field within a document. Each field has its own
inverted index and thus, for TF/IDF purposes, the value of the field
is the value of the document.
You only calculate the score at query time and not at insert time. Lucene has the right statistics to make this a fast calculation and the values are always fresh.
The frequency only really makes sense against a single field since you are interested in the values for that specific field. Assume we have multiple fields and we search a single one, then we're only interested in the frequency of that one. Searching multiple ones you still want control over the individual fields (such as boosting "title" over "body") or want to define how to combine them. If you have a use-case where this doesn't make sense (not sure I have a good example right now — it's IMO far less common) then you could combine multiple fields into one with copy_to and search on that.

Is there any way to increase maximum fuzziness on a fuzzy query in elasticsearch?

I'm trying to use elastic search to do a fuzzy query for strings. According to this link (https://www.elastic.co/guide/en/elasticsearch/reference/1.6/common-options.html#fuzziness), the maximum fuzziness allowed is 2, so the query will only return results that are 2 edits away using the Levenshtein Edit Distance. The site says that Fuzzy Like This Query supports fuzzy searches with a fuzziness greater than 2, but so far using the Fuzzy Like This Query has only allowed me to search for results within two edits of the search. Is there any workaround for this constraint?
It looks like this was a bug which was fixed quite a while back. Which Elasticsearch version are you using?
For context, the reason why Edit Distance is now limited to [0,1,2] for most Fuzzy operations has to do with a massive performance improvement of fuzzy/wildcard/regexp matching in Lucene 4 using Finite State Transducers.
Executing a fuzzy query via an FST requires knowing the desired edit-distance at the time the transducer is constructed (at index-time). This was likely capped at an edit-distance of 2 to keep the FST size requirements manageable. But also possibly, because for many applications, an edit-distance of greater than 2 introduces a whole lot of noise.
The previous fuzzy query implementation required visiting each document to calculate edit distance at query-time and was impractical for large collections.
It sounds like Elasticsearch (1.x) is still using the original (non-performant) implementation for the FuzzyLikeThisQuery, which is why the edit-distance can increase beyond 2. However, FuzzyLikeThis has been deprecated as of 1.6 and won't be supported in 2.0.

Norms, Document Frequency and Suggestions in Elasticsearch

If I have a field called name and I use the suggest api to get suggestions for misspellings do I need to have document frequencies or norms enabled in order to do accurate suggestions? My assumption is yes but I am curious if maybe there is a separate suggestions index in lucene that handles frequency and/or norms even if I have it disabled for the field in my main index.
I doubt if suggester can work without field length normalization, as disabling norm means you are looking for a binary value whether the term is present or not in the document field and which in turn will have impact on the similarity score of each document.
These three factors—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time. Together, they are used to calculate the weight of a single term in a particular document.
"but I am curious if maybe there is a separate suggestions index in lucene that handles frequency and/or norms even if I have it disabled for the field in my main index."
Any suggester will use Vector Space Model by default to calculate the cosine similarity, which in turn will use the tf-idf-norm based scoring calculated during indexing for each term to rank the suggestions, so I doubt if suggester can score documents accurately without field norm.
theory behind relevance scoring:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scoring-theory.html#field-norm

Elasticsearch scoring

I'm using elasticsearch to find similar documents to a given document using the "more like this" query.
Is there an easy way to get the elasticsearch scoring between 0 and 1 (using cosine similarity) ?
Thanks!
You may want to look into the Function Score functionality of Elasticsearch, more specifically the script_score and field_value_factor functions. This will allow you to take the score from default scoring (_score) and enhance or replace it in other ways. It really depends on what sort of boosting or transformation you'd like. The default scoring model takes into account the Vector model but other things as well .
I don't think that's possible to retrieve directly.
But perhaps this workaround would make sense?
Elasticsearch always bring back max_score in hits document.
You can potentially divide your document _score by max_score. Report with highest value will score as 1, documents, that are not so like given one, will score less.
The Elasticsearch uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.

elasticsearch fuzzy matching max_expansions & min_similarity

I'm using fuzzy matching in my project mainly to find misspellings and different spellings of the same names. I need to exactly understand how the fuzzy matching of elastic search works and how it uses the 2 parameters mentioned in the title.
As I understand the min_similarity is a percent by which the queried string matches the string in the database. I couldn't find an exact description of how this value is calculated.
The max_expansions as I understand is the Levenshtein distance by which a search should be executed. If this actually was Levenshtein distance it would have been the ideal solution for me. Anyway, it's not working
for example i have the word "Samvel"
queryStr max_expansions matches?
samvel 0 Should not be 0. error (but levenshtein distance can be 0!)
samvel 1 Yes
samvvel 1 Yes
samvvell 1 Yes (but it shouldn't have)
samvelll 1 Yes (but it shouldn't have)
saamvelll 1 No (but for some weird reason it matches with Samvelian)
saamvelll anything bigger than 1 No
The documentation says something I actually do not understand:
Add max_expansions to the fuzzy query allowing to control the maximum number
of terms to match. Default to unbounded (or bounded by the max clause count in
boolean query).
So can please anyone explain to me how exactly these parameters affect the search results.
The min_similarity is a value between zero and one. From the Lucene docs:
For example, for a minimumSimilarity of 0.5 a term of the same length
as the query term is considered similar to the query term if the edit
distance between both terms is less than length(term)*0.5
The 'edit distance' that is referred to is the Levenshtein distance.
The way this query works internally is:
it finds all terms that exist in the index that could match the search term, when taking the min_similarity into account
then it searches for all of those terms.
You can imagine how heavy this query could be!
To combat this, you can set the max_expansions parameter to specify the maximum number of matching terms that should be considered.

Resources