Norms, Document Frequency and Suggestions in Elasticsearch - elasticsearch

If I have a field called name and I use the suggest api to get suggestions for misspellings do I need to have document frequencies or norms enabled in order to do accurate suggestions? My assumption is yes but I am curious if maybe there is a separate suggestions index in lucene that handles frequency and/or norms even if I have it disabled for the field in my main index.

I doubt if suggester can work without field length normalization, as disabling norm means you are looking for a binary value whether the term is present or not in the document field and which in turn will have impact on the similarity score of each document.
These three factors—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time. Together, they are used to calculate the weight of a single term in a particular document.
"but I am curious if maybe there is a separate suggestions index in lucene that handles frequency and/or norms even if I have it disabled for the field in my main index."
Any suggester will use Vector Space Model by default to calculate the cosine similarity, which in turn will use the tf-idf-norm based scoring calculated during indexing for each term to rank the suggestions, so I doubt if suggester can score documents accurately without field norm.
theory behind relevance scoring:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scoring-theory.html#field-norm

Related

Does Elasticsearch score different length shingles with the same IDF?

In Elasticsearch 5.6.5 I'm searching against a field with the following filter applied:
"filter_shingle":{
"max_shingle_size":"4",
"min_shingle_size":"2",
"output_unigrams":"true",
"type":"shingle"
}
When I perform a search for depreciation tax against a document with that exact text, I see the following explanation of the score:
weight(Synonym(content:depreciation content:depreciation tax)) .... [7.65]
weight(content:tax) ... [6.02]
If I change the search to depreciation taffy against the exact same document with depreciation tax in the content I get this explanation:
weight(Synonym(content:depreciation content:depreciation taffy)) .... [7.64]
This is not what I expected. I thought a match on the bigram token for depreciation tax would get a much higher score than a match on the unigram depreciation. However this scoring seems to reflect a simple unigram match. There is an extremely small difference and digging further this is because the termFreq=28 under the depreciation taffy match, and termFreq=29 under the depreciation tax match. I'm also not sure how this relates as I imagine across the shard holding this document there are very different counts for depreciation, depreciation tax and depreciation tafffy
Is this expected behavior? Is ES treating all the different sized shingles, including unigrams, with the same IDF value? Do I need to split out each shingle size into different sub fields with different analyzers to get the behavior I expect?
TL;DR
Shingles and Synonyms are broken in Elastic/Lucene and a lot of hacks need to be applied until a fix is released (accurate as of ES 6).
Put unigrams, bigrams and so on in individual subfields and search them separately, combining the scores for an overall match. Don't use a single shingle filter on a field that does multiple n-gram configurations
Don't combine a synonym and shingle filter on the same field.
In my case I do a must match with synonyms on a unigram field, then a series of should matches to boost the score on shingles of each size, without synonyms
Details
I got an answer on the elastic support forums:
https://discuss.elastic.co/t/does-elasticsearch-score-different-length-shingles-with-the-same-idf/126653/2
Yep, this is mostly expected.
It's not really the shingles causing the scoring oddness, but the fact
that SynonymQueries do the frequency blending behavior that you're
seeing. They use frequency of the original token for all the
subsequent 'synonym' tokens, as a way to help prevent skewing the
score results. Synonyms are often relatively rare, and would
drastically affect the scoring if they each used their individual
df's.
From the Lucene docs:
For scoring purposes, this query tries to score the terms as if you
had indexed them as one term: it will match any of the terms but only
invoke the similarity a single time, scoring the sum of all term
frequencies for the document.
The SynonymQuery also sets the docFrequency to the maximum
docFrequency of the terms in the document. So for example, if:
"deprecation"df == 5 "deprecation tax"df == 2, "deprecation taffy"df
== 1, it will use 5 as the docFrequency for scoring purposes.
The bigger issue is that Lucene doesn't have a way to differentiate
shingles from synonyms... they both use tokens that overlap the
position of other tokens in the token stream. So if unigrams are mixed
with bi-(or larger)-grams, Lucene is tricked into thinking it's
actually a synonym situation.
The fix is to keep your unigrams and bi-plus-grams in different
fields. That way Lucene won't attempt to use SynonymQueries in these
situations, because the positions won't be overlapping anymore.
Here's another related question that I asked which relates to how actual synonyms also get mangled when combined with shingles. https://discuss.elastic.co/t/es-5-4-synonyms-and-shingles-dont-seem-to-work-together/127552
Elastic/Lucene expands the synonym set, injects them into the token stream, then creates shingles. E.g. Query: econ supply and demand => econ, economics, supply, demand. Document: `... econ foo ... => econ, foo '. Now we get the shingle from the query "econ economics" and somehow this matches the document. No idea why since I only applied synonyms to the query, not the document, so I don't see the match. Also, the way the shingles are created from the query is wrong too.
This is a known problem, and it is still not fully resolved. A number
of Lucene filters can't consume graphs as their inputs.
There is currently active work being done on developing a fixed
shingle filter, and also an idea to have a sub-field for indexing
shingles.

"Term Vector API" clarification required

I'm not sure if I've understood the Term Vectors API correctly.
The document starts by saying:
Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false.
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
Setting field_statistics to false (default is true) will omit :
document count (how many documents contain this field)
sum of document frequencies (the sum of document frequencies for all terms in this field)
sum of total term frequencies (the sum of total term frequencies of each term in this field)
I guess they are simply the sum over their corresponding values reported in term statistics?
Then in the section Behavior it says:
The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected. Use routing only to hit a particular shard.
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
term and token are synonyms and simply mean whatever came out of the analysis process and has been indexed in the Lucene inverted index.
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
By default, the call returns term information and field statistics, but term statistics have to be requested explicitly with &term_statistics=true.
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
payload is a Lucene concept, which is pretty well explained here. Term payloads are not available unless your have a custom analyzer that makes use of a delimited-payload token filter to extract them.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
[...]
I guess they are simply the sum over their corresponding values reported in term statistics?
The sum of "document frequencies" is the number of times each term present in the field appears in the same document. So if the field contains "big brown fox", it will count the number of times "big" appears in the same document, the number of times "brown" appears in the same document and the same for "fox".
The sum of "total term frequencies" is the number of times each term present in this field appears in all documents present in the Lucene index (which is located on a single shard of an ES index). So if the field contains "big brown fox", it will count the number of times "big" appears in all documents, the number of times "brown" appears in all documents and the same for "fox".
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
It is realtime by default, which means that a refresh call is made when issuing the _termvectors call in order to get fresh information from the Lucene index. However, statistics are gathered only from a single shard, which does not give an overall view of the statistics of the whole ES index (potentially made of several shards, hence several Lucene indexes).

Search WIthout Field Length Normalization

How can I omit the fieldLength Norm at search time? I have some documents I want to search and I want to ignore extraneous strings in my query so that things like "parrot" match to "multiple parrots."
So, how can I ignore the field length norm at search time?
I had the same problem - though I think for efficiency reasons this is normally done at index time. Instead of using tf-idf I used BM25 (which is supposedly better). BM25 has a coefficient to the field norm term which can be set to 0 so it doesn't effect the solution...
https://stackoverflow.com/a/38362244/3071643

Elasticsearch: Modifying Field Normalization at Query Time (omit_norms in queries)

Elasticsearch takes the length of a document into account when ranking (they call this field normalization). The default behavior is to rank shorter matching documents higher than longer matching documents.
Is there anyway to turn off or modify field normalization at query time? I am aware of the index time omit_norms option, but I would prefer to not reindex everything to try this out.
Also, instead of simply turning off field normalization, I wanted to try out a few things. I would like to take field length into account, but not as heavily as elasticsearch currently does. With the default behavior, a document will rank 2 times higher than a document which is two times longer. I wanted to try a non-linear relationship between ranking and length.

Elasticsearch scoring

I'm using elasticsearch to find similar documents to a given document using the "more like this" query.
Is there an easy way to get the elasticsearch scoring between 0 and 1 (using cosine similarity) ?
Thanks!
You may want to look into the Function Score functionality of Elasticsearch, more specifically the script_score and field_value_factor functions. This will allow you to take the score from default scoring (_score) and enhance or replace it in other ways. It really depends on what sort of boosting or transformation you'd like. The default scoring model takes into account the Vector model but other things as well .
I don't think that's possible to retrieve directly.
But perhaps this workaround would make sense?
Elasticsearch always bring back max_score in hits document.
You can potentially divide your document _score by max_score. Report with highest value will score as 1, documents, that are not so like given one, will score less.
The Elasticsearch uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.

Resources