I'm not sure if I've understood the Term Vectors API correctly.
The document starts by saying:
Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false.
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
Setting field_statistics to false (default is true) will omit :
document count (how many documents contain this field)
sum of document frequencies (the sum of document frequencies for all terms in this field)
sum of total term frequencies (the sum of total term frequencies of each term in this field)
I guess they are simply the sum over their corresponding values reported in term statistics?
Then in the section Behavior it says:
The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected. Use routing only to hit a particular shard.
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
term and token are synonyms and simply mean whatever came out of the analysis process and has been indexed in the Lucene inverted index.
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
By default, the call returns term information and field statistics, but term statistics have to be requested explicitly with &term_statistics=true.
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
payload is a Lucene concept, which is pretty well explained here. Term payloads are not available unless your have a custom analyzer that makes use of a delimited-payload token filter to extract them.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
[...]
I guess they are simply the sum over their corresponding values reported in term statistics?
The sum of "document frequencies" is the number of times each term present in the field appears in the same document. So if the field contains "big brown fox", it will count the number of times "big" appears in the same document, the number of times "brown" appears in the same document and the same for "fox".
The sum of "total term frequencies" is the number of times each term present in this field appears in all documents present in the Lucene index (which is located on a single shard of an ES index). So if the field contains "big brown fox", it will count the number of times "big" appears in all documents, the number of times "brown" appears in all documents and the same for "fox".
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
It is realtime by default, which means that a refresh call is made when issuing the _termvectors call in order to get fresh information from the Lucene index. However, statistics are gathered only from a single shard, which does not give an overall view of the statistics of the whole ES index (potentially made of several shards, hence several Lucene indexes).
Related
For example, with a search for "stack overflow" I want a document containing both "stack" and "overflow" to have a higher score than a document containing only one of those words.
Right now, I am seeing cases where a document that contains "stack" 0 times and "overflow" 50 times gets ranked above a document that contains "stack" 1 time and "overflow" 1 time.
A secondary concern is ranking documents higher that have the exact word as opposed to a word variant. For example, a document containing "stack" should be ranked higher than a document containing "stacking".
A third concern is ranking documents higher that have the words adjacent. For example a document "How to use stack overflow" should be ranked higher than a document "The stack of papers caused the inbox to overflow."
If you put those three concerns together, here is an example of the desired rank of results for "stack overflow":
Is it possible to configure an index or a query to calculate score this way?
Here you are trying to achieve multiple things in a single query. First you should try to understand how ES is returning you the results.
Document containing overflow 50 times gets ranked above a document that contains "stack" 1 time and "overflow" 1 time because ES score calculation is based on tf/idf based score calculation. And in this case obviously, overflow comes 50 times which is quite higher than other frequency combined for other 2
terms in another document.
Note:- You can disable this calculation as mentioned in the link.
If you don’t care about how often a term appears in a field and all
you care about is that the term is present, then you can disable term
frequencies in the field mapping:
You are getting the results containing the term stacking due to stemming and if you don't want document containing stacking shouldn't come in search results, than don't documents in stemmed form or do some post-processing after getting the results from ES and reduce their score, not sure if ES provide it out of the box.
The third thing which you want is a phrase search.
Also use explain api to understand, how ES calculates the score of the document with your query, It will help you to construct the right query according to your requirements.
I have the intention to use the Terminate After feature of elasticsearch in order to reduce the result set.
The question is, the documents retrieved when using Terminate After, are ranked among the complete set of documents, or just among the reduced returned set?
Terminate after limits the number of search hits per shard so any document that may have had a hit later could also have had a higher ranking(higher score) than highest ranked document returned since the score used for ranking is independent of the other hits.
So yes the document will be ranked depending upon only the result set returned, but this would not affect how the actual score was calculated which takes into account all the documents.
Wanting a reduced result set and wanting it to be ranked depending on all the hits that may have occurred is a contradiction in itself.
Terminate after is generally used for filter type queries where the score of all returned docs is the same so that ranking doesn't matter.
For match type queries ES uses pagination so it's already quite efficient and you don't really need to restrict the document set anyways.
I'm trying to understand boosting and if boosting is the answer to my problem.
I have an index and that has different types of data.
EG: Index Animals. One of the fields is animaltype. This value can be Carnivorous, herbivorous etc.
Now when a we query in search, I want to show results of type carnivorous at top, and then the herbivorous type.
Also would it be possible to show only say top 3 results from a type and then remaining from other types?
Let assume for a herbivourous type we have a field named vegetables. This will have values only for a herbivourous animaltype.
Now, can it be possible to have boosting rules specified as follows:
Boost Levels:
animaltype:Carnivorous
then animaltype:Herbivorous and vegatablesfield: spinach
then animaltype:herbivoruous and vegetablesfield: carrot
etc. Basically boosting on various fields at various levels. Im new to this concept. It would really helpful to get some inputs/guidance.
Thanks,
Kasturi Chavan
Your example is closer to sorting than boosting, as you have a priority list for how important each document is - while boosting (in Solr) is usually applied a bit more fluent, meaning that there is no hard line between documents of type X and type Y.
However - boosting with appropriately large values will in effect give you the same result, putting the documents into different score "areas" which will then give you the sort order you're looking for. You can see the score contributed by each term by appending debugQuery=true to your query. Boosting says that 'a document with this value is z times more important than those with a different value', but if the document only contains low scoring tokens from the search (usually words that are very common), while other documents contain high scoring tokens (words that are infrequent), the latter document might still be considered more important.
Example: Searching for "city paris", where most documents contain the word 'city', but only a few contain the word 'paris' (but does not contain city). Even if you boost all documents assigned to country 'germany', the score contributed from city might still be lower - even with the boost factor than what 'paris' contributes alone. This might not occur in real life, but you should know what the boost actually changes.
Using the edismax handler, you can apply the boost in two different ways - one is to use boost=, which is multiplicative, or to use either bq= or bf=, which are additive. The difference is how the boost contributes to the end score.
For your example, the easiest way to get something similar to what you're asking, is to use bq (boost query):
bq=animaltype:Carnivorous^1000&
bq=animaltype:Herbivorous^10
These boosts will probably be large enough to move all documents matching these queries into their own buckets, without moving between groups. To create "different levels" as your example shows, you'll need to tweak these values (and remember, multiple boosts can be applied to the same document if something is both herbivorous and eats spinach).
A different approach would be to create a function query using query, if and similar functions to result in a single integer value that you can use as a sorting value. You can also calculate this value when indexing the document if it's static (which your example is), and then sort by that field instead. It will require you to reindex your documents if the sorting values change, but it might be an easy and effective solution.
To achieve the "Top 3 results from a type" you're probably going to want to look at Result grouping support - which makes it possible to get "x documents" for each value in a single field. There is, as far as I know, no way to say "I want three of these at the top, then the rest from other values", except for doing multiple queries (and excluding the three you've already retrieved from the second query). Usually issuing multiple queries works just as fine (or better) performance wise.
Elasticsearch takes the length of a document into account when ranking (they call this field normalization). The default behavior is to rank shorter matching documents higher than longer matching documents.
Is there anyway to turn off or modify field normalization at query time? I am aware of the index time omit_norms option, but I would prefer to not reindex everything to try this out.
Also, instead of simply turning off field normalization, I wanted to try out a few things. I would like to take field length into account, but not as heavily as elasticsearch currently does. With the default behavior, a document will rank 2 times higher than a document which is two times longer. I wanted to try a non-linear relationship between ranking and length.
If I have a field called name and I use the suggest api to get suggestions for misspellings do I need to have document frequencies or norms enabled in order to do accurate suggestions? My assumption is yes but I am curious if maybe there is a separate suggestions index in lucene that handles frequency and/or norms even if I have it disabled for the field in my main index.
I doubt if suggester can work without field length normalization, as disabling norm means you are looking for a binary value whether the term is present or not in the document field and which in turn will have impact on the similarity score of each document.
These three factors—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time. Together, they are used to calculate the weight of a single term in a particular document.
"but I am curious if maybe there is a separate suggestions index in lucene that handles frequency and/or norms even if I have it disabled for the field in my main index."
Any suggester will use Vector Space Model by default to calculate the cosine similarity, which in turn will use the tf-idf-norm based scoring calculated during indexing for each term to rank the suggestions, so I doubt if suggester can score documents accurately without field norm.
theory behind relevance scoring:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scoring-theory.html#field-norm