I understand that there is term vector in elastic search which can give the word position and other stats.
Can percolator give the word position in the documents that are being searched on?
I understand that the documents are not indexed and only percolator queries are indexed. I see the below
If the requested information wasn’t stored in the index, it will be computed on the fly if possible. Additionally, term vectors could be computed for documents not even existing in the index, but instead provided by the user.
in - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html
So interested to know if elastic search can calculate the word position on the fly?
Any leads are appreciated. Thanks for reading.
#Kaveh
Thanks for taking time for me but really sorry I don't see how this (https://stackoverflow.com/a/67926555/4068218) is related because using artificial documents I can get the stats - https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-termvectors.html
but what I have is percolator - https://www.youtube.com/watch?v=G2Ru2KV0DZg
So even if I get the term vector on fly using artificial documents or by the /_analyze it does not matter as they will not give me the position of terms (in percolator)
eg Percolator - I am trying to find the word - Hello.
My document has the below field and value
"text": "Hello World"
If I used artificial documents or /_analyze it will say 0 - Hello 1- World but when I percolate I will get the
percolate query that found the word Hello. I want to combine both and want the percolator tell
"I found Hello in position 0"
As you can see in the documentation for term vector if you store _source the Elastic can calculate the term vector on fly. It will analyze your text based on the source and it will aggregate it with existing term vector of index.
If you want to get the result for term you always can get your analyzed data for list of terms for more information here.
Related
Elastic search use inverted index which is totally understandable because it returns all the documents containing the word we searched for.
But I do not understand where do we use forward index? Like, we don't search for document and expect words containing in that particular document.
Is there any practical use case for forward index? Any company using it for its product?
As mentioned in this SO answer, there is no technical difference between the forward index and the inverted index. Forward index is a list of terms contained within a particular document. The inverted index would be a list of documents containing a given term.
Please go through this blog, where it is clearly mentioned that the forward index is pretty fast when indexing and have less efficient queries.
Whereas inverted indexing have slower indexing, but fast query. To get a detailed explanation of the inverted index, you can refer to this article and this blog.
For example, with a search for "stack overflow" I want a document containing both "stack" and "overflow" to have a higher score than a document containing only one of those words.
Right now, I am seeing cases where a document that contains "stack" 0 times and "overflow" 50 times gets ranked above a document that contains "stack" 1 time and "overflow" 1 time.
A secondary concern is ranking documents higher that have the exact word as opposed to a word variant. For example, a document containing "stack" should be ranked higher than a document containing "stacking".
A third concern is ranking documents higher that have the words adjacent. For example a document "How to use stack overflow" should be ranked higher than a document "The stack of papers caused the inbox to overflow."
If you put those three concerns together, here is an example of the desired rank of results for "stack overflow":
Is it possible to configure an index or a query to calculate score this way?
Here you are trying to achieve multiple things in a single query. First you should try to understand how ES is returning you the results.
Document containing overflow 50 times gets ranked above a document that contains "stack" 1 time and "overflow" 1 time because ES score calculation is based on tf/idf based score calculation. And in this case obviously, overflow comes 50 times which is quite higher than other frequency combined for other 2
terms in another document.
Note:- You can disable this calculation as mentioned in the link.
If you don’t care about how often a term appears in a field and all
you care about is that the term is present, then you can disable term
frequencies in the field mapping:
You are getting the results containing the term stacking due to stemming and if you don't want document containing stacking shouldn't come in search results, than don't documents in stemmed form or do some post-processing after getting the results from ES and reduce their score, not sure if ES provide it out of the box.
The third thing which you want is a phrase search.
Also use explain api to understand, how ES calculates the score of the document with your query, It will help you to construct the right query according to your requirements.
I'm not sure if I've understood the Term Vectors API correctly.
The document starts by saying:
Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false.
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
Setting field_statistics to false (default is true) will omit :
document count (how many documents contain this field)
sum of document frequencies (the sum of document frequencies for all terms in this field)
sum of total term frequencies (the sum of total term frequencies of each term in this field)
I guess they are simply the sum over their corresponding values reported in term statistics?
Then in the section Behavior it says:
The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected. Use routing only to hit a particular shard.
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
I'm guessing, term here is refered to what some other people would call a token maybe? Or is term defined by the time we get here in the documentation and I've missed it?
term and token are synonyms and simply mean whatever came out of the analysis process and has been indexed in the Lucene inverted index.
Then the document continues by saying there are three sections to the return value: Term information, Term Statistics, and Field statistics. I guess meaning that term information and statistics is not the only thing this API returns, correct?
By default, the call returns term information and field statistics, but term statistics have to be requested explicitly with &term_statistics=true.
Then Term information includes a field called payloads, which is not defined and I have no idea what it means.
payload is a Lucene concept, which is pretty well explained here. Term payloads are not available unless your have a custom analyzer that makes use of a delimited-payload token filter to extract them.
Then in Field statistics, there is sum of document frequencies and sum of total term frequencies with a rather confusing explanation:
[...]
I guess they are simply the sum over their corresponding values reported in term statistics?
The sum of "document frequencies" is the number of times each term present in the field appears in the same document. So if the field contains "big brown fox", it will count the number of times "big" appears in the same document, the number of times "brown" appears in the same document and the same for "fox".
The sum of "total term frequencies" is the number of times each term present in this field appears in all documents present in the Lucene index (which is located on a single shard of an ES index). So if the field contains "big brown fox", it will count the number of times "big" appears in all documents, the number of times "brown" appears in all documents and the same for "fox".
So which one is it? Realtime or not? Or is it that term information is realtime and term statistics and field statistics are merely an approximation of the reality?
It is realtime by default, which means that a refresh call is made when issuing the _termvectors call in order to get fresh information from the Lucene index. However, statistics are gathered only from a single shard, which does not give an overall view of the statistics of the whole ES index (potentially made of several shards, hence several Lucene indexes).
I have a corpus of documents indexed . I also stored the term vectors when indexing. Now I want to retrieve term vectors of all documents satisfying some filtering options.
I was able to get term vector for a single document or for a set of documents by providing the document IDs. But is there a way to get term vectors for all the documents without providing document IDs?
Eventually what I want to do is to get the frequency counts of all the terms in a field, for all documents in an index (i.e., a bag of words matrix).
I am using elasticsearch-py as a client.
Appreciate any pointers. Thanks!
I'm using elasticsearch to find similar documents to a given document using the "more like this" query.
Is there an easy way to get the elasticsearch scoring between 0 and 1 (using cosine similarity) ?
Thanks!
You may want to look into the Function Score functionality of Elasticsearch, more specifically the script_score and field_value_factor functions. This will allow you to take the score from default scoring (_score) and enhance or replace it in other ways. It really depends on what sort of boosting or transformation you'd like. The default scoring model takes into account the Vector model but other things as well .
I don't think that's possible to retrieve directly.
But perhaps this workaround would make sense?
Elasticsearch always bring back max_score in hits document.
You can potentially divide your document _score by max_score. Report with highest value will score as 1, documents, that are not so like given one, will score less.
The Elasticsearch uses the Boolean model to find matching documents, and a formula called the practical scoring function to calculate relevance. This formula borrows concepts from term frequency/inverse document frequency and the vector space model but adds more-modern features like a coordination factor, field length normalization, and term or query clause boosting.