I have a corpus of documents indexed . I also stored the term vectors when indexing. Now I want to retrieve term vectors of all documents satisfying some filtering options.
I was able to get term vector for a single document or for a set of documents by providing the document IDs. But is there a way to get term vectors for all the documents without providing document IDs?
Eventually what I want to do is to get the frequency counts of all the terms in a field, for all documents in an index (i.e., a bag of words matrix).
I am using elasticsearch-py as a client.
Appreciate any pointers. Thanks!
Related
How do I dump out the term dictionary from Elasticsearch?
I just want to look for pathological cases in indexing, so I want to see what is actually getting put into the index.
These are text fields, so I can't just do a terms aggregation.
_termvectors only works for single or multiple documents. I want top terms in the index, like the terms component in Solr.
I understand that there is term vector in elastic search which can give the word position and other stats.
Can percolator give the word position in the documents that are being searched on?
I understand that the documents are not indexed and only percolator queries are indexed. I see the below
If the requested information wasn’t stored in the index, it will be computed on the fly if possible. Additionally, term vectors could be computed for documents not even existing in the index, but instead provided by the user.
in - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html
So interested to know if elastic search can calculate the word position on the fly?
Any leads are appreciated. Thanks for reading.
#Kaveh
Thanks for taking time for me but really sorry I don't see how this (https://stackoverflow.com/a/67926555/4068218) is related because using artificial documents I can get the stats - https://www.elastic.co/guide/en/elasticsearch/reference/6.8/docs-termvectors.html
but what I have is percolator - https://www.youtube.com/watch?v=G2Ru2KV0DZg
So even if I get the term vector on fly using artificial documents or by the /_analyze it does not matter as they will not give me the position of terms (in percolator)
eg Percolator - I am trying to find the word - Hello.
My document has the below field and value
"text": "Hello World"
If I used artificial documents or /_analyze it will say 0 - Hello 1- World but when I percolate I will get the
percolate query that found the word Hello. I want to combine both and want the percolator tell
"I found Hello in position 0"
As you can see in the documentation for term vector if you store _source the Elastic can calculate the term vector on fly. It will analyze your text based on the source and it will aggregate it with existing term vector of index.
If you want to get the result for term you always can get your analyzed data for list of terms for more information here.
Elastic search use inverted index which is totally understandable because it returns all the documents containing the word we searched for.
But I do not understand where do we use forward index? Like, we don't search for document and expect words containing in that particular document.
Is there any practical use case for forward index? Any company using it for its product?
As mentioned in this SO answer, there is no technical difference between the forward index and the inverted index. Forward index is a list of terms contained within a particular document. The inverted index would be a list of documents containing a given term.
Please go through this blog, where it is clearly mentioned that the forward index is pretty fast when indexing and have less efficient queries.
Whereas inverted indexing have slower indexing, but fast query. To get a detailed explanation of the inverted index, you can refer to this article and this blog.
If I have a field called name and I use the suggest api to get suggestions for misspellings do I need to have document frequencies or norms enabled in order to do accurate suggestions? My assumption is yes but I am curious if maybe there is a separate suggestions index in lucene that handles frequency and/or norms even if I have it disabled for the field in my main index.
I doubt if suggester can work without field length normalization, as disabling norm means you are looking for a binary value whether the term is present or not in the document field and which in turn will have impact on the similarity score of each document.
These three factors—term frequency, inverse document frequency, and field-length norm—are calculated and stored at index time. Together, they are used to calculate the weight of a single term in a particular document.
"but I am curious if maybe there is a separate suggestions index in lucene that handles frequency and/or norms even if I have it disabled for the field in my main index."
Any suggester will use Vector Space Model by default to calculate the cosine similarity, which in turn will use the tf-idf-norm based scoring calculated during indexing for each term to rank the suggestions, so I doubt if suggester can score documents accurately without field norm.
theory behind relevance scoring:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scoring-theory.html#field-norm
From a data structure point of view how does Lucene (Solr/ElasticSearch) so quickly do filtered term counts? For example given all documents that contain the word "bacon" find the counts for all words in those documents.
First, for background, I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.
But how then does Lucene quickly determine counts for all words in documents that match "bacon"? In my naive understanding, Lucene would have to take the bit array associated with bacon and AND it with the bit arrays for every single other word. Am I missing something? I don't understand how this can be efficient. Also, do these bit arrays have to be pulled off of disk? That sounds all the worse!
How's the magic work?
You may already know this but it won't hurt to say that Lucene uses inverted indexing. In this indexing technique a dictionary of each word occurring in all documents is made and against each word information about that words occurrences is stored. Something like this image
To achieve this Lucene stores documents, indexes and their Metadata in different files formats. Follow this link for file details http://lucene.apache.org/core/3_0_3/fileformats.html#Overview
If you read the document numbers section, each document is given an internal ID so when documents are found with word 'consign' the lucene engine has the reference to the metadata of it.
Refer to the overview section to see what data gets saved in different lucene indexes.
Now that we have a pointer to the stored documents Lucene may be getting it in one of the following ways
Really count the number of words if the document is stored
Use Term Dictionary, frequency, and proximity data to get the count.
Finally, which API are you using to "quickly determine counts for all words"
Image credit to http://leanjavaengineering.wordpress.com/
Check about index file format here http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/codecs/lucene80/package-summary.html#package.description
there are no bitsets involved: its an inverted index. Each term maps to a list of documents. In lucene the algorithms work on iterators of these "lists", so items from the iterator are read on-demand, not all at once.
this diagram shows a very simple conjunction algorithm that just uses a next() operation: http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html
Behind the scenes, it is much like this diagram in lucene. Our lists are delta-encoded and bitpacked, and augmented with a skiplist which allows us to intersect more efficiently (via the additional advance() operation) than the above algorithm though.
DocIDSetIterator is this "enumerator" in lucene. it has the two main methods, next() and advance(). And yes, it is true you can decide to read the entire list + convert it into a bitset in memory, and implement this iterator over that in-memory bitset. This is what happens if you use CachingWrapperFilter.