How fields are associtated with terms in inverted index in elasticsearch? - elasticsearch

As per my understanding, elasticsearch uses a structure called inverted index to provide full text search. It is clear that inverted index has terms and ids of the documents which has that term but the document can have any number of fields and the field name can be used in the query time to look/search only on that field. In that case how elasticsearch restricts/limits search only to a particular field? I would like to know if inverted index contains fields name or field id along with terms and document id.
Similar thing happens when you sort based on any field. So there could be a way to associate terms with field names. Please help me understand the intricacies involved here.
Thanks in advance.

I would like to know if inverted index contains fields name or field id
along with terms and document id.
Quoting from Lucene Docs
The same string in two different fields is considered a different term. Thus terms are represented as a pair of strings, the first naming the field, and the second naming text within the field.
In that case how elasticsearch restricts/limits search only to a
particular field?
Each segment index maintains Term Vectors : For each field in each document, the term vector is stored. A term vector consists of term text and term frequency.
Hence, the indexes are maintained for each field in each document.

We have a inverted index per field per index.
And there is something called field data cache ( or doc values ) which has the inverted "inverted index". All doc to field value lookup happens here.

I was also having this question
I can share my understanding here with you.
Elasticsearch creates an inverted index for each full-text field of the document. So if an index has 10 fields that allow full-text search then Elasticsearch will create 10 different inverted index for the 10 fields and store the analyzer results in those inverted indices for each field.
Thus when you perform a search operation and specify what all fields you want to search then Elasticsearch will search on the inverted indices of those specific fields only
Thus to summarize, an inverted index is created at the field level.
I hope that helps
Thanks

Related

How to search exact word in a test in Elastic Search

Let's say I have two texts:
Text 1 - "The fox has been living in the wood cabin for days."
Text 2 - "The wooden hammer is a dangerous weapon."
And I would like to search for the word "wood", without it matching me "wooden hammer". How would I do that in Elastic Search or nest?
Term query is used for exact matches search. However it's not recommended to use it against text fields, the following quote from term query documentation:
To better search text fields, the match query also analyzes your
provided search term before performing a search. This means the match
query can search text fields for analyzed tokens rather than an exact
term.
The term query does not analyze the search term. The term query only
searches for the exact term you provide. This means the term query may
return poor or no results when searching text fields.
The problem with text exact matches, as described in the Term query documentation:
By default, Elasticsearch changes the values of text fields as part of
analysis. This can make finding exact matches for text field values
difficult.
So, the documents data is modified (i.e., analyzed) before indexing. This depends on the index mapping definition for each field, defaults to the default index analyzer, or the standard analyzer.
But the default standard analyzer will not change the token "Wooden" to "Wood", this might happen if you used stemming for this field.
This means, if you don't use a different analyzer or stemming, querying with "Wood" shouldn't match "Wooden" token.
To summarize: Indexed data is modified/analyzed before indexing (based on the field mapping definition). Match query analyze the search query, while Term query doesn't analyze the search query. So you have to properly chose the field mapping and the search query to better suit your use case
For some use cases, like storing email addressed, phone numbers or keyword fields that always have the same value, consider using the Keyword type, which is suitable for exact matches in these use cases. However, ES recommends:
Avoid using keyword fields for full-text search. Use the text field
type instead.
So for better visibility and practical solution for your use case, it's better to elaborate more the field mapping you use and what you want to achieve.

How to calculate Elasticsearch Field Size

The question, is there a way to calculate the most expensive field in a Elasticsearch index.
AIM is to calculate and compare the storage and index size of two fields in a elasticsearch Index.
Also is it wise to use dual type fields?
like a string in elasticsearch has text field which is searchable and .keyword field which is aggregatable
Will it use double the storage and index space?
is it wise to use dual-type fields. Like a string in elasticsearch has text field which is searchable and .keyword field which is aggregatable
It totally depends on the use case. Maintain both keyword & text representation of a field value if :
a) You need advance searching capability on the field
b) Either your current or future requirements requires capability to either sort or aggregate on the field.
In real life i have seen for short text fields like 'name', 'business-name','tag' etc it makes sense to maintain both. But for larger texts e.g description i don't think there are use cases for aggregation & sorting (in general).

screen out document results that share the same property value accept the first one

I have a db of documents. Every document has a property(keyword) called index (noting to do with the elastic index) and a property(keyword) named superIndex. There can be multiple documents with the same index and multiple documents with the same superIndex in the DB, these fields are not unique.
I run a compound query searching free text on the text content of these documents, with sorting, and get the results I want. However, I get many documents having the same index and/or superIndex. Currently I programmatically filter the result list and take only the first result from each index and superIndex. My requirement is that at the end I'm left with the top results from the sort, the first from each index and superIndex.
Can this be done using elastic query. If so how?
Field collapsing allows you to collapse all search results having the same value in a field (e.g. index). (See Elasticsearch Reference: Field Collapsing)

Is there a way to denote in search query that "omit length of the field" in elasticsearch?

The only solution for omitting the length of the fields is to put/change mapping of the document and re-indexing the whole data which I do not want.
Is there a way to ignore length of the field while querying?
If the length filter is being carried out by the Length Token Filter within an analyzer then terms filtered out won't be in your inverted index.
So you will not be able to search for the missing terms using a query - you'd need to update your analyzer and re-index.

elastic search primary key and secondary key

I have an index in elastics search with products. Every product has an article number in the form of a guid. To show this products on a webshop I don't want to show a guid (to long). I want an integer number.
Now i have two keys. One to lookup the web request (the integer) and one to update the product (the guid)
I know I can search on a field in elastic search. But is an exact match search on a field slower as an exact match on a key (_id)? I don't want to do a mapping search from one key to the other because that is another operation.
The _id field is just a primary key for documents. It will be stored separately. Yes, there will be some lag. But you'll find it's not that much lag. If you want a field to search as fast as _id field. Then in mapping, store the field externally. Refer to the store attribute for a field.
Like other fields, it's also stored in ES. By default _id is not analyzed. If you define a field as not_analyzed its also as fast as the _id field. ES indexes each and every field the same.

Resources