What is the replacement for FielddataLoading.Eager option in Elasticsearch mapping? - elasticsearch

I am upgrading an app from Elasticsearch 2.3 to 7.9. I'm using the NEST client version 7.11.1 which shows to be compatible with ES 7.9. We are using 7.9 because that is the latest version version available on AWS server we are working with.
The old application has the following field mapping:
.String(s => s
.Name(f => f.PartDescription)
.Analyzer(Analyzers.DescriptionAnalyzer)
.Fielddata(descriptor => descriptor.Loading(FielddataLoading.Eager)));
I am using the following mapping to replace this in the new version:
.Text(t => t
.Name(ep => ep.PartDescription)
.Analyzer(Analyzer.DescriptionAnalyzer)
.Fielddata(true))
I see that in the new version the only option for Fielddata is a boolean. The Eager and other options are missing.
Is Fielddata(true) a suitable equivalent for the upgrade?

The boolean on fielddata determines whether fielddata is enabled for the field. fielddata is used when performing aggregations, sorting and for scripting, and is loaded into the heap, into the fielddata cache, on demand (not eagerly loaded).
Typically for text datatype fields, you don't want fielddata; text data types undergo analysis and the resulting tokens are stored in the inverted index. When fielddata is set to true, the inverted index is uninverted on demand to produce a columnar structure that is loaded into the heap to serve aggregations, sorting and scripting on text fields. Text analysis often produces many tokens that serve the purpose of full-text search well but don't serve the purpose of aggregation, sorting and scripting well. With many tokens and many concurrent aggregations, heap memory can grow quickly, exerting GC pressure. So, the default for text datatype fields is to have fielddata be false, and to set it to true if you know what you're doing.
Instead of setting fielddata to true on a text datatype field, a good approach is to use multi-fields and also map the field as a keyword datatype if the field is one that you want to use for aggregations, sorting and scripting, and target the keyword multi field for this purpose.

Related

Using stored_fields for retrieving a subset of the fields in Elastic Search

The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.

Elasticsearch 7 - Sort on custom field of multi-field property

I am working on upgrading a system at work from using ES1 to ES7.
Part of the ES1 implementation included a custom plugin to add an analyzer for custom sorting. The custom sorting behavior we have is similar to "natural sort", but extended to deal with legal codes. For example, it will sort 1.1.1 before 1.10.1. We've been calling this "legal sort". We used this plugin to add an extra .legalsort field to multi-field properties in our index, and then we would sort based on this field when searching.
I am currently trying to adapt the main logic for indexing and searching to ES7. I am not trying to replace the "legal sort" plugin yet. When trying to implement sorting for searches, I ran into the error Fielddata is disabled on text fields by default. The solution I've seen suggested for that is to add a .keyword field for any text properties, which will be used for sorting and aggregation. This "works", but I don't see how I can then apply our old logic of sorting based on a .legalsort field.
Is there a way to sort on a field other than .keyword, which can use a custom analyzer, like we were able to in ES1?
The important aspect is not the name of your field (like *.keyword), but the type of field. For exact match searches, sorting and aggregation the type of the field should be “keyword“.
If you only use the legalsort field for display, sorting, aggregations or exact match, simply change the type from “text” to “keyword”.
If you want to use the same information for both purposes, it’s recommended to make it a multi-field by itself. Use the “keyword”-type field for sorting, aggregations and exact match search and use the “text”-type field for full-text search.
Having 2 types available for the 2 purposes is a significant improvement over the single string type you had in ES 1.0. When you sorted in ES 1.0, the information stored in the inverted index, had to get uninverted and was kept in RAM. This datastructure was/has been called fielddata. It was unbounded and often caused out-of-memory exceptions. Newer versions of Lucene introduced an alternative data structure which resides on disk (and in the file system cache) as a “replacement” to the “fielddata” data structure. It’s named doc-values and allows to sort on huge amounts of data without consuming significant amount of heap RAM. The only drawback: docvalues are not available for analyzed text (fields of type text), hence the need for a field of type keyword.
You also could set the mapping parameter “fielddata” to true for your legalsort field, enabling fielddata for this particular field to get back the previous behaviour with all its drawbacks

What will be the affect of fielddata=true when querying a ~10M document index and more questions

I have an index of ~10M docs. In each document I have a 'text' field where I put a string in and in the end I want aggregate all the terms inside this field. When trying to do that I only get the entire string.
I heard only bad things about using fielddata=true.
For this amount of documents, is it really such a bad practice to use fielddata=true in terms of memory?
Is there a difference (in terms of performance) between using an analyzer in the indexing pipeline (just set an analyzer on a specific field) to using an analyzer as a function (run analyzer on a string, get the results and put them in a document)?
Synonyms - I have defined a list of synonyms, I believe I already know the answer but still I'll give it a try, Is it possible to simply update such list of synonyms and that's it? or it's a mandatory to re-index after updating the synonyms list?
yes the lack of memory is an issue but you should test it to findout how much memory do you need. 10M is not too much doc for 32G Heap memory limit.
I didn't understand the question
at the time of creating index you should point to list (file) of synonyms words. but after that you can update the list without need to re-index. of course not simple contraction (for that you should re-index). https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html

Can ElasticSearch be used purely for aggregations?

In my current usecase, I'm using ElasticSearch as a document store, over which I am building a faceted search feature.
The docs state the following:
Sorting, aggregations, and access to field values in scripts requires a different data access pattern.
Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations.
Does this imply that the aggregations are not dependent on the index? If so, is it advisable to prevent the fields from being indexed altogether by setting {"index": "no"} ?
This is a small deviation, but where does the setting enabled come in? How is it different from index?
On a broader note, should I be using ElasticSearch if aggregations is all I'm looking for? Should I opt for other solutions like MongoDB? If so, what are the performance considerations?
HELP!
It is definitely possible to use Elasticsearch for the sole purpose of aggregating data. I've seen such setups a few times. For instance, in one past project, we'd index data but we'd only run aggregations in order to build financial reports, and we rarely needed to get documents/hits. 99% of the use cases were simply aggregating data.
If you have such a use case, then you can tune your mapping to
The role of enabled is to decide whether your data is indexed or not. It is true by default, but if you set it to false, your data will simply be stored (in _source) but completely ignored by analyzers, i.e. it won't be analyzed, tokenized and indexed, and thus, it won't be searchable, you'll be be able to retrieve the _source, but not search for it. If you need to use aggregations, then enabled needs to be true (the default value)
The store parameter is to decide whether you want to store the field or not. By default, the field value is indexed, but not stored as it is already stored with the _source itself and you can retrieve it using source filtering. For aggregations, this parameter doesn't play any role.
If your use case is only about aggregations, you might be tempted to set _source: false, i.e. not store the _source at all since all you'll be needed is to index the field values in order to aggregate them, but this is rarely a good idea for various reasons.
So, to answer your main question, aggregations do depend on the index, but the (doc-)values used for aggregations are written in dedicated files, whose inner structure is much more performant and optimal than accessing the data from the index in order to build aggregations.
If you're using ES 1.x, make sure to set doc_values to true for all the fields you'll want to aggregate on (except analyzed strings and boolean fields).
If you're using ES 2.x, doc_values is true by default, so you don't need to do anything special.
Update:
It is worth noting that aggregations are dependent on doc_values (i.e. Per Document Values .dvd and .dvm Lucene files), which basically contains the same info as in the inverted index, but organized in a column-oriented fashion, which makes it much more efficient for aggregations.

Elasticsearch store field vs _source

Using Elasticsearch 1.4.3
I'm building a sort of "reporting" system. And the client can pick and chose which fields they want returned in their result.
In 90% of the cases the client will never pick all the fields, so I figured I can disable _source field in my mapping to save space. But then I learned that
GET myIndex/myType/_search/
{
"fields": ["field1", "field2"]
...
}
Does not return the fields.
So I assume I have to then use "store": true for each field. From what I read this will be faster for searches, but I guess space wise it will be the same as _source or we still save space?
The _source field stores the JSON you send to Elasticsearch and you can choose to only return certain fields if needed, which is perfect for your use case. I have never heard that the stored fields will be faster for searches. The _source field could be bigger on disk space, but if you have to store every field there is no need to use stored fields over the _source field. If you do disable the source field it will mean:
You won’t be able to do partial updates
You won’t be able to re-index your data from the JSON in your
Elasticsearch cluster, you’ll have to re-index from the data source
(which is usually a lot slower).
By default in elasticsearch, the _source (the document one indexed) is stored. This means when you search, you can get the actual document source back. Moreover, elasticsearch will automatically extract fields/objects from the _source and return them if you explicitly ask for it (as well as possibly use it in other components, like highlighting).
You can specify that a specific field is also stored. This means that the data for that field will be stored on its own. Meaning that if you ask for field1 (which is stored), elasticsearch will identify that its stored, and load it from the index instead of getting it from the _source (assuming _source is enabled).
When do you want to enable storing specific fields? Most times, you don't. Fetching the _source is fast and extracting it is fast as well. If you have very large documents, where the cost of storing the _source, or the cost of parsing the _source is high, you can explicitly map some fields to be stored instead.
Note, there is a cost of retrieving each stored field. So, for example, if you have a json with 10 fields with reasonable size, and you map all of them as stored, and ask for all of them, this means loading each one (more disk seeks), compared to just loading the _source (which is one field, possibly compressed).
I got this answer on below link answered by shay.banon you can read this whole thread to get good understanding about it. enter link description here
Clinton Gormley says in the link below
https://groups.google.com/forum/#!topic/elasticsearch/j8cfbv-j73g/discussion
by default ES stores your JSON doc in the _source field, which is
set to "stored"
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek
Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.
In other words, it is almost always a false optimization.
Enabling _source will store the entire JSON document in the index while store will only store individual fields that are marked so. So using store might be better than using _source if you want to save disk space.
As a reference for ES 7.3, the answer becomes clearer. DO NOT try to optimize before you have strong testing reasons UNDER REALISTIC PRODUCTION CONDITIONS.
I might just quote from the _source:
Users often disable the _source field without thinking about the
consequences, and then live to regret it. If the _source field isn't
available then a number of features are not supported:
The update, update_by_query,
and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either
to change mappings or analysis, or to upgrade an index to a new major
version.
The ability to debug queries or aggregations by viewing the original
document used at index time.
Potentially in the future, the ability to repair index corruption
automatically.
TIP: If disk space is a concern, rather increase the
compression level instead of disabling the _source.
Besides there are not obvious advantages using stored_fields as you might have thought of.
If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.

Resources