What will be the affect of fielddata=true when querying a ~10M document index and more questions - elasticsearch

I have an index of ~10M docs. In each document I have a 'text' field where I put a string in and in the end I want aggregate all the terms inside this field. When trying to do that I only get the entire string.
I heard only bad things about using fielddata=true.
For this amount of documents, is it really such a bad practice to use fielddata=true in terms of memory?
Is there a difference (in terms of performance) between using an analyzer in the indexing pipeline (just set an analyzer on a specific field) to using an analyzer as a function (run analyzer on a string, get the results and put them in a document)?
Synonyms - I have defined a list of synonyms, I believe I already know the answer but still I'll give it a try, Is it possible to simply update such list of synonyms and that's it? or it's a mandatory to re-index after updating the synonyms list?

yes the lack of memory is an issue but you should test it to findout how much memory do you need. 10M is not too much doc for 32G Heap memory limit.
I didn't understand the question
at the time of creating index you should point to list (file) of synonyms words. but after that you can update the list without need to re-index. of course not simple contraction (for that you should re-index). https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html

Related

Using stored_fields for retrieving a subset of the fields in Elastic Search

The documentation and recommendation for using stored_fields feature in ElasticSearch has been changing. In the latest version (7.9), stored_fields is not recommended - https://www.elastic.co/guide/en/elasticsearch/reference/7.9/search-fields.html
Is there a reason for this?
Where as in version 7.4.0, there is no such negative comment - https://www.elastic.co/guide/en/elasticsearch/reference/7.4/mapping-store.html
What is the guidance in using this feature? Is using _source filtering a better option? I ask because in some other doc, _source filtering is supposed to kill performance - https://www.elastic.co/blog/found-optimizing-elasticsearch-searches
If you use _source or _fields you will quickly kill performance. They access the stored fields data structure, which is intended to be used when accessing the resulting hits, not when processing millions of documents.
What is the best way to filter fields and not kill performance with Elastic Search?
source filtering is the recommended way to fetch the fields and you are getting confused due to the blog, but you seem to miss the very important concept and use-case where it is applicable. Please read the below statement carefully.
_source is intended to be used when accessing the resulting hits, not when processing millions of documents.
By default, elasticsearch returns only 10 hits/search results which can be changed based on the size parameter and if in your search results, you want to fetch few fields value than using source_filter makes perfect sense as it's done on the final result set(not all the documents matching search results),
While if you use the script, and using source value try to read field-value and filter the search result, this will cause queries to scan all the index which is the second part of the above-mentioned statement(not when processing millions of documents.)
Apart from the above, as all the field values are already stored as part of _source field which is enabled by default, you need not allocate extra space if you explicitly mark few fields as stored(disabled by default to save the index size) to retrieve field-values.

Are there downsides of indexing every field in Elasticsearch index? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I found information about how number of shards, number of fields in mapping, affect performance of Elasticsearch. But I could not find any information about how indexing or not indexing a field affect cluster performance.
Imagine I have document like:
{
"name":"John Doe",
"photoPath":"/var/www/public/images/i1/j/d/123456.jpg",
"passwordHash":"$12$3b$abc...354",
"bio":"..."
}
I need to put 10 to 100 such documents to the cluster each second. When I put such document in index I am pretty sure I'd need to fulltext search for name and fulltext search for bio. I will never search for photoPath and I will never need fulltext search for password hash.
When I do my mapping I have several options:
make all fields text and analyze them with simple analyzer (i.e. tokenize by any not-character) - in that case I will have terms like "i1", "3b" or "123456" in my index
make name and bio text, make password hash keyword and make photoPath non-indexed
So my questions are:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types?
Am I correct in my assumption that having less fields indexed helps performance?
Am I correct in my assumption that indexing fewer fields will improve indexing performance?
Am I correct in my assumption that actual search will be faster if I index only what I need?
Here we go with the answers:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types? --> see detailed explanation below
Am I correct in my assumption that having less fields indexed helps performance? --> Yes
Am I correct in my assumption that indexing fewer fields will improve indexing performance? --> Yes
Am I correct in my assumption that actual search will be faster if I index only what I need? --> Most likely
Detailed explanation:
Every index comes with a mapping in which you not just specify what data should get indexed but also in how many fields your data is stored and how to process the data before storing it. In its default configuration Elasticsearch will dynamically create this mapping for you based on the type of data you sent to it.
Every single entry in your mapping consumes some bytes which will add to the size of the cluster state (the data structure that contains all the meta-information about your cluster such as information about nodes, indices, fields, shards, etc. and lives in RAM). For some users the cluster state simply got too big which severely affected performance. As a safety measurement Elasticsearch by default does not allow you to have more than 1000 fields in a single index.
Depending on mapping type and optional mapping parameters Elasticsearch will create one or more data structures for every single field you store. There are less and more "expensive" types, e.g. keyword and boolean are rather "cheap" types, whereas "text" (for full text search) is a rather expensive type, as it also requires preprocessing (analysis) of your strings. By default Elasticsearch maps strings to a multifield made up of 2 fields: one that goes by <fieldname> which is of type text and supports full-text search, and one that goes by <fieldname>.keyword of type keyword which only supports exact match search. On top of keyword fields and some other field types allow you to do analytics and use them for sorting. If you don't need one or the other, then please customize your mapping by storing it only for the use case you need. It makes a huge difference if you only need to display a field (no need to create any additional data structures for that field), whether you need to be able to search in a field (requiring specific data structures like inverted indices), or whether you need to do analytics on a field or sort by that field (requiring specific data structures like doc_values). Besides the Elasticsearch fields you specify in your mapping with a type you also can control the data structures that should get created with the following mapping-parameters: index, doc_values, enabled (just to name a few)
At search time it also makes a difference over how many fields you are searching and how big your index is. The fewer fields, the smaller the index, the better for fast search requests.
Conclusion:
So, your idea to customize your mapping by only storing some fields as keyword fields, some as text fields, some as multifields makes perfect sense!
As the question has several parts, I would try to answer them with official elasticsearch(ES) resources. Before that let's break what OP has in the ES index and every field use case:
{
"name":"John Doe", //used for full text search
"photoPath":"/var/www/pub/images/i1/j/d/123.jpg", // not used for search or retrival.
"passwordHash":"$12$3b$abc...354", // not used for search or retrival.
"bio":"..." //used for full text search**
}
Now as OP just mentioned photoPath and passwordHash aren't used for full-text search, I am assuming that these fields will not be used even for retrieval purposes.
So first, we need to understand what's the difference b/w indexing a field and storing the field and this is explained very well in this and this article. In short, if you have _source disabled(default is enabled), you will not be able to retrieve a field if it's not stored.
Now coming to the optimization part and improving the performance part. it's very simple that if you (send/store) more data what you actually need, then you wasting resources(nertwork,CPU,memory, disk). And ES is no different here.
Now coming to OP assumptions/questions:
In what ways, if any, am I improving performance in case I use the second option with custom-tailored field types? This option definitely better than first as you are not indexing the fields which you don't need for a search, but there is still room for optimization if you don't need to retrieve them, then it's better not to store them as well as remove from index mapping.
Am I correct in my assumption that having fewer fields indexed helps performance? Yes, as this way your inverted index would be smaller and you would be able to cache more data from your inverted index to file system cache and searching in small no of data is always faster. Apart from that, it helps to improve the relevance of your search by not indexing the unnecessary fields for your search.
Am I correct in my assumption that indexing fewer fields will improve indexing performance? Explained in the previous answer.
Am I correct in my assumption that the actual search will be faster if I index only what I need? It not only improves the search speed but improves indexing speed(as there will be lesser segments and merging them takes less time)
I can add a lot more information but I wanted to keep this short and simple. Let me know if anything isn't clear and would be happy to add more information.

Elasticsearch lucene, understand code path for search

I want to understand how each of the lucene index files (nvd,dvd,tim,doc.. mainly these four) are used in ES query.
E.g. say my index has ten docs and i am doing a aggregation query. I would like to understand how ES/Lucene performs access to these four files for a single query.
I am trying to see if I can make some optimization in my system which is mostly a disk heavy system to speed up query performance.
I looked at ES code and understand that the QueryPhase is the most expensive and it seems to be doing a lot of randomn access to disk for the log oriented data I have.
I want to now dive deeper on lucene level as well and possibly debug code and see in action. Lucene code has zero log messages for IndexReader related classes. Also debugging lucene code directly seems unhelpful since the unittest don't create indexes with tim, doc, nvd, dvd files
Any pointers ?
As I know, ES don't do much on search details, if your want optimize search, my experience is optimize your data layout, here is some important lucene files description:
(see http://lucene.apache.org/core/7_2_1/core/org/apache/lucene/codecs/lucene70/package-summary.html#package.description):
Term Index(.tip) # ON MEMORY.
Term Dictionary(.tim) # ON DISK.
Frequencies(.doc) # ON DISK.
Per-Document Values(.dvd, .dvm), very useful on aggregation. # ON DISK.
Field Index(.fdx) # ON MEMORY.
Field Data(.fdt), finally data fetch from disk in here. # ON DISK.
And there are some point can optmize performance:
trying use small date type, for example: INTEGER or LONG values instead of STRING.
CLOSE DocValues on unnecessary field, at the same time open DocValues on that filed which your want to sort/aggregation.
just incluse necessasy field on source like "_source": { "includes": ["some_necessasy_field"]}.
only index field that your need using ES defined mappings.
split your data on multi index.
add SSD.

Elasticsearch store field vs _source

Using Elasticsearch 1.4.3
I'm building a sort of "reporting" system. And the client can pick and chose which fields they want returned in their result.
In 90% of the cases the client will never pick all the fields, so I figured I can disable _source field in my mapping to save space. But then I learned that
GET myIndex/myType/_search/
{
"fields": ["field1", "field2"]
...
}
Does not return the fields.
So I assume I have to then use "store": true for each field. From what I read this will be faster for searches, but I guess space wise it will be the same as _source or we still save space?
The _source field stores the JSON you send to Elasticsearch and you can choose to only return certain fields if needed, which is perfect for your use case. I have never heard that the stored fields will be faster for searches. The _source field could be bigger on disk space, but if you have to store every field there is no need to use stored fields over the _source field. If you do disable the source field it will mean:
You won’t be able to do partial updates
You won’t be able to re-index your data from the JSON in your
Elasticsearch cluster, you’ll have to re-index from the data source
(which is usually a lot slower).
By default in elasticsearch, the _source (the document one indexed) is stored. This means when you search, you can get the actual document source back. Moreover, elasticsearch will automatically extract fields/objects from the _source and return them if you explicitly ask for it (as well as possibly use it in other components, like highlighting).
You can specify that a specific field is also stored. This means that the data for that field will be stored on its own. Meaning that if you ask for field1 (which is stored), elasticsearch will identify that its stored, and load it from the index instead of getting it from the _source (assuming _source is enabled).
When do you want to enable storing specific fields? Most times, you don't. Fetching the _source is fast and extracting it is fast as well. If you have very large documents, where the cost of storing the _source, or the cost of parsing the _source is high, you can explicitly map some fields to be stored instead.
Note, there is a cost of retrieving each stored field. So, for example, if you have a json with 10 fields with reasonable size, and you map all of them as stored, and ask for all of them, this means loading each one (more disk seeks), compared to just loading the _source (which is one field, possibly compressed).
I got this answer on below link answered by shay.banon you can read this whole thread to get good understanding about it. enter link description here
Clinton Gormley says in the link below
https://groups.google.com/forum/#!topic/elasticsearch/j8cfbv-j73g/discussion
by default ES stores your JSON doc in the _source field, which is
set to "stored"
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek
Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.
In other words, it is almost always a false optimization.
Enabling _source will store the entire JSON document in the index while store will only store individual fields that are marked so. So using store might be better than using _source if you want to save disk space.
As a reference for ES 7.3, the answer becomes clearer. DO NOT try to optimize before you have strong testing reasons UNDER REALISTIC PRODUCTION CONDITIONS.
I might just quote from the _source:
Users often disable the _source field without thinking about the
consequences, and then live to regret it. If the _source field isn't
available then a number of features are not supported:
The update, update_by_query,
and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either
to change mappings or analysis, or to upgrade an index to a new major
version.
The ability to debug queries or aggregations by viewing the original
document used at index time.
Potentially in the future, the ability to repair index corruption
automatically.
TIP: If disk space is a concern, rather increase the
compression level instead of disabling the _source.
Besides there are not obvious advantages using stored_fields as you might have thought of.
If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.

Can ElasticSearch create/store just the indexes while leaving the source document where it is?

Assuming I already have a set of documents living in some document store can I have ElasticSearch create its indexes and store them in its various replicated nodes while leaving the documents themselves where they are? In other words can I use ES just for search and not for storage? (I understand this might not be ideal but assume there are good reasons I need to keep the documents themselves where they are).
If I take this approach does it remove any functionality from search, for example showing where in a document the search term was found?
Thanks.
The link Konstantin referenced should show you how to disable _source.
There is another way to store fields (store=true). You are better off using _source and excluding any specific fields you don't want stored as part of _source, though.
Functionality removed:
Viewing fields that are returned from search
Highlighting
Easily rebuilding an index from _source. Probably not an issue, since data is stored elsewhere
There are probably other features I am missing.
The only case I've come across where I really don't need _source is when building an analytics engine where I am only returning aggregates (term and histogram).

Resources