Are there downsides of indexing every field in Elasticsearch index? [closed] - elasticsearch

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I found information about how number of shards, number of fields in mapping, affect performance of Elasticsearch. But I could not find any information about how indexing or not indexing a field affect cluster performance.
Imagine I have document like:
{
"name":"John Doe",
"photoPath":"/var/www/public/images/i1/j/d/123456.jpg",
"passwordHash":"$12$3b$abc...354",
"bio":"..."
}
I need to put 10 to 100 such documents to the cluster each second. When I put such document in index I am pretty sure I'd need to fulltext search for name and fulltext search for bio. I will never search for photoPath and I will never need fulltext search for password hash.
When I do my mapping I have several options:
make all fields text and analyze them with simple analyzer (i.e. tokenize by any not-character) - in that case I will have terms like "i1", "3b" or "123456" in my index
make name and bio text, make password hash keyword and make photoPath non-indexed
So my questions are:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types?
Am I correct in my assumption that having less fields indexed helps performance?
Am I correct in my assumption that indexing fewer fields will improve indexing performance?
Am I correct in my assumption that actual search will be faster if I index only what I need?

Here we go with the answers:
In what ways, if any, am I improving performance in case I use the second option with custom tailored field types? --> see detailed explanation below
Am I correct in my assumption that having less fields indexed helps performance? --> Yes
Am I correct in my assumption that indexing fewer fields will improve indexing performance? --> Yes
Am I correct in my assumption that actual search will be faster if I index only what I need? --> Most likely
Detailed explanation:
Every index comes with a mapping in which you not just specify what data should get indexed but also in how many fields your data is stored and how to process the data before storing it. In its default configuration Elasticsearch will dynamically create this mapping for you based on the type of data you sent to it.
Every single entry in your mapping consumes some bytes which will add to the size of the cluster state (the data structure that contains all the meta-information about your cluster such as information about nodes, indices, fields, shards, etc. and lives in RAM). For some users the cluster state simply got too big which severely affected performance. As a safety measurement Elasticsearch by default does not allow you to have more than 1000 fields in a single index.
Depending on mapping type and optional mapping parameters Elasticsearch will create one or more data structures for every single field you store. There are less and more "expensive" types, e.g. keyword and boolean are rather "cheap" types, whereas "text" (for full text search) is a rather expensive type, as it also requires preprocessing (analysis) of your strings. By default Elasticsearch maps strings to a multifield made up of 2 fields: one that goes by <fieldname> which is of type text and supports full-text search, and one that goes by <fieldname>.keyword of type keyword which only supports exact match search. On top of keyword fields and some other field types allow you to do analytics and use them for sorting. If you don't need one or the other, then please customize your mapping by storing it only for the use case you need. It makes a huge difference if you only need to display a field (no need to create any additional data structures for that field), whether you need to be able to search in a field (requiring specific data structures like inverted indices), or whether you need to do analytics on a field or sort by that field (requiring specific data structures like doc_values). Besides the Elasticsearch fields you specify in your mapping with a type you also can control the data structures that should get created with the following mapping-parameters: index, doc_values, enabled (just to name a few)
At search time it also makes a difference over how many fields you are searching and how big your index is. The fewer fields, the smaller the index, the better for fast search requests.
Conclusion:
So, your idea to customize your mapping by only storing some fields as keyword fields, some as text fields, some as multifields makes perfect sense!

As the question has several parts, I would try to answer them with official elasticsearch(ES) resources. Before that let's break what OP has in the ES index and every field use case:
{
"name":"John Doe", //used for full text search
"photoPath":"/var/www/pub/images/i1/j/d/123.jpg", // not used for search or retrival.
"passwordHash":"$12$3b$abc...354", // not used for search or retrival.
"bio":"..." //used for full text search**
}
Now as OP just mentioned photoPath and passwordHash aren't used for full-text search, I am assuming that these fields will not be used even for retrieval purposes.
So first, we need to understand what's the difference b/w indexing a field and storing the field and this is explained very well in this and this article. In short, if you have _source disabled(default is enabled), you will not be able to retrieve a field if it's not stored.
Now coming to the optimization part and improving the performance part. it's very simple that if you (send/store) more data what you actually need, then you wasting resources(nertwork,CPU,memory, disk). And ES is no different here.
Now coming to OP assumptions/questions:
In what ways, if any, am I improving performance in case I use the second option with custom-tailored field types? This option definitely better than first as you are not indexing the fields which you don't need for a search, but there is still room for optimization if you don't need to retrieve them, then it's better not to store them as well as remove from index mapping.
Am I correct in my assumption that having fewer fields indexed helps performance? Yes, as this way your inverted index would be smaller and you would be able to cache more data from your inverted index to file system cache and searching in small no of data is always faster. Apart from that, it helps to improve the relevance of your search by not indexing the unnecessary fields for your search.
Am I correct in my assumption that indexing fewer fields will improve indexing performance? Explained in the previous answer.
Am I correct in my assumption that the actual search will be faster if I index only what I need? It not only improves the search speed but improves indexing speed(as there will be lesser segments and merging them takes less time)
I can add a lot more information but I wanted to keep this short and simple. Let me know if anything isn't clear and would be happy to add more information.

Related

What will be the affect of fielddata=true when querying a ~10M document index and more questions

I have an index of ~10M docs. In each document I have a 'text' field where I put a string in and in the end I want aggregate all the terms inside this field. When trying to do that I only get the entire string.
I heard only bad things about using fielddata=true.
For this amount of documents, is it really such a bad practice to use fielddata=true in terms of memory?
Is there a difference (in terms of performance) between using an analyzer in the indexing pipeline (just set an analyzer on a specific field) to using an analyzer as a function (run analyzer on a string, get the results and put them in a document)?
Synonyms - I have defined a list of synonyms, I believe I already know the answer but still I'll give it a try, Is it possible to simply update such list of synonyms and that's it? or it's a mandatory to re-index after updating the synonyms list?
yes the lack of memory is an issue but you should test it to findout how much memory do you need. 10M is not too much doc for 32G Heap memory limit.
I didn't understand the question
at the time of creating index you should point to list (file) of synonyms words. but after that you can update the list without need to re-index. of course not simple contraction (for that you should re-index). https://www.elastic.co/guide/en/elasticsearch/guide/current/synonyms-expand-or-contract.html

ElasticSearch Edge NGram vs Prefix query

Let's say we have a text field that is relatively short, let's say maximum 10 characters and is saved as a keyword.
I want my users to be able to prefix-search this field (not autocomplete / search-as-you-type).
I have read on Elastic's documentation that the prefix query scales poorly, and they give a couple of examples to demonstrate it.
When is it ok to use prefix search, and when should I use index-time edge-ngrams? Building and storing index-time edge-ngrams of this field sounds excessive, but maybe I'm missing something.
First thing first:
Keyword fields are only searchable by their exact value
Thus you can't prefix-search a field that is defined as keyword type . prefix-search is only work for analyzed fields.
In general prefix-search has relatively poor performance. You can improve it's functionality by index_prefixes but even then it is relatively intense operation so if you plan to scale your business, it is better to use N-Grams because it is more resource efficient.

Elasticsearch store field vs _source

Using Elasticsearch 1.4.3
I'm building a sort of "reporting" system. And the client can pick and chose which fields they want returned in their result.
In 90% of the cases the client will never pick all the fields, so I figured I can disable _source field in my mapping to save space. But then I learned that
GET myIndex/myType/_search/
{
"fields": ["field1", "field2"]
...
}
Does not return the fields.
So I assume I have to then use "store": true for each field. From what I read this will be faster for searches, but I guess space wise it will be the same as _source or we still save space?
The _source field stores the JSON you send to Elasticsearch and you can choose to only return certain fields if needed, which is perfect for your use case. I have never heard that the stored fields will be faster for searches. The _source field could be bigger on disk space, but if you have to store every field there is no need to use stored fields over the _source field. If you do disable the source field it will mean:
You won’t be able to do partial updates
You won’t be able to re-index your data from the JSON in your
Elasticsearch cluster, you’ll have to re-index from the data source
(which is usually a lot slower).
By default in elasticsearch, the _source (the document one indexed) is stored. This means when you search, you can get the actual document source back. Moreover, elasticsearch will automatically extract fields/objects from the _source and return them if you explicitly ask for it (as well as possibly use it in other components, like highlighting).
You can specify that a specific field is also stored. This means that the data for that field will be stored on its own. Meaning that if you ask for field1 (which is stored), elasticsearch will identify that its stored, and load it from the index instead of getting it from the _source (assuming _source is enabled).
When do you want to enable storing specific fields? Most times, you don't. Fetching the _source is fast and extracting it is fast as well. If you have very large documents, where the cost of storing the _source, or the cost of parsing the _source is high, you can explicitly map some fields to be stored instead.
Note, there is a cost of retrieving each stored field. So, for example, if you have a json with 10 fields with reasonable size, and you map all of them as stored, and ask for all of them, this means loading each one (more disk seeks), compared to just loading the _source (which is one field, possibly compressed).
I got this answer on below link answered by shay.banon you can read this whole thread to get good understanding about it. enter link description here
Clinton Gormley says in the link below
https://groups.google.com/forum/#!topic/elasticsearch/j8cfbv-j73g/discussion
by default ES stores your JSON doc in the _source field, which is
set to "stored"
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek
Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.
In other words, it is almost always a false optimization.
Enabling _source will store the entire JSON document in the index while store will only store individual fields that are marked so. So using store might be better than using _source if you want to save disk space.
As a reference for ES 7.3, the answer becomes clearer. DO NOT try to optimize before you have strong testing reasons UNDER REALISTIC PRODUCTION CONDITIONS.
I might just quote from the _source:
Users often disable the _source field without thinking about the
consequences, and then live to regret it. If the _source field isn't
available then a number of features are not supported:
The update, update_by_query,
and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either
to change mappings or analysis, or to upgrade an index to a new major
version.
The ability to debug queries or aggregations by viewing the original
document used at index time.
Potentially in the future, the ability to repair index corruption
automatically.
TIP: If disk space is a concern, rather increase the
compression level instead of disabling the _source.
Besides there are not obvious advantages using stored_fields as you might have thought of.
If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.

Elasticsearch questions: search, performance and caching

I'm new to elasticsearch, have been reading their API and some things are not clear to me
1) It is said that filters are cached. what does that mean? if i send a query with a filter on it, what gets cached? The results of that query? If i send a different query with the same filter, will the cache help me somehow?
I know the question is kinda vague, but so is ElasticSearch's documentation for this.
2) Is there a real performance difference between a query matching a term X to the "_all" field or to a specific field? As far i understand, both queries will be compared against all documents that contain X in one of their fields, and the only difference is in how many fields will be matched against X, in these documents. is that correct?
1) For your first question take a look at this link.
To quote from the post
"Filters don’t score documents – they simply include or exclude. If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment’s filter state (“who matches this particular filter?”) in a single, compact BitSet.
The first time Elasticsearch executes a filter, it parses Lucene segment data structures to determine what matches your filter. Instead of throwing away this information, it caches it inside a BitSet.
The next time the same filter is executed, Elasticsearch can reference the compact BitSet instead of the Lucene segments. This has huge performance benefits."
2) "The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on. This comes at the expense of CPU cycles and index size."link
So if you know what fields you are going to query use specifics fields to search on.

In Elasticsearch, what happens if I set 'store' to yes on a few fields, but _source to false?

We're building a "unified" search across a lot of different resources in our system. Our index schema includes about 10 generic fields that are indexed, plus 5 which are required to identify the appropriate resource location in our system when results are returned.
The indexed fields often contain sensitive data, so we don't want them stored at all, only indexed for matching, thus we set the _source to FALSE.
I do however want the 5 ident fields returned, so is it possible to set the ident fields to store = yes, but the overall index _source to FALSE and get what I'm looking for in the results?
Have a look at this other answer as well. As mentioned there, in most of the cases the _source field helps a lot. Even though it might seem like a waste because elasticsearch effectively stores the whole document that comes in, that's really handy (e.g. when needing to update documents without sending the whole updated document). At the end of the day it hides a lucene implementation detail, the fact that you need to explicitly store fields if you want to get them back, while users usually expect to get back what they sent to the search engine. Surprisingly, the _source helps performance wise too, as it requires a single disk seek instead of more disk seeks that might be caused by retrieving multiple stored fields. At the end of the day the _source field is just a big lucene stored field containing json, which can be parsed in order to get to specific fields and do some work with them, without needing to store them separately.
That said, depending on your usecase (how many fields you retrieve) it might be useful to have a look at source include/exclude at the bottom of the _source field reference, which allows you to prevent parts (e.g. the sensitive parts of your documents) of the source field from being stored. That would be useful if you want to keep relying on the _source but don't want a part of the input documents to be returned, but you do want to search against those fields, as they are going to be indexed (but not stored!) in the underlying lucene index.
In both cases (either you disable the _source completely or exclude some parts), if you plan to update your documents keep in mind that you'll need to send the whole updated document using the index api. In fact you cannot rely on partial updates provided with the update api as you don't have in the index the complete document that you indexed in the first place, which you would need to apply changes to.
Yes, stored fields do not rely on the _source field, or vice-versa. They are separate, and changing or disabling one shouldn't impact the other.

Resources