Do two equal documents in elasticsearch double the needed disc space - elasticsearch

When I save the same document for, for example, 10 times, does it need ten times as much disk space? Or are the single fields of the document saved in an index or something and the document only references to this index if more than one document have the same value for one field?

Well answer is yes and no :).
By default the data is stored in a aggregated data structure called lucene reverse index.
In addition to this , the data that you gave for indexing is also stored in a field called _source. So we can safely assume that the data is stored in two different formats where we can only use reverse index for searching but for retrieving the actual data , we need to fetch it from _source.
So if _source is explicitly disabled , you wont be seeing a linear growth of disk size. ( Given that segment merge is done to a single segment )
If this is not disabled , then the data has to be stored both in _source ( As raw JSON ) and reverse index ( Data is tokenized and then stored )

Related

Elastic Search Monthly Rolling index with custom routing

I am trying to figure out the how to create a monthly rolling index with custom routing (multi-tenancy scenario) , with these requirements :
WRITE flow : Each document will have a timestamp and the document should be indexed to the appropriate backing index based on that timestamp and not to the latest index. Also, write requests will have a custom routing key (eg: customerId) so they hit a specific shard.
READ flow : Requests must be routed to all backing indexes. Requests will have a custom routing key specified (eg: customerId) and results must be aggregated and returned.
Index creation : Rolling the index should be automated. Each index should have a custom routing key (eg: customerId )
Wondering, what are the options available ?
This very feature, called time-series data stream, will be coming in the upcoming ES 8.5 release.
The big difference between normal data streams and time-series data stream is that all backing indexes of TSDS are sorted by timestamp and all documents will be written in the right backing index for the given time frame of the document, even if that backing index is not the current write index, which means if your data source lags (even by a few hours), the data will still land in the right index. Also all documents related to the same dimension (i.e. customerId in your case) will end up on the same shard.
Another difference is that the ID of the documents is computed as a function of the timestamp and the dimension(s) contained in the document, which means there can only be one single occurence for a given timestamp/dimension pair (i.e. no duplicate).
Technically, you can already achieve pretty much the same with normal data streams, however, the underlying optimizations related to storing docs in the same shard and the ability to write documents to older backing indexes won't be possible since you can only index documents in the current write index.

Elasticsearch/Lucene null handling in doc values

I'm planning to use Elasticsearch mostly for data analytics. I have large document with many, moslty numeric (up to 4 bytes) attributes. Most fields in my document only have about 30% of values in them. If I understand correctly I can take advantage of Doc Values feature which is similar to columnar data layout found in some databases. I was wondering how Elasticsearch/Lucene will store this data. Is there any compressions used (e.g. run length) or it is dense data layout where nulls will take the same space on storage as the values?
The default behavior of ElasticSearch is not to add the field at all in case of NULL values. You can force map the field using null_value but for types where NULL is supported. For Example: long field cannot be mapped with string null_value.
So to address the question, ElasticSearch will not allocate default space for fields missing in the document. But you may run into MissingFieldException in case you query on some field which never had a value. To avoid this, map your fields explicitly before indexing. If you map explicitly make sure to set the null_value property of the field to outside the range of your data input.

Index type in elasticsearch

I am trying to understand and effectively use the index type available in elasticsearch.
However, I am still not clear how _type meta field is different from any regular field of an index in terms of storage/implementation. I do understand avoiding_type_gotchas
For example, if I have 1 million records (say posts) and each post has a creation_date. How will things play out if one of my index types is creation_date itself (leading to ~ 1 million types)? I don't think it affects the way Lucene stores documents, does it?
In what way my elasticsearch query performance be affected if I use creation_date as index type against a namesake type say 'post'?
I got the answer on elastic forum.
https://discuss.elastic.co/t/index-type-effective-utilization/58706
Pasting the response as is -
"While elasticsearch is scalable in many dimensions there is one where it is limited. This is the metadata about your indices which includes the various indices, doc types and fields they contain.
These "mappings" exist in memory and are updated and shared around all nodes with every change. For this reason it does not make sense to endlessly grow the list of indices, types (and therefore fields) that exist in this cluster state. A type-per-document-creation-date registers a million on the one-to-ten scale of bad design decisions" - Mark_Harwood

Elasticsearch store field vs _source

Using Elasticsearch 1.4.3
I'm building a sort of "reporting" system. And the client can pick and chose which fields they want returned in their result.
In 90% of the cases the client will never pick all the fields, so I figured I can disable _source field in my mapping to save space. But then I learned that
GET myIndex/myType/_search/
{
"fields": ["field1", "field2"]
...
}
Does not return the fields.
So I assume I have to then use "store": true for each field. From what I read this will be faster for searches, but I guess space wise it will be the same as _source or we still save space?
The _source field stores the JSON you send to Elasticsearch and you can choose to only return certain fields if needed, which is perfect for your use case. I have never heard that the stored fields will be faster for searches. The _source field could be bigger on disk space, but if you have to store every field there is no need to use stored fields over the _source field. If you do disable the source field it will mean:
You won’t be able to do partial updates
You won’t be able to re-index your data from the JSON in your
Elasticsearch cluster, you’ll have to re-index from the data source
(which is usually a lot slower).
By default in elasticsearch, the _source (the document one indexed) is stored. This means when you search, you can get the actual document source back. Moreover, elasticsearch will automatically extract fields/objects from the _source and return them if you explicitly ask for it (as well as possibly use it in other components, like highlighting).
You can specify that a specific field is also stored. This means that the data for that field will be stored on its own. Meaning that if you ask for field1 (which is stored), elasticsearch will identify that its stored, and load it from the index instead of getting it from the _source (assuming _source is enabled).
When do you want to enable storing specific fields? Most times, you don't. Fetching the _source is fast and extracting it is fast as well. If you have very large documents, where the cost of storing the _source, or the cost of parsing the _source is high, you can explicitly map some fields to be stored instead.
Note, there is a cost of retrieving each stored field. So, for example, if you have a json with 10 fields with reasonable size, and you map all of them as stored, and ask for all of them, this means loading each one (more disk seeks), compared to just loading the _source (which is one field, possibly compressed).
I got this answer on below link answered by shay.banon you can read this whole thread to get good understanding about it. enter link description here
Clinton Gormley says in the link below
https://groups.google.com/forum/#!topic/elasticsearch/j8cfbv-j73g/discussion
by default ES stores your JSON doc in the _source field, which is
set to "stored"
by default, the fields in your JSON doc are set to NOT be "stored"
(ie stored as a separate field)
so when ES returns your doc (search or get) it just load the _source
field and returns that, ie a single disk seek
Some people think that by storing individual fields, it will be faster
than loading the whole JSON doc from the _source field. What they don't
realise is that each stored field requires a disk seek (10ms each seek!
), and that the sum of those seeks far outweighs the cost of just
sending the _source field.
In other words, it is almost always a false optimization.
Enabling _source will store the entire JSON document in the index while store will only store individual fields that are marked so. So using store might be better than using _source if you want to save disk space.
As a reference for ES 7.3, the answer becomes clearer. DO NOT try to optimize before you have strong testing reasons UNDER REALISTIC PRODUCTION CONDITIONS.
I might just quote from the _source:
Users often disable the _source field without thinking about the
consequences, and then live to regret it. If the _source field isn't
available then a number of features are not supported:
The update, update_by_query,
and reindex APIs.
On the fly highlighting.
The ability to reindex from one Elasticsearch index to another, either
to change mappings or analysis, or to upgrade an index to a new major
version.
The ability to debug queries or aggregations by viewing the original
document used at index time.
Potentially in the future, the ability to repair index corruption
automatically.
TIP: If disk space is a concern, rather increase the
compression level instead of disabling the _source.
Besides there are not obvious advantages using stored_fields as you might have thought of.
If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.

ElasticSearch: Impact of setting a "not_analyzed" field as "store":"yes"?

Suppose I have a string field specified as not_analyzed in the mapping. If I then add "store":"yes" to the mapping, will ElasticSearch duplicate the storage? My understanding of not_analyzed fields is that they are not run through an Analyzer, indexed as is, but a client is able to match against it. So, if a field is both not_analyzed and store:yes, this could cause ElasticSearch to keep two copies of the string.
My question:
If a string field is stored as both not_analyzed and store:yes, will there be duplicate storage of the string?
I hope that's clear enough. Thanks!
You're mixing up the concept of indexed field and stored field in lucene, the library that elasticsearch is built on top of.
A field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.
A field is stored when you want to be able to retrieve it. Let's say Lucene provides some kind of storage too, which doesn't have anything to do with the inverted index itself.
When you search using lucene you get back a list of document ids that match. Then you can retrieve some text from their stored fields, which is what you literally show as search results. If you don't store a field you'll never be able to get it back from lucene (this is not true for elasticsearch though, as I'm going to explain below).
You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.
Therefore the two data structures are not related to each other. If you both index and store a field in lucene, its content will not be present twice in the same form. Stored fields are stored as they are, as you send them to lucene, while indexed fields might be analyzed and will be part of the inverted index, which is something else. Stored fields are made to be retrieved for a specific document (by lucene document id), while indexed fields are made to search, in such a structure that literally inverts the text having as a result each term as key, together with a list of document ids that contain it (the postings list).
When it comes to elasticsearch things change a little though. When you don't configure a field as stored in your mapping (default is store:no) you are able to retrieve it anyway by default. This happens because elasticsearch always stores in lucene the whole source document that you send to it (unless you disable this feature) within a special lucene field, called _source.
When you search using elasticsearch you get back by default the whole source field, but you can also ask for specific fields. What happens in that case is that elasticsearch checks whether those specific fields are stored or not in lucene. If they are the content will be retrieved from lucene, otherwise the _source stored field will be retrieved from lucene, parsed as json (pull parsing) and those specific fields will be extracted. In the first case it might be a little faster, but not necessarily. If your source is really big and you only want to load a couple of fields, configuring them as stored in lucene would probably make the loading process faster; on the other hand, if your _source is not that big and you want to load many fields, then it's probably better to load only one stored field (the _source), which would lead to a single disk seek, parse it etc. In most of the cases using the _source field works just fine.
To answer your question: inverted index and lucene storage are two completely different things. You end up having two copies of the same data in lucene only if you decide to store a field (store:yes in the mapping), since elasticsearch keeps that same content within the json _source, but this doesn't have anything to do with the fact that you're indexing or analyzing the field.

Resources