What is the difference between node.disk_used v.s. index.store_size? - elasticsearch

What is the difference between node.disk used v.s. index data/store size?
How can index size total be bigger than disk used?

In ElasticSearch, store_size is the store size that is taken from primary and replica shards, while disk_used is the used disk space. Thus, node.disk_used represents the used disk space on the node, while store_size is finding the store size from the collection of documents. Within a node, you can declare multiple indexes. In relation to the second part of your question, this is an interesting overview on the problem you are having.

Related

How to know the amount of heap used by an ElasticSearch shard?

I have a cluster where several indices in the warm nodes don't contain any documents. As they are read-only, they can't get any more documents, so I want to know if removing them will do any good to the heap usage of my ElasticSearch nodes.
I tried using the /_cat/shards to get some heap information per shard of said indices, but I couldn't find it, I tried looking at the values of the following metrics, but even for 20GiB shards the values are so small I think I'm looking at the wrong metrics (in parentheses the sample value for a 33GiB shard):
fm: fielddata.memory_size (0b)
qcm: query_cache.memory_size (0b)
sm: segments.memory (44.6mb)
siwm: segments.index_writer_memory (0b)
svmm: segments.version_map_memory (0b)
sfbm: segments.fixed_bitset_memory (0b)
So my questions are:
Is there a way to know how much host memory is being by each shard?
Does an empty shard consume any heap?
(The bottom line is my warm nodes is having high heap, so I'm considering removing all the shards of these empty indices, but only if it would have any benefit at all.)
Thanks.
So after gathering some information through the comment section here are my two cents:
First of all, I am not aware of an API that lets you inspect the heap memory per shard. Maybe xpack monitoring and/or the elasticsearch metricbeat module can do that for you.
However since you asked:
... As they are read-only, they can't get any more documents, so I want to know if removing them will do any good to the heap usage of my ElasticSearch nodes.
Elasticsearch is built on top of Lucene. A Lucene index is made up of one or more immutable index segments, which essentially is a "mini-index". Elasticsearch tries to keep frequently requested segments (indexing- and search-requests) in the heap memory in order to serve these requests in a fast manner. If no more segments can be loaded into the heap (because the heap size has reached its limit), segments get flushed to disk and other segments get loaded from disk into the heap. This is a constant process (take a look at this blog post for reference).
As you have stated the indices that you consider "problematic" are read-only (meaning no indexing-requests can be performed) and also contain no documents (meaning no search-requests will be executed against the particular segments).
To sum it up: The "problematic" indices will most likely not be in the heap anyways.
Yet, these indices however still claim some disk space because of the fact that they "exist". Furthermore elasticsearch must manage their shards and has to (re-)allocate them to nodes, e.g. in case of an shard recovery.
Q: So why are you hesitant to do (deleting the empty indices) it?
A: ... I need to modify my maintenance application's code...
With Elasticsearch version 6.6 came index lifecycle management (ilm). It's aim is to automate the management/maintenance of indices by defining policies. These policies contain so called phases (hot, warm, cold, delete) and corresponding actions (rollover, read-only, freeze, shrink, delete, ...). They can also be setup and modified through Kibana.
You may want to take a look at ilm and replace your application with it (no offense, just a hint).
I hope I could help you!

Max value of number_of_routing_shards in Elasticsearch 6.x

What is the max recommended value of number_of_routing_shards for an index?
Can I specify a very high value like 30000? What are the side effects if I do so?
Shards are "slices" of an index created by elasticsearch to have flexibility to distribute indexed data. For example, among several datanodes.
Shards, in the low level are independent sets of lucene segments that work autonomously, which can be queried independently. This makes possible the high performance because search operations can be split into independent processes.
The more shards you have the more flexible becomes the storage assignment for a given index. This obviously has some caveats.
Distributed searches must wait each other to merge step-results into a consistent response. If there are many shards, the query must be sliced into more parts, (which has a computing overhead). The query is distributed to each shard, whose hashes match any of the current search (not all shards are necesary hit by every query) therefore the most busy (slower) shard, will define the overall performance of your search.
It's better to have a balanced number of indexes. Each index has a memory footprint that is stored in the cluster state. The more indexes you have the bigger the cluster state, the more time it takes to be shared among all cluster nodes.
The more shards an index has, the complexer it becomes, therefore the size taken to serialize it into the cluster state is bigger, slowing things down globally.
This will give you an index with 30.000 shards (according https://www.elastic.co/guide/en/elasticsearch/reference/6.x/indices-split-index.html), which is ... useless.
As all software tuning, recommended values vary with your:
use case
hardware (VM / network / disk ...) ?
metrics

How do partition size affect read/write performances in Cassandra?

I can partition my table into a small amount of bigger partitions or several smaller partitions, but in my use case the big partition is still small in size, it will never exceed 100MB. There will be millions of users reading from this table so is there a risk of congestion when having so many users reading from a single partition?
I can imagine that splitting the read queries between several physical nodes is faster than reading from a single physical node, but does splitting read queries between several virtual nodes improve performance in the same way? The number of big partitions will exceed the number of physical nodes, so will spreading the data further through the virtual nodes with smaller partitions improve the read performance? Is the answer any different for updating partitions of counter tables?
So basically, what I need to know is if millions of users reading from the same partition (that is below 100MB in size) will introduce congestion. This is the answer that actually matters for my project. But I also want to know if spreading the data further (regular and counter tables), beyond the number of physical nodes through smaller partitions will increase the read/write performance.
Any reference links would be extremely appreciated since I'll be writing a report and referencing an article, journal or documentation is always preferred.
In my opinion accessing the same partition ( We are actually talking about "row" in cassandra 3.0) is not a problem. If the load on your cluster is increasing then you just need to add more node, this is the no single point of failure principle. Each node in your cluster is able to fulfil the user request ( depending on your replication factor and read consistency).
Also if you know that a partition key is going to be accessed a lot then you can play with the key and row cache functionality of your table, you will avoid any disk access

How to determine RAM used per index in Elasticsearch?

I've created several indices through CouchDB River plugin on Elasticsearch 1.7. I have node stats but can't determine the amount of RAM used per index. I want to use this data to see if I can get rid of indices using large amounts of RAM.
Technically speaking, the memory usage of an index is basically two parts: one which is "static" and represents the memory used by the data itself, and another one which depends more or less on the search usage (caches, buffers, dynamic memory structures).
You need to look at the indices stats to see this usage: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-stats.html
And from there, you search for the index you are interested in and you look at these sections: filter_cache, id_cache, fielddata, percolate, completion, segments (this is the "static" usage I mentioned above), query_cache.

Elastisearch 2.3.2 limits on resources

Is there a list of hard limits for Elasticsearch?
I am particularly interested if there is a theoretical limit on the number of indices one can create, and the number of records an index can have.
There are limits to the number of docs per shard of 2 billion, which is a hard Lucene limit.
The actual value is Integer.MAX_VALUE - 128 which is 2147483519 documents per shard.
While asking about the number of documents can be a realistic question, looking at the maximum number of indices is the wrong question. Probably there is a JVM limit, assuming some kind of array or ArrayList that holds these indices (or indices mappings - the cluster state) then the limit would be the size of the array or the size of the ArrayList/HashSet/Map etc.
Way before reaching that theoretical limit your cluster is probably dead or not even able to start. Each shard is using resources, the cluster state would be huge, even with one primary shard only. The correct question would be to ask about performance issues with your cluster and not thinking about the maximum number of indices in a cluster.

Resources