How to determine RAM used per index in Elasticsearch? - elasticsearch

I've created several indices through CouchDB River plugin on Elasticsearch 1.7. I have node stats but can't determine the amount of RAM used per index. I want to use this data to see if I can get rid of indices using large amounts of RAM.

Technically speaking, the memory usage of an index is basically two parts: one which is "static" and represents the memory used by the data itself, and another one which depends more or less on the search usage (caches, buffers, dynamic memory structures).
You need to look at the indices stats to see this usage: https://www.elastic.co/guide/en/elasticsearch/reference/1.7/indices-stats.html
And from there, you search for the index you are interested in and you look at these sections: filter_cache, id_cache, fielddata, percolate, completion, segments (this is the "static" usage I mentioned above), query_cache.

Related

How to know the amount of heap used by an ElasticSearch shard?

I have a cluster where several indices in the warm nodes don't contain any documents. As they are read-only, they can't get any more documents, so I want to know if removing them will do any good to the heap usage of my ElasticSearch nodes.
I tried using the /_cat/shards to get some heap information per shard of said indices, but I couldn't find it, I tried looking at the values of the following metrics, but even for 20GiB shards the values are so small I think I'm looking at the wrong metrics (in parentheses the sample value for a 33GiB shard):
fm: fielddata.memory_size (0b)
qcm: query_cache.memory_size (0b)
sm: segments.memory (44.6mb)
siwm: segments.index_writer_memory (0b)
svmm: segments.version_map_memory (0b)
sfbm: segments.fixed_bitset_memory (0b)
So my questions are:
Is there a way to know how much host memory is being by each shard?
Does an empty shard consume any heap?
(The bottom line is my warm nodes is having high heap, so I'm considering removing all the shards of these empty indices, but only if it would have any benefit at all.)
Thanks.
So after gathering some information through the comment section here are my two cents:
First of all, I am not aware of an API that lets you inspect the heap memory per shard. Maybe xpack monitoring and/or the elasticsearch metricbeat module can do that for you.
However since you asked:
... As they are read-only, they can't get any more documents, so I want to know if removing them will do any good to the heap usage of my ElasticSearch nodes.
Elasticsearch is built on top of Lucene. A Lucene index is made up of one or more immutable index segments, which essentially is a "mini-index". Elasticsearch tries to keep frequently requested segments (indexing- and search-requests) in the heap memory in order to serve these requests in a fast manner. If no more segments can be loaded into the heap (because the heap size has reached its limit), segments get flushed to disk and other segments get loaded from disk into the heap. This is a constant process (take a look at this blog post for reference).
As you have stated the indices that you consider "problematic" are read-only (meaning no indexing-requests can be performed) and also contain no documents (meaning no search-requests will be executed against the particular segments).
To sum it up: The "problematic" indices will most likely not be in the heap anyways.
Yet, these indices however still claim some disk space because of the fact that they "exist". Furthermore elasticsearch must manage their shards and has to (re-)allocate them to nodes, e.g. in case of an shard recovery.
Q: So why are you hesitant to do (deleting the empty indices) it?
A: ... I need to modify my maintenance application's code...
With Elasticsearch version 6.6 came index lifecycle management (ilm). It's aim is to automate the management/maintenance of indices by defining policies. These policies contain so called phases (hot, warm, cold, delete) and corresponding actions (rollover, read-only, freeze, shrink, delete, ...). They can also be setup and modified through Kibana.
You may want to take a look at ilm and replace your application with it (no offense, just a hint).
I hope I could help you!

What is loaded in memory except inverted index in Elasticsearch which makes it so fast in search?

What are the things which are there in memory of Elasticsearch which make search so fast?
Are all jsons in memory themselves, or only inverted index and mapping will be in memory 24*7??
It is a good question, and then answer in short is:
It is not only data being stored in-memory that makes Elasticsearch searches so fast
Inverted indexes are not guaranteed to be always stored in memory. I didn't manage to find a direct proof, so I infer this from the following:
index segments may not be loaded in memory completely (see _cat/segments output parameter size.memory)
the very first advice in Tune for search speed is:
Give memory to the filesystem cache
This means that Elasticsearch also stores index data on disk in quite smart way so filesystem itself helps it with often accessible searches.
One of such "life-hacks" is that for each field in the mapping there will be a different inverted index, which will be small enough to be efficiently cached by FS, if queried frequently (and fields you never query will just occupy the disk space).
So does Elasticsearch store original JSONs in memory?
No, it stores them in a special field called _source. It is not fast to retrieve it, that's why scripts accessing _source may be slow in execution.
Are there other data structures that make Elasticsearch fast?
Yes, for example, those ones that are used for aggregations:
doc_values, which are column-oriented storage for exact-value fields (this feature makes Elasticsearch a little bit Columnar DB), but again, it is not originally in-memory and gets "cached" upon frequent use;
fielddata, which does similar job but for text fields; it is actually stored in memory but it is not efficient and is turned off by default.
What else does Elasticsearch do to speed up the search?
It uses more caching: Shard request caching and Node query cache. As you see, it is not as simple as "just put data in memory".
Hope that helps!

What is the difference between node.disk_used v.s. index.store_size?

What is the difference between node.disk used v.s. index data/store size?
How can index size total be bigger than disk used?
In ElasticSearch, store_size is the store size that is taken from primary and replica shards, while disk_used is the used disk space. Thus, node.disk_used represents the used disk space on the node, while store_size is finding the store size from the collection of documents. Within a node, you can declare multiple indexes. In relation to the second part of your question, this is an interesting overview on the problem you are having.

What are the drawbacks of using a Lucene directory as a primary file store?

I want to use a Lucene MMapDirectory as a primary file store. Each file would be stored in a separate document as a byte array in a StoredField. All file properties that should be searchable, like file name, size etc., would be stored in indexable fields in the same document.
My questions would be:
What are the drawbacks of using Lucene directories for storing files, especially with regards to indexing and search performance and memory (RAM) consumption?
If this is not a "no-go", is there a better/faster way of storing files in the directory than as a byte array?
Short Answer
I really love Luсene and consider it to be the best opensource library, but I'm afraid that it's not a good decision to use it as a primary file source due to:
high CPU/memory overhead
slow index/query performance
high HDD utilization and doubled index size
weak capabilities to recovery
Long Answer
Under the hood lucene uses the following files to keep all stored fields in one segment:
the fields index file (.fdx),
the fields data file (.fdt).
You can read more about how it works in Lucene50StoredFieldsFormat’s docs.
This means in case of any I/O issue it is almost impossible to restore any file.
In order to return one file - lucene have to read and decompress binary data from the disc in block-by-block manner. This means high CPU overhead to decompress and high memory footprint to keep the whole file in java heap space. No streaming is also avaialbe - compared to file and network storages.
Maximum document size is limited by codec implementation - 2 GB per document
Lucene has a unique write-once segmented architecture: recently indexed documents are written to a new self-contained segment, in append-only, write-once fashion: once written, those segment files will never again change. This happens either when too much RAM is being used to hold recently indexed documents, or when you ask Lucene to refresh your searcher so you can search all recently indexed documents. Over time, smaller segments are merged away into bigger segments, and the index has a logarithmic "staircase" structure of active segment files at any time. This architecture becomes a big problem for file storage:
you can not delete file - only mark as unavailable
merge operation requires 2x disc space and consumes a lot of resources and disc throughput - it creates new .fdt file and copies content of other .fdt files thru java code and java heap memory
So you won't be using MMapDirectory but an actual lucene index.
I have made good experiences with using lucene as the primary data-store for some projects.
Just be sure to also include a generated/natural unique ID, because the document IDs are not constant or reliable.
Also make sure you use a Directory implementation fitting to your use-case. I have switched to the normal RandomAccess implementation in the low-load case, since it uses less memory and is almost as fast.

Does Cassandra uses Heap memory to store blooms filter ,and how much space does it consumes for 100GB of data?

I come to know that cassandra uses blooms filter for performance ,and it stores these filter data into physical-memory.
1)Where does cassandra stores this filters?(in heap memory ?)
2)How much memory do these filters consumes?
When running, the Bloom filters must be held in memory, since their whole purpose is to avoid disk IO.
However, each filter is saved to disk with the other files that make up each SSTable - see http://wiki.apache.org/cassandra/ArchitectureSSTable
The filters are typically a very small fraction of the data size, though the actual ratio seems to vary quite a bit. On the test node I have handy here, the biggest filter I can find is 3.3MB, which is for 1GB of data. For another 1.3GB data file, however, the filter is just 93KB...
If you are running Cassandra, you can check the size of your filters yourself by looking in the data directory for files named *-Filter.db

Resources