Elasticsearch: What if size of index is larger than available RAM? - elasticsearch

Assuming a single machine system with an in-memory indexing schema.
I am not able to find this info in ES docs. Does ES start swapping out the overflowing data, loads it when needed and continue working or it gives an error?

In-memory indices provide better performance at the cost of limiting the index size to the amount of available physical memory.
Via the 1.7 documentation. Memory stores are no longer available in 2.0+.
Under the hood it uses the Lucene RAMDirectory, which will just consume RAM (and eventually swap) until either you hit Java heap limits and ES crashes with out-of-memory errors, or the system gives up and oomkills the Elasticsearch process. Don't use in-memory indexes for large indexes, or for any situation where persistence is important.

Related

AWS ElasticSearch Java Process Limit

AWS documentation makes clear the following:
Java Process Limit
Amazon ES limits Java processes to a heap size of 32 GB. Advanced users can specify the percentage of the heap used for field data. For more information, see Configuring Advanced Options and JVM OutOfMemoryError.
Elastic search instance types span right up to 500GB memory - so my question (as a Java / JVM amateur) is how many Java processes does ElasticSearch run? I assume a 500GB ElasticSearch instance (r4.16xlarge.elasticsearch) is somehow going to make use of more than 32GB + any host system overhead?
Elasticsearch uses one java process (per node).
Indeed as quoted it is advised not to go over the 32GB RAM from performance efficiency reasons (the JVM would need to use 64bits pointers, which would decrease performance).
Another recommendation is to keep memory for the file system cache, which lucene uses heavily in order to load doc-values, and info from disk into memory.
Depending on your workload, it is better to run multiple VMs on a single 500gb server. you better use 64gb-128gb VMs, each divided between 31gb for Elasticsearch and the rest for the file system cache.
multiple VMs on a server means that each VM is Elasticsearch node.

Elasticsearch config tweaking with limited memory

I have following scenario:
A single machine with 32GB of ram runs Elasticsearch 2.4, there is one index with 5 shards that is 25gb in size.
On that index we are constantly indexing new data, plus doing full-text search queries that check about 95% documents - no aggregations. The instance generates a lot of CPU load - there is no swapping.
My question is: how should I tweak elasticsearch memory usage? (I don't have an option to add another machine at this moment)
Should I assign more memory to ES HEAP like 25GB (going over 50% memory that readme advises to not do do), or should I assign minimal HEAP like 1GB-2GB and assume Lucene will cache all the index in memory since its full-text searches?
Right now 50% of server memory so 16GB in this case seems to work best for us.

How does ElasticSearch and Lucene share the memory

I have one question about the following quota from ES official doc:
But if you give all available memory to Elasticsearch’s heap,
there won’t be any left over for Lucene.
This can seriously impact the performance of full-text search.
If my server has 80G memory, I issued the following command to start ES node: bin/elasticsearch -xmx 30g
That means I only give the process of ES 30g memory maximum. How can Lucene use the left 50G, since Lucene is running in ES process, it's just part of the process.
The Xmx parameter simply indicates how much heap you allocate to the ES Java process. But allocating RAM to the heap is not the only way to use the available memory on a server.
Lucene does indeed run inside the ES process, but Lucene doesn't only make use of the allocated heap, it also uses memory by heavily leveraging the file system cache for managing index segment files.
There were these two great blog posts (this one and this other one) from Lucene's main committer which explain in greater details how Lucene leverages all the available remaining memory.
The bottom line is to allocate 30GB heap to the ES process (using -Xmx30g) and then Lucene will happily consume whatever is left to do what needs to be done.
Lucene uses the off heap memory via the OS. It is described in the Elasticsearch guide in the section about Heap sizing and swapping.
Lucene is designed to leverage the underlying OS for caching in-memory data structures. Lucene segments are stored in individual files. Because segments are immutable, these files never change. This makes them very cache friendly, and the underlying OS will happily keep hot segments resident in memory for faster access.

Mongodb - make inmemory or use cache

I will be creating a 5 node mongodb cluster. It will be more read heavy than write and had a question which design would bring better performance. These nodes will be dedicated to only mongodb. For the sake of an example, say each node will have 64GB of ram.
From the mongodb docs it states:
MongoDB automatically uses all free memory on the machine as its cache
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
I also read that it is possible to implement mongodb purely in memory
http://edgystuff.tumblr.com/post/49304254688/how-to-use-mongodb-as-a-pure-in-memory-db-redis
If my data was quite dynamic (can range from 50gb to 75gb every few hours), would it be theoretically be better performing to design mongodb in a way which allows mongodb to manage itself with its cache (default setup of mongo), or to put the mongodb into memory initially and if the data grows over the size of ram use swap space (SSD)?
MongoDB default storage engine maps the files in memory. It provides an efficient way to access the data, while avoiding double caching (i.e. MongoDB cache is actually the page cache of the OS).
Does this mean as long as my data is smaller than the available ram it will be like having an in-memory database?
For read traffic, yes. For write traffic, it is different, since MongoDB may have to journalize the write operation (depending on the configuration), and maintain the oplog.
Is it better to run MongoDB from memory only (leveraging tmpfs)?
For read traffic, it should not be better. Putting the files on tmpfs will also avoid double caching (which is good), but the data can still be paged out. Using a regular filesystem instead will be as fast once the data have been paged in.
For write traffic, it is faster, provided the journal and oplog are also put on tmpfs. Note that in that case, a system crash will result in a total data loss. Usually, the performance gain does not worth the risk.

Does Cassandra uses Heap memory to store blooms filter ,and how much space does it consumes for 100GB of data?

I come to know that cassandra uses blooms filter for performance ,and it stores these filter data into physical-memory.
1)Where does cassandra stores this filters?(in heap memory ?)
2)How much memory do these filters consumes?
When running, the Bloom filters must be held in memory, since their whole purpose is to avoid disk IO.
However, each filter is saved to disk with the other files that make up each SSTable - see http://wiki.apache.org/cassandra/ArchitectureSSTable
The filters are typically a very small fraction of the data size, though the actual ratio seems to vary quite a bit. On the test node I have handy here, the biggest filter I can find is 3.3MB, which is for 1GB of data. For another 1.3GB data file, however, the filter is just 93KB...
If you are running Cassandra, you can check the size of your filters yourself by looking in the data directory for files named *-Filter.db

Resources