AWS ElasticSearch Java Process Limit - elasticsearch

AWS documentation makes clear the following:
Java Process Limit
Amazon ES limits Java processes to a heap size of 32 GB. Advanced users can specify the percentage of the heap used for field data. For more information, see Configuring Advanced Options and JVM OutOfMemoryError.
Elastic search instance types span right up to 500GB memory - so my question (as a Java / JVM amateur) is how many Java processes does ElasticSearch run? I assume a 500GB ElasticSearch instance (r4.16xlarge.elasticsearch) is somehow going to make use of more than 32GB + any host system overhead?

Elasticsearch uses one java process (per node).
Indeed as quoted it is advised not to go over the 32GB RAM from performance efficiency reasons (the JVM would need to use 64bits pointers, which would decrease performance).
Another recommendation is to keep memory for the file system cache, which lucene uses heavily in order to load doc-values, and info from disk into memory.
Depending on your workload, it is better to run multiple VMs on a single 500gb server. you better use 64gb-128gb VMs, each divided between 31gb for Elasticsearch and the rest for the file system cache.
multiple VMs on a server means that each VM is Elasticsearch node.

Related

how to limit memory usage of elasticsearch in ubuntu 17.10?

My elasticsearch service is consuming around 1 gb.
My total memory is 2gb. The elasticsearch service keeps getting shut down. I guess the reason is because of the high memory consumption. How can i limit the usage to just 512 MB?
This is the memory before starting elastic search
After running sudo service elasticsearch start the memory consumption jumps
I appreciate any help! Thanks!
From the official doc
The default installation of Elasticsearch is configured with a 1 GB heap. For just about every deployment, this number is usually too small. If you are using the default heap values, your cluster is probably configured incorrectly.
So you can change it like this
There are two ways to change the heap size in Elasticsearch. The easiest is to set an environment variable called ES_HEAP_SIZE. When the server process starts, it will read this environment variable and set the heap accordingly. As an example, you can set it via the command line as follows: export ES_HEAP_SIZE=512m
But it's not recommended. You just can't run an Elasticsearch in the optimal way with so few RAM available.

Memory Management in H2O

I am curious to know how memory is managed in H2O.
Is it completely 'in-memory' or does it allow swapping in case the memory consumption goes beyond available physical memory? Can I set -mapperXmx parameter to 350GB if I have a total of 384GB of RAM on a node? I do realise that the cluster won't be able to handle anything other than the H2O cluster in this case.
Any pointers are much appreciated, Thanks.
H2O-3 stores data completely in-memory in a distributed column-compressed distributed key-value store.
No swapping to disk is supported.
Since you are alluding to mapperXmx, I assume you are talking about running H2O in a YARN environment. In that case, the total YARN container size allocated per node is:
mapreduce.map.memory.mb = mapperXmx * (1 + extramempercent/100)
extramempercent is another (rarely used) command-line parameter to h2odriver.jar. Note the default extramempercent is 10 (percent).
mapperXmx is the size of the Java heap, and the extra memory referred to above is for additional overhead of the JVM implementation itself (e.g. the C/C++ heap).
YARN is extremely picky about this, and if your container tries to use even one byte over its allocation (mapreduce.map.memory.mb), YARN will immediately terminate the container. (And for H2O-3, since it's an in-memory processing engine, the loss of one container terminates the entire job.)
You can set mapperXmx and extramempercent to as large a value as YARN has space to start containers.

MemSQL performance issues

I have a single node MemSQL install with one master aggregator and two leaves (all on a single box). The machine has 2 cores, 16Gb RAM, and MemSQL columnstore data is ~7Gb (coming from 21Gb CSV). When running queries on the data, memory usage caps at ~2150Mb (11Gb sitting free). I've configured both leaves to have maximum_memory = 7000 in the memsql.cnf files for both nodes (memsql-optimize does similar). During query execution, the master aggregator sits at 100% CPU, with the leaves 0-8% CPU.
This does not seems like an efficient use of system resources, but I'm not sure what I can do to configure the system or MemSQL to make more efficient use of CPU or memory. Any help would be greatly appreciated!
If during query execution your machine is at 100% cpu (on all cores), it doesn't really matter which MemSQL node it is, your workload throughput is still bottlenecked on cpu. However for most queries you wouldn't expect most of the cpu use to be on the aggregator, so you may want to take a look at EXPLAIN or PROFILE of your queries.
Columnstore data is cached in memory as part of the OS file cache - it isn't counted as memory reserved by MemSQL, which is why your memory usage is less than the size of the columnstore data.
My database was coming from some other place than the current memsql install (perhaps an older cluster configuration) despite there only being a single memsql cluster on the machine. Looking at the Databases section in the Web UI was displaying no databases/tables, but my queries were succeeded with the expected answers.
drop database/reload from CSV managed to remedy the situation. All core threads are now used during query.

Elasticsearch: What if size of index is larger than available RAM?

Assuming a single machine system with an in-memory indexing schema.
I am not able to find this info in ES docs. Does ES start swapping out the overflowing data, loads it when needed and continue working or it gives an error?
In-memory indices provide better performance at the cost of limiting the index size to the amount of available physical memory.
Via the 1.7 documentation. Memory stores are no longer available in 2.0+.
Under the hood it uses the Lucene RAMDirectory, which will just consume RAM (and eventually swap) until either you hit Java heap limits and ES crashes with out-of-memory errors, or the system gives up and oomkills the Elasticsearch process. Don't use in-memory indexes for large indexes, or for any situation where persistence is important.

How does ElasticSearch and Lucene share the memory

I have one question about the following quota from ES official doc:
But if you give all available memory to Elasticsearch’s heap,
there won’t be any left over for Lucene.
This can seriously impact the performance of full-text search.
If my server has 80G memory, I issued the following command to start ES node: bin/elasticsearch -xmx 30g
That means I only give the process of ES 30g memory maximum. How can Lucene use the left 50G, since Lucene is running in ES process, it's just part of the process.
The Xmx parameter simply indicates how much heap you allocate to the ES Java process. But allocating RAM to the heap is not the only way to use the available memory on a server.
Lucene does indeed run inside the ES process, but Lucene doesn't only make use of the allocated heap, it also uses memory by heavily leveraging the file system cache for managing index segment files.
There were these two great blog posts (this one and this other one) from Lucene's main committer which explain in greater details how Lucene leverages all the available remaining memory.
The bottom line is to allocate 30GB heap to the ES process (using -Xmx30g) and then Lucene will happily consume whatever is left to do what needs to be done.
Lucene uses the off heap memory via the OS. It is described in the Elasticsearch guide in the section about Heap sizing and swapping.
Lucene is designed to leverage the underlying OS for caching in-memory data structures. Lucene segments are stored in individual files. Because segments are immutable, these files never change. This makes them very cache friendly, and the underlying OS will happily keep hot segments resident in memory for faster access.

Resources