Cassandra java process using more memory than its allocated max heap size (Xmx) - memory-management

We have our cassandra cluster which runs Apache Cassandra 3.11.4 in set of unix hosts (18). each of these host has 96G of RAM and we have configured heap size to -Xms=64G -Xmx=64G but top command (top -M) on hosts shows the actual memory utilization is ~85G on average i.e. much higher than allocated heap (64G).
the trends of memory usage are like, during startup of cassandra daemon, top -M show the process has already occupied ~75G which (75G-64G)=9G more than allocated heap size, and this memory utilization increases over time and reaches to max 85G in just 3-4 hours and remains at that stage throughout the time, while the heap utilization (~40-50%) is normal, GS activities are usual, minor GC kicks in as usual.
have confirmed that the total off-heap memory utilized by all the keyspaces are below 2G on each hosts.
We are unable to trace what else is consuming the RAM in addition to the allocated heap.

Besides the heap memory, Cassandra uses also the off-heap memory, for example for keeping compression metadata, bloom filters, and some other things. From documentation (1, 2):
Compression metadata is stored off-heap and scales with data on disk. This often requires 1-3GB of off-heap RAM per terabyte of data on disk, though the exact usage varies with chunk_length_in_kb and compression ratios.
Bloom filters are stored in RAM, but are stored offheap, so operators should not consider bloom filters when selecting the maximum heap size.
You can monitor heap & offheap memory usage using the JMX, for example. (I've seen setups, where bloom filter alone occupied ~40Gb of RAM, but it was heavily dependent on the number of the unique partition keys)
Too big heaps are usually not recommended because they can use long pauses, etc. It of course depends on the workload, but you can try 31Gb or lower (or just use default settings). Plus, you need to leave the memory for Linux file buffers so it will cache often used files. That is the reason why by default Cassandra allocates only 1/4th of system memory for heap.

Related

why elasticsearch cluster's heap usage so mutch?

why my elasticsearch cluster's heap usage so mutch. while there just a littile request?
Elasticsearch uses memory for few different purposes, all in order to provide faster search times and high indexing throughput.
The heap usage you pasted doesn't look high. Elasticsearch runs its Java process in the JVM. The JVM manages the heap and clears memory through garbage collection. looking at heap usage graph you will usually see the usage increasing constantly, and then sudden drops - which happen when the GC (Garbage collector) releases the unused memory.
Do you have a specific issue with memory? it looks ok - ranging between 40%-70% is perfectly normal.

statement_mem seems to limit the node memory instead of the segment memory

According to the GreenPlum documentation, GUCs such as statement_mem, gp_vmem_protect_limit should work at segment level. Same thing should happen with a resource queue memory allowance.
On our system we have 8 primary segments per node. So if I set the statement_mem of a query to 2GB I would expect the query to consume (if needed) up to 2GB x 8 = 16GBs of RAM. But it seems that it would only use 2GBs total per node before starting to write into disk (that's it 2GB/8 per segment). I tried with different statement_values and same thing.
max_statement_mem or gp_vmem_protect_limit limits are never reached. RAM usage on nodes have been monitored using various tools (from GP command center to top, free, all the way across Pivotal suggested session_level_memory_consumption view).
EDITED FROM HERE
ADDED two documentation sources where statement_mem is defined per segment and not per host. (#Jon Roberts)
On the GP best practices guide, beginning of page 32, it clearly says that if the statement_mem is 125MB and we have 8 segments on the server, each query will get 1GB allocated per server.
https://www.google.es/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwi6sOTx8O3KAhVBKg4KHTwICX0QFggmMAE&url=http%3A%2F%2Fgpdb.docs.pivotal.io%2F4300%2Fpdf%2FGPDB43BestPractices.pdf&usg=AFQjCNGkTqa6143fvJUztYISWAiVyj62dA&sig2=D2ZcJwLDqN0qBzU73NjXNg&bvm=bv.113943164,d.ZWU&cad=rja
On the https://support.pivotal.io/hc/en-us/articles/201947018-Pivotal-Greenplum-GPDB-Memory-Configuration it seems to use statement_mem as segment memory and not host memory. It keeps interrelating statement_mem with the memory limit of the resource queues as well as with the gp_vmem_protect_limit (both parameters defined per segment basis).
This is why I'm getting confused about how to properly manage the memory resources.
Thanks
I incorrectly stated that statement_mem is on a per host and that is not the case. This link is talking about the memory on a segment level:
http://gpdb.docs.pivotal.io/4370/guc_config-statement_mem.html#statement_mem
With the default of "eager_free" gp_resqueue_memory_policy, memory gets re-used so the aggregate amount of memory used may look low for a particular query execution. If you change it to "auto" where the memory isn't re-used, the memory usage is more noticeable.
Run an "explain analyze" of your query and see the slices that are used. With eager_free, the memory gets re-used so you may only have a single slice wanting more memory than available such as this one:
(slice18) * Executor memory: 10399K bytes avg x 2 workers, 10399K bytes max (seg0). Work_mem: 8192K bytes max, 13088K bytes wanted.
And for your question on how to manage the resources, most people don't change the default values. A query that spills to disk is usually an indication that the query needs to be revised or the data model needs some work.

How much virtual memory does a 30.5Gb heap(256 Gb memory in total) for Elasticsearch support?

Assume I have a machine with 256gb memory and 12TB SSD. Indexed document size is 100TB. I assign 30.5 GB to Elasticsearch heap. The remaining is for Lucene and OS.
My question is, how much virtual memory does Elasticsearch support? To put it in another way, how many indexed documents can I put into the virtual memory for each machine?
Thanks
The amount of virtual memory ES can use is defined by the value of the vm.max_map_count setting in /etc/sysctl.conf. By default it is set at 262144, but you can change this value using:
sysctl -w vm.max_map_count=262144
From the linux documentation:
This file contains the maximum number of memory map areas a process
may have. Memory map areas are used as a side-effect of calling
malloc, directly by mmap and mprotect, and also when loading shared
libraries.
While most applications need less than a thousand maps, certain
programs, particularly malloc debuggers, may consume lots of them,
e.g., up to one or two maps per allocation.
The default value is 65536.
So this setting doesn't impose a specific size available to ES/Lucene, but a number of discrete memory areas that a given process can use. How much memory is used exactly will depend on the size of the memory chunks being allocated by ES/Lucene. By default, Lucene uses
1<<30 = 1,073,741,824 ~= 1GB chunks on a 64 bit JRE and
1<<28 = 268,435,456 ~= 256MB chunks on 32 bit JRE
So if you do the math, the default value of vm.max_map_count is probably good enough for your case, if not you can tune it and monitor your virtual memory usage.

How much memory is available for database use in memsql

I have created memsql cluster on 7 machines. One of the machine shows that out of 62.86 GB only 2.83 is used. So here I am assuming that around 60 GB
memory is available to store data.
But my top command tell another story
Here we can see that about 21.84 GB memory is getting used and free memory is 41 GB.
So
1> How much exact memory is available for database? Is it 60 Gb as per cluster URL or 42 Gb as per top command
Note that:
1>memsql-op is consuming aroung 13.5 g virtual memory.
2> as per 'top' if we subtract buffered and cached memory's total size from used memory, then it comes to 2.83GB which is used memory as per cluster URL
To answer your question, you currently have about 60GB of memory free to be used by any process on your machine including the MemSQL database. Note that MemSQL has some overhead and by default reserves a small percentage of the total memory for overhead. If you visit the status page in the MemSQL Ops UI and view the "Leaf Table Memory" card, you will discover the amount of memory that can be used for data storage within the leaf nodes of your MemSQL cluster.
MemSQL Ops is written in Python which is then embedded into a "single binary" via a packaging tool. Because of this it exhibits a couple of oddities including high VM use. Note that this should not affect the amount of data you can store, as Ops is only consuming 308MB of resident memory on your machine. It should stay relatively constant based on the size of your cluster.

why the memory fragmentation is less than 1 in Redis

Redis support 3 memory allocator: libc, jemalloc, tcmalloc. When i do memory usage test, i find that mem_fragmentation_ratio in INFO MEMORY could be less than 1 with libc allocator. With jemalloc or tcmalloc, this value is greater or equal than 1 as it should be.
Could anyone explain why mem_fragmentation_ratio is less than 1 with libc?
Redis version:2.6.12. CentOS 6
Update:
I forgot to mention that one possible reason is that swap happens and mem_fragmentation_ratio will be < 1.
But when i do my test, i adjust swapiness, even turn swap off. The result is the same. And my redis instance actually do not cost too much memory.
Generally, you will have less fragmentation with jemalloc or tcmalloc, than with libc malloc. This is due to 4 factors:
more granular allocation classes for jemalloc and tcmalloc. It reduces internal fragmentation, especially when Redis has to allocate a lot of very small objects.
better algorithms and data structures to prevent external fragmentation (especially for jemalloc). Obviously, the gain depends on your long term memory allocation patterns.
support of "malloc size". Some allocators offer an API to return the size of allocated memory. With glibc (Linux), malloc does not have this capability, so it is emulated by explicitly adding an extra prefix to each allocated memory block. It increases internal fragmentation. With jemalloc and tcmalloc (or with the BSD libc malloc), there is no such
overhead.
jemalloc (and tcmalloc with some setting changes) can be more aggressive than glibc to release memory to the OS - but again, it depends on the allocation patterns.
Now, how is it possible to get inconsistent values for mem_fragmentation_ratio?
As stated in the INFO documentation, the mem_fragmentation_ratio value is calculated as the ratio between memory resident set size of the process (RSS, measured by the OS), and the total number of bytes allocated by Redis using the allocator.
Now, if more memory is allocated with libc (compared to jemalloc,tcmalloc), or if more memory is used by some other processes on your system during your benchmarks, Redis memory may be swapped out by the OS. It will reduce the RSS (since a part of Redis memory will not be in main memory anymore). The resulting fragmentation ratio will be less than 1.
In other words, this ratio is only relevant if you are sure Redis memory has not been swapped out by the OS (if it is not the case, you will have performance issues anyway).
Other than swap, I know 2 ways to make "memory fragmentation ratio" to be less than 1:
Have a redis instance with little or no data, but thousands of idling client connections. From my testing, it looks like redis will have to allocate about 20 KB of memory for each client connections, but most of it won't actually be used (i.e. won't appear in RSS) until later.
Have a master-slave setup with let's say 8 GB of repl-backlog-size. The 8 GB will be allocated as soon as the replication starts (on master only for version <4.0, on both master and slave otherwise), but the memory will only be used as we start writing to the master. So the ratio will be way below 1 initially, and then get closer and closer to 1 as the replication backlog get filled.

Resources