statement_mem seems to limit the node memory instead of the segment memory - greenplum

According to the GreenPlum documentation, GUCs such as statement_mem, gp_vmem_protect_limit should work at segment level. Same thing should happen with a resource queue memory allowance.
On our system we have 8 primary segments per node. So if I set the statement_mem of a query to 2GB I would expect the query to consume (if needed) up to 2GB x 8 = 16GBs of RAM. But it seems that it would only use 2GBs total per node before starting to write into disk (that's it 2GB/8 per segment). I tried with different statement_values and same thing.
max_statement_mem or gp_vmem_protect_limit limits are never reached. RAM usage on nodes have been monitored using various tools (from GP command center to top, free, all the way across Pivotal suggested session_level_memory_consumption view).
EDITED FROM HERE
ADDED two documentation sources where statement_mem is defined per segment and not per host. (#Jon Roberts)
On the GP best practices guide, beginning of page 32, it clearly says that if the statement_mem is 125MB and we have 8 segments on the server, each query will get 1GB allocated per server.
https://www.google.es/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0ahUKEwi6sOTx8O3KAhVBKg4KHTwICX0QFggmMAE&url=http%3A%2F%2Fgpdb.docs.pivotal.io%2F4300%2Fpdf%2FGPDB43BestPractices.pdf&usg=AFQjCNGkTqa6143fvJUztYISWAiVyj62dA&sig2=D2ZcJwLDqN0qBzU73NjXNg&bvm=bv.113943164,d.ZWU&cad=rja
On the https://support.pivotal.io/hc/en-us/articles/201947018-Pivotal-Greenplum-GPDB-Memory-Configuration it seems to use statement_mem as segment memory and not host memory. It keeps interrelating statement_mem with the memory limit of the resource queues as well as with the gp_vmem_protect_limit (both parameters defined per segment basis).
This is why I'm getting confused about how to properly manage the memory resources.
Thanks

I incorrectly stated that statement_mem is on a per host and that is not the case. This link is talking about the memory on a segment level:
http://gpdb.docs.pivotal.io/4370/guc_config-statement_mem.html#statement_mem
With the default of "eager_free" gp_resqueue_memory_policy, memory gets re-used so the aggregate amount of memory used may look low for a particular query execution. If you change it to "auto" where the memory isn't re-used, the memory usage is more noticeable.
Run an "explain analyze" of your query and see the slices that are used. With eager_free, the memory gets re-used so you may only have a single slice wanting more memory than available such as this one:
(slice18) * Executor memory: 10399K bytes avg x 2 workers, 10399K bytes max (seg0). Work_mem: 8192K bytes max, 13088K bytes wanted.
And for your question on how to manage the resources, most people don't change the default values. A query that spills to disk is usually an indication that the query needs to be revised or the data model needs some work.

Related

relationship between container_memory_working_set_bytes and process_resident_memory_bytes and total_rss

I'm looking to understanding the relationship of
container_memory_working_set_bytes vs process_resident_memory_bytes vs total_rss (container_memory_rss) + file_mapped so as to better equipped system for alerting on OOM possibility.
It seems against my understanding (which is puzzling me right now) given if a container/pod is running a single process executing a compiled program written in Go.
Why is the difference between container_memory_working_set_bytes is so big(nearly 10 times more) with respect to process_resident_memory_bytes
Also the relationship between container_memory_working_set_bytes and container_memory_rss + file_mapped is weird here, something I did not expect, after reading here
The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss from memory.status file. This should not be confused with the true resident set size or the amount of physical memory used by the cgroup. rss + file_mapped will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
So cgroup total resident set size is rss + file_mapped how does this value is less than container_working_set_bytes for a container that is running in the given cgroup
Which make me feels something with this stats that I'm not correct.
Following are the PROMQL used to build the above graph
process_resident_memory_bytes{container="sftp-downloader"}
container_memory_working_set_bytes{container="sftp-downloader"}
go_memstats_heap_alloc_bytes{container="sftp-downloader"}
container_memory_mapped_file{container="sftp-downloader"} + container_memory_rss{container="sftp-downloader"}
So the relationship seems is like this
container_working_set_in_bytes = container_memory_usage_bytes - total_inactive_file
container_memory_usage_bytes as its name implies means the total memory used by the container (but since it also includes file cache i.e inactive_file which OS can release under memory pressure) substracting the inactive_file gives container_working_set_in_bytes
Relationship between container_memory_rss and container_working_sets can be summed up using following expression
container_memory_usage_bytes = container_memory_cache + container_memory_rss
cache reflects data stored on a disk that is currently cached in memory. it contains active + inactive file (mentioned above)
This explains why the container_working_set was higher.
Ref #1
Ref #2
Not really an answer, but still two assorted points.
Does this help to make sense of the chart?
Here at my $dayjob, we had faced various different issues with how different tools external to the Go runtime count and display memory usage of a process executing a program written in Go.
Coupled with the fact Go's GC on Linux does not actually release freed memory pages to the kernel but merely madvise(2)s it that such pages are MADV_FREE, a GC cycle which had freed quite a hefty amount of memory does not result in any noticeable change of the readings of the "process' RSS" taken by the external tooling (usually cgroups stats).
Hence we're exporting our own metrics obtained by periodically calling runtime.ReadMemStats (and runtime/debug.ReadGCStats) in any major serivice written in Go — with the help of a simple package written specifically for that. These readings reflect the true idea of the Go runtime about the memory under its control.
By the way, the NextGC field of the memory stats is super useful to watch if you have memory limits set for your containers because once that reading reaches or surpasses your memory limit, the process in the container is surely doomed to be eventually shot down by the oom_killer.

Cassandra java process using more memory than its allocated max heap size (Xmx)

We have our cassandra cluster which runs Apache Cassandra 3.11.4 in set of unix hosts (18). each of these host has 96G of RAM and we have configured heap size to -Xms=64G -Xmx=64G but top command (top -M) on hosts shows the actual memory utilization is ~85G on average i.e. much higher than allocated heap (64G).
the trends of memory usage are like, during startup of cassandra daemon, top -M show the process has already occupied ~75G which (75G-64G)=9G more than allocated heap size, and this memory utilization increases over time and reaches to max 85G in just 3-4 hours and remains at that stage throughout the time, while the heap utilization (~40-50%) is normal, GS activities are usual, minor GC kicks in as usual.
have confirmed that the total off-heap memory utilized by all the keyspaces are below 2G on each hosts.
We are unable to trace what else is consuming the RAM in addition to the allocated heap.
Besides the heap memory, Cassandra uses also the off-heap memory, for example for keeping compression metadata, bloom filters, and some other things. From documentation (1, 2):
Compression metadata is stored off-heap and scales with data on disk. This often requires 1-3GB of off-heap RAM per terabyte of data on disk, though the exact usage varies with chunk_length_in_kb and compression ratios.
Bloom filters are stored in RAM, but are stored offheap, so operators should not consider bloom filters when selecting the maximum heap size.
You can monitor heap & offheap memory usage using the JMX, for example. (I've seen setups, where bloom filter alone occupied ~40Gb of RAM, but it was heavily dependent on the number of the unique partition keys)
Too big heaps are usually not recommended because they can use long pauses, etc. It of course depends on the workload, but you can try 31Gb or lower (or just use default settings). Plus, you need to leave the memory for Linux file buffers so it will cache often used files. That is the reason why by default Cassandra allocates only 1/4th of system memory for heap.

How to configure HAWQ memory?

Can I configure the amount of memory available to all HAWQ segment instances and the amount of memory available for each segment?
This is covered in the system requirements.
In Apache Hawq, virtual segments are used as containers of the executors. As a result, the memory used by queries is controlled by the number of virtual segments.
You can use GUC hawq_rm_memory_limit_perseg to control the total memory size of each host(segment instance) and to control the memory size of a virtual segment, you can create your own resqueue with specified memsize for each container(256M as default).
The virtual segment memory usage is set by hawq_rm_stmt_vseg_memory, which calculates the total memory for all forked QE. Since different query statements may resides on the same vseg, so hawq_rm_stmt_vseg_memory will be shared across different queries.
hawq_rm_stmt_vseg_memory is the memory quota(size) of one virtual segment, the default value is 128mb. That means, the memory size of one virtual segment is 128mb, one query may request many virtual segments.
#Wen Lin, you mentioned "one query may request many virtual segments", so one virtual segment is also shared by many queries, right? Then all shared queries will share the memory quota, which is 128mb by default?
#ztao, virtual segments are not shared by queries. For one query, it asks virtual segments from RM, and return virtual segments to RM when it finishes.

How much memory is available for database use in memsql

I have created memsql cluster on 7 machines. One of the machine shows that out of 62.86 GB only 2.83 is used. So here I am assuming that around 60 GB
memory is available to store data.
But my top command tell another story
Here we can see that about 21.84 GB memory is getting used and free memory is 41 GB.
So
1> How much exact memory is available for database? Is it 60 Gb as per cluster URL or 42 Gb as per top command
Note that:
1>memsql-op is consuming aroung 13.5 g virtual memory.
2> as per 'top' if we subtract buffered and cached memory's total size from used memory, then it comes to 2.83GB which is used memory as per cluster URL
To answer your question, you currently have about 60GB of memory free to be used by any process on your machine including the MemSQL database. Note that MemSQL has some overhead and by default reserves a small percentage of the total memory for overhead. If you visit the status page in the MemSQL Ops UI and view the "Leaf Table Memory" card, you will discover the amount of memory that can be used for data storage within the leaf nodes of your MemSQL cluster.
MemSQL Ops is written in Python which is then embedded into a "single binary" via a packaging tool. Because of this it exhibits a couple of oddities including high VM use. Note that this should not affect the amount of data you can store, as Ops is only consuming 308MB of resident memory on your machine. It should stay relatively constant based on the size of your cluster.

What is Peak Working Set in windows task manager

I'm confused about the windows task manager memory overview.
in the general memory overview it shows "in use" 7.9gb (in my sample)
.
I've used process explorer to sum up the used memory and it shows me the following:
Since this is the nearest number to the 7.9gb of the task manager, i guess this value is shown there.
Now my question:
What is the Peak working set?
If i hoover over the column in task manager, it says:
and the microsoft help says Maximum amount of working set memory used by the process.
Is it now the effective used memory of all processes, or is it the maximum of memory which was used by all process?
The number you refer to is "Memory used by processes, drivers and the operating system" [source].
This is an easy but somewhat vague description. A somewhat similar description would be the total amount of memory that is not free, or part of the buffer cache, or part of the standby list.
It is not the maximum memory used at some time ("peak"), it's a coincidence that you have roughly the same number there. It is the presently used amount (used by "everyone", that is all programs and the OS).
The peak working set is a different thing. The working set is the amount of memory in a process (or, if you consider several processes, in all these processes) that is currently in physical memory. The peak working set is, consequently, the maximum value so far seen.
A process may allocate more memory than it actually ever commits ("uses"), and most processes will commit more memory than they have in their working set at one time. This is perfectly normal. Pages are moved in and out of working sets (and into the standby list) to assure that the computer, which has only a finite amount of memory, always has enough reserves to satisfy any memory needs.
The memory figures in question aren't actually a reliable indicator of how much memory a process is using.
A brief explanation of each of the memory relationships:
Private Bytes are what the process is allocated, also with pagefile usage.
Working Set is the non-paged Private Bytes plus memory-mapped files.
Virtual Bytes are the Working Set plus paged Private Bytes and
standby list.
In answer to your question the peak working set is the maximum amount of physical RAM that was assigned to the process in question.
~ Update ~
Available memory is defined as the sum of the standby list plus free memory. There is far more to total memory usage than the sum all process working sets. Because of this and due to memory sharing this value is not generally very useful.
The virtual size of a process is the portion of a process virtual address space that has been allocated for use. There is no relationship between this and physical memory usage.
Private bytes is the portion of a processes virtual address space that has been allocated for private use. It does not include shared memory or that used for code. There is no relationship between this value and physical memory usage either.
Working set is the amount of physical memory in use by a process. Due to memory sharing there will be some double counting in this value.
The terms mentioned above aren't really going to mean very much until you understand the basic concepts in Windows memory management. Have a look HERE for some further reading.

Resources