relationship between container_memory_working_set_bytes and process_resident_memory_bytes and total_rss - go

I'm looking to understanding the relationship of
container_memory_working_set_bytes vs process_resident_memory_bytes vs total_rss (container_memory_rss) + file_mapped so as to better equipped system for alerting on OOM possibility.
It seems against my understanding (which is puzzling me right now) given if a container/pod is running a single process executing a compiled program written in Go.
Why is the difference between container_memory_working_set_bytes is so big(nearly 10 times more) with respect to process_resident_memory_bytes
Also the relationship between container_memory_working_set_bytes and container_memory_rss + file_mapped is weird here, something I did not expect, after reading here
The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss from memory.status file. This should not be confused with the true resident set size or the amount of physical memory used by the cgroup. rss + file_mapped will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
So cgroup total resident set size is rss + file_mapped how does this value is less than container_working_set_bytes for a container that is running in the given cgroup
Which make me feels something with this stats that I'm not correct.
Following are the PROMQL used to build the above graph
process_resident_memory_bytes{container="sftp-downloader"}
container_memory_working_set_bytes{container="sftp-downloader"}
go_memstats_heap_alloc_bytes{container="sftp-downloader"}
container_memory_mapped_file{container="sftp-downloader"} + container_memory_rss{container="sftp-downloader"}

So the relationship seems is like this
container_working_set_in_bytes = container_memory_usage_bytes - total_inactive_file
container_memory_usage_bytes as its name implies means the total memory used by the container (but since it also includes file cache i.e inactive_file which OS can release under memory pressure) substracting the inactive_file gives container_working_set_in_bytes
Relationship between container_memory_rss and container_working_sets can be summed up using following expression
container_memory_usage_bytes = container_memory_cache + container_memory_rss
cache reflects data stored on a disk that is currently cached in memory. it contains active + inactive file (mentioned above)
This explains why the container_working_set was higher.
Ref #1
Ref #2

Not really an answer, but still two assorted points.
Does this help to make sense of the chart?
Here at my $dayjob, we had faced various different issues with how different tools external to the Go runtime count and display memory usage of a process executing a program written in Go.
Coupled with the fact Go's GC on Linux does not actually release freed memory pages to the kernel but merely madvise(2)s it that such pages are MADV_FREE, a GC cycle which had freed quite a hefty amount of memory does not result in any noticeable change of the readings of the "process' RSS" taken by the external tooling (usually cgroups stats).
Hence we're exporting our own metrics obtained by periodically calling runtime.ReadMemStats (and runtime/debug.ReadGCStats) in any major serivice written in Go — with the help of a simple package written specifically for that. These readings reflect the true idea of the Go runtime about the memory under its control.
By the way, the NextGC field of the memory stats is super useful to watch if you have memory limits set for your containers because once that reading reaches or surpasses your memory limit, the process in the container is surely doomed to be eventually shot down by the oom_killer.

Related

Do job objects work with memory mapped files?

I am using memory mapped files for a set of very large datasets (each ~150GB). The scenario here is that the memory that is total amount of memory that is consumed during processing the data is around 50TB, because there are many stages of algorithms where each algirithm process another 150GB dataset.
Memory mapped files seem to balance quite well if there is little memory load and the memory manager has enough time balance the memory.
High Memory Load
The problem is though, that when there is a high memory load and a lot of data is written into the memory mapped files, the current process working set grows to the limit of the RAM. What you can see then is that the pages are written to harddisk but when I reach the max RAM, Windows stalls completely never returns until everything is written.
Another problem is that the memory manager is writing the pages only when it reaches the maximum RAM limit. Before nothing is written to disk.
I can agree that my process that is writing data and the memory page writer are two independent processes that don't know each other, so while I'm producing data, the page writer tries its best to write the pages to disk but is not able to follow since writing toi disk takes more time than allocating memory.
Obviously it is never a problem if the memory load of Windows is below the maximum RAM. If I'm coming near the maximum RAM things going to fall apart
First Attempt: SetWorkingSetSize
My first attempt was to use the SetProcessWorkingSet, by setting maximum allowed working set size to a strict limit (e.g. 5GB). What happens now is that the memory manager is starting to write pages to disk when the process has reaches the 5GB working set size and the process never reaches more than the example 5GB. Nevertheless my process still writes data, but what I can see from the Task Manager is that the amount of memory is still growing, whereas the process itself is still on 5GB. Nevertheless the same situation as before exists, the RAM grows until the max RAM is reached and Windows stalls again.
I was hoping that Windows throttles itself when I'm allocating memory, so that there is a mechnism in Windows that defers memory allocation, meaning that my application automatically slows down but that doesn't seem to be case.
Second Attempt: Job Objects
So I thought Job Objects can be a solution as it seemed to me that using SetProcessWorkingSetSize does not work as expected and is only a suggestion to the OS. So I used JOBOBJECT_EXTENDED_LIMIT_INFORMATION
typedef struct _JOBOBJECT_EXTENDED_LIMIT_INFORMATION {
JOBOBJECT_BASIC_LIMIT_INFORMATION BasicLimitInformation;
IO_COUNTERS IoInfo;
SIZE_T ProcessMemoryLimit;
SIZE_T JobMemoryLimit;
SIZE_T PeakProcessMemoryUsed;
SIZE_T PeakJobMemoryUsed;
} JOBOBJECT_EXTENDED_LIMIT_INFORMATION, *PJOBOBJECT_EXTENDED_LIMIT_INFORMATION;
with SetInformationJobObject but it seems that it doesn't do anything in terms of limit the process' working set, where SetProcessWorkingSetSize limits it.
So my questions are
should Job Objects limit the working set, and should the work as SetProcessWorkingSetSize (I understand that Job Objects are for more than this, but I'm just asking for this special condition)
Should Windows throttle memory allocation itself or do I have to do it myself.
If Windows does not throttle what would be the best approach to throttle my application (also using Job Objects using notifications?)

How could I make a Go program use more memory? Is that recommended?

I'm looking for option something similar to -Xmx in Java, that is to assign maximum runtime memory that my Go application can utilise. Was checking the runtime , but not entirely if that is the way to go.
I tried setting something like this with func SetMaxStack(), (likely very stupid)
debug.SetMaxStack(5000000000) // bytes
model.ExcelCreator()
The reason why I am looking to do this is because currently there is ample amount of RAM available but the application won't consume more than 4-6% , I might be wrong here but it could be forcing GC to happen much faster than needed leading to performance issue.
What I'm doing
Getting large dataset from RDBMS system , processing it to write out in excel.
Another reason why I am looking for such an option is to limit the maximum usage of RAM on the server where it will be ultimately deployed.
Any hints on this would greatly appreciated.
The current stable Go (1.10) has only a single knob which may be used to trade memory for lower CPU usage by the garbage collection the Go runtime performs.
This knob is called GOGC, and its description reads
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
So basically setting it to 200 would supposedly double the amount of memory the Go runtime of your running process may use.
Having said that I'd note that the Go runtime actually tries to adjust the behaviour of its garbage collector to the workload of your running program and the CPU processing power at hand.
I mean, that normally there's nothing wrong with your program not consuming lots of RAM—if the collector happens to sweep the garbage fast enough without hampering the performance in a significant way, I see no reason to worry about: the Go's GC is
one of the points of the most intense fine-tuning in the runtime,
and works very good in fact.
Hence you may try to take another route:
Profile memory allocations of your program.
Analyze the profile and try to figure out where the hot spots
are, and whether (and how) they can be optimized.
You might start here
and continue with the gazillion other
intros to this stuff.
Optimize. Typically this amounts to making certain buffers
reusable across different calls to the same function(s)
consuming them, preallocating slices instead of growing them
gradually, using sync.Pool where deemed useful etc.
Such measures may actually increase the memory
truly used (that is, by live objects—as opposed to
garbage) but it may lower the pressure on the GC.

How to measure the performance of the Erlang Garbage Collector?

I have started programming in Erlang recently and there are a few things I want to understand regarding garbage collection (GC). As far as I understand, there is a generational GC for the private heap of each process and a reference counting GC for the global shared heap.
What I would like to know is if there is anyway to get:
How many number of collection cycles?
How many bytes are allocated and deallocated, on a global level or process level?
What are the private heaps, and shared heap sizes? And can we define this as a GC parameter?
How long does it take to collect garbage? The % of time needed?
Is there a way to run a program without GC?
Is there a way to get this kind of information, either with code or using some commands when I run an Erlang program?
Thanks.
To get information for a single process, you can call erlang:process_info(Pid). This will yield (as of Erlang 18.0) the following fields:
> erlang:process_info(self()).
[{current_function,{erl_eval,do_apply,6}},
{initial_call,{erlang,apply,2}},
{status,running},
{message_queue_len,0},
{messages,[]},
{links,[<0.27.0>]},
{dictionary,[]},
{trap_exit,false},
{error_handler,error_handler},
{priority,normal},
{group_leader,<0.26.0>},
{total_heap_size,4184},
{heap_size,2586},
{stack_size,24},
{reductions,3707},
{garbage_collection,[{min_bin_vheap_size,46422},
{min_heap_size,233},
{fullsweep_after,65535},
{minor_gcs,7}]},
{suspending,[]}]
The number of collection cycles for the process is available in the field minor_gcs under the section garbage_collection.
Per Process
The current heap size for the process is available in the field heap_size from the results above (in words, 4 bytes on a 32-bit VM and 8 bytes on a 64-bit VM). The total memory consumption of the process can be obtained by calling erlang:process_info(Pid, memory) which returns for example {memory,34312} for the above process. This includes call stack, heap and internal structures.
Deallocations (and allocations) can be traced using erlang:trace/3. If the trace flag is garbage_collection you will received messages on the form {trace, Pid, gc_start, Info} and {trace, Pid, gc_end, Info}. The Info field of the gc_start message contains such things as heap_size and old_heap_size.
Per System
Top level statistics of the system can be obtained by erlang:memory/0:
> erlang:memory().
[{total,15023008},
{processes,4215272},
{processes_used,4215048},
{system,10807736},
{atom,202481},
{atom_used,187597},
{binary,325816},
{code,4575293},
{ets,234816}]
Garbage collection statistics can be obtained via erlang:statistics(garbage_collection) which yields:
> statistics(garbage_collection).
{85,23961,0}
Where (as of Erlang 18.0) the first field is the total number of garbage collections performed by the VM and the second field is the total number of words reclaimed.
The heap sizes for a process are available under the fields total_heap_size (all heap fragments and stack) and heap_size (the size of the youngest heap generation) from the process info above.
They can be controlled via spawn options, specifically min_heap_size which sets the initial heap size for a process.
To set it for all process, erlang:system_flag(min_heap_size, MinHeapSize) can be called.
You can also control global VM memory allocation via the +M... options to the Erlang VM. The flags are described here. However, this requires extensive knowledge about the internals of the Erlang VM and its allocators and using them should not be taken lightly.
This can be obtained via the tracing described in answer 2. If you use the option timestamp when tracing, you will receive a timestamp with each trace message that can be used to calculate the total GC time.
Short answer: no.
Long answer: Maybe. You can control the initial heap size (via min_heap_size) which will affect when garbage collection will occur the first time. You can also control when a full sweep will be performed with the fullsweep_after option.
More information can be found in the Academic and Historical Questions and Processes section of the Efficiency Guide.
The most practical way of introspecting Erlang memory usage at runtime is via the Recon library, as Steve Vinoski mentioned.

What is Peak Working Set in windows task manager

I'm confused about the windows task manager memory overview.
in the general memory overview it shows "in use" 7.9gb (in my sample)
.
I've used process explorer to sum up the used memory and it shows me the following:
Since this is the nearest number to the 7.9gb of the task manager, i guess this value is shown there.
Now my question:
What is the Peak working set?
If i hoover over the column in task manager, it says:
and the microsoft help says Maximum amount of working set memory used by the process.
Is it now the effective used memory of all processes, or is it the maximum of memory which was used by all process?
The number you refer to is "Memory used by processes, drivers and the operating system" [source].
This is an easy but somewhat vague description. A somewhat similar description would be the total amount of memory that is not free, or part of the buffer cache, or part of the standby list.
It is not the maximum memory used at some time ("peak"), it's a coincidence that you have roughly the same number there. It is the presently used amount (used by "everyone", that is all programs and the OS).
The peak working set is a different thing. The working set is the amount of memory in a process (or, if you consider several processes, in all these processes) that is currently in physical memory. The peak working set is, consequently, the maximum value so far seen.
A process may allocate more memory than it actually ever commits ("uses"), and most processes will commit more memory than they have in their working set at one time. This is perfectly normal. Pages are moved in and out of working sets (and into the standby list) to assure that the computer, which has only a finite amount of memory, always has enough reserves to satisfy any memory needs.
The memory figures in question aren't actually a reliable indicator of how much memory a process is using.
A brief explanation of each of the memory relationships:
Private Bytes are what the process is allocated, also with pagefile usage.
Working Set is the non-paged Private Bytes plus memory-mapped files.
Virtual Bytes are the Working Set plus paged Private Bytes and
standby list.
In answer to your question the peak working set is the maximum amount of physical RAM that was assigned to the process in question.
~ Update ~
Available memory is defined as the sum of the standby list plus free memory. There is far more to total memory usage than the sum all process working sets. Because of this and due to memory sharing this value is not generally very useful.
The virtual size of a process is the portion of a process virtual address space that has been allocated for use. There is no relationship between this and physical memory usage.
Private bytes is the portion of a processes virtual address space that has been allocated for private use. It does not include shared memory or that used for code. There is no relationship between this value and physical memory usage either.
Working set is the amount of physical memory in use by a process. Due to memory sharing there will be some double counting in this value.
The terms mentioned above aren't really going to mean very much until you understand the basic concepts in Windows memory management. Have a look HERE for some further reading.

R and shared memory for parallel::mclapply

I am trying to take advantage of a quad-core machine by parallelizing a costly operation that is performed on a list of about 1000 items.
I am using R's parallel::mclapply function currently:
res = rbind.fill(parallel::mclapply(lst, fun, mc.cores=3, mc.preschedule=T))
Which works. Problem is, any additional subprocess that is spawned has to allocate a large chunk of memory:
Ideally, I would like each core to access shared memory from the parent R process, so that as I increase the number of cores used in mclapply, I don't hit RAM limitations before core limitations.
I'm currently at a loss on how to debug this issue. All of the large data structures that each process accesses are globals (currently). Is that somehow the issue?
I did increase my shared memory max setting for the OS to 20 GB (available RAM):
$ cat /etc/sysctl.conf
kern.sysv.shmmax=21474836480
kern.sysv.shmall=5242880
kern.sysv.shmmin=1
kern.sysv.shmmni=32
kern.sysv.shmseg=8
kern.maxprocperuid=512
kern.maxproc=2048
I thought that would fix things, but the issue still occurs.
Any other ideas?
Just the tip what might have been going on
R-devel Digest, Vol 149, Issue 22
Radford Neal's answer from Jul 26, 2015:
When mclapply forks to start a new process, the memory is initially
shared with the parent process. However, a memory page has to be
copied whenever either process writes to it. Unfortunately, R's
garbage collector writes to each object to mark and unmark it whenever
a full garbage collection is done, so it's quite possible that every R
object will be duplicated in each process, even though many of them
are not actually changed (from the point of view of the R programs).
Linux and macosx have a copy-on-write mechanism when forking, it means that the pages of memory are not actually copied, but shared until the first write.
mclapply is based on fork(), so probably (unless you write to your big shared data), the memory that you see reported by your process lister is not actual memory.
But, when collecting the results, the master process will have to allocate memory for each returned result of the mclapply.
To help you further, we would need to know more about your fun function.
I guess I would have thought this would not have used extra memory because of the copy-on-write functionality. I take it the elements of the list are large? Perhaps when R passes the elements to fun() it is actually making a copy of the list item instead of using copy on write. If so, the following might work better:
fun <- function(itemNumber){
myitem <- lst[[itemNumber]]
# now do your computations
}
res = rbind.fill(parallel::mclapply(1:length(lst), fun, mc.cores=3, mc.preschedule=T))
Or use lst[[itemNumber]] directly in your function. If R/Linux/macos isn't smart enough to use copy-on-write as you wrote the function, it may with this modified approach.
Edit: I assume you are not modifying the items in the list. If you do, R is going to make copies of the data.

Resources