R and shared memory for parallel::mclapply

R and shared memory for parallel::mclapply - macos

I am trying to take advantage of a quad-core machine by parallelizing a costly operation that is performed on a list of about 1000 items.
I am using R's parallel::mclapply function currently:
res = rbind.fill(parallel::mclapply(lst, fun, mc.cores=3, mc.preschedule=T))
Which works. Problem is, any additional subprocess that is spawned has to allocate a large chunk of memory:
Ideally, I would like each core to access shared memory from the parent R process, so that as I increase the number of cores used in mclapply, I don't hit RAM limitations before core limitations.
I'm currently at a loss on how to debug this issue. All of the large data structures that each process accesses are globals (currently). Is that somehow the issue?
I did increase my shared memory max setting for the OS to 20 GB (available RAM):
$ cat /etc/sysctl.conf
kern.sysv.shmmax=21474836480
kern.sysv.shmall=5242880
kern.sysv.shmmin=1
kern.sysv.shmmni=32
kern.sysv.shmseg=8
kern.maxprocperuid=512
kern.maxproc=2048
I thought that would fix things, but the issue still occurs.
Any other ideas?

Just the tip what might have been going on
R-devel Digest, Vol 149, Issue 22
Radford Neal's answer from Jul 26, 2015:
When mclapply forks to start a new process, the memory is initially
shared with the parent process. However, a memory page has to be
copied whenever either process writes to it. Unfortunately, R's
garbage collector writes to each object to mark and unmark it whenever
a full garbage collection is done, so it's quite possible that every R
object will be duplicated in each process, even though many of them
are not actually changed (from the point of view of the R programs).

Linux and macosx have a copy-on-write mechanism when forking, it means that the pages of memory are not actually copied, but shared until the first write.
mclapply is based on fork(), so probably (unless you write to your big shared data), the memory that you see reported by your process lister is not actual memory.
But, when collecting the results, the master process will have to allocate memory for each returned result of the mclapply.
To help you further, we would need to know more about your fun function.

I guess I would have thought this would not have used extra memory because of the copy-on-write functionality. I take it the elements of the list are large? Perhaps when R passes the elements to fun() it is actually making a copy of the list item instead of using copy on write. If so, the following might work better:
fun <- function(itemNumber){
myitem <- lst[[itemNumber]]
# now do your computations
}
res = rbind.fill(parallel::mclapply(1:length(lst), fun, mc.cores=3, mc.preschedule=T))
Or use lst[[itemNumber]] directly in your function. If R/Linux/macos isn't smart enough to use copy-on-write as you wrote the function, it may with this modified approach.
Edit: I assume you are not modifying the items in the list. If you do, R is going to make copies of the data.

Related

relationship between container_memory_working_set_bytes and process_resident_memory_bytes and total_rss

I'm looking to understanding the relationship of
container_memory_working_set_bytes vs process_resident_memory_bytes vs total_rss (container_memory_rss) + file_mapped so as to better equipped system for alerting on OOM possibility.
It seems against my understanding (which is puzzling me right now) given if a container/pod is running a single process executing a compiled program written in Go.
Why is the difference between container_memory_working_set_bytes is so big(nearly 10 times more) with respect to process_resident_memory_bytes
Also the relationship between container_memory_working_set_bytes and container_memory_rss + file_mapped is weird here, something I did not expect, after reading here
The total amount of anonymous and swap cache memory (it includes transparent hugepages), and it equals to the value of total_rss from memory.status file. This should not be confused with the true resident set size or the amount of physical memory used by the cgroup. rss + file_mapped will give you the resident set size of cgroup. It does not include memory that is swapped out. It does include memory from shared libraries as long as the pages from those libraries are actually in memory. It does include all stack and heap memory.
So cgroup total resident set size is rss + file_mapped how does this value is less than container_working_set_bytes for a container that is running in the given cgroup
Which make me feels something with this stats that I'm not correct.
Following are the PROMQL used to build the above graph
process_resident_memory_bytes{container="sftp-downloader"}
container_memory_working_set_bytes{container="sftp-downloader"}
go_memstats_heap_alloc_bytes{container="sftp-downloader"}
container_memory_mapped_file{container="sftp-downloader"} + container_memory_rss{container="sftp-downloader"}

So the relationship seems is like this
container_working_set_in_bytes = container_memory_usage_bytes - total_inactive_file
container_memory_usage_bytes as its name implies means the total memory used by the container (but since it also includes file cache i.e inactive_file which OS can release under memory pressure) substracting the inactive_file gives container_working_set_in_bytes
Relationship between container_memory_rss and container_working_sets can be summed up using following expression
container_memory_usage_bytes = container_memory_cache + container_memory_rss
cache reflects data stored on a disk that is currently cached in memory. it contains active + inactive file (mentioned above)
This explains why the container_working_set was higher.
Ref #1
Ref #2

Not really an answer, but still two assorted points.
Does this help to make sense of the chart?
Here at my $dayjob, we had faced various different issues with how different tools external to the Go runtime count and display memory usage of a process executing a program written in Go.
Coupled with the fact Go's GC on Linux does not actually release freed memory pages to the kernel but merely madvise(2)s it that such pages are MADV_FREE, a GC cycle which had freed quite a hefty amount of memory does not result in any noticeable change of the readings of the "process' RSS" taken by the external tooling (usually cgroups stats).
Hence we're exporting our own metrics obtained by periodically calling runtime.ReadMemStats (and runtime/debug.ReadGCStats) in any major serivice written in Go — with the help of a simple package written specifically for that. These readings reflect the true idea of the Go runtime about the memory under its control.
By the way, the NextGC field of the memory stats is super useful to watch if you have memory limits set for your containers because once that reading reaches or surpasses your memory limit, the process in the container is surely doomed to be eventually shot down by the oom_killer.

Memory is not being released

This is a simple program to decompress the string, I just running a loop to show that memory usage increases and the memory used never get released.
Memory is not getting released even after 8hr also
Package for decompressing string: https://github.com/Albinzr/lzGo - (simple lz string algorithm)
I'm adding a gist link since the string used for decompressing is large
Source Code:
Code
Activity Monitor
I'm completely new to go, Can anyone tell me how I can solve the memory issue?
UPDATE Jul 15 20
The app still crashes when the memory limit is reached Since it only uses 12mb - 15mb this should not happen!!

There is a lot going on here.
First, using Go version 1.14.2 your program works fine for me. It does not appear to be leaking memory.
Second, even when I purposely created a memory leak by increasing the loop size to 100 and saving the results in an array, I only used about 100 MB of memory.
Which gets us to Third, you should not be using Activity Monitor or any other operating system level tools to be checking for memory leaks in a Go program. Operating System memory management is a painfully complex topic and the OS tools are designed to help you determine how a program is affecting the whole system, not what is going on within the program.
Specifically, macOS "Real Memory" (analogous to RSS, Resident Set Size) includes memory the program is no longer using but the OS has not taken back yet. When the garbage collector frees up memory and tells the OS it does not need that memory anymore, the OS does not immediately take it back. (Why it works that way is way beyond the scope of this answer.)
Also, if the OS is under Memory Pressure, it can take back not only memory the program has freed, but it can also take back (temporarily) memory the program is still using but has not accessed "recently" so that another program that urgently needs memory can use it. In this case, "Real Memory" will be reduced even if the process is not actually using less memory. There is no statistic reported by the operations system that will help you here.
You need to use native Go settings like GODEBUG=gctrace=1 or tools like expvar and expvarmon to see what the garbage collector is doing.
As for why your program ran out of memory when you limited it, keep in mind that by default Go builds a dynamically linked executable and just reading in all the shared libraries can take up a lot of memory. Try building your application with static linking using CGO_ENABLED=0 and see if that helps. See how much memory it uses when you only run 1 iteration of the loop.

How smart is mmap?

mmap can be used to share read-only memory between processes, reducing the memory foot print:
process P1 mmaps a file, uses the mapped memory -> data gets loaded into RAM
process P2 mmaps a file, uses the mapped memory -> OS re-uses the same memory
But how about this:
process P1 mmaps a file, loads it into memory, then exits.
another process P2 mmaps the same file, accesses the memory that is still hot from P1's access.
Is the data loaded again from disk? Is the OS smart enough to re-use the virtual memory even if "mmap count" dropped to zero temporarily?
Does the behaviour differ between different OS? (I'm mostly interested in Linux/OS X)
EDIT: In case the OS is not smart enough -- would it help if there is one "background process", keeping the file mmaped, so it never leaves the address space of at least one process?
I am of course interested in performance when I mmap and munmap the same file successively and rapidly, possibly (but not necessarily) within the same process.
EDIT2: I see answers describing completely irrelevant points at great length. To reiterate the point -- can I rely on Linux/OS X to not re-load data that already resides in memory, from previous page hits within mmaped memory segments, even though the particular region is no longer mmaped by any process?

The presence or absence of the contents of a file in memory is much less coupled to mmap system calls than you think. When you mmap a file, it doesn't necessarily load it into memory. When you munmap it (or if the process exits), it doesn't necessarily discard the pages.
There are many different things that could trigger the contents of a file to be loaded into memory: mapping it, reading it normally, executing it, attempting to access memory that is mapped to the file. Similarily, there are different things that could cause the file's contents to be removed from memory, mostly related to the OS deciding it wants the memory for something more important.
In the two scenarios from your question, consider inserting a step between steps 1 and 2:
1.5. another process allocates and uses a large amount of memory -> the mmaped file is evicted from memory to make room.
In this case the file's contents will probably have to get reloaded into memory if they are mapped again and used again in step 2.
versus:
1.5. nothing happens -> the contents of the mmaped file hang around in memory.
In this case the file's contents don't need to be reloaded in step 2.
In terms of what happens to the contents of your file, your two scenarios aren't much different. It's something like this step 1.5 that would make a much more important difference.
As for a background process that is constantly accessing the file in order to ensure it's kept in memory (for example, by scanning the file and then sleeping for a short amount of time in a loop), this would of course force the file to remain in memory. but you're probably better off just letting the OS make its own decision about when to evict the file and when not to evict it.

The second process likely finds the data from the first process in the buffer cache. So in most cases the data will not be loaded again from disk. But since the buffer cache is a cache, there are no guarantees that the pages don't get evicted inbetween.
You could start a third process and use mmap(2) and mlock(2) to fix the pages in ram. But this will probably cause more trouble than it is worth.
Linux substituted the UNIX buffer cache for a page cache. But the principle is still the same. The Mac OS X equivalent is called Unified Buffer Cache (UBC).

What's the difference between memory allocation and garbage collection, please?

I understand that 'Garbage Collection' is a form of memory management and that it's a way to automatically reclaim unused memory.
But what is 'memory allocation' and the conceptual difference from 'Garbage Collection'?

They are Polar opposites. So yeah, pretty big difference.
Allocating memory is the process of claiming a memory space to store things.
Garbage Collection (or freeing of memory) is the process of releasing that memory back to the pool of available memory.
Many newer languages perform both of these steps in the background for you when variables are declared/initialized, and fall out of scope.

Memory allocation is the act of asking for some memory to the system to use it for something.
Garbage collection is a process to check if some memory that was previously allocated is no longer really in use (i.e. is no longer accessible from the program) to free it automatically.
A subtle point is that the objective of garbage collection is not actually "freeing objects that are no longer used", but to emulate a machine with infinite memory, allowing you to continue to allocate memory and not caring about deallocating it; for this reason, it's not a substitute for the management of other kind resources (e.g. file handles, database connections, ...).

A simple pseudo-code example:
void myFoo()
{
LinkedList<int> myList = new LinkedList<int>();
return;
}
This will request enough new space on the heap to store the LinkedList object.
However, when the function body is over, myList dissapears and you do not have anymore anyway of knowing where this LinkedList is stored (the memory address). Hence, there is absolutely no way to tell to the system to free that memory, and make it available to you again later.
The Java Garbage Collector will do that for you automatically, in the cost of some performance, and with also introducing a little non-determinism (you cannot really tell when the GC will be called).
In C++ there is no native garbage collector (yet?). But the correct way of managing memory is by the use of smart_pointers (eg. std::auto_ptr (deprecated in C++11), std::shared_ptr) etc etc.

You want a book. You go to the library and request the book you want. The library checks to see if they have the book (in which case they do) and you gladly take it and know you must return it later.
You go home, sit down, read the book and finish it. You return the book back to the library the next day because you are finished with it.
That is a simple analogy for memory allocation and garbage collection. Computers have limited memory, just like libraries have limited copies of books. When you want to allocate memory you need to make a request and if the computer has sufficient memory (the library has enough copies for you) then what you receive is a chunk of memory. Computers need memory for storing data.
Since computers have limited memory, you need to return the memory otherwise you will run out (just like if no one returned the books to the library then the library would have nothing, the computer will explode and burn furiously before your very eyes if it runs out of memory... not really). Garbage collection is the term for checking whether memory that has been previously allocated is no longer in use so it can be returned and reused for other purposes.

Memory allocation asks the computer for some memory, in order to store data. For example, in C++:
int* myInts = new int[howManyIntsIWant];
tells the computer to allocate me enough memory to store some number of integers.
Another way of doing the same thing would be:
int myInts[6];
The difference here is that in the second example, we know when the code is written and compiled exactly how much space we need - it's 6 * the size of one int. This lets us do static memory allocation (which uses memory on what's called the "stack").
In the first example we don't know how much space is needed when the code is compiled, we only know it when the program is running and we have the value of howManyIntsIWant. This is dynamic memory allocation, which gets memory on the "heap".
Now, with static allocation we don't need to tell the computer when we're finished with the memory. This relates to how the stack works; the short version is that once we've left the function where we created that static array, the memory is swallowed up straight away.
With dynamic allocation, this doesn't happen so the memory has to be cleaned up some other way. In some languages, you have to write the code to deallocate this memory, in other it's done automatically. This is garbage collection - some automatic process built into the language that will sweep through all of the dynamically allocated memory on the heap, work out which bits aren't being used and deallocate them (i.e. free them up for other processes and programs).
So: memory allocation = asking for memory for your program. Garbage collection = where the programming language itself works out what memory isn't being used any more and deallocates it for you.

Does calling free or delete ever release memory back to the "system"

Here's my question: Does calling free or delete ever release memory back to the "system". By system I mean, does it ever reduce the data segment of the process?
Let's consider the memory allocator on Linux, i.e ptmalloc.
From what I know (please correct me if I am wrong), ptmalloc maintains a free list of memory blocks and when a request for memory allocation comes, it tries to allocate a memory block from this free list (I know, the allocator is much more complex than that but I am just putting it in simple words). If, however, it fails, it gets the memory from the system using say sbrk or brk system calls. When a memory is free'd, that block is placed in the free list.
Now consider this scenario, on peak load, a lot of objects have been allocated on heap. Now when the load decreases, the objects are free'd. So my question is: Once the object is free'd will the allocator do some calculations to find whether it should just keep this object in the free list or depending upon the current size of the free list it may decide to give that memory back to the system i.e decrease the data segment of the process using sbrk or brk?
Documentation of glibc tells me that if the allocation request is much larger than page size, it will be allocated using mmap and will be directly released back to the system once free'd. Cool. But let's say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what?
From what I know (correct me please), a memory allocated with malloc will never be released back to the system ever until the process ends i.e. the allocator will simply keep it in the free list if I free it. But the question that is troubling me is then, if I use a tool to see the memory usage of my process (I am using pmap on Linux, what do you guys use?), it should always show the memory used at peak load (as the memory is never given back to the system, except when allocated using mmap)? That is memory used by the process should never ever decrease(except the stack memory)? Is it?
I know I am missing something, so please shed some light on all this.
Experts, please clear my concepts regarding this. I will be grateful. I hope I was able to explain my question.

There isn't much overhead for malloc, so you are unlikely to achieve any run-time savings. There is, however, a good reason to implement an allocator on top of malloc, and that is to be able to trace memory leaks. For example, you can free all memory allocated by the program when it exits, and then check to see if your memory allocator calls balance (i.e. same number of calls to allocate/deallocate).
For your specific implementation, there is no reason to free() since the malloc won't release to system memory and so it will only release memory back to your own allocator.
Another reason for using a custom allocator is that you may be allocating many objects of the same size (i.e you have some data structure that you are allocating a lot). You may want to maintain a separate free list for this type of object, and free/allocate only from this special list. The advantage of this is that it will avoid memory fragmentation.

No.
It's actually a bad strategy for a number of reasons, so it doesn't happen --except-- as you note, there can be an exception for large allocations that can be directly made in pages.
It increases internal fragmentation and therefore can actually waste memory. (You can only return aligned pages to the OS, so pulling aligned pages out of a block will usually create two guaranteed-to-be-small blocks --smaller than a page, anyway-- to either side of the block. If this happens a lot you end up with the same total amount of usefully-allocated memory plus lots of useless small blocks.)
A kernel call is required, and kernel calls are slow, so it would slow down the program. It's much faster to just throw the block back into the heap.
Almost every program will either converge on a steady-state memory footprint or it will have an increasing footprint until exit. (Or, until near-exit.) Therefore, all the extra processing needed by a page-return mechanism would be completely wasted.

It is entirely implementation dependent. On Windows VC++ programs can return memory back to the system if the corresponding memory pages contain only free'd blocks.

I think that you have all the information you need to answer your own question. pmap shows the memory that is currenly being used by the process. So, if you call pmap before the process achieves peak memory, then no it will not show peak memory. if you call pmap just before the process exits, then it will show peak memory for a process that does not use mmap. If the process uses mmap, then if you call pmap at the point where maximum memory is being used, it will show peak memory usage, but this point may not be at the end of the process (it could occur anywhere).
This applies only to your current system (i.e. based on the documentation you have provided for free and mmap and malloc) but as the previous poster has stated, behavior of these is implmentation dependent.

This varies a bit from implementation to implementation.
Think of your memory as a massive long block, when you allocate to it you take a bit out of your memory (labeled '1' below):
111
If I allocate more more memory with malloc it gets some from the system:
1112222
If I now free '1':
___2222
It won't be returned to the system, because two is in front of it (and memory is given as a continous block). However if the end of the memory is freed, then that memory is returned to the system. If I freed '2' instead of '1'. I would get:
111
the bit where '2' was would be returned to the system.
The main benefit of freeing memory is that that bit can then be reallocated, as opposed to getting more memory from the system. e.g:
33_2222

I believe that the memory allocator in glibc can return memory back to the system, but whether it will or not depends on your memory allocation patterns.
Let's say you do something like this:
void *pointers[10000];
for(i = 0; i < 10000; i++)
pointers[i] = malloc(1024);
for(i = 0; i < 9999; i++)
free(pointers[i]);
The only part of the heap that can be safely returned to the system is the "wilderness chunk", which is at the end of the heap. This can be returned to the system using another sbrk system call, and the glibc memory allocator will do that when the size of this last chunk exceeds some threshold.
The above program would make 10000 small allocations, but only free the first 9999 of them. The last one should (assuming nothing else has called malloc, which is unlikely) be sitting right at the end of the heap. This would prevent the allocator from returning any memory to the system at all.
If you were to free the remaining allocation, glibc's malloc implementation should be able to return most of the pages allocated back to the system.
If you're allocating and freeing small chunks of memory, a few of which are long-lived, you could end up in a situation where you have a large chunk of memory allocated from the system, but you're only using a tiny fraction of it.

Here are some "advantages" to never releasing memory back to the system:
Having already used a lot of memory makes it very likely you will do so again, and
when you release memory the OS has to do quite a bit of paperwork
when you need it again, your memory allocator has to re-initialise all its data structures in the region it just received
Freed memory that isn't needed gets paged out to disk where it doesn't actually make that much difference
Often, even if you free 90% of your memory, fragmentation means that very few pages can actually be released, so the effort required to look for empty pages isn't terribly well spent

Many memory managers can perform TRIM operations where they return entirely unused blocks of memory to the OS. However, as several posts here have mentioned, it's entirely implementation dependent.
But lets say I never ask for allocation of size greater than say 50 bytes and I ask a lot of such 50 byte objects on peak load on the system. Then what ?
This depends on your allocation pattern. Do you free ALL of the small allocations? If so and if the memory manager has handling for a small block allocations, then this may be possible. However, if you allocate many small items and then only free all but a few scattered items, you may fragment memory and make it impossible to TRIM blocks since each block will have only a few straggling allocations. In this case, you may want to use a different allocation scheme for the temporary allocations and the persistant ones so you can return the temporary allocations back to the OS.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio