Memory used by a long running process on OS X - macos

I want to ensure that a long-running number crunching algorithm doesn't use too much memory. The algorithm is written in C++ and runs on OS X. A drastically simplified version is:
int main() {
while (someCondition) {
// notice nothing is allocated on the heap
vector<int> v(10, 0);
}
}
I've profiled the code using Instruments (allocations and leaks). I don't see any leaks. And while the "live bytes" count looks fine (hovers around 20 MB) the "overall bytes" count keeps growing. What concerned me is when the "overall count" reached about 80 GB I received an OS X warning about lack of hard disk space (I have a 120 GB solid state disk). I don't know much about OS/process interaction so I thought I'd ask:
Is memory used by a long running process on a UNIX-based OS available to other processes before the first process is killed or no longer running?
Edit: Looks like I'm misinterpreting the "overall bytes" number in Instruments:Instruments ObjectAlloc: Explanation of Live Bytes & Overall Bytes. When I check out the process in Activity Monitor the "real memory" is essentially constant.

The reason you get a disk space warning is probably related to virtual memory allocation. Every time your process (or the OS) requests memory it is usually first "allocated" in backing-store - swap.
Total virtual memory is size of available swap plus RAM. I do not have access to OSX, and I know it plays by its own rules, but there must be a command that shows swap usage
swap -l (Solaris)
swap -s (Solaris)
free (linux)
The only command I came up with is vm_stat, plus top - it appears top is probably the closest to what I am talking about.

Related

How to get detailed memory breakdown in the TensorFlow profiler?

I'm using the new TensorFlow profiler to profile memory usage in my neural net, which I'm running on a Titan X GPU with 12GB RAM. Here's some example output when I profile my main training loop:
==================Model Analysis Report======================
node name | requested bytes | ...
Conv2DBackpropInput 10227.69MB (100.00%, 35.34%), ...
Conv2D 9679.95MB (64.66%, 33.45%), ...
Conv2DBackpropFilter 8073.89MB (31.21%, 27.90%), ...
Obviously this adds up to more than 12GB, so some of these matrices must be in main memory while others are on the GPU. I'd love to see a detailed breakdown of what variables are where at a given step. Is it possible to get more detailed information on where various parameters are stored (main or GPU memory), either with the profiler or otherwise?
"Requested bytes" shows a sum over all memory allocations, but that memory can be allocated and de-allocated. So just because "requested bytes" exceeds GPU RAM doesn't necessarily mean that memory is being transferred to CPU.
In particular, for a feedforward neural network, TF will normally keep around the forward activations, to make backprop efficient, but doesn't need to keep the intermediate backprop activations, i.e. dL/dh at each layer, so it can just throw away these intermediates after it's done with these. So I think in this case what you care about is the memory used by Conv2D, which is less than 12 GB.
You can also use the timeline to verify that total memory usage never exceeds 12 GB.

discrepancy between htop and golang readmemstats

My program loads a lot of data at start up and then calls debug.FreeOSMemory() so that any extra space is given back immediately.
loadDataIntoMem()
debug.FreeOSMemory()
after loading into memory , htop shows me the following for the process
VIRT RES SHR
11.6G 7629M 8000
But a call to runtime.ReadMemStats shows me the following
Alloc 5593336608 5.3G
BuckHashSys 1574016 1.6M
HeapAlloc 5593336610 5.3G
HeapIdle 2607980544 2.5G
HeapInuse 7062446080 6.6G
HeapReleased 2607980544 2.5G
HeapSys 9670426624 9.1G
MCacheInuse 9600 9.4K
MCacheSys 16384 16K
MSpanInuse 106776176 102M
MSpanSys 115785728 111M
OtherSys 25638523 25M
StackInuse 589824 576K
StackSys 589824 576K
Sys 10426738360 9.8G
TotalAlloc 50754542056 48G
Alloc is the amount obtained from system and not yet freed ( This is
resident memory right ?) But there is a big difference between the two.
I rely on HeapIdle to kill my program i.e if HeapIdle is more than 2 GB, restart - in this case it is 2.5, and isn't going down even after a while. Golang should use from heap idle when allocating more in the future, thus reducing heap idle right ?
If assumption 1 is wrong, which stat can accurately tell me what the RES value in htop is.
What can I do to reduce the value of HeapIdle ?
This was tried on go 1.4.2, 1.5.2 and 1.6.beta1
The effective memory consumption of your program will be Sys-HeapReleased. This still won't be exactly what the OS reports, because the OS can choose to allocate memory how it sees fit based on the requests of the program.
If your program runs for any appreciable amount of time, the excess memory will be offered back to the OS so there's no need to call debug.FreeOSMemory(). It's also not the job of the garbage collector to keep memory as low as possible; the goal is to use memory as efficiently as possible. This requires some overhead, and room for future allocations.
If you're having trouble with memory usage, it would be a lot more productive to profile your program and see why you're allocating more than expected, instead of killing your process based on incorrect assumptions about memory.

Need bash script that constantly uses high memory but low cpu?

I am running few experiments to see changes in system behavior under different memory and cpu loads. I was wondering is there a bash script which constantly uses high memory but low CPU?
For the purpose of simulating CPU/memory/IO load, most *NIX systems (Linux included) provide handy tool called stress.
The tool varies from OS to OS. On Linux, to take up 512MB of RAM with low CPU load:
stress --vm 1 --vm-bytes 512M --vm-hang 100
(The invocation means: start one memory thread (--vm 1), allocate/free 512MB of memory in every thread, sleep before freeing memory 100 seconds.)
This is silly, and can't be reasonably expected to provide data which will be useful in any real-world scenario. However, to generate at least the amount of memory consumption associated with a given power-of-two bytes:
build_string() {
local pow=$1
local dest=$2
s=' '
for (( i=0; i<pow; i++ )); do
s+="$s"
done
printf -v "$dest" %s "$s"
}
build_string 10 kilobyte # build a string of length 1024
echo "Kilobyte string consumes ${#kilobyte} bytes"
build_string 20 megabyte # build a string of length 1048576
echo "Megabyte string consumes ${#megabyte} bytes"
Note that transiently, during construction, at least 2x the requested space will be required (for the local); a version that didn't have this behavior would either be using namevars (depending on bash 4.3) or eval (depending on the author's willingness to do evil).

How to get memory usage high water mark on OSX

I'd like to be able to test some guesses about memory complexity of various command line utilities.
Taking as a simple example
grep pattern file
I'd like to see how memory usage varies with the size of pattern and the size of file.
For time complexity, I'd make a guess, then run
time grep pattern file
on various sized inputs to see if my guess seems to be borne out in reality, but I don't know how to do this for memory.
One possibility would be a wrapper script that initiates the job and samples memory usage periodically, but this seems inelegant and unlikely to give the real high watermark.
I've seen time -v suggested, but don't have that flag available on my machine (running bash on OSX) and don't know where to find a version that supports it.
I've also seen that on Linux this information is available through the proc filesystem, but again, it's not available to me in my context.
I'm wondering if dtrace might be an appropriate tool, but again am concerned that a simple sample-based figure might not be the true high watermark?
Does anyone know of a tool or approach that would be appropriate on OSX?
Edit
I removed two mentions of disk usage, which were just asides and perhaps distracted from the main thrust of the question.
Your question is interesting because, without the application source code, you need to make a few assumptions about what constitutes memory use. Even if you were to use procfs, the results will be misleading: both the resident set size and the total virtual address space will be over-estimates since they will include extraneous data such as the program text.
Particularly for small commands, it would be easier to track individual allocations, although even there you need to be sure to include all the possible sources. In addition to malloc() etc., a process can extend its heap with brk() or obtain anonymous memory using mmap().
Here's a DTrace script that traces malloc(); you can extend it to include other allocating functions. Note that it isn't suitable for multi-threaded programs as it uses some non-atomic variables.
bash-3.2# cat hwm.d
/* find the maximum outstanding allocation provided by malloc() */
size_t total, high;
pid$target::malloc:entry
{
self->size = arg0;
}
pid$target::malloc:return
/arg1/
{
total += self->size;
allocation[arg1] = self->size;
high = (total > high) ? total : high;
}
pid$target::free:entry
/allocation[arg0]/
{
total -= allocation[arg0];
allocation[arg0] = 0;
}
END
{
printf("High water mark was %d bytes.\n", high);
}
bash-3.2# dtrace -x evaltime=exec -qs hwm.d -c 'grep maximum hwm.d'
/* find the maximum outstanding allocation provided by malloc() */
High water mark was 62485 bytes.
bash-3.2#
A much more comprehensive discussion of memory allocators is contained in this article by Brendan Gregg. It provides a much better answer than my own to your question. In particular, it includes a link to a script called memleak.d; modify this to include time stamps for the allocations & deallocations, so that you can sort its output by time. Then, perhaps using the accompanying script as an example, use perl to track the current outstanding total allocation and high water mark. Such a DTrace/perl combination would be suitable for tracing multi-threaded processes.
You can use /usr/bin/time -l (which is not the time builtin in macos) and read the "maximum resident set size", which is not precisely high water mark but might give you some idea.
$ /usr/bin/time -l ls
...
0.00 real 0.00 user 0.00 sys
925696 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
239 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
3 voluntary context switches
1 involuntary context switches
The meaning of this field is explained here.
Tried getrusage(). Inaccurate results. Tried Instruments. Pain in the arse.
Best solution by far: valgrind + massif.
command-line based: easy to run, script and automate; no apps to open, menus to click, blah blah; can run in background etc
provides a visual graph-- in your terminal-- of memory usage over time
valgrind --tool=massif /path/to/my_program arg1 ...
ms_print `ls -r massif.out.* | head -1` | grep Detailed -B50
To view more details, run ms_print `ls -r massif.out.* | head -1`

Unexpected page handling (also, VirtualLock = no op?)

This morning I stumbled across a surprising number of page faults where I did not expect them. Yes, I probably should not worry, but it still strikes me odd, because in my understanding they should not happen. And, I'd like better if they didn't.
The application (under WinXP Pro 32bit) reserves a larger section (1GB) of address space with VirtualAlloc(MEM_RESERVE) and later allocates moderately large blocks (20-50MB) of memory with VirtualAlloc(MEM_COMMIT). This is done in a worker ahead of time, the intent being to stall the main thread as little as possible. Obviously, you cannot ever assure that no page faults happen unless the memory region is currently locked, but a few of them are certainly tolerable (and unavoidable). Surprisingly every single page faults. Always.
The assumption was thus that the system only creates pages lazily after allocating them, which somehow makes sense too (although the documentation suggests something different). Fair enough, my bad.
The obvious workaround is therefore VirtualLock/VirtualUnlock, which forces the system to create those pages, as they must exist after VirtualLock returns. Surprisingly, still every single page faults.
So I wrote a little test program which did all above steps in sequence, sleeping 5 seconds in between each, to rule out something was wrong in the other code. The results were:
MEM_RESERVE 1GB ---> success, zero CPU, zero time, nothing happens
MEM_COMMIT 1 GB ---> success, zero CPU, zero time, working set increases by 2MB, 512 page faults (respectively 8 bytes of metadata allocated in user space per page)
for(... += 128kB) { VirtualLock(128kB); VirtualUnlock(128kB); } ---> success, zero CPU, zero time, nothing happens
for(... += 4096) *addr = 0; ---> 262144 page faults, about 0.25 seconds (~95% kernel time). 1GB increase for both "working set" and "physical" inside Process Explorer
VirtualFree ---> zero CPU, zero time, both "working set" and "physical" instantly go * poof *.
My expectation was that since each page had been locked once, it must physically exist at least after that. It might of course still be moved in and out of the WS as the quota is exceeded (merely changing one reference as long as sufficient RAM is available). Yet, neither the execution time, nor the working set, nor the physical memory metrics seem to support this. Rather, as it looks, each single accessed page is created upon faulting, even if it had been locked previously. Of course I can touch every page manually in a worker thread, but there must be a cleaner way too?
Am I making a wrong assumption about what VirtualLock should do or am I not understanding something right about virtual memory? Any idea about how to tell the OS in a "clean, legitimate, working" way that I'll be wanting memory, and I'll be wanting it for real?
UPDATE:
In reaction to Harry Johnston's suggestion, I tried the somewhat problematic approach of actually calling VirtualLock on a gigabyte of memory. For this to succeed, you must first set the process' working set size accordingly, since the default quotas are 200k/1M, which means VirtualLock cannot possibly lock a region larger than 200k (or rather, it cannot lock more than 200k alltogether, and that is minus what is already locked for I/O or for another reason).
After setting a minimum working set size of 1GB and a maximum of 2GB, all the page faults happen the moment VirtualAlloc(MEM_COMMIT) is called. "Virtual size" in Process Explorer jumps up by 1GB instantly. So far, it looked really, really good.
However, looking closer, "Physical" remains as it is, actual memory is really only used the moment you touch it.
VirtualLock remains a no-op (fault-wise), but raising the minimum working set size kind of got closer to the goal.
There are two problems with tampering the WS size, however. First, you're generally not meant to have a gigabyte of minimum working set in a process, because the OS tries hard to keep that amount of memory locked. This would be acceptable in my case (it's actually more or less just what I ask for).
The bigger problem is that SetProcessWorkingSetSize needs the the PROCESS_SET_QUOTA access right, which is no problem as "administrator", but it fails when you run the program as a restricted user (for a good reason), and it triggers the "allow possibly harmful program?" alert of some well-known Russian antivirus software (for no good reason, but alas, you can't turn it off).
Technically VirtualLock is a hint, and so the OS is allowed to ignore it. It's backed by the NtLockVirtualMemory syscall which on Reactos/Wine is implemented as a no-op, however Windows does back the syscall with real work (MiLockVadRange).
VirtualLock isn't guarranteed to succeed. Calls to this function require the SE_LOCK_MEMORY_PRIVILEGE to work, and the addresses must fulfil security and quota restrictions. Additionally after a VirtualUnlock, the kernel is no longer obliged to keep your page in memory, so a page fault after that is a valid action.
And as Raymond Chen points out, when you unlock the memory it can formally release the page. This means that the next VirtualLock on the next page might obtain that very same page again, so when you touch the original page you'll still get a page-fault.
VirtualLock remains a no-op (fault-wise)
I tried to reproduce this, but it worked as one might expect. Running the example code shown at the bottom of this post:
start application (523 page faults)
adjust the working set size (21 page faults)
VirtualAlloc with MEM_COMMIT 2500 MB of RAM (2 page faults)
VirtualLock all of that (about 641,250 page faults)
perform writes to all of this RAM in an infinite loop (zero page faults)
This all works pretty much as expected. 2500 MB of RAM is 640,000 pages. The numbers add up. Also, as far as the OS-wide RAM counters go, commit charge goes up at VirtualAlloc, while physical memory usage goes up at VirtualLock.
So VirtualLock is most definitely not a no-op on my Win7 x64 machine. If I don't do it, the page faults, as expected, shift to where I start writing to the RAM. They still total just over 640,000. Plus, the first time the memory is written to takes longer.
Rather, as it looks, each single accessed page is created upon faulting, even if it had been locked previously.
This is not wrong. There is no guarantee that accessing a locked-then-unlocked page won't fault. You lock it, it gets mapped to physical RAM. You unlock it, and it's free to be unmapped instantly, making a fault possible. You might hope it will stay mapped, but no guarantees...
For what it's worth, on my system with a few gigabytes of physical RAM free, it works the way you were hoping for: even if I follow my VirtualLock with an immediate VirtualUnlock and set the minimum working set size back to something small, no further page faults occur.
Here's what I did. I ran the test program (below) with and without the code that immediately unlocks the memory and restores a sensible minimum working set size, and then forced physical RAM to run out in each scenario. Before forcing low RAM, neither program gets any page faults. After forcing low RAM, the program that keeps the memory locked retains its huge working set and has no further page faults. The program that unlocked the memory, however, starts getting page faults.
This is easiest to observe if you suspend the process first, since otherwise the constant memory writes keep it all in the working set even if the memory isn't locked (obviously a desirable thing). But suspend the process, force low RAM, and watch the working set shrink only for the program that has unlocked the RAM. Resume the process, and witness an avalanche of page faults.
In other words, at least in Win7 x64 everything works exactly as you expected it to, using the code supplied below.
There are two problems with tampering the WS size, however. First, you're generally not meant to have a gigabyte of minimum working set in a process
Well... if you want to VirtualLock, you are already tampering with it. The only thing that SetProcessWorkingSetSize does is allow you to tamper with it. It doesn't degrade performance by itself; it's VirtualLock that does - but only if the system actually runs low on physical RAM.
Here's the complete program:
#include <stdio.h>
#include <tchar.h>
#include <Windows.h>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
SIZE_T chunkSize = 2500LL * 1024LL * 1024LL; // 2,626,568,192 = 640,000 pages
int sleep = 5000;
Sleep(sleep);
cout << "Setting working set size... ";
if (!SetProcessWorkingSetSize(GetCurrentProcess(), chunkSize + 5001001L, chunkSize * 2))
return -1;
cout << "done" << endl;
Sleep(sleep);
cout << "VirtualAlloc... ";
UINT8* data = (UINT8*) VirtualAlloc(NULL, chunkSize, MEM_COMMIT, PAGE_READWRITE);
if (data == NULL)
return -2;
cout << "done" << endl;
Sleep(sleep);
cout << "VirtualLock... ";
if (VirtualLock(data, chunkSize) == 0)
return -3;
//if (VirtualUnlock(data, chunkSize) == 0) // enable or disable to experiment with unlocks
// return -3;
//if (!SetProcessWorkingSetSize(GetCurrentProcess(), 5001001L, chunkSize * 2))
// return -1;
cout << "done" << endl;
Sleep(sleep);
cout << "Writes to the memory... ";
while (true)
{
int* end = (int*) (data + chunkSize);
for (int* d = (int*) data; d < end; d++)
*d = (int) d;
cout << "done ";
}
return 0;
}
Note that this code puts the thread to sleep after VirtualLock. According to a 2007 post by Raymond Chen, the OS is free to page it all out of physical RAM at this point and until the thread wakes up again. Note also that MSDN claims otherwise, saying that this memory will not be paged out, regardless of whether all threads are sleeping or not. On my system, they certainly remain in the physical RAM while the only thread is sleeping. I suspect Raymond's advice applied in 2007, but is no longer true in Win7.
I don't have enough reputation to comment, so I'll have to add this as an answer.
Note that this code puts the thread to sleep after VirtualLock. According to a 2007 post by Raymond Chen, the OS is free to page it all out of physical RAM at this point and until the thread wakes up again [...] I suspect Raymond's advice applied in 2007, but is no longer true in Win7.
What romkyns said has been confirmed by Raymond Chen in 2014. That is, when you lock memory with VirtualĀ­Lock, it will remain locked even if all your threads are blocked. He also says the fact that pages remain locked, may be just an implementation detail and not contractual.
This is probably not the case, because according to msdn, it is contractual
Pages that a process has locked remain in physical memory until the process unlocks them or terminates. These pages are guaranteed not to be written to the pagefile while they are locked.

Resources