How to get memory usage high water mark on OSX - macos

I'd like to be able to test some guesses about memory complexity of various command line utilities.
Taking as a simple example
grep pattern file
I'd like to see how memory usage varies with the size of pattern and the size of file.
For time complexity, I'd make a guess, then run
time grep pattern file
on various sized inputs to see if my guess seems to be borne out in reality, but I don't know how to do this for memory.
One possibility would be a wrapper script that initiates the job and samples memory usage periodically, but this seems inelegant and unlikely to give the real high watermark.
I've seen time -v suggested, but don't have that flag available on my machine (running bash on OSX) and don't know where to find a version that supports it.
I've also seen that on Linux this information is available through the proc filesystem, but again, it's not available to me in my context.
I'm wondering if dtrace might be an appropriate tool, but again am concerned that a simple sample-based figure might not be the true high watermark?
Does anyone know of a tool or approach that would be appropriate on OSX?
Edit
I removed two mentions of disk usage, which were just asides and perhaps distracted from the main thrust of the question.

Your question is interesting because, without the application source code, you need to make a few assumptions about what constitutes memory use. Even if you were to use procfs, the results will be misleading: both the resident set size and the total virtual address space will be over-estimates since they will include extraneous data such as the program text.
Particularly for small commands, it would be easier to track individual allocations, although even there you need to be sure to include all the possible sources. In addition to malloc() etc., a process can extend its heap with brk() or obtain anonymous memory using mmap().
Here's a DTrace script that traces malloc(); you can extend it to include other allocating functions. Note that it isn't suitable for multi-threaded programs as it uses some non-atomic variables.
bash-3.2# cat hwm.d
/* find the maximum outstanding allocation provided by malloc() */
size_t total, high;
pid$target::malloc:entry
{
self->size = arg0;
}
pid$target::malloc:return
/arg1/
{
total += self->size;
allocation[arg1] = self->size;
high = (total > high) ? total : high;
}
pid$target::free:entry
/allocation[arg0]/
{
total -= allocation[arg0];
allocation[arg0] = 0;
}
END
{
printf("High water mark was %d bytes.\n", high);
}
bash-3.2# dtrace -x evaltime=exec -qs hwm.d -c 'grep maximum hwm.d'
/* find the maximum outstanding allocation provided by malloc() */
High water mark was 62485 bytes.
bash-3.2#
A much more comprehensive discussion of memory allocators is contained in this article by Brendan Gregg. It provides a much better answer than my own to your question. In particular, it includes a link to a script called memleak.d; modify this to include time stamps for the allocations & deallocations, so that you can sort its output by time. Then, perhaps using the accompanying script as an example, use perl to track the current outstanding total allocation and high water mark. Such a DTrace/perl combination would be suitable for tracing multi-threaded processes.

You can use /usr/bin/time -l (which is not the time builtin in macos) and read the "maximum resident set size", which is not precisely high water mark but might give you some idea.
$ /usr/bin/time -l ls
...
0.00 real 0.00 user 0.00 sys
925696 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
239 page reclaims
0 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
3 voluntary context switches
1 involuntary context switches
The meaning of this field is explained here.

Tried getrusage(). Inaccurate results. Tried Instruments. Pain in the arse.
Best solution by far: valgrind + massif.
command-line based: easy to run, script and automate; no apps to open, menus to click, blah blah; can run in background etc
provides a visual graph-- in your terminal-- of memory usage over time
valgrind --tool=massif /path/to/my_program arg1 ...
ms_print `ls -r massif.out.* | head -1` | grep Detailed -B50
To view more details, run ms_print `ls -r massif.out.* | head -1`

Related

Memory usage in MiB/KiB by PID

I need command to get process memory usage in MiB or KiB by PID. I prefer the number I see in gnome-system-monitor in "Memory" column. I tried a lot of different commands but all of them give me some other numbers.
For example, I've tried ps with different keys. The most interesting was
ps aux. The RSS column is something close but number here greater than in gnome-system-monitor. A lot of answers on question similar to mine contain top command but it gives memory usage in percents.
cat /proc/YourPid/status
ps aux YourPid
top -p YourPid
pmap -x YourPid
All of them give process memory usage; top gives both percents and absolute value (RES - Resident size)
Look at Understanding Process memory and How to measure actual memory usage of an application or process? for more details about memory usage and monitor.

Need bash script that constantly uses high memory but low cpu?

I am running few experiments to see changes in system behavior under different memory and cpu loads. I was wondering is there a bash script which constantly uses high memory but low CPU?
For the purpose of simulating CPU/memory/IO load, most *NIX systems (Linux included) provide handy tool called stress.
The tool varies from OS to OS. On Linux, to take up 512MB of RAM with low CPU load:
stress --vm 1 --vm-bytes 512M --vm-hang 100
(The invocation means: start one memory thread (--vm 1), allocate/free 512MB of memory in every thread, sleep before freeing memory 100 seconds.)
This is silly, and can't be reasonably expected to provide data which will be useful in any real-world scenario. However, to generate at least the amount of memory consumption associated with a given power-of-two bytes:
build_string() {
local pow=$1
local dest=$2
s=' '
for (( i=0; i<pow; i++ )); do
s+="$s"
done
printf -v "$dest" %s "$s"
}
build_string 10 kilobyte # build a string of length 1024
echo "Kilobyte string consumes ${#kilobyte} bytes"
build_string 20 megabyte # build a string of length 1048576
echo "Megabyte string consumes ${#megabyte} bytes"
Note that transiently, during construction, at least 2x the requested space will be required (for the local); a version that didn't have this behavior would either be using namevars (depending on bash 4.3) or eval (depending on the author's willingness to do evil).

Is it possible to know the address of a cache miss?

Whenever a cache miss occurs, is it possible to know the address of that missed cache line? Are there any hardware performance counters in modern processors that can provide such information?
Yes, on modern Intel hardware there are precise memory sampling events that track not only the address of the instruction, but the data address as well. These events also includes a great deal of other information, such as what level of the cache hierarchy the memory access was satisfied it, the total latency and so on.
You can use perf mem to sample this information and produces a report.
For example, the following program:
#include <stddef.h>
#define SIZE (100 * 1024 * 1024)
int p[SIZE] = {1};
void do_writes(volatile int *p) {
for (size_t i = 0; i < SIZE; i += 5) {
p[i] = 42;
}
}
void do_reads(volatile int *p) {
volatile int sink;
for (size_t i = 0; i < SIZE; i += 5) {
sink = p[i];
}
}
int main(int argc, char **argv) {
do_writes(p);
do_reads(p);
}
compiled with:
g++ -g -O1 -march=native perf-mem-test.cpp -o perf-mem-test
and run with:
sudo perf mem record -U ./perf-mem-test && sudo perf mem report
Produces a report of memory accesses sorted by latency like this:
The Data Symbol column shows where address the load was targeting - most here show up as something like p+0xa0658b4 which means at an offset of 0xa0658b4 from the start of p which makes sense as the code is reading and writing p. The list is sorted by "local weight" which is the access latency in reference cycles1.
Note that the information recorded is only a sample of memory accesses: recording every miss would usually be way too much information. Furthermore, it only records loads with a latency of 30 cycles or more by default, but you can apparently tweak this with command line arguments.
If you're only interested in accesses that miss in all levels of cache, you're looking for the "Local RAM hit" lines2. Perhaps you can restrict your sampling to only cache misses - I'm pretty sure the Intel memory sampling stuff supports that, and I think you can tell perf mem to look at only misses.
Finally, note that here I'm using the -U argument after record which instructs perf mem to only record userspace events. By default it will include kernel events, which may or may not be useful for your. For the example program, there are many kernel events associated with copying the p array from the binary into writable process memory.
Keep in mind that I specifically arranged my program such that the global array p ended up in the initialized .data section (the binary is ~400 MB!), so that it shows up with the right symbol in the listing. The vast majority of the time your process is going to be accessing dynamically allocated or stack memory, which will just give you a raw address. Whether you can map this back to a meaningful object depends on if you track enough information to make that possible.
1 I think it's in reference cycles, but I could be wrong and the kernel may have already converted it to nanoseconds?
2 The "Local" and "hit" part here refer to the fact that we hit the RAM attached to the current core, i.e., we didn't have go to the RAM associated with another socket in a multi-socket NUMA configuration.
If you want to know the exact virtual or physical address of every cache miss on a particular processor, that would be very hard and sometimes impossible. But you are more likely to be interested in expensive memory access patterns; those patterns that incur large latencies because they miss in one or more levels of the cache subsystem. Note that it is important to keep in mind that a cache miss on one processor might be a cache hit on another depending on design details of each processor and depending also on the operating system.
There are several ways to find such patterns, two are commonly used. One is to use a simulator such as gem5 or Sniper. Another is to use hardware performance events. Events that represent cache misses are available but they do not provide any details on why or where a miss occurred. However, using a profiler, you can approximately associate cache misses as reported by the corresponding hardware performance events with the instructions that caused them which in turn can be mapped back to locations in the source code using debug information. Examples of such profilers include Intel VTune Amplifier and AMD CodeXL. The results produced by simulators and profilers may not be accurate and so you have to be careful when interpreting them.

Memory used by a long running process on OS X

I want to ensure that a long-running number crunching algorithm doesn't use too much memory. The algorithm is written in C++ and runs on OS X. A drastically simplified version is:
int main() {
while (someCondition) {
// notice nothing is allocated on the heap
vector<int> v(10, 0);
}
}
I've profiled the code using Instruments (allocations and leaks). I don't see any leaks. And while the "live bytes" count looks fine (hovers around 20 MB) the "overall bytes" count keeps growing. What concerned me is when the "overall count" reached about 80 GB I received an OS X warning about lack of hard disk space (I have a 120 GB solid state disk). I don't know much about OS/process interaction so I thought I'd ask:
Is memory used by a long running process on a UNIX-based OS available to other processes before the first process is killed or no longer running?
Edit: Looks like I'm misinterpreting the "overall bytes" number in Instruments:Instruments ObjectAlloc: Explanation of Live Bytes & Overall Bytes. When I check out the process in Activity Monitor the "real memory" is essentially constant.
The reason you get a disk space warning is probably related to virtual memory allocation. Every time your process (or the OS) requests memory it is usually first "allocated" in backing-store - swap.
Total virtual memory is size of available swap plus RAM. I do not have access to OSX, and I know it plays by its own rules, but there must be a command that shows swap usage
swap -l (Solaris)
swap -s (Solaris)
free (linux)
The only command I came up with is vm_stat, plus top - it appears top is probably the closest to what I am talking about.

Which (OS X) dtrace probe fires when a page is faulted in from disk?

I'm writing up a document about page faulting and am trying to get some concrete numbers to work with, so I wrote up a simple program that reads 12*1024*1024 bytes of data. Easy:
int main()
{
FILE*in = fopen("data.bin", "rb");
int i;
int total=0;
for(i=0; i<1024*1024*12; i++)
total += fgetc(in);
printf("%d\n", total);
}
So yes, it goes through and reads the entire file. The issue is that I need the dtrace probe that is going to fire 1536 times during this process (12M/8k). Even if I count all of the fbt:mach_kernel:vm_fault*: probes and all of the vminfo::: probes, I don't hit 500, so I know I'm not finding the right probes.
Anyone know where I can find the dtrace probes that fire when a page is faulted in from disk?
UPDATE:
On the off chance that the issue was that there was some intelligent pre-fetching going on in the stdio functions, I tried the following:
int main()
{
int in = open("data.bin", O_RDONLY | O_NONBLOCK);
int i;
int total=0;
char buf[128];
for(i=0; i<1024*1024*12; i++)
{
read(in, buf, 1);
total += buf[0];
}
printf("%d\n", total);
}
This version takes MUCH longer to run (42s real time, 10s of which was user and the rest was system time - page faults, I'm guessing) but still generates one fifth as many faults as I would expect.
For the curious, the time increase is not due to loop overhead and casting (char to int.) The code version that does just these actions takes .07 seconds.
Not a direct answer, but it seems you are equating disk reads and page faults. They are not necessarily the same. In your code you are reading data from a file into a small user memory chunk, so the I/O system can read the file into the buffer/VM cache in any way and size it sees fit. I might be wrong here, I don't know how Darwin does this.
I think the more reliable test would be to mmap(2) the whole file into process memory and then go touch each page is that space.
I was down the same rathole recently. I don't have my DTrace scripts or test programs available just now, but I will give you the following advice:
1.) Get your hands on OS X Internals by Amit Singh and read section 8.3 on virtual memory (this will get you in the right frame of reference for selecting DTrace probes).
2.) Get your hands on Solaris Performance and Tools by Brendan Gregg / Jim Mauro. Read the section on virtual memory and pay close attention to the example DTrace scripts that make use of the vminfo provider.
3.) OS X is definitely prefetching large chunks of pages from the filesystem, and your test program is playing right into this optimization (since you're reading sequentially). Interestingly, this is not the case for Solaris. Try randomly accessing the big array to defeat the prefetch.
The assumption that the operating system will fault in each and every page that's being touched as a separate operation (and that therefore, if you touch N pages, you'll see the DTrace probe fire N times) is flawed; most UN*Xes will perform some sort of readahead or pre-faulting and you're very unlikely to get exactly the same number of calls to as you have pages. This is so even if you use mmap() directly.
The exact ratio may also depend on the filesystem, as readahead and page clustering implementations and thresholds are unlikely to be the same for all of them.
You probably can force a per-page fault policy if you use mmap directly and then apply madvise(MADV_DONTNEED) or similar and/or purge the entire range with msync(MS_INVALIDATE).

Resources