go tool pprof -inuse_space much smaller than linux top shows - go

My program runs in background. I use linux top command, it shows 16g memory. But when I want to use go pprof -inuse_space to check the point, I gives only 200M. Where do the other memory go?

Generally, the memory used in os(shown by top VIRT) is larger than pprof. One reason is gc will happen when heap size > ($GOGC% + 1) * (reachable nodes size): https://blog.golang.org/go15gc. By default, $GOGC is 100, that means the memory size will be twice of the heap size shown by pprof. But you seem not to be in this case.

Related

How to get detailed memory breakdown in the TensorFlow profiler?

I'm using the new TensorFlow profiler to profile memory usage in my neural net, which I'm running on a Titan X GPU with 12GB RAM. Here's some example output when I profile my main training loop:
==================Model Analysis Report======================
node name | requested bytes | ...
Conv2DBackpropInput 10227.69MB (100.00%, 35.34%), ...
Conv2D 9679.95MB (64.66%, 33.45%), ...
Conv2DBackpropFilter 8073.89MB (31.21%, 27.90%), ...
Obviously this adds up to more than 12GB, so some of these matrices must be in main memory while others are on the GPU. I'd love to see a detailed breakdown of what variables are where at a given step. Is it possible to get more detailed information on where various parameters are stored (main or GPU memory), either with the profiler or otherwise?
"Requested bytes" shows a sum over all memory allocations, but that memory can be allocated and de-allocated. So just because "requested bytes" exceeds GPU RAM doesn't necessarily mean that memory is being transferred to CPU.
In particular, for a feedforward neural network, TF will normally keep around the forward activations, to make backprop efficient, but doesn't need to keep the intermediate backprop activations, i.e. dL/dh at each layer, so it can just throw away these intermediates after it's done with these. So I think in this case what you care about is the memory used by Conv2D, which is less than 12 GB.
You can also use the timeline to verify that total memory usage never exceeds 12 GB.

Why does Unix block size increase with bigger memory size?

I am profiling binary data which has
increasing Unix block size (one got from stat > Blocks) when the number of events are increased as in the following figure
but the byte distance between events stay constant
I have noticed some changes in other fields of the file which may explain the increasing Unix block size
The unix block size is a dynamic measure.
I am interested in why it is increasing with bigger memory units in some systems.
I have had an idea that it should be constant.
I used different environments to provide the stat output:
Debian Linux 8.1 with its default stat
OSX 10.8.5 with Xcode 6 and its default stat
Greybeard's comment may have the answer to the blocks behaviour:
The stat (1) command used to be a thin CLI to the stat (2) system
call, which used to transfer relevant parts of a file's inode. Pretty
early on, the meaning of the st_blksize member of the C struct
returned by stat (2) was changed to "preferred" blocksize for
efficient file system I/O, which carries well to file systems with
mixed block sizes or non-block oriented allocation.
How can you measure the block size in case (1) and (2) separately?
Why can the Unix block size increase with bigger memory size?
"Stat blocks" is not a block size. It is number of blocks the file consists of. It is obvious that number of blocks is proportional to size. Size of block is constant for most file systems (if not all).

Go memory consumption with many goroutines

I was trying to check how Go will perform with 100,000 goroutines. I wrote a simple program to spawn that many routines which does nothing but print some announcements. I restricted the MaxStack size to just 512 bytes. But what I noticed was the program size doesn't decrease with that. It was consuming around 460 MB of memory and hence around 4 KB per goroutine. My question is, can we set max stack size lower than the "minimum" stack size (which may be 4 KB) for the goroutines. How can we set the minimum Stack size that Goroutine starts with ?
Below is sample code I used for the test:
package main
import "fmt"
import "time"
import "runtime/debug"
func main() {
fmt.Printf("%v\n", debug.SetMaxStack(512))
var i int
for i = 0; i < 100000; i++ {
go func(x int) {
for {
time.Sleep(10 * time.Millisecond)
//fmt.Printf("I am %v\n", x)
}
}(i)
}
fmt.Println("Done")
time.Sleep(999999999999)
}
There's currently no way to set the minimum stack size for goroutines.
Go 1.2 increased the minimum size from 4KB to 8KB
The docs say:
"In Go 1.2, the minimum size of the stack when a goroutine is created has been lifted from 4KB to 8KB. Many programs were suffering performance problems with the old size, which had a tendency to introduce expensive stack-segment switching in performance-critical sections. The new number was determined by empirical testing."
But they go on to say:
"Updating: The increased minimum stack size may cause programs with many goroutines to use more memory. There is no workaround, but plans for future releases include new stack management technology that should address the problem better."
So you may have more luck in the future.
See http://golang.org/doc/go1.2#stack_size for more info.
The runtime/debug.SetMaxStack function only determines a what point does go consider a program infinitely recursive, and terminate it. http://golang.org/pkg/runtime/debug/#SetMaxStack
Setting it absurdly low does nothing to the minimum size of stacks, and only limits the maximum size by virtue of your program crashing when any stack's in-use size exceeds the limit.
Technically the crash only happens when the stack must be grown, so your program will die when a stack needs more than 8KB (or 4KB prior to go 1.2).
The reason why your program uses a minimum of 4KB * nGoroutines is because stacks are page-aligned, so there can never be more than one stack on a VM page. Therefore your program will use at least nGoroutines worth of pages, and OSes usually only measure and allocate memory in page-sized increments.
The only way to change the starting (minimum) size of a stack is to modify and recompile the go runtime (and possibly the compiler too).
Go 1.3 will include contiguous stacks, which are generally faster than the split stacks in Go 1.2 and earlier, and which may also lead to smaller initial stacks in the future.
Just a note: in Go 1.4: the minimum size of the goroutine stack has decreased from 8Kb to 2Kb.
And as per Go 1.13, it's still same -https://github.com/golang/go/blob/bbd25d26c0a86660fb3968137f16e74837b7a9c6/src/runtime/stack.go#L72:
// The minimum size of stack used by Go code
_StackMin = 2048

Why does the speed of memcpy() drop dramatically every 4KB?

I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.

Memory used by a long running process on OS X

I want to ensure that a long-running number crunching algorithm doesn't use too much memory. The algorithm is written in C++ and runs on OS X. A drastically simplified version is:
int main() {
while (someCondition) {
// notice nothing is allocated on the heap
vector<int> v(10, 0);
}
}
I've profiled the code using Instruments (allocations and leaks). I don't see any leaks. And while the "live bytes" count looks fine (hovers around 20 MB) the "overall bytes" count keeps growing. What concerned me is when the "overall count" reached about 80 GB I received an OS X warning about lack of hard disk space (I have a 120 GB solid state disk). I don't know much about OS/process interaction so I thought I'd ask:
Is memory used by a long running process on a UNIX-based OS available to other processes before the first process is killed or no longer running?
Edit: Looks like I'm misinterpreting the "overall bytes" number in Instruments:Instruments ObjectAlloc: Explanation of Live Bytes & Overall Bytes. When I check out the process in Activity Monitor the "real memory" is essentially constant.
The reason you get a disk space warning is probably related to virtual memory allocation. Every time your process (or the OS) requests memory it is usually first "allocated" in backing-store - swap.
Total virtual memory is size of available swap plus RAM. I do not have access to OSX, and I know it plays by its own rules, but there must be a command that shows swap usage
swap -l (Solaris)
swap -s (Solaris)
free (linux)
The only command I came up with is vm_stat, plus top - it appears top is probably the closest to what I am talking about.

Resources