managed heap fragmentation - debugging

I am trying to understand how heap fragmenation works. What does the following output tell me?
Is this heap overly fragmented?
I have 243010 "free objects" with a total of 53304764 bytes. Are those "free object" spaces in the heap that once contained object but that are now garabage collected?
How can I force a fragmented heap to clean up?
!dumpheap -type Free -stat
total 243233 objects
Statistics:
MT Count TotalSize Class Name
0017d8b0 243010 53304764 Free

It depends on how your heap is organized. You should have a look at how much memory in Gen 0,1,2 is allocated and how much free memory you have there compared to the total used memory.
If you have 500 MB managed heap used but and 50 MB is free then you are doing pretty well. If you do memory intensive operations like creating many WPF controls and releasing them you need a lot more memory for a short time but .NET does not give the memory back to the OS once you allocated it. The GC tries to recognize allocation patterns and tends to keep your memory footprint high although your current heap size is way too big until your machine is running low on physical memory.
I found it much easier to use psscor2 for .NET 3.5 which has some cool commands like ListNearObj where you can find out which objects are around your memory holes (pinned objects?). With the commands from psscor2 you have much better chances to find out what is really going on in your heaps. Most commands are also available in SOS.dll in .NET 4 as well.
To answer your original question: Yes free objects are gaps on the managed heap which can simply be the free memory block after your last allocated object on a GC segement. Or if you do !DumpHeap with the start address of a GC segment you see the objects allocated in that managed heap segment along with your free objects which are GC collected objects.
This memory holes do normally happen in Gen2. The object addresses before and after the free object do tell you what potentially pinned objects are around your hole. From this you should be able to determine your allocation history and optimize it if you need to.
You can find the addresses of the GC Heaps with
0:021> !EEHeap -gc
Number of GC Heaps: 1
generation 0 starts at 0x101da9cc
generation 1 starts at 0x10061000
generation 2 starts at 0x02aa1000
ephemeral segment allocation context: none
segment begin allocated size
02aa0000 02aa1000** 03836a30 0xd95a30(14244400)
10060000 10061000** 103b8ff4 0x357ff4(3506164)
Large object heap starts at 0x03aa1000
segment begin allocated size
03aa0000 03aa1000 03b096f8 0x686f8(427768)
Total Size: Size: 0x115611c (18178332) bytes.
------------------------------
GC Heap Size: Size: 0x115611c (18178332) bytes.
There you see that you have heaps at 02aa1000 and 10061000.
With !DumpHeap 02aa1000 03836a30 you can dump the GC Heap segment.
!DumpHeap 02aa1000 03836a30
Address MT Size
...
037b7b88 5b408350 56
037b7bc0 60876d60 32
037b7be0 5b40838c 20
037b7bf4 5b408350 56
037b7c2c 5b408728 20
037b7c40 5fe4506c 16
037b7c50 60876d60 32
037b7c70 5b408728 20
037b7c84 5fe4506c 16
037b7c94 00135de8 519112 Free
0383685c 5b408728 20
03836870 5fe4506c 16
03836880 608c55b4 96
....
There you find your free memory blocks which was an object which was already GCed. You can dump the surrounding objects (the output is sorted address wise) to find out if they are pinned or have other unusual properties.

You have 50MB of RAM as Free space. This is not good.
Having .NET allocating blocks of 16MB from process, we have a fragmentation issue indeed.
There are plenty of reasons to fragmentation to occure in .NET.
Have a look here and here.
In your case it is possibly a pinning. As 53304764 / 243010 makes 219.35 bytes per object - much lower then LOH objects.

Related

Windows program has big native heap, much larger than all allocations

We are running a mixed mode process (managed + unmanaged) on Win 7 64 bit.
Our process is using up too much memory (especially VM). Based on our analysis, the majority of the memory is used by a big native heap. Our theory is that the LFH is saving too many free blocks in committed memory, for future allocations. They sum to about 1.2 GB while our actual allocated native memory is only at most 0.6 GB.
These numbers are from a test run of the process. In production it sometimes exceeded 10 GB of VM - with maybe 6 GB unaccounted for by known allocations.
We'd like to know if this theory of excessive committed-but-free-for-allocation segments is true, and how this waste can be reduced.
Here's the details of our analysis.
First we needed to figure out what's allocated and rule out memory leaks. We ran the excellent Heap Inspector by Jelle van der Beek and we ruled out a leak and established that the known allocations are at most 0.6 deci-GB.
We took a full memory dump and opened in WinDbg.
Ran !heap -stat
It reports a big native heap with 1.83 deci-GB committed memory. Much more than the sum of our allocations!
_HEAP 000000001b480000
Segments 00000078
Reserved bytes 0000000072980000
Committed bytes 000000006d597000
VirtAllocBlocks 0000001e
VirtAlloc bytes 0000000eb7a60118
Then we ran !heap -stat -h 0000001b480000
heap # 000000001b480000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
c0000 12 - d80000 (10.54)
b0000 d - 8f0000 (6.98)
e0000 a - 8c0000 (6.83)
...
If we add up all the 20 reported items, they add up to 85 deci-MB - much less than the 1.79 deci-GB we're looking for.
We ran !heap -h 1b480000
...
Flags: 00001002
ForceFlags: 00000000
Granularity: 16 bytes
Segment Reserve: 72a70000
Segment Commit: 00002000
DeCommit Block Thres: 00000400
DeCommit Total Thres: 00001000
Total Free Size: 013b60f1
Max. Allocation Size: 000007fffffdefff
Lock Variable at: 000000001b480208
Next TagIndex: 0000
Maximum TagIndex: 0000
Tag Entries: 00000000
PsuedoTag Entries: 00000000
Virtual Alloc List: 1b480118
Unable to read nt!_HEAP_VIRTUAL_ALLOC_ENTRY structure at 000000002acf0000
Uncommitted ranges: 1b4800f8
FreeList[ 00 ] at 000000001b480158: 00000000be940080 . 0000000085828c60 (9451 blocks)
...
When adding up up all the segment sizes in the report, we get:
Total Size = 1.83 deci-GB
Segments Marked Busy Size = 1.50 deci-GB
Segments Marked Busy and Internal Size = 1.37 deci-GB
So all the committed bytes in this report do add up to the total commit size. We grouped on block size and the most heavy allocations come from blocks of size 0x3fff0. These don't correspond to allocations that we know of. There were also mystery blocks of other sizes.
We ran !heap -p -all. This reports the LFH internal segments but we don't understand it fully. Those 3fff0 sized blocks in the previous report appear in the LFH report with an asterisk mark and are sometimes Busy and sometimes Free. Then inside them we see many smaller free blocks.
We guess these free blocks are legitimate. They are committed VM that the LFH reserves for future allocations. But why is their total size so much greater than sum of memory allocations, and can this be reduced?
Well, I can sort of answer my own question.
We had been doing lots and lots of tiny allocations and deallocations in our program. There was no leak, but it seems this somehow created a fragmentation of some sort. After consolidating and eliminating most of these allocations our software is running much better and using less peak memory. It is still a mystery why the peak committed memory was so much higher than the peak actually-used memory.

Where is the heap?

I understand that in Linux the mm_struct describes the memory layout of a process. I also understand that the start_brk and brk mark the start and end of the heap section of a process respectively.
Now, this is my problem: I have a process, for which I wrote the source code, that allocates 5.25 GB of heap memory using malloc. However, when I examine the process's mm_sruct using a kernel module I find the value of is equal to 135168. And this is different from what I expected: I expected brk - start_brk to be equal slight above 5.25 GB.
So, what is going on here?
Thanks.
I notice the following in the manpage for malloc(3):
Normally, malloc() allocates memory from the heap, and adjusts the size of the heap as required, using sbrk(2). When allocating blocks of memory larger than MMAP_THRESHOLD bytes, the glibc malloc() implementation allocates the memory as a private anonymous mapping using mmap(2). MMAP_THRESHOLD is 128 kB by default, but is adjustable using mallopt(3). Allocations performed using mmap(2) are unaffected by the RLIMIT_DATA resource limit (see getrlimit(2)).
So it sounds like mmap is used instead of the heap.

How do I increase memory limit (contiguous as well as overall) in Matlab r2012b?

I am using Matlab r2012b on win7 32-bit with 4GB RAM.
However, the memory limit on Matlab process is pretty low. On memory command, I am getting the following output:
Maximum possible array: 385 MB (4.038e+08 bytes) *
Memory available for all arrays: 1281 MB (1.343e+09 bytes) **
Memory used by MATLAB: 421 MB (4.413e+08 bytes)
Physical Memory (RAM): 3496 MB (3.666e+09 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
I need to increase the limit to as much as possible.
System: Windows 7 32 bit
RAM: 4 GB
Matlab: r2012b
For general guidance with memory management in MATLAB, see this MathWorks article. Some specific suggestions follow.
Set the /3GB switch in the boot.ini to increase the memory available to MATLAB. Or set it with a properties dialog if you are averse to text editors. This is mentioned in this section of the above MathWorks page.
Also use pack to increase the Maximum possible array by compacting the memory. The 32-bit MATLAB memory needs blocks of contiguous free memory, which is where this first value comes from. The pack command saves all the variables, clears the workspace, and reloads them so that they are contiguous in memory.
More on overall memory, try disabling the virtual machine, closing programs, stopping unnecessary Windows services. No easy answer for this part.

Page boundaries, implementing memory pool

I have decided to reinvent the wheel for a millionth time and write my own memory pool. My only question is about page size boundaries.
Let's say GetSystemInfo() call tells me that the page size is 4096 bytes. Now, I want to preallocate a memory area of 1MB (could be smaller, or larger), and divide this area into 128 byte blocks. HeapAlloc()/VirtualAlloc() will have an overhead between 8 and 16 bytes I guess. Might be some more, I've read posts talking about 60 bytes.
Question is, do I need to pay attention to not to have one of my 128 byte blocks across page boundaries?
Do I simply allocate 1MB in one chunk and divide it into my block size?
Or should I allocate many blocks of, say, 4000 bytes (to take into account HeapAlloc() overhead), and sub-divide this 4000 bytes into 128 byte blocks (4000 / 128 = 31 blocks, 128 bytes each) and not use the remaining bytes at all (4000 - 31x128 = 32 bytes in this example)?
Having a block cross a page boundary isn't a huge deal. It just means that if you try to access that block and it's completely swapped out, you'll get two page faults instead of one. The more important thing to worry about is the alignment of the block.
If you're using your small block to hold a structure that contains native types longer than 1 byte, you'll want to align it, otherwise you face potentially abysmal performance that will outweigh any performance gains you may have made by pooling.
The Windows pooling function ExAllocatePool describes its behaviour as follows:
If NumberOfBytes is PAGE_SIZE or greater, a page-aligned buffer is
allocated. Memory allocations of PAGE_SIZE or less do not cross page
boundaries. Memory allocations of less than PAGE_SIZE are not
necessarily page-aligned but are aligned to 8-byte boundaries in
32-bit systems and to 16-byte boundaries in 64-bit systems.
That's probably a reasonable model to follow.
I'm generally of the idea that larger is better when it comes to a pool. Within reason, of course, and depending on how you are going to use it. I don't see anything wrong with allocating 1 MB at a time (I've made pools that grow in 100 MB chunks). You want it to be worthwhile to have the pool in the first place. That is, have enough data in the same contiguous region of memory that you can take full advantage of cache locality.
I've found out that if I used _align_malloc(), I wouldn't need to worry wether spreading my sub-block to two pages would make any difference or not. An answer by Freddie to another thread (How to Allocate memory from a new virtual page in C?) also helped. Thanks Harry Johnston, I just wanted to use it as a memory pool object.

Fortran array memory management

I am working to optimize a fluid flow and heat transfer analysis program written in Fortran. As I try to run larger and larger mesh simulations, I'm running into memory limitation problems. The mesh, though, is not all that big. Only 500,000 cells and small-peanuts for a typical CFD code to run. Even when I request 80 GB of memory for my problem, it's crashing due to insufficient virtual memory.
I have a few guesses at what arrays are hogging up all that memory. One in particular is being allocated to (28801,345600). Correct me if I'm wrong in my calculations, but a double precision array is 8 bits per value. So the size of this array would be 28801*345600*8=79.6 GB?
Now, I think that most of this array ends up being zeros throughout the calculation so we don't need to store them. I think I can change the solution algorithm to only store the non-zero values to work on in a much smaller array. However, I want to be sure that I'm looking at the right arrays to reduce in size. So first, did I correctly calculate the array size above? And second, is there a way I can have Fortran show array sizes in MB or GB during runtime? In addition to printing out the most memory intensive arrays, I'd be interested in seeing how the memory requirements of the code are changing during runtime.
Memory usage is a quite vaguely defined concept on systems with virtual memory. You can have large amounts of memory allocated (large virtual memory size) but only a small part of it actually being actively used (small resident set size - RSS).
Unix systems provide the getrusage(2) system call that returns information about the amount of system resources in use by the calling thread/process/process children. In particular it provides the maxmimum value of the RSS ever reached since the process was started. You can write a simple Fortran callable helper C function that would call getrusage(2) and return the value of the ru_maxrss field of the rusage structure.
If you are running on Linux and don't care about portability, then you may just open and read from /proc/self/status. It is a simple text pseudofile that among other things contains several lines with statistics about the process virtual memory usage:
...
VmPeak: 9136 kB
VmSize: 7896 kB
VmLck: 0 kB
VmHWM: 7572 kB
VmRSS: 6316 kB
VmData: 5224 kB
VmStk: 88 kB
VmExe: 572 kB
VmLib: 1708 kB
VmPTE: 20 kB
...
Explanation of the various fields - here. You are mostly interested in VmData, VmRSS, VmHWM and VmSize. You can open /proc/self/status as a regular file with OPEN() and process it entirely in your Fortran code.
See also what memory limitations are set with ulimit -a and ulimit -aH. You may be exceeding the hard virtual memory size limit. If you are submitting jobs through a distributed resource manager (e.g. SGE/OGE, Torque/PBS, LSF, etc.) check that you request enough memory for the job.

Resources