Windows program has big native heap, much larger than all allocations - windows

We are running a mixed mode process (managed + unmanaged) on Win 7 64 bit.
Our process is using up too much memory (especially VM). Based on our analysis, the majority of the memory is used by a big native heap. Our theory is that the LFH is saving too many free blocks in committed memory, for future allocations. They sum to about 1.2 GB while our actual allocated native memory is only at most 0.6 GB.
These numbers are from a test run of the process. In production it sometimes exceeded 10 GB of VM - with maybe 6 GB unaccounted for by known allocations.
We'd like to know if this theory of excessive committed-but-free-for-allocation segments is true, and how this waste can be reduced.
Here's the details of our analysis.
First we needed to figure out what's allocated and rule out memory leaks. We ran the excellent Heap Inspector by Jelle van der Beek and we ruled out a leak and established that the known allocations are at most 0.6 deci-GB.
We took a full memory dump and opened in WinDbg.
Ran !heap -stat
It reports a big native heap with 1.83 deci-GB committed memory. Much more than the sum of our allocations!
_HEAP 000000001b480000
Segments 00000078
Reserved bytes 0000000072980000
Committed bytes 000000006d597000
VirtAllocBlocks 0000001e
VirtAlloc bytes 0000000eb7a60118
Then we ran !heap -stat -h 0000001b480000
heap # 000000001b480000
group-by: TOTSIZE max-display: 20
size #blocks total ( %) (percent of total busy bytes)
c0000 12 - d80000 (10.54)
b0000 d - 8f0000 (6.98)
e0000 a - 8c0000 (6.83)
If we add up all the 20 reported items, they add up to 85 deci-MB - much less than the 1.79 deci-GB we're looking for.
We ran !heap -h 1b480000
Flags: 00001002
ForceFlags: 00000000
Granularity: 16 bytes
Segment Reserve: 72a70000
Segment Commit: 00002000
DeCommit Block Thres: 00000400
DeCommit Total Thres: 00001000
Total Free Size: 013b60f1
Max. Allocation Size: 000007fffffdefff
Lock Variable at: 000000001b480208
Next TagIndex: 0000
Maximum TagIndex: 0000
Tag Entries: 00000000
PsuedoTag Entries: 00000000
Virtual Alloc List: 1b480118
Unable to read nt!_HEAP_VIRTUAL_ALLOC_ENTRY structure at 000000002acf0000
Uncommitted ranges: 1b4800f8
FreeList[ 00 ] at 000000001b480158: 00000000be940080 . 0000000085828c60 (9451 blocks)
When adding up up all the segment sizes in the report, we get:
Total Size = 1.83 deci-GB
Segments Marked Busy Size = 1.50 deci-GB
Segments Marked Busy and Internal Size = 1.37 deci-GB
So all the committed bytes in this report do add up to the total commit size. We grouped on block size and the most heavy allocations come from blocks of size 0x3fff0. These don't correspond to allocations that we know of. There were also mystery blocks of other sizes.
We ran !heap -p -all. This reports the LFH internal segments but we don't understand it fully. Those 3fff0 sized blocks in the previous report appear in the LFH report with an asterisk mark and are sometimes Busy and sometimes Free. Then inside them we see many smaller free blocks.
We guess these free blocks are legitimate. They are committed VM that the LFH reserves for future allocations. But why is their total size so much greater than sum of memory allocations, and can this be reduced?

Well, I can sort of answer my own question.
We had been doing lots and lots of tiny allocations and deallocations in our program. There was no leak, but it seems this somehow created a fragmentation of some sort. After consolidating and eliminating most of these allocations our software is running much better and using less peak memory. It is still a mystery why the peak committed memory was so much higher than the peak actually-used memory.


Can all of L2/L3 cache be used by data? If so, why does the Graviton 3 bandwidth plot drop off after half the L2/L3 size, but only gradually?

Consider Graviton3, for example. It's a 64-core CPU with per-core caches 64KiB L1d and 1MiB L2. And a shared L3 of 64MiB across all cores. The RAM bandwidth per socket is 307GB/s (source).
In this plot (source),
we see that all-cores bandwidth drops off to roughly half, when the data exceeds 4MB. This makes sense: 64x 64KiB = 4 MiB is the size of the L1 data cache.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.
It looks from the plot like they may not have tested any sizes between 32M and 64M. Looks like a straight line between those points on all 3 CPUs.
Since 64M is the total size of both L2 and L3, I'd expect a test like this to have slowed most of the way down at 64M. As Brendan says, page tables and a bit of code will take space, competing with the actual intended test data. If the benchmark loop is tight, stack won't come into play, except for interrupt handling.
Once you're evicting anything from a working set slightly larger than cache, you often evict almost everything before getting back to it, depending on pseudo-LRU luck. I'd expect a test size or 48 or even 56 MiB to be a lot closer to the 32 MiB data point than the 64 MiB data point.
Can all of L2/L3 cache be used by data?
In theory, yes; but only if there's no "non-data" (code) in the cache, only if you count "all data" (and don't just count a process' data and ignore things like stack and page tables), and only if there isn't any aliasing problems.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there?
For a fully associative cache I'd expect a sudden drop off at/near 32 MiB. However, large caches are almost never fully associative as it costs way to much to find anything in the cache.
As associativity decreases the chance of conflicts increases. For example, for an 8-way associative 64 MiB cache the pathological case is that everything conflicts and you're only able to effectively use 8 MiB of it.
More specifically, for a 64 MiB cache (with unknown associativity), and an "assumed Linux" environment that lacks support for cache coloring, it's reasonable to expect a smooth drop off that ends at 64 MiB.
Just to be clear, on a running Graviton 3 in AWS, an lscpu gives me 32MiB for L3 and not 64 MiB.
Caches (sum of all):
L1d: 4 MiB (64 instances)
L1i: 4 MiB (64 instances)
L2: 64 MiB (64 instances)
L3: 32 MiB (1 instance)
The original question is assuming an L3 of 64 MiB across all cores.
But why does the next cliff begin at 32MB? And why is the drop-off so gradual there? The private L2 caches of 64 cores is a total of 64 MiB, same as the shared L3 size.

Relation between size of address bus and memory size; memory Segmentation in 8086

My question is related to memory segmentation in 8086. I learnt that,
8086 has a 20 bit address bus. And so it can address 2^20 different addresses. Which means it has an memory size of 2^20, i.e, 1MB.
I have a few doubts:
What I understand from the fact that 8086 has a 20 bit address bus is that it could have 2^20 different combinations of 0s and 1s, each of which represents one physical address. What I don't understand is that how does 2^20 different address locations mean 1 MB of addressable memory? How is total number of different addresses locations related to memory size (in Megabytes)?
Also, correct me if I'm wrong, the 16 bit segment registers in 8086 hold the starting address of the different segments in the memory (Code, Stack, Data, Extra).My question is, aren't the addresses in memory of 20 bits? Then how can the 16 bit register hold 20 bit addresses? If it contains the upper 16 bit of the 20 bit address, how does the processor make out to which exact address location it has to point?
P.S: I am a beginner is micro-processors and total reliant on self study, so kindly excuse if my questions seem a bit silly.
Thanks in advance.
For this question, its important to remember there is a different between the number of possible memory addresses and the amount of actual memory (RAM) installed in the system. For the 8086, memory addresses are 20-bits long as you note, so that means there are 2^20 possible memory addresses (which is exactly 1 MiB in size since 1 MiB is 1024 or 2^10 KiB and 1 KiB is 1024 or 2^10 Bytes). This does NOT mean the system has 1 MiB worth of RAM necessarily, it very likely has less but the most addresses the 8086 could possibly address is 1 MiB; so if nothing but RAM was in the address space, the most RAM it could possibly have is 1 MiB. Frequently, you might have gaps in the address space not filled with anything, some of the address space is used for ROM or other peripherals. So, that size of the address space is 1 MiB but that does not mean there is 1 MiB of RAM/memory in the system.
Correct, the segment registers are all 16-bits for the 8086. A memory address is created by combining the appropriate segment register with the argument (the argument being the result of whatever the addressing mode being used by the instruction) by adding the argument to the segment register's value shifted by 4 bits. So, if for example the ss is 0x1111, sp is at 0x2222 and you preform a push ax instruction, the 20-bit address to which the value is pushed is (ss << 4) + sp or 0x11110 + 0x02222 = 0x13332. More information can be found on Wikipedia under the Real Mode section:

Committed Bytes and Commit Limit - Memory Statistics

I'm trying to understand the actual difference between committed bytes and commit limit.
From the definitions below,
Commit Limit is the amount of virtual memory that can be committed
without having to extend the paging file(s). It is measured in bytes.
Committed memory is the physical memory which has space reserved on the disk paging files.
Committed Bytes is the amount of committed virtual memory, in bytes.
From my computer configurations, i see that my Physical Memory is 1991 MB, Virtual Memory (total paging file for all files) is 1991 MB and
Minimum Allowed is 16 MB, Recommended is 2986 MB and Currently Allocated is 1991 MB.
But when i open my perfmon and monitor Committed Bytes and Commit Limit, the numbers differ a lot. So what exactly are these Committed Bytes and Commit Limit and how do they form.
Right now in my perfmon, Committed Bytes is running at 3041 MB (Sometimes it goes to 4000 MB as well), Commit Limit is 4177 MB. So how are they calculated. Kindly explain. I've read a lot of documents but I wasn't understanding how this works.
Please help. Thanks.

How do I increase memory limit (contiguous as well as overall) in Matlab r2012b?

I am using Matlab r2012b on win7 32-bit with 4GB RAM.
However, the memory limit on Matlab process is pretty low. On memory command, I am getting the following output:
Maximum possible array: 385 MB (4.038e+08 bytes) *
Memory available for all arrays: 1281 MB (1.343e+09 bytes) **
Memory used by MATLAB: 421 MB (4.413e+08 bytes)
Physical Memory (RAM): 3496 MB (3.666e+09 bytes)
* Limited by contiguous virtual address space available.
** Limited by virtual address space available.
I need to increase the limit to as much as possible.
System: Windows 7 32 bit
Matlab: r2012b
For general guidance with memory management in MATLAB, see this MathWorks article. Some specific suggestions follow.
Set the /3GB switch in the boot.ini to increase the memory available to MATLAB. Or set it with a properties dialog if you are averse to text editors. This is mentioned in this section of the above MathWorks page.
Also use pack to increase the Maximum possible array by compacting the memory. The 32-bit MATLAB memory needs blocks of contiguous free memory, which is where this first value comes from. The pack command saves all the variables, clears the workspace, and reloads them so that they are contiguous in memory.
More on overall memory, try disabling the virtual machine, closing programs, stopping unnecessary Windows services. No easy answer for this part.

managed heap fragmentation

I am trying to understand how heap fragmenation works. What does the following output tell me?
Is this heap overly fragmented?
I have 243010 "free objects" with a total of 53304764 bytes. Are those "free object" spaces in the heap that once contained object but that are now garabage collected?
How can I force a fragmented heap to clean up?
!dumpheap -type Free -stat
total 243233 objects
MT Count TotalSize Class Name
0017d8b0 243010 53304764 Free
It depends on how your heap is organized. You should have a look at how much memory in Gen 0,1,2 is allocated and how much free memory you have there compared to the total used memory.
If you have 500 MB managed heap used but and 50 MB is free then you are doing pretty well. If you do memory intensive operations like creating many WPF controls and releasing them you need a lot more memory for a short time but .NET does not give the memory back to the OS once you allocated it. The GC tries to recognize allocation patterns and tends to keep your memory footprint high although your current heap size is way too big until your machine is running low on physical memory.
I found it much easier to use psscor2 for .NET 3.5 which has some cool commands like ListNearObj where you can find out which objects are around your memory holes (pinned objects?). With the commands from psscor2 you have much better chances to find out what is really going on in your heaps. Most commands are also available in SOS.dll in .NET 4 as well.
To answer your original question: Yes free objects are gaps on the managed heap which can simply be the free memory block after your last allocated object on a GC segement. Or if you do !DumpHeap with the start address of a GC segment you see the objects allocated in that managed heap segment along with your free objects which are GC collected objects.
This memory holes do normally happen in Gen2. The object addresses before and after the free object do tell you what potentially pinned objects are around your hole. From this you should be able to determine your allocation history and optimize it if you need to.
You can find the addresses of the GC Heaps with
0:021> !EEHeap -gc
Number of GC Heaps: 1
generation 0 starts at 0x101da9cc
generation 1 starts at 0x10061000
generation 2 starts at 0x02aa1000
ephemeral segment allocation context: none
segment begin allocated size
02aa0000 02aa1000** 03836a30 0xd95a30(14244400)
10060000 10061000** 103b8ff4 0x357ff4(3506164)
Large object heap starts at 0x03aa1000
segment begin allocated size
03aa0000 03aa1000 03b096f8 0x686f8(427768)
Total Size: Size: 0x115611c (18178332) bytes.
GC Heap Size: Size: 0x115611c (18178332) bytes.
There you see that you have heaps at 02aa1000 and 10061000.
With !DumpHeap 02aa1000 03836a30 you can dump the GC Heap segment.
!DumpHeap 02aa1000 03836a30
Address MT Size
037b7b88 5b408350 56
037b7bc0 60876d60 32
037b7be0 5b40838c 20
037b7bf4 5b408350 56
037b7c2c 5b408728 20
037b7c40 5fe4506c 16
037b7c50 60876d60 32
037b7c70 5b408728 20
037b7c84 5fe4506c 16
037b7c94 00135de8 519112 Free
0383685c 5b408728 20
03836870 5fe4506c 16
03836880 608c55b4 96
There you find your free memory blocks which was an object which was already GCed. You can dump the surrounding objects (the output is sorted address wise) to find out if they are pinned or have other unusual properties.
You have 50MB of RAM as Free space. This is not good.
Having .NET allocating blocks of 16MB from process, we have a fragmentation issue indeed.
There are plenty of reasons to fragmentation to occure in .NET.
Have a look here and here.
In your case it is possibly a pinning. As 53304764 / 243010 makes 219.35 bytes per object - much lower then LOH objects.
