discrepancy between htop and golang readmemstats - go

My program loads a lot of data at start up and then calls debug.FreeOSMemory() so that any extra space is given back immediately.
loadDataIntoMem()
debug.FreeOSMemory()
after loading into memory , htop shows me the following for the process
VIRT RES SHR
11.6G 7629M 8000
But a call to runtime.ReadMemStats shows me the following
Alloc 5593336608 5.3G
BuckHashSys 1574016 1.6M
HeapAlloc 5593336610 5.3G
HeapIdle 2607980544 2.5G
HeapInuse 7062446080 6.6G
HeapReleased 2607980544 2.5G
HeapSys 9670426624 9.1G
MCacheInuse 9600 9.4K
MCacheSys 16384 16K
MSpanInuse 106776176 102M
MSpanSys 115785728 111M
OtherSys 25638523 25M
StackInuse 589824 576K
StackSys 589824 576K
Sys 10426738360 9.8G
TotalAlloc 50754542056 48G
Alloc is the amount obtained from system and not yet freed ( This is
resident memory right ?) But there is a big difference between the two.
I rely on HeapIdle to kill my program i.e if HeapIdle is more than 2 GB, restart - in this case it is 2.5, and isn't going down even after a while. Golang should use from heap idle when allocating more in the future, thus reducing heap idle right ?
If assumption 1 is wrong, which stat can accurately tell me what the RES value in htop is.
What can I do to reduce the value of HeapIdle ?
This was tried on go 1.4.2, 1.5.2 and 1.6.beta1

The effective memory consumption of your program will be Sys-HeapReleased. This still won't be exactly what the OS reports, because the OS can choose to allocate memory how it sees fit based on the requests of the program.
If your program runs for any appreciable amount of time, the excess memory will be offered back to the OS so there's no need to call debug.FreeOSMemory(). It's also not the job of the garbage collector to keep memory as low as possible; the goal is to use memory as efficiently as possible. This requires some overhead, and room for future allocations.
If you're having trouble with memory usage, it would be a lot more productive to profile your program and see why you're allocating more than expected, instead of killing your process based on incorrect assumptions about memory.

Related

Understanding the output of top on Mac

$ man top
CPU Percentage of processor usage, broken into user, system, and idle components. The time period for which
these percentages are calculated depends on the event counting mode.
Disks Number and total size of disk reads and writes.
LoadAvg Load average over 1, 5, and 15 minutes. The load average is the average number of jobs in the run
queue.
MemRegions Number and total size of memory regions, and total size of memory regions broken into private (broken
into non-library and library) and shared components.
Networks Number and total size of input and output network packets.
PhysMem Physical memory usage, broken into wired, active, inactive, used, and free components.
Procs Total number of processes and number of processes in each process state.
SharedLibs Resident sizes of code and data segments, and link editor memory usage.
Threads Number of threads.
Time Time, in H:MM:SS format. When running in logging mode, Time is in YYYY/MM/DD HH:MM:SS format by
default, but may be overridden with accumulative mode. When running in accumulative event counting
mode, the Time is in HH:MM:SS since the beginning of the top process.
VirtMem Total virtual memory, virtual memory consumed by shared libraries, and number of pageins and pageouts.
Swap Swap usage: total size of swap areas, amount of swap space in use and amount of swap space available.
Purgeable Number of pages purged and number of pages currently purgeable.
Below the global state fields, a list of processes is displayed. The fields that are displayed depend on the
options that are set. The pid field displays the following for the architecture:
+ for 64-bit native architecture, or - for 32-bit native architecture, or * for a non-native architecture.
I see the following output of top on Mac. I don't quite understand as the manual is not very detailed.
For example, I only have 8GB of memory. Why it shows 15G PhysMem? What are wired, active, inactive, used, and free components?
For Disks, are the numbers '21281572/769G read' the size of disk read since the machine starts?
For Networks, are the numbers since the machine starts?
For VM, what are vsize, framework vsize, swapins, swapouts?
$ top -l 1 | head
Processes: 797 total, 4 running, 1 stuck, 792 sleeping, 1603 threads
2019/05/08 09:48:40
Load Avg: 54.32, 41.08, 34.69
CPU usage: 62.2% user, 36.89% sys, 1.8% idle
SharedLibs: 258M resident, 65M data, 86M linkedit.
MemRegions: 78888 total, 6239M resident, 226M private, 2045M shared.
PhysMem: 15G used (2220M wired), 785M unused.
VM: 3392G vsize, 1299M framework vsize, 0(0) swapins, 0(0) swapouts.
Networks: packets: 24484543/16G in, 24962180/7514M out.
Disks: 21281572/769G read, 20527776/242G written.
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/disk1s1 466G 444G 19G 97% /
/dev/disk1s4 466G 3.1G 19G 14% /private/var/vm
/dev/disk2s1 932G 546G 387G 59% /Volumes/usbhd
com.apple.TimeMachine.2019-05-06-225547#/dev/disk1s1 466G 441G 19G 96% /Volumes/com.apple.TimeMachine.localsnapshots/Backups.backupdb/py???s MacBook Air/2019-05-06-225547/Macintosh HD
com.apple.TimeMachine.2019-05-02-082105#/dev/disk1s1 466G 440G 19G 96% /Volumes/com.apple.TimeMachine.localsnapshots/Backups.backupdb/py???s MacBook Air/2019-05-02-082105/Macintosh HD
I only have 8GB of memory. Why it shows 15G PhysMem?
There is 15G available, which is above the 8Gb in the machine due to the usage of Virtual Memory, where the OS can choose to swap (copy in / out) pages in memory to other storage (hard disk / SSD).
What are wired, active, inactive, used, and free components?
These are the states of physical pages of memory as used by Virtual Memory. So we have:
wired - Memory pages that are in use and can't be swapped (paged) out to disk (e.g. the OS itself)
active - Memory pages used for virtual memory that are in use that have been referenced recently. They are not likely to be paged out, unless no other pages are available
inactive - Memory pages that are used for virtual memory, but have not been referenced recently. They are likely to be swapped out if the need arises
used - Sometimes known as "speculative", physical memory that is speculatively mapped as the OS guesses about possibly requiring this, but it's not yet active
free - Physical memory pages not being used for virtual memory and is instantly available
For Disks, are the numbers '21281572/769G read' the size of disk read since the machine starts
For Networks, are the numbers since the machine starts?
Yes, I believe that these are since rebooting the OS.
For VM, what are vsize, framework vsize, swapins, swapouts?
I expect these are:
vsize - the amount of virtual space being used on disk
framework vsize - No idea about this one!
swapins - the number of memory pages loaded in from virtual memory to physical memory
swapout - the number of memory pages swapped out to physical memory from virtual memory

Does the Meltdown mitigation, in combination with `calloc()`s CoW "lazy allocation", imply a performance hit for calloc()-allocated memory?

So calloc() works by asking the OS for some virtual memory. The OS is working in cahoots with the MMU, and cleverly responds with a virtual memory address which actually maps to a copy-on-write, read-only page full of zeroes. When a program tries to write to anywhere in that page, a page fault occurs (because you cannot write to read-only pages), a copy of the page is created, and your program's virtual memory is mapped to this brand new copy of those zeroes.
Now that Meltdown is a thing, OSes have been patched so that it's no longer possible to speculatively execute across the kernel-user boundary. This means that whenever user code calls kernel code, it effectively causes a pipeline stall. Typically, when the pipeline stalls in a loop, it's devastating for performance, since the CPU ends up wasting time waiting for data, whether from cache or main memory.
Given such, what I want to know is:
When a program writes to a never-before-accessed page which was allocated with calloc(), and the remapping to the new CoW page occurs, is this executing kernel code?
Is the page fault copy-on-write functionality implemented at the OS level or the MMU level?
If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?
And finally, if it is real, is there any case where this effect is significant to real-world performance?
Your premise is wrong. Page faults were never pipelined / super-cheap. Meltdown (and Spectre) mitigation does make them more expensive, though, along with system calls and all other user->kernel transitions.
Speculative execution across the kernel/user boundary was never possible; Intel CPUs don't rename the privilege level, i.e. kernel/user transitions always required a full pipeline flush. I think you're misunderstanding Meltdown: it's cause purely by speculative execution in user-space and delayed handling of the privilege checks on TLB hits.
This is universal in CPU design, AFAIK. I'm not aware of any microarchitectures that rename the privilege level or otherwise speculate into kernel code, x86 or otherwise.
The cost added by Meltdown mitigation is that entering the kernel flushes the TLB. (Or on CPUs with TLB process-context ID support, the kernel can use PCIDs to make using separate page-tables for kernel vs. user-space much cheaper).
The kernel entry point (on Linux) becomes a trampoline that swaps page tables and jumps to the real kernel entry point, to avoid exposing the kernel ASLR offset to user-space. But other than that and an extra mov cr3, reg on entry and exit from the kernel (setting a new page table), nothing else is changed.
(Spectre mitigation is tricky, too, and required more changes like retpolines... and might also significantly increase the cost of user->kernel->user. IDK about page fault costs.)
#BeeOnRope reports (see comments and his answer for full details) that without Spectre patches, just Meltdown patches applied but nopti boot option to "disable" it, increased the cost of a round trip to the kernel on a Skylake CPU (with syscall with bogus RAX, returning -ENOSYS right away) went up from ~100 to ~300 cycles. So that's maybe the cost of the trampoline? And with actual page-table isolation enabled, it went up to ~700 cycles. That's without Spectre mitigation patches at all. (Also, that's the x86-64 syscall entry point, not page-fault. They're likely similar, though.)
Page fault exceptions:
CPUs don't predict page faults, so they couldn't speculatively execute the handler anyway. Prefetch or decode of the page fault entry point could maybe happen while the pipeline was flushing, but that process wouldn't start until the page-faulting instruction tried to retire. A faulting load/store is marked to take effect on retirement, and doesn't re-steer the front-end; the whole key to Meltdown is the lack of action on a faulting load until it reaches retirement.
Related: When an interrupt occurs, what happens to instructions in the pipeline?
Also: Out-of-order execution vs. speculative execution has some detail about what kind of speculation really causes Meltdown, and how CPUs handle faults.
When a program writes to a never-before-accessed page which was allocated with calloc(), and the remapping to the new CoW page occurs, is this executing kernel code?
Yes, page faults are handled by the kernel's page-fault handler. There's no pure-hardware handling for copy-on-write.
If I call calloc() to allocate 4GiB of memory, then initialize it with some arbitrary value (say, 0xFF instead of 0x00) in a tight loop, is my (Intel) CPU going to be hitting a speculation boundary every time it writes to a new page?
Yes. The kernel doesn't fault-around for zeroed pages (unlike for file-backed mappings when data is hot in the pagecache). So every new page touched causes a pagefault, even for small 4k normal pages. (Thanks to #BeeOnRope for accurate info on this.) With anonymous hugepages, you'll only pagefault once per 2MiB (x86-64), which is tremendously better.
If you want to avoid per-page costs, allocate with mmap(MAP_POPULATE) to prefault all the pages into the HW page table, on a Linux system. I'm not sure if madvise can prefault pages for you, e.g. madvise(MADV_WILLNEED) on an already-mapped region. But madvise(MADV_HUGEPAGE) will encourage the kernel to use anonymous hugepages (and maybe to defrag physical memory to free up contiguous 2M blocks to enable that, if you don't have it configured to do that without madvise).
Related: Two TLB-miss per mmap/access/munmap has some perf results on a Linux kernel with KPTI patches.
Yes use of calloc()-allocated memory will suffer a performance degradation due to the Meltdown and Spectre patches.
In fact, calloc() isn't special here: malloc(), new and more generally all allocated memory will probably suffer approximately the same performance impact. Both calloc() and malloc() are ultimately backed by pages returned by the OS (although the allocator will re-use them after they are freed). The only real difference being that a smart allocator, when it goes down the path of using new pages from the OS (rather than re-using a previously freed allocation) in the case of calloc it can omit the zeroing because the OS-provided pages are guaranteed to be zero. Other than that the allocator behavior is largely the same and the OS-level zeroing behavior is the same (there is usually no option to ask the OS for non-zero pages).
So the performance impact applies more broadly than you thought, but the performance impact is likely smaller than you suggest, since a page fault is already doing a lot of work anyways, so you aren't talking an order of magnitude degradation or anything. See Peter's answer on the reasons the performance impact is likely to be limited. I wrote this answer mostly because the answer to your headline question is still yes as there is some impact.
To estimate the impact on a malloc heavy workflow, I tried running some allocation and page-fault heavy test on a current kernel (4.13.0-39-generic) with the Spectre and Meltdown mitigations, as well as on an older kernel prior to these mitigations.
The test code is very simple:
#include <stdlib.h>
#include <stdio.h>
#define SIZE (40 * 1024 * 1024)
#define PG_SIZE 4096
int main() {
char *mem = malloc(SIZE);
for (volatile char *p = mem; p < mem + SIZE; p += PG_SIZE) {
*p = 'z';
}
printf("pages touched: %d\npoitner value : %p\n", SIZE / PG_SIZE, mem);
}
The results on the newer kernel were about ~3700 cycles per page fault, and on the older kernel without mitigations around ~3300 cycles. The overall regression (presumably) due to the mitigations was about 14%. Note that this in on Skylake hardware (i7-6700HQ) where some of the Spectre mitigations are somewhat cheaper, and the kernel supports PCID which makes the KPTI Meltdown mitigations cheaper. The results might be worse on different hardware.
Oddly, the results on the new kernel with Spectre and Meltdown mitigations disabled at boot (using spectre_v2=off nopti) were much worse than either the new kernel default or the old kernel, coming in at about 5050 cycles per page fault, something like a 35% regression over the same kernel with the mitigations enabled. So something is going really wrong, performance-wise when the mitigations are disabled.
Full Results
Here is the full perf stat output for the two runs.
Old Kernel (4.10.0-42)
pages touched: 10240
poitner value : 0x7f7d2561e010
Performance counter stats for './pagefaults':
12.980048 task-clock (msec) # 0.976 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
10,286 page-faults # 0.792 M/sec
33,662,397 cycles # 2.593 GHz
27,230,864 instructions # 0.81 insn per cycle
4,535,443 branches # 349.417 M/sec
11,760 branch-misses # 0.26% of all branches
0.013293417 seconds time elapsed
New Kernel (4.13.0-39)
pages touched: 10240
poitner value : 0x7f306ad69010
Performance counter stats for './pagefaults':
14.789615 task-clock (msec) # 0.966 CPUs utilized
8 context-switches # 0.541 K/sec
0 cpu-migrations # 0.000 K/sec
10,288 page-faults # 0.696 M/sec
38,318,595 cycles # 2.591 GHz
28,796,523 instructions # 0.75 insn per cycle
4,693,944 branches # 317.381 M/sec
26,853 branch-misses # 0.57% of all branches
0.015312764 seconds time elapsed
New Kernel (4.13.0.-39) spectre_v2=off nopti
pages touched: 10240
poitner value : 0x7ff079ede010
Performance counter stats for './pagefaults':
16.690621 task-clock (msec) # 0.982 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
10,286 page-faults # 0.616 M/sec
51,964,080 cycles # 3.113 GHz
28,602,441 instructions # 0.55 insn per cycle
4,699,608 branches # 281.572 M/sec
25,064 branch-misses # 0.53% of all branches
0.017001581 seconds time elapsed

Should the stats reported by Go's runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?

In Go Should the "Sys" stat or any other stat/combination reported by runtime.ReadMemStats approximately equal the resident memory set reported by ps aux?
Alternatively, assuming some memory may be swapped out, should the Sys stat be approximately greater than or equal to the RSS?
We have a long-running web service that deals with a high frequency of requests and we are finding that the RSS quickly climbs up to consume virtually all of the 64GB memory on our servers. When it hits ~85% we begin to experience considerable degradation in our response times and in how many concurrent requests we can handle. The run I've listed below is after about 20 hours of execution, and is already at 51% memory usage.
I'm trying to determine if the likely cause is a memory leak (we make some calls to CGO). The data seems to indicate that it is, but before I go down that rabbit hole I want to rule out a fundamental misunderstanding of the statistics I'm using to make that call.
This is an amd64 build targeting linux and executing on CentOS.
Reported by runtime.ReadMemStats:
Alloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed
Sys: 3686471104 bytes (3515.69MB) // bytes obtained from system (sum of XxxSys below)
HeapAlloc: 1294777080 bytes (1234.80MB) // bytes allocated and not yet freed (same as Alloc above)
HeapSys: 3104931840 bytes (2961.09MB) // bytes obtained from system
HeapIdle: 1672339456 bytes (1594.87MB) // bytes in idle spans
HeapInuse: 1432592384 bytes (1366.23MB) // bytes in non-idle span
Reported by ps aux:
%CPU %MEM VSZ RSS
1362 51.3 306936436 33742120

Why memset function make the virtual memory so large

I have a process will do much lithography calculation, so I used mmap to alloc some memory for memory pool. When process need a large chunk of memory, I used mmap to alloc a chunk, after use it then put it in the memory pool, if the same chunk memory is needed again in the process, get it from the pool directly, not used memory map again.(not alloc all the need memory and put it in the pool at the beginning of the process). Between mmaps function, there are some memory malloc not used mmap, such as malloc() or new().
Now the question is:
If I used memset() to set all the chunk data to ZERO before putting it in the memory pool, the process will use too much virtual memory as following, format is "mmap(size)=virtual address":
mmap(4198400)=0x2aaab4007000
mmap(4198400)=0x2aaab940c000
mmap(8392704)=0x2aaabd80f000
mmap(8392704)=0x2aaad6883000
mmap(67112960)=0x2aaad7084000
mmap(8392704)=0x2aaadb085000
mmap(2101248)=0x2aaadb886000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae028b000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae108f000
mmap(8392704)=0x2aaae1890000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaeca9c000
mmap(8392704)=0x2aaaec29b000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
mmap(8392704)=0x2aaafd6a7000
mmap(2101248)=0x2aacc5f8c000
The mmap last - first = 0x2aacc5f8c000 - 0x2aaab4007000 = 8.28G
But if I don't call memset before put in the memory pool:
mmap(4198400)=0x2aaab4007000
mmap(8392704)=0x2aaab940c000
mmap(8392704)=0x2aaad2480000
mmap(67112960)=0x2aaad2c81000
mmap(2101248)=0x2aaad6c82000
mmap(4198400)=0x2aaad6e83000
mmap(8392704)=0x2aaadb288000
mmap(8392704)=0x2aaadba89000
mmap(67112960)=0x2aaadc28a000
mmap(2101248)=0x2aaae0a8c000
mmap(2101248)=0x2aaae0c8d000
mmap(2101248)=0x2aaae0e8e000
mmap(8392704)=0x2aaae1890000
mmap(8392704)=0x2aaae108f000
mmap(4198400)=0x2aaae2091000
mmap(4198400)=0x2aaae6494000
mmap(8392704)=0x2aaaea897000
mmap(8392704)=0x2aaaeb098000
mmap(2101248)=0x2aaaeb899000
mmap(8392704)=0x2aaaeba9a000
mmap(2101248)=0x2aaaec29b000
mmap(8392704)=0x2aaaec49c000
mmap(8392704)=0x2aaaecc9d000
mmap(2101248)=0x2aaaed49e000
The mmap last - first = 0x2aaaed49e000 - 0x2aaab4007000= 916M
So the first process will "out of memory" and killed.
In the process, the mmap memory chunk will not be fully used or not even used although it is alloced, I mean, for example, before calibration, the process mmap 67112960(64M), it will not used(write or read data in this memory region) or just used the first 2M bytes, then put in the memory pool.
I know the mmap just return virtual address, the physical memory used delay alloc, it will be alloced when read or write on these address.
But what made me confused is that, why the virtual address increase so much? I used the centos 5.3, kernel version is 2.6.18, I tried this process both on libhoard and the GLIBC(ptmalloc), both with the same behavior.
Do anyone meet the same issue before, what is the possible root cause?
Thanks.
VMAs (virtual memory areas, AKA memory mappings) do not need to be contiguous. Your first example uses ~256 Mb, the second ~246 Mb.
Common malloc() implementations use mmap() automatically for large allocations (usually larger than 64Kb), freeing the corresponding chunks with munmap(). So you do not need to mmap() manually for large allocations, your malloc() library will take care of that.
When mmap()ing, the kernel returns a COW copy of a special zero page, so it doesn't allocate memory until it's written to. Your zeroing is causing memory to be really allocated, better just return it to the allocator, and request a new memory chunk when you need it.
Conclusion: don't write your own memory management unless the system one has proven inadecuate for your needs, and then use your own memory management only when you have proved it noticeably better for your needs with real life load.

managed heap fragmentation

I am trying to understand how heap fragmenation works. What does the following output tell me?
Is this heap overly fragmented?
I have 243010 "free objects" with a total of 53304764 bytes. Are those "free object" spaces in the heap that once contained object but that are now garabage collected?
How can I force a fragmented heap to clean up?
!dumpheap -type Free -stat
total 243233 objects
Statistics:
MT Count TotalSize Class Name
0017d8b0 243010 53304764 Free
It depends on how your heap is organized. You should have a look at how much memory in Gen 0,1,2 is allocated and how much free memory you have there compared to the total used memory.
If you have 500 MB managed heap used but and 50 MB is free then you are doing pretty well. If you do memory intensive operations like creating many WPF controls and releasing them you need a lot more memory for a short time but .NET does not give the memory back to the OS once you allocated it. The GC tries to recognize allocation patterns and tends to keep your memory footprint high although your current heap size is way too big until your machine is running low on physical memory.
I found it much easier to use psscor2 for .NET 3.5 which has some cool commands like ListNearObj where you can find out which objects are around your memory holes (pinned objects?). With the commands from psscor2 you have much better chances to find out what is really going on in your heaps. Most commands are also available in SOS.dll in .NET 4 as well.
To answer your original question: Yes free objects are gaps on the managed heap which can simply be the free memory block after your last allocated object on a GC segement. Or if you do !DumpHeap with the start address of a GC segment you see the objects allocated in that managed heap segment along with your free objects which are GC collected objects.
This memory holes do normally happen in Gen2. The object addresses before and after the free object do tell you what potentially pinned objects are around your hole. From this you should be able to determine your allocation history and optimize it if you need to.
You can find the addresses of the GC Heaps with
0:021> !EEHeap -gc
Number of GC Heaps: 1
generation 0 starts at 0x101da9cc
generation 1 starts at 0x10061000
generation 2 starts at 0x02aa1000
ephemeral segment allocation context: none
segment begin allocated size
02aa0000 02aa1000** 03836a30 0xd95a30(14244400)
10060000 10061000** 103b8ff4 0x357ff4(3506164)
Large object heap starts at 0x03aa1000
segment begin allocated size
03aa0000 03aa1000 03b096f8 0x686f8(427768)
Total Size: Size: 0x115611c (18178332) bytes.
------------------------------
GC Heap Size: Size: 0x115611c (18178332) bytes.
There you see that you have heaps at 02aa1000 and 10061000.
With !DumpHeap 02aa1000 03836a30 you can dump the GC Heap segment.
!DumpHeap 02aa1000 03836a30
Address MT Size
...
037b7b88 5b408350 56
037b7bc0 60876d60 32
037b7be0 5b40838c 20
037b7bf4 5b408350 56
037b7c2c 5b408728 20
037b7c40 5fe4506c 16
037b7c50 60876d60 32
037b7c70 5b408728 20
037b7c84 5fe4506c 16
037b7c94 00135de8 519112 Free
0383685c 5b408728 20
03836870 5fe4506c 16
03836880 608c55b4 96
....
There you find your free memory blocks which was an object which was already GCed. You can dump the surrounding objects (the output is sorted address wise) to find out if they are pinned or have other unusual properties.
You have 50MB of RAM as Free space. This is not good.
Having .NET allocating blocks of 16MB from process, we have a fragmentation issue indeed.
There are plenty of reasons to fragmentation to occure in .NET.
Have a look here and here.
In your case it is possibly a pinning. As 53304764 / 243010 makes 219.35 bytes per object - much lower then LOH objects.

Resources