Memory management error with Mono?

Memory management error with Mono? - memory-management

I have a program that was developed with C#/.NET 4.5.2 and runs fine on Windows (7 x64). The program handles a large amount of data, so it is compiled for x64 only. My largest input dataset can be processed on a PC with 16GB RAM.
When I try to run it on Ubuntu 16.04 LTS 64 bit; 64 GB RAM installed, everything is fine for smaller datasets but I started getting SIGSEGV signals with larger datasets. These errors do not always occur at the same position, and for intermediate sized datasets, they sometimes don't occur at all.
I upgraded my version of Mono so I am now running with 5.0.1.1:
TLS: __thread
SIGSEGV: altstack
Notifications: epoll
Architecture: amd64
Disabled: none
Misc: softdebug
LLVM: supported, not enabled.
GC: sgen (concurrent by default)
Upgrading replaced the error with a NullReferenceException:
Native stacktrace:
Unhandled Exception:
System.NullReferenceException: Object reference not set to an instance of an object
at (wrapper managed-to-native) System.Array:FastCopy (System.Array,int,System.Array,int,int)
at System.Array.Copy (System.Array sourceArray, System.Int32 sourceIndex, System.Array destinationArray, System.Int32 destinationIndex, System.Int32 length) [0x00068] in <a07d6bf484a54da2861691df910339b1>:0
at System.Collections.Generic.HashSet`1[T].SetCapacity (System.Int32 newSize, System.Boolean forceNewHashCodes) [0x0000f] in <26aedeede9534b948c539f8734c8492d>:0
at System.Collections.Generic.HashSet`1[T].IncreaseCapacity () [0x00025] in <26aedeede9534b948c539f8734c8492d>:0
at System.Collections.Generic.HashSet`1[T].AddIfNotPresent (T value) [0x000bc] in <26aedeede9534b948c539f8734c8492d>:0
at System.Collections.Generic.HashSet`1[T].Add (T item) [0x00000] in <26aedeede9534b948c539f8734c8492d>:0
at MyLib.MyClass.DoProcessing () [0x00a6e] in <66a585adc1684679bfec565c73eb94e4>:0
at MyLib.MyClass.SynchProcessing () [0x00000] in <66a585adc1684679bfec565c73eb94e4>:0
at MyApp.Program.Main (System.String[] args) [0x00139] in <e80b0468d8e642129fa7c39d5b2bb0a0>:0
In addition to HashSet< long >, it sometimes appears in Dictionary < long , Node > (where Node is a small class of two floats), but either way it is always when it is trying to add an element and in System.Array:FastCopy. This same location was reported by the older version of Mono when it gave a SIGSEGV. The errors occur in a single-threaded section of the code where the data is being read in & pre-processed (later it is multi-threaded, but that code has yet to produce the error, probably because all collections are already at their max size or shrinking)
So it looks like an internal pointer is being corrupted when a HashSet or Dictionary's underlying array is re-allocated.
Has anyone seen anything like this? I think it looks like a memory manager / Garbage Collector error, or even an underlying bug in Mono? However, searching Google and this site has not turned anything up. I see that Mono has a new GC but mono -V says it is running be default (and the SIGSEGV was being produced when the older version was being used).
Does anyone have any suggestions or solutions to try?
I don't know the heap & collection sizes when this error occurs. Next up I'll add some diagnostics to try and help although they'll make it slow and only give approximate values (eg. Console.WriteLine every 10,000 nodes, say)

Related

Visual Studio throws exception after hdd went out of space

I had visual studio running while my HDD was unknowingly running out of available space, I have since cleared the issue and the HDD has multiple gigabytes of data free, but sadly something got broken during this unstable state, and I am receiving this error:
Recoverable
System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.VisualStudio.ProjectSystem.ProjectSerialization.CachedMsBuildGlobWithGaps.IsMatch(String stringToMatch)
at System.Linq.ImmutableCollectionsExtensions.Any[TElement,TArg(ImmutableArray\`1 immutableArray, Func\`3 predicate, TArg arg)
at System.Linq.ImmutableCollectionsExtensions.Any[TElement,TArg(ImmutableArray\`1 immutableArray, Func\`3 predicate, TArg arg)
at Microsoft.VisualStudio.ProjectSystem.ProjectSerialization.ConstructionUtilities.<>c__DisplayClass16_0.<CheckIfProjectConeChangedOnDisk>b__1(FileSystemEntry&entry)
at Microsoft.IO.Enumeration.FileSystemEnumerator\`1.MoveNext()
at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable\`1 source)
at Microsoft.VisualStudio.ProjectSystem.ProjectSerialization.ConstructionUtilities.CheckIfProjectConeChangedOnDisk(String projectPath, DateTime lastEvaluationTimeUtc, IEnumerable\`1 globs, IProjectTelemetryService telemetryService, UnconfiguredProject project)
at Microsoft.VisualStudio.ProjectSystem.ProjectSerialization.ProjectCacheService.IsProjectCacheUpToDateSlow(ConfiguredProject configuredProject)

Difference of "Use of Stack Memory After Return" between native arm64 and native Intel/rosetta2 x86_64

I have an odd message using my code base (C/C++ & Swift).
The code itself is way too big to post, but I wanted to hear what people think could be the reason.
I run the same code natively on my M1 Apple Silicon chip without any issues. I have all diagnostics turned on:
The fun begins when I use it on an Intel based Mac and/or under Rosetta2. (All systems are Big Sur).
Vithanco(83162,0x20400de00) malloc: enabling scribbling to detect mods to free blocks
Vithanco(83162,0x20400de00) malloc: nano zone abandoned due to inability to preallocate reserved vm space.
applicationDidFinishLaunching
objc[83162]: Class _NSZombie_NSSimpleRegularExpressionCheckingResult is implemented in both ?? (0x60400017ab90) and ?? (0x60400016ffd0). One of the two will be used. Which one is undefined.
=================================================================
==83162==ERROR: AddressSanitizer: stack-use-after-return on address 0x0001105fee00 at pc 0x000101fbd30f bp 0x000308d4eb70 sp 0x000308d4eb68
WRITE of size 8 at 0x0001105fee00 thread T0
==83162==WARNING: invalid path to external symbolizer!
==83162==WARNING: Failed to use and restart external symbolizer!
#0 0x101fbd30e in textfont_dict_open+0x44e (/Users/(deleted)/Library/Developer/Xcode/DerivedData/...-gwcenzuufsseezetprookmoioioy/Build/Products/Debug/.../Contents/MacOS/Vithanco:x86_64+0x1012b630e)
#1 0x1026f3036 in loadGraphvizLibraries+0x156 (/Users/(deleted)/Library/Developer/Xcode/DerivedData/Vithanco-gwcenzuufsseezetprookmoioioy/Build/Products/Debug/Vithanco.app/Contents/MacOS/Vithanco:x86_64+0x1019ec036)
#2 0x1026f618c in globalinit_33_2FCABEB9B9698DE37811B48DE0525A0F_func0+0xc (/Users/(deleted)/Library/Developer/Xcode/DerivedData/Vithanco-gwcenzuufsseezetprookmoioioy/Build/Products/Debug/Vithanco.app/Contents/MacOS/Vithanco:x86_64+0x1019ef18c)
#3 0x1102400af in _dispatch_client_callout+0x7 (/usr/lib/system/introspection/libdispatch.dylib:x86_64+0x40af)
There is a lot more to come on the error stack, but not much of use.
I was just wondering: what could be the case? Why would the same code run into a Use of Stack Memory After Return only on one architecture? Same code was running previously on Intel. So, would this be a macOS, compiler issue or something else?

I used this declaration:
extern struct _dt_s textfont_dict_open(GVC_t * gvc);
instead of
extern struct _dt_s * textfont_dict_open(GVC_t * gvc);
Interesting, how the two architectures led to a very different outcome although I never used the outcome of the method.

What causes this error: "address already known to kernel for another [busy] synchronizer type"?

I have a customer who is getting their system log flooded with thousands of copies of this message:
Jul 25 11:21:33 athayer-mbp13 kernel[0]: PSYNCH: pid[52893]: address already known to kernel for another [busy] synchronizer type
The culprit is my app, but I can’t reproduce the problem and don’t have much of a clue to its cause. My app does disk searching, and this error happens about 15 hours into the life of the process. There is no excessive memory usage or file descriptor leakage. The app continues to operate normally, it’s just that these messages cause the system log to blow up to gigabyte proportions and fill up the boot disk.
I found the Darwin kernel code where the message is printed, but it’s only a clue, it doesn’t show the smoking gun:
http://opensource.apple.com//source/xnu/xnu-1699.32.7/bsd/kern/pthread_support.c
FAILEDUSERTEST("address already known to kernel for another (busy) synchronizer type\n”);
It’s in this function:
/* find kernel waitqueue, if not present create one. Grants a reference */
int
ksyn_wqfind(user_addr_t mutex, uint32_t mgen, uint32_t ugen, uint32_t rw_wc, uint64_t tid, int flags, int wqtype, ksyn_wait_queue_t * kwqp)
Can anyone provide any insight into what’s going on?
Here’s the profile for the machine:
Model Name: MacBook Pro
Model Identifier: MacBookPro12,1
Processor Name: Intel Core i5
Processor Speed: 2.7 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 3 MB
Memory: 8 GB
Boot ROM Version: MBP121.0167.B16
SMC Version (system): 2.28f7
Hardware UUID: 9205D058-90BF-541E-8E61-E75259ABC11F
System Software Overview:
System Version: OS X 10.11.4 (15E65)
Kernel Version: Darwin 15.4.0
Boot Volume: Macintosh HD
Boot Mode: Normal
Computer Name: athayer-mbp13
User Name: System Administrator (root)
Secure Virtual Memory: Enabled
system_integrity: integrity_enabled
Time since boot: 9 days 18:55

Possible Explanation
It's possible that you're being affected by an old kernel bug. If a pthread condition variable (the main component of a standard pthread_mutex family object) is allocated, but never waited on, there is a situation in which its object is never removed from a pthreads-internal registry on OSX.
If that happens, and if another mutex is later allocated that happens to end up in the same space in memory, and if that mutex is waited on, this error can occur, since the new mutex's ID will not match the one already present in its space. This is distinct from a reallocation issue where garbled/meaningless info is found instead of a valid ID.
Workaround
The workaround is to ensure that you are calling a a wait function on all mutexes/condvars you create. Even a nanosecond wait will trigger "correct" destruction when it completes on a no-longer-used mutex. An example of the fix by the Chromium devs is linked below.
For example, you could wait one nanosecond/tick on a lock thus:
struct timespec time { .tv_sec = 0, .tv_nsec = 1 };
pthread_cond_timedwait_relative_np(
&some_condition_handle,
&some_lock_handle,
time
);
Confounding Factors
The kernel bug may not be the real issue. There are a lot of confounding factors here:
The kernel source hasn't been published for 10.10 or 10.11, so the code being called that generates that error may not be the code that you found online.
As a result of that, the kernel bug I mentioned may not still exist, or may not be reachable in the same way.
The error line you published has parens (()) around the word "busy", but the source you found has square brackets ([]). The places in code that print out the two different messages are distinct from each other, so the problem lines might not be the ones you pointed out in your question.
Relevant Links
Article by the first (only?) person who has diagnosed this issue: http://rayne3d.com/blog/02-27-2014-rayne-weekly-devblog-4
The problem gets exhibited in the pthread source (or it was, in pthread 105.1.4), visible at this link (search in the page for 13782056): https://opensource.apple.com/source/libpthread/libpthread-105.1.4/src/pthread_cond.c
An example fix like the workaround listed above was made by the Chromium team when they were affected by a similar (the same?) issue: https://codereview.chromium.org/1323293005
The original Apple Developer Forum link appears to be defunct, though I might just be unable to access it: https://devforums.apple.com/thread/220316?tstart=0

electric-fence segfaults in malloc

I've got a rather complicated program that does a lot of memory allocation, and today by surprise it started segfaulting in a weird way that gdb couldn't pin-point the location of. Suspecting memory corruption somewhere, I linked it against Electric Fence, but I am baffled as to what it is telling me:
ElectricFence Exiting: mprotect() failed:
Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2 () at ../sysdeps/i386/i686/multiarch/strlen.S:99
99 ../sysdeps/i386/i686/multiarch/strlen.S: No such file or directory.
in ../sysdeps/i386/i686/multiarch/strlen.S
#0 __strlen_sse2 () at ../sysdeps/i386/i686/multiarch/strlen.S:99
#1 0xb7fd6f2d in ?? () from /usr/lib/libefence.so.0
#2 0xb7fd6fc2 in EF_Exit () from /usr/lib/libefence.so.0
#3 0xb7fd6b48 in ?? () from /usr/lib/libefence.so.0
#4 0xb7fd66c9 in memalign () from /usr/lib/libefence.so.0
#5 0xb7fd68ed in malloc () from /usr/lib/libefence.so.0
#6 <and above are frames in my program>
I'm calling malloc with a value of 36, so I'm pretty sure that shouldn't be a problem.
What I don't understand is how it is even possible that I could be trashing the heap in malloc. In reading the manual page a bit more, it appears that maybe I am writing to a free page, or maybe I'm underwriting a buffer. So, I have tried the following environment variables, together and by themselves:
EF_PROTECT_FREE=1
EF_PROTECT_BELOW=1
EF_ALIGNMENT=64
EF_ALIGNMENT=4096
The last two had absolutely no effect.
The first one changed the portions of the stack frame which are in my program (where in my program was executing when malloc was called fatally), but with identical frames once malloc was entered.
The second one changed a bit more; in addition to the crash occurring at a different place in my program, it also occurred in a call to realloc instead of malloc, although realloc is directly calling malloc and otherwise the back trace is identical to above.
I'm not explicitly linking against any other libraries besides fence.
Update: I found several places where it suggests that the message: " mprotect() failed: Cannot allocate memory" means that there is not enough memory on the machine. But I am not seeing the "Cannot allocate memory" part, and ps says I am only using 15% of memory. With such a small allocation (4k+32) could this really be the problem?

I just wasted several hours on the same problem.
It turns out that it is to do with the setting in
/proc/sys/vm/max_map_count
From the kernel documentation:
"This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries.
While most applications need less than a thousand maps, certain programs, particularly malloc debuggers, may consume lots of them, e.g., up to one or two maps per allocation."
So you can 'cat' that file to see what it is set to, and then you can 'echo' a bigger number into it. Like this: echo 165535 > /proc/sys/vm/max_map_count
For me, this allowed electric fence to get past where it was before, and start to find real bugs.

C++/msvc6 application crashes due to heap corruption, any hints?

About the application
It runs on Windows XP Professional SP2.
It's built with Microsoft Visual C++ 6.0 with Service Pack 6.
It's MFC based.
It uses several external dlls (e.g. Xerces, ZLib or ACE).
It has high performance requirements.
It does a lot of network and hard disk I/O, but it's also cpu intensive.
It has an exception handling mechanism which generates a minidump when an unhandled exception occurs.
UPDATE: It is a highly multithreaded application and we are using mutexes to protect concurrent access (of course, we might be failing at some place...)
Facts about the crash
It only happens on multiprocessor/multicore machines and under heavy loads of work.
It happens at random (neither we nor our client have found a pattern yet) after some some hours running.
We cannot reproduce the crash on our testing lab. It only happens on some production systems (but always in multicore machines)
It always ends up crashing at the same point, although the complete stack is not always the same. Let me add the stack of the crashing thread (obtained using WinDbg, sorry we don't have symbols)
Exception code: c0000005 ACCESS_VIOLATION
Address : 006a85b9
Access Type : write
Access Address : 2e020fff
Fault address: 006a85b9 01:002a75b9 C:\MyDir\MyApplication.exe
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
030af6c8 7c9206eb 77bfc3c9 01a80000 00224bc3 MyApplication+0x2a85b9
030af960 7c91e9c0 7c92901b 00000ab4 00000000 ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
030af98c 7c9205c8 00000001 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
030af9c0 7c920551 01a80898 7c92056d 313adfb0 ntdll!RtlpFreeToHeapLookaside+0x22 (FPO: [2,0,4])
030afa8c 4ba3ae96 000307da 00130005 00040012 ntdll!RtlFreeHeap+0x1e9 (FPO: [Non-Fpo])
030afacc 77bfc2e3 0214e384 3087c8d8 02151030 0x4ba3ae96
030afb00 7c91e306 7c80bfc1 00000948 00000001 msvcrt!free+0xc8 (FPO: [Non-Fpo])
030afb20 0042965b 030afcc0 0214d780 02151218 ntdll!ZwReleaseSemaphore+0xc (FPO: [3,0,0])
030afb7c 7c9206eb 02e6c471 02ea0000 00000008 MyApplication+0x2965b
030afe60 7c9205c8 02151248 030aff38 7c920551 ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
030afe74 7c92056d 0210bfb8 02151250 02151250 ntdll!RtlpFreeToHeapLookaside+0x22 (FPO: [2,0,4])
030aff38 77bfc2de 01a80000 00000000 77bfc2e3 ntdll!RtlFreeHeap+0x647 (FPO: [Non-Fpo])
7c92056d c5ffffff ce7c94be ff7c94be 00ffffff msvcrt!free+0xc3 (FPO: [Non-Fpo])
7c920575 ff7c94be 00ffffff 12000000 907c94be 0xc5ffffff
7c920579 00ffffff 12000000 907c94be 90909090 0xff7c94be
*** WARNING: Unable to verify checksum for xerces-c_2_7.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for xerces-c_2_7.dll -
7c92057d 12000000 907c94be 90909090 8b55ff8b MyApplication+0xbfffff
7c920581 907c94be 90909090 8b55ff8b 08458bec xerces_c_2_7
7c920585 90909090 8b55ff8b 08458bec 04408b66 0x907c94be
7c920589 8b55ff8b 08458bec 04408b66 0004c25d 0x90909090
7c92058d 08458bec 04408b66 0004c25d 90909090 0x8b55ff8b
The address MyApplication+0x2a85b9 corresponds to a call to erase() of a std::list.
What I have tried so far
Reviewing all the code related to the point where the crash ends happening.
Trying to enable pageheap on our testing lab though nothing useful has been found by now.
We have substituted the std::list for a C array and then it crashes in other part of the code (although it is related code, it's not in the code where the old list resided). Coincidentally, now it crashes in another erase, though this time of a std::multiset. Let me copy the stack contained in the dump:
ntdll.dll!_RtlpCoalesceFreeBlocks#16() + 0x124e bytes
ntdll.dll!_RtlFreeHeap#12() + 0x91f bytes
msvcrt.dll!_free() + 0xc3 bytes
MyApplication.exe!006a4fda()
[Frames below may be incorrect and/or missing, no symbols loaded for MyApplication.exe]
MyApplication.exe!0069f305()
ntdll.dll!_NtFreeVirtualMemory#16() + 0xc bytes
ntdll.dll!_RtlpSecMemFreeVirtualMemory#16() + 0x1b bytes
ntdll.dll!_ZwWaitForSingleObject#12() + 0xc bytes
ntdll.dll!_RtlpFreeToHeapLookaside#8() + 0x26 bytes
ntdll.dll!_RtlFreeHeap#12() + 0x114 bytes
msvcrt.dll!_free() + 0xc3 bytes
c5ffffff()
(12-Apr-2010) I've tried to enable heap free checking (using gflags) but it slows down the application a lot...
Possible solutions (that I'm aware of) which cannot be applied
"Migrate the application to a newer compiler": We are working on this but It's not a solution at the moment.
"Enable pageheap (normal or full)": We can't enable pageheap on production machines as this affects performance heavily.
I think that's all I remember now, if I have forgotten something I'll add it asap. If you can give me some hint or propose some possible solution, don't hesitate to answer!

You can try peppering your code with calls to the debug heap checking routines to see if you can locate the corruption closer to the source (you're using the debug CRT to track down this problem, right?):
http://msdn.microsoft.com/en-us/library/aa271695(VS.60).aspx

Use Application Verifier from debugging tools for windows. Sometimes it helps.
Try to set up VS to download OS debug symbols and make sure that OMIT FRAME POINTERS is off in your application. Perhaps stack trace will be informative.
Highly multithreaded
Long time ago I discovered that there is a limit for thread count per process in WinXP. My test snippet could create only few thoursands of thread. The problem was resolved by thread pool.
EDIT:
For my purposes there was enough just to check “Application Verifier” checkbox in gflags.exe. Unfortunately, I have no experience with other options.
As for thread limit, test snippet was simple:
unsigned __stdcall ThreadProc(LPVOID)
{
_tprintf(_T("Thread started\n"));
return 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
while (TRUE)
{
unsigned threadId = 0;
_tprintf(_T("Start thread\n"));
_beginthreadex( NULL, 0, &ThreadProc, NULL, 0, &threadId);
}
return 0;
}
I didn’t wait long this time, but handle count in Task Manager was increasing very fast. My real world application got this effect only in 12 hours. But must say the issue was not in crashing, new threads just not created.

Can you post what exceptions you are getting?
If this is some memory corruption bug, then the crash occurs sometime after the memory corruption, so that will be challenging to track down the root cause. You should:
Travel (or remotely logon) to the production system, install Visual Studio, have .pdb and .map files ready (and windows' symbols as well), attach debugger to the release-build and wait for the crash. Though if you set it up correctly, you can use the minidump file on your dev machine, where you would already have your app and window's symbols setup. Then you can see which free call is throwing, and try to figure out which object is being freed to see if that object is corrupted somehow and nearby objects in memory.
Somehow find a way to reproduce the bug in your office, can you create high enough volumes to duplicate what the customer is doing?
Your posted callstacks don't look particularly illuminating.
Since you are using VS 6 with SP6, then its STL is OK.
Can you tell if the app on the production system is leaking any resources? Running perfmon can help with this.
Another thing, you're not calling new/delete like very frequently from different threads are you? I've found that if you do this fast enough, you'll crash your app rather quickly (did this on XP). I had to replace new/delete calls in my app with VirtualAlloc (windows Virtual Memory API), that worked great for me. Of course, STL could be allocating from the heap as well.

Use a performance profiler that can hook into CPU events, such as VTune. Set it up in sampling mode and tell it to wait for events related to cache line sharing. These are identified by a HITM event from the SNOOP phase.
If you run this on a multi processor machine with a realistic workload then it will find places in your code where there is active contention between threads for a single piece of data. You will need to analyze the profiler hot spots found this way and try to find something that is not being wrapped in an appropriate mutex.
I'm not an expert on CPU architecture or anything, but my understanding is that when the CPUs are about to access a piece of data the system will check if any other CPUs are accessing the same piece of data, this is done by watching the memory fetches and writes coming out of each CPU, a process called snooping. Snooping makes sure that if TWO or more CPUs have the same data in each of their caches that the duplicated copies of the data are removed when one of them is modified. A HIT-Modified event means that the system detected this situation and had to flush one of the CPUs cache lines.
See this document for more information on using VTune like this
http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/
I don't have a copy of VTune in front of me right now so maybe this won't work but it seems like the lowest impact way of getting some data. VTune in sampling mode should not cause a lot of problems with performance.

The key here is that this only happens on multiprocessor machines (Cores are the same as processors)
What happens when a threaded program runs on a single processor is that two threads never execute at the same time. The OS has to time-slice each processor to simulate threads.
In a multiprocessor system multiple threads can operate at the same time.
You are probably accessing shared resources from different threads at the same time now.
These resources can be be connections to external systems and even global variables and data structures even Singleton classes.
Unfortunately you now have one of the hardest problems to find.
If you can find the memory being corrupted then you need to find who else is using it on a different thread and then synchronize the memory (Semaphore or CriticalSection).
Unfortunately there is no easy way to find the problem.
You might be able to set the processor affinity temporarily to only run on one processor until you find the problem. See link
http://msdn.microsoft.com/en-us/library/ms684251(VS.85).aspx
Here is a method to set affinity on
For Windows XP/Vista/7, access Affinity by opening the Windows Task Manager (CTL+ALT+DEL, or right-click on Task Bar), select "Processes" tab, right-click the application process you wish to isolate, then select "Set Affinity." Inside the Processor Affinity dialog, un-check the CPU/cores you do not need to use. This effectively isolates that application to the selected CPUs/cores preventing cashe spanning and reducing process-switching and simplifies your ability to supervise CPU/core allocation for multiple programs.

As your second stack trace shows, your application is corrupting the heap. The header of a heap block is written over and thus the crash occurs in the heap manager when coalescing free blocks, or when going through the free list (in the first stack trace).
The code you identified that is currently freeing memory may be a victim of another code overflowing or underflowing a memory block.
The easiest way to debug this kind of crash is to use the debugging help from windows, through pageheap or appverifier, but depending on the application it may slow down too much, or grow the memory usage too high to be usable, which seems to be the case. You may try to use light pageheap, which will have less impact.
You need to identify what part of the application is overflowing. One way to do this is to look at the information contained in the overflown block. If you have a crash in RtlpCoalesceFreeBlocks, I think I remember one of the registers (#esi) is pointing to the start of the corrupted block (I am not on a windows system at the time of this writing and can not check that). Or if you have a dump, using windbg command !heap -a will dump all memory and display corrupted blocks (better log into a file, since the full heap listing can be long). Once corrupted blocks are known, their content may help to identify the code.
Another help can be to enable the stack backtraces (using gflags). This can be done in production as it is lighter than pageheap. It will add some information to heap blocks and may move the crash to another place in your application, but the stack traces will help to identify what code allocated the blocks that are overflowing.

I would focus on getting the issue to happen on a build for which you have proper debugging symbols, at least for your main application. You seem to gloss over this with "sorry we don't have symbols", but when symbols are applied, the stacktraces may show you more information.
What exactly does this mean: "We can't generate symbols because we're linking with a library which doesn't link if we're using them."? This seems odd.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio