Random crashes in mshtml.dll - winapi

Recently we are getting random crashes, which are rather hard to reproduce. There's no common action being done in our app, sometimes they just happen when you leave the app idling for a longer while. They have one thing in common though: The top of the stacktrace is always in mshtml!CDoc and looks like this:
[0x0] mshtml!CDoc::ReadOptionSettingsFromRegistry + 0xed
[0x1] mshtml!CDoc::UpdateFromRegistry + 0x123
[0x2] mshtml!CDoc::OnSettingsChange + 0xd0
[0x3] mshtml!OnSettingsChangeAllDocs + 0x8f
[0x4] mshtml!GlobalWndProc_SEH + 0x13b
[0x5] mshtml!GlobalWndProc + 0x2d
[0x6] user32!_InternalCallWinProc + 0x2b
[0x7] user32!UserCallWinProcCheckWow + 0x33a
[0x8] user32!DispatchClientMessage + 0xea
[0x9] user32!__fnINSTRINGNULL + 0x40
[0xa] ntdll!KiUserCallbackDispatcher + 0x4d
[0xb] user32!GetMessageW + 0x2e
The crashes are caused by access violations (c0000005, invalid pointer read) in mshtml!CDoc::ReadOptionSettingsFromRegistry.
There's no particular Windows message being processed at that time, it might be anything. The message loop is just a regular
MSG msg;
while (::GetMessage(&msg, 0 ,0, 0)) { ... }
I couldn't find any documentation for those CDoc functions. Does anyone have any idea what might cause these crashes or how to tackle this problem?
Two hints: 1) It might have something to do with copying/pasting HTML with the system clipboard. 2) We have both hosted IE browsers (legacy) and WebView2 browsers (already converted from IE) in our app. Perhaps they interfere with each other?

Mystery solved!
I was finally able to reproduce the crash sometimes on a VM with a debugger installed. Digging into the disassembly, I found out that mshtml!GlobalWndProc tried to handle WM_SETTINGCHANGE every time it crashed (as indicated by the OnSettingsChange... functions in the call stack). This in turn enabled me to reproduce the crash consistently by first pasting anything from the clipboard into one of our HTML editors and then by either broadcasting this message myself or by changing any system setting (like selecting a different system sound scheme). Narrowing it further down from there, it turned out that the culprit was a missing virtual destructor in a clipboard handler class. This left an mshtml::IHTMLDocument2Ptr unreleased, which somehow lead to the crash in MSHTML later while handling that Windows message.
In summary, the crash was caused by an unreleased smart pointer in our code, but was triggered by random system settings changes (probably scheduled by IT departments).

Related

Delphi 11.1 TFileOpenDialog.Execute hangs when CoInitializeEx is used in initialization before Forms initialization

In a small-ish Delphi 11.1 32bit app I've put an instance of TFileOpenDialog on a form, added a couple of fileTypes (*.zip and *. *) and in a button.OnClick handler called its Execute method.
At runtime, clicking the button resulted in nothing happening and application hanging.
Call stack looked like this:
:76fe513c ntdll.NtWaitForMultipleObjects + 0xc
:75763ea4 ; C:\Windows\SysWOW64\combase.dll
:7575d017 ; C:\Windows\SysWOW64\combase.dll
:75761196 ; C:\Windows\SysWOW64\combase.dll
:7575e8c0 ; C:\Windows\SysWOW64\combase.dll
:75791354 ; C:\Windows\SysWOW64\combase.dll
:76ea6163 ; C:\Windows\SysWOW64\RPCRT4.dll
:76ea6e44 RPCRT4.NdrClientCall4 + 0x14
:6e7e241d ; C:\Windows\SysWOW64\OneCoreUAPCommonProxyStub.dll
Vcl.Dialogs.TCustomFileDialog.Execute(1115734)
After wasting a day googling and trying out various solution, I finally (I know, should've done this straight away) made a new VCL app with just a form, button and TFileOpenDialog. This worked as expected and displayed the File Open dialog.
Returning to my hanging app, I eventually spotted that the .dpr had the uROCOMInit, at the beginning of the uses clause, followed by Windows, Classes, Forms.
uROCOMInit is a REM Objects unit, which I am probably not allowed to publicly share, but in essence it does only one relevant thing: calls CoInitializeEx(nil, 0) in initialization section.
Owing to a lucky guess I moved the uROCOMInit unit around in uses clause and found the solution; when uROCOMInit was appearing after the Forms in uses clause, TFileOpenDialog.Execute worked as expected.
So it looks like calling CoInitializeEx (or CoInitialize ?) too early, i.e. before the initialization section of Forms unit, breaks Delphi TFileOpenDialog.
Can someone please explain what causes this behaviour ?

SEH on Windows, call stack traceback is gone

I am reading this article about the SEH on Windows.
and here is the source code of myseh.cpp
I debugged myseh.cpp. I set 2 breakpoints at printf("Hello from an exception handler\n"); at line:24 and DWORD handler = (DWORD)_except_handler; at line: 36 respectively.
Then I ran it and it broke at line:36. I saw the stack trace as follows.
As going, AccessViolationException occurred because of mov [eax], 1
Then it broke at line:24. I saw the stack trace as follows.
The same thread but the frame of main was gone! Instead of _except_handle. And ESP jumped from 0018f6c8 to 0018ef34;it's a big gap between 0018f6c8 and 0018ef34
After Exception handled.
I know that _except_handle must be run at user mode rather than kernel mode.
After _except_handle returned, the thread turned to ring0 and then windows kernel modified CONTEXT EAX to &scratch & and then returned to ring3 . Thus thread ran continually.
I am curious about the mechanism of windows dealing with exception:
WHY the frame calling main was gone?
WHY the ESP jumped from 0018f6c8 to 0018ef34?(I mean a big pitch), Do those ESP address belong to same thread's stack??? Did the kernel play some tricks on ESP in ring3??? If so, WHY did it choose the address of 0018ef34 as handler callback's frame? Many thanks!
You are using the default debugger settings, not good enough to see all the details. They were chosen to help you focus on your own code and get the debug session started as quickly as possible.
The [External Code] block tells you that there are parts of the stack frame that do not belong to code that you have written. They don't, they belong to the operating system. Use Tools > Options > Debugging > General and untick the "Enable Just My Code" option.
The [Frames below might be incorrect...] warning tells you that the debugger doesn't have accurate PDBs to correctly walk the stack. Use Tools > Options > Debugging > Symbols and tick the "Microsoft Symbol Servers" option and choose a cache location. The debugger will now download the PDBs you need to debug through the operating system DLLs. Might take a while, it is only done once.
You can reason out the big ESP change, the CONTEXT structure is quite large and takes up space on the stack.
After these changes you ought to now see something resembling:
ConsoleApplication1942.exe!_except_handler(_EXCEPTION_RECORD * ExceptionRecord, void * EstablisherFrame, _CONTEXT * ContextRecord, void * DispatcherContext) Line 22 C++
ntdll.dll!ExecuteHandler2#20() Unknown
ntdll.dll!ExecuteHandler#20() Unknown
ntdll.dll!_KiUserExceptionDispatcher#8() Unknown
ConsoleApplication1942.exe!main() Line 46 C++
ConsoleApplication1942.exe!invoke_main() Line 64 C++
ConsoleApplication1942.exe!__scrt_common_main_seh() Line 255 C++
ConsoleApplication1942.exe!__scrt_common_main() Line 300 C++
ConsoleApplication1942.exe!mainCRTStartup() Line 17 C++
kernel32.dll!#BaseThreadInitThunk#12() Unknown
ntdll.dll!__RtlUserThreadStart() Unknown
ntdll.dll!__RtlUserThreadStart#8() Unknown
Recorded on Win10 version 1607 and VS2015 Update 2. This isn't the correct way to write SEH handlers, find a better example in this post.

See what causes deadlock on pthread_mutex_lock

I have a Core Data iOS app that uses private queue concurrency in a background process. I'm getting a deadlock that makes the UI freeze up from time to time (fairly regularly, to be honest) - but all the info I get from the debugger (LLDB) is that it is stuck on pthread_mutex_lock. The stack trace is no longer than that, which makes debugging near on impossible:
thread #1: tid = 0x2503, 0x3b5060fc libsystem_kernel.dylib`__psynch_mutexwait + 24, stop reason = signal SIGSTOP
frame #0: 0x3b5060fc libsystem_kernel.dylib`__psynch_mutexwait + 24
frame #1: 0x3b44f128 libsystem_c.dylib`pthread_mutex_lock + 392
The XCode process pane is similarly only showing those two entries on the stack.
I'm quite new to this multithreading stuff so am at a total loss where to begin with fixing the issue. Any suggestions for how to go about debugging this?
Your stack is obviously longer than two frames, you can't start a thread with pthread_mutex_lock. So the truncation of the stack frame is pretty clearly just a bug in the lldb unwinder. If you have an ADC account, please file a bug about this at bugreporter.apple.com. Also if you're not using the most recent version of lldb you can get your hands on you might want to try that, maybe it fixed whatever bug you are seeing. You can install multiple Xcode's side by side so you don't have to remove the one you are currently using to try a newer one.
You might also try another tool that will give you a backtrace (e.g. the Instruments time profiler) when your app gets into this state, since it uses a different unwinder. That will at least let you see what the full backtrace is.

What is 0x%08lx?

I've been getting a lot of blue screens on my XP box at work recently. So many in fact that I downloaded debugging tools for windows(x86) and have been analyzing the crash dumps. So many in fact that I've changed the dumps to mini only or else I would probably end up tanking half a work day each week just waiting for the blue screen to finish recording the detailed crash log.
Almost without exception every dump tells me that the cause of the blue screen is some kind of memory misallocation or misreference and the memory at 0x%08lx referenced 0x%08lx and could not be %s.
Out of idle curiosity I put "0x%08lx" into Google and found that quite a few crash dumps include this bizarre message. Am I to take it that 0x%08lx is a place holder for something that should be meaningful? "%s" which is part of the concluding sentence "The memory could not be %s" definitely looks like it's missing a variable or something.
Does anyone know the provenance of this message? Is it actually supposed to be useful and what is it supposed to look like?
It's not a major thing I have always worked around it. It's just strange that so many people should see this in so many crash dumps and nobody ever says: "Oh the crash dump didn't complete that message properly it's supposed to read..."
I'm just curious as to whether anyone knows the purpose of this strange error message artefact.
0x%08lx and %s are almost certainly format specifiers for the C function sprintf. But looks like the driver developers did as good a job in their error handling code as they did in the critical code, as you should never see these specifiers in the GUI -- they should be replaced with meaningful values.
0x%08lx should turn into something like "0xE001D4AB", a hexadecimal 32-bit pointer value.
%s should be replaced by another string, in this case a description. Something like
the memory at 0xE001D4AB referenced
0xE005123F and could not be read.
Note that I made up the values. Basically, a kernel mode access violation occurred. Hopefully in the mini dumps you can see which module caused it and uninstall / update / whatever it.
I believe it is just the placeholder for the memory address. 0x is a string prefix that would notify the user that it is an hexadecimal, while %08lx is the actual placeholder for a long int (l) converted to hexadecimal (x) with a padding of 8 zeroes (08).

C++/msvc6 application crashes due to heap corruption, any hints?

About the application
It runs on Windows XP Professional SP2.
It's built with Microsoft Visual C++ 6.0 with Service Pack 6.
It's MFC based.
It uses several external dlls (e.g. Xerces, ZLib or ACE).
It has high performance requirements.
It does a lot of network and hard disk I/O, but it's also cpu intensive.
It has an exception handling mechanism which generates a minidump when an unhandled exception occurs.
UPDATE: It is a highly multithreaded application and we are using mutexes to protect concurrent access (of course, we might be failing at some place...)
Facts about the crash
It only happens on multiprocessor/multicore machines and under heavy loads of work.
It happens at random (neither we nor our client have found a pattern yet) after some some hours running.
We cannot reproduce the crash on our testing lab. It only happens on some production systems (but always in multicore machines)
It always ends up crashing at the same point, although the complete stack is not always the same. Let me add the stack of the crashing thread (obtained using WinDbg, sorry we don't have symbols)
Exception code: c0000005 ACCESS_VIOLATION
Address : 006a85b9
Access Type : write
Access Address : 2e020fff
Fault address: 006a85b9 01:002a75b9 C:\MyDir\MyApplication.exe
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
030af6c8 7c9206eb 77bfc3c9 01a80000 00224bc3 MyApplication+0x2a85b9
030af960 7c91e9c0 7c92901b 00000ab4 00000000 ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
030af98c 7c9205c8 00000001 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
030af9c0 7c920551 01a80898 7c92056d 313adfb0 ntdll!RtlpFreeToHeapLookaside+0x22 (FPO: [2,0,4])
030afa8c 4ba3ae96 000307da 00130005 00040012 ntdll!RtlFreeHeap+0x1e9 (FPO: [Non-Fpo])
030afacc 77bfc2e3 0214e384 3087c8d8 02151030 0x4ba3ae96
030afb00 7c91e306 7c80bfc1 00000948 00000001 msvcrt!free+0xc8 (FPO: [Non-Fpo])
030afb20 0042965b 030afcc0 0214d780 02151218 ntdll!ZwReleaseSemaphore+0xc (FPO: [3,0,0])
030afb7c 7c9206eb 02e6c471 02ea0000 00000008 MyApplication+0x2965b
030afe60 7c9205c8 02151248 030aff38 7c920551 ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
030afe74 7c92056d 0210bfb8 02151250 02151250 ntdll!RtlpFreeToHeapLookaside+0x22 (FPO: [2,0,4])
030aff38 77bfc2de 01a80000 00000000 77bfc2e3 ntdll!RtlFreeHeap+0x647 (FPO: [Non-Fpo])
7c92056d c5ffffff ce7c94be ff7c94be 00ffffff msvcrt!free+0xc3 (FPO: [Non-Fpo])
7c920575 ff7c94be 00ffffff 12000000 907c94be 0xc5ffffff
7c920579 00ffffff 12000000 907c94be 90909090 0xff7c94be
*** WARNING: Unable to verify checksum for xerces-c_2_7.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for xerces-c_2_7.dll -
7c92057d 12000000 907c94be 90909090 8b55ff8b MyApplication+0xbfffff
7c920581 907c94be 90909090 8b55ff8b 08458bec xerces_c_2_7
7c920585 90909090 8b55ff8b 08458bec 04408b66 0x907c94be
7c920589 8b55ff8b 08458bec 04408b66 0004c25d 0x90909090
7c92058d 08458bec 04408b66 0004c25d 90909090 0x8b55ff8b
The address MyApplication+0x2a85b9 corresponds to a call to erase() of a std::list.
What I have tried so far
Reviewing all the code related to the point where the crash ends happening.
Trying to enable pageheap on our testing lab though nothing useful has been found by now.
We have substituted the std::list for a C array and then it crashes in other part of the code (although it is related code, it's not in the code where the old list resided). Coincidentally, now it crashes in another erase, though this time of a std::multiset. Let me copy the stack contained in the dump:
ntdll.dll!_RtlpCoalesceFreeBlocks#16() + 0x124e bytes
ntdll.dll!_RtlFreeHeap#12() + 0x91f bytes
msvcrt.dll!_free() + 0xc3 bytes
MyApplication.exe!006a4fda()
[Frames below may be incorrect and/or missing, no symbols loaded for MyApplication.exe]
MyApplication.exe!0069f305()
ntdll.dll!_NtFreeVirtualMemory#16() + 0xc bytes
ntdll.dll!_RtlpSecMemFreeVirtualMemory#16() + 0x1b bytes
ntdll.dll!_ZwWaitForSingleObject#12() + 0xc bytes
ntdll.dll!_RtlpFreeToHeapLookaside#8() + 0x26 bytes
ntdll.dll!_RtlFreeHeap#12() + 0x114 bytes
msvcrt.dll!_free() + 0xc3 bytes
c5ffffff()
(12-Apr-2010) I've tried to enable heap free checking (using gflags) but it slows down the application a lot...
Possible solutions (that I'm aware of) which cannot be applied
"Migrate the application to a newer compiler": We are working on this but It's not a solution at the moment.
"Enable pageheap (normal or full)": We can't enable pageheap on production machines as this affects performance heavily.
I think that's all I remember now, if I have forgotten something I'll add it asap. If you can give me some hint or propose some possible solution, don't hesitate to answer!
You can try peppering your code with calls to the debug heap checking routines to see if you can locate the corruption closer to the source (you're using the debug CRT to track down this problem, right?):
http://msdn.microsoft.com/en-us/library/aa271695(VS.60).aspx
Use Application Verifier from debugging tools for windows. Sometimes it helps.
Try to set up VS to download OS debug symbols and make sure that OMIT FRAME POINTERS is off in your application. Perhaps stack trace will be informative.
Highly multithreaded
Long time ago I discovered that there is a limit for thread count per process in WinXP. My test snippet could create only few thoursands of thread. The problem was resolved by thread pool.
EDIT:
For my purposes there was enough just to check “Application Verifier” checkbox in gflags.exe. Unfortunately, I have no experience with other options.
As for thread limit, test snippet was simple:
unsigned __stdcall ThreadProc(LPVOID)
{
_tprintf(_T("Thread started\n"));
return 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
while (TRUE)
{
unsigned threadId = 0;
_tprintf(_T("Start thread\n"));
_beginthreadex( NULL, 0, &ThreadProc, NULL, 0, &threadId);
}
return 0;
}
I didn’t wait long this time, but handle count in Task Manager was increasing very fast. My real world application got this effect only in 12 hours. But must say the issue was not in crashing, new threads just not created.
Can you post what exceptions you are getting?
If this is some memory corruption bug, then the crash occurs sometime after the memory corruption, so that will be challenging to track down the root cause. You should:
Travel (or remotely logon) to the production system, install Visual Studio, have .pdb and .map files ready (and windows' symbols as well), attach debugger to the release-build and wait for the crash. Though if you set it up correctly, you can use the minidump file on your dev machine, where you would already have your app and window's symbols setup. Then you can see which free call is throwing, and try to figure out which object is being freed to see if that object is corrupted somehow and nearby objects in memory.
Somehow find a way to reproduce the bug in your office, can you create high enough volumes to duplicate what the customer is doing?
Your posted callstacks don't look particularly illuminating.
Since you are using VS 6 with SP6, then its STL is OK.
Can you tell if the app on the production system is leaking any resources? Running perfmon can help with this.
Another thing, you're not calling new/delete like very frequently from different threads are you? I've found that if you do this fast enough, you'll crash your app rather quickly (did this on XP). I had to replace new/delete calls in my app with VirtualAlloc (windows Virtual Memory API), that worked great for me. Of course, STL could be allocating from the heap as well.
Use a performance profiler that can hook into CPU events, such as VTune. Set it up in sampling mode and tell it to wait for events related to cache line sharing. These are identified by a HITM event from the SNOOP phase.
If you run this on a multi processor machine with a realistic workload then it will find places in your code where there is active contention between threads for a single piece of data. You will need to analyze the profiler hot spots found this way and try to find something that is not being wrapped in an appropriate mutex.
I'm not an expert on CPU architecture or anything, but my understanding is that when the CPUs are about to access a piece of data the system will check if any other CPUs are accessing the same piece of data, this is done by watching the memory fetches and writes coming out of each CPU, a process called snooping. Snooping makes sure that if TWO or more CPUs have the same data in each of their caches that the duplicated copies of the data are removed when one of them is modified. A HIT-Modified event means that the system detected this situation and had to flush one of the CPUs cache lines.
See this document for more information on using VTune like this
http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/
I don't have a copy of VTune in front of me right now so maybe this won't work but it seems like the lowest impact way of getting some data. VTune in sampling mode should not cause a lot of problems with performance.
The key here is that this only happens on multiprocessor machines (Cores are the same as processors)
What happens when a threaded program runs on a single processor is that two threads never execute at the same time. The OS has to time-slice each processor to simulate threads.
In a multiprocessor system multiple threads can operate at the same time.
You are probably accessing shared resources from different threads at the same time now.
These resources can be be connections to external systems and even global variables and data structures even Singleton classes.
Unfortunately you now have one of the hardest problems to find.
If you can find the memory being corrupted then you need to find who else is using it on a different thread and then synchronize the memory (Semaphore or CriticalSection).
Unfortunately there is no easy way to find the problem.
You might be able to set the processor affinity temporarily to only run on one processor until you find the problem. See link
http://msdn.microsoft.com/en-us/library/ms684251(VS.85).aspx
Here is a method to set affinity on
For Windows XP/Vista/7, access Affinity by opening the Windows Task Manager (CTL+ALT+DEL, or right-click on Task Bar), select "Processes" tab, right-click the application process you wish to isolate, then select "Set Affinity." Inside the Processor Affinity dialog, un-check the CPU/cores you do not need to use. This effectively isolates that application to the selected CPUs/cores preventing cashe spanning and reducing process-switching and simplifies your ability to supervise CPU/core allocation for multiple programs.
As your second stack trace shows, your application is corrupting the heap. The header of a heap block is written over and thus the crash occurs in the heap manager when coalescing free blocks, or when going through the free list (in the first stack trace).
The code you identified that is currently freeing memory may be a victim of another code overflowing or underflowing a memory block.
The easiest way to debug this kind of crash is to use the debugging help from windows, through pageheap or appverifier, but depending on the application it may slow down too much, or grow the memory usage too high to be usable, which seems to be the case. You may try to use light pageheap, which will have less impact.
You need to identify what part of the application is overflowing. One way to do this is to look at the information contained in the overflown block. If you have a crash in RtlpCoalesceFreeBlocks, I think I remember one of the registers (#esi) is pointing to the start of the corrupted block (I am not on a windows system at the time of this writing and can not check that). Or if you have a dump, using windbg command !heap -a will dump all memory and display corrupted blocks (better log into a file, since the full heap listing can be long). Once corrupted blocks are known, their content may help to identify the code.
Another help can be to enable the stack backtraces (using gflags). This can be done in production as it is lighter than pageheap. It will add some information to heap blocks and may move the crash to another place in your application, but the stack traces will help to identify what code allocated the blocks that are overflowing.
I would focus on getting the issue to happen on a build for which you have proper debugging symbols, at least for your main application. You seem to gloss over this with "sorry we don't have symbols", but when symbols are applied, the stacktraces may show you more information.
What exactly does this mean: "We can't generate symbols because we're linking with a library which doesn't link if we're using them."? This seems odd.

Resources