Call to ExAllocatePoolWithTag never returns - windows

I am having some issues with my virtualHBA driver on Windows Server 2016. A ran the HLK crashdump support test. 3 times out of 10 the test passed. In those 3 failing tests, the crashdump hangs at 0% while taking Complete dump, or Kernel dump or minidump.
By kernel debugging my code, I found that the call to ExAllocatePoolWithTag() for buffer allocation never actually returns.
Below is the statement which never returns.
pDeviceExtension->pcmdbuf=(struct mycmdrsp *)ExAllocatePoolWithTag(NonPagedPoolCacheAligned,pcmdqSignalSize,((ULONG)'TA1'));
I searched on the web regarding this. However, all of the found pages are focusing on this function returning NULL which in my case never returns.
Any help on how to move forward would be highly appreciated.
Thanks in advance.

You can't allocate memory in crash dump mode. You're running at HIGH_LEVEL with interrupts disabled and so you're calling this API at the wrong IRQL.
The typical solution for a hardware adapter is to set the RequestedDumpBufferSize in the PORT_CONFIGURATION_INFORMATION structure during the normal HwFindAdapter call. Then when you're called again in crash dump mode you use the CrashDumpRegion field to get your dump buffer allocation. You then need to write your own "crash dump mode only" allocator to allocate buffers out of this memory region.
It's a huge pain, especially given that it's difficult/impossible to know how much memory you're ultimately going to need. I usually calculate some minimal configuration overhead (i.e. 1 channel, 8 I/O requests at a time, etc.) and then add in a registry configurable slush. The only benefit is that the environment is stripped down so you don't need to be in your all singing, all dancing configuration.

Related

Making a virtual IOPCIDevice with IOKit

I have managed to create a virtual IOPCIDevice which attaches to IOResources and basically does nothing. I'm able to get existing drivers to register and match to it.
However when it comes to IO handling, I have some trouble. IO access by functions (e.g. configRead, ioRead, configWrite, ioWrite) that are described in IOPCIDevice class can be handled by my own code. But drivers that use memory mapping and IODMACommand are the problem.
There seems to be two things that I need to manage: IODeviceMemory(described in the IOPCIDevice) and DMA transfer.
How could I create a IODeviceMemory that ultimately points to memory/RAM, so that when driver tries to communicate to PCI device, it ultimately does nothing or just moves the data to RAM, so my userspace client can handle this data and act as an emulated PCI device?
And then could DMA commands be directed also to my userspace client without interfering to existing drivers' source code that use IODMACommand.
Thanks!
Trapping memory accesses
So in theory, to achieve what you want, you would need to allocate a memory region, set its protection bits to read-only (or possibly neither read nor write if a read in the device you're simulating has side effects), and then trap any writes into your own handler function where you'd then simulate device register writes.
As far as I'm aware, you can do this sort of thing in macOS userspace, using Mach exception handling. You'd need to set things up that page protection fault exceptions from the process you're controlling get sent to a Mach port you control. In that port's message handler, you'd:
check where the access was going to
if it's the device memory, you'd suspend all the threads of the process
switch the thread where the write is coming from to single-step, temporarily allow writes to the memory region
resume the writer thread
trap the single-step message. Your "device memory" now contains the written value.
Perform your "device's" side effects.
Turn off single-step in the writer thread.
Resume all threads.
As I said, I believe this can be done in user space processes. It's not easy, and you can cobble together the Mach calls you need to use from various obscure examples across the web. I got something similar working once, but can't seem to find that code anymore, sorry.
… in the kernel
Now, the other problem is you're trying to do this in the kernel. I'm not aware of any public KPIs that let you do anything like what I've described above. You could start looking for hacks in the following places:
You can quite easily make IOMemoryDescriptors backed by system memory. Don't worry about the IODeviceMemory terminology: these are just IOMemoryDescriptor objects; the IODeviceMemory class is a lie. Trapping accesses is another matter entirely. In principle, you can find out what virtual memory mappings of a particular MD exist using the "reference" flag to the createMappingInTask() function, and then call the redirect() method on the returned IOMemoryMap with a NULL backing memory argument. Unfortunately, this will merely suspend any thread attempting to access the mapping. You don't get a callback when this happens.
You could dig into the guts of the Mach VM memory subsystem, which mostly lives in the osfmk/vm/ directory of the xnu source. Perhaps there's a way to set custom fault handlers for a VM region there. You're probably going to have to get dirty with private kernel APIs though.
Why?
Finally, why are you trying to do this? Take a step back: What is it you're ultimately trying to do with this? It doesn't seem like simulating a PCI device in this way is an end to itself, so is this really the only way to do what greater goal you're ultimately trying to achieve? See: XY problem

What happens in the kernel when the process accesses an address just allocated with brk/sbrk?

This is actually a theoretical question about memory management. Since different operating systems implement things differently, I'll have to relieve my thirst for knowledge asking how things work in only one of them :( Preferably the open source and widely used one: Linux.
Here is the list of things I know in the whole puzzle:
malloc() is user space. libc is responsible for the syscall job (calling brk/sbrk/mmap...). It manages to get big chunks of memory, described by ranges of virtual addresses. The library slices these chunks and manages to respond the user application requests.
I know what brk/sbrk syscalls do. I know what 'program break' means. These calls basically push the program break offset. And this is how libc gets its virtual memory chunks.
Now that user application has a new virtual address to manipulate, it simply writes some value to it. Like: *allocated_integer = 5;. Ok. Now, what? If brk/sbrk only updates offsets in the process' entry in the process table, or whatever, how the physical memory is actually allocated?
I know about virtual memory, page tables, page faults, etc. But I wanna know exactly how these things are related to this situation that I depicted. For example: is the process' page table modified? How? When? A page fault occurs? When? Why? With what purpose? When is this 'buddy algorithm' called, and this free_area data structure accessed? (http://www.tldp.org/LDP/tlk/mm/memory.html, section 3.4.1 Page Allocation)
Well, after finally finding an excellent guide (http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/) and some hours digging the Linux kernel, I found the answers...
Indeed, brk only pushes the virtual memory area.
When the user application hits *allocated_integer = 5;, a page fault occurs.
The page fault routine will search for the virtual memory area responsible for the address and then call the page table handler.
The page table handler goes through each level (2 levels in x86 and 4 levels in x86_64), allocating entries if they're not present (2nd, 3rd and 4th), and then finally calls the real handler.
The real handler actually calls the function responsible for allocating page frames.

get_user_pages -EFAULT error caused by VM_GROWSDOWN flag not set

I'm continue my work on the FGPA driver.
Now I'm adding OpenCL support. So I have a following test.
It's just add NUM_OF_EXEC times write and read requests of same buffers and after that waits for completion.
Each write/read request serialized in driver and sequentially executed as DMA transaction. DMA related code can be viewed here.
So the driver takes a transaction, execute it (rsp_setup_dma and fpga_push_data_to_device), waits for interrupt from FPGA (fpga_int_handler), release resources (fpga_finish_dma_write) and begin a new one. When NUM_OF_EXEC equals to 1, all seems to work, but if I increase it, problem appears. At some point get_user_pages (at rsp_setup_dma) returns -EFAULT. Debugging the kernel, I found out, that allocated vma doesn't have VM_GROWSDOWN flag set (at find_extend_vma in mmap.c). But at this point I stuck, because neither I'm sure that I understand why this flag is needed, neither I have an idea why it is not set. Why can get_user_pages fail with the above symptomps? How can I debug this?
On some architectures the stack grows up and on others the stack grows down. See hppa and hppa64 for the weirdos that created the need for such a flag.
So whenever you have to deal with setting up the stack for a kernel thread or process you'll have to provide the direction in which the stack grows as well.

WSASYSCALLFAILURE with overlapped IO on Windows XP

I hit a bug in my code which uses WSARecv and WSAGetOverlapped result on an overlapped socket. Under heavy load, WSAGetOverlapped returns with WSASYSCALLFAILURE ('A system call that should never fail has failed') and my TCP stream is out of sync afterwards, causing mayhem in the upper levels of my program.
So far I have not been able to isolate it to a given set of hardware or drivers. Has somebody hit this issue as well, and found a solution or workaround?
How many connections, how many pending recvs, how many outsanding sends? What does perfmon or task manager say about the amount of non-paged pool used? How much memory in the box? Does it go away if you run the program on Vista or above? Do you have any LSPs installed?
You could be exhausting non-paged pool and causing a badly written driver to misbehave when it fails to allocate memory. This issue is less likely to bite on Vista or later as the amount of non-paged pool available has increased dramatically (see http://www.lenholgate.com/blog/2009/03/excellent-article-on-non-paged-pool.html for details). Alternatively you might be hitting the "locked pages" limit (you can only lock a fixed number of pages in memory on the OS and each pending I/O operation locks one or more pages depending on buffer size and allocation alignment).
It seems I have solved this issue by sleeping 1ms and retrying the WSAGetOverlapped result when it reports a WSASYSCALLFAILURE.
I had another issue related to overlapped events firing, even though there is no data, which I also had to solve first. The test is now running for over an hour, with a few WSASYSCALLFAILURE handled correctly. Hopefully the overnight test will succeed as well.
#Len: thanks again for your help.
EDIT: The overnight test was successful. My bug was caused by two interdependent issues:
Issue 1: WaitForMultipleObjects in ConnectionSet::select occasionally
signals data on an empty socket, causing SocketConnection::readSync to
deadlock.
Fix: Do a non-blocking read on the first byte of each packet. Reset
ConnectionSet if socket was empty
Issue 2: WSAGetOverlappedResult returns occasionally WSASYSCALLFAILURE,
causing out-of-sync on the TCP stream.
Fix: Retry WSAGetOverlappedResult after a small sleep period.
http://equalizer.svn.sourceforge.net/viewvc/equalizer?view=revision&revision=4649

C++/msvc6 application crashes due to heap corruption, any hints?

About the application
It runs on Windows XP Professional SP2.
It's built with Microsoft Visual C++ 6.0 with Service Pack 6.
It's MFC based.
It uses several external dlls (e.g. Xerces, ZLib or ACE).
It has high performance requirements.
It does a lot of network and hard disk I/O, but it's also cpu intensive.
It has an exception handling mechanism which generates a minidump when an unhandled exception occurs.
UPDATE: It is a highly multithreaded application and we are using mutexes to protect concurrent access (of course, we might be failing at some place...)
Facts about the crash
It only happens on multiprocessor/multicore machines and under heavy loads of work.
It happens at random (neither we nor our client have found a pattern yet) after some some hours running.
We cannot reproduce the crash on our testing lab. It only happens on some production systems (but always in multicore machines)
It always ends up crashing at the same point, although the complete stack is not always the same. Let me add the stack of the crashing thread (obtained using WinDbg, sorry we don't have symbols)
Exception code: c0000005 ACCESS_VIOLATION
Address : 006a85b9
Access Type : write
Access Address : 2e020fff
Fault address: 006a85b9 01:002a75b9 C:\MyDir\MyApplication.exe
ChildEBP RetAddr Args to Child
WARNING: Stack unwind information not available. Following frames may be wrong.
030af6c8 7c9206eb 77bfc3c9 01a80000 00224bc3 MyApplication+0x2a85b9
030af960 7c91e9c0 7c92901b 00000ab4 00000000 ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
030af98c 7c9205c8 00000001 00000000 00000000 ntdll!ZwWaitForSingleObject+0xc (FPO: [3,0,0])
030af9c0 7c920551 01a80898 7c92056d 313adfb0 ntdll!RtlpFreeToHeapLookaside+0x22 (FPO: [2,0,4])
030afa8c 4ba3ae96 000307da 00130005 00040012 ntdll!RtlFreeHeap+0x1e9 (FPO: [Non-Fpo])
030afacc 77bfc2e3 0214e384 3087c8d8 02151030 0x4ba3ae96
030afb00 7c91e306 7c80bfc1 00000948 00000001 msvcrt!free+0xc8 (FPO: [Non-Fpo])
030afb20 0042965b 030afcc0 0214d780 02151218 ntdll!ZwReleaseSemaphore+0xc (FPO: [3,0,0])
030afb7c 7c9206eb 02e6c471 02ea0000 00000008 MyApplication+0x2965b
030afe60 7c9205c8 02151248 030aff38 7c920551 ntdll!RtlAllocateHeap+0xeac (FPO: [Non-Fpo])
030afe74 7c92056d 0210bfb8 02151250 02151250 ntdll!RtlpFreeToHeapLookaside+0x22 (FPO: [2,0,4])
030aff38 77bfc2de 01a80000 00000000 77bfc2e3 ntdll!RtlFreeHeap+0x647 (FPO: [Non-Fpo])
7c92056d c5ffffff ce7c94be ff7c94be 00ffffff msvcrt!free+0xc3 (FPO: [Non-Fpo])
7c920575 ff7c94be 00ffffff 12000000 907c94be 0xc5ffffff
7c920579 00ffffff 12000000 907c94be 90909090 0xff7c94be
*** WARNING: Unable to verify checksum for xerces-c_2_7.dll
*** ERROR: Symbol file could not be found. Defaulted to export symbols for xerces-c_2_7.dll -
7c92057d 12000000 907c94be 90909090 8b55ff8b MyApplication+0xbfffff
7c920581 907c94be 90909090 8b55ff8b 08458bec xerces_c_2_7
7c920585 90909090 8b55ff8b 08458bec 04408b66 0x907c94be
7c920589 8b55ff8b 08458bec 04408b66 0004c25d 0x90909090
7c92058d 08458bec 04408b66 0004c25d 90909090 0x8b55ff8b
The address MyApplication+0x2a85b9 corresponds to a call to erase() of a std::list.
What I have tried so far
Reviewing all the code related to the point where the crash ends happening.
Trying to enable pageheap on our testing lab though nothing useful has been found by now.
We have substituted the std::list for a C array and then it crashes in other part of the code (although it is related code, it's not in the code where the old list resided). Coincidentally, now it crashes in another erase, though this time of a std::multiset. Let me copy the stack contained in the dump:
ntdll.dll!_RtlpCoalesceFreeBlocks#16() + 0x124e bytes
ntdll.dll!_RtlFreeHeap#12() + 0x91f bytes
msvcrt.dll!_free() + 0xc3 bytes
MyApplication.exe!006a4fda()
[Frames below may be incorrect and/or missing, no symbols loaded for MyApplication.exe]
MyApplication.exe!0069f305()
ntdll.dll!_NtFreeVirtualMemory#16() + 0xc bytes
ntdll.dll!_RtlpSecMemFreeVirtualMemory#16() + 0x1b bytes
ntdll.dll!_ZwWaitForSingleObject#12() + 0xc bytes
ntdll.dll!_RtlpFreeToHeapLookaside#8() + 0x26 bytes
ntdll.dll!_RtlFreeHeap#12() + 0x114 bytes
msvcrt.dll!_free() + 0xc3 bytes
c5ffffff()
(12-Apr-2010) I've tried to enable heap free checking (using gflags) but it slows down the application a lot...
Possible solutions (that I'm aware of) which cannot be applied
"Migrate the application to a newer compiler": We are working on this but It's not a solution at the moment.
"Enable pageheap (normal or full)": We can't enable pageheap on production machines as this affects performance heavily.
I think that's all I remember now, if I have forgotten something I'll add it asap. If you can give me some hint or propose some possible solution, don't hesitate to answer!
You can try peppering your code with calls to the debug heap checking routines to see if you can locate the corruption closer to the source (you're using the debug CRT to track down this problem, right?):
http://msdn.microsoft.com/en-us/library/aa271695(VS.60).aspx
Use Application Verifier from debugging tools for windows. Sometimes it helps.
Try to set up VS to download OS debug symbols and make sure that OMIT FRAME POINTERS is off in your application. Perhaps stack trace will be informative.
Highly multithreaded
Long time ago I discovered that there is a limit for thread count per process in WinXP. My test snippet could create only few thoursands of thread. The problem was resolved by thread pool.
EDIT:
For my purposes there was enough just to check “Application Verifier” checkbox in gflags.exe. Unfortunately, I have no experience with other options.
As for thread limit, test snippet was simple:
unsigned __stdcall ThreadProc(LPVOID)
{
_tprintf(_T("Thread started\n"));
return 0;
}
int _tmain(int argc, _TCHAR* argv[])
{
while (TRUE)
{
unsigned threadId = 0;
_tprintf(_T("Start thread\n"));
_beginthreadex( NULL, 0, &ThreadProc, NULL, 0, &threadId);
}
return 0;
}
I didn’t wait long this time, but handle count in Task Manager was increasing very fast. My real world application got this effect only in 12 hours. But must say the issue was not in crashing, new threads just not created.
Can you post what exceptions you are getting?
If this is some memory corruption bug, then the crash occurs sometime after the memory corruption, so that will be challenging to track down the root cause. You should:
Travel (or remotely logon) to the production system, install Visual Studio, have .pdb and .map files ready (and windows' symbols as well), attach debugger to the release-build and wait for the crash. Though if you set it up correctly, you can use the minidump file on your dev machine, where you would already have your app and window's symbols setup. Then you can see which free call is throwing, and try to figure out which object is being freed to see if that object is corrupted somehow and nearby objects in memory.
Somehow find a way to reproduce the bug in your office, can you create high enough volumes to duplicate what the customer is doing?
Your posted callstacks don't look particularly illuminating.
Since you are using VS 6 with SP6, then its STL is OK.
Can you tell if the app on the production system is leaking any resources? Running perfmon can help with this.
Another thing, you're not calling new/delete like very frequently from different threads are you? I've found that if you do this fast enough, you'll crash your app rather quickly (did this on XP). I had to replace new/delete calls in my app with VirtualAlloc (windows Virtual Memory API), that worked great for me. Of course, STL could be allocating from the heap as well.
Use a performance profiler that can hook into CPU events, such as VTune. Set it up in sampling mode and tell it to wait for events related to cache line sharing. These are identified by a HITM event from the SNOOP phase.
If you run this on a multi processor machine with a realistic workload then it will find places in your code where there is active contention between threads for a single piece of data. You will need to analyze the profiler hot spots found this way and try to find something that is not being wrapped in an appropriate mutex.
I'm not an expert on CPU architecture or anything, but my understanding is that when the CPUs are about to access a piece of data the system will check if any other CPUs are accessing the same piece of data, this is done by watching the memory fetches and writes coming out of each CPU, a process called snooping. Snooping makes sure that if TWO or more CPUs have the same data in each of their caches that the duplicated copies of the data are removed when one of them is modified. A HIT-Modified event means that the system detected this situation and had to flush one of the CPUs cache lines.
See this document for more information on using VTune like this
http://software.intel.com/en-us/articles/using-intel-vtune-performance-analyzer-events-ratios-optimizing-applications/
I don't have a copy of VTune in front of me right now so maybe this won't work but it seems like the lowest impact way of getting some data. VTune in sampling mode should not cause a lot of problems with performance.
The key here is that this only happens on multiprocessor machines (Cores are the same as processors)
What happens when a threaded program runs on a single processor is that two threads never execute at the same time. The OS has to time-slice each processor to simulate threads.
In a multiprocessor system multiple threads can operate at the same time.
You are probably accessing shared resources from different threads at the same time now.
These resources can be be connections to external systems and even global variables and data structures even Singleton classes.
Unfortunately you now have one of the hardest problems to find.
If you can find the memory being corrupted then you need to find who else is using it on a different thread and then synchronize the memory (Semaphore or CriticalSection).
Unfortunately there is no easy way to find the problem.
You might be able to set the processor affinity temporarily to only run on one processor until you find the problem. See link
http://msdn.microsoft.com/en-us/library/ms684251(VS.85).aspx
Here is a method to set affinity on
For Windows XP/Vista/7, access Affinity by opening the Windows Task Manager (CTL+ALT+DEL, or right-click on Task Bar), select "Processes" tab, right-click the application process you wish to isolate, then select "Set Affinity." Inside the Processor Affinity dialog, un-check the CPU/cores you do not need to use. This effectively isolates that application to the selected CPUs/cores preventing cashe spanning and reducing process-switching and simplifies your ability to supervise CPU/core allocation for multiple programs.
As your second stack trace shows, your application is corrupting the heap. The header of a heap block is written over and thus the crash occurs in the heap manager when coalescing free blocks, or when going through the free list (in the first stack trace).
The code you identified that is currently freeing memory may be a victim of another code overflowing or underflowing a memory block.
The easiest way to debug this kind of crash is to use the debugging help from windows, through pageheap or appverifier, but depending on the application it may slow down too much, or grow the memory usage too high to be usable, which seems to be the case. You may try to use light pageheap, which will have less impact.
You need to identify what part of the application is overflowing. One way to do this is to look at the information contained in the overflown block. If you have a crash in RtlpCoalesceFreeBlocks, I think I remember one of the registers (#esi) is pointing to the start of the corrupted block (I am not on a windows system at the time of this writing and can not check that). Or if you have a dump, using windbg command !heap -a will dump all memory and display corrupted blocks (better log into a file, since the full heap listing can be long). Once corrupted blocks are known, their content may help to identify the code.
Another help can be to enable the stack backtraces (using gflags). This can be done in production as it is lighter than pageheap. It will add some information to heap blocks and may move the crash to another place in your application, but the stack traces will help to identify what code allocated the blocks that are overflowing.
I would focus on getting the issue to happen on a build for which you have proper debugging symbols, at least for your main application. You seem to gloss over this with "sorry we don't have symbols", but when symbols are applied, the stacktraces may show you more information.
What exactly does this mean: "We can't generate symbols because we're linking with a library which doesn't link if we're using them."? This seems odd.

Resources