Limited stack trace in Process Explorer - windows

I have a process running under Windows Server 2003 SP2. When I want to check stack trace of one of its threads it is always limited to 9 entries. Those entries are resolved correctly (I have PDBs in place) but list is just cut in middle.
Do you know of any limitation in Process Explorer?

I am assuming that you think the complete stack trace for this thread should have more than 9 entries. You don't mention if 32 bit OS or 64 bit OS, but I will assume 32 bit OS and then cover 64 bit as an afterthought.
Sometimes when collecting a stack trace on 32 bit systems you cannot collect any items for the stack trace or you can only collect a limited amount of stack frame information even though you know the callstack is deeper. The reasons for this are:
Different calling conventions put data in different places on the stack, making it hard to walk the stack. I can think of 4 definitions, 3 in common use, one more exotic: cdecl, fastcall, stdcall, naked.
For release builds, the code optimizer may do away with the frame pointers using a technique known as Frame Pointer Omission (FPO). Without the FPO (and sometimes, even with the FPO data in a PDB file) you cannot successfully walk the callstack.
Hooks - any helper DLLs, anti-virus, debugging hooks, instrumented code, malware, etc, may mess up the callstack at somepoint because they've inserted their own stub code on the callstack and that small section may not be walkable by the stack walker.
Bytecode virtual machines. Depending upon how the virtual machine is written, the VM may place trampolines on the callstack to aid its execution. These will make the stack hard to walk successfully.
Because of the variety of calling conventions on 32 bit Windows (from both Microsoft and other vendors) it is hard to work out what to expect when you move from one frame to another.
For 64 bit systems there is one calling convention specified. That makes life a lot easier. That said, you still have the issues of helper DLLs and hooks doing their own thing with the stack and that may still cause you problems when walking the stack.
I doubt there is a limitation in Process Explorer. I think the issue is just that walking the callstack for that thread is problematic because of one of the reasons I've listed above.

Related

How is table-based exception handling better than 32-bit Windows SEH?

In 32-bit Windows (at least with Microsoft compilers), exception handling is implemented using a stack of exception frames allocated dynamically on the call stack; the top of the exception stack is pointed to by a TIB entry. The runtime cost is a couple of PUSH/POP instructions per function that needs to handle exceptions, spilling the variables accessed by the exception handler onto the stack, and when handling an exception, a simple linked list walk.
In both 64-bit Windows and the Itanium / System V x86-64 ABI, unwinding instead uses a big sorted list describing all the functions in memory. The runtime cost is some tables per every function (not just ones involved in exception handling), complications for dynamically generated code, and when handling an exception, walking the function list once per every active function regardless of whether it has anything to do with exceptions or not.
How is the latter better than the former? I understand why the Itanium model is cheaper in the common case than the traditional UNIX one based on setjmp/longjmp, but a couple of PUSHes andPOPs plus some register spillage in 32-bit Windows doesn't seem that bad, for the (seemingly) much quicker and simpler handling that it provides. (IIRC, Windows API calls routinely consume Ks of stack space anyway, so it’s not like we gain anything by forcing this data out into tables.)
In addition to optimizing the happy case, perhaps there was also a concern that buffer overflow vulnerabilities could expose the information in the exception. If this information gets corrupted, it could seriously confuse the user, or maybe even cause further errors (remember std::terminate() is called if another exception gets thrown).
Source: http://www.osronline.com/article.cfm%5earticle=469.htm

Problems in reading memory while analyzing Windows kernel crash dumps

While analyzing Windows Kernel crash dumps using WinDBG, I have often faced a problem of WinDBG not able to read some memory location. Recently while analyzing a Kernel crash dump (minidump file), I observed that there were six stack variables (including two parameters), out of those WinDBG successfully dumped the values of four stack variables but it was returning for other two variable. I could not understand this because all the six variables were part of same stack frame.
In addition to that, I noticed that when I tried to dump a global data structure, WinDBG returned me an error indicating "Unable to read memory at Address 0xfffff801139c50d0". I could not understand why WinDBG could not read a variable which had been defined globally in my driver.
I had loaded the symbols properly, including the PDB file of my driver. WinDBG did not give me any symbols related error.
I want to understand the reason for this behavior. Why does WinDBG fail to read the value of local and global variables? Can someone give me an explanation for this behavior?
Assuming you already have access to private symbols, this error is commonly caused by code optimization in the driver, and the PDBs do not have enough information to determine correction location of variables at all times.
Use !lmi <module name>and check if characteristics field has "perf" to determine of code is optimized.
As advised in Debugging Performance Optimized Code The resulting optimization reduces paging (and page faults), and increases spatial locality between code and data. It addresses a key performance bottleneck that would be introduced by poor positioning of the original code. A component that has gone through this optimization may have its code or data block within a function moved to different locations of the binary.
In modules that have been optimized by these techniques, the locations of code and data blocks will often be found at memory addresses different than the locations where they would reside after normal compilation and linking. Furthermore, functions may have been split into many non-contiguous blocks, in order that the most commonly-used code paths can be located close to each other on the same pages.
Therefore, a function (or any symbol) plus an offset will not necessarily have the same meaning it would have in non-optimized code. The rule of thumb when working with performance-optimized codes is simply that you cannot perform reliable address arithmetic on optimized code.
You need to check the output of dv /V to determine where the debugger is actually looking for locals, and confirm this is correct.

Heap corruption caused by race condition - does not happen when application is slowed down. How to Debug?

We are experiencing a crash in a Windows C++ application right on startup. The crash happens currently only on our win 8.1 machine (other development machines being windows 7) and only happen on release builds. The stack trace is each time a bit different, but always related to memory alloc, so it's likely a heap corruption problem.
The problem is that, as soon as the application is slowed down a bit, the crash does not occur:
Debug builds do not crash.
If the release build application is linked against the debug crt (static or dynamic), the crash does not occur, so the CRT debug heap can't be used to track the problem.
If Application Verifier is hooked to the application and 'heap' tests are selected,the application does not crash.
Running the application through "Dr.Memory" also causes the crash to not happen.
In all these cases where the crash does not happen, the application is slightly slowed down and especially startup does take a bit longer, so my assumption is that it's a heap corruption caused by a race condition.
If we can't use the CRT debug heap or tools that slow down the app execution (because it does not crash then), what are good approaches to tracing down the circumstance under which the heap corrupts?
The behavior you described might signal your SW has an issue with dynamic memory which is timing sensetive. I would recommend only code review with the focus on the variables using dynamic data allocation or references to dynamically allocated data. In particular, containers from stl, any other objects allocated via new/malloc or similar. Might be in the first turn you can find all such variables which are shared between different threads and analyze whether:
The variables are initialized before the first use.
The life time of the objects is longed than their use. For the data this means, it shall not be used before it is allocated and after it is deallocated.
The variables are protected against simultaneous read/write from different threads.
Logical read-write sequence ensures that reading of the variable is safe in the case the variable is not yet written by anyone.
If nothing found, perform then some static code analysis (i.e. LINT or similar) and analyze all compiller warnings if you have any.
Updated: just one more possibility, you can redefine your own memory allocators to add some head and tail guard areas to the allocated memory and on every call monitor whether the patterns are not corrupted. Once it happens you may at least dump the data, and together with the callstack identify the place in the SW firstly affected by the corruption. Then the analysis scope will be much reduced. But don't forget this might also change the timings so, that the corruption won't happen.

How does the size of managed code affect memory footprint?

I have been tasked with reducing memory footprint of a Windows CE 5.0 application. I came across Rob Tiffany's highly cited article which recommends using managed DLL to keep the code out of the process's slot. But there is something I don't understand.
The article says that
The JIT compiler is running in your slot and it pulls in IL from the 1
GB space as needed to compile the current call stack.
This means that all the code in the managed DLL can potentially eventually end up in the process's slot. While this will help other processes by not loading the code in common area how does it help this process? FWIW the article does mention that
It also reduces the amount of memory that has to be allocated inside your
My only thought is that just as the code is pulled into the slot it is also pushed/swapped out. But that is just a wild guess and probably completely false.
CF assemblies aren't loaded into the process slot like native DLLs are. They're actually accessed as memory-mapped files. This means that the size of the DLL is effectively irrelevant.
The managed heap also lies in shared memory, not your process slot, so object allocations are far less likely to cause process slot fragmentation or OOM's.
The JITter also doesn't just JIT and hold forever. It compiles what is necessary, and during a GC may very well pitch compiled code that is not being used, or that hasn't been used in a while. You're never going to see an entire assembly JITTed and pulled into the process slow (well if it's a small assembly maybe, but it's certainly not typical).
Obviously some process slot memory has to be used to create some pointers, stack storage, etc etc, but by and large managed code has way less impact on the process slot limitations than native code. Of course you can still hit the limit with large stacks, P/Invokes, native allocations and the like.
In my experience, the area people get into trouble most often with CF apps an memory is with GDI objects and drawing. Bitmaps take up a lot of memory. Even though it's largely in shared memory, creating lots of them (along with brushes, pens, etc) and not caching and reusing is what most often give a large managed app memory footprint.
For a bit more detail this MSDN webcast on Compact Framework Memory Management, while old, is still very relevant.

Drawbacks of using /LARGEADDRESSAWARE for 32-bit Windows executables?

We need to link one of our executables with this flag as it uses lots of memory.
But why give one EXE file special treatment. Why not standardize on /LARGEADDRESSAWARE?
So the question is: Is there anything wrong with using /LARGEADDRESSAWARE even if you don't need it. Why not use it as standard for all EXE files?
blindly applying the LargeAddressAware flag to your 32bit executable deploys a ticking time bomb!
by setting this flag you are testifying to the OS:
yes, my application (and all DLLs being loaded during runtime) can cope with memory addresses up to 4 GB.
so don't restrict the VAS for the process to 2 GB but unlock the full range (of 4 GB)".
but can you really guarantee?
do you take responsibility for all the system DLLs, microsoft redistributables and 3rd-party modules your process may use?
usually, memory allocation returns virtual addresses in low-to-high order. so, unless your process consumes a lot of memory (or it has a very fragmented virtual address space), it will never use addresses beyond the 2 GB boundary. this is hiding bugs related to high addresses.
if such bugs exist they are hard to identify. they will sporadically show up "sooner or later". it's just a matter of time.
luckily there is an extremely handy system-wide switch built into the windows OS:
for testing purposes use the MEM_TOP_DOWN registry setting.
this forces all memory allocations to go from the top down, instead of the normal bottom up.
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management]
"AllocationPreference"=dword:00100000
(this is hex 0x100000. requires windows reboot, of course)
with this switch enabled you will identify issues "sooner" rather than "later".
ideally you'll see them "right from the beginning".
side note: for first analysis i strongly recommend the tool VMmap (SysInternals).
conclusions:
when applying the LAA flag to your 32bit executable it is mandatory to fully test it on a x64 OS with the TopDown AllocationPreference switch set.
for issues in your own code you may be able to fix them.
just to name one very obvious example: use unsigned integers instead of signed integers for memory pointers.
when encountering issues with 3rd-party modules you need to ask the author to fix his bugs. unless this is done you better remove the LargeAddressAware flag from your executable.
a note on testing:
the MemTopDown registry switch is not achieving the desired results for unit tests that are executed by a "test runner" that itself is not LAA enabled.
see: Unit Testing for x86 LargeAddressAware compatibility
PS:
also very "related" and quite interesting is the migration from 32bit code to 64bit.
for examples see:
As a programmer, what do I need to worry about when moving to 64-bit windows?
https://www.sec.cs.tu-bs.de/pubs/2016-ccs.pdf (twice the bits, twice the trouble)
Because lots of legacy code is written with the expectation that "negative" pointers are invalid. Anything in the top two Gb of a 32bit process has the msb set.
As such, its far easier for Microsoft to play it safe, and require applications that (a) need the full 4Gb and (b) have been developed and tested in a large memory scenario, to simply set the flag.
It's not - as you have noticed - that hard.
Raymond Chen - in his blog The Old New Thing - covers the issues with turning it on for all (32bit) applications.
No, "legacy code" in this context (C/C++) is not exclusively code that plays ugly tricks with the MSB of pointers.
It also includes all the code that uses 'int' to store the difference between two pointer, or the length of a memory area, instead of using the correct type 'size_t' : 'int' being signed has 31 bits, and can not handle a value of more than 2 Gb.
A way to cure a good part of your code is to go over it and correct all of those innocuous "mixing signed and unsigned" warnings. It should do a good part of the job, at least if you haven't defined function where an argument of type int is actually a memory length.
However that "legacy code" will probably apparently work right for quite a while, even if you correct nothing.
You'll only break when you'll allocate more than 2 Gb in one block. Or when you'll compare two unrelated pointers that are more than 2 Gb away from each other.
As comparing unrelated pointers is technically an undefined behaviour anyway, you won't encounter that much code that does it (but you can never be sure).
And very frequently even if in total you need more than 2Gb, your program actually never makes single allocations that are larger than that. In fact in Windows, even with LARGEADDRESSAWARE you won't be able by default to allocate that much given the way the memory is organized. You'd need to shuffle the system DLL around to get a continuous block of more than 2Gb
But Murphy's laws says that kind of code will breaks one day, it's just that it will happen very long after you've enable LARGEADDRESSAWARE without checking, and when nobody will remember this has been done.

Resources