I've been getting a lot of blue screens on my XP box at work recently. So many in fact that I downloaded debugging tools for windows(x86) and have been analyzing the crash dumps. So many in fact that I've changed the dumps to mini only or else I would probably end up tanking half a work day each week just waiting for the blue screen to finish recording the detailed crash log.
Almost without exception every dump tells me that the cause of the blue screen is some kind of memory misallocation or misreference and the memory at 0x%08lx referenced 0x%08lx and could not be %s.
Out of idle curiosity I put "0x%08lx" into Google and found that quite a few crash dumps include this bizarre message. Am I to take it that 0x%08lx is a place holder for something that should be meaningful? "%s" which is part of the concluding sentence "The memory could not be %s" definitely looks like it's missing a variable or something.
Does anyone know the provenance of this message? Is it actually supposed to be useful and what is it supposed to look like?
It's not a major thing I have always worked around it. It's just strange that so many people should see this in so many crash dumps and nobody ever says: "Oh the crash dump didn't complete that message properly it's supposed to read..."
I'm just curious as to whether anyone knows the purpose of this strange error message artefact.
0x%08lx and %s are almost certainly format specifiers for the C function sprintf. But looks like the driver developers did as good a job in their error handling code as they did in the critical code, as you should never see these specifiers in the GUI -- they should be replaced with meaningful values.
0x%08lx should turn into something like "0xE001D4AB", a hexadecimal 32-bit pointer value.
%s should be replaced by another string, in this case a description. Something like
the memory at 0xE001D4AB referenced
0xE005123F and could not be read.
Note that I made up the values. Basically, a kernel mode access violation occurred. Hopefully in the mini dumps you can see which module caused it and uninstall / update / whatever it.
I believe it is just the placeholder for the memory address. 0x is a string prefix that would notify the user that it is an hexadecimal, while %08lx is the actual placeholder for a long int (l) converted to hexadecimal (x) with a padding of 8 zeroes (08).
Related
I've been doing some module work and I'm having crashes that occur randomly (usually within 10 hours after boot).
The kernel log messages can vary from one crash to the next, but in some cases I get this:
<4>huh, entered c90390a8 with preempt_count 0000010d, exited with c0340000?
The code that generates this log is from the 2.6.14 kernel, kernel/timer.c:
int preempt_count = preempt_count();
fn(data);
if (preempt_count != preempt_count()) {
printk(KERN_WARNING "huh, entered %p "
"with preempt_count %08x, exited"
" with %08x?\n",
fn, preempt_count,
preempt_count());
BUG();
}
For this condition to happen, what would have had to have occurred (obviously preempt_count changed, but what might cause that)?
The other symptom of the crash is that I'm seeing a scheduling while atomic while doing i2c from a workqueue (which should certainly not be atomic, right?). What might cause this?
I figure this post is a long shot but I'm really just looking for anything to troubleshoot at this point.
Just answering from the top of my head: "preempt_count" is a 32 bit field, which are split up into sub-bit-fields for various purposes. The sub-bit-fields are detailed in O'Reilly's Understanding the Linux Kernel. Again, off the top of my head, I don't know what "c0340000" represents. But since you started with "0000010d", and should have ended up with "0000010d", whatever your timer code did is pretty messed up.
One common cause is if your timer code did something like spin_lock_bh() but forgot to do a spin_unlock_bh(). But that usually results in just a 1-bit difference between the starting and ending preempt_count value. But in your case, your starting and ending values show a massive change.
Michael
I just spent some time chasing down a bug that boiled down to the following. Code was erroneously overwriting the stack, and I think it wrote over the return address of the function call. Following the return, the program would crash and stack would be corrupted. Running the program in valgrind would return an error such as:
vex x86->IR: unhandled instruction bytes: 0xEA 0x3 0x0 0x0
==9222== valgrind: Unrecognised instruction at address 0x4e925a8.
I figure this is because the return jumped to a random location, containing stuff that were not valid x86 opcodes. (Though I am somehow suspicious that this address 0x4e925a8 happened to be in an executable page. I imagine valgrind would throw a different error if this wasn't the case.)
I am certain that the problem was of the stack-overwriting type, and I've since fixed it. Now I am trying to think how I could catch errors like this more effectively. Obviously, valgrind can't warn me if I rewrite data on the stack, but maybe it can catch when someone writes over a return address on the stack. In principle, it can detect when something like 'push EIP' happens (so it can flag where the return addresses are on the stack).
I was wondering if anyone knows if Valgrind, or anything else can do that? If not, can you comment on other suggestions regarding debugging errors of this type efficiently.
If the problem happens deterministically enough that you can point out particular function that has it's stack smashed (in one repeatable test case), you could, in gdb:
Break at entry to that function
Find where the return address is stored (it's relative to %ebp (on x86) (which keeps the value of %esp at the function entry), I am not sure whether there is any offset).
Add watchpoint to that address. You have to issue the watch command with calculated number, not an expression, because with an expression gdb would try to re-evaluate it after each instruction instead of setting up a trap and that would be extremely slow.
Let the function run to completion.
I have not yet worked with the python support available in gdb7, but it should allow automating this.
In general, Valgrind detection of overflows in stack and global variables is weak to non-existant. Arguably, Valgrind is the wrong tool for that job.
If you are on one of supported platforms, building with -fmudflap and linking with -lmudflap will give you much better results for these kinds of errors. Additional docs here.
Udpdate:
Much has changed in the 6 years since this answer. On Linux, the tool to find stack (and heap) overflows is AddressSanitizer, supported by recent versions of GCC and Clang.
We have an older massive C++ application and we have been converting it to support Unicode as well as 64-bits. The following strange thing has been happening:
Calls to registry functions and windows creation functions, like the following, have been failing:
hWnd = CreateSysWindowExW( ExStyle, ClassNameW.StringW(), Label2.StringW(), Style,
Posn.X(), Posn.Y(),
Size.X(), Size.Y(),
hParentWnd, (HMENU)Id,
AppInstance(), NULL);
ClassNameW and Label2 are instances of our own Text class which essentially uses malloc to allocate the memory used to store the string.
Anyway, when the functions fail, and I call GetLastError it returns the error code for "invalid memory access" (though I can inspect and see the string arguments fine in the debugger). Yet if I change the code as follows then it works perfectly fine:
BSTR Label2S = SysAllocString(Label2.StringW());
BSTR ClassNameWS = SysAllocString(ClassNameW.StringW());
hWnd = CreateSysWindowExW( ExStyle, ClassNameWS, Label2S, Style,
Posn.X(), Posn.Y(),
Size.X(), Size.Y(),
hParentWnd, (HMENU)Id,
AppInstance(), NULL);
SysFreeString(ClassNameWS); ClassNameWS = 0;
SysFreeString(Label2S); Label2S = 0;
So what gives? Why would the original functions work fine with the arguments in local memory, but when used with Unicode, the registry function require SysAllocString, and when used in 64-bit, the Windows creation functions also require SysAllocString'd string arguments? Our Windows procedure functions have all been converted to be Unicode, always, and yes we use SetWindowLogW call the correct default Unicode DefWindowProcW etc. That all seems to work fine and handles and draws Unicode properly etc.
The documentation at http://msdn.microsoft.com/en-us/library/ms632679%28v=vs.85%29.aspx does not say anything about this. While our application is massive we do use debug heaps and tools like Purify to check for and clean up any memory corruption. Also at the time of this failure, there is still only one main system thread. So it is not a thread issue.
So what is going on? I have read that if string arguments are marshalled anywhere or passed across process boundaries, then you have to use SysAllocString/BSTR, yet we call lots of API functions and there is lots of code out there which calls these functions just using plain local strings?
What am I missing? I have tried Googling this, as someone else must have run into this, but with little luck.
Edit 1: Our StringW function does not create any temporary objects which might go out of scope before the actual API call. The function is as follows:
Class Text {
const wchar_t* StringW () const
{
return TextStartW;
}
wchar_t* TextStartW; // pointer to current start of text in DataArea
I have been running our application with the debug heap and memory checking and other diagnostic tools, and found no source of memory corruption, and looking at the assembly, there is no sign of temporary objects or invalid memory access.
BUT I finally figured it out:
We compile our code /Zp1, which means byte aligned memory allocations. SysAllocString (in 64-bits) always return a pointer that is aligned on a 8 byte boundary. Presumably a 32-bit ANSI C++ application goes through an API layer to the underlying Unicode windows DLLs, which would also align the pointer for you.
But if you use Unicode, you do not get that incidental pointer alignment that the conversion mapping layer gives you, and if you use 64-bits, of course the situation will get even worse.
I added a method to our Text class which shifts the string pointer so that it is aligned on an eight byte boundary, and viola, everything runs fine!!!
Of course the Microsoft people say it must be memory corruption and I am jumping the wrong conclusion, but there is evidence it is not the case.
Also, if you use /Zp1 and include windows.h in a 64-bit application, the debugger will tell you sizeof(BITMAP)==28, but calling GetObject on a bitmap will fail and tell you it needs a 32-byte structure. So I suspect that some of Microsoft's API is inherently dependent on aligned pointers, and I also know that some optimized assembly (I have seen some from Fortran compilers) takes advantage of that and crashes badly if you ever give it unaligned pointers.
So the moral of all of this is, dont use "funky" compiler arguments like /Zp1. In our case we have to for historical reasons, but the number of times this has bitten us...
Someone please give me a "this is useful" tick on my answer please?
Using a bit of psychic debugging, I'm going to guess that the strings in your application are pooled in a read-only section.
It's possible that the CreateSysWindowsEx is attempting to write to the memory passed in for the window class or title. That would explain why the calls work when allocated on the heap (SysAllocString) but not when used as constants.
The easiest way to investigate this is to use a low level debugger like windbg - it should break into the debugger at the point where the access violation occurs which should help figure out the problem. Don't use Visual Studio, it has a nasty habit of being helpful and hiding first chance exceptions.
Another thing to try is to enable appverifier on your application - it's possible that it may show something.
Calling a Windows API function does not cross the process boundary, since the various Windows DLLs are loaded into your process.
It sounds like whatever pointer that StringW() is returning isn't valid when Windows is trying to access it. I would look there - is it possible that the pointer returned it out of scope and deleted shortly after it is called?
If you share some more details about your string class, that could help diagnose the problem here.
How can I recognize that the callstack that is shown by the debugger when my program crashes may be wrong and misleading. For example when the callstack says the following frames may be missing or incorrect, what that actually means? Also what the + number after the function call in the callstack means :
kernel32!LoadLibrary + 0x100 bytes
Should this number be important to me, and is it true that if this number is big the callstack may be incorrect ?
Sorry if I am asking something trivial and obvious
Thank you all
Generally, you can trust your callstack to be correct.
However, if you re-throw exceptions explicitly instead of allowing them to bubble up the callstack naturally, the actual error can be hidden from the stack trace.
To start with the 2nd one: kernel32!LoadLibrary + 0x100 bytes means that the call was from the function LoadLibrary (offset: +100 bytes); appearantly there was no symbolic information exactly identifying the caller. This in itself is no reason for the callstack to be corrupted.
A call stack may be corrupted if functions overwrite values on the stack (i.e. by buffer overflow. This would likely show as '0x41445249' (if it were my name to overwrite it) as a call function. That is something outside your program memory ranges.
A way to diagnose the cause of your crash would be to set breakpoints on functions identified by the call stack. Or use your debugger to backtrace (depending on debugger & system). It is interesting to find out what arguments were included in the calls. Pointers are generally a good start (NULL pointers, uninitialized pointers). Good luck.
I am trying to figure out a crash in my application.
WinDbg tells me the following: (using dashes in place of underscores)
LAST-CONTROL-TRANSFER: from 005f5c7e to 6e697474
DEFAULT-BUCKET-ID: BAD_IP
BUGCHECK-STR: ACCESS-VIOLATION
It is obvious to me that 6e697474 is NOT a valid address.
I have three questions:
1) Does the "BAD_IP" bucket ID mean "Bad Instruction Pointer?"
2) This is a multi-threaded application so one consideration was that the object whose function I was attempting to call went out of scope. Does anyone know if that would lead to the same error message?
3) What else might cause an error like this? One of my co-workers suggested that it might be a stack overflow issue, but WinDBG in the past has proven rather reliable at detecting and pointing these out. (not that I'm sure about the voodoo it does in the background to diagnose that).
Bad-IP is Bad Instruction Pointer. From the description of your problem, I would assume it is a stack corruption instead of a stack overflow.
I can think of the following things that could cause a jump to invalid address, in decreasing order of likelyhood:
calling a member function on a deallocated object. (as you suspect)
calling a member function of a corrupted object.
calling a member function of an object with a corrupted vtable.
a rouge pointer overwriting code space.
I'd start debugging by finding the code at 005f5c7e and looking at what objects are being accessed around there.
It may be helpful to ask, what could have written the string 'ttie' to this location? Often when you have bytes in the 0x41-0x5A, 0x61-0x7A ([a-zA-Z]) range, it indicates a string buffer overflow.
As to what was actually overwritten, it could be the return address, some other function pointer you're using, or occasionally that a virtual function table pointer (vfptr) in an object got overwritten to point to the middle of a string.