I need to find the point in my userland code that crash my kernel - debugging

I have big system that make my system crash hard. When I boot up, I don't even have
a coredump. If I log every line that
get executed until my system goes down. I will find that evil code.
Can I log every source code line in GDB to a file?
UPDATE:
ok, I found the bug. It was nasty. The application I started did not
take the system down. After learning about coredump inspection with mdb, and some gdb stepping I found out that the systemcall causing the dump, was not implemented. Updating the system to latest kernel will fix my problem. Thanks to all of you.
MY LESSON:
make sure you know what process causes the coredump. It's not always the one you started.

Sounds like a tricky little problem.
I often try to eliminate as many possible suspects as I can by commenting out large chunks of code, configuring the system to not run certain pieces (if it allows you to do that) etc. This amounts to doing an ad-hoc binary search on the problem, and is a surprisingly effective way of zooming in on offending code relatively quickly.
A potential problem with logging is that the log might not hit the disk before the system locks up - if you don't get a core dump, you might not get the log.
Speaking of core dumps, make sure you don't have a limit on your core dump size (man ulimit.)
You could try to obtain a list of all the functions in your code using objdump, process it a little bit and create a bunch of GDB trace statements on those functions - basically creating a GDB script automatically. If that turns out to be overkill, then a binary search on the code using tracepoints can also help you zoom in on the problem.
And don't panic. You're smarter than the bug - you'll find it.

You can not reasonably track every line of your source using GDB (too slow). Besides, a system crash is most likely a result of a system call, and libc is probably doing the system call on your behalf. Even if you find the line of the application that caused OS crash, you still don't really know anything.
You should start by clarifying which OS is crashing. For Linux, you can try the following approaches:
strace -fo trace.out /path/to/app
After reboot, trace.out will contain syscalls the application was doing just before the crash. If you are lucky, you'll see the last syscall-of-death, but I wouldn't count on it.
Alternatively, try to reproduce the crash on the user-mode Linux, or on kernel with KGDB compiled in.
These will tell you where the problem in the kernel is. Finding the matching system call in your application will likely be trivial.

Please clarify your problem: What part of the system is crashing?
Is it an application?
If so, which application? Is this an application which you have written yourself? Is this an application you have obtained from elsewhere? Can you obtain a clean interrupt if you use a debugger? Can you obtain a backtrace showing which functions are calling the section of code which crashes?
Is it a new hardware driver?
Is it based on an older driver? If so, what has changed? Is it based on a manufacturer's data sheet? Is that data sheet the latest and most correct?
Is it somewhere in the kernel? Which kernel?
What is the OS? I assume it is linux, seeing that you are using the GNU debugger. But of course, that is not necessarily so.
You say you have no coredump. Have you enabled coredumps on your machine? Most systems these days do not have coredumps enabled by default.
Regarding logging GDB output, you may have some success, but it depends where the problem is whether or not you will have the right output logged before the system crashes. There is plenty of delay in writing to disk. You may not catch it in time.

I'm not familiar with the gdb way of doing this, but with windbg the way to go is to have a debugger attached to the kernel and control the debugger remotely over a serial cable (or firewire) from a second debugger. I'm pretty sure gdb has similar capabilities, I could quickly find some hints here: http://www.digipedia.pl/man/gdb.4.html

Related

reason for crashing of the windows

I wrote some program which uses information about (reads via Windows) hardware of the current PC (big program, so I can't post here code) and sometimes my windows 7 crashes, the worst thing is that I have no idea why, and debug doesn't help me, is there any way to receive from windows 7 some kind of log, why it crashed? thanks in advance for any help
The correct (but somewhat ugly) answer:
Go to Computer->Properties, go to 'Advanced System Settings'.
Under startup and recovery, make sure it is set to "Kernel memory dump" and note the location of the dump file (on a completely default install, you are looking at C:\windows\memory.dmp)
You optimally want to install Windows Debugging tools (now in the Windows SDK) as well as setting the MS Symbol store in your symbol settings (http://msdn.microsoft.com/en-us/library/ff552208(v=vs.85).aspx)
Once youv'e done all that, wait for a crash and inspect memory.dmp in the debugger. Usually you will not see the exact crash because your driver vendors don't include symbols, but you will also generally get to see the DLL name that is involved in the crash, which should point you to what driver you are dealing with.
If you are not seeing a specific driver DLL name in the stack, it often indicates to me a hardware failure (like memory or overhead) that needs to be addressed.
MS has a good article here at technet that describes what I mentioned above (but step by step and in greater detail) http://blogs.technet.com/b/askcore/archive/2008/11/01/how-to-debug-kernel-mode-blue-screen-crashes-for-beginners.aspx
You can also look at the event log as someone else noted, but generally the information there is next to useless, beyond the actual kernel message (which can sometimes vaguely indicate whether the problem is driver or something else)

Difficult to debug embedded application

I'm trying to debug an application on an embedded device running an old version of Linux/Qtopia. I asked for help on QT forums but the people there don't know about old software and embedded systems. I'd really like some help with debug strategies.
My program will crash after the main window has been constructed, i.e. some time into the event loop. But depending on the order of functions in the constructor, sometimes it will run only from the console and sometimes it will only run from the icon. Despite my best efforts I can't narrow down what is causing the problem.
There is no seg fault or signal but my program does not continue and the destructor does not get called. It seems to me that one of the first things that would happen in the event loop is a resize event and when this is called could vary if you ran from the console or icon. Also, the various widgets in my GUI would be initialised and drawn so that is also a potential source of error, if I haven't set up something properly.
My debugging options are limited as the area where the crash actually occurs is not under my control. I tried logging to a file and printing to stderr but this was no help. When I got to the state where it runs from the icon but not console, I tried running in gdb and strace but it ran OK - the classic problem of debug software initialising differently.
My next thought is to try to force a core dump and then analyse that. How do I force a core dump ? Is there a better strategy ?
Logging to a file or to a communication port (serial port, etc.) is probably the simplest way to see what is happening and maintaining the normal runtime (i.e. not in a debugger).
You say that logging to a file and printing to stderr was no help. Why not? Are you printing relevant debugging information to the file? Are you using the Linux/Qtopia sources and adding debug logging?
Assuming you have sources for all of the code you are running, it should be just a matter of adding debug logging in the right places to pinpoint where the problem is occurring.

Can a simple program be responsible for a BSOD?

I've got a customer who told me that my program (simple user-land program, not a driver) is crashing his system with a Blue Screen Of Death (BSOD). He says he has never encountered that with other program and that he can reproduce it easily with mine.
The BSOD is of type CRITICAL_OBJECT_TERMINATION (0x000000F4) with object type 0x3 (process): A process or thread crucial to system operation has unexpectedly exited or been terminate.
Can a simple program be responsible for a BSOD (even on Vista...) or should he check the hardware or OS installation?
Just because your program isn't a driver doesn't mean it won't use a driver.
In theory, your code shouldn't be able to BSOD the computer. It's up to the OS to make sure that doesn't happen. By definition, that means there's a problem somewhere either in hardware or in code other than your program. That doesn't preclude there being a bug in your code as well though.
The easiest way to cause a BSOD with a user-space program is (afaik) to kill the Windows subsystem process (csrss.exe). This doesn't need faulty hardware nor a bug in the kernel or a driver, it only needs administrator privileges1.
What is your code exactly doing? The error message ("A process or thread crucial to system operation has unexpectedly exited or been terminate.") sounds like one of the essential system processes terminated. Maybe you are killing a process and unintentionally got the wrong process?
If somehow possible you could try to get a memory dump from that customer. Using the Debugging Tools for Windows you can then further analyze that dump as described here.
1Windows doesn't prevent you from doing so because it "keeps administrators in control of their computer". So this is by design and not a bug. Read Raymond's articles and you will see why.
Short answer is yes. Long answer depends on what is you program is suppose to do and how it does it?
Normally, it shouldn't. If it does, there must be either
A bug in the Windows kernel (possible but very unlikely)
A bug in a device driver (not necessarily in a device your program uses, this could get quite complicated)
A fault in the hardware
I would bet on option number two (device driver) but it would be interesting if you could get us a more detailed dump.
Well, yes it can - but for many different reasons.
That's why we test on different machines, operating systems, hardware etc..
Have you set some requirements for your program and is your user following them?
If you can't duplicate it yourself, and your program doesn't need admin to run, I'd be a bit suspicous about
The stability of that system's hardware
The virus/malware status of that system.
If you can get physical access to the client box, it might be worth running a full virus scan with an up-to-date scanner, and running a full memtest on it.
I had a system once that seemed stable, except that a certian few programs wouldn't run on it (and would sometimes crash the box). Memtest showed my RAM had some bad bits, but they were in higer sims, so they only got accessed if a program tried to use a lot of RAM.
No, and that is pretty much by definition. The worst thing that you can say is that a user-land application may have "triggered" a Windows bug or a driver bug. But a modern desktop Operating System is fully responsible for its own integrity; a BSOD is a failure of that integrity. Therefore the OS is responsible, and only the OS.
(Example of a BSOD bug that your application alone could expose: a virus scanner implemented as a driver, that crashes when executing a file from sector 0xFFFFFFFF, a sector that on this one machine just happens to contain one DLL of your application)
I had problems when exit my app without stopping all the processes and BD connections when the program ends (I crashed the entire IDE). I place the "stopping and disconnecting" code in the "Terminate" of "Form_Closed" event of my main form and the problem wa solved, I don't know it this is your situation.
Another problem can be if the user is trying to access the same resources your app is using (databases, hardware, sockets, etc). Ask him/her about what apps he/she is using when the BSOD happens.
A virus can't be discarded.

How do I analyse a BSOD and the error information it will provide me?

Well, fortunately I haven't written many applications that cause a BSOD but I just wonder about the usefullness of the information on this screen. Does it contain any useful information that could help me to find the error in my code? If so, what do I need, exactly?
And then, the system restarts and probably has written some error log or other information to the system somewhere. Where is it, what does it contain and how do I use it to improve my code?
I did get a BSOD regularly in the past when I was interacting with a PBX system where the amount of documentation of it's drivers were just absent, so I had to do some trial-and-error coding. Fortunately, I now work for a different company and don't see any BSOD's as a result of my code.
If you want a fairly easy way to find out what caused an OS crash that will work ~90% of the time - assuming you have a crash dump available - then try the following:
Download WinDbg as part of the Debugging tools for Windows package. Note, you only need to install the component called Debugging Tools for Windows.
Run WinDbg
Select "Open Crash Dump" from the file menu
When the dump file has loaded type analyze -v and press enter
WinDbg will do an automated analysis of the crash and will provide a huge amount of information on the system state at the time of the crash. It will usually be able to tell you which module was at fault and what type of error caused the crash. You should also get a stack trace that may or may not be helpful to you.
Another useful command is kbwhich prints out a stack trace. In that list, look for a line contains .sys. This is normally the driver which caused the crash.
Note that you will have to configure symbols in WinDbg if you want the stack trace to give you function names. To do this:
Create a folder such as C:\symbols
In WinDbg, open File -> Symbol File Path
Add: SRV*C:\symbols*http://msdl.microsoft.com/download/symbols
This will cache symbol files from Microsoft's servers.
If the automated analysis is not sufficient then there are a variety of commands that WinDbg provides to enable you to work out exactly what was happening at the time of the crash. The help file is a good place to start in this scenario.
Generally speaking, you cannot cause a OS crash or bug check from within your application code. That said, if you are looking for general tips and stuff, I recommend the NTDebugging blog. Most of the stuff is way over my head.
What happens when the OS crashes is it will write a kernel dump file, depending on the current flags and so on, you get more or less info in it. You can load up the dump file in windbg or some other debugger. Windbg has the useful !analyze command, which will examine the dump file and give you hints on the bucket the crash fell into, and the possible culprits. Also check the windbg documentation on the general cause of the bug check, and what you can do to resolve it.

Invoke Blue Screen of Death using Managed Code

Just curious here: is it possible to invoke a Windows Blue Screen of Death using .net managed code under Windows XP/Vista? And if it is possible, what could the example code be?
Just for the record, this is not for any malicious purpose, I am just wondering what kind of code it would take to actually kill the operating system as specified.
The keyboard thing is probably a good option, but if you need to do it by code, continue reading...
You don't really need anything to barf, per se, all you need to do is find the KeBugCheck(Ex) function and invoke that.
http://msdn.microsoft.com/en-us/library/ms801640.aspx
http://msdn.microsoft.com/en-us/library/ms801645.aspx
For manually initiated crashes, you want to used 0xE2 (MANUALLY_INITIATED_CRASH) or 0xDEADDEAD (MANUALLY_INITIATED_CRASH1) as the bug check code. They are reserved explicitly for that use.
However, finding the function may prove to be a bit tricky. The Windows DDK may help (check Ntddk.h) - I don't have it available at the moment, and I can't seem to find decisive info right now - I think it's in ntoskrnl.exe or ntkrnlpa.exe, but I'm not sure, and don't currently have the tools to verify it.
You might find it easier to just write a simple C++ app or something that calls the function, and then just running that.
Mind you, I'm assuming that Windows doesn't block you from accessing the function from user-space (.NET might have some special provisions). I have not tested it myself.
I do not know if it really works and I am sure you need Admin rights, but you could set the CrashOnCtrlScroll Registry Key and then use a SendKeys to send CTRL+Scroll Lock+Scroll Lock.
But I believe that this HAS to come from the Keyboard Driver, so I guess a simple SendKeys is not good enough and you would either need to somehow hook into the Keyboard Driver (sounds really messy) or check of that CrashDump has an API that can be called with P/Invoke.
http://support.microsoft.com/kb/244139
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\i8042prt\Parameters
Name: CrashOnCtrlScroll
Data Type: REG_DWORD
Value: 1
Restart
I would have to say no. You'd have to p/invoke and interact with a driver or other code that lives in kernel space. .NET code lives far removed from this area, although there has been some talk about managed drivers in future versions of Windows. Just wait a few more years and you can crash away just like our unmanaged friends.
As far as I know a real BSOD requires failure in kernel mode code. Vista still has BSOD's but they're less frequent because the new driver model has less drivers in kernel mode. Any user-mode failures will just result in your application being killed.
You can't run managed code in kernel mode. So if you want to BSOD you need to use PInvoke. But even this is quite difficult. You need to do some really fancy PInvokes to get something in kernel mode to barf.
But among the thousands of SO users there is probably someone who has done this :-)
You could use OSR Online's tool that triggers a kernel crash. I've never tried it myself but I imagine you could just run it via the standard .net Process class:
http://www.osronline.com/article.cfm?article=153
I once managed to generate a BSOD on Windows XP using System.Net.Sockets in .NET 1.1 irresponsibly. I could repeat it fairly regularly, but unfortunately that was a couple of years ago and I don't remember exactly how I triggered it, or have the source code around anymore.
Try live videoinput using directshow in directx8 or directx9, most of the calls go to kernel mode video drivers. I succeded in lots of blue screens when running a callback procedure from live videocaptureing source, particulary if your callback takes a long time, can halt the entire Kernel driver.
It's possible for managed code to cause a bugcheck when it has access to faulty kernel drivers. However, it would be the kernel driver that directly causes the BSOD (for example, uffe's DirectShow BSODs, Terence Lewis's socket BSODs, or BSODs seen when using BitTorrent with certain network adapters).
Direct user-mode access to privileged low-level resources may cause a bugcheck (for example, scribbling on Device\PhysicalMemory, if it doesn't corrupt your hard disk first; Vista doesn't allow user-mode access to physical memory).
If you just want a dump file, Mendelt's suggestion of using WinDbg is a much better idea than exploiting a bug in a kernel driver. Unfortunately, the .dump command is not supported for local kernel debugging, so you would need a second PC connected over serial or 1394, or a VM connected over a virtual serial port. LiveKd may be a single-PC option, if you don't need the state of the memory dump to be completely self-consistent.
This one doesn't need any kernel-mode drivers, just a SeDebugPrivilege. You can set your process critical by NtSetInformationProcess, or RtlSetProcessIsCritical and just kill your process. You will see same bugcheck code as you kill csrss.exe, because you set same "critical" flag on your process.
Unfortunately, I know how to do this as a .NET service on our server was causing a blue screen. (Note: Windows Server 2008 R2, not XP/Vista).
I could hardly believe a .NET program was the culprit, but it was. Furthermore, I've just replicated the BSOD in a virtual machine.
The offending code, causes a 0x00000f4:
string name = string.Empty; // This is the cause of the problem, should check for IsNullOrWhiteSpace
foreach (Process process in Process.GetProcesses().Where(p => p.ProcessName.StartsWith(name, StringComparison.OrdinalIgnoreCase)))
{
Check.Logging.Write("FindAndKillProcess THIS SHOULD BLUE SCREEN " + process.ProcessName);
process.Kill();
r = true;
}
If anyone's wondering why I'd want to replicate the blue screen, it's nothing malicious. I've modified our logging class to take an argument telling it to write direct to disk as the actions prior to the BSOD weren't appearing in the log despite .Flush() being called. I replicated the server crash to test the logging change. The VM duly crashed but the logging worked.
EDIT: Killing csrss.exe appears to be what causes the blue screen. As per comments, this is likely happening in kernel code.
I found that if you run taskkill /F /IM svchost.exe as an Administrator, it tries to kill just about every service host at once.

Resources