No "walkers" in mdb (Solaris modular debugger) - debugging

Every now and then I use the mdb debugger to examine core dumps on Solaris. One nice article that I looked at to get up to speed on the possibilities with mdb is http://blogs.oracle.com/ace/entry/mdb_an_introduction_drilling_sigsegv where the author performs a step-by-step examination of a SIGSEGV crash. In the article the author uses "walkers" which is a kind of add-on to mdb that can perform specific tasks.
My problem is that I don't have any of those walkers in my mdb. By using the "::walkers" command, all walkers available can be listed and my list is empty. So the question is, how can I install/add/load walkers such as the ones used in the above article? I don't really know where they are supposed to be loaded from, if you have to download and add them from somewhere or if it's a configuration step when installing Solaris?

mdb automatically loads walkers and dcmds appropriate for what you're debugging, usually from /usr/lib/mdb and similar directories (see mdb(1) for details). If you just run "mdb" by itself, you'll get almost nothing. If you run "mdb" on a userland process or core dump (e.g., "mdb $$"), you'll get walkers and dcmds appropriate for userland debugging. If you run "mdb" on the kernel (e.g., "mdb -k"), you'll get walkers and dcmds for kernel debugging.

Related

Windows 7 and VB6: Event Error ID 1000

I have a completely random error popping up on a particular piece of software out in the field. The application is a game written in VB6 and is running on Windows 7 64-bit. Every once in a while, the app crashes, with a generic "program.exe has stopped responding" message box. This game can run fine for days on end until this message appears, or within a matter of hours. No exception is being thrown.
We run this app in Windows 2000 compatibility mode (this was its original OS), with visual themes disabled, and as an administrator. The app itself is purposely simple in terms of using external components and API calls.
References:
Visual Basic for Applications
Visual Basic runtime objects and procedures
Visual Basic objects and procedures
OLE Automation
Microsoft DAO 3.51 Object Library
Microsoft Data Formatting Object Library
Components:
Microsoft Comm Control 6.0
Microsoft Windows Common Controls 6.0 (SP6)
Resizer XT
As you can see, these are pretty straightforward, Microsoft-standard tools, for the most part. The database components exist to interact with an Access database used for bookkeeping, and the Resizer XT was inserted to move this game more easily from its original 800x600 resolution to 1920x1080.
There is no networking enabled on the kiosks; no network drivers, and hence no connections to remote databases. Everything is encapsulated in a single box.
In the Windows Application event log, when this happens, there is an Event ID 1000 faulting a seemingly random module -- so far, either ntdll.dll or lpk.dll. In terms of API calls, I don't see any from ntdll.dll. We are using kernel32, user32, and winmm, for various file system and sound functions. I can't reproduce as it is completely random, so I don't even know where to start troubleshooting. Any ideas?
EDIT: A little more info. I've tried several different versions of Dependency Walker, at the suggestion of some other developers, and the latest version shows that I am missing IESHIMS.dll and GRPSVC.dll (these two seems to be well-known bugs in Depends.exe), and that I have missing symbols in COMCTRL32.dll and IEFRAME.dll. Any clues there?
The message from the application event log isn't that useful - what you need is a post mortem process dump from your process - so you can see where in your code things started going wrong.
Every time I've seen one of these problems it generally comes down to a bad API parameter rather than something more exotic, this may be caused by bad data coming in, but usually it's a good ol fashioned bug that causes the problem.
As you've probably figured already this isn't going to be easy to debug; ideally you'd have a repeatable failure case to debug, instead of relying on capturing dump files from a remote machine, but until you can make it repeatable remote dumps are the only way forwards.
Dr Watson used to do this, but is no longer shipped, so the alternatives are:
How to use the Userdump.exe tool to create a dump file
Sysinternals ProcDump
Collecting User-Mode dumps
What you need to get is a minidump, these contain the important parts of the process space, excluding standard modules (e.g. Kernel32.dll) - and replacing the dump with a version number.
There are instructions for Automatically Capturing a Dump When a Process Crashes - which uses cdb.exe shipped with the debugging tools, however the crucial item is the registry key \\HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\AeDebug
You can change your code to add better error handling - especially useful if you can narrow down the cause to a few procedures and use the techniques described in Using symbolic debug information to locate a program crash. to directly process the map files.
Once you've got a minidump and the symbol files WinDbg is the tool of choice for digging into these dumps - however it can be a bit of a voyage to discover what the cause is.
The only other thing I'd consider, and this depends on your application structure, is to attempt to capture all input events for replay.
Another option is to find a copy of VMWare 7.1 which has replay debugging and use that as the first step in capturing a reproducible set of steps.
Right click your executable object and let it be WINXP compatible pending
when you discover source of the problem to finally solve it

How to extract stack traces from minidumps?

I've got a whole bunch of minidumps which were recorded during the runtime of an application through MiniDumpWriteDump. The minidumps were created on a machine with a different OS version than my development machine.
Now I'm trying to write a program to extract stack traces from the minidumps, using dbghelp.dll. I'm walking the MINIDUMP_MODULE_LIST and call SymLoadModule64, but this fails to download the pdbs (kernel32 etc.) from the public symbol server. If I add "C:\Windows\System32" to the symbol path it finds the dlls and downloads the symbols, but of course they don't match the dlls from the minidump, so the results are useless.
So how do I tell dbghelp.dll to download and use the proper pdbs?
[edit]
I forgot to state that SymLoadModule64 only takes a filename and no version/checksum information, so obviously with SymLoadModule64 alone it's impossible for dbghelp to figure out which pdb to download.
The information is actually available in the MINIDUMP_MODULE_LIST but I don't know how to pass it back to the dbghelp API.
There is SymLoadModuleEx which takes additional parameters, but I have no idea if that's what I need or what I should pass for the additional parameters.
[edit]
No luck so far, though I've noticed there's also dbgeng.dll distributed together with dbghelp.dll in the debugging SDK. MSDN looks quite well documented and says it's the same engine as windbg uses. Maybe I can use that to extract the stack traces.
If anyone can point me to some introduction to using dbgeng.dll to process minidumps that would probably help too, as the MSDN documents only the individual components but not how they work together.
Just in case anyone else wants to automate extracting stack traces from dumps, here's what I ended up doing:
Like I mentioned in the update it's possible to use dbgeng.dll instead of dbghelp.dll, which seems to be the same engine WinDbg uses. After some trial and error here's how to get a good stack trace with the same symbol loading mechanism as WinDbg.
call DebugCreate to get an instance of the debug engine
query for IDebugClient4, IDebugControl4, IDebugSymbols3
use IDebugSymbols3.SetSymbolOptions to configure how symbols are loaded (see MSDN for the options WinDbg uses)
use IDebugSymbols3.SetSymbolPath to set the symbol path like you would do in WinDbg
use IDebugClient4.OpenDumpFileWide to open the dump
use IDebugControl4.WaitForEvent to wait until the dump is loaded
use IDebugSymbols3.SetScopeFromStoredEvent to select the exception stored in the dump
use IDebugControl4.GetStackTrace to fetch the last few stack frames
use IDebugClient4.SetOutputCallbacks to register a listener receiving the decoded stack trace
use IDebugControl4.OutputStackTrace to process the stack frames
use IDebugClient4.SetOutputCallbacks to unregister the callback
release the interfaces
The call to WaitForEvent seems to be important because without it the following calls fail to extract the stack trace.
Also there still seems to be some memory leak in there, can't tell if it's me not cleaning up properly or something internal to dbgeng.dll, but I can just restart the process every 20 dumps or so, so I didn't investigate more.
An easy way to automate the analysis of multiple minidump files is to use the scripts written by John Robbins in his article "Automating Analyzing Tons Of Minidump Files With WinDBG And PowerShell" (you can grab the code on GitHub).
This is easy to tweak to have it perform whatever WinDbg commands you'd like it to, if the default setup is not sufficient.

Analysing crash dump in windbg

I am using a third party closed source API which throws an exception stating that "all named pipes are busy".
I would like to debug this further (rather than just stepping through) so I can actually learn what is happening under the covers.
I have taken a dump of this process using WinDbg. What commands should I now use to analyse this dump?
Thanks
You could start doing as follows to get an overview of the exception:
!analyze -v
Now you could load the exception context record:
.ecxr
And now... just take a look at the stack, registers, threads,...
kb ;will show you the stack trace of the crash.
dv ;local variables
Depending on the clues you get, you should follow a different direction. If you want a quick reference to WinDbg I'd recommend you this link.
I hope you find some of this commands and info useful.
In postmortem debugging with Windbg, it can be useful to run some general diagnostic commands before deciding where to dig deeper. These should be your first steps:
.logopen <filename> (See also .logappend)
.lastevent See why the process halted and on what thread
u List disassembly near $eip on offending thread
~ Status of all threads
Kb List callstack, including parameters
.logclose
These commands typically give you an overview of what happened so you can dig further. In the case of dealing with libraries where you don't have source, sending the resulting log file to the vendor along with the build # of the binary library should be sufficient for them to trace it to a known issue if there is one.
This generally happens when a client calls CreateFile for an existing pipe and all the existing pipe instances are busy. At this point CreateFile returns an error and the error code is ERROR_PIPE_BUSY. The right thing at this point is to call WaitNamedPipe with a timeout value to wait for a pipe instance to become available.
The problem generally happens when more than one client tries to connect to the named pipe at the same time.
I assume that the 3rd party dll is native (Otherwise, just use Reflector)
Before using WinDbg to analyze the dump, try using Process-Monitor (SysInternals, freeware) to monitor your process's activity. if it fails because of a file system related issue, you can see exactly what caused the problem and what exactly it tried to do before failing.
If Process-Monitor wasn't enough than you can try and debug your process. but in order to see some meaningful information about the 3rd party dll you'll need it's pdb's.
After setting the correct debug symbols, you can view the call stack by using the k command or one of it's variations (again, I assume you're talking about native code). if your process is indeed crashing because of this dll than examine the parameters that you pass to it's function to ensure that the problem is not on your side. I guess that further down the call stack, you reach some Win32 API - examine the parameters that the dll's function is passing, trying to see if something "smells". If you have the dll's private symbol you can examine it's function's local variables as well (dv) which can give you some more information.
I hope I gave you a good starting point.
This is an excellent resource for using WinDbg to analyze crashes that may be of some use: http://www.networkworld.com/article/3100370/windows/how-to-solve-windows-10-crashes-in-less-than-a-minute.html
The article is for Windows 10, but it contains links to similar information for earlier versions of Windows.

I need to find the point in my userland code that crash my kernel

I have big system that make my system crash hard. When I boot up, I don't even have
a coredump. If I log every line that
get executed until my system goes down. I will find that evil code.
Can I log every source code line in GDB to a file?
UPDATE:
ok, I found the bug. It was nasty. The application I started did not
take the system down. After learning about coredump inspection with mdb, and some gdb stepping I found out that the systemcall causing the dump, was not implemented. Updating the system to latest kernel will fix my problem. Thanks to all of you.
MY LESSON:
make sure you know what process causes the coredump. It's not always the one you started.
Sounds like a tricky little problem.
I often try to eliminate as many possible suspects as I can by commenting out large chunks of code, configuring the system to not run certain pieces (if it allows you to do that) etc. This amounts to doing an ad-hoc binary search on the problem, and is a surprisingly effective way of zooming in on offending code relatively quickly.
A potential problem with logging is that the log might not hit the disk before the system locks up - if you don't get a core dump, you might not get the log.
Speaking of core dumps, make sure you don't have a limit on your core dump size (man ulimit.)
You could try to obtain a list of all the functions in your code using objdump, process it a little bit and create a bunch of GDB trace statements on those functions - basically creating a GDB script automatically. If that turns out to be overkill, then a binary search on the code using tracepoints can also help you zoom in on the problem.
And don't panic. You're smarter than the bug - you'll find it.
You can not reasonably track every line of your source using GDB (too slow). Besides, a system crash is most likely a result of a system call, and libc is probably doing the system call on your behalf. Even if you find the line of the application that caused OS crash, you still don't really know anything.
You should start by clarifying which OS is crashing. For Linux, you can try the following approaches:
strace -fo trace.out /path/to/app
After reboot, trace.out will contain syscalls the application was doing just before the crash. If you are lucky, you'll see the last syscall-of-death, but I wouldn't count on it.
Alternatively, try to reproduce the crash on the user-mode Linux, or on kernel with KGDB compiled in.
These will tell you where the problem in the kernel is. Finding the matching system call in your application will likely be trivial.
Please clarify your problem: What part of the system is crashing?
Is it an application?
If so, which application? Is this an application which you have written yourself? Is this an application you have obtained from elsewhere? Can you obtain a clean interrupt if you use a debugger? Can you obtain a backtrace showing which functions are calling the section of code which crashes?
Is it a new hardware driver?
Is it based on an older driver? If so, what has changed? Is it based on a manufacturer's data sheet? Is that data sheet the latest and most correct?
Is it somewhere in the kernel? Which kernel?
What is the OS? I assume it is linux, seeing that you are using the GNU debugger. But of course, that is not necessarily so.
You say you have no coredump. Have you enabled coredumps on your machine? Most systems these days do not have coredumps enabled by default.
Regarding logging GDB output, you may have some success, but it depends where the problem is whether or not you will have the right output logged before the system crashes. There is plenty of delay in writing to disk. You may not catch it in time.
I'm not familiar with the gdb way of doing this, but with windbg the way to go is to have a debugger attached to the kernel and control the debugger remotely over a serial cable (or firewire) from a second debugger. I'm pretty sure gdb has similar capabilities, I could quickly find some hints here: http://www.digipedia.pl/man/gdb.4.html

How does a debugger work?

I keep wondering how does a debugger work? Particulary the one that can be 'attached' to already running executable. I understand that compiler translates code to machine language, but then how does debugger 'know' what it is being attached to?
The details of how a debugger works will depend on what you are debugging, and what the OS is. For native debugging on Windows you can find some details on MSDN: Win32 Debugging API.
The user tells the debugger which process to attach to, either by name or by process ID. If it is a name then the debugger will look up the process ID, and initiate the debug session via a system call; under Windows this would be DebugActiveProcess.
Once attached, the debugger will enter an event loop much like for any UI, but instead of events coming from the windowing system, the OS will generate events based on what happens in the process being debugged – for example an exception occurring. See WaitForDebugEvent.
The debugger is able to read and write the target process' virtual memory, and even adjust its register values through APIs provided by the OS. See the list of debugging functions for Windows.
The debugger is able to use information from symbol files to translate from addresses to variable names and locations in the source code. The symbol file information is a separate set of APIs and isn't a core part of the OS as such. On Windows this is through the Debug Interface Access SDK.
If you are debugging a managed environment (.NET, Java, etc.) the process will typically look similar, but the details are different, as the virtual machine environment provides the debug API rather than the underlying OS.
As I understand it:
For software breakpoints on x86, the debugger replaces the first byte of the instruction with CC (int3). This is done with WriteProcessMemory on Windows. When the CPU gets to that instruction, and executes the int3, this causes the CPU to generate a debug exception. The OS receives this interrupt, realizes the process is being debugged, and notifies the debugger process that the breakpoint was hit.
After the breakpoint is hit and the process is stopped, the debugger looks in its list of breakpoints, and replaces the CC with the byte that was there originally. The debugger sets TF, the Trap Flag in EFLAGS (by modifying the CONTEXT), and continues the process. The Trap Flag causes the CPU to automatically generate a single-step exception (INT 1) on the next instruction.
When the process being debugged stops the next time, the debugger again replaces the first byte of the breakpoint instruction with CC, and the process continues.
I'm not sure if this is exactly how it's implemented by all debuggers, but I've written a Win32 program that manages to debug itself using this mechanism. Completely useless, but educational.
In Linux, debugging a process begins with the ptrace(2) system call. This article has a great tutorial on how to use ptrace to implement some simple debugging constructs.
If you're on a Windows OS, a great resource for this would be "Debugging Applications for Microsoft .NET and Microsoft Windows" by John Robbins:
http://www.amazon.com/dp/0735615365
(or even the older edition: "Debugging Applications")
The book has has a chapter on how a debugger works that includes code for a couple of simple (but working) debuggers.
Since I'm not familiar with details of Unix/Linux debugging, this stuff may not apply at all to other OS's. But I'd guess that as an introduction to a very complex subject the concepts - if not the details and APIs - should 'port' to most any OS.
I think there are two main questions to answer here:
1. How the debugger knows that an exception occurred?
When an exception occurs in a process that’s being debugged, the debugger gets notified by the OS before any user exception handlers defined in the target process are given a chance to respond to the exception. If the debugger chooses not to handle this (first-chance) exception notification, the exception dispatching sequence proceeds further and the target thread is then given a chance to handle the exception if it wants to do so. If the SEH exception is not handled by the target process, the debugger is then sent another debug event, called a second-chance notification, to inform it that an unhandled exception occurred in the target process. Source
2. How the debugger knows how to stop on a breakpoint?
The simplified answer is: When you put a break-point into the program, the debugger replaces your code at that point with a int3 instruction which is a software interrupt. As an effect the program is suspended and the debugger is called.
Another valuable source to understand debugging is Intel CPU manual (Intel® 64 and IA-32 Architectures
Software Developer’s Manual). In the volume 3A, chapter 16, it introduced the hardware support of debugging, such as special exceptions and hardware debugging registers. Following is from that chapter:
T (trap) flag, TSS — Generates a debug exception (#DB) when an attempt is
made to switch to a task with the T flag set in its TSS.
I am not sure whether Window or Linux use this flag or not, but it is very interesting to read that chapter.
Hope this helps someone.
My understanding is that when you compile an application or DLL file, whatever it compiles to contains symbols representing the functions and the variables.
When you have a debug build, these symbols are far more detailed than when it's a release build, thus allowing the debugger to give you more information. When you attach the debugger to a process, it looks at which functions are currently being accessed and resolves all the available debugging symbols from here (since it knows what the internals of the compiled file looks like, it can acertain what might be in the memory, with contents of ints, floats, strings, etc.). Like the first poster said, this information and how these symbols work greatly depends on the environment and the language.

Resources