Is creating a memory dump at customer environment good? - debugging

I am facing a severe problem with my program, which gets reproduced only in the customer place. Putting logs, are not helping as I doubt the failure is happening in a third party dll. For some reasons, I couldn't get help from the library provider. I am thinking of producing a dump at the point of failure, so that to analyze it offline. Is this a recommended practice? Or any alternatives?

Yes, this is something that every program should have and utilize as often as possible.
I suggest that you don't use third party libraries. Create your own dumps instead. It's very simple and straight forward. You basically need to do the following:
Your program needs to access dbghelp.dll. It's a windows dll that allows you to create human readable call stacks etc. The debugger uses this dll to display data in your process. It also handles post mortem debugging, i.e. dumps of some sort. This dll can safely be distributed with your software. I suggest that you download and install Debugging Tools for Windows. This will give you access to all sorts of tools and the best tool WinDbg.exe and the latest dbghelp.dll is also in that distribution.
In dbghelp.dll you call e.g. MiniDumpWriteDump(), which will create the dump file and that's more or less it. You're done. As soon as you have that file in your hands, you can start using it. Either in the Visual Studio Debugger, which probably even might be associated with the .dmp file extension, or in WinDbg.
Now, there are a few things to think of while you're at it. When checking dump files like this, you need to generate .pdb files when you compile and link your executable. Otherwise there's no chance of mapping the dump data to human readable data, e.g. to get good callstacks and values of variables etc. This also means that you have to save these .pdb files. You need to be able to match them exactly against that very release. Since the dump files are date stamped with the date stamp of the executable, the debugger needs the exact pdb files. It doesn't matter if your code hasn't changed a single bit, if the .pdb files belong to another compilation session, you're toast.
I encourage every windows win32 developer to check out Oleg Starodumov's site DebugInfo.com. It contains a lot of samples and tutorials and how you can configure and tune your dump file generation. There are of course a myriad of ways to exclude certain data, create your custom debug message to attach to the dump etc.
Keep in mind that minidumps will contain very limited information about the application state at exception time. The trade off is a small file (around 50-100 kB depending on your settings). But if you want, you can create a full dump, which will contain the state of the whole application, i.e. globals and even kernel objects. These files can be HUGE and should only be used at extreme cases.
If there are legal aspects, just make sure your customers are aware of what you're doing. I bet you already have some contract where you aren't supposed to reveal business secrets or other legal aspects. If customers complain, convince them how important it is to find bugs and that this will improve the quality of the software drastically. More or less higher quality at the cost of nothing. If it doesn't cost them anything, that's also a good argument :)
Finally, here's another great site if you want to read up more on crash dump analysis: dumpanalysis.org
Hope this helps. Please comment if you want me to explain more.
Cheers !
Edit:
Just wanted to add that MiniDumpWriteDump() requires that you have a pointer to a MINIDUMP-EXCEPTION-INFORMATION (with underscores) struct. But the GetExceptionInformation() macro provides this for you at time of exception in your exception handler (structured exception handling or SEH):
__try {
}
__except (YourHandlerFunction(GetExceptionInformation())) {
}
YourHandlerFunction() will be the one taking care of generating the minidump (or some other function down the call chain). Also, if you have custom errors in your program, e.g. something happens that should not happen but technically is not an exception, you can use RaiseException() to create your own.
GetExceptionInformation() can only be used in this context and nowhere else during program execution.

Crash dumps are a pretty common troubleshooting method and can be very effective, especially for problems that only reproduce at the customer's site.
Just make sure the customer/client understands what you're doing and that you have permission. It's possible that a crash dump can have sensitive information that a customer may not want (or be permitted) to let walk out the door or over the wire.

Better than that there are libraries that will upload crash data back you.
BugDump and BugSplat
And there's the Microsoft way:
http://msdn.microsoft.com/en-us/library/aa936273.aspx

Disclaimer: I am not a lawyer, nor do I pretend to be one, this is not legal advice.
The data you can include in logs and crash dumps also depend on what domain you are working in. For example, medical equipment and patient information systems often contain sensitive data about patients that should not be visible to unauthorized persons.
The HIPAA Privacy Rule regulates
the use and disclosure of certain
information held by "covered entities"
(...) It establishes regulations for
the use and disclosure of Protected
Health Information (PHI). PHI is any
information held by a covered entity
which concerns health status,
provision of health care, or payment
for health care that can be linked to
an individual.[10] This is interpreted
rather broadly and includes any part
of an individual's medical record or
payment history. --Wikipedia
It should not be possible to link health information to an individual. The crash dumps and logs should be anonymized and stripped of any sensitive information, or not sent at all.
Maybe this does not apply to your specific case, so this is more of a general note. I think it applies to other domains that handle sensitive information, such as military and financial, and so on.

Basically the easiest way to produce a dumpfile is by using adplus. You don't need to change your code.
Adplus is part of the debugging tools for windows, as mentioned in the article above.
Adplus is basically a huge vbscript automation of windbg.
What you have to do to use adplus:
Download and install Debugging tools for windows to c:\debuggers
start your application
open a commandline and navigate to c:\debuggers
run this line "adplus -crash your_exe.exe"
reproduce the crash
you'll get a minidump with all the information you need.
you can open the crash dump in your favorite debugger.
within windbg, the command "analyze -v" helped me in at least 40% of all the crashes that only happened at customer site and were not reproducible in house.

Related

Embed and execute native code from memory

I want to do these two things with my application (Windows only):
Allow user to insert (with a tool) a native code into my application before starting it.
Run this user-inserted code straight from memory during runtime.
Ideally, it would have to be easy for user to specify this code.
I have two ideas how to do this that I'm considering right now:
User would embed a native dll into application's resources. Application would load this dll straight from memory using techniques from this article.
Somehow copy assembly code of .dll method specified by user into my application resources, and execute this code from heap as described in this article.
Are there any better options to do this? If not, any thoughts on what might cause problems in those solutions?
EDIT
I specifically do not want to use LoadLibrary* calls as they require dll file to be already on hard drive which I'm trying to avoid. I'm also trying to make dissasembling harder.
EDIT
Some more details:
Application code is under my control and is native. I simply want to provide user with a way to embed his own customized functions after my application is compiled and deployed.
User code can have arbitrary restrictions placed on it by me, it is not a problem.
The aim is to allow third parties to statically link code into a native application.
The obvious way to do this is to supply the third parties with the application's object files and a linker. This could be wrapped up in a tool to make it easy to use.
As ever, the devil is in the detail. In addition to object files, applications contain manifests, resources, etc. You need to find a linker that you are entitled to distribute. You need to use a compiler that is compatible with said linker. And so on. But this is certainly feasible, and likely to be more reliable than trying to roll your own solution.
Your option 2 is pretty much intractable in my view. For small amounts of self-contained code it's viable. For any serious amount of code you cannot realistically hope for success without re-inventing the wheel that is your option 1.
For example, real code is going to link to Win32 functions, and how are you going to resolve those? You'd have to invent something just like a PE import table. So, why do so when DLLs already exist. If you invented your own PE-like file format for this code, how would anyone generate it? All the standard tools are in the business of making PE format DLLs.
As for option 1, loading a DLL from memory is not supported. So you have to do all the work that the loader would do for you if it were loading from file. So, if you want to load a DLL that is not present on the disk, then option 1 is your only choice.
Any half competent hacker will readily pull the DLL from the executing process though so don't kid yourself that running DLLs from memory will somehow protect your code from inspection.
This is something called "application virtualization", there are 3rd party tools for that, check them on google.
In a simple case, you may just load "DLL" into memory, apply relocs, setup imports and call entry point.

How to read some data from a Windows application memory?

I have an application, which displays me some data. I need to attach to this app's process, find the data I need in memory (one single number, actually), and save it somewhere. This application doesn't seem to use standard windows controls, so things aren't going to be as simple as reading controls data using AutoIt or something similar.
Currently I'm a self-learner database guy and have quite shallow knowledge about windows apps debugging. Not even sure if I asked my question correctly enough.
So, can you give me some starter guidelines about, say, what should I read first, and general directions I should work on?
Thanks.
To read memory of other application you need to open the process with respect of OpenProcess with at least PROCESS_VM_READ access rights and then use ReadProcessMemory to read any memory address from the process. If you are an administrator or have debug privilege you will be able to open any process with maximal access rights, you need only to enable SeDebugPrivilege before (see for example http://support.microsoft.com/kb/131065).
If you don't know a much about the memory of the destination process you can just enumerate the memory blocks with respect of VirtualQueryEx (see How does one use VirtualAllocEx do make room for a code cave? as an example where I examine the program code. The program data you can examine in the same way).
The most practical problem which I see is that you ask your question in too general way. If you explain more what kind of the data you are looking for I could probably suggest you a better way. For example if you could see the data somewhere you could examine the corresponding windows and controls with respect of Spy++ (a part of Visual Studio Tools). The most important are the class of windows (or controls) and the messages which will be send at the moment when the most interesting window are displayed. You can also use Process Monitor to trace all file and registry access at the time when the windows with the interesting information will be displayed. At least at the beginning you should examine the memory of the process with ReadProcessMemory at the moment when the data which you are looking for are displayed on the window.
If you will have no success in your investigations I'd recommend you to insert in your question more information.
My primary advice is: try to find any other method of integration than this. Even if you succeed, you'll be hostage to any kinds of changes in the target process, and possibly in the Windows O/S. What you are describing is behaviour most virus scanners should flag and hinder: if not now, then in the future.
That said, you can take a look at DLL injection. However, it sounds as if you're going to have to debug the heck out of the target process at the disassembly level: otherwise, how are you going to know what memory address to read?
I used to know the windows debugging API but it's long lost memory. How about using ollydbg:
http://www.ollydbg.de/
And controlling that with both ollydbg script and autoit?
Sounds interesting... but very difficult. Since you say this is a 'one-off', what about something like this instead?
Take a screenshot of this application.
Run the screenshot through an OCR program
If you are able to read the text you are looking for in a predictable way, you're halfway there!
So now if you can read a OCR'd screenshot of your application, it is a simple matter of writing a program that does the following:
Scripts the steps to get the data on the screen
Creates a screenshot of the data in question
Runs it through an OCR program like Microsoft Office Document Imaging
Extracts the relevant text and does 'whatever' with it.
I have done something like this before with pretty good results, but I would say it is a fragile solution. If the application changes, it stops working. If the OCR can't read the text, it stops working. If the OCR reads the wrong text, it might do worse things than stop working...
As the other posters have said, reaching into memory and pulling out data is a pretty advanced topic... kudos to you if you can figure out a way to do that!
I know this may not be a popular answer, due to the nature of what this software is used for, but programs like CheatEngine and ArtMoney allow you to search through all the memory reserved by a process for a given value, then refine the results till you find the address of the value you're looking for.
I learned this initially while trying to learn how to better protect my games after coming across a trainer for one of them, but have found the technique occasionally useful when debugging.
Here is an example of the technique described above in use: https://www.youtube.com/watch?v=Nv04gYx2jMw&t=265

"Works on my machine" - How to fix non-reproducible bugs?

Very occasionally, despite all testing efforts, I get hit with a bug report from a customer that I simply can't reproduce in the office.
(Apologies to Jeff for the 'borrowing' of the badge)
I have a few "tools" that I can use to try and locate and fix these, but it always feels a bit like I'm knife-and-forking it:-
Asking for more and more context from the customer: (systeminfo)
Log files from our application
Ad-hoc tests with the customer to attempt to change the behaviour
Providing customer with a new build with additional diagnostics
Thinking about the problem in the bath...
Site visit (assuming customer is somewhere warm and sunny)
Are there set procedures, or other techniques than anyone uses to resolve problems like this?
One of the attributes of good debuggers, I think is that they always have a lot of weapons in their toolkit. They never seem to get "stuck" for too long and there is always something else for them to try. Some of the things I've been known to do:
ask for memory dumps
install a remote debugger on a client machine
add tracing code to builds
add logging code for debugging purposes
add performance counters
add configuration parameters to various bits of suspicious code so I can turn on and off features
rewrite and refactor suspicious code
try to replicate the issue locally on a different OS or machine
use debugging tools such as application verifier
use 3rd party load generation tools
write simulation tools in-house for load generation when the above failed
use tools like Glowcode to analyse memory leaks and performance issues
reinstall the client machine from scratch
get registry dumps and apply them locally
use registry and file watcher tools
Eventually, I find the bug just gives up out of some kind of awe at my persistence. Or the client realises that it's probably a machine or client side install or configuration issue.
Extensive logging usually helps.
The easiest way is always to see the customer in action (assuming that its readily reproducible by the customer). Oftentimes, problems arise due to issues with the customer's computer environment, conflicts with other programs, etc - these are details which you will not be able to catch on your dev rig. So a site visit might be useful; but if that's not convenient, tools like RealVNC might help as well in letting you see the customer 'do their thing'.
(watching the customer in action also allows you to catch them out in any WTF moments that they might have)
Now, if the problem is intermittent, then things get somewhat more complicated. The best way to get around this problem would be to log useful information in places where you guess problems could occur and perhaps use a tool like Splunk to index the log files during analysis. A diagnostic build (i.e. with extra logging) might be useful in this case.
I'm just in the middle of implementing an automated error reporting system that sends back to me information (currently via email although you could use a webservice) from any exception encountered by the app.
That way I get (nearly) all the information that I would do if I was sitting in front of VS2008 and it really helps me to work out what the problem is.
The customers are also usually (sorta) impressed that I know about their problem as soon as they encounter it!
Also, if you use the Application.ThreadException error handler you can send back info on unexpected exceptions too!
We use all the methods you mention progressively starting with the easiest and proceeding to the harder.
However you forget that sometimes hardware is at fault. For example, memory could be malfunctioning and some computation-intensive code will behave strangely throwing exceptions with weird diagnostics. Of cource, it works on your machine, since you don't have faulty hardware.
Experience is needed to identify such errors and insist that customer tries to install the program on another machine or does hardware check. One thing that helps greatly is good error handling - when your code throws an exception it should provide details, not just indicate that something is bad. With good error indication it's easier to identify such suspicious issues related to faulty hardware.
I think one of the most important things is the ability to ask sensible questions around what the customer has reported... More often than not they're not mentioning something that they don't see as relevant, but is actually key.
Telepathy would also be useful...
We've had good success using EurekaLog with it posting directly to FogBugz. This gets us a bug report containing a call stack, along with related system info (other processes running, memory, network details etc) and a screen shot. Occasionally customers enter further info too, which is helpful. It's certainly, in most cases, made it much easier and quicker to fix bugs.
One technique I've found useful is building an application with an integrated "diagnostic" mode (enabled by a command line switch when you launch the app). That certainly avoids the need to create custom builds with additional logging.
Otherwise, it sounds like what you're doing is as good an approach as any.
Copilot (assuming customer is somewhere cold and rainy :)
The usual procedure for this is to expect something like this will happen and add a ton of logging information. Of course you don't enable it from the beginning, but only when this happens.
Usually customers don't like to have to install a new version or some diagnostic tools. It is not their job to do your debugging. And visiting a client for cases like these is rarely an option. You must involve the client as little as possible. Changing a switch and sending you a log file is OK - anything more than this is too much.
I like the alternative of thinking the problem at the bath. I will start from trying to find out the differences between my machine and the client's configuration.
As a software engineer doing webstuff (booking/shop/member systems etc) the most important thing for us is to get as much information from the customer as possible.
Going from
it's broke!
to
it's broke! & here are screenshots of
every option I picked whilst
generating this particular report
reduces the amount of time it takes us to reproduce and fix an issue no end.
It may be obvious, but it takes a fair amount of chasing to get this kind of information from our customers sometimes! But it's worth it just for those moments you find they're not actually doing what they say they are.
I had these problems also. My solution was to add lots of logging and give the customer a debug build with all the possible debug information. Then make sure dr Watson (it was on Windows NT) created a memory dump with enough information.
After loading the memory dump in the debugger I could find out where and why it crashed.
EDIT: Oh, this obviously only works if the application terminates violently...
I think following the trail of the actions user took can lead us to the reasons of failure or selective failures. But most of the times users are at loss to precisely describe the interactions with the applications, the automatic screenshot taking (if it is desktop app. for .net app you can check Jeff's UnhandledExceptionHandler). Logging all the important action which change state of the objects can also help us in understanding it.
I don't have this problem very often, but if I did, I would use a screen sharing or recorded application to watch the user in action without having to go there (unless, as you said, it's warm and sunny and the company pays the trip).
I have recently been investigating such an issue myself. Over the course of my carrier I have learnt that, while computer systems may be complex, they are predictable so have faith that you can find the problem. My approach to these kinds of issues two fold:
1) Gather as much detailed information as possible from the customer about their failure and analyse it meticulously for patterns. Gather multiple sets of data for multiple failure occurrences to build up a clearer picture.
2) Try and reproduce the failure in house. Continue to make your system more and more similar to the customers system until you can reproduce it, the system is identical or it becomes impractical to make it more similar.
While doing this consider:
1)What differences exist between this system and other working systems.
2)What has recently changed in your product or the customers configuration that has caused the problem to start occurring.
Regards
Depending on the issue you could get WinDbg dumps, they normally give a pretty good idea of what is going on. We have diagnosed quite a few problems that weren't crashed from minidumps.
For .Net apps we also was Trace.Writeline then we can get the user to fire up DbgView and send us the output.
Its very complicated issue . I was thinking writing some procedure for this . I just made some procedure for this non-reproducible bug . it might be helpful
When the Bug accorded .. There are several factors it might to occur.
I am Sure all bugs are reproducible . I always keep eye for these kind of issues..
Get the System Information
what other process the customer did before that.
Time period it occurs . its rare or frequent
its next action happened after the issue ( its always same or different )
Find the factors for this bug ( as developer )
Find the exact position where this issue happened .
Find ALL THE SYSTEM Factors on that time
check all memory leaks or user error issue or wrong condition in code
List out all facotrs to may cause this issue.
How the each factors are affected this and wat are the data is holding those factors
Check memeory issues happened
check the customer have the current update code like yours
check all log from atleast 1 month and find any upnormal operation happened . keep on note
Just a short anecdote (hence 'community wiki'): Last week I thought it was a clever idea in a Django app to import the module pprint for pretty printing Python data only if DEBUG was True:
if settings.DEBUG:
from pprint import pprint
Then I used here and there the pprint command as debugging statement:
pprint(somevar) # show somevar on the console
After finishing the work, I tested the app with setting DEBUG=False. You can guess what happened: The site broke with HTTP500 errors all over the place, and I did not know why, because there is no traceback if DEBUG is False. I was puzzled that the errors disappeared magically, if I switched back to debug mode.
It took me 1-2 hours of putting print statements all over the code to find that the code crashes at exactly the above pprint() line. Then it took me another half an hour to convince myself to stop banging my head on the table.
Now comes the moral of the story:
Not every thing that looks like a clever idea in the first view is such savvy in the end.
An important point to look at for debugging these errors are all configuration options and platform switches your code by itself makes. This can be quite a lot more than just some user preferences. Document good, if you make an assumption about the user's platform (e.g., if you test for Win/Mac/Linux only, will your code crash on BSD or Solaris?)
Cheers,
However tough a non-reproducible problem is - we can still have a structured and strategic approach to solve them - and I can say through experience that it requires out of box thinking in 50% of the cases. Generally speaking, one can categorize the problems into different types which helps to identify what tool to be used. For example if you have a non-reproducible application crash issue or a memory issue you can use profilers and nail down the issue caused in the particular functionality.
Also, one of the most important approach is inforamation rich logging. I also use a lot of enums to describe the state of the process depending on the scenario in question. for exampe, I used like Initiated, Triggered, Running, Waiting Repaired etc to describe the schedules states and saved them to DB at different stages.
Not mentioned yet, but "directed code review" is one good solution, especially if you didn't do a proper review (at least 1 hour per 100 lines of code) before release.
I have also seen impressive demos of AppSight Suite, which is basically an advanced environment monitoring and logging tool. It allows the customer to record what happens on his machine in an extensive but fairly compact log file which you can then replay.
As many have mentioned, extensive logging, and asking the client for the log files when something goes wrong. In addition, as I worked more with web apps, I'll also provide detailed, but succinct deployment documentation (e.g., deployment steps, environmental resources that need to be set up etc).
Here are common problems I've seen that lead to the types of problem you are describing:
Environment not set up properly (e.g., missing environment variables, data sources etc).
Application not fully deployed (e.g., database schema not deployed).
Difference in operating system configuration (default character encoding being the most common culprit for me).
Most of the time, these issues can be identified through the log content.
You can use tools like Microsoft SharedView or TeamViewer to connect to remote PC and inspect problem directly on site. Of course, you'll need cooperation with customer.

What can we do about a randomly crashing app without source code?

I am trying to help a client with a problem, but I am running out of ideas. They have a custom, written in house application that runs on a schedule, but it crashes. I don't know how long it has been like this, so I don't think I can trace the crashes back to any particular software updates. The most unfortunate part is there is no longer any source code for the VB6 DLL which contains the meat of the logic.
This VB6 DLL is kicked off by 2-3 function calls from a VB Script. Obviously, I can modify the VB Script to add error logging, but I'm not having much luck getting quality information to pinpoint the source of the crash. I have put logging messages on either side of all of the function calls and determined which of the calls is causing the crash. However, nothing is ever returned in the err object because the call is crashing wscript.exe.
I'm not sure if there is anything else I can do. Any ideas?
Edit: The main reason I care, even though I don't have the source code is that there may be some external factor causing the crash (insufficient credentials, locked file, etc). I have checked the log file that is created in drwtsn32.log as a result of wscript.exe crashing, and the only information I get is an "Access Violation".
I first tend to think this is something to do with security permissions, but couldn't this also be a memory access violation?
You may consider using one of the Sysinternals tools if you truly think this is a problem with the environment such as file permissions. I once used Filemon to figure out all the files my application was touching and discovered a problem that way.
You may also want to do a quick sanity check with Dependency Walker to make sure you are actually loading the DLL files you think you are. I have seen the wrong version of the C runtime being loaded and causing a mysterious crash.
Depending on the scope of the application, your client might want to consider a rewrite. Without source code, they will eventually be forced to do so anyway when something else changes.
It's always possible to use a debugger - either directly on the PC that's running the crashing app or on a memory dump - to determine what's happening to a greater or lesser extent. In this case, where the code is VB6, that may not be very helpful because you'll only get useful information at the Win32 level.
Ultimately, if you don't have the source code then will finding out where the bug is really help? You won't be able to fix it anyway unless you can avoid that code path for ever in the calling script.
You could use the debugging tools for windows. Which might help you pinpoint the error, but without the source to fix it, won't do you much good.
A lazier way would be to call the dll from code (not a script) so you can at least see what is causing the issue and inspect the err object. You still won't be able to fix it, unless the problem is that it is being called incorrectly.
The guy of Coding The Wheel has a pretty interesting series about building an online poker bot which is full of serious technical info, a lot of which is concerned with how to get into existing applications and mess with them, which is, in some way, what you want to do.
Specifically, he has an article on using WinDbg to get at important info, one on how to bend function calls to your own code and one on injecting DLLs in other processes. These techniques might help to find and maybe work around or fix the crash, although I guess it's still a tough call.
There are a couple of tools that may be helpful. First, you can use dependency walker to do a runtime profile of your app:
http://www.dependencywalker.com/
There is a profile menu and you probably want to make sure that the follow child processes option is checked. This will do two things. First, it will allow you to see all of the lib versions that get pulled in. This can be helpful for some problems. Second, the runtime profile uses the debug memory manager when it runs the child processes. So, you will be able to see if buffers are getting overrun and a little bit of information about that.
Another useful tool is process monitor from Mark Russinovich:
http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
This tool will report all file, registry and thread operations. This will help you determine if any you are bumping into file or registry credential issues.
Process explorer gives you a lot of the same information:
http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
This is also a Russinovich tool. I find that it is a bit easier to look at some data through this tool.
Finally, using debugging tools for windows or dev studio can give you some insight into where the errors are occurring.
Access violation is almost always a memory error - all the more likely in this case because its random crashing (permissions would likely be more obviously reproducible). In the case of a dll it could be either
There's an error in the code in the dll itself - this could be something like a memory allocation error or even a simple loop boundary condition error.
There's an error when the dll tries to link out to another dll on the system. This will generally be caused by a mismatch between dll versions on the machine.
Your first step should be to try and get a reproducible crash condition. If you don't have a set of circumstances that will crash the system then you cannot know when you have fixed it.
I would then install the system on a clean machine and attempt to reproduce the error on that. Run a monitor and check precisely what other files (dlls etc) are open when the program crashes. I have seen code that crashes on a hyperthreaded Pentium but not on an earlier one - so restoring an old machine as a testbed may be a good option to cover that one. Varying the amount of ram in the machine is also worthwhile.
Hopefully these steps might give you a clue. Hopefully it will be an environment problem and so can be avoided by using the right version of windows, dlls etc. However if you're still stuck with the crash at this point with no good clues then your options are either to rewrite or attempt to hunt down the problem further by debugging the dll at assembler lever or dissassembling it. If you are not familiar with assembly code then both of these are long-shots and it's difficult to see what you will gain - and either option is likely to be a massive time-sink. Myself I have in the past, when faced with a particularly low-level high intensity problem like this advertised on one of the 'coder for hire' websites and looked for someone with specialist knowledge. Again you will need a reproducible error to be able to do this.
In the long run a dll without source code will have to be replaced. Paying a specialist with assembly skills to analyse the functions and provide you with flowcharts may well be worthwhile considering. It is good business practice to do this sooner in a controlled manner than later - like after the machine it is running on has crashed and that version of windows is no longer easily available.
You may want to try using Resource Hacker you may have luck de-compiling the in house application. it may not give you the full source code but at least maybe some more info about what the app is doing, which also may help you determine your culrpit.
Add the maximum possible RAM to the machine
This simple and cheap hack has work for me in the past. Of course YMMV.
Reverse engineering is one possibility, although a tough one.
In theory you can decompile and even debug/trace a compiled VB6 application - this is the easy part, modifying it without source, in all but the most simple cases, is the hard part.
Free compilers/decompilers:
VB decompilers
VB debuggers
Rewrite would be, in most cases, a more successful and faster way to solve the problem.

How to consistently organize code for debugging?

When working in a big project that requires debugging (like every project) you realize how much people love "printf" before the IDE's built-in debugger. By this I mean
Sometimes you need to render variable values to screen (specially for interactive debugging).
Sometimes to log them in a file
Sometimes you have to change the visibility (make them public) just to another class to access it (a logger or a renderer for example).
Some other times you need to save the previous value in a member just to contrast it with the new during debugging
...
When a project gets huge with a lot of people working on it, all this debugging-specific code can get messy and hard to differentiate from normal code. This can be crazy for those who have to update/change someone else's code or to prepare it for a release.
How do you solve this?
It is always good to have naming standards and I guess that debug-coding standards should be quite useful (like marking every debug-variable with a _DBG sufix). But I also guess naming is just not enough. Maybe centralizing it into a friendly tracker class, or creating a robust base of macros in order to erase it all for the release. I don't know.
What design techniques, patterns and standards would you embrace if you are asked to write a debug-coding document for all others in the project to follow?
I am not talking about tools, libraries or IDE-specific commands, but for OO design decisions.
Thanks.
Don't commit debugging code, just debuggin tools.
Loggin OTOH has a natural place in execption handling routines and such. Also a few well placed logging statments in a few commonly used APIs can be good for debugging.
Like one log statment to log all SQL executed from the system.
My vote would be with what you described as a friendly tracker class. This class would keep all of that centralized, and potentially even allow you to change debug/logging strategies dynamically.
I would avoid things like Macros simply because that's a compiler trick, and not true OO. By abstracting the concept of debug/logging, you have the opportunity to do lots of things with it including making it a no-op if needed.
Logging or debugging? I believe that well-designed and properly unit-tested application should not need to be permanently instrumented for debugging. Logging on the other hand can be very useful, both in finding bugs and auditing program actions. Rather than cover a lot of information that you can get elsewhere, I would point you at logging.apache.org for either concrete implementations that you can use or a template for a reasonable design of a logging infrastructure.
I think it's particularly important to avoid using System.outs / printfs directly and instead use (even a custom) logging class. That at least gives you a centralized kill-switch for all the loggings (minus the call costs in Java).
It is also useful to have that log class have info/warn/error/caveat, etc.
I would be careful about error levels, user ids, metadata, etc. since people don't always add them.
Also, one of the most common problems that I've seen is that people put temporary printfs in the code as they debug something, and then forget where they put them. I use a tool that tracks everything that I do so I can quickly identify all my recent edits since an abstract checkpoint and delete them. In most cases, however, you may want to pose special rules on debug code that can be checked into your source control.
In VB6 you've got
Debug.Print
which sends output to a window in the IDE. It's bearable for small projects. VB6 also has
#If <some var set in the project properties>
'debugging code
#End If
I have a logging class which I declare at the top with
Dim Trc as Std.Traces
and use in various places (often inside #If/#End If blocks)
Trc.Tracing = True
Trc.Tracefile = "c:\temp\app.log"
Trc.Trace 'with no argument stores date stamp
Trc.Trace "Var=" & var
Yes it does get messy, and yes I wish there was a better way.
We routinely are beginning to use a static class that we write trace messages to. It is very basic and still requires a call from the executing method, but it serves our purpose.
In the .NET world, there is already a fair amount of built in trace information available, so we do not need to worry about which methods are called or what the execution time is. These are more for specific events which occur in the execution of the code.
If your language does not support, through its tracing constructs, categorization of messages, it should be something that you add to your tracing code. Something to the effect that will identify different levels of importance and/or functional areas is a great start.
Just avoid instrumenting your code by modifying it. Learn to use a debugger. Make logging and error handling easy. Have a look at Aspect Oriented Programming
Debugging/Logging code can indeed be intrusive. In our C++ projects, we wrap common debug/log code in macros - very much like asserts. I often find that logging is most usefull in the lower level components so it doesn't have to go everywhere.
There is a lot in the other answers to both agree and disagree with :) Having debug/logging code can be a very valuable tool for troubleshooting problems. In Windows, there are many techniques - the two major ones are:
Extensive use of checked (DBG) build asserts and lots of testing on DBG builds.
the use of ETW in what we call 'fre' or 'retail' builds.
Checked builds (what most ohter call DEBUG builds) are very helpfull for us as well. We run all our tests on both 'fre' and 'chk' builds (on x86 and AMD64 as well, all serever stuff runs on Itanium too...). Some people even self host (dogfood) on checked builds. This does two things
Finds lots of bugs that woldn't be found otherwise.
Quickly elimintes noisy or unnessary asserts.
In Windows, we use Event Tracing for Windows (ETW) extensively. ETW is an efficient static logging mechanism. The NT kernel and many components are very well instrumented. ETW has a lot of advantages:
Any ETW event provider can be dynamically enabled/disabled at run time - no rebooting or process restarts required. Most ETW providers provide granular control over individual events, or groups of events.
Events from any provider (most importantly the kernel) can be merged into a single trace so all events can be correlated.
A merged trace can be copied off the box and fully processed - with symbols.
The NT kernel sample pofile interrupt can generate an ETW event - this yeilds a very light weight sample profiler that can be used any time
On Vista and Windows Server 2008, logging an event is lock free and fully multi-core aware - a thread on each processor can independently log events with no synchronization needed between them.
This is hugly valuable for us, and can be for your Windows code as well - ETW is usuable by any component - including user mode, drivers and other kernel components.
One thing we often do is write a streaming ETW consumer. Instead of putting printfs in the code, I just put ETW events at intersting places. When my componetn is running, I can then just run my ETW watcher at any time - the watcher receivs the events and displays them, conts them, or does other interesting things with them.
I very much respectfully disagree with tvanfosson. Even the best code can benefit from well implemented logging. Well implimented static run-time logging can make finding many problems straight forward - without it, you have zero visiblilty into what is happening in your component. You can look at inputs, outputs and guess -that's about it.
They key here is the term 'well implimented'. Instrumentation must be in the right place. Like any thing else, this requries some thought and planning. If it is not in helpfull/intersting places, then it will not help you find problems in a a development, testing, or deployed scenario. You can also have too much instrumeation causing perf problems when it is on - or even off!
Of course, different software products or componetns will have different needs. Some things may need very little instrumenation. But a widely depolyed or critical component can greatly benefit from weill egineered instrumeantion.
Here is a scenario for you (note, this very well may not apply to you...:) ). Lets say you have a line-of-business app deployed on every desktop in your company - hundreds or thousands of users. What do you do when someone has a pbolem? Do yo stop by their office and hookup a debugger? If so, how do you know what version they have? Where do you get the right symbols? How do you get the debuger on their system? What if it only happens once every few hours or days? Are you going to let the system run with the debugger connected all that time?
As you can imagine - hooking up debugger in this scenario is disruptive.
If your component is instrumented with ETW, then you could ask your user to simply turn on tracing; continue to do his/her work; then hit the "WTF" button when the problem happens. Even better: your app may have be able to self log - detecting problems at run time and turning on logging auto-magicaly. It could even send you ETW files when problems occured.
These are just simple exmples - logging can be handled many different ways. My key recomendation here is to think about how loging might be able to help you find, debug, and fix problems in your componetns at dev time, test time, and after they are deployed.
I was burnt by the same issue in about every project I've been involved with, so now I have this habit that involves extensive use of logging libraries (whatever the language/platform provides) from the start. Any Log4X port is fine for me.
Building yourself some proper debug tools can be extremely valuable. For example in a 3D environment, you might have an option to display the octree, or to render planned AI paths, or to draw waypoints that are normally invisible. You'd probably also want some on-screen display to aid with profiling too: the current framerate, count of polygons on screen, texture memory usage, and so on.
Although this takes some time and effort to do, in the long run it can save you a lot of time and frustration.

Resources