What can we do about a randomly crashing app without source code? - vb6

I am trying to help a client with a problem, but I am running out of ideas. They have a custom, written in house application that runs on a schedule, but it crashes. I don't know how long it has been like this, so I don't think I can trace the crashes back to any particular software updates. The most unfortunate part is there is no longer any source code for the VB6 DLL which contains the meat of the logic.
This VB6 DLL is kicked off by 2-3 function calls from a VB Script. Obviously, I can modify the VB Script to add error logging, but I'm not having much luck getting quality information to pinpoint the source of the crash. I have put logging messages on either side of all of the function calls and determined which of the calls is causing the crash. However, nothing is ever returned in the err object because the call is crashing wscript.exe.
I'm not sure if there is anything else I can do. Any ideas?
Edit: The main reason I care, even though I don't have the source code is that there may be some external factor causing the crash (insufficient credentials, locked file, etc). I have checked the log file that is created in drwtsn32.log as a result of wscript.exe crashing, and the only information I get is an "Access Violation".
I first tend to think this is something to do with security permissions, but couldn't this also be a memory access violation?

You may consider using one of the Sysinternals tools if you truly think this is a problem with the environment such as file permissions. I once used Filemon to figure out all the files my application was touching and discovered a problem that way.
You may also want to do a quick sanity check with Dependency Walker to make sure you are actually loading the DLL files you think you are. I have seen the wrong version of the C runtime being loaded and causing a mysterious crash.

Depending on the scope of the application, your client might want to consider a rewrite. Without source code, they will eventually be forced to do so anyway when something else changes.

It's always possible to use a debugger - either directly on the PC that's running the crashing app or on a memory dump - to determine what's happening to a greater or lesser extent. In this case, where the code is VB6, that may not be very helpful because you'll only get useful information at the Win32 level.
Ultimately, if you don't have the source code then will finding out where the bug is really help? You won't be able to fix it anyway unless you can avoid that code path for ever in the calling script.

You could use the debugging tools for windows. Which might help you pinpoint the error, but without the source to fix it, won't do you much good.
A lazier way would be to call the dll from code (not a script) so you can at least see what is causing the issue and inspect the err object. You still won't be able to fix it, unless the problem is that it is being called incorrectly.

The guy of Coding The Wheel has a pretty interesting series about building an online poker bot which is full of serious technical info, a lot of which is concerned with how to get into existing applications and mess with them, which is, in some way, what you want to do.
Specifically, he has an article on using WinDbg to get at important info, one on how to bend function calls to your own code and one on injecting DLLs in other processes. These techniques might help to find and maybe work around or fix the crash, although I guess it's still a tough call.

There are a couple of tools that may be helpful. First, you can use dependency walker to do a runtime profile of your app:
http://www.dependencywalker.com/
There is a profile menu and you probably want to make sure that the follow child processes option is checked. This will do two things. First, it will allow you to see all of the lib versions that get pulled in. This can be helpful for some problems. Second, the runtime profile uses the debug memory manager when it runs the child processes. So, you will be able to see if buffers are getting overrun and a little bit of information about that.
Another useful tool is process monitor from Mark Russinovich:
http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
This tool will report all file, registry and thread operations. This will help you determine if any you are bumping into file or registry credential issues.
Process explorer gives you a lot of the same information:
http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
This is also a Russinovich tool. I find that it is a bit easier to look at some data through this tool.
Finally, using debugging tools for windows or dev studio can give you some insight into where the errors are occurring.

Access violation is almost always a memory error - all the more likely in this case because its random crashing (permissions would likely be more obviously reproducible). In the case of a dll it could be either
There's an error in the code in the dll itself - this could be something like a memory allocation error or even a simple loop boundary condition error.
There's an error when the dll tries to link out to another dll on the system. This will generally be caused by a mismatch between dll versions on the machine.
Your first step should be to try and get a reproducible crash condition. If you don't have a set of circumstances that will crash the system then you cannot know when you have fixed it.
I would then install the system on a clean machine and attempt to reproduce the error on that. Run a monitor and check precisely what other files (dlls etc) are open when the program crashes. I have seen code that crashes on a hyperthreaded Pentium but not on an earlier one - so restoring an old machine as a testbed may be a good option to cover that one. Varying the amount of ram in the machine is also worthwhile.
Hopefully these steps might give you a clue. Hopefully it will be an environment problem and so can be avoided by using the right version of windows, dlls etc. However if you're still stuck with the crash at this point with no good clues then your options are either to rewrite or attempt to hunt down the problem further by debugging the dll at assembler lever or dissassembling it. If you are not familiar with assembly code then both of these are long-shots and it's difficult to see what you will gain - and either option is likely to be a massive time-sink. Myself I have in the past, when faced with a particularly low-level high intensity problem like this advertised on one of the 'coder for hire' websites and looked for someone with specialist knowledge. Again you will need a reproducible error to be able to do this.
In the long run a dll without source code will have to be replaced. Paying a specialist with assembly skills to analyse the functions and provide you with flowcharts may well be worthwhile considering. It is good business practice to do this sooner in a controlled manner than later - like after the machine it is running on has crashed and that version of windows is no longer easily available.

You may want to try using Resource Hacker you may have luck de-compiling the in house application. it may not give you the full source code but at least maybe some more info about what the app is doing, which also may help you determine your culrpit.

Add the maximum possible RAM to the machine
This simple and cheap hack has work for me in the past. Of course YMMV.

Reverse engineering is one possibility, although a tough one.
In theory you can decompile and even debug/trace a compiled VB6 application - this is the easy part, modifying it without source, in all but the most simple cases, is the hard part.
Free compilers/decompilers:
VB decompilers
VB debuggers
Rewrite would be, in most cases, a more successful and faster way to solve the problem.

Related

Out of Memory Message box

I have an MFC application developed with VS2003
It is working fine in XP vista etc.
But when i have executed it in windows 8, and we use it for some time,
then no window is displayed. Instead of that the a MessageBox with a message 'Out of Memory' is displayed. And the Message box is Having the caption of my application.
This issue is rarely occurred in windows 7 too.
I have tried watching the handles using tools like processexplorer and it is not increasing.
Also many forums says that it is because of increase in unclosed handles or resources.
Can any one suggest how can i find where the issue is. Or any one provide possible reason for this.
I cant setup the devenv in the machine causing the issue. I am confused how to diagnose by executing a test build in that.
Please provide your findings.
Thanks in advance.
You clearly have a memory leak somewhere. It's hard to be any more specific without seeing the code.
A debugger is really the best way to solve this problem. If you can reproduce the problem on your development machine, that would be the easiest case. If not, you can attach a debugger to the running process on another machine, either locally or remotely.
The MFC libraries also support some basic memory leak detection, turned on by default for Debug builds and controllable for other builds using the AfxEnableMemoryTracking function. You can use this feature to obtain information about which blocks of memory were allocated but not properly deallocated (i.e. were leaked).
Like you mentioned, Process Explorer is another good way to track down resource leaks. Are you sure that the handle counts are remaining constant rather than trending upwards over time? If the values in the columns are never changing like the question suggests, then you are surely doing something wrong. Your application has to be creating objects in order to do its job. The point is to make sure that it disposes of them when it is finished.
If you can't reproduce the problem with the running application and have only the source code available, you'll need to go through the code and make sure that every use of new has a corresponding use of delete (and that new[] matches up with delete[]). And in general in C++, you should avoid explicit dynamic memory allocation wherever possible. Instead, use the container classes that are provided either by MFC or the standard library. For example, don't allocate arrays manually, use std::vector to do it for you. These container classes ensure that the memory is automatically deallocated in the destructor when the object goes out of scope.

Debugging multiple exe program

I have a project where I am required to fix this program that has the tendency to crash very non-deterministically. This piece of software performs lots of calculations and database calls and can have a very high load, meaning lots of clients.
It is a very critical component and without it nothing works. It needs to perform and be able to run without user interaction for long times.
It is actually a native C++/ATL project with COM for communication between its two executables.
I have spent a lot of time now actually studying the code and looking for obvious code flaws, such as not locking of shared variables (those that are obvious), exception handlers that don't do anything with an exception, besides 'return false', even if this could be a critical exception.
But I wanted to know if anyone has some tips for in regards to tackling a project like this, where many people have actually attempted to fix the issue and failed, and now you've taken the challenge and don't want to fail.
I am prepared to go far to fix this, however I need some guidance as to how to go about it in a good way?
My idea is to first set up a test environment and hope to collect as much information as possible about crashes that do occur, and then find, through logging, stack traces, etc, the points of the crashes. This may or may not be a good way to debug such a project.
Any input is appreciated?
It may be obvious, but my roadmap for such bugfixing task is :
Collect as many information as possible on crash source (users, developpers, etc).
Inspect documentation and dependencies.
Inspect source code.
Build an isolated test env and try to reproduce.
If you still can't find the source of the bug, try to sanitize the source code and to add a more verbose logging system.
Regards
Log, log, log, log.

"Works on my machine" - How to fix non-reproducible bugs?

Very occasionally, despite all testing efforts, I get hit with a bug report from a customer that I simply can't reproduce in the office.
(Apologies to Jeff for the 'borrowing' of the badge)
I have a few "tools" that I can use to try and locate and fix these, but it always feels a bit like I'm knife-and-forking it:-
Asking for more and more context from the customer: (systeminfo)
Log files from our application
Ad-hoc tests with the customer to attempt to change the behaviour
Providing customer with a new build with additional diagnostics
Thinking about the problem in the bath...
Site visit (assuming customer is somewhere warm and sunny)
Are there set procedures, or other techniques than anyone uses to resolve problems like this?
One of the attributes of good debuggers, I think is that they always have a lot of weapons in their toolkit. They never seem to get "stuck" for too long and there is always something else for them to try. Some of the things I've been known to do:
ask for memory dumps
install a remote debugger on a client machine
add tracing code to builds
add logging code for debugging purposes
add performance counters
add configuration parameters to various bits of suspicious code so I can turn on and off features
rewrite and refactor suspicious code
try to replicate the issue locally on a different OS or machine
use debugging tools such as application verifier
use 3rd party load generation tools
write simulation tools in-house for load generation when the above failed
use tools like Glowcode to analyse memory leaks and performance issues
reinstall the client machine from scratch
get registry dumps and apply them locally
use registry and file watcher tools
Eventually, I find the bug just gives up out of some kind of awe at my persistence. Or the client realises that it's probably a machine or client side install or configuration issue.
Extensive logging usually helps.
The easiest way is always to see the customer in action (assuming that its readily reproducible by the customer). Oftentimes, problems arise due to issues with the customer's computer environment, conflicts with other programs, etc - these are details which you will not be able to catch on your dev rig. So a site visit might be useful; but if that's not convenient, tools like RealVNC might help as well in letting you see the customer 'do their thing'.
(watching the customer in action also allows you to catch them out in any WTF moments that they might have)
Now, if the problem is intermittent, then things get somewhat more complicated. The best way to get around this problem would be to log useful information in places where you guess problems could occur and perhaps use a tool like Splunk to index the log files during analysis. A diagnostic build (i.e. with extra logging) might be useful in this case.
I'm just in the middle of implementing an automated error reporting system that sends back to me information (currently via email although you could use a webservice) from any exception encountered by the app.
That way I get (nearly) all the information that I would do if I was sitting in front of VS2008 and it really helps me to work out what the problem is.
The customers are also usually (sorta) impressed that I know about their problem as soon as they encounter it!
Also, if you use the Application.ThreadException error handler you can send back info on unexpected exceptions too!
We use all the methods you mention progressively starting with the easiest and proceeding to the harder.
However you forget that sometimes hardware is at fault. For example, memory could be malfunctioning and some computation-intensive code will behave strangely throwing exceptions with weird diagnostics. Of cource, it works on your machine, since you don't have faulty hardware.
Experience is needed to identify such errors and insist that customer tries to install the program on another machine or does hardware check. One thing that helps greatly is good error handling - when your code throws an exception it should provide details, not just indicate that something is bad. With good error indication it's easier to identify such suspicious issues related to faulty hardware.
I think one of the most important things is the ability to ask sensible questions around what the customer has reported... More often than not they're not mentioning something that they don't see as relevant, but is actually key.
Telepathy would also be useful...
We've had good success using EurekaLog with it posting directly to FogBugz. This gets us a bug report containing a call stack, along with related system info (other processes running, memory, network details etc) and a screen shot. Occasionally customers enter further info too, which is helpful. It's certainly, in most cases, made it much easier and quicker to fix bugs.
One technique I've found useful is building an application with an integrated "diagnostic" mode (enabled by a command line switch when you launch the app). That certainly avoids the need to create custom builds with additional logging.
Otherwise, it sounds like what you're doing is as good an approach as any.
Copilot (assuming customer is somewhere cold and rainy :)
The usual procedure for this is to expect something like this will happen and add a ton of logging information. Of course you don't enable it from the beginning, but only when this happens.
Usually customers don't like to have to install a new version or some diagnostic tools. It is not their job to do your debugging. And visiting a client for cases like these is rarely an option. You must involve the client as little as possible. Changing a switch and sending you a log file is OK - anything more than this is too much.
I like the alternative of thinking the problem at the bath. I will start from trying to find out the differences between my machine and the client's configuration.
As a software engineer doing webstuff (booking/shop/member systems etc) the most important thing for us is to get as much information from the customer as possible.
Going from
it's broke!
to
it's broke! & here are screenshots of
every option I picked whilst
generating this particular report
reduces the amount of time it takes us to reproduce and fix an issue no end.
It may be obvious, but it takes a fair amount of chasing to get this kind of information from our customers sometimes! But it's worth it just for those moments you find they're not actually doing what they say they are.
I had these problems also. My solution was to add lots of logging and give the customer a debug build with all the possible debug information. Then make sure dr Watson (it was on Windows NT) created a memory dump with enough information.
After loading the memory dump in the debugger I could find out where and why it crashed.
EDIT: Oh, this obviously only works if the application terminates violently...
I think following the trail of the actions user took can lead us to the reasons of failure or selective failures. But most of the times users are at loss to precisely describe the interactions with the applications, the automatic screenshot taking (if it is desktop app. for .net app you can check Jeff's UnhandledExceptionHandler). Logging all the important action which change state of the objects can also help us in understanding it.
I don't have this problem very often, but if I did, I would use a screen sharing or recorded application to watch the user in action without having to go there (unless, as you said, it's warm and sunny and the company pays the trip).
I have recently been investigating such an issue myself. Over the course of my carrier I have learnt that, while computer systems may be complex, they are predictable so have faith that you can find the problem. My approach to these kinds of issues two fold:
1) Gather as much detailed information as possible from the customer about their failure and analyse it meticulously for patterns. Gather multiple sets of data for multiple failure occurrences to build up a clearer picture.
2) Try and reproduce the failure in house. Continue to make your system more and more similar to the customers system until you can reproduce it, the system is identical or it becomes impractical to make it more similar.
While doing this consider:
1)What differences exist between this system and other working systems.
2)What has recently changed in your product or the customers configuration that has caused the problem to start occurring.
Regards
Depending on the issue you could get WinDbg dumps, they normally give a pretty good idea of what is going on. We have diagnosed quite a few problems that weren't crashed from minidumps.
For .Net apps we also was Trace.Writeline then we can get the user to fire up DbgView and send us the output.
Its very complicated issue . I was thinking writing some procedure for this . I just made some procedure for this non-reproducible bug . it might be helpful
When the Bug accorded .. There are several factors it might to occur.
I am Sure all bugs are reproducible . I always keep eye for these kind of issues..
Get the System Information
what other process the customer did before that.
Time period it occurs . its rare or frequent
its next action happened after the issue ( its always same or different )
Find the factors for this bug ( as developer )
Find the exact position where this issue happened .
Find ALL THE SYSTEM Factors on that time
check all memory leaks or user error issue or wrong condition in code
List out all facotrs to may cause this issue.
How the each factors are affected this and wat are the data is holding those factors
Check memeory issues happened
check the customer have the current update code like yours
check all log from atleast 1 month and find any upnormal operation happened . keep on note
Just a short anecdote (hence 'community wiki'): Last week I thought it was a clever idea in a Django app to import the module pprint for pretty printing Python data only if DEBUG was True:
if settings.DEBUG:
from pprint import pprint
Then I used here and there the pprint command as debugging statement:
pprint(somevar) # show somevar on the console
After finishing the work, I tested the app with setting DEBUG=False. You can guess what happened: The site broke with HTTP500 errors all over the place, and I did not know why, because there is no traceback if DEBUG is False. I was puzzled that the errors disappeared magically, if I switched back to debug mode.
It took me 1-2 hours of putting print statements all over the code to find that the code crashes at exactly the above pprint() line. Then it took me another half an hour to convince myself to stop banging my head on the table.
Now comes the moral of the story:
Not every thing that looks like a clever idea in the first view is such savvy in the end.
An important point to look at for debugging these errors are all configuration options and platform switches your code by itself makes. This can be quite a lot more than just some user preferences. Document good, if you make an assumption about the user's platform (e.g., if you test for Win/Mac/Linux only, will your code crash on BSD or Solaris?)
Cheers,
However tough a non-reproducible problem is - we can still have a structured and strategic approach to solve them - and I can say through experience that it requires out of box thinking in 50% of the cases. Generally speaking, one can categorize the problems into different types which helps to identify what tool to be used. For example if you have a non-reproducible application crash issue or a memory issue you can use profilers and nail down the issue caused in the particular functionality.
Also, one of the most important approach is inforamation rich logging. I also use a lot of enums to describe the state of the process depending on the scenario in question. for exampe, I used like Initiated, Triggered, Running, Waiting Repaired etc to describe the schedules states and saved them to DB at different stages.
Not mentioned yet, but "directed code review" is one good solution, especially if you didn't do a proper review (at least 1 hour per 100 lines of code) before release.
I have also seen impressive demos of AppSight Suite, which is basically an advanced environment monitoring and logging tool. It allows the customer to record what happens on his machine in an extensive but fairly compact log file which you can then replay.
As many have mentioned, extensive logging, and asking the client for the log files when something goes wrong. In addition, as I worked more with web apps, I'll also provide detailed, but succinct deployment documentation (e.g., deployment steps, environmental resources that need to be set up etc).
Here are common problems I've seen that lead to the types of problem you are describing:
Environment not set up properly (e.g., missing environment variables, data sources etc).
Application not fully deployed (e.g., database schema not deployed).
Difference in operating system configuration (default character encoding being the most common culprit for me).
Most of the time, these issues can be identified through the log content.
You can use tools like Microsoft SharedView or TeamViewer to connect to remote PC and inspect problem directly on site. Of course, you'll need cooperation with customer.

What is the most challenging development environment you've ever had to work in and what did you do to get around the limitations?

By 'challenging development environment' I don't mean you're on a small boat that's rocking up and down and someone is holding a gun to your head. I mean, are the tools at your disposal making the problem difficult?
Development is typically a cycle of code, run, observe the result, repeat. In some environments this is a very quick and painless process, but in others it's very difficult. We end up using little tricks to help us observe the result and run the code faster.
I was thinking of this because I just started using SSIS (an ETL tool included with SQL Server 2005/8). It's been quite challenging for me to make progress, mainly because there's no guidance on what all the dialogs mean and also because the errors are very cryptic and most of the time don't really tell you what the problem is.
I think the easiest environment I've had to work in was VB6 because there you can edit code while the application is running and it will continue running with your new code! You don't even have to run it again. This can save you a lot of time. Surprisingly, in Netbeans with Java code, you can do the same. It steps out of the method and re-runs the method with the new code.
In SQL Server 2000 when there is an error in a trigger you get no stack trace, which can make it really tricky to locate where the problem occurred since an insert can have a cascading effect and trigger many triggers. In Oracle you get a very nice little stack trace with line numbers so resolving the problem is very easy.
Some of the things that I see really help in locating problems:
Good error messages when things go wrong.
Providing a stack trace when a problem occurs.
Debug environment where you can pause, then output the value of variables and step to follow the execution path.
A graphical debug environment that shows the code as it's running.
A screen that can show the current values of variables so you can print to them.
Ability to turn on debug logging on a production system.
What is the worst you've seen and what can be done to get around these limitations?
EDIT
I really didn't intend for this to be flame bait. I'm really just looking for ideas to improve systems so that if I'm creating something I'll think about these things and not contribute to people's problems. I'm also looking for creative ways around these limitations that I can use if I find myself in this position.
I was working on making modifications to Magento for a client. There is very little information on how the Magento system is organized. There are hundreds of folders and files, and there are at least a thousand view files. There was little support available from Magento forums, and I suspect the main reason for this lack of information is because the creators of Magento want you to pay them to become a certified Magento developer. Also, at that time last year there was no StackOverflow :)
My first task was to figure out how the database schema worked and which table stored some attributes I was looking for. There are over 300 tables in Magento, and I couldn't find out how the SQL queries were being done. So I had just one option...
I exported the entire database (300+ tables, and at least 20,000 lines of SQL code) into a .sql file using PhpMyAdmin, and I 'committed' this file into the subversion repositry. Then, I made some changes to the database using the Magento administration panel, and redownloaded the .sql file. Then, I ran a DIFF using TortioseSvn, and scrolled through the 20k+ lines file to find which lines had changed, LOL. As crazy as it sounds, it did work, and I was able to figure out which tables I needed to access.
My 2nd problem was, because of the crazy directory structure, I had to ftp to about 3 folders at the same time for trivial changes. So I had to keep 3 windows of my ftp program open, switch between them and ftp each time.
The 3rd problem was figuring out how the url mapping worked and where some of the code I wanted was being stored. Here, by sheer luck, I managed to find the Model class I was looking for.
Mostly by sheer luck and other similar crazy adventures I managed to work my way through and complete the project. Since then, StackOverflow was started and by a helpful answer to this bounty question I was able to finally get enough information about Magento that I can do future projects in a less crazy manner (hopefully).
Try keypunching your card deck in Fortran, complete with IBM JCL (Job Control Language), handing it in at the data center window, coming back the next morning and getting an inch-thick stack of printer paper with the hex dump of your crash, and a list of the charges to your account.
Grows hair on your fingernails.
I guess that was an improvement on the prior method of sitting at the console, toggling switches and reading the lights.
Occam on a 400x transputer network. As there was only one transputer that could output to console debugging was a nightmare. Had to build a test harness on a Sun network.
I took a class once, that was loosely based on SICP, except it was taught in Dylan rather than Scheme. Actually, it was in the old Dylan syntax, the prefix one that was based on Scheme. But because there were no interpreters for that old version of Dylan, the professor wrote one. In Java. As an applet. Which meant that it had no access to the filesystem; you had to write all of your code in a separate text editor, and then paste it into the Dylan interpreter. Oh, and it had no debugging facilities, of course. And being a Dylan interpreter written in Java, and this was back in 2000, it was ungodly slow.
Print statement debugging, lots of copying and pasting, and an awful lot of cursing at the interpreter were involved.
Back in the 90's, I was developing applications in Clipper, a compilable dBase-like language. I don't remember if it came with a debugger, we often used a 3rd-party debugger called 'Mr Debug' (really!). Although Clipper was fast, some of our more intensive routines were written in C. If you prayed to the correct gods and uttered the necessary incantations, you could use Microsoft's CodeView debugger to debug the C code. But usually not for more than a few minutes, as the program usually didn't like to spend much time running with CodeView (usually memory problems).
I had a series of makefile switches that I used to stub out the sections of code that I didn't need to debug at the time. My debugging environment was very sparse so there was as much free memory as possible for the program. I also think I drank a lot more back then...
Some years ago I reverse engineered game copy protections. Because the protections was written in C or C++ they were fairly easy to disassemble and understand what was going on. But in some cases it got hairy when the copy protection took a detour into the kernel to obfuscate what was happening. A few of them also started to use of custom made virtual machines to make the problem less understandable. I spent hours writing hooks and debuggers to be able to trace into them. The environment really offered a competetive and innovative mind. I had everything at my disposal save time. Misstakes caused reboots and very little feedback what went wrong. I realized thinking before acting is often a better solution.
Today I dispise debuggers. If the problem is in code visible to me I find it easiest to use verbose logging. (Sometimes the error is not understanding the interface/environment then debuggers are good.) I have also realized time is of an essance. You need to have a good working environment with possibility to test your code instantly. If you compiler takes 15 sec, your environment takes 20 sec to update or your caches takes 5 minutes to clear find another way to test your code. Progress keeps me motivated and without a good working environment I get bored, angry and frustrated.
The last job I had I was a Sitecore Developer. Bugfixing can be very painful if the bug only occurs on the client's system, and they do not have Visual Studio installed on the system, with the remote debugging off, and the problem only happens on the production server (not the staging server).
The worst in recent memory was developing SSRS reports using Dundas controls. We were doing quite a bit with the grids which required coding. The pain was the bugginess of the controls, and the lack of debugging support.
I never got around the limitations, but just worked through them.

Best Ways to Debug a Release Mode Application

Im sure this has happened to folks before, something works in debug mode, you compile in release, and something breaks.
This happened to me while working on a Embedded XP environment, the best way i found to do it really was to write a log file to determine where it would go wrong.
What are your experiences/ discoveries trying to tackle an annoying Release-mode bug?
Make sure you have good debug symbols available (you can do this even with a release build, even on embedded devices). You should be able to get a stack trace and hopefully the values of some variables. A good knowledge of assembly language is probably also useful at this point.
My experience is that generally the bug is related to code that is near the area of breakage. That is to say, if you are seeing an issue arising in the function "LoadConfigInfoFromFile" then probably you should start by closely analysing that for issues, rather than "DrawControlsOnScreen", if you know what I mean. "Spooky action at a distance" type bugs do not tend to arise often (although when they do, they tend to be a major bear).
Tracefile is always a good idea.
When it's about crashes, I'm using adplus, which is part of debugging tools for windows. basically what adplus does, is, it attaches windbg to the executable you're monitoring. When the application crashes, you get a crash dump and a log file. You can load the crash dump in your preferred debugger and find out, which instruction lead to the crash.
As release builds are heavily optimized compared to debug builds, the way you compile your code affects its behaviour. This is basically true when crashes in multithreaded code happen in the release version but not the debug version. adplus and windbg helped me, to find out, where this happened.
ADPlus is explained here:
httx://support.microsoft.com/?scid=kb%3Ben-us%3B286350&x=15&y=12
Basically what you have to do is:
1. Download and install WinDbg into C:\debuggers
httx://www.microsoft.com/whdc/devtools/debugging/default.mspx
Start your application
open a cmd and cd to c:\debuggers
start adplus like this:
"adplus.bat -crash your_exe.exe"
reproduce the crash
analyze the crashdump in vs2005 or in windbg
If it's only a small portion of the application that needs debugging then you can change those source files only to be built without optimisations. Presumably you generate debug info for all builds, and so this makes the application run mostly as it would in release, but allows you to debug the interesting parts properly.
How about using Trace statements. They are there for Release mode value checking.
Trace.WriteLine(myVar);
I agree on log file debugging to narrow it down.
I've used "Entering FunctionName" "Leaving FunctionName" until I can find what method it enters before the crash. Then I add more log messages re-compile and re-release.
Besides playing with turning off optimization and/or turning on debug information for your Release build as pauldoo said, a log file will good data can really help. I once wrote a "trace" app that would capture trace logs for the app if it was running when the release build started (otherwise the results would go to the debugger's output window if running under the debugger). I was able to have end-users email me log files from them reproducing the bugs they were seeing, and it was the only way I would have found the problem in at least one case.
Though it's probably not usable in an embedded environment, I've had good luck with WinDbg for debugging release-mode Windows applications. Even if the application is not compiled with symbol information, you can at least get a usable stack trace and plenty of other useful crash information.
You could also copy your debug symbols to the production environment even if it's compiled in relase mode
Here's an article with more information
If you problem is synchronization related dumping log in the file might be problematic.
In this case i usually will use some big array of string and dump this to screen/file after the problem was reproduces.
This is of course depend on your memory restriction, sometime i use just few symbols and numbers to store in the array if the memory on the platform is limited. Reading such logs is not a big pleasure, but sometimes this is the only choice.

Resources