What are the different causes for error code 0x3? - windows

It's been a couple times now I end up with programs crashing with a return value of 3. Although I've always managed to solve it by re-writing or changing my code, I am now getting curious for the causes of this error code(And Google doesn't seem to have nearly as much data on it as 0xc0000005). According to the Microsoft docs, it happens when the system cannot find the path specified.
Now, I might just be bad at understanding english, but I cannot understand how this error can possibly relate to my codes(No files/pointers involved or manipulated in these specific cases). My only guess is the problem being related to some internal DLLs missing.
Could anybody please provide me with further explanation over this error code? I feel like I should understand what is the cause of a problem before trying to even solve it, and it might help in future cases.
Thank you very much.

Related

More specific OpenGL error information

Is there a way to retrieve more detailed error information when OpenGL has flagged an error? I know there isn't in core OpenGL, but is there perhaps some common extension or platform- or driver-dependent way or anything at all?
My basic problem is that I have a game (written in Java with JOGL), and when people have trouble with it, which they do on certain hardware/software configurations, it can be quite hard to trace down where the root of the problem lies. For performance reasons, I can't keep calling glGetError for each command but only do so at a few points in the program, so it's kind of hard to even find what command even flagged the error to begin with. Even if I could, however, the extremely general error codes that OpenGL have don't really tell me all that much about what happened (seeing as how the manpages on the commands even describe how the various error codes are reused for sometimes quite many different actual error conditions).
It would be tremendously helpful if there were a way to find out what OpenGL command actually flagged the error, and also more details about the error that was flagged (like, if I get GL_INVALID_VALUE, what value to what argument was invalid and why?).
It seems a bit strange that drivers wouldn't provide this information, even if in a completely custom way, but looked as I have, I sure haven't found any way to find it. If it really is that they don't, is there any good reason for why that is so?
Actually, there is a feature in core OpenGL that will give you detailed debug information. But you are going to have to set your minimum version requirement pretty high to have this as a core feature.
Nevertheless, see this article -- even though it only went core in OpenGL 4.3, it existed in extension form for quite some time and it does not require any special hardware feature. So for the most part all you really need is a recent driver from NV or AMD.
I have an example of how to use this extension in an answer I wrote a while back, complete with a few utility functions to make the output easier to read. It is written in C, so I do not know how helpful it will be, but you might find something useful.
Here is the sort of output you can expect from this extension (AMD Catalyst):
OpenGL Error:
=============
Object ID: 102
Severity: Medium
Type: Performance
Source: API
Message: glDrawElements uses element index type 'GL_UNSIGNED_BYTE' that is not
optimal for the current hardware configuration; consider using
'GL_UNSIGNED_SHORT' instead.
Not only will it give you error information, but it will even give you things like performance warnings for doing something silly like using 8-bit vertex indices (which desktop GPUs do not like).
To answer another one of your questions, if you set the debug output to synchronous and install a breakpoint in your debug callback you can easily make any debugger break on an OpenGL error. If you examine the callstack you should be able to quickly identify exactly what API call generated most errors.
Here are some suggestions.
According to the man pages, glGetError returns the value of the error flag and then resets it to GL_NO_ERROR. I would use this property to track down your bug - if nothing else you can switch up where you call it and do a binary search to find where the error occurs.
I doubt calling glGetError will give you a performance hit. All it does is read back an error flag.
If you don't have the ability to test this on the specific hardware/software configurations those people have, it may be tricky. OpenGL drivers are implemented for specific devices, after all.
glGetError is good for basically saying that the previous line screwed up. That should give you a good starting point - you can look up in the man pages why that function will throw the error, rather than trying to figure it out based on its enum name.
There are other specific error functions to call, such as glGetProgramiv, and glGetFramebufferStatus, that you may want to check, as glGetError doesn't check for every type of error. IE Just because it reads clean doesn't mean another error didn't happen.

What helps to you improve your ability to find a bug?

I want to know if there are method to quickly find bugs in the program.
It seems that the more you master the architecture of your software, the more quickly
you can locate the bugs.
How the programmers improve their ability to find a bug?
Logging, and unit tests. The more information you have about what happened, the easier it is to reproduce it. The more modular you can make your code, the easier it is to check that it really is misbehaving where you think it is, and then check that your fix solves the problem.
Divide and conquer. Whenever you are debugging, you should be thinking about cutting down the possible locations of the problem. Every time you run the app, you should be trying to eliminate a possible source and zero in on the actual location. This can be done with logging, with a debugger, assertions, etc.
Here's a prophylactic method after you have found a bug: I find it really helpful to take a minute and think about the bug.
What was the bug exactly in essence.
Why did it occur.
Could you have found it earlier, easier.
Anything else you learned from the bug.
I find taking a minute to think about these things will make it far less likely that you will produce the same bug in the future.
I will assume you mean logic bugs. The best way I have found to capture logic bugs is to implement some sort of testing scheme. Check out jUnit as the standard. Pretty much you define a set of accepted outputs of your methods. Every time you compile your system it checks all of your test cases. If you have introduced new logic that breaks your tests, you will know about it instantly and know exactly what you have to fix.
Test driven design is a pretty big movement in programming right now. You will be hard pressed to find a language that doesn't support some kind of testing. Even JavaScript has a multitude of test suites.
Experience makes you a better debugger. Pay close attention to the bugs that you AND others commonly make. Try to figure out if/how these bugs apply to ALL code that affects you, not the single instance of where the bug was seen.
Raymond Chen is famous for his powers of psychic debugging.
Most of what looks like psychic
debugging is really just knowing what
people tend to get wrong.
That means that you don't necessarily have to be intimately familiar with the architecture / system. You just need enough knowledge to understand the types of bugs that apply and are easy to make.
I personally take the approach of thinking about where the bug may be in the code before actually opening up the code and taking a look. When you first start with this approach, it may not actually work very well, especially if you are pretty unfamiliar with the code base. However, over time someone will be able to tell you the behavior they are experiencing and you'll have a good idea where the problem is located or you may even know what to fix in the code to remedy the problem before even looking at the code.
I was on a project for several years that maintained by a vendor. They were not very good debuggers and most of the time it was up to us to point them to an area of the code that had the problem. What made our problem worse was that we didn't have a nice way to view the source code, so a lot of our "debugging" was just feeling.
Error checking and reporting. The #1 newbie coder debugging mistake is to turn off error reporting, avoid checking for whether what's going on makes sense, etc etc. In general, people feel like if they can't see anything going wrong then nothing is going wrong. Which of course could not be further from the case.
Instead, your code should be chock full of error conditions that will make lots of noise, with detailed reporting, someplace you will see it. (This doesn't mean inside a production web page.) Then, instead of having to trace an error all over the place because it got passed through sixteen layers of execution before it finally got someplace that broke, your errors start happening proximately to the actual issue.
It seems that the more you master the
architecture of your software ,the
more quickly you can locate the bugs.
After understanding the architecture, one's ability to find bugs in the application increases with their ability to identify and write extensive tests.
Know your tools.
Make sure that you know how to use conditional breakpoints and watches in your debugger.
Use static analysis tools as well - they can point out the more obvious issues.
Sleep and rest.
Use programming methods that produce fewer bugs in the first place.
If to implement a single stand-alone functional requirement it takes N separate point-edits to source code, the number of bugs put into the code is roughly proportional to N, so find programming methods that minimize N. Ways to do this: DRY (don't repeat yourself), code generation, and DSL (domain-specific-language).
Where bugs are likely, have unit tests.
Obviously.IMHO, the best unit tests are monte-carlo.
Make intermediate results visible.
For example, compilers have intermediate representations, in the form of 4-tuples. If there is a bug, the intermediate code can be examined. That tells if the bug is in the first or second half of the compiler.
P.S. Most programmers are not aware that they have a choice of how much data structure to use. The less data structure you use, the less are the chances for bugs (and performance issues) caused by it.
I find tracepoints to be an invaluable debugging tool. They are a bit like logging, except you create them during a debugging session to solve a particular issue, like breakpoints.
Printing the stacktrace in a tracepoint can be especially useful. For example, you can print the hash code and stacktrace in the constructor of an object, and then later on when the object is used again you can search for its hashcode to see which client code created it. Same for seeing who disposed it or called a certain method etc.
They are also great for debugging issues related to window focus changes etc, where the debugger would interfere if you drop in break mode.
Static code tools like FindBugs
Assertions, assertions, and assertions.
Some areas of our code has 4 or 5 assertions for each line of real code. When we get a bug report the first thing that happens is that the customer data is processed in our debug build 99 times out a hundred an assert will fire near the cause of the bug.
Additionally our debug build perform redundant calculations to ensure that an optimized algorithm is returning the correct result, and also debug functions are used to examine the sanity of data structures.
The hardest thing new developers have to contend with is getting their code to survive the assertions of the code gthey are calling.
Additionally we do not allow any code to be putback to toplevel that causes any integration or unit test to fail.
Stepping through the code, examining flow/state where unexpected behavior is occurring. (Then develop a test for it, of course).
Writing Debug.Write(message) in your code and using DebugView is another option. And then run your application find out what is going on.
"Architecture" in software means something like:
Several components
The components interact across clearly-defined interfaces
Each component has a well-defined responsibility
The responsibility of one component is unlike the responsibilities of other components
So, as you said, the better the architecture the easier it is to find bugs.
First: knowing the bug, you can decide which functionality is broken, and therefore know which component implements that functionality. For example, if the bug is that something isn't being logged properly, therefore this bug should be in one of 3 places:
In the component that's responsible for logging (your logging library)
Or, above that in the application code which is using this library
Or, below that in the system code which this library is using
Second: examine the data transfered across the interfaces between components. To continue the previous example above:
Set a debugger breakpoint on the application code which invokes the logger API, to verify whether the logger API is being used correctly (e.g. whether it's being invoked at all, whether parameters are as-expected, etc.).
Doing this tells you whether the bug is in the component above this interface, or in the component that's below this interface.
Repeat (perhaps using binary search if the call stack is very deep) until you've found which component is at fault.
When you come to the point that you think there must be a bug in the OS, check your assertions -- and put them into the code with "assert" statements.
Conversely, as you are writing the code, think of the range of valid inputs for your algorithms and put in assertions to make sure you have what you think you have. Same goes for output: Check that you produced what you think you produced.
E.g. if you expect a non-empty list:
l = getList(input)
assert l, "List was empty for input: %s" % str(input)
I'm part of the QA team # work, and knowing anything about the product and how it is developed, helps a lot in finding bugs, also when I make new QA tools I pass it to our dev team to test it, finding bugs in your own code is just plain hard!
Some people say programmers are tainted, so we cannot see bugs in their own product; we are not talking about code here, we are beyond that, usability and functionality itself.
Meanwhile unit testing seams to be a nice solution to find bugs in your own code, its totally pointless if you're wrong even before writing the unit test, how are you going to find the bugs then? you don't!, let your co-worker find them, hire a QA guy.
Scientific debugging is what I always used, and it greatly helps.
Basically, if you can replicate a bug, you can track its origin. You should then experiment some tests, observe the results, and infer hypotheses on why the bug happens.
Writing about all your hypotheses, attempts, expected results and observed results can help you track down the bugs, particularly if they're nasty.
There are automated tools that can help you with that process, particularly git-bisect (and similar bisection tools on other revision systems) to quickly find which change introduced the bug, unit testing to reproduce a bug and prevent regressions in your code (can be used in combination with bisect), and delta debugging to find the culprit in your code (similar to git-bisect but whereas git-bisect works on the code history, delta debugging works on the code directly).
But whatever the tools you are using, the most important benefit is in the scientific methodology, as this is the formalization of what most experienced debuggers do.

How detailed should error messages be?

I was wondering what the general consensus on error messages was. How detailed should they be?
I've worked on projects where there was a different error message for entering a number that was too big, too small, had a decimal, was a string, etc. That was quite nice for the user as they knew exactly where things went wrong, but the error handling code started to rival the actual business logic in size, and started to develop some of its own bugs.
On the other side I've worked on a project where you'd get very generic errors such as
COMPILE FAILED REASON 3
Which needless to say was almost entirely useless as reason 3 turned out to mean a link error.
So where is the middle ground? How do I know if I've added descriptive enough error messages? How do I know if the user will be able to understand where they've gone wrong?
There are two possible target audiences for an error message, the user, and the developer.
One should generally have the message target the user.
o what is the cause of the problem.
o why the program can't work around the problem
o what the user can do to work around the problem.
o how to report the problem.
If the problem is to be reported, the report should include as much program context information as possible.
o module name
o function name
o line number
o variables of interest in the general area of the problem
o maybe even a core dump.
Target the correct audience.
You should communicate what happened, and what the user's options are, in as few words as possible. The longer the error message is, the less likely the user is to read it. By the same token, short error messages are cryptic and useless. There's a sweet spot in terms of length, and it's different for every situation.
Too short:
Invalid input.
Too long:
Please enter a correctly formatted IP address, such as 192.168.0.1. An IP address is a number used to identify your computer on a network.
Just right:
Please enter a valid IP address.
As far as code bloat is concerned, if a little extra code will prevent a user from calling support or getting frustraited, then it's a good investment.
There are two types of error messages: Those that will be seen by the user and those that'll be seen by the programmer.
"How do I know if the user will be able to understand where they've gone wrong?"
I'm assuming that those messages are only going to be seen by the user, and not a very technical one, and COMPILE FAILED REASON 3 is not a typical end-user error message. It's something that the programmer will see(the user doesn't usually compile things).
So, if it's the user that'll see it:
Provide a short "This is an error message"("Ops! Something went wrong!", etc.)
Provide a small generic description of the error ("The site you're trying to connect to seems to be unavailable"/"You don't seem to have enough permissions to perform the XYZ task"/etc.)
Add a "Details>>" button, in case your user happens to understand computers well, including detailed information(exception stack trace, error code, etc.)
Finally, provide some simple and understandable commands for the user ("Try again", "Cancel", etc.)
The real question about error messages is if they should even be displayed. A lot of error messages are presented to a user but there is NO WAY for them to correct them.
As long as there is a way to correct the error then give enough information to the user that will allow them to correct it on their own. If they are not able to correct it on their own is there any reason to tell them the technical reason for the crash? No just log it to a file for troubleshooting later.
As detailed as they need to be ;)
I know it sounds like a smart ass answer but so much of this depends on your target audience and the type of error. For errors caused by invalid user entry, you could show them what constitutes a valid entry. For errors that the user can't control, a generic "we're working on it" type message might do.
I agree with Jon B's comments regarding length as well.
Error messages should be detailed, but clear. This can be achieved by combining error messages from multiple levels:
Failed to save the image
Permission denied: /foo.jpg
Here we have two levels. There can be more. First we tell the big picture and then we tell the details. The order is such that first we have the part understood by most and then the part less people understand, but both can still be visible.
Additionally there could be a fix suggestion.
I would err on the side of more detail, but I think you answered your own question. To avoid the bloat in code then provide useful information in the code/error message but you can give more details in the documentation perhaps or with help files or FAQ.
having too little information is worse in my opinion.
If you are using a language with rich introspection or other capabilities, a log witht he line that failed a check would be useful. The user can then forward to tech support or otherwise get detailed information and this is not additional code bloat, but using your own code to provide information.

How do you approach intermittent bugs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Scenario
You've got several bug reports all showing the same problem. They're all cryptic with similar tales of how the problem occurred. You follow the steps but it doesn't reliably reproduce the problem. After some investigation and web searching, you suspect what might be going on and you are pretty sure you can fix it.
Problem
Unfortunately, without a reliable way to reproduce the original problem, you can't verify that it actually fixes the issue rather than having no effect at all or exacerbating and masking the real problem. You could just not fix it until it becomes reproducible every time, but it's a big bug and not fixing it would cause your users a lot of other problems.
Question
How do you go about verifying your change?
I think this is a very familiar scenario to anyone who has engineered software, so I'm sure there are a plethora of approaches and best practices to tackling bugs like this. We are currently looking at one of these problems on our project where I have spent some time determining the issue but have been unable to confirm my suspicions. A colleague is soak-testing my fix in the hopes that "a day of running without a crash" equates to "it's fixed". However, I'd prefer a more reliable approach and I figured there's a wealth of experience here on SO.
Bugs that are hard to reproduce are the hardest one to solve. What you need to make sure that you have found the root of the problem, even if the problem itself cannot be reproduced successfully.
The most common intermittent bugs are caused by race-conditions - by eliminating the race, or ensuring that one side always wins you have eliminated the root of the problem even if you can't successfully confirm it by testing the results. The only thing you can test is that the cause does need repeat itself.
Sometimes fixing what is seen as the root indeed solves a problem but not the right one - there is no avoiding it. The best way to avoid intermittent bugs is be careful and methodical with the system design and architecture.
You'll never be able to verify the fix without identifying the root cause and coming up with a reliable way to reproduce the bug.
For identifying the root cause: If your platform allows it, hook some post-mortem debugging into the problem.
For example, on Windows, get your code to create a minidump file (core dump on Unix) when it encounters this problem. You can then get the customer (or WinQual, on Windows) to send you this file. This should give you more information about how your code's gone wrong on the production system.
But without that, you'll still need to come up with a reliable way to reproduce the bug. Otherwise you'll never be able to verify that it's fixed.
Even with all of this information, you might end up fixing a bug that looks like, but isn't, the one that the customer is seeing.
Instrument the build with more extensive (possibly optional) logging and data saving that allows exact reproduction of the variable UI steps the users took before the crash occurred.
If that data does not reliably allow you to reproduce the issue then you've narrowed the class of bug. Time to look at sources of random behaviour, such as variations in system configuration, pointer comparisons, uninitialized data, etc.
Sometimes you "know" (or rather feel) that you can fix the issue without extensive testing or unit testing scaffolding, because you truly understand the issue. However, if you don't, it very often boils down to something like "we ran it 100 times and the error no longer occurred, so we'll consider it fixed until the next time it's reported.".
I use what i call "heavy style defensive programming" : add asserts in all the modules that seems linked by the problem. What i mean is, add A LOT of asserts, asserts evidences, assert state of objects in all their memebers, assert "environnement" state, etc.
Asserts help you identify the code that is NOT linked to the problem.
Most of the time i find the origin of the problem just by writing the assertions as it forces you to reread all the code and plundge under the guts of the application to understand it.
There is no one answer to this problem. Sometimes the solution you've found helps you figure out the scenario to reproduce the problem, in which case you can test that scenario before and after the fix. Sometimes, though, that solution you've found only fixes one of the problems but not all of them, or like you say masks a deeper problem. I wish I could say "do this, it works every time", but there isn't a "this" that fits that scenario.
You say in a comment that you think it is a race condition. If you think you know what "feature" of the code is generating the condition, you can write a test to try to force it.
Here is some risky code in c:
const int NITER = 1000;
int thread_unsafe_count = 0;
int thread_unsafe_tracker = 0;
void* thread_unsafe_plus(void *a){
int i, local;
thread_unsafe_tracker++;
for (i=0; i<NITER; i++){
local = thread_unsafe_count;
local++;
thread_unsafe_count+=local;
};
}
void* thread_unsafe_minus(void *a){
int i, local;
thread_unsafe_tracker--;
for (i=0; i<NITER; i++){
local = thread_unsafe_count;
local--;
thread_unsafe_count+=local;
};
}
which I can test (in a pthreads enironment) with:
pthread_t th1, th2;
pthread_create(&th1,NULL,&thread_unsafe_plus,NULL);
pthread_create(&th2,NULL,&thread_unsafe_minus,NULL);
pthread_join(th1,NULL);
pthread_join(th2,NULL);
if (thread_unsafe_count != 0) {
printf("Ah ha!\n");
}
In real life, you'll probably have to wrap your suspect code in some way to help the race hit more ofter.
If it works, adjust the number of threads and other parameters to make it hit most of the time, and now you have a chance.
First you need to get stack traces from your clients, that way you can actually do some forensics.
Next do fuzz tests with random input, and keep these tests running for long stretches, they're great at finding those irrational border cases, that human programmers and testers can find through use cases and understanding of the code.
In this situation, where nothing else works, I introduce additional logging.
I also add in email notifications that show me the state of the application when it breaks down.
Sometimes I add in performance counters... I put that data in a table and look at trends.
Even if nothing shows up, you are narrowing things down. One way or another, you will end up with useful theories.
These are horrible and almost always resistant to the 'fixes' the engineer thinks he is putting in, as they have a habit of coming back to bite months later. Be wary of any fixes made to intermittent bugs. Be prepared for a bit of grunt work and intensive logging as this sounds more of a testing problem than a development problem.
My own problem when overcoming bugs like these was that I was often too close to the problem, not standing back and looking at the bigger picture. Try and get someone else to look at how you approach the problem.
Specifically my bug was to do with the setting of timeouts and various other magic numbers that in retrospect where borderline and so worked almost all of the time. The trick in my own case was to do a lot of experimentation with settings that I could find out which values would 'break' the software.
Do the failures happen during specific time periods? If so, where and when? Is it only certain people that seem to reproduce the bug? What set of inputs seem to invite the problem? What part of the application does it fail on? Does the bug seem more or less intermittent out in the field?
When I was a software tester my main tools where a pen and paper to record notes of my previous actions - remember a lot of seemingly insignificant details is vital. By observing and collecting little bits of data all the time the bug will appear to become less intermittent.
For a difficult-to-reproduce error, the first step is usually documentation. In the area of the code that is failing, modify the code to be hyper-explicit: One command per line; heavy, differentiated exception handling; verbose, even prolix debug output. That way, even if you can't reproduce or fix the error, you can gain far more information about the cause the next time the failure is seen.
The second step is usually assertion of assumptions and bounds checking. Everything you think you know about the code in question, write .Asserts and checks. Specifically, check objects for nullity and (if your language is dynamic) existence.
Third, check your unit test coverage. Do your unit tests actually cover every fork in execution? If you don't have unit tests, this is probably a good place to start.
The problem with unreproducible errors is that they're only unreproducible to the developer. If your end users insist on reproducing them, it's a valuable tool to leverage the crash in the field.
I've run into bugs on systems that seem to consistently cause errors, but when stepping through the code in a debugger the problem mysteriously disappears. In all of these cases the issue was one of timing.
When the system was running normally there was some sort of conflict for resources or taking the next step before the last one finished. When I stepped through it in the debugger, things were moving slowly enough that the problem disappeared.
Once I figured out it was a timing issue it was easy to find a fix. I'm not sure if this is applicable in your situation, but whenever bugs disappear in the debugger timing issues are my first suspects.
Once you fully understand the bug (and that's a big "once"), you should be able to reproduce it at will. When the reproduction code (automated test) is written, you fix the bug.
How to get to the point where you understand the bug?
Instrument the code (log like crazy). Work with your QA - they are good at re-creating the problem, and you need to arrange to have full dev toolkit available to you on their machines. Use automated tools for uninitialized memory/resources. Just plain stare at the code. No easy solution there.
Those types of bugs are very frustrating. Extrapolate it out to different machines with different types of custom hardware that might be in them (like at my company), and boy oh boy does it become a nightmare. I currently have several bugs like this at the moment at my job.
My rule of thumb: I don't fix it unless I can reproduce it myself or I'm presented with a log that clearly shows something wrong. Otherwise I cannot verify my change, nor can I verify that my change has not broken anything else. Of course, it's just a rule of thumb - I do make exceptions.
I think you're quite right to be concerned with your colleuge's approach.
These problems have always been caused by:
Memory Problems
Threading Problems
To solve the problem, you should:
Instrument your code (Add log statements)
Code Review threading
Code Review memory allocation / dereferencing
The code reviews will most likely only happen if it is a priority, or if you have a strong suspicion about which code is shared by the multiple bug reports. If it's a threading issue, then check your thread safety - make sure variables accessable by both threads are protected. If it's a memory issue, then check your allocations and dereferences and especially be suspicious of code that allocates and returns memory, or code that uses memory allocation by someone else who may be releasing it.
Some questions you could ask yourself:
When did this piece of code last work without problem.
What has been done since it stopped working.
If the code has never worked the approach would be different naturally.
At least when many users change a lot of code all the time this is a very common scenario.
Specific scenario
While I don't want to concentrate on only the issue I am having, here are some details of the current issue we face and how I've tackled it so far.
The issue occurs when the user interacts with the user interface (a TabControl to be exact) at a particular phase of a process. It doesn't always occur and I believe this is because the window of time for the problem to be exhibited is small. My suspicion is that the initialization of a UserControl (we're in .NET, using C#) coincides with a state change event from another area of the application, which leads to a font being disposed. Meanwhile, another control (a Label) tries to draw its string with that font, and hence the crash.
However, actually confirming what leads to the font being disposed has proved difficult. The current fix has been to clone the font so that the drawing label still has a valid font, but this really masks the root problem which is the font being disposed in the first place. Obviously, I'd like to track down the full sequence, but that is proving very difficult and time is short.
Approach
My approach was first to look at the stack trace from our crash reports and examine the Microsoft code using Reflector. Unfortunately, this led to a GDI+ call with little documentation, which only returns a number for the error - .NET turns this into a pretty useless message indicating something is invalid. Great.
From there, I went to look at what call in our code leads to this problem. The stack starts with a message loop, not in our code, but I found a call to Update() in the general area under suspicion and, using instrumentation (traces, etc), we were able to confirm to about 75% certainty that this was the source of the paint message. However, it wasn't the source of the bug - asking the label to paint is no crime.
From there, I looked at each aspect of the paint call that was crashing (DrawString) to see what could be invalid and started to rule each one out until it fell on the disposable items. I then determined which ones we had control over and the font was the only one. So, I took a look at how we handled the font and under what circumstances we disposed it to identify any potential root causes. I was able to come up with a plausible sequence of events that fit the reports from users, and therefore able to code a low risk fix.
Of course, it crossed my mind that the bug was in the framework, but I like to assume we screwed up before passing the blame to Microsoft.
Conclusion
So, that's how I approached one particular example of this kind of problem. As you can see, it's less than ideal, but fits with what many have said.
Unless there are major time constraints, I don't start testing changes until I can reliably reproduce the problem.
If you really had to, I suppose you could write a test case that appears to sometimes trigger the problem, and add it to your automated test suite (you do have an automated test suite, right?), and then make your change and hope that test case never fails again, knowing that if you didn't really fix anything at least you now have more chance of catching it. But by the time you can write a test case, you almost always have things reduced down to the point where you're no longer dealing with such an (apparently) non-deterministic situation.
Simply: ask the user who reported it.
I just use one of the reporters as a verification system.
Usually the person who was willing to report a bug is more than happy to help you to solve her problem [1].
Just give her your version with a possible fix and ask if the problem is gone.
In cases where the bug is a regression, the same method can be used to bisect where the problem occurred by giving the user with the problem multiple versions to test.
In other cases the user can also help you to debug the problem by giving her a version with more debugging capabilities.
This will limit any negative effects from a possible fix to that person instead of guessing that something will fix the bug and then later on realising that you've just released a "bug fix" that has no effect or in worst case a negative effect for the system stability.
You can also limit the possible negative effects of the "bug fix" by giving the new version to a limited number of users (for example to all of the ones that reported the problem) and releasing the fix only after that.
Also ones she can confirm that the fix you've made works, it is easy to add tests that ensures that your fix will stay in the code (at least on unit test level, if the bug is hard to reproduce on more higher system level).
Of course this requires that whatever you are working on supports this kind of approach. But if it doesn't I would really do whatever I can to enable it - end users are more satisfied and many of the hardest tech problems just go away and priorities come clear when development can directly interact with the system end users.
[1] If you have ever reported a bug, you most likely know that many times the response from the development/maintenance team is somehow negative from the end users point of view or there will be no response at all - especially in situations where the bug can not be reproduced by the development team.

Error Message Text - Best Practices

We are changing some of the text for our old, badly written error messages. What are some resources for best practices on writing good error messages (specifically for Windows XP/Vista).
In terms of wording your error messages, I recommend referring to the following style guides for Windows applications:
Windows user experience guidelines, and specifically the section on error messages here.
Microsoft Manual of Style
The ultimate best practice is to prevent the user from causing errors in the first place.
Don't tell users anything they don't care about; error code 5064 doesn't mean a thing to anyone. Don't tell them they did something wrong; disallow it in the first place. Don't blame them, especially not for mistakes your software made. Above all, when there is a problem, tell them how to fix it so they can move on and get some work done.
A good error message should:
Be unobtrusive (no blue-screen or yellow-screen of death)
Give the user direction to correct the problem (on their own if possible, or who to contact for help)
Hide useless, esoteric programmer nonsense ( don't say, "a null reference exception occurred on line 45")
Be descriptive without being verbose. Just enough information to tell the user what they need to know and nothing more.
One thing I've started to do is to generate a unique number that I display in the error message and write to the log file so I can find the error in the log when the user sends me a screenshot or calls and says, "I got an error. It says my reference number is 0988-7634"
For security reasons, don't provide internal system information that the user does not need.
Trivial example: when failing to login, don't tell the user if the username is wrong or the password is wrong; this will only help the attacker to brute force the system. Instead, just say "Username/Password combination is invalid" or something like that.
Always include suggestions to Remedy the error.
Try to figure out a way to write your software so it corrects the problem for them.
For any user input (strings, filenames, values, etc), always display the erroneous value with delimiters around it (quotes, brackets, etc). e.g.
The filename you entered could not be found: "somefile.txt"
This helps to show any whitespace/carriage returns that may have sneaked in and greatly reduces troubleshooting and frustration.
Avoid identical error messages coming from different places; parametrize with file:line if possible, or use other context that lets you, the developer, uniquely identify where the error occurred.
Design the mechanism to allow easy localization, especially if it is a commercial product.
If the error messages are user-visible, make them complete, meaningful sentences that don't assume intimate knowledge of the code; remember, you're always too close to the problem -- the user is not. If possible, give the user guidance on how to proceed, who to contact, etc.
Every error should have a message if possible; if not, then try and make sure that all error-unwind paths eventually reach an error message that sheds light on what happened.
I'm sure there will be other good answers here...
Shorter messages may actually be read.
The longer your error message, the less the user will read. That being said, try to refactor the code so you can eliminate exceptions if there is an obvious response. Try to only have exceptions that happen based on things beyond your user or your code's control.
The best exception message is the one you never have to display.
Error handling is always better than error reporting, but since you are retrofitting the error messages and not necessarily the code here's a couple of suggestions:
Users want solutions, not problems. Help them know what to do after an error, even if the message is as simple as "Please close the current window and retry your action."
I am also a big fan of centralized logging of errors. Make sure the log is both human and computer scanable. Users don't always let you know what problems they are having, especially if they can be 'worked around', so the log can help you know what things need fixed.
If you can control the error dialog easily, having a dialog which shows a nice, readable message with a 'details' button to show the error number, trace, etc. can be a big help for real-time problem solving as well.
Support for multilanguage applies for all kinds of messages, but tends to be forgotten in the case of error messages.
I would second not telling the user useless esoteric information like numeric error codes. I would follow that up however by saying to definitely log that information for troubleshooting by more technically savvy people.

Resources