How do make your Ruby scripts robust with proper exception handling - ruby

I'm not asking about how to use the begin..rescue statement in Ruby. Rather, I'm asking how do you figure out what's worth worrying about. I know some hardcore people probably go through the documentation/code to find every single possible error, and have code that handles all cases gracefully. But that might be going too far if you're not writing something "mission-critical".
So how do you know what errors to look out for? I generally just run my code, and if it stops due to an error, I add a begin..rescue statement. I'm sure there has to be a better way though. I just ran a script to import data into a database for some files, and it stopped after an hour (lots of files, don't ask) due to some error that I couldn't know was going to occur. Quite annoying, and I'm sure it's my fault for not writing more robust code. How do I do that?

I'm not sure it's always possible to always forsee exceptions without first experiencing them. Now that you know that databases sometimes fail, you might add better exception handling for that in the future, but experience is the only real teacher in this sense.
However, I think you might be a bit too quick to write your own error handlers. For production code, there are obviously certain scenarios you want to be able to catch easily and show a good error message for (e.g. too many database connections), but sometimes allowing the error to float up is, in fact, the most robust thing a script can do. If there's something wrong with my rubygems setup that causes your gem to spit up errors, I don't want you to hide them behind "something went wrong!"; I'd rather see the error itself so I can get down to work.
Catch the errors that really matter, and that you can do something about, like to retry a HTTP connection a few times before giving up. Beyond that, however, if there's nothing you can do about the error, there's no shame in letting it float up.

Related

Replacing golang error codes with panic for all error handling

We have medium sized application written in go. From all the code lines about 60 percent goes to code error handling. Something like this:
if err != nil {
return err
}
After some time, writing this lines over and over again becomes tiresome and we are now thinking to replace all error codes with panics.
I know panics aren't meant to be used like that.
What can be potential pitfall and does anyone have experience with something similar ?
The main pitfall would be a widespread use of hammers to drive screws. Panic is for unrecoverable/unexpected errors, error return values are for recoverable/expected errors.
Replace the word "panic" with "crash", because that is conceptually what a panic is. Do you honestly want to write an application that by design crashes whenever anything goes in any way remotely wrong? That would be the most fragile application on Earth, the very antithesis of fault tolerance.
As all the comments suggest you could be signing up for a disaster and let me add, whosoever is going to maintain this code will go insane trying to figure out what exactly went wrong when things do go awry if you don't do proper error handling (even if that person is you, it's going to haunt you).
Just repeating what others have said - the problem is being unable to differentiate between these two
a predictable & possibly recoverable condition
perhaps what is unrecoverable (need to exit the application or be handled at a higher level - this may be coming from your own application's conditions or some third party library, but is forcing you out of your normal execution path)
Panics are reserved for the latter types.

How do you reproduce bugs that occur sporadically?

We have a bug in our application that does not occur every time and therefore we don't know its "logic". I don't even get it reproduced in 100 times today.
Disclaimer: This bug exists and I've seen it. It's not a pebkac or something similar.
What are common hints to reproduce this kind of bug?
Analyze the problem in a pair and pair-read the code. Make notes of the problems you KNOW to be true and try to assert which logical preconditions must hold true for this happen. Follow the evidence like a CSI.
Most people instinctively say "add more logging", and this may be a solution. But for a lot of problems this just makes things worse, since logging can change timing-dependencies sufficiently to make the problem more or less frequent. Changing the frequency from 1 in 1000 to 1 in 1,000,000 will not bring you closer to the true source of the problem.
So if your logical reasoning does not solve the problem, it'll probably give you a few specifics you could investigate with logging or assertions in your code.
There is no general good answer to the question, but here is what I have found:
It takes a talent for this kind of thing. Not all developers are best suited for it, even if they are superstars in other areas. So know your team, who has a talent for it, and hope you can give them enough candy to get them excited about helping you out, even if it isn't their area.
Work backwards, and treat it like a scientific investigation. Start with the bug, what you see is wrong. Develop hypotheses about what could cause it (this is the creative/imaginative part, the art that not everyone has the talent for) - and it helps a lot to know how the code works. For each of those hypotheses (preferably sorted by what you think is most likely - again pure gut feel here), develop a test that tries to eliminate it as the cause, and test the hypothesis. Any given failure to meet a prediction doesn't mean the hypothesis is wrong. Test the hypothesis until it is confirmed to be wrong (although as it gets less likely you may want to move on to another hypothesis first, just don't discount this one until you have a definitive failure).
Gather as much data as you can during this process. Extensive logging and whatever else is applicable. Do not discount a hypothesis because you lack the data, rather remedy the lack of data. Quite often the inspiration for the right hypothesis comes from examining the data. Noticing something off in a stack trace, weird issue in a log, something missing that should be there in a database, etc.
Double check every assumption. So many times I have seen an issue not get fixed quickly because some general method call was not further investigated, so the problem was just assumed to be not applicable. "Oh that, that should be simple." (See point 1).
If you run out of hypotheses, that is generally caused by insufficient knowledge of the system (this is true even if you wrote every line of code yourself), and you need to run through and review code and gain additional insight into the system to come up with a new idea.
Of course, none of the above guarantees anything, but that is the approach that I have found gets results consistently.
Add some sort of logging or tracing. For example log the last X actions the user committed before causing the bug (only if you can set a condition to match bug).
It's quite common for programmers not to be able to reiterate a user-experienced crash simply because you have developed a certain workflow and habits in using the application that obviously goes around the bug.
At this frequency of 1/100, I'd say that the first thing to do is to handle exceptions and log anything anywhere or you could be spending another week hunting this bug.
Also make a priority list of potentially sensitive articulations and features in your project. For example :
1 - Multithreading
2 - Wild pointers/ loose arrays
3 - Reliance on input devices
etc.
This will help you segment areas that you can brute-force-until-break-again as suggested by other posters.
Since this is language-agnostic, I'll mention a few axioms of debugging.
Nothing a computer ever does is random. A 'random occurrence' indicates a as-yet-undiscovered pattern. Debugging begins with isolating the pattern. Vary individual elements and assess what makes a change in the behaviour of the bug.
Different user, same computer?
Same user, different computer?
Is the occurrence strongly periodic? Does rebooting change the periodicity?
FYI- I once saw a bug that was experienced by a single person. I literally mean person, not a user account. User A would never see the problem on their system, User B would sit down at that workstation, signed on as User A and could immediately reproduce the bug. There should be no conceivable way for the app to know the difference between the physical body in the chair. However-
The users used the app in different ways. User A habitually used a hotkey to to invoke a action and User B used an on-screen control. The difference in the user behaviour would cascade into a visible error a few actions later.
ANY difference that effects the behaviour of the bug should be investigated, even if it makes no sense.
There's a good chance your application is MTWIDNTBMT (Multi Threaded When It Doesn't Need To Be Multi Threaded), or maybe just multi-threaded (to be polite). A good way to reproduce sporadic errors in multi-threaded applications is to sprinkle code like this around (C#):
Random rnd = new Random();
System.Threading.Thread.Sleep(rnd.Next(2000));
and/or this:
for (int i = 0; i < 4000000000; i++)
{
// tight loop
}
to simulate threads completing their tasks at different times than usual or tying up the processor for long stretches.
I've inherited many buggy, multi-threaded apps over the years, and code like the above examples usually makes the sporadic errors occur much more frequently.
Add verbose logging. It will take multiple -- sometimes dozen(s) -- iterations to add enough logging to understand the scenario.
Now the problem is that if the problem is a race condition, which is likely if it doesn't reproduce reliably, so logging can change timing and the problem will stop happening. In this case do not log to a file, but keep a rotating buffer of the log in memory and only dump it on disk when you detect that the problem has occurred.
Edit: a little more thoughts: if this is a gui application run tests with a qa automation tool which allows you to replay macros. If this is a service-type app, try to come up with at least a guess as to what is happening and then programmatically create 'freak' usage patterns which would exercise the code that you suspect. Create higher than usual loads etc.
What development environment?
For C++, your best bet may be VMWare Workstation record/replay, see:
http://stackframe.blogspot.com/2007/04/workstation-60-and-death-of.html
Other suggestions include inspecting the stack trace, and careful code overview... there is really no silver bullet :)
Try to add code in your app to trace the bug automatically once it happens (or even alert you via mail / SMS)
log whatever you can so when it happens you can catch the right system state.
Another thing- try applying automated testing that can cover more territory than human based testing in a formed manner.. it's a long shot, but a good practice in general.
all the above, plus throw some brute force soft-robot at it that is semi random, and scater a lot of assert/verify (c/c++, probably similar in other langs) through the code
Tons of logging and careful code review are your only options.
These can be especially painful if the app is deployed and you can't adjust the logging. At that point, your only choice is going through the code with a fine-tooth comb and trying to reason about how the program could enter into the bad state (scientific method to the rescue!)
Often these kind of bugs are related to corrupted memory and for that reason they might not appear very often. You should try to run your software with some kind of memory profiler e.g., valgrind, to see if something goes wrong.
Let’s say I’m starting with a production application.
I typically add debug logging around the areas where I think the bug is occurring. I setup the logging statements to give me insight into the state of the application. Then I have the debug log level turned on and ask the user/operator(s) notify me of the time of the next bug occurrence. I then analyze the log to see what hints it gives about the state of the application and if that leads to a better understanding of what could be going wrong.
I repeat step 1 until I have a good idea of where I can start debugging the code in the debugger
Sometimes the number of iterations of the code running is key but other times it maybe the interaction of a component with an outside system (database, specific user machine, operating system, etc.). Take some time to setup a debug environment that matches the production environment as closely as possible. VM technology is a good tool for solving this problem.
Next I proceed via the debugger. This could include creating a test harness of some sort that puts the code/components in the state I’ve observed from the logs. Knowing how to setup conditional break points can save a lot of time, so get familiar with that and other features within your debugger.
Debug, debug , debug. If you’re going nowhere after a few hours, take a break and work on something unrelated for awhile. Come back with a fresh mind and perspective.
If you have gotten nowhere by now, go back to step 1 and make another iteration.
For really difficult problems you may have to resort to installing a debugger on the system where the bug is occurring. That combined with your test harness from step 4 can usually crack the really baffling issues.
Unit Tests. Testing a bug in the app is often horrendous because there is so much noise, so many variable factors. In general the bigger the (hay)stack, the harder it is to pinpoint the issue. Creatively extending your unit test framework to embrace edge cases can save hours or even days of sifting
Having said that there is no silver bullet. I feel your pain.
Add pre and post condition check in methods related to this bug.
You may have a look at Design by contract
Along with a lot of patience, a quiet prayer & cursing you would need:
a good mechanism for logging the user actions
a good mechanism for gathering the data state when the user performs some actions (state in application, database etc.)
Check the server environment (e.g. an anti-virus software running at a particular time etc.) & record the times of the error & see if you can find any trends
some more prayers & cursing...
HTH.
Assuming you're on Windows, and your "bug" is a crash or some sort of corruption in unmanaged code (C/C++), then take a look at Application Verifier from Microsoft. The tool has a number of stops that can be enabled to verify things during runtime. If you have an idea of the scenario where your bug occurs, then try to run through the scenario (or a stress version of the scenario) with AppVerifer running. Make sure to either turn on pageheap in AppVerifier, or consider compiling your code with the /RTCcsu switch (see http://msdn.microsoft.com/en-us/library/8wtf2dfz.aspx for more information).
"Heisenbugs" require great skills to diagnose, and if you want help from people here you have to describe this in much more detail, and patiently listen to various tests and checks, report result here, and iterate this till you solve it (or decide it is too expensive in terms of resources).
You will probably have to tell us your actual situation, language, DB, operative system, workload estimate, time of the day it happened in the past, and a myriad of other things, list tests you did already, how they went, and be ready to do more and share the results.
And this will not guarantee that we collectively can find it, either...
I'd suggest to write down all things that user has been doing. If you have lets say 10 such bug reports You can try to find something that connects them.
Read the stack trace carefully and try to guess what could be happened;
then try to trace\log every line of code that potentially can cause trouble.
Keep your focus on disposing resources; many sneaky sporadical bugs i found were related to close\dispose things :).
For .NET projects You can use Elmah (Error Logging Modules and Handlers) to monitor you application for un-caught exceptions, it's very simple to install and provides a very nice interface to browse unknown errors
http://code.google.com/p/elmah/
This saved me just today in catching a very random error that was occuring during a registration process
Other than that I can only recommend trying to get as much information from your users as possible and having a thorough understanding of the project workflow
They mostly come out at night....
mostly
The team that I work with has enlisted the users in recording their time they spend in our app with CamStudio when we've got a pesky bug to track down. It's easy to install and for them to use, and makes reproducing those nagging bugs much easier, since you can watch what the users are doing. It also has no relationship to the language you're working in, since it's just recording the windows desktop.
However, this route seems to be viable only if you're developing corporate apps and have good relationships with your users.
This varies (as you say), but some of the things that are handy with this can be
immediately going into the debugger when the problem occurs and dumping all the threads (or the equivalent, such as dumping the core immediately or whatever.)
running with logging turned on but otherwise entirely in release/production mode. (This is possible in some random environments like c and rails but not many others.)
do stuff to make the edge conditions on the machine worse... force low memory / high load / more threads / serving more requests
Making sure that you're actually listening to what the users encountering the problem are actually saying. Making sure that they're actually explaining the relevant details. This seems to be the one that breaks people in the field a lot. Trying to reproduce the wrong problem is boring.
Get used to reading assembly that was produced by optimizing compilers. This seems to stop people sometimes, and it isn't applicable to all languages/platforms, but it can help
Be prepared to accept that it is your (the developer's) fault. Don't get into the trap of insisting the code is perfect.
sometimes you need to actually track the problem down on the machine it is happening on.
#p.marino - not enough rep to comment =/
tl;dr - build failures due to time of day
You mentioned time of day and that caught my eye. Had a bug once were someone stayed later at work on night, tried to build and commit before they left and kept getting a failure. They eventually gave up and went home. When they caught in the next morning it built fine, they committed (probably should have been more suspiscious =] ) and the build worked for everyone. A week or two later someone stayed late and had an unexpected build failure. Turns out there was a bug in the code that made any build after 7PM break >.>
We also found a bug in one seldom used corner of the project this january that caused problems marshalling between different schemas because we were not accounting for the different calendars being 0 AND 1 month based. So if no one had messed with that part of the project we wouldn't have possibly found the bug until jan. 2011
These were easier to fix than threading issues, but still interesting I think.
hire some testers!
This has worked for really weird heisenbugs.
(I'd also recommend getting a copy of "Debugging" by Dave Argans, these ideas are partly derived form using his ideas!)
(0) Check the ram of the system using something like Memtest86!
The whole system exhibits the problem, so make a test jig that exercises the whole thing.
Say it's a server side thing with a GUI, you run the whole thing with a GUI test framework doing the necessary input to provoke the problem.
It doesn't fail 100% of the time, so you have to make it fail more often.
Start by cutting the system in half ( binary chop)
worse case, you have to remove sub-systems one at a time.
stub them out if they can't be commented out.
See if it still fails. Does it fail more often ?
Keep proper test records, and only change one variable at a time!
Worst case you use the jig and you test for weeks to get meaningful statistics. This is HARD; but remember, the jig is doing the work.
I've got No threads and only one process, and I don't talk to hardware
If the system has no threads, no communicating processes and contacts no hardware; it's tricky; heisenbugs are generally synchronization, but in the no-thread no processes case it's more likely to be uninitialized data, or data used after being released, either on the heap or the stack. Try to use a checker like valgrind.
For threaded/multi-process problems:
Try running it on a different number of CPU's. If it's running on 1, try on 4! Try forcing a 4-computer system onto 1.
It'll mostly ensure things happen one at a time.
If there are threads or communicating processes this can shake out bugs.
If this is not helping but you suspect it's synchronization or threading, try changing the OS time-slice size.
Make it as fine as your OS vendor allows!
Sometimes this has made race conditions happen almost every time!
Obversely, try going slower on the timeslices.
Then you set the test jig running with debugger(s) attached all over the place and wait for the test jig to stop on a fault.
If all else fails, put the hardware in the freezer and run it there. The timing of everything will be shifted.
Debugging is hard and time consuming especially if you are unable to deterministically reproduce the problem. My advice to you is to find out the steps to reproduce it deterministically (not just sometimes).
There has been a lot of research in the field of failure reproduction in the past years and is still very active. Record&Replay techniques have been (so far) the research direction of most researchers. This is what you need to do:
1) Analyze the source code and determine what are the sources of non-determinism in the application, that is, what are the aspects that may take your application through different execution paths (e.g. user input, OS signals)
2) Log them in the next time you execute the application
3) When your application fails again, you have the steps-to-reproduce the failure in your log.
If your log still does not reproduce the failure, then you are dealing with a concurrency bug. In that case, you should take a look at how your application accesses shared variables. Do not attempt to record the accesses to shared variables, because you would be logging too much data, thereby causing severe slowdowns and large logs. Unfortunately, there is not much I can say that would help you to reproduce concurrency bugs, because research still has a long way to go in this subject. The best I can do is to provide a reference to the most recent advance (so far) in the topic of deterministic replay of concurrency bugs:
http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html
Best regards
Use an enhanced crash reporter. In the Delphi environment, we have EurekaLog and MadExcept. Other tools exist in other environments. Or you can diagnose the core dump. You're looking for the stack trace, which will show you where it's blowing up, how it got there, what's in memory, etc.. It's also useful to have a screenshot of the app, if it's a user-interaction thing. And info about the machine that it crashed on (OS version and patch, what else is running at the time, etc..) Both of the tools that I mentioned can do this.
If it's something that happens with a few users but you can't reproduce it, and they can, go sit with them and watch. If it's not apparent, switch seats - you "drive", and they tell you what to do. You'll uncover the subtle usability issues that way. double-clicks on a single-click button, for example, initiating re-entrancy in the OnClick event. That sort of thing. If the users are remote, use WebEx, Wink, etc., to record them crashing it, so you can analyze the playback.

PL SQL - Return SQLCODE as OUT parameter is accepted?

I have a procedure that returns an OUT parameter.
procedure foo (in_v IN INTEGER, out_v OUT integer)
BEGIN
...
EXCEPTION
WHEN OTHERS THEN
--sh*t happend
out_v := SQLCODE;
END
That parameter will be 0 if everything goes OK, and <> 0 if something ugly happened.
Now, if sh*t happens along the way, an exception will be thrown.
Is it ok to assing the SQLCODE value to the OUT parameter ? Or is this consideres a code smell, and I will be expelled from the programming community ?
Thanks in advance.
If there is no additional handling of the error, I would probably advise against it. This approach just makes it necessary for each caller to examine the value of the out parameter anyway. And if the caller forgets it, a serious problem may pass unnoticed at first and create a hard to debug problem elsewhere.
If you simply don't catch OTHERS here, you ensure that the caller has to explicitly catch it, which is a lot cleaner an easier to debug.
Whether it's OK or not depends on what you're trying to achieve with it, I suppose. What problem are you trying to solve, or what requirement are you trying to meet?
the usual behaviour with errors is to gracefully handle the ones that you expect might happen during normal functioning and to allow those that you do not expect to be raised, so this does look odd.
Its ok, but I don't recommend it. By doing it this way you're forcing the calling code to do non-standard error handling. This is fine if you are the only person ever calling the code and you remember to check the error code. However, if you're coding in a large system with multiple programmers I think you should be kind to your fellow programmers and follow the standard way of exception handling supported by the language.
If you do decide to go down that route, also pass back SQLERRM as without the error text you only have the error code to go by. After years of catching "-20001" for the hundreds of application errors in 3rd party software the SQLERRM is often important.
This coding style is not a good idea for a number of reasons.
First, it is bad practice to mix errors and exceptions, even worse practice to mix return values and exceptions. Exceptions are meant to track exceptional situations - things completely unexpected by normal programming such as out of memory issues, conversion problems and so on that you normally would not want to write handlers at every call site for but need to deal with at some level.
The second worrying aspect of the coding style is that your code is effectively saying when an exception is encountered, the code will signal to its caller that something bad has happened but will happily throw away ALL exception information except for the SQLCODE.

How do you approach intermittent bugs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Scenario
You've got several bug reports all showing the same problem. They're all cryptic with similar tales of how the problem occurred. You follow the steps but it doesn't reliably reproduce the problem. After some investigation and web searching, you suspect what might be going on and you are pretty sure you can fix it.
Problem
Unfortunately, without a reliable way to reproduce the original problem, you can't verify that it actually fixes the issue rather than having no effect at all or exacerbating and masking the real problem. You could just not fix it until it becomes reproducible every time, but it's a big bug and not fixing it would cause your users a lot of other problems.
Question
How do you go about verifying your change?
I think this is a very familiar scenario to anyone who has engineered software, so I'm sure there are a plethora of approaches and best practices to tackling bugs like this. We are currently looking at one of these problems on our project where I have spent some time determining the issue but have been unable to confirm my suspicions. A colleague is soak-testing my fix in the hopes that "a day of running without a crash" equates to "it's fixed". However, I'd prefer a more reliable approach and I figured there's a wealth of experience here on SO.
Bugs that are hard to reproduce are the hardest one to solve. What you need to make sure that you have found the root of the problem, even if the problem itself cannot be reproduced successfully.
The most common intermittent bugs are caused by race-conditions - by eliminating the race, or ensuring that one side always wins you have eliminated the root of the problem even if you can't successfully confirm it by testing the results. The only thing you can test is that the cause does need repeat itself.
Sometimes fixing what is seen as the root indeed solves a problem but not the right one - there is no avoiding it. The best way to avoid intermittent bugs is be careful and methodical with the system design and architecture.
You'll never be able to verify the fix without identifying the root cause and coming up with a reliable way to reproduce the bug.
For identifying the root cause: If your platform allows it, hook some post-mortem debugging into the problem.
For example, on Windows, get your code to create a minidump file (core dump on Unix) when it encounters this problem. You can then get the customer (or WinQual, on Windows) to send you this file. This should give you more information about how your code's gone wrong on the production system.
But without that, you'll still need to come up with a reliable way to reproduce the bug. Otherwise you'll never be able to verify that it's fixed.
Even with all of this information, you might end up fixing a bug that looks like, but isn't, the one that the customer is seeing.
Instrument the build with more extensive (possibly optional) logging and data saving that allows exact reproduction of the variable UI steps the users took before the crash occurred.
If that data does not reliably allow you to reproduce the issue then you've narrowed the class of bug. Time to look at sources of random behaviour, such as variations in system configuration, pointer comparisons, uninitialized data, etc.
Sometimes you "know" (or rather feel) that you can fix the issue without extensive testing or unit testing scaffolding, because you truly understand the issue. However, if you don't, it very often boils down to something like "we ran it 100 times and the error no longer occurred, so we'll consider it fixed until the next time it's reported.".
I use what i call "heavy style defensive programming" : add asserts in all the modules that seems linked by the problem. What i mean is, add A LOT of asserts, asserts evidences, assert state of objects in all their memebers, assert "environnement" state, etc.
Asserts help you identify the code that is NOT linked to the problem.
Most of the time i find the origin of the problem just by writing the assertions as it forces you to reread all the code and plundge under the guts of the application to understand it.
There is no one answer to this problem. Sometimes the solution you've found helps you figure out the scenario to reproduce the problem, in which case you can test that scenario before and after the fix. Sometimes, though, that solution you've found only fixes one of the problems but not all of them, or like you say masks a deeper problem. I wish I could say "do this, it works every time", but there isn't a "this" that fits that scenario.
You say in a comment that you think it is a race condition. If you think you know what "feature" of the code is generating the condition, you can write a test to try to force it.
Here is some risky code in c:
const int NITER = 1000;
int thread_unsafe_count = 0;
int thread_unsafe_tracker = 0;
void* thread_unsafe_plus(void *a){
int i, local;
thread_unsafe_tracker++;
for (i=0; i<NITER; i++){
local = thread_unsafe_count;
local++;
thread_unsafe_count+=local;
};
}
void* thread_unsafe_minus(void *a){
int i, local;
thread_unsafe_tracker--;
for (i=0; i<NITER; i++){
local = thread_unsafe_count;
local--;
thread_unsafe_count+=local;
};
}
which I can test (in a pthreads enironment) with:
pthread_t th1, th2;
pthread_create(&th1,NULL,&thread_unsafe_plus,NULL);
pthread_create(&th2,NULL,&thread_unsafe_minus,NULL);
pthread_join(th1,NULL);
pthread_join(th2,NULL);
if (thread_unsafe_count != 0) {
printf("Ah ha!\n");
}
In real life, you'll probably have to wrap your suspect code in some way to help the race hit more ofter.
If it works, adjust the number of threads and other parameters to make it hit most of the time, and now you have a chance.
First you need to get stack traces from your clients, that way you can actually do some forensics.
Next do fuzz tests with random input, and keep these tests running for long stretches, they're great at finding those irrational border cases, that human programmers and testers can find through use cases and understanding of the code.
In this situation, where nothing else works, I introduce additional logging.
I also add in email notifications that show me the state of the application when it breaks down.
Sometimes I add in performance counters... I put that data in a table and look at trends.
Even if nothing shows up, you are narrowing things down. One way or another, you will end up with useful theories.
These are horrible and almost always resistant to the 'fixes' the engineer thinks he is putting in, as they have a habit of coming back to bite months later. Be wary of any fixes made to intermittent bugs. Be prepared for a bit of grunt work and intensive logging as this sounds more of a testing problem than a development problem.
My own problem when overcoming bugs like these was that I was often too close to the problem, not standing back and looking at the bigger picture. Try and get someone else to look at how you approach the problem.
Specifically my bug was to do with the setting of timeouts and various other magic numbers that in retrospect where borderline and so worked almost all of the time. The trick in my own case was to do a lot of experimentation with settings that I could find out which values would 'break' the software.
Do the failures happen during specific time periods? If so, where and when? Is it only certain people that seem to reproduce the bug? What set of inputs seem to invite the problem? What part of the application does it fail on? Does the bug seem more or less intermittent out in the field?
When I was a software tester my main tools where a pen and paper to record notes of my previous actions - remember a lot of seemingly insignificant details is vital. By observing and collecting little bits of data all the time the bug will appear to become less intermittent.
For a difficult-to-reproduce error, the first step is usually documentation. In the area of the code that is failing, modify the code to be hyper-explicit: One command per line; heavy, differentiated exception handling; verbose, even prolix debug output. That way, even if you can't reproduce or fix the error, you can gain far more information about the cause the next time the failure is seen.
The second step is usually assertion of assumptions and bounds checking. Everything you think you know about the code in question, write .Asserts and checks. Specifically, check objects for nullity and (if your language is dynamic) existence.
Third, check your unit test coverage. Do your unit tests actually cover every fork in execution? If you don't have unit tests, this is probably a good place to start.
The problem with unreproducible errors is that they're only unreproducible to the developer. If your end users insist on reproducing them, it's a valuable tool to leverage the crash in the field.
I've run into bugs on systems that seem to consistently cause errors, but when stepping through the code in a debugger the problem mysteriously disappears. In all of these cases the issue was one of timing.
When the system was running normally there was some sort of conflict for resources or taking the next step before the last one finished. When I stepped through it in the debugger, things were moving slowly enough that the problem disappeared.
Once I figured out it was a timing issue it was easy to find a fix. I'm not sure if this is applicable in your situation, but whenever bugs disappear in the debugger timing issues are my first suspects.
Once you fully understand the bug (and that's a big "once"), you should be able to reproduce it at will. When the reproduction code (automated test) is written, you fix the bug.
How to get to the point where you understand the bug?
Instrument the code (log like crazy). Work with your QA - they are good at re-creating the problem, and you need to arrange to have full dev toolkit available to you on their machines. Use automated tools for uninitialized memory/resources. Just plain stare at the code. No easy solution there.
Those types of bugs are very frustrating. Extrapolate it out to different machines with different types of custom hardware that might be in them (like at my company), and boy oh boy does it become a nightmare. I currently have several bugs like this at the moment at my job.
My rule of thumb: I don't fix it unless I can reproduce it myself or I'm presented with a log that clearly shows something wrong. Otherwise I cannot verify my change, nor can I verify that my change has not broken anything else. Of course, it's just a rule of thumb - I do make exceptions.
I think you're quite right to be concerned with your colleuge's approach.
These problems have always been caused by:
Memory Problems
Threading Problems
To solve the problem, you should:
Instrument your code (Add log statements)
Code Review threading
Code Review memory allocation / dereferencing
The code reviews will most likely only happen if it is a priority, or if you have a strong suspicion about which code is shared by the multiple bug reports. If it's a threading issue, then check your thread safety - make sure variables accessable by both threads are protected. If it's a memory issue, then check your allocations and dereferences and especially be suspicious of code that allocates and returns memory, or code that uses memory allocation by someone else who may be releasing it.
Some questions you could ask yourself:
When did this piece of code last work without problem.
What has been done since it stopped working.
If the code has never worked the approach would be different naturally.
At least when many users change a lot of code all the time this is a very common scenario.
Specific scenario
While I don't want to concentrate on only the issue I am having, here are some details of the current issue we face and how I've tackled it so far.
The issue occurs when the user interacts with the user interface (a TabControl to be exact) at a particular phase of a process. It doesn't always occur and I believe this is because the window of time for the problem to be exhibited is small. My suspicion is that the initialization of a UserControl (we're in .NET, using C#) coincides with a state change event from another area of the application, which leads to a font being disposed. Meanwhile, another control (a Label) tries to draw its string with that font, and hence the crash.
However, actually confirming what leads to the font being disposed has proved difficult. The current fix has been to clone the font so that the drawing label still has a valid font, but this really masks the root problem which is the font being disposed in the first place. Obviously, I'd like to track down the full sequence, but that is proving very difficult and time is short.
Approach
My approach was first to look at the stack trace from our crash reports and examine the Microsoft code using Reflector. Unfortunately, this led to a GDI+ call with little documentation, which only returns a number for the error - .NET turns this into a pretty useless message indicating something is invalid. Great.
From there, I went to look at what call in our code leads to this problem. The stack starts with a message loop, not in our code, but I found a call to Update() in the general area under suspicion and, using instrumentation (traces, etc), we were able to confirm to about 75% certainty that this was the source of the paint message. However, it wasn't the source of the bug - asking the label to paint is no crime.
From there, I looked at each aspect of the paint call that was crashing (DrawString) to see what could be invalid and started to rule each one out until it fell on the disposable items. I then determined which ones we had control over and the font was the only one. So, I took a look at how we handled the font and under what circumstances we disposed it to identify any potential root causes. I was able to come up with a plausible sequence of events that fit the reports from users, and therefore able to code a low risk fix.
Of course, it crossed my mind that the bug was in the framework, but I like to assume we screwed up before passing the blame to Microsoft.
Conclusion
So, that's how I approached one particular example of this kind of problem. As you can see, it's less than ideal, but fits with what many have said.
Unless there are major time constraints, I don't start testing changes until I can reliably reproduce the problem.
If you really had to, I suppose you could write a test case that appears to sometimes trigger the problem, and add it to your automated test suite (you do have an automated test suite, right?), and then make your change and hope that test case never fails again, knowing that if you didn't really fix anything at least you now have more chance of catching it. But by the time you can write a test case, you almost always have things reduced down to the point where you're no longer dealing with such an (apparently) non-deterministic situation.
Simply: ask the user who reported it.
I just use one of the reporters as a verification system.
Usually the person who was willing to report a bug is more than happy to help you to solve her problem [1].
Just give her your version with a possible fix and ask if the problem is gone.
In cases where the bug is a regression, the same method can be used to bisect where the problem occurred by giving the user with the problem multiple versions to test.
In other cases the user can also help you to debug the problem by giving her a version with more debugging capabilities.
This will limit any negative effects from a possible fix to that person instead of guessing that something will fix the bug and then later on realising that you've just released a "bug fix" that has no effect or in worst case a negative effect for the system stability.
You can also limit the possible negative effects of the "bug fix" by giving the new version to a limited number of users (for example to all of the ones that reported the problem) and releasing the fix only after that.
Also ones she can confirm that the fix you've made works, it is easy to add tests that ensures that your fix will stay in the code (at least on unit test level, if the bug is hard to reproduce on more higher system level).
Of course this requires that whatever you are working on supports this kind of approach. But if it doesn't I would really do whatever I can to enable it - end users are more satisfied and many of the hardest tech problems just go away and priorities come clear when development can directly interact with the system end users.
[1] If you have ever reported a bug, you most likely know that many times the response from the development/maintenance team is somehow negative from the end users point of view or there will be no response at all - especially in situations where the bug can not be reproduced by the development team.

Error Message Text - Best Practices

We are changing some of the text for our old, badly written error messages. What are some resources for best practices on writing good error messages (specifically for Windows XP/Vista).
In terms of wording your error messages, I recommend referring to the following style guides for Windows applications:
Windows user experience guidelines, and specifically the section on error messages here.
Microsoft Manual of Style
The ultimate best practice is to prevent the user from causing errors in the first place.
Don't tell users anything they don't care about; error code 5064 doesn't mean a thing to anyone. Don't tell them they did something wrong; disallow it in the first place. Don't blame them, especially not for mistakes your software made. Above all, when there is a problem, tell them how to fix it so they can move on and get some work done.
A good error message should:
Be unobtrusive (no blue-screen or yellow-screen of death)
Give the user direction to correct the problem (on their own if possible, or who to contact for help)
Hide useless, esoteric programmer nonsense ( don't say, "a null reference exception occurred on line 45")
Be descriptive without being verbose. Just enough information to tell the user what they need to know and nothing more.
One thing I've started to do is to generate a unique number that I display in the error message and write to the log file so I can find the error in the log when the user sends me a screenshot or calls and says, "I got an error. It says my reference number is 0988-7634"
For security reasons, don't provide internal system information that the user does not need.
Trivial example: when failing to login, don't tell the user if the username is wrong or the password is wrong; this will only help the attacker to brute force the system. Instead, just say "Username/Password combination is invalid" or something like that.
Always include suggestions to Remedy the error.
Try to figure out a way to write your software so it corrects the problem for them.
For any user input (strings, filenames, values, etc), always display the erroneous value with delimiters around it (quotes, brackets, etc). e.g.
The filename you entered could not be found: "somefile.txt"
This helps to show any whitespace/carriage returns that may have sneaked in and greatly reduces troubleshooting and frustration.
Avoid identical error messages coming from different places; parametrize with file:line if possible, or use other context that lets you, the developer, uniquely identify where the error occurred.
Design the mechanism to allow easy localization, especially if it is a commercial product.
If the error messages are user-visible, make them complete, meaningful sentences that don't assume intimate knowledge of the code; remember, you're always too close to the problem -- the user is not. If possible, give the user guidance on how to proceed, who to contact, etc.
Every error should have a message if possible; if not, then try and make sure that all error-unwind paths eventually reach an error message that sheds light on what happened.
I'm sure there will be other good answers here...
Shorter messages may actually be read.
The longer your error message, the less the user will read. That being said, try to refactor the code so you can eliminate exceptions if there is an obvious response. Try to only have exceptions that happen based on things beyond your user or your code's control.
The best exception message is the one you never have to display.
Error handling is always better than error reporting, but since you are retrofitting the error messages and not necessarily the code here's a couple of suggestions:
Users want solutions, not problems. Help them know what to do after an error, even if the message is as simple as "Please close the current window and retry your action."
I am also a big fan of centralized logging of errors. Make sure the log is both human and computer scanable. Users don't always let you know what problems they are having, especially if they can be 'worked around', so the log can help you know what things need fixed.
If you can control the error dialog easily, having a dialog which shows a nice, readable message with a 'details' button to show the error number, trace, etc. can be a big help for real-time problem solving as well.
Support for multilanguage applies for all kinds of messages, but tends to be forgotten in the case of error messages.
I would second not telling the user useless esoteric information like numeric error codes. I would follow that up however by saying to definitely log that information for troubleshooting by more technically savvy people.

Resources