Antivirus and file access conflict : good programming practices?

Antivirus and file access conflict : good programming practices? - windows

Sometimes, we experiment "access denied" errors due to the antivirus which handles the file at the same time our program wants to write/rename/copy it.
This happens rarely but makes me upset because I don't find the good way to deal with: technically our response is to change our source code to implement kind of retry mechanism... but we are not satisfied.. . that smells a little bit... we can't afford telling our customers "please turn off your antivirus, let our software work properly"...
So if your have already experimented such issues, please let me know how you dealt with.
Thanks!

There is really very little scope for saying "turn avs off". That just won't fly in a lot of offices so we've done exactly what you've said: build a retry-queue.
Files that are locked are added to the queue. When the original operation ends, we pause for 1 second and sequentially pop through the queue. Files that fail the second time are added to a second queue and after the first completes, we wait 3 seconds and pop through the second.
Files that fail the second queue (the third attempt) are reported.

Related

Google Colab - Completed but still running?

I have a large dataset (that I am trying to run. The cell has not generated an output; however, it currently says 'completed at [time]'. The cell still appears to be running and the is a message saying 'waiting for python 3 to Compute Engine Backend.'
does anyone know whether the the cell timed out? should I rerun, or should I leave it as is?

Some of my runs have also encountered exactly the same conditions as yours. Then, after acknowledging some sort of implementation that enforces interactive use from Google (reCAPTCHA - which was added earlier this year, pops up randomly during your runs, if you fail to click it, your runtime will be terminated within minutes), I come up with the following theory.
My guesses are that, when you are unable to click the captchas (lucky case where it doesn't get terminated), or you left the VM running for too long without any interactions with it (the latter has much higher chances), Colab may decide to silently terminate your runtime shortly. If no additional interactions are seen within some time later, it will perform the full termination (your code and results will be gone).
I have observed some of the times where I left it running and switched to another tab, when I switched back, it performed a reconnection and then showed the "completed at [time]" but the code was still running. Other times I observed no reconnection but the postconditions were the same as yours. The training results and other data on VM were left intact regardless. So, keep running it will cause no problems in my experience, no reruns are needed, although you should keep an eye for your VM when the condition occurs to avoid any possible problems.

What causes CreateDirectory to return ERROR_ACCESS_DENIED?

In another question, we established that yes, CreateDirectory occasionally fails with the undocumented GetLastError value of ERROR_ACCESS_DENIED, and that the right way to handle the situation is probably to try again a few times. It's easy to implement such an algorithm, but it's not so easy to test it when you don't know how to reproduce it.
I don't need theories for why this happens. It might be a bug in Windows, yeah. It might also be by design. Ultimately, at this point, it doesn't matter, because Microsoft shipped the behavior, and I must cope.
I also don't need an explanation of multi-tasking operating system theory and how Windows implements it in general. I write system software for a living. I understand little else.
What I need now is a reliable way to reproduce the failure so I can write a test case for the code which copes. Here's what I've tried so far:
I wrote test program P1, which slowly and repeatedly enumerates the contents of the would-be parent. As well, I wrote test program P2 which does nothing but repeatedly attempt to delete and create a directory in the would-be parent. I figured keeping an enumeration open a long time might make the problem more likely. Running P2 by itself produces an occasional period of failures (on the order of every several minutes for approximately 10 milliseconds). Running P1 and P2 at the same time does not seem to make the failures any more frequent or long.
I ran two instances of P2 at the same time, and that does not seem to make the failures any more frequent or long.
I modified P2 so that it can create files in addition to directories, and running that at the same time as P1 does not seem to make the failures any more frequent or long.
I ran P1 and multiple instances of P2 with different parameters all at the same time, and that does not seem to make the failures any more frequent or long.
I wrote test program P3 which moves items into and out of the would-be parent and ran P3 at the same time as P2, and that does not seem to make the failures any more frequent or long.
Any other ideas?

Let me start by double-checking that I understand the question. If you run something like the below snippet, you expect it to fail eventually, right?
while (true)
{
System.IO.Directory.CreateDirectory( ".\\FooDir" );
System.IO.Directory.Delete( ".\\FooDir" );
}
If your application is the only thing running on the system that has a handle open to that file, then this feels like a bug. So knowing the OS version would help.
On the other hand, if there is something else in the system that is keeping the handle open for just a little while, then whether this is a bug or not becomes a little more fuzzy. The number of things that try to blindly grok files and directories might surprise you. A naive indexer, for example, might be walking into that directory, enumerating it, looking for files to index and so on -- and if you collide with him, blammo. A similarly naive anti-virus filter, or some other file system filter, might be poking it as well (in this case, it still feels like a bug).
There are little things we've done in the OS to try and give services like these ways to get out of your way. Does it repro if you turn the indexer off, if you turn off any anti-virus, any anti-malware? We can go from there, and hopefully we will find that newer bits have it fixed already (that statement had a lot of assumptions in it, I know).
One other relatively interesting piece of trivia is that ERROR_ACCESS_DENIED is a Win32 error that is mapped from more than one underlying status in the system (see this article for example). So if we can dig a little deeper, we may be able to find out what the file system is trying to tell the app (if it's more than access denied).
We might end up getting into a conversation about whether you can, in the wild, assume that your app is the only thing poking at your files and directories. You can probably guess where that one will go.

I would wager a guess that your enumeration / deletion / creation is causing some synchronization problems with the handles. If CreateDirectory is anything like CreateFile (and I would assume the logic behind it would be shared), then you would see similar behaviour to CreateFile:
If you call CreateFile on a file that
is pending deletion as a result of a
previous call to DeleteFile, the
function fails. The operating system
delays file deletion until all handles
to the file are closed. GetLastError
returns ERROR_ACCESS_DENIED.

How do you reproduce bugs that occur sporadically?

We have a bug in our application that does not occur every time and therefore we don't know its "logic". I don't even get it reproduced in 100 times today.
Disclaimer: This bug exists and I've seen it. It's not a pebkac or something similar.
What are common hints to reproduce this kind of bug?

Analyze the problem in a pair and pair-read the code. Make notes of the problems you KNOW to be true and try to assert which logical preconditions must hold true for this happen. Follow the evidence like a CSI.
Most people instinctively say "add more logging", and this may be a solution. But for a lot of problems this just makes things worse, since logging can change timing-dependencies sufficiently to make the problem more or less frequent. Changing the frequency from 1 in 1000 to 1 in 1,000,000 will not bring you closer to the true source of the problem.
So if your logical reasoning does not solve the problem, it'll probably give you a few specifics you could investigate with logging or assertions in your code.

There is no general good answer to the question, but here is what I have found:
It takes a talent for this kind of thing. Not all developers are best suited for it, even if they are superstars in other areas. So know your team, who has a talent for it, and hope you can give them enough candy to get them excited about helping you out, even if it isn't their area.
Work backwards, and treat it like a scientific investigation. Start with the bug, what you see is wrong. Develop hypotheses about what could cause it (this is the creative/imaginative part, the art that not everyone has the talent for) - and it helps a lot to know how the code works. For each of those hypotheses (preferably sorted by what you think is most likely - again pure gut feel here), develop a test that tries to eliminate it as the cause, and test the hypothesis. Any given failure to meet a prediction doesn't mean the hypothesis is wrong. Test the hypothesis until it is confirmed to be wrong (although as it gets less likely you may want to move on to another hypothesis first, just don't discount this one until you have a definitive failure).
Gather as much data as you can during this process. Extensive logging and whatever else is applicable. Do not discount a hypothesis because you lack the data, rather remedy the lack of data. Quite often the inspiration for the right hypothesis comes from examining the data. Noticing something off in a stack trace, weird issue in a log, something missing that should be there in a database, etc.
Double check every assumption. So many times I have seen an issue not get fixed quickly because some general method call was not further investigated, so the problem was just assumed to be not applicable. "Oh that, that should be simple." (See point 1).
If you run out of hypotheses, that is generally caused by insufficient knowledge of the system (this is true even if you wrote every line of code yourself), and you need to run through and review code and gain additional insight into the system to come up with a new idea.
Of course, none of the above guarantees anything, but that is the approach that I have found gets results consistently.

Add some sort of logging or tracing. For example log the last X actions the user committed before causing the bug (only if you can set a condition to match bug).

It's quite common for programmers not to be able to reiterate a user-experienced crash simply because you have developed a certain workflow and habits in using the application that obviously goes around the bug.
At this frequency of 1/100, I'd say that the first thing to do is to handle exceptions and log anything anywhere or you could be spending another week hunting this bug.
Also make a priority list of potentially sensitive articulations and features in your project. For example :
1 - Multithreading
2 - Wild pointers/ loose arrays
3 - Reliance on input devices
etc.
This will help you segment areas that you can brute-force-until-break-again as suggested by other posters.

Since this is language-agnostic, I'll mention a few axioms of debugging.
Nothing a computer ever does is random. A 'random occurrence' indicates a as-yet-undiscovered pattern. Debugging begins with isolating the pattern. Vary individual elements and assess what makes a change in the behaviour of the bug.
Different user, same computer?
Same user, different computer?
Is the occurrence strongly periodic? Does rebooting change the periodicity?
FYI- I once saw a bug that was experienced by a single person. I literally mean person, not a user account. User A would never see the problem on their system, User B would sit down at that workstation, signed on as User A and could immediately reproduce the bug. There should be no conceivable way for the app to know the difference between the physical body in the chair. However-
The users used the app in different ways. User A habitually used a hotkey to to invoke a action and User B used an on-screen control. The difference in the user behaviour would cascade into a visible error a few actions later.
ANY difference that effects the behaviour of the bug should be investigated, even if it makes no sense.

There's a good chance your application is MTWIDNTBMT (Multi Threaded When It Doesn't Need To Be Multi Threaded), or maybe just multi-threaded (to be polite). A good way to reproduce sporadic errors in multi-threaded applications is to sprinkle code like this around (C#):
Random rnd = new Random();
System.Threading.Thread.Sleep(rnd.Next(2000));
and/or this:
for (int i = 0; i < 4000000000; i++)
{
// tight loop
}
to simulate threads completing their tasks at different times than usual or tying up the processor for long stretches.
I've inherited many buggy, multi-threaded apps over the years, and code like the above examples usually makes the sporadic errors occur much more frequently.

Add verbose logging. It will take multiple -- sometimes dozen(s) -- iterations to add enough logging to understand the scenario.
Now the problem is that if the problem is a race condition, which is likely if it doesn't reproduce reliably, so logging can change timing and the problem will stop happening. In this case do not log to a file, but keep a rotating buffer of the log in memory and only dump it on disk when you detect that the problem has occurred.
Edit: a little more thoughts: if this is a gui application run tests with a qa automation tool which allows you to replay macros. If this is a service-type app, try to come up with at least a guess as to what is happening and then programmatically create 'freak' usage patterns which would exercise the code that you suspect. Create higher than usual loads etc.

What development environment?
For C++, your best bet may be VMWare Workstation record/replay, see:
http://stackframe.blogspot.com/2007/04/workstation-60-and-death-of.html
Other suggestions include inspecting the stack trace, and careful code overview... there is really no silver bullet :)

Try to add code in your app to trace the bug automatically once it happens (or even alert you via mail / SMS)
log whatever you can so when it happens you can catch the right system state.
Another thing- try applying automated testing that can cover more territory than human based testing in a formed manner.. it's a long shot, but a good practice in general.

all the above, plus throw some brute force soft-robot at it that is semi random, and scater a lot of assert/verify (c/c++, probably similar in other langs) through the code

Tons of logging and careful code review are your only options.
These can be especially painful if the app is deployed and you can't adjust the logging. At that point, your only choice is going through the code with a fine-tooth comb and trying to reason about how the program could enter into the bad state (scientific method to the rescue!)

Often these kind of bugs are related to corrupted memory and for that reason they might not appear very often. You should try to run your software with some kind of memory profiler e.g., valgrind, to see if something goes wrong.

Let’s say I’m starting with a production application.
I typically add debug logging around the areas where I think the bug is occurring. I setup the logging statements to give me insight into the state of the application. Then I have the debug log level turned on and ask the user/operator(s) notify me of the time of the next bug occurrence. I then analyze the log to see what hints it gives about the state of the application and if that leads to a better understanding of what could be going wrong.
I repeat step 1 until I have a good idea of where I can start debugging the code in the debugger
Sometimes the number of iterations of the code running is key but other times it maybe the interaction of a component with an outside system (database, specific user machine, operating system, etc.). Take some time to setup a debug environment that matches the production environment as closely as possible. VM technology is a good tool for solving this problem.
Next I proceed via the debugger. This could include creating a test harness of some sort that puts the code/components in the state I’ve observed from the logs. Knowing how to setup conditional break points can save a lot of time, so get familiar with that and other features within your debugger.
Debug, debug , debug. If you’re going nowhere after a few hours, take a break and work on something unrelated for awhile. Come back with a fresh mind and perspective.
If you have gotten nowhere by now, go back to step 1 and make another iteration.
For really difficult problems you may have to resort to installing a debugger on the system where the bug is occurring. That combined with your test harness from step 4 can usually crack the really baffling issues.

Unit Tests. Testing a bug in the app is often horrendous because there is so much noise, so many variable factors. In general the bigger the (hay)stack, the harder it is to pinpoint the issue. Creatively extending your unit test framework to embrace edge cases can save hours or even days of sifting
Having said that there is no silver bullet. I feel your pain.

Add pre and post condition check in methods related to this bug.
You may have a look at Design by contract

Along with a lot of patience, a quiet prayer & cursing you would need:
a good mechanism for logging the user actions
a good mechanism for gathering the data state when the user performs some actions (state in application, database etc.)
Check the server environment (e.g. an anti-virus software running at a particular time etc.) & record the times of the error & see if you can find any trends
some more prayers & cursing...
HTH.

Assuming you're on Windows, and your "bug" is a crash or some sort of corruption in unmanaged code (C/C++), then take a look at Application Verifier from Microsoft. The tool has a number of stops that can be enabled to verify things during runtime. If you have an idea of the scenario where your bug occurs, then try to run through the scenario (or a stress version of the scenario) with AppVerifer running. Make sure to either turn on pageheap in AppVerifier, or consider compiling your code with the /RTCcsu switch (see http://msdn.microsoft.com/en-us/library/8wtf2dfz.aspx for more information).

"Heisenbugs" require great skills to diagnose, and if you want help from people here you have to describe this in much more detail, and patiently listen to various tests and checks, report result here, and iterate this till you solve it (or decide it is too expensive in terms of resources).
You will probably have to tell us your actual situation, language, DB, operative system, workload estimate, time of the day it happened in the past, and a myriad of other things, list tests you did already, how they went, and be ready to do more and share the results.
And this will not guarantee that we collectively can find it, either...

I'd suggest to write down all things that user has been doing. If you have lets say 10 such bug reports You can try to find something that connects them.

Read the stack trace carefully and try to guess what could be happened;
then try to trace\log every line of code that potentially can cause trouble.
Keep your focus on disposing resources; many sneaky sporadical bugs i found were related to close\dispose things :).

For .NET projects You can use Elmah (Error Logging Modules and Handlers) to monitor you application for un-caught exceptions, it's very simple to install and provides a very nice interface to browse unknown errors
http://code.google.com/p/elmah/
This saved me just today in catching a very random error that was occuring during a registration process
Other than that I can only recommend trying to get as much information from your users as possible and having a thorough understanding of the project workflow
They mostly come out at night....
mostly

The team that I work with has enlisted the users in recording their time they spend in our app with CamStudio when we've got a pesky bug to track down. It's easy to install and for them to use, and makes reproducing those nagging bugs much easier, since you can watch what the users are doing. It also has no relationship to the language you're working in, since it's just recording the windows desktop.
However, this route seems to be viable only if you're developing corporate apps and have good relationships with your users.

This varies (as you say), but some of the things that are handy with this can be
immediately going into the debugger when the problem occurs and dumping all the threads (or the equivalent, such as dumping the core immediately or whatever.)
running with logging turned on but otherwise entirely in release/production mode. (This is possible in some random environments like c and rails but not many others.)
do stuff to make the edge conditions on the machine worse... force low memory / high load / more threads / serving more requests
Making sure that you're actually listening to what the users encountering the problem are actually saying. Making sure that they're actually explaining the relevant details. This seems to be the one that breaks people in the field a lot. Trying to reproduce the wrong problem is boring.
Get used to reading assembly that was produced by optimizing compilers. This seems to stop people sometimes, and it isn't applicable to all languages/platforms, but it can help
Be prepared to accept that it is your (the developer's) fault. Don't get into the trap of insisting the code is perfect.
sometimes you need to actually track the problem down on the machine it is happening on.

#p.marino - not enough rep to comment =/
tl;dr - build failures due to time of day
You mentioned time of day and that caught my eye. Had a bug once were someone stayed later at work on night, tried to build and commit before they left and kept getting a failure. They eventually gave up and went home. When they caught in the next morning it built fine, they committed (probably should have been more suspiscious =] ) and the build worked for everyone. A week or two later someone stayed late and had an unexpected build failure. Turns out there was a bug in the code that made any build after 7PM break >.>
We also found a bug in one seldom used corner of the project this january that caused problems marshalling between different schemas because we were not accounting for the different calendars being 0 AND 1 month based. So if no one had messed with that part of the project we wouldn't have possibly found the bug until jan. 2011
These were easier to fix than threading issues, but still interesting I think.

hire some testers!

This has worked for really weird heisenbugs.
(I'd also recommend getting a copy of "Debugging" by Dave Argans, these ideas are partly derived form using his ideas!)
(0) Check the ram of the system using something like Memtest86!
The whole system exhibits the problem, so make a test jig that exercises the whole thing.
Say it's a server side thing with a GUI, you run the whole thing with a GUI test framework doing the necessary input to provoke the problem.
It doesn't fail 100% of the time, so you have to make it fail more often.
Start by cutting the system in half ( binary chop)
worse case, you have to remove sub-systems one at a time.
stub them out if they can't be commented out.
See if it still fails. Does it fail more often ?
Keep proper test records, and only change one variable at a time!
Worst case you use the jig and you test for weeks to get meaningful statistics. This is HARD; but remember, the jig is doing the work.
I've got No threads and only one process, and I don't talk to hardware
If the system has no threads, no communicating processes and contacts no hardware; it's tricky; heisenbugs are generally synchronization, but in the no-thread no processes case it's more likely to be uninitialized data, or data used after being released, either on the heap or the stack. Try to use a checker like valgrind.
For threaded/multi-process problems:
Try running it on a different number of CPU's. If it's running on 1, try on 4! Try forcing a 4-computer system onto 1.
It'll mostly ensure things happen one at a time.
If there are threads or communicating processes this can shake out bugs.
If this is not helping but you suspect it's synchronization or threading, try changing the OS time-slice size.
Make it as fine as your OS vendor allows!
Sometimes this has made race conditions happen almost every time!
Obversely, try going slower on the timeslices.
Then you set the test jig running with debugger(s) attached all over the place and wait for the test jig to stop on a fault.
If all else fails, put the hardware in the freezer and run it there. The timing of everything will be shifted.

Debugging is hard and time consuming especially if you are unable to deterministically reproduce the problem. My advice to you is to find out the steps to reproduce it deterministically (not just sometimes).
There has been a lot of research in the field of failure reproduction in the past years and is still very active. Record&Replay techniques have been (so far) the research direction of most researchers. This is what you need to do:
1) Analyze the source code and determine what are the sources of non-determinism in the application, that is, what are the aspects that may take your application through different execution paths (e.g. user input, OS signals)
2) Log them in the next time you execute the application
3) When your application fails again, you have the steps-to-reproduce the failure in your log.
If your log still does not reproduce the failure, then you are dealing with a concurrency bug. In that case, you should take a look at how your application accesses shared variables. Do not attempt to record the accesses to shared variables, because you would be logging too much data, thereby causing severe slowdowns and large logs. Unfortunately, there is not much I can say that would help you to reproduce concurrency bugs, because research still has a long way to go in this subject. The best I can do is to provide a reference to the most recent advance (so far) in the topic of deterministic replay of concurrency bugs:
http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html
Best regards

Use an enhanced crash reporter. In the Delphi environment, we have EurekaLog and MadExcept. Other tools exist in other environments. Or you can diagnose the core dump. You're looking for the stack trace, which will show you where it's blowing up, how it got there, what's in memory, etc.. It's also useful to have a screenshot of the app, if it's a user-interaction thing. And info about the machine that it crashed on (OS version and patch, what else is running at the time, etc..) Both of the tools that I mentioned can do this.
If it's something that happens with a few users but you can't reproduce it, and they can, go sit with them and watch. If it's not apparent, switch seats - you "drive", and they tell you what to do. You'll uncover the subtle usability issues that way. double-clicks on a single-click button, for example, initiating re-entrancy in the OnClick event. That sort of thing. If the users are remote, use WebEx, Wink, etc., to record them crashing it, so you can analyze the playback.

How to get users to read error messages? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
If you program for a nontechnical audience, you find yourself at a high risk that users will not read your carefully worded and enlightening error messages, but just click on the first button available with a shrug of frustration.
So, I'm wondering what good practices you can recommend to help users actually read your error message, instead of simply waiving it aside. Ideas I can think of would fall along the lines of:
Formatting of course help; maybe a simple, short message, with a "learn more" button that leads to the longer, more detailed error message
Have all error messages link to some section of the user guide (somewhat difficult to achieve)
Just don't issue error messages, simply refuse to perform the task (a somewhat "Apple" way of handling user input)
Edit: the audience I have in mind is a rather broad user base that doesn't use the software too often and is not captive (i.e., no in-house software or narrow community). A more generic form of this question was asked on slashdot, so you may want to check there for some of the answers.

That is an excellent question worthy of a +1 from me. The question despite being simple, covers many aspects of the nature of end-users. It boils down to a number of factors here which would benefit you and the software itself, and of course for the end-users.
Do not place error messages in the status bar - they will never read them despite having it jazzed up with colours etc....they will always miss them! No matter how hard you'll try... At one stage during the Win 95 UI testing before it was launched, MS carried out an experiment to read the UI (ed - it should be noted that the message explicitly stated in the context of 'Look under the chair'), with a $100 dollar bill taped to the underside of the chair that the subjects were sitting on...no one spotted the message in the status bar!
Make the messages short, do not use intimidating words such as 'Alert: the system encountered a problem', the end-user is going to hit the panic button and will over-react...
No matter how hard you try, do not use colours to identify the message...psychologically, it's akin to waving a red-flag to the bull!
Use neutral sounding words to convey minimal reaction and how to proceed!
It may be better to show a dialog box listing the neutral error message and to include a checkbox indicating 'Do you wish to see more of these error messages in the future?', the last thing an end-user wants, is to be working in the middle of the software to be bombarded with popup messages, they will get frustrated and will be turned off by the application! If the checkbox was ticked, log it to a file instead...
Keep the end-users informed of what error messages there will be...which implies...training and documentation...now this is a tricky one to get across...you don't want them to think that there will be 'issues' or 'glitches' and what to do in the event of that...they must not know that there will be possible errors, tricky indeed.
Always, always, be not afraid to ask for feedback when the uneventful happens - such as 'When that error number 1304 showed up, how did you react? What was your interpretation' - the bonus with that, the end-user may be able to give you a more coherent explanation instead of 'Error 1304, database object lost!', instead they may be able to say 'I clicked on this so and so, then somebody pulled the network cable of the machine accidentally', this will clue you in on having to deal with it and may modify the error to say 'Ooops, Network connection disconnected'... you get the drift.
Last but not least, if you want to target international audiences, take into account of internationalization of the error messages - hence that's why to keep it neutral, because then it will be easier to translate, avoid synonyms, slang words, etc which would make the translation meaningless - for example, Fiat Ford, the motor car company was selling their brand Fiat Ford Pinto, but noticed no sales was happening in South America, it turned out, Pinto was a slang there for 'small penis' and hence no sales...
(ed)Document the list of error messages to be expected in a separate section of the documentation titled 'Error Messages' or 'Corrective Actions' or similar, listing the error numbers in the correct order with a statement or two on how to proceed...
(ed) Thanks to Victor Hurdugaci for his input, keep the messages polite, do not make the end-users feel stupid. This goes against the answer by Jack Marchetti if the user base is international...
Edit: A special word of thanks to gnibbler who mentioned another extremely vital point as well!
Allow the end-user to be able to select/copy the error message so that they can if they do so wish, to email to the help support team or development team.
Edit#2: My bad! Whoops, thanks to DanM who mentioned that about the car, I got the name mixed up, it was Ford Pinto...my bad...
Edit#3: Have highlighted by ed to indicate additionals or addendums and credited to other's for their inputs...
Edit#4: In response to Ken's comment - here's my take...
No it is not, use neutral standard Windows colours...do not go for flashy colours! Stick to the normal gray back-colour with black text, which is a normal standard GUI guideline in the Microsoft specifications..see UX Guidelines (ed).
If you insist on flashy colours, at least, take into account of potential colour-blind users i.e. accessibility which is another important factor for those that have a disability, screen magnification friendly error messages, colour-blindness, those that suffer with albino, they may be sensitive to flashy colours, and epileptics as well...who may suffer from a particular colours that could trigger a seizure...

Show them the message. Due dilligence and all, but log every error to a file. Users can't remember what they were doing or what the error message was seconds after the event, it's like eye-witness accounts of perpatrators.
Provide a good way to allow them to email or upload the log to you so that you can assist them in reconciling the issue. If it's a web application: even better, you can be receiving information about the situation ahead of anyone even reporting the problem.

Short answer: You can't.
Less short answer: Make them visible, relevant, and contextual (highlight what they messed up). But still, you're fighting a losing battle. People don't read on computer screens, they scan, and they've been trained to click the buttons until the dialog boxes go away.

We put a simple memorable graphic in the error box: not an icon, a fairly large bitmap, and nothing like the standard Windows message icons. Nobody can ever remember the wording of a messagebox (most won't even read it if the box has an "OK" button they can press), but most people DO remember the picture they saw. So our support people can ask the customer "did you see the coffee-drinking guy?" or "did you see the empty desk?". At least that way we know roughly what went wrong.

Depending on your user base, writing funny/rude/personal error messages can work great.
For instance, I wrote an application which allowed our HR people to better track the hire/fire dates of employees. [we were a small company, very laid back].
When they entered wrong dates I would write:
Hey dumb ass, learn how to enter a date!
EDIT: Of course a more helpful message is to say: "Please enter date as mm/dd/yyyy" or perhaps in code to try and figure out what they entered and if they entered "blahblah" to show an error. However, this was a very small application for an HR person I knew personally. Hence again people, read the first line of this post: Depending on your user base...
I recently worked on an Art Institute project, so the error messages were geared towards the audience, such as:
Most art before the Baroque period was
unsigned. However, we’re beyond the
Baroque period now, so all fields must
be completed.
Basically gear it to your audience if at all possible, and avoid boring as all unearthly general errors such as: "please enter email" or "please enter valid email".

Alerts/popups are annoying, that's why everyone hits the first button they see.
Make it less annoying. Example: if the user entered the date incorrectly, or entered a text where numbers are expected, then DON'T popup a message, just highlight the field and write a message somewhere around it.
Make a custom message box. Do not ever use the default message box of the system, for example Windows XP message boxes are annoying themselves. Make a new colored message box, with a different background color than system default.
Very Important: do not insist. Some message boxes use the Modal dialog and insist on making you read it, this is very annoying. If you can make the message box appear as a warning message it would be better, for example, Stack Overflow messages that appear right on the top of the page, informing but not annoying.
UPDATE
Make the message meaningful and helpful. For example, do not write something like, "No Keyboard found, press F1 to continue."

The best UI design will be where you virtually never show an error message. The software should adapt to the user. With that sort of a design, an error message will be novel and will grab the users attention. If you pepper the user with senseless dialogs like that you're explicitly training them to ignore your messages.

In my opinion and experience, it's the power users, who do not read error messages. The nontechnical audience I know reads every message on the screen most carefully and the problem at this point mostly is: They don't understand it.
This point may be the cause of your experience, because at some point they will stop reading them, because "they don't understand it anyway", so your task is easy:
Make the error message as easy to understand as possible and keep the technical part under the hood.
For example I transfer a message like this:
ORA-00237: snapshot operation disallowed: control file newly created
Cause: An attempt to invoke cfileMakeAndUseSnapshot with a currently mounted control file that was newly created with CREATE CONTROLFILE was made.
Action: Mount a current control file and retry the operation.
to something like:
This step could not be processed due to momentary problems with the database. Please contact (your admin|the helpdesk|anyone who can contact the developer or admin to solve the problem). Sorry for the inconvenience.

Show users that the error message has a meaning, and it's a way to provide assistance to them and they will read it. If it's just jargon-bable or generic nonsense message they will learn to dismiss them quicly.
I have learned that is very good practice to include an error dialog with default action to send (eg. via email) detailed diagnostic info, if you quickly respond to those emails with valuable information or workaround, they will worship you.
This is also a great learning tool. In future versions you can solve known-issues or at least provide in-place workaround info. Until then users will learn that this message is caused by X and the problem can by solved by Y - all because someone did explain it to them.
Of course this won't work on a large scale application, but works very well in enterprise applications with few hundred users, and in a lean agile, release early release often, environment.
EDIT:
Since you have a broad user base I recommend to provide software that does what users are/can expect it to do, eg. do not show them eroror message if phone number is not formatted well, reformat if for them.
I personally like software that does not make me think, and when occasionally there is nothing you (the developer) can do to interpret my intention, provide a very well written (and reviewed by actual users) messages.
It's common knowlege that people do not read documentation (did you read instructions back-to-back do when you did plugged in household appliance?), they try a way to get results quickly, when failed you have to grab their attention (eg. disable default button for a while) with meaningful and helpful info. They don't care about your sofware failure, they want to get results, now.

One good tip I've learned is that you should write a dialog box like a newspaper article. Not in the size-sense, but in the importance-sense. Let me explain.
You should write the most important things to read, first, and provide more detailed information second.
In other words, this is no good:
There was a problem loading the file, the file might have been deleted, or
it might be present on a network share that you don't have access to at
your present location.
Do you want to retry opening the file?
Instead, change the order:
Problem loading file, do you want to retry?
There was a problem loading the file, the file might have been deleted, or
it might be present on a network share that you don't have access to at
your present location.
This way, the user can read just as much as he wants, or bothers, and still have an idea about what's being asked.

To start, write error messages that users can actually understand. "Error: 1023" is not good example. I think better way is logging the error, than showing it to the user with some "fancy" code. Or if logging is not possible, give the users proper way to send the error details to the support department.
Also, be short and clear enough. Do not include some technical details. Do not show them information that they cannot use. If possible provide a workaround for the error. If not provide a default route, that should be taken.
If your application is a web app, designing custom error pages is a good idea. They stress users less, take SO for example. You can get some ideas how to design a good error page here: http://www.smashingmagazine.com/2007/07/25/wanted-your-404-error-pages/

Make them fun. (It seemed relevant, given the site we're on :) )

One thing I'd like to add.
Use verbs for your action buttons to close your error messages rather than exclamations, example don't use "Ok!" "Close" etc.

Unless you can provide the user some simple work-around, don't bother showing the user an error message at all. There is just no point, since 90% of users won't care what it says.
On the other hand If you CAN actually show the user a useful workaround, then one way to force them to read it is make the OK button become enabled after 10 seconds or so. Sort of how Firefox does it whenever you are trying to install a new plug-in.
If it is a total crash that you cannot gracefully recover from, then inform the user in very layman terms saying:
"I'm sorry we screwed up, we would like to send some information about this crash, will you allow us to do so? YES / NO"
In addition, try not to make your error messages longer than a sentence. When people (me included) see a whole paragraph talking about the error, my mind just shuts off.
With so much social media and information overload, people's mind freeze when they see a wall of text.
EDIT:
Someone once also recently suggested using comic strips along with whatever message you want to show. Such as something from Dilbert that may be close to the type of error you may have.

From my experience: you don't get users (especially non-technical ones) to read error messages. No matter how clear and understandable, bold, red and flashing the message is, that you display, most users will just click anything away that they're not used to, even if it's "Do you really want to delete everything?". I have seen users click the "window close"-icon instead of "OK" or "cancel" even though they didn't even know which option they chose by doing so ...
If you really need to force users to read what you're displaying, I'd suggest a JavaScript-Countdown until a button is clickable. That way the user will hopefully use the waiting time to really read what he's supposed to. Be careful though: most users will be even more annoyed by that :)
I furthermore like your idea of a "read more"-link, although I doubt that will get users more interested that just want to get rid of the message by all means ...
Just for the record: there are users that DO read error messages but are so afraid that they won't do anything with it. I once had a support call where the customer would read an error message to me, asking me, what he should do. "Well, what are your options?", I asked. "The window only has an 'OK'-button.", he replied. ... mmh, hard one :)

I often display the error in red (when the design allows it).
Red stands for "alert", etc. so it's more often read.

Well, to answer your question directly: Don't have your programmers write your error messages. If you follow this one piece of advice, you'd save, cumulatively, thousands of hours of user angst and productivity and millions of dollars in technical support costs.
The real goal, however, should be to design your application so users can't make mistakes. Don't let them take actions that lead to error massages and require them to back up. As a simple example, in a web form that requires all its fields to be filled in, instead of popping up an error message when users click on the Send button, don't enable the Send button until all the field contain valid content. It means more work on the back side, but it results in a better user experience.
Of course, that's a bit of an ideal world. Sometimes, program errors are unavoidable. When they do occur, you need to provide clear, complete, and useful information, and most importantly, don't expose the system to user and don't blame users for their actions.
A good error message should contain:
What the problem is and why it happened.
How to resolve the problem.
One of the worst things you can do is simply pass system error messages through to users. For example, when your Java program throws an exception, don't simply pass the programmer-ese up to the UI and expose it to the user. Catch it, and have a clear message created by your user assistance developer that you can present to your user.
I was lucky enough, on my last job, to work with a team of programmer who wouldn't think of writing their own error messages. Any time they found themselves in a situation where one was required and the program couldn't be designed to avoid it (often because of limited resources), they always came to me, explained what they needed, and let me create an error message that was clear and followed company style. If that was the default mindset of every programmer, the computing world would be a far, far better place.

Less errors
If an application throws vomit at you on a regular basis, you become immune to it, and errors become irritating background muzak. If an error is a rare event, it will garner more attention.
Quosh anything which isn't a major deal, throw out all those warnings, find ways of understanding user intent, take out the decisions wherever possible. I have a few apps which I continue to streamline in this way. Developers see every error as important, but this is not true from a user perspective. Look for the users' common response to a problem and capture that, deploy that as your response.
If you do need to raise an error: short, concise, low terror factor, no exclamation marks. Paragraphs are fail.
There's no silver bullet, but you need to socially engineer to make errors important.

We told users their manager had been contacted (which was a lie). It worked a little too well and had to be removed.

Adding an "Advanced" button that enables some more technical details will provide an incentive to read it for the part of the target audience that thinks itself as technical

I'd suggest that you give feedback (stating that the user made a mistake) immediately after the mistake is made. (For instance, when entering a value of a date field, check the value and, if it is wrong, make the input field visually different).
If there are errors on the page (I'm more into web development, hence I'm referring to it as a "page", but it can be also called "form"), show an "error summary", explaining that there were errors and a bulleted list of what exactly errors happened. However, if there are more than 5-6 words per message, those won't be read/understood.

How about making the button state "Click here to speak with a support technician who will assist you with this issue."
There are many websites that provide the option to speak with a real person.

I read a candidate for the most horrific solution on slashdot:
We have found that the only way to
make users take responsibility for
errors is to give them a penalty for
forcing the error to go away. For
starters, where possible, the error
wont actually close for them unless we
enter an admin password to make it go
away, and if they reboot to get rid of
it (Task Manager is disabled on all
client PC's) the machine will not open
the application that crashed for 15
minutes. Of course, this all depends
on the type of users you are dealing
with, as more technically adept users
wouldnt accept this kind of system,
but after trying for literally YEARS
to make users take responsibility for
crashes and making sure the IT
department is aware of them in order
to fix the issue before it gets too
hard to manage, these are the only
steps that worked. Now, all of our end
users are aware that if they ignore
errors, they are going to suffer for
it themselves.

"ATTENTION! ATTENTION! If you do not read error message you WILL DIE!"

Despite all the recommendations in the accepted answer, my users continued to click the first button they could find. So now I show this:
The user has to make a choice before the OK button appears
If he selects the 3rd option, he can continue, otherwise the application quits.

What's the toughest bug you ever found and fixed? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
What made it hard to find? How did you track it down?
Not close enough to close but see also
https://stackoverflow.com/questions/175854/what-is-the-funniest-bug-youve-ever-experienced

A jpeg parser, running on a surveillance camera, which crashed every time the company's CEO came into the room.
100% reproducible error.
I kid you not!
This is why:
For you who doesn't know much about JPEG compression - the image is kind of broken down into a matrix of small blocks which then are encoded using magic etc.
The parser choked when the CEO came into the room, because he always had a shirt with a square pattern on it, which triggered some special case of contrast and block boundary algorithms.
Truly classic.

This didn't happen to me, but a friend told me about it.
He had to debug a app which would crash very rarely. It would only fail on Wednesdays -- in September -- after the 9th. Yes, 362 days of the year, it was fine, and three days out of the year it would crash immediately.
It would format a date as "Wednesday, September 22 2008", but the buffer was one character too short -- so it would only cause a problem when you had a 2 digit DOM on a day with the longest name in the month with the longest name.

This requires knowing a bit of Z-8000 assembler, which I'll explain as we go.
I was working on an embedded system (in Z-8000 assembler). A different division of the company was building a different system on the same platform, and had written a library of functions, which I was also using on my project. The bug was that every time I called one function, the program crashed. I checked all my inputs; they were fine. It had to be a bug in the library -- except that the library had been used (and was working fine) in thousands of POS sites across the country.
Now, Z-8000 CPUs have 16 16-bit registers, R0, R1, R2 ...R15, which can also be addressed as 8 32-bit registers, named RR0, RR2, RR4..RR14 etc. The library was written from scratch, refactoring a bunch of older libraries. It was very clean and followed strict programming standards. At the start of each function, every register that would be used in the function was pushed onto the stack to preserve its value. Everything was neat & tidy -- they were perfect.
Nevertheless, I studied the assembler listing for the library, and I noticed something odd about that function --- At the start of the function, it had PUSH RR0 / PUSH RR2 and at the end to had POP RR2 / POP R0. Now, if you didn't follow that, it pushed 4 values on the stack at the start, but only removed 3 of them at the end. That's a recipe for disaster. There an unknown value on the top of the stack where return address needed to be. The function couldn't possibly work.
Except, may I remind you, that it WAS working. It was being called thousands of times a day on thousands of machines. It couldn't possibly NOT work.
After some time debugging (which wasn't easy in assembler on an embedded system with the tools of the mid-1980s), it would always crash on the return, because the bad value was sending it to a random address. Evidently I had to debug the working app, to figure out why it didn't fail.
Well, remember that the library was very good about preserving the values in the registers, so once you put a value into the register, it stayed there. R1 had 0000 in it. It would always have 0000 in it when that function was called. The bug therefore left 0000 on the stack. So when the function returned it would jump to address 0000, which just so happened to be a RET, which would pop the next value (the correct return address) off the stack, and jump to that. The data perfectly masked the bug.
Of course, in my app, I had a different value in R1, so it just crashed....

This was on Linux but could have happened on virtually any OS. Now most of you are probably familiar with the BSD socket API. We happily use it year after year, and it works.
We were working on a massively parallel application that would have many sockets open. To test its operation we had a testing team that would open hundreds and sometimes over a thousand connections for data transfer. With the highest channel numbers our application would begin to show weird behavior. Sometimes it just crashed. The other time we got errors that simply could not be true (e.g. accept() returning the same file descriptor on subsequent calls which of course resulted in chaos.)
We could see in the log files that something went wrong, but it was insanely hard to pinpoint. Tests with Rational Purify said nothing was wrong. But something WAS wrong. We worked on this for days and got increasingly frustrated. It was a showblocker because the already negotiated test would cause havoc in the app.
As the error only occured in high load situations, I double-checked everything we did with sockets. We had never tested high load cases in Purify because it was not feasible in such a memory-intensive situation.
Finally (and luckily) I remembered that the massive number of sockets might be a problem with select() which waits for state changes on sockets (may read / may write / error). Sure enough our application began to wreak havoc exactly the moment it reached the socket with descriptor 1025. The problem is that select() works with bit field parameters. The bit fields are filled by macros FD_SET() and friends which DON'T CHECK THEIR PARAMETERS FOR VALIDITY.
So everytime we got over 1024 descriptors (each OS has its own limit, Linux vanilla kernels have 1024, the actual value is defined as FD_SETSIZE), the FD_SET macro would happily overwrite its bit field and write garbage into the next structure in memory.
I replaced all select() calls with poll() which is a well-designed alternative to the arcane select() call, and high load situations have never been a problem everafter. We were lucky because all socket handling was in one framework class where 15 minutes of work could solve the problem. It would have been a lot worse if select() calls had been sprinkled all over of the code.
Lessons learned:
even if an API function is 25 years old and everybody uses it, it can have dark corners you don't know yet
unchecked memory writes in API macros are EVIL
a debugging tool like Purify can't help with all situations, especially when a lot of memory is used
Always have a framework for your application if possible. Using it not only increases portability but also helps in case of API bugs
many applications use select() without thinking about the socket limit. So I'm pretty sure you can cause bugs in a LOT of popular software by simply using many many sockets. Thankfully, most applications will never have more than 1024 sockets.
Instead of having a secure API, OS developers like to put the blame on the developer. The Linux select() man page says
"The behavior of these macros is
undefined if a descriptor value is
less than zero or greater than or
equal to FD_SETSIZE, which is normally
at least equal to the maximum number
of descriptors supported by the
system."
That's misleading. Linux can open more than 1024 sockets. And the behavior is absolutely well defined: Using unexpected values will ruin the application running. Instead of making the macros resilient to illegal values, the developers simply overwrite other structures. FD_SET is implemented as inline assembly(!) in the linux headers and will evaluate to a single assembler write instruction. Not the slightest bounds checking happening anywhere.
To test your own application, you can artificially inflate the number of descriptors used by programmatically opening FD_SETSIZE files or sockets directly after main() and then running your application.
Thorsten79

Mine was a hardware problem...
Back in the day, I used a DEC VaxStation with a big 21" CRT monitor. We moved to a lab in our new building, and installed two VaxStations in opposite corners of the room. Upon power-up,my monitor flickered like a disco (yeah, it was the 80's), but the other monitor didn't.
Okay, swap the monitors. The other monitor (now connected to my VaxStation) flickered, and my former monitor (moved across the room) didn't.
I remembered that CRT-based monitors were susceptable to magnetic fields. In fact, they were -very- susceptable to 60 Hz alternating magnetic fields. I immediately suspected that something in my work area was generating a 60 Hz alterating magnetic field.
At first, I suspected something in my work area. Unfortunately, the monitor still flickered, even when all other equipment was turned off and unplugged. At that point, I began to suspect something in the building.
To test this theory, we converted the VaxStation and its 85 lb monitor into a portable system. We placed the entire system on a rollaround cart, and connected it to a 100 foot orange construction extension cord. The plan was to use this setup as a portable field strength meter,in order to locate the offending piece of equipment.
Rolling the monitor around confused us totally. The monitor flickered in exactly one half of the room, but not the other side. The room was in the shape of a square, with doors in opposite corners, and the monitor flickered on one side of a diagnal line connecting the doors, but not on the other side. The room was surrounded on all four sides by hallways. We pushed the monitor out into the hallways, and the flickering stopped. In fact, we discovered that the flicker only occurred in one triangular-shaped half of the room, and nowhere else.
After a period of total confusion, I remembered that the room had a two-way ceiling lighting system, with light switches at each door. At that moment, I realized what was wrong.
I moved the monitor into the half of the room with the problem, and turned the ceiling lights off. The flicker stopped. When I turned the lights on, the flicker resumed. Turning the lights on or off from either light switch, turned the flicker on or off within half of the room.
The problem was caused by somebody cutting corners when they wired the ceiling lights. When wiring up a two-way switch on a lighting circuit, you run a pair of wires between the SPDT switch contacts, and a single wire from the common on one switch, through the lights, and over to the common on the other switch.
Normally, these wires are bundeled together. They leave as a group from one switchbox, run to the overhead ceiling fixture, and on to the other box. The key idea, is that all of the current-carrying wires are bundeled together.
When the building was wired, the single wire between the switches and the light was routed through the ceiling, but the wires travelling between the switches were routed through the walls.
If all of the wires ran close and parallel to each other, then the magnetic field generated by the current in one wire was cancelled out by the magnetic field generated by the equal and opposite current in a nearby wire. Unfortunately, the way that the lights were actually wired meant that one half of the room was basically inside a large, single-turn transformer primary. When the lights were on, the current flowed in a loop, and the poor monitor was basically sitting inside of a large electromagnet.
Moral of the story: The hot and neutral lines in your AC power wiring are next to each other for a good reason.
Now, all I had to do was to explain to management why they had to rewire part of their new building...

A bug where you come across some code, and after studying it you conclude, "There's no way this could have ever worked!" and suddenly it stops working though it always did work before.

One of the products I helped build at my work was running on a customer site for several months, collecting and happily recording each event it received to a SQL Server database. It ran very well for about 6 months, collecting about 35 million records or so.
Then one day our customer asked us why the database hadn't updated for almost two weeks. Upon further investigation we found that the database connection that was doing the inserts had failed to return from the ODBC call. Thankfully the thread that does the recording was separated from the rest of the threads, allowing everything but the recording thread to continue functioning correctly for almost two weeks!
We tried for several weeks on end to reproduce the problem on any machine other than this one. We never could reproduce the problem. Unfortunately, several of our other products then began to fail in about the same manner, none of which have their database threads separated from the rest of their functionality, causing the entire application to hang, which then had to be restarted by hand each time they crashed.
Weeks of investigation turned into several months and we still had the same symptoms: full ODBC deadlocks in any application that we used a database. By this time our products are riddled with debugging information and ways to determine what went wrong and where, even to the point that some of the products will detect the deadlock, collect information, email us the results, and then restart itself.
While working on the server one day, still collecting debugging information from the applications as they crashed, trying to figure out what was going on, the server BSoD on me. When the server came back online, I opened the minidump in WinDbg to figure out what the offending driver was. I got the file name and traced it back to the actual file. After examining the version information in the file, I figured out it was part of the McAfee anti-virus suite installed on the computer.
We disabled the anti-virus and haven't had a single problem since!!

I just want to point out a quite common and nasty bug that can happens in this google-area time:
code pasting and the infamous minus
That is when you copy paste some code with an minus in it, instead of a regular ASCII character hyphen-minus ('-').
Plus, minus(U+2212), Hyphen-Minus(U+002D)
Now, even though the minus is supposedly rendered as longer than the hyphen-minus, on certain editors (or on a DOS shell windows), depending on the charset used, it is actually rendered as a regular '-' hyphen-minus sign.
And... you can spend hours trying to figure why this code does not compile, removing each line one by one, until you find the actual cause!
May be not the toughest bug out there, but frustrating enough ;)
(Thank you ShreevatsaR for spotting the inversion in my original post - see comments)

The first was that our released product exhibited a bug, but when I tried to debug the problem, it didn't occur. I thought this was a "release vs. debug" thing at first -- but even when I compiled the code in release mode, I couldn't reproduce the problem. I went to see if any other developer could reproduce the problem. Nope. After much investigation (producing a mixed assembly code / C code listing) of the program output and stepping through the assembly code of the released product (yuck!), I found the offending line. But the line looked just fine to me! I then had to lookup what the assembly instructions did -- and sure enough the wrong assembly instruction was in the released executable. Then I checked the executable that my build environment produced -- it had the correct assembly instruction. It turned out that the build machine somehow got corrupt and produced bad assembly code for only one instruction for this application. Everything else (including previous versions of our product) produced identical code to other developers machines. After I showed my research to the software manager, we quickly re-built our build machine.

Somewhere deep in the bowels of a networked application was the line (simplified):
if (socket = accept() == 0)
return false;
//code using the socket()
What happened when the call succeeded? socket was set to 1. What does send() do when given a 1? (such as in:
send(socket, "mystring", 7);
It prints to stdout... this I found after 4 hours of wondering why, with all my printf()s taken out, my app was printing to the terminal window instead of sending the data over the network.

With FORTRAN on a Data General minicomputer in the 80's we had a case where the compiler caused a constant 1 (one) to be treated as 0 (zero). It happened because some old code was passing a constant of value 1 to a function which declared the variable as a FORTRAN parameter, which meant it was (supposed to be) immutable. Due to a defect in the code we did an assignment to the parameter variable and the compiler gleefully changed the data in the memory location it used for a constant 1 to 0.
Many unrelated functions later we had code that did a compare against the literal value 1 and the test would fail. I remember staring at that code for the longest time in the debugger. I would print out the value of the variable, it would be 1 yet the test 'if (foo .EQ. 1)' would fail. It took me a long time before I thought to ask the debugger to print out what it thought the value of 1 was. It then took a lot of hair pulling to trace back through the code to find when the constant 1 became 0.

I had a bug in a console game that occurred only after you fought and won a lengthy boss-battle, and then only around 1 time in 5. When it triggered, it would leave the hardware 100% wedged and unable to talk to outside world at all.
It was the shyest bug I've ever encountered; modifying, automating, instrumenting or debugging the boss-battle would hide the bug (and of course I'd have to do 10-20 runs to determine that the bug had hidden).
In the end I found the problem (a cache/DMA/interrupt race thing) by reading the code over and over for 2-3 days.

Not very tough, but I laughed a lot when it was uncovered.
When I was maintaining a 24/7 order processing system for an online shop, a customer complained that his order was "truncated". He claimed that while the order he placed actually contained N positions, the system accepted much less positions without any warning whatsoever.
After we traced order flow through the system, the following facts were revealed. There was a stored procedure responsible for storing order items in database. It accepted a list of order items as string, which encoded list of (product-id, quantity, price) triples like this:
"<12345, 3, 19.99><56452, 1,
8.99><26586, 2, 12.99>"
Now, the author of stored procedure was too smart to resort to anything like ordinary parsing and looping. So he directly transformed the string into SQL multi-insert statement by replacing "<" with "insert into ... values (" and ">" with ");". Which was all fine and dandy, if only he didn't store resulting string in a varchar(8000) variable!
What happened is that his "insert ...; insert ...;" was truncated at 8000th character and for that particular order the cut was "lucky" enough to happen right between inserts, so that truncated SQL remained syntactically correct.
Later I found out the author of sp was my boss.

This is back when I thought that C++ and digital watches were pretty neat...
I got a reputation for being able to solve difficult memory leaks. Another team had a leak they couldn't track down. They asked me to investigate.
In this case, they were COM objects. In the core of the system was a component that gave out many twisty little COM objects that all looked more or less the same. Each one was handed out to many different clients, each of which was responsible for doing AddRef() and Release() the same number of times.
There wasn't a way to automatically calculate who had called each AddRef, and whether they had Released.
I spent a few days in the debugger, writing down hex addresses on little pieces of paper. My office was covered with them. Finally I found the culprit. The team that asked me for help was very grateful.
The next day I switched to a GC'd language.*
(*Not actually true, but would be a good ending to the story.)

Bryan Cantrill of Sun Microsystems gave an excellent Google Tech Talk on a bug he tracked down using a tool he helped develop called dtrace.
The The Tech Talk is funny, geeky, informative, and very impressive (and long, about 78 minutes).
I won't give any spoilers here on what the bug was but he starts revealing the culprit at around 53:00.

While testing some new functionality that I had recently added to a trading application, I happened to notice that the code to display the results of a certain type of trade would never work properly. After looking at the source control system, it was obvious that this bug had existed for at least a year, and I was amazed that none of the traders had ever spotted it.
After puzzling for a while and checking with a colleague, I fixed the bug and went on testing my new functionality. About 3 minutes later, my phone rang. On the other end of the line was an irate trader who complained that one of his trades wasn’t showing correctly.
Upon further investigation, I realized that the trader had been hit with the exact same bug I had noticed in the code 3 minutes earlier. This bug had been lying around for a year, just waiting for a developer to come along and spot it so that it could strike for real.
This is a good example of a type of bug known as a Schroedinbug. While most of us have heard about these peculiar entities, it is an eerie feeling when you actually encounter one in the wild.

The two toughest bugs that come to mind were both in the same type of software, only one was in the web-based version, and one in the windows version.
This product is a floorplan viewer/editor. The web-based version has a flash front-end that loads the data as SVG. Now, this was working fine, only sometimes the browser would hang. Only on a few drawings, and only when you wiggled the mouse over the drawing for a bit. I narrowed the problem down to a single drawing layer, containing 1.5 MB of SVG data. If I took only a subsection of the data, any subsection, the hang didn't occur. Eventually it dawned on me that the problem probably was that there were several different sections in the file that in combination caused the bug. Sure enough, after randomly deleting sections of the layer and testing for the bug, I found the offending combination of drawing statements. I wrote a workaround in the SVG generator, and the bug was fixed without changing a line of actionscript.
In the same product on the windows side, written in Delphi, we had a comparable problem. Here the product takes autocad DXF files, imports them to an internal drawing format, and renders them in a custom drawing engine. This import routine isn't particularly efficient (it uses a lot of substring copying), but it gets the job done. Only in this case it wasn't. A 5 megabyte file generally imports in 20 seconds, but on one file it took 20 minutes, because the memory footprint ballooned to a gigabyte or more. At first it seemed like a typical memory leak, but memory leak tools reported it clean, and manual code inspection turned up nothing either. The problem turned out to be a bug in Delphi 5's memory allocator. In some conditions, which this particular file was duly recreating, it would be prone to severe memory fragmentation. The system would keep trying to allocate large strings, and find nowhere to put them except above the highest allocated memory block. Integrating a new memory allocation library fixed the bug, without changing a line of import code.
Thinking back, the toughest bugs seem to be the ones whose fix involves changing a different part of the system than the one where the problem occurs.

When the client's pet bunny rabbit gnawed partway through the ethernet cable. Yes. It was bad.

Had a bug on a platform with a very bad on device debugger.
We would get a crash on the device if we added a printf to the code. It then would crash at a different spot than the location of the printf. If we moved the printf, the crash would ether move or disappear. In fact, if we changed that code by reordering some simple statements, the crash would happen some where unrelated to the code we did change.
It turns out there was a bug in the relocator for our platform. the relocator was not zero initializing the ZI section but rather using the relocation table to initialze the values. So any time the relocation table changed in the binary the bug would move. So simply added a printf would change the relocation table an there for the bug.

This happened to me on the time I worked on a computer store.
One customer came one day into shop and tell us that his brand new computer worked fine on evenings and night, but it does not work at all on midday or late morning.
The trouble was that mouse pointer does not move at that times.
The first thing we did was changing his mouse by a new one, but the trouble were not fixed. Of course, both mouses worked on store with no fault.
After several tries, we found the trouble was with that particular brand and model of mouse.
Customer workstation was close to a very big window, and at midday the mouse was under direct sunlight.
Its plastic was so thin that under that circumstances, it became translucent and sunlight prevented optomechanical wheel for working :|

My team inherited a CGI-based, multi-threaded C++ web app. The main platform was Windows; a distant, secondary platform was Solaris with Posix threads. Stability on Solaris was a disaster, for some reason. We had various people who looked at the problem for over a year, off and on (mostly off), while our sales staff successfully pushed the Windows version.
The symptom was pathetic stability: a wide range of system crashes with little rhyme or reason. The app used both Corba and a home-grown protocol. One developer went so far as to remove the entire Corba subsystem as a desperate measure: no luck.
Finally, a senior, original developer wondered aloud about an idea. We looked into it and eventually found the problem: on Solaris, there was a compile-time (or run-time?) parameter to adjust the stack size for the executable. It was set incorrectly: far too small. So, the app was running out of stack and printing stack traces that were total red herrings.
It was a true nightmare.
Lessons learned:
Brainstorm, brainstorm, brainstorm
If something is going nuts on a different, neglected platform, it is probably an attribute of the environment platform
Beware of problems that are transferred from developers who leave the team. If possible, contact the previous people on personal basis to garner info and background. Beg, plead, make a deal. The loss of experience must be minimized at all costs.

Adam Liss's message above talking about the project we both worked on, reminded me of a fun bug I had to deal with. Actually, it wasn't a bug, but we'll get to that in a minute.
Executive summary of the app in case you haven't seen Adam message yet: sales-force automation software...on laptops...end of the day they dialed up ...to synchronize with the Mother database.
One user complained that every time he tried to dial in, the application would crash. The customer support folks went through all their usually over-the-phone diagnostic tricks, and they found nothing. So, they had to relent to the ultimate: have the user FedEx the laptop to our offices. (This was a very big deal, as each laptop's local database was customized to the user, so a new laptop had to be prepared, shipped to the user for him to use while we worked on his original, then we had to swap back and have him finally sync the data on first original laptop).
So, when the laptop arrived, it was given to me to figure out the problem. Now, syncing involved hooking up the phone line to the internal modem, going to the "Communication" page of our app, and selecting a phone number from a Drop-down list (with last number used pre-selected). The numbers in the DDL were part of the customization, and were basically, the number of the office, the number of the office prefixed with "+1", the number of the office prefixed with "9,,," in case they were calling from an hotel etc.
So, I click the "COMM" icon, and pressed return. It dialed in, it connected to a modem -- and then immediately crashed. I tired a couple more times. 100% repeatability.
So, a hooked a data scope between the laptop & the phone line, and looked at the data going across the line. It looked rather odd... The oddest part was that I could read it!
The user had apparently wanted to use his laptop to dial into a local BBS system, and so, change the configuration of the app to use the BBS's phone number instead of the company's. Our app was expecting our proprietary binary protocol -- not long streams of ASCII text. Buffers overflowed -- KaBoom!
The fact that a problem dialing in started immediately after he changed the phone number, might give the average user a clue that it was the cause of the problem, but this guy never mentioned it.
I fixed the phone number, and sent it back to the support team, with a note electing the guy the "Bonehead user of the week". (*)
(*) OkOkOk... There's probably a very good chance what actually happened in that the guy's kid, seeing his father dial in every night, figured that's how you dial into BBS's also, and changed the phone number sometime when he was home alone with the laptop. When it crashed, he didn't want to admit he touched the laptop, let alone broke it; so he just put it away, and didn't tell anyone.

It was during my diploma thesis. I was writing a program to simulate the effect of high intensity laser on a helium atom using FORTRAN.
One test run worked like this:
Calculate the intial quantum state using program 1, about 2 hours.
run the main simulation on the data from the first step, for the most simple cases about 20 to 50 hours.
then analyse the output with a third program in order to get meaningful values like energy, tork, momentum
These should be constant in total, but they weren't. They did all kinds of weird things.
After debugging for two weeks I went berserk on the logging and logged every variable in every step of the simulation including the constants.
That way I found out that I wrote over an end of an array, which changed a constant!
A friend said he once changed the literal 2 with such a mistake.

A deadlock in my first multi-threaded program!
It was very tough to find it because it happened in a thread pool. Occasionally a thread in the pool would deadlock but the others would still work. Since the size of the pool was much greater than needed it took a week or two to notice the first symptom: application completely hung.

I have spent hours to days debugging a number of things that ended up being fixable with literally just a couple characters.
Some various examples:
ffmpeg has this nasty habit of producing a warning about "brainfart cropping" (referring to a case where in-stream cropping values are >= 16) when the crop values in the stream were actually perfectly valid. I fixed it by adding three characters: "h->".
x264 had a bug where in extremely rare cases (one in a million frames) with certain options it would produce a random block of completely the wrong color. I fixed the bug by adding the letter "O" in two places in the code. Turned out I had mispelled the name of a #define in an earlier commit.

My first "real" job was for a company that wrote client-server sales-force automation software. Our customers ran the client app on their (15-pound) laptops, and at the end of the day they dialed up to our unix servers to synchronize with the Mother database. After a series of complaints, we found that an astronomical number of calls were dropping at the very beginning, during authentication.
After weeks of debugging, we discovered that the authentication always failed if the incoming call was answered by a getty process on the server whose Process ID contained an even number followed immediately by a 9. Turns out the authentication was a homebrew scheme that depended on an 8-character string representation of the PID; a bug caused an offending PID to crash the getty, which respawned with a new PID. The second or third call usually found an acceptable PID, and automatic redial made it unnecessary for the customers to intervene, so it wasn't considered a significant problem until the phone bills arrived at the end of the month.
The "fix" (ahem) was to convert the PID to a string representing its value in octal rather than decimal, making it impossible to contain a 9 and unnecessary to address the underlying problem.

Basically, anything involving threads.
I held a position at a company once in which I had the dubious distinction of being one of the only people comfortable enough with threading to debug nasty issues. The horror. You should have to get some kind of certification before you're allowed to write threaded code.

I heard about a classic bug back in high school; a terminal that you could only log into if you sat in the chair in front of it. (It would reject your password if you were standing.)
It reproduced pretty reliably for most people; you could sit in the chair, log in, log out... but if you stand up, you're denied, every time.
Eventually it turned out some jerk had swapped a couple of adjacent keys on the keyboard, E/R and C/V IIRC, and when you sat down, you touch-typed and got in, but when you stood, you had to hunt 'n peck, so you looked at the incorrent labels and failed.

While I don't recall a specific instance, the toughest category are those bugs which only manifest after the system has been running for hours or days, and when it goes down, leaves little or no trace of what caused the crash. What makes them particularly bad is that no matter how well you think you've reasoned out the cause, and applied the appropriate fix to remedy it, you'll have to wait for another few hours or days to get any confidence at all that you've really nailed it.

Our network interface, a DMA-capable ATM card, would very occasionally deliver corrupted data in received packets. The AAL5 CRC had checked out as correct when the packet came in off the wire, yet the data DMAd to memory would be incorrect. The TCP checksum would generally catch it, but back in the heady days of ATM people were enthused about running native applications directly on AAL5, dispensing with TCP/IP altogether. We eventually noticed that the corruption only occurred on some models of the vendor's workstation (who shall remain nameless), not others.
By calculating the CRC in the driver software we were able to detect the corrupted packets, at the cost of a huge performance hit. While trying to debug we noticed that if we just stored the packet for a while and went back to look at it later, the data corruption would magically heal itself. The packet contents would be fine, and if the driver calculated the CRC a second time it would check out ok.
We'd found a bug in the data cache of a shipping CPU. The cache in this processor was not coherent with DMA, requiring the software to explicitly flush it at the proper times. The bug was that sometimes the cache didn't actually flush its contents when told to do so.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio