Released/production version got heap corruption. Solution preventing it?

Released/production version got heap corruption. Solution preventing it? - heap-corruption

In our company we somehow got code to production which crashed(because Heap got corrupted somehow). Developers developed, after that testers had hands on it and later on it was released natural way(monthly release). Everything was fine till it crashed... We tried to investigate it and found many places where we could get a heap corruption... What could we do to prevent this stuff? Stricken our code reviews(we have it 4/5 all times and only 1 developer doing it without any help from the coder of it)? Change our politics for memory management only through smart pointers or something else? Any advise would be nice!

Switch a managed language (C#, Java, whatever) if you can. If you can't:
Use RAII.
Be very clear about who owns each bit of memory.
Create objects on the stack wherever possible.
Use smart pointers in a consistent way.
Have your code reviewers watch out for the above points.

Related

How do you fix a bug you can't replicate?

The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?
I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?
Edit:
I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?

Language
Different programming languages will have their own flavour of bugs.
C
Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers far enough to avoid a SEGFAULT---also known as Heisenbugs. Pointer issues are arduous to track and replicate, but debuggers can help (such as GDB and DDD).
Java
An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.
JavaScript
Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.
Environment
Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:
on a server?
on a desktop?
in a web browser?
In what environment does the application produce the problem?
development?
test?
production?
Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.
Networking
As networking is essential to so many applications:
Ensure stable network connections, including wireless signals.
Does the software reconnect after network failures robustly?
Do all connections get closed properly so as to release file descriptors?
Are people using the machine who shouldn't be?
Are rogue devices interacting with the machine's network?
Are there factories or radio towers nearby that can cause interference?
Do packet sizes and frequency fall within nominal ranges?
Are packets being monitored for loss?
Are all network devices adequate for heavy bandwidth usage?
Consistency
Eliminate as many unknowns as possible:
Isolate architectural components.
Remove non-essential, or possibly problematic (conflicting), elements.
Deactivate different application modules.
Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.
Logging
Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.
Hardware
If the software seems okay, consider hardware faults:
Are the physical network connections solid?
Are there any loose cables?
Are chips seated properly?
Do all cables have clean connections?
Is the working environment clean and free of dust?
Have any hidden devices or cables been damaged by rodents or insects?
Are there bad blocks on drives?
Are the CPU fans working?
Can the motherboard power all components? (CPU, network card, video card, drives, etc.)
Could electromagnetic interference be the culprit?
And mostly for embedded:
Insufficient supply bypassing?
Board contamination?
Bad solder joints / bad reflow?
CPU not reset when supply voltages are out of tolerance?
Bad resets because supply rails are back-powered from I/O ports and don't fully discharge?
Latch-up?
Floating input pins?
Insufficient (sometimes negative) noise margins on logic levels?
Insufficient (sometimes negative) timing margins?
Tin whiskers?
ESD damage?
ESD upsets?
Chip errata?
Interface misuse (e.g. I2C off-board or in the presence of high-power signals)?
Race conditions?
Counterfeit components?
Network vs. Local
What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?
Firmware
In between hardware and software is firmware.
Is the computer BIOS up-to-date?
Is the BIOS battery working?
Are the BIOS clock and system clock synchronized?
Time and Statistics
Timing issues are difficult to track:
When does the problem happen?
How frequently?
What other systems are running at that time?
Is the application time-sensitive (e.g., will leap days or leap seconds cause issues)?
Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.
Change Management
Sometimes problems appear after a system upgrade.
When did the problem first start?
What changed in the environment (hardware and software)?
What happens after rolling back to a previous version?
What differences exist between the problematic version and good version?
Library Management
Different operating systems have different ways of distributing conflicting libraries:
Windows has DLL Hell.
Unix can have numerous broken symbolic links.
Java library files can be equally nightmarish to resolve.
Perform a fresh install of the operating system, and include only the supporting software required for your application.
Java
Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.
Use a library management tool such as Maven or Ivy.
Debugging
Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.
Sleep
It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.
Code Review
Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.
Cosmic Radiation
Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.
Tools
Sometimes, albeit infrequently, the compiler will introduce a bug, especially for niche tools (e.g. a C micro-controller compiler suffering from a symbol table overflow). Is it possible to use a different compiler? Could any other tool in the tool-chain be introducing issues?

If it's a GUI app, it's invaluable to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (not wrongly, just differently).
Otherwise, concentrate your logging in that area. Log most everything (you can pull it out later) and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.
Does your app report a version number, a build number, etc.? You need this to determine precisely which version you're debugging (or not!).
If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made, etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.

If you can't replicate it, you may fix it, but can't know that you've fixed it.
I've made my best explanation about how the bug was triggered (even if I didn't know how that situation could come about), fixed that, and made sure that if the bug surfaced again, our notification mechanisms would let a future developer know the things that I wish I had known. In practice, this meant adding log events when the paths which could trigger the bug were crossed, and metrics for related resources were recorded. And, of course, making sure that the tests exercised the code well in general.
Deciding what notifications to add is a feasability and triage question. So is deciding on how much developer time to spend on the bug in the first place. It can't be answered without knowing how important the bug is.
I've had good outcomes (didn't show up again, and the code was better for it), and bad (spent too much time not fixing the problem, whether the bug ended up fixed or not). That's what estimates and issue priorities are for.

Sometimes I just have to sit and study the code until I find the bug. Try to prove that the bug is impossible, and in the process you may figure out where you might be mistaken. If you actually succeed in convincing yourself it's impossible, assume you messed up somewhere.
It may help to add a bunch of error checking and assertions to confirm or deny your beliefs/assumptions. Something may fail that you'd never expect to.

It can be difficult, and sometimes near impossible. But my experience is, that you will sooner or later be able to reproduce and fix the bug, if you spend enough time on it (if that spent time is worth it, is another matter).
General suggestions that might help in this situation.
Add more logging, if possible, so that you have more data the next time the bug appears.
Ask the users, if they can replicate the bug. If yes, you can have them replicate it while watching over their shoulder, and hopefully find out, what triggers the bug.

Make random changes until something works :-)

Assuming you have already added all the logging that you think would help and it didn't... two things spring to mind:
Work backwards from the reported symptom. Think to yourself.. "it I wanted to produce the symptom that was reported, what bit of code would I need to be executing, and how would I get to it, and how would I get to that?" D leads to C leads to B leads to A. Accept that if a bug is not reproducible, then normal methods won't help. I've had to stare at code for many hours with these kind of thought processes going on to find some bugs. Usually it turns out to be something really stupid.
Remember Bob's first law of debugging: if you can't find something, it's because you're looking in the wrong place :-)

Think. Hard. Lock yourself away, admit no interuptions.
I once had a bug where the evidence was a hex dump of a corrupt database. The chains of pointers were systematically screwed up. All the user's programs, and our database software, worked faultlessly in testing. I stared at it for a week (it was an important customer), and after eliminating dozens of possible ideas, I realised that the data was spread across two physical files and the corruption occurred where the chains crossed file boundaries. I realized that if a backup/restore operation failed at a critical point, the two files could end up "out of sync", restored to different time points. If you then ran one of the customer's programs on the already-corrupt data, it would produce exactly the knotted chains of pointers I was seeing. I then demonstrated a sequence of events that reproduced the corruption exactly.

modify the code where you think the problem is happening, so extra debug info is recorded somewhere. when it happens next time, you will have what your need to solve the problem.

There are two types of bugs you can't replicate. The kind you discovered, and the kind someone else discovered.
If you discovered the bug, you should be able to replicate it. If you can't replicate it, then you simply haven't considered all of the contributing factors leading towards the bug. This is why whenever you have a bug, you should document it. Save the log, get a screenshot, etc. If you don't, then how can you even prove the bug really exists? Maybe it's just a false memory?
If someone else discovered a bug, and you can't replicate it, obviously ask them to replicate it. If they can't replicate it, then you try to replicate it. If you can't replicate it quickly, ignore it.
I know that sounds bad, but I think it is justified. The amount of time it will take you to replicate a bug that someone else discovered is very large. If the bug is real, it will happen again naturally. Someone, maybe even you, will stumble across it again. If it is difficult to replicate, then it is also rare, and probably won't cause too much damage if it happens a few more times.
You can be a lot more productive if you spend your time actually working, fixing other bugs and writing new code, than you will be trying to replicate a mystery bug that you can't even guarantee actually exists. Just wait for it to appear again naturally, then you will be able to spend all your time fixing it, rather than wasting your time trying to reveal it.

Discuss the problem, read code, often quite a lot of it. Often we do it in pairs, because you can usually eliminate the possibilities analytically quite quickly.

Start by looking at what tools you have available to you. For example crashes on a Windows platform go to WinQual, so if this is your case you now have crash dump information. Do you can static analysis tools that spot potential bugs, runtime analysis tools, profiling tools?
Then look at the input and output. Anything similar about the inputs in situations when users report the error, or anything out of place in the output? Compile a list of reports and look for patterns.
Finally, as David stated, stare at the code.

Ask user to give you a remote access for his computer and see everything yourself. Ask user to make a small video of how he reproduces this bug and send it to you.
Sure both are not always possible but if they are it may clarify some things. The common way of finding bugs are still the same: separating parts that may cause bug, trying to understand what`s happening, narrowing codespace that could cause the bug.

There are tools like gotomeeting.com, which you can use to share screen with your user and observe the behaviour. There could be many potential problems like number of softwares installed on their machines, some tools utility conflicting with your program. I believe gotomeeting, is not the only solution, but there could be timeout issues, slow internet issue.
Most of times I would say softwares do not report you correct error messages, for example, in case of java and c# track every exceptions.. dont catch all but keep a point where you can catch and log. UI Bugs are difficult to solve unless you use remote desktop tools. And most of time it could be bug in even third party software.

If you work on a real significant sized application, you probably have a queue of 1,000 bugs, most of which are definitely reproducible.
Therefore, I'm afraid I'd probably close the bug as WORKSFORME (Bugzilla) and then get on fixing some more tangible bugs. Or doing whatever the project manager decides to do.
Certainly making random changes is a bad idea, even if they're localised, because you risk introducing new bugs.

What can we do about a randomly crashing app without source code?

I am trying to help a client with a problem, but I am running out of ideas. They have a custom, written in house application that runs on a schedule, but it crashes. I don't know how long it has been like this, so I don't think I can trace the crashes back to any particular software updates. The most unfortunate part is there is no longer any source code for the VB6 DLL which contains the meat of the logic.
This VB6 DLL is kicked off by 2-3 function calls from a VB Script. Obviously, I can modify the VB Script to add error logging, but I'm not having much luck getting quality information to pinpoint the source of the crash. I have put logging messages on either side of all of the function calls and determined which of the calls is causing the crash. However, nothing is ever returned in the err object because the call is crashing wscript.exe.
I'm not sure if there is anything else I can do. Any ideas?
Edit: The main reason I care, even though I don't have the source code is that there may be some external factor causing the crash (insufficient credentials, locked file, etc). I have checked the log file that is created in drwtsn32.log as a result of wscript.exe crashing, and the only information I get is an "Access Violation".
I first tend to think this is something to do with security permissions, but couldn't this also be a memory access violation?

You may consider using one of the Sysinternals tools if you truly think this is a problem with the environment such as file permissions. I once used Filemon to figure out all the files my application was touching and discovered a problem that way.
You may also want to do a quick sanity check with Dependency Walker to make sure you are actually loading the DLL files you think you are. I have seen the wrong version of the C runtime being loaded and causing a mysterious crash.

Depending on the scope of the application, your client might want to consider a rewrite. Without source code, they will eventually be forced to do so anyway when something else changes.

It's always possible to use a debugger - either directly on the PC that's running the crashing app or on a memory dump - to determine what's happening to a greater or lesser extent. In this case, where the code is VB6, that may not be very helpful because you'll only get useful information at the Win32 level.
Ultimately, if you don't have the source code then will finding out where the bug is really help? You won't be able to fix it anyway unless you can avoid that code path for ever in the calling script.

You could use the debugging tools for windows. Which might help you pinpoint the error, but without the source to fix it, won't do you much good.
A lazier way would be to call the dll from code (not a script) so you can at least see what is causing the issue and inspect the err object. You still won't be able to fix it, unless the problem is that it is being called incorrectly.

The guy of Coding The Wheel has a pretty interesting series about building an online poker bot which is full of serious technical info, a lot of which is concerned with how to get into existing applications and mess with them, which is, in some way, what you want to do.
Specifically, he has an article on using WinDbg to get at important info, one on how to bend function calls to your own code and one on injecting DLLs in other processes. These techniques might help to find and maybe work around or fix the crash, although I guess it's still a tough call.

There are a couple of tools that may be helpful. First, you can use dependency walker to do a runtime profile of your app:
http://www.dependencywalker.com/
There is a profile menu and you probably want to make sure that the follow child processes option is checked. This will do two things. First, it will allow you to see all of the lib versions that get pulled in. This can be helpful for some problems. Second, the runtime profile uses the debug memory manager when it runs the child processes. So, you will be able to see if buffers are getting overrun and a little bit of information about that.
Another useful tool is process monitor from Mark Russinovich:
http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx
This tool will report all file, registry and thread operations. This will help you determine if any you are bumping into file or registry credential issues.
Process explorer gives you a lot of the same information:
http://technet.microsoft.com/en-us/sysinternals/bb896653.aspx
This is also a Russinovich tool. I find that it is a bit easier to look at some data through this tool.
Finally, using debugging tools for windows or dev studio can give you some insight into where the errors are occurring.

Access violation is almost always a memory error - all the more likely in this case because its random crashing (permissions would likely be more obviously reproducible). In the case of a dll it could be either
There's an error in the code in the dll itself - this could be something like a memory allocation error or even a simple loop boundary condition error.
There's an error when the dll tries to link out to another dll on the system. This will generally be caused by a mismatch between dll versions on the machine.
Your first step should be to try and get a reproducible crash condition. If you don't have a set of circumstances that will crash the system then you cannot know when you have fixed it.
I would then install the system on a clean machine and attempt to reproduce the error on that. Run a monitor and check precisely what other files (dlls etc) are open when the program crashes. I have seen code that crashes on a hyperthreaded Pentium but not on an earlier one - so restoring an old machine as a testbed may be a good option to cover that one. Varying the amount of ram in the machine is also worthwhile.
Hopefully these steps might give you a clue. Hopefully it will be an environment problem and so can be avoided by using the right version of windows, dlls etc. However if you're still stuck with the crash at this point with no good clues then your options are either to rewrite or attempt to hunt down the problem further by debugging the dll at assembler lever or dissassembling it. If you are not familiar with assembly code then both of these are long-shots and it's difficult to see what you will gain - and either option is likely to be a massive time-sink. Myself I have in the past, when faced with a particularly low-level high intensity problem like this advertised on one of the 'coder for hire' websites and looked for someone with specialist knowledge. Again you will need a reproducible error to be able to do this.
In the long run a dll without source code will have to be replaced. Paying a specialist with assembly skills to analyse the functions and provide you with flowcharts may well be worthwhile considering. It is good business practice to do this sooner in a controlled manner than later - like after the machine it is running on has crashed and that version of windows is no longer easily available.

You may want to try using Resource Hacker you may have luck de-compiling the in house application. it may not give you the full source code but at least maybe some more info about what the app is doing, which also may help you determine your culrpit.

Add the maximum possible RAM to the machine
This simple and cheap hack has work for me in the past. Of course YMMV.

Reverse engineering is one possibility, although a tough one.
In theory you can decompile and even debug/trace a compiled VB6 application - this is the easy part, modifying it without source, in all but the most simple cases, is the hard part.
Free compilers/decompilers:
VB decompilers
VB debuggers
Rewrite would be, in most cases, a more successful and faster way to solve the problem.

Can you start and stop boundschecker (DevPartner)?

I'm trying to use boundschecker to analyze a rather complex program. Running the program with boundschecker is almost too slow for it to be of any use since it takes me almost a day to run the program to the point in the code where I suspect the issue exists. Can anyone give me some ideas for how to inspect only certain parts of my software using boundschecker (DevPartner) in Visual Studio 2005?
Thanks in advance for all your help!

I last used BoundsChecker a few years ago, and had the same problems. With large projects, it makes everything run so slowly that it is useless. We ended up ditching it.
But, we still needed some of it's functionality, but like you, not for the whole program. So we had to do it ourselves.
In our case, we mainly used it to try and track down memory leaks. If that's your objective as well, there are other options.
Visual Studio does a pretty good job of telling you about memory leaks when your program exits
It reports leaks in the order that they were created
It will tell you exactly where the leaked memory was created if your source files have this at the top
#ifdef _DEBUG
#undef THIS_FILE
static char THIS_FILE[]=__FILE__;
#define new DEBUG_NEW
#endif
Those help a lot, but it's often not enough. Adding that snippet everywhere isn't always feasible. If you use factory classes, knowing where memory was allocated doesn't help at all. So when all else fails, we take advantage of #2.
Add something like the following:
#define LEAK(str) {char *s = (char*)malloc(100); strcpy(s, str);}
Then, pepper your code with "LEAK("leak1");" or whatever. Run the program, and exit it. Your new leaked strings will display in Visual Studio's leak dump surrounding the existing leaks. Keep adding/moving your LEAK statements and re-running the program to narrow your search until you've pinpointed the exact location. Then fix the leak, remove your debugging leaks, and you're all set!

BoundsChecker tracks all memory allocations and releases in extreme detail. It knows, for instance, that such and such a memory allocation was done from the C runtime heap, which in turn was taken from a Win32 heap, which in turn started life as memory allocated by VirtualAlloc. If the application was instrumented (FinalCheck), it also has detailed information as to which pointers reference the memory.
This is one reason (of many) why the thing is slow.
If BC were to connect to an application late, it would have none of this data built up, and would have either (1) dig it all up at once, or (2) start guessing about things. Neither solution is very practical.

One way to lighten up BoundsChecker is by excluding from instrumentation all but the few modules you are interested in. I know thats not great because if you knew where the leak was you wouldn't need BoundsChecker. What I usually recommend is that you use BC's Active Check mode first with only Memory Tracking available. You miss the API Validations but you could always rerun that seperately. After you run Active Check and you get clues regarding which modules tend to be problematic, only then do you enable instrumentation for the module or modules of interest and their dependencies. We know Final Check is annoyingly slow but as Mistiano correctly states, with Final Check not only does BC keep a graph of all allocated blocks but also all pointers and contexts to them. Therein lies the magic in how Final Check can nail leaks and corruptions at the point of occurance, not just on application shutdown or fault. Shameless plug: I work on the DevPartner team. We are releasing DPS 10.5 on February 4, 2011 with x64 application support in BC. Unlike the relatively ancient and undersold BC64 for Itanium which only provided Active Check, DPS 10.5 provides full Final Check support for x64 applications, both for pure C++ and for native modules running in .NET processes. See microfocus.com under MF Developer for details.

What do you do if you cannot resolve a bug? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Did you ever had a bug in your code, you could not resolve? I hope I'm not the only one out there, who made this experience ...
There exist some classes of bugs, that are very hard to track down:
timing-related bugs (that occur during inter-process-communication for example)
memory-related bugs (most of you know appropriate examples, I guess !!!)
event-related bugs (hard to debug, because every break point you run into makes your IDE the target for mouse release/focus events ...)
OS-dependent bugs
hardware dependent bugs (occurs on
release machine, but not on
developer machine)
...
To be honest, from time to time I fail to fix such a bug on my own ... After debugging for hours (or sometimes even days) I feel very demoralized.
What do you do in this situation (apart from asking others for help which is not always possible)?
Do you
use pencil and paper instead of a debugger
face for another thing and return to
this bug later
...
Please let me know!

Some things that help:
1) Take a break, approach the bug from a different angle.
2) Get more aggressive with tracing and logging.
3) Have another pair of eyes look at it.
4) A usual last resort is to figure out a way to make the bug irrelevant by changing the fundamental conditions in which it occurs
5) Smash and break things. (Stress relief only!)

I once worked for a company that sold a client-server application that was basically a file transfer and synchronization tool. Both the client and the server were custom applications we had designed.
We had a persistent bug that was very hard to duplicate in the lab. Our server could only handle a certain number of incoming client connections per box, so many of our customers would "cluster" multiple servers together to handle large user populations. The back end data for the cluster was on a file server they all shared. In this cluster configuration there was a bug that would happen under load where we would get a low-level file system error code on a file sharing call involving one of the back end files. Nobody could get this to repeat reliably in the lab, and even when they could they couldn't narrow down what was happening.
(I forget the exact error, it was probably 59 ERROR_UNEXP_NET_ERR or maybe 65 ERROR_NETWORK_ACCESS_DENIED. As I recall it was not even one of the documented error codes you were supposed to be able to get from the API we were calling, which was usually a lock or unlock call on a file section).
Since it involved the communication between the server and the back-end file store, and I was the "network transport" guy, I was tasked with looking at it. Many others had looked at it with no luck.
The one solid thing I had was I knew where in the code the error was being detected, but not what to do about it. So I needed to find the root cause. So I set up an appropriate hardware environment to duplicate it, and I put a custom build of the server software that instrumented the section of code in question.
The instrumentation was as follows: I added a test for the troublesome error code, and had it call a piece of code to send a UDP packet to a predetermined network address when the error occurred. The UDP packet contained a unique string in it to key on.
I then set a packet sniffing tool on the network. (At the time I was using Microsoft Network Monitor). I positioned it where it would be able to "see" this UDP packet when it was sent as well as all the communication between the cluster servers and the file server.
Most good sniffers have a mode where you can have it capture until it sees a particular piece of traffic, then stop. I turned on that mode and set it to look for that UDP packet my code would send. The goal was to end up with a packet capture of all the file server traffic right before the bug occurred. The very last network packets to and from the system where the UDP packet originated would presumably be a big clue as to what was happening.
I set the "stress test" configuration going and went home for the weekend.
When I got back on Monday, lo and behold I had my data. The sniffer had stopped just as expected after many hours of running and contained a capture. After studying the capture, what I found was that the Server Message Block or SMB (aka CIFS aka SAMBA) connection between our server and the file server was actually timing out at the TCP level due to extreme loading on the server. Because all of Microsoft's stuff is heavily layered, it would percolate back up through the file sharing stack as an "unexpected" error instead of returning a more intelligible error code that said "hey, you lost your connection at the TCP level".
I did a little more research on the TCP settings for Windows, and lo and behold the defaults for the version of Windows we were using (probably NT 4 in that era) were none too generous. It was only allowing for a very small number of failures on the TCP connection and boom, you were dead. Once you lost your SMB connection to the file server, all your file locks were toast and there was no way to recover.
So I ended up writing an appendix to the user manual that explained how to alter the TCP settings in Windows to make your cluster server a bit more tolerant of high load situations. And that was it. The fix to the bug was zero change in code, merely some additional documentation on how to properly configure the OS for use by this product.
What have we learned?
Be prepared to run altered versions of your code to investigate the problem
Consider using non-traditional tools to solve the problem (sniffers)
Not all bug fixes require code changes
Sometimes you can diagnose a bug while at home having a beer

I do a number of different things:
throw out all my assumptions and start from scratch. Remember, a bug exists because something which appears correct is actually wrong. Even those lines or functions or classes that you are absolutely certain are correct may be incorrect. Until you can convince yourself of the correctness you can't assume anything is right.
keep putting in print statements and assert statements to eliminate things and allow me to reform new assumptions.
step through code in the debugger if the problem is a control flow problem. Don't step over functions. Step in them and go through all the detail of their execution to confirm they are working right. Confirm the arguments and return values.
If a line or function or class is suspect but I can't prove it in situ, then write a small test case that does what you think the problem construct does. This may locate the problem or give some insights as to where to look next.
stop for the day. It's amazing what kind of offline processing your brain will do overnight. Often the answer or a key insight will appear the next day while I'm doing something mindless like showering or driving.

Create an automated way to cause the bug. The worst bug to fix is one that takes hours to reproduce.

Quote taken from "The Cryptonomicon":
"Intuition, like a flash of lightning, lasts only for a second. It generally comes when one is tormented by a difficult decipherment and when one reviews in his mind the fruitless experiments already tried. Suddenly the light breaks through and one finds after a few minutes what previous days of labor were unable to reveal."

I usually ask someone else to take a look at the code. While I'm explaining what the code is supposed to do, I sometimes see the bug just as I talk.
When a bug is a tough one, I sit and work until I figure it out and solve the problem. Interestingly enough, there are times when catching a mysterious bug is more enjoyable than everything running smoothly. And the relief and feeling when a bug is resolved, well, not many other things can beat that (except the obvious ones).

If all else fails, don't tackle it directly. Rewrite the problem area code in a more refactored way.

I have definitely had bugs which I worked on for 4-5 days continuously before finding a solution. Other bugs have sat in the bug tracker for months, as I put in a few hours spread out over a long period of time. I think this sort of bug is inevitable in any complex software project.
Some stuff that works well for me:
binary search through the program flow with logging
use Trace statements along with DbgView to search for bugs which show up in release mode
find an alternate way to reproduce the bug without changing the code
(works against logic, but...) change the code so that the bug is more easily reproducible (the failure condition is more readily achieved)
sleep on it and try again tomorrow with a fresh pair of eyes :)
The worst sort of bug in my opinion is a concurrency bug which disappears when logging is inserted.

Lots of great answers here. One thing that's worked for me in the past is to ask "what can I do to make it totally obvious when this problem has occured?".
For example, if the problem is a corrupted value in a data structure, try building a consistency-check routine that you can run periodically. Also consider implementing all access to the shared data through a set of functions that log each change.
Or, if the problem is a "random" memory overwrite, use a replacement malloc()/free() implementation that traps writing to "free" memory (like electric fence or dmalloc).
Someone else mentioned automating the process of triggering the bug. This is greeat if you can do it. Even having a routine that randomly exercises the program might help in these cases.

Seriously? I do things in this order.
Go to bed
Ask a colleague
Rewrite so the area isn't affected.
Ask SO
Raise a support ticket with your 3rd party library vendor.

"What do you do in this situation (apart from asking others for help which is not always possible)?"
When is it not possible to ask for help?
There are always others you can turn to for assistance - your coworkers, your boss, friends here at Stack Overflow, etc.
Understanding when to seek help shouldn't be demoralizing!

There are a lot of good tips here.
One that I absolutely do not agree with is the concept of changing the code hoping that it will go away. First off, you a probably going to introduce new bugs. Seconds, you can easily change things enough to hide the bug only to have it resurface again with the next patch.
Memory corruption bugs are especially likely to vanish as magically as they turn up. However, the memory corruption bug isn't fix, it is only that non-fatal areas of memory are getting trashed.
1) Try a different debugger. For example, I use WinDbg more and more often. When you load a program in a debugger, memory layout for your application will change slightly. Maybe a different debugger cause the error to manifest slightly differently.
2) If you resort to changing code without knowing exactly what the problem is, then if the bug goes away, YOU MUST go back and understand why the change fixed the bug. Otherwise, you are probably just hiding the bug.
3) Talk to others about the bug, maybe they have seen different versions of the same problem (i.e. other ways to recreate it)
4) Logging.

I've had bugs that took weeks or months before a solution was found, but eventually all bugs do get fixed. Aside from the classical non-debugger bug-tracking techniques like disabling parts of the system until you get a minimal test case, I've used these techniques:
Looking for better debugging tools. A new perspective goes a long way. Xdebug is something I started using in PHP only because of a performance bug that I wasn't making headway on.
Studying the technology that the bug is located in. This has helped to debug an outlook add-in. It had random errors that made no sense and that google searches turned up zilch about. By researching outlook add-in best practices, COM and MAPI programming, we got a clearer picture of what could go wrong, and thought of new things to try to fix the bugs, which eventually did fix them.
Trying to exacerbate the problem. If there's an issue that only happens occasionally, I'll try to find ways to make it happen constantly. This has helped to track down errors in web apps under IE and also to narrow down a crashing bug in the flash plugin.
When all else failed, I've rewritten the subsystem that caused problems from scratch. This may take a few days, or even weeks, but if you're stuck on a bug, and can't resolve it, and customers won't take no for an answer, what else can you do? This doesn't always fix things, but if it doesn't, you usually get a clearer picture of what's going wrong.
I've noticed a few commonalities in these bugs that I get stuck on for weeks:
Asking 3rd parties for help rarely helps, and it's generally not a good idea to wait for someone else to come save the day.
Almost always the fault is inside some 3rd party closed source technology, especially when using obscure parts. IE had nasty bugs when trying to use client certificates. Flash didn't deal well with randomly generated drawing instructions (some of which were nonsensical). Outlook doesn't like it when you try to change form layout dynamically from code. These days I've learned to respect the "comfort zones" of proprietary tech.

I give it more time. I once had a bug (in a personal project) that I just could not figure out. I tried every debugging method I could think of, including Google, with no success. Six months later, I came back and found the bug within an hour or so. It wasn't something simple (something apparently undocumented was going on deep inside Swing), but I just looked at it in a way I hadn't before.

I've had this problem before, I believe everyone has, I have flat out given up before, it was simply impossible to find, yet it kept crashing, when theres some kind of bug in the code, what I do is just sit down and concentrate on every bit of the code little by little until I find it, it's hard and it takes patience but it's all you can do in such a situation.
Hope this helps.

I honestly cannot recall a bug that I couldn't fix. It may cause a lot of refactoring, or may take a while, but I've never had one that I can't get rid of. If it takes me more than an hour to track it down then it's almost always something really stupid and small like looking right past that : that should've been a ;, etc.
In python, if I'm using an editor that isn't mine, or maybe it's someone else's code, I use retab! in vim, or paste into something like pastie to check indentation (if I don't have vim available).
If it's not a crasher/deal breaker, then I move on and come back with a fresh pair of eyes.
Oh, and you can never, ever have too much logging.

I add as much debug as possible (write to log file, message boxes, etc.), and test.
I don't think this is the worst bug you can find. The worst ones are those you can't reproduce deterministically or in the testing environment.

I get a bit demoralized too when unable to solve a bug. Usually when I hit a wall with a bug, I would just take note on my findings and stop working on it. I would jump on another bug that is easier to solve and then came back to the bug. By doing this, I would have a fresh mind and attitude in tackling the bug. Sometimes you might have tendency to overcomplicate things when spending too much times on a bug. Having a break, helps in breaking the wall.
RWendi

First off, is it reproducible? That's a HUGE plus if it is. I want things bugs to always/never happen... its the intermittent ones that are the troublesome ones.
And it is going to depend on the problem, but at my shop we'll generally tag-team such a problem figuring that 2 heads (or 3 or 4) is better than 1.
Occasionally the bug won't even be in MY code, but it generally is. There have been issues where a 3rd party library was the culprit or a particular implementation on a particular platform was the cause - those stink.
I'll use anything and everything to at least track it down: debuggers, trace output, whatever.
Typically, if I can isolate it to a class or module I'll write a test harness to duplicate the real world and try to duplicate it there. I generally write my test code first, but sometimes legacy code (or other developer's code) exists that doesn't have tests already.
I generally will talk the design and problem through, out loud with the team and whiteboard anything that isn't clear. Often the solution will bubble to the surface once we talk about it as a group.
That's what I do.

I usually, try hard solving it. But, if that is not possible for reasonable windows of time, I leave it for some time to braincells to solve it while i sleep ;) Sometime it works...

I've considered asking for help on this website called StackOverflow that I've been frequenting lately...

This is what I did today...
I debug HW/SW interaction and its often the case logging (instrumentation) changes or hides the bug. Hence tests are performed "at-speed". I call these bugs "roaches" as they run away from any light I can shine on them.
So I have to:
Find the transaction that causes the bug. List the HW interaction via logging (this test passes, but it illustrates the flow).
Instrument before and after the bug to print state changes.
The bug I'm solving now of course is worst case as the HW locks up. The HW includes the CPU so its like being in a well lit room then the power fails and its pitch black.
I have a special backdoor view into memory, but of course this is locked up also. I tried power cycling in the hopes that the memory would stay non-volatile long enough to reenable the backdoor. No such luck. This is possible though.
I very very carefully wrote all the steps I went through to characterize this bug (what works, what fails etc). Sent this to developers with similar HW to verify it just wasn't me or my HW.
I took a few hours break to let this info settle and see if any lightbulbs lit elsewhere.
No replies, this bug is mine to solve...
This HW SW interaction is a loop tha does some setup then enters a polling loop that reads when the transaction is finished. Many transactions should occur. Which transaction fails? Is it the first one (indicating I can debug the transaction and not some noise in the HW). Is it the always the Nth transaction? What makes the Nth different than the first or the (N-1)th. The SW is single threaded and built to be predictable. No preemption, no interrupts enabled.
This SW has worked before, whats new? All the HW is new. In this case all the silicon is new as its an ASIC. Even the embedded CPU is new and customized so the ISA is new.
So I suspect everything and I'm blind. I'll have to sneak up on this roach.
I enabled just the log that reports how many times the SW polls the HW for completion. In this way the first transaction runs at speed, I get an idea how often I touch the HW in a tight polling loop. The test passes. I know its the Nth transaction and I recorded the peak number of polls for all transactions (perhaps meaningless data).
After modifing anythin, I have to put it back the way it was to verify the bug still exists. After all the earth has rotated and the solar winds are not as strong ;)
Looked at all the checkins, saw a contractor changed some important setup parameters with no explanation. These (outsourced) people are still under evaluation. This will not help.
Found there was no spinwait in the polling loop. Bad for the loop timeout as without it the timeout depends on CPU speed. Added spinwait, still no happiness.
Limited the number of transactions to see where it fails, somewhere before 1000.
Setup the HW to run slower, still hangs.
Hate to leave anyone reading this hanging too, but this diatribe will have to wait till tomorrow.

There is no bug that can't be fixed, since there is no bug that can't be fixed with a total rewrite.
An unfixable bug is just a bug you aren't willing to replace.

For memory related bugs i have found that the Memory Profiling options of Ants Profiler have helped me quite a bit on finding bugs.

use more creative methods of tracking the bug down.
using remote debugging on the machine where its reproducable.
using profiling tools.
introduce more logging to the app.

Going away for a while and then coming back to a problem is one common approach I do and have heard.
How easily reproduced the bug is can be a factor as well since if the error only occurs in one in a zillion runs of a program that could be considered a negligible gain for fixing it by breaking something else.
There is also the question of nailing down where the bug is, is it in some configuration so that it occurs on a server but not my local XP Pro machine which runs IIS 5.0. Some other bugs may involve having to change the resolution of my machine that can be annoying to try to reproduce a bug that others have reported.
You left out the "occurs under another O/S" category of bugs so that a web page that is fine in IE and Firefox on PC may look like crap on Safari on a Mac. Do I get my hands dirty in trying to fix a CSS issue using my machine as a server and the Mac that is over a row or two in the cubicles of the floor in order to see this issue or is it so low a priority it gets swept under the rug? Alternatively, if a bug was on Linux and there aren't any Linux machines near me, what should I do?
I'm sorry to have left with some questions but these seem to be difficult questions for me at times.

In addition to the debugger, I've also used logging and old fashioned paper and pencil. On occasion I've found really hard bugs, like code that runs fine in debug mode, but breaks in release mode. I've even occasionally rewritten perfectly good code that for whatever reason, doesn't work reliably, figuring that it's better to be reliable than elegant.
I sometimes try to redefine what others term a bug as really being a feature, but that seldom works!

I have a bug that shows up every few months on a customer site. It usually happens at 3am and it's not discovered until early the next morning when the customer arrives at their site. And usually when they discover it, they want everything to get working immediately, so our support people generally just reboot the computer. It's been driving me nuts for years. It never happens on my test machine or in the QA lab, only at certain customer sites. Over time, I've
refactored some of the code that I thought was causing it
added more debugging printouts around where it appears to be crashing
redirected stdout so that next time I see it I can "kill -3" the process
given support some new tools to dump out the current state of database locks and the like.
added diagnostics to make it more obvious when it does happen
It hasn't happened in a few months, and I've got my fingers crossed that I might have fixed it this time, but I'm not counting on it.

If it's not critical, don't fix it, you'll just spend too much time!
Keep the bug open. comment/work on it when you can. It might get fix by accident (or by someone else) later on!

Sometimes it takes a little lateral thinking, but every bug is fixable. Sometimes you need to leave it and sleep over it, sometimes it's good to ask someone else to have a quick look (they may see something you haven't), but mostly it's about trying different things, calling up on previous experience. It can be frustrating, but the buzz you get when you do fix it, is like no other!

Finding GDI/User resource usage from a crash dump

I have a crash dump of an application that is supposedly leaking GDI. The app is running on XP and I have no problems loading it into WinDbg to look at it. Previously we have use the Gdikdx.dll extension to look at Gdi information but this extension is not supported on XP or Vista.
Does anyone have any pointers for finding GDI object usage in WinDbg.
Alternatively, I do have access to the failing program (and its stress testing suite) so I can reproduce on a running system if you know of any 'live' debugging tools for XP and Vista (or Windows 2000 though this is not our target).

I've spent the last week working on a GDI leak finder tool. We also perform regular stress testing and it never lasted longer than a day's worth w/o stopping due to user/gdi object handle overconsumption.
My attempts have been pretty successful as far as I can tell. Of course, I spent some time beforehand looking for an alternative and quicker solution. It is worth mentioning, I had some previous semi-lucky experience with the GDILeaks tool from msdn article mentioned above. Not to mention that i had to solve a few problems prior to putting it to work and this time it just didn't give me what and how i wanted it. The downside of their approach is the heavyweight debugger interface (it slows down the researched target by orders of magnitude which I found unacceptable). Another downside is that it did not work all the time - on some runs I simply could not get it to report/compute anything! Its complexity (judging by the amount of code) was another scare-away factor. I'm not a big fan of GUIs, as it is my belief that I'm more productive with no windows at all ;o). I also found it hard to make it find and use my symbols.
One more tool I used before setting on to write my own, was the leakbrowser.
Anyways, I finally settled on an iterative approach to achieve following goals:
minor performance penalties
implementation simplicity
non-invasiveness (used for multiple products)
relying on as much available as possible
I used detours (non-commercial use) for core functionality (it is an injectible DLL). Put Javascript to use for automatic code generation (15K script to gen 100K source code - no way I code this manually and no C preprocessor involved!) plus a windbg extension for data analysis and snapshot/diff support.
To tell the long story short - after I was finished, it was a matter of a few hours to collect information during another stress test and another hour to analyze and fix the leaks.
I'll be more than happy to share my findings.
P.S. some time did I spend on trying to improve on the previous work. My intention was minimizing false positives (I've seen just about too many of those while developing), so it will also check for allocation/release consistency as well as avoid taking into account allocations that are never leaked.
Edit: Find the tool here

There was a MSDN Magazine article from several years ago that talked about GDI leaks. This points to several different places with good information.
In WinDbg, you may also try the !poolused command for some information.
Finding resource leaks in from a crash dump (post-mortem) can be difficult -- if it was always the same place, using the same variable that leaks the memory, and you're lucky, you could see the last place that it will be leaked, etc. It would probably be much easier with a live program running under the debugger.
You can also try using Microsoft Detours, but the license doesn't always work out. It's also a bit more invasive and advanced.

I have created a Windbg script for that. Look at the answer of
Command to get GDI handle count from a crash dump
To track the allocation stack you could set a ba (Break on Access) breakpoint past the last allocated GDICell object to break just at the point when another GDI allocation happens. That could be a bit complex because the address changes but it could be enough to find pretty much any leak.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio