How to diagnose emacs lisp problems involving indirect buffers? - elisp

I am using Dave Love's noweb-mode to edit a file that is a mix of LaTeX and C code. Love's mode uses his multi-mode to switch back and forth between modes.
This switching is accomplished via indirect buffers.
In Emacs 21, the mode appears to work well. But a forced upgrade to Emacs 23 has revealed problems:
When making a transition between modes, mark is lost.
When viewing the same buffer in two different visible windows, movement in window A occasionally causes movement in window B, and it will also cause window B's point to move.
I'm trying to diagnose and repair these faults. I managed to work around problem 1 by turning off all buffer/mode switching while (region-active-p). But problem 2 has me completely stumped. I don't even know how to diagnose it.
I am looking for any help, but especially answers to either of these two questions:
How should I try to diagnose this problem?
Where can I find a clear and more complete explanation of the semantics of indirect buffers? The GNU Emacs Lisp Reference Manual does not say much, and I'm not sure reading the source code is the best next step.

Semantics and Emacs are two very separate worlds, as you might imagine. Similarly there are no "clear and complete explanations", sadly. Basically indirect buffers share their buffer text, text-properties, as well as some internal variables, while they keep separate buffer-local variables, and separate overlays. The division between what is shared and what isn't is largely arbitrary. To make things worse, indirect buffers are rarely used, so bugs and unwarranted assumptions inevitably creep in.
To track down your problems, the best approach is first to come up with reliably reproducible recipes.
I can reproduce some weird behavior w.r.t the mark where the base buffer's mark ends up being a marker in the indirect buffer, so that looks like a plain bug somewhere (and that doesn't seem to be fixed in 24.1, sadly). Please M-x report-emacs-bug.

Related

Is it a good habit to run in debug mode?

I saw a comment on another question (I forget which one) encouraging the asker to avoid testing his/her code in the debug harness unless strictly necessary, citing something to the effect of it acting as a crutch. There's certainly something to be said for developing the skill to deduce the cause of bugs without "direct" evidence. I'm quite a fan of debuggers myself (in fact, I tend to only run without if strictly necessary), but I got to thinking about the relative merits of each approach.
Debugger Pros
Starting with the obvious, takes less time to zero in on faults, exceptions and crashes
Tracing provides a nice alternative to littering your code with commented-out print statements
Performance overhead can give you extra wiggle room, i.e. if your program is responsive while debugging, it will almost definitely be so in the wild
Debugger Cons
Performance overhead can make iterations slower
(Edit) Tunnel Vision: Debugging the symptom can distract you from deducing the cause when the crash occurs long after or far from the defect
It may "help" you by initializing variables or otherwise masking bugs, leading to surprises later on
Conversely, there's the odd bug that only crops up in a debug configuration; tracking it down may be a waste of effort (though, this is often indicative of a deeper, subtler problem that is worth fixing)
These are general, of course--it varies wildly with language, environment and situation--but what are some other considerations?
I've had this argument many times. The debugger is only a crutch if you use it like one. I've met people who refused to use a debugger even to get a stack trace of where a piece of code crashed, instead using printf bisection to find the crashing line of code (this would take a day or more.. seriously, people?)
One problem you might encounter when using a debugger is tunnel vision. The debugger has a way of focusing your attention on the immediate area where the bug became apparent -- whether it's a crash, incorrect data, or otherwise -- at the expense of stealing your attention from other areas that might benefit from some investigation. On the other hand, actually watching code execute in a debugger can sometimes free you from your mental trap of thinking about the code the wrong way. You might swear it does X when it actually does Y -- seeing it do Y before your very eyes is sometimes a profound moment.
That said, I only fire up the debugger in two circumstances:
A bug manifested which, after five minutes or so, I cannot immediately guess as to the cause
I'm trying to understand some code I'm not familiar with, and I want to watch it execute
Honestly, the time in the debugger is usually just a few minutes, then the problem is found. Fixing the problem is usually the hard part, and the debugger is of little use for that.
I think it's a mistake, not so much to always have a debugger at the ready, or to even run code always under the debugger, but to run a DEBUG BUILD. You already pointed out the worst of the problems with this. Memory allocations tend to happen differently, uninitialized data is filled with different values, etc. If the first time you fire up the release build is a few weeks before QA gets their hands on it (or, in a crazy shop, before you start shipping), you may be in for a world of serious pain.
I have only once seen a bug which only manifested in the debug build. A few people argued that it wasn't important because that isn't what we ship, but I looked into it anyway and found a REALLY bad problem.
Like any tool the debugger has appropriate and inappropriate uses. There are no bad tools.
If you're reasonably certain you have some bugs to deal with, running in debug mode tends to make finding them a bit faster. If you're at the point that you think the bugs are gone, you want to simulate the target environment as closely as possible, which usually means turning debug mode off.
Depending on your language, tools, etc., chances are pretty decent that you can also do something that's more or less a hybrid of the two: generate debugging information, but everything else like debug mode. This is often extremely helpful as well, so you can do debugging on the code after it's generated the way the customer will see it (but beware that optimization can produce oddities, such as changing the order of code...)
Ultimately, you should run your tests in the same configuration your code will be running in the wild.
Then if a test fails, you can drop back to debug mode, and if it still fails, track it down and fix it. If it "fixes" itself when run in debug mode, then be glad you found it now rather than when you shipped, and get to tracking down the root cause in other ways.
Especially in GCC the compiler definitely likes to help you along, but honestly issues don't crop up very often.
I see nothing wrong with developing in debug mode, usually the only time you got odd behavior is when you aren't quite following the standards for the language anyway (not initializing variables, etc).
Once you're ready to release switch off debug and run tests again, locate bugs, rinse and repeat. I have very rarely come in contact with "debug mode only" bugs.
I like to run in debug mode during development. One big reason is simply so I'm setup to run the debugger when needed.
When I was doing more C++/MFC programming, I would put ASSERTS all over the place (or Tracing, as you described it) and caught many bugs and found many wrong assumptions in the process.
Who cares if it runs a bit slower? I say develop in debug mode. It's extremely rare that I have errors that turn up when I switch to release builds. But I generally start running release builds when I'm nearing completion. If I have some testers, I'd obviously send them release builds.
I found writing printf much superior over debugging in python. It is just because it is simpler and you do need to recompile. Watching variables even whit pydev in eclipse is painful.One of the reaso is that i was always "debugging" code that was not much than few tens of lines of code.
On the other hand whit larger projects in C I am using debuger. It is different that in C there are pointers and arrays and you want to observe structs and it is not simple to write simple printf to see if your serialization code works correct.
I always use the debugger. And I will always single step new code line-by-line. There's a whole generation of kids that debug with print statements. (especially for web development) IMHO that's why software state-of-the-art is lagging.
Of course, you have to run your unit tests with both debugging on and off, but I find true compiler bugs related to optimizations are rare.

Can this kernel function be more readable? (Ideas needed for an academic research!)

Following my previous question regarding the rationale behind extremely long functions, I would like to present a specific question regarding a piece of code I am studying for my research. It's a function from the Linux Kernel which is quite long (412 lines) and complicated (an MCC index of 133). Basically, it's a long and nested switch statement
Frankly, I can't think of any way to improve this mess. A dispatch table seems both huge and inefficient, and any subroutine call would require an inconceivable number of arguments in order to cover a large-enough segment of code.
Do you think of any way this function can be rewritten in a more readable way, without losing efficiency? If not, does the code seem readable to you?
Needless to say, any answer that will appear in my research will be given full credit - both here and in the submitted paper.
Link to the function in an online source browser
I don't think that function is a mess. I've had to write such a mess before.
That function is the translation into code of a table from a microprocessor manufacturer. It's very low-level stuff, copying the appropriate hardware registers for the particular interrupt or error reason. In this kind of code, you often can't touch registers which have not been filled in by the hardware - that can cause bus errors. This prevents the use of code that is more general (like copying all registers).
I did see what appeared to be some code duplication. However, at this level (operating at interrupt level), speed is more important. I wouldn't use Extract Method on the common code unless I knew that the extracted method would be inlined.
BTW, while you're in there (the kernel), be sure to capture the change history of this code. I have a suspicion that you'll find there have not been very many changes in here, since it's tied to hardware. The nature of the changes over time of this sort of code will be quite different from the nature of the changes experienced by most user-mode code.
This is the sort of thing that will change, for instance, when a new consolidated IO chip is implemented. In that case, the change is likely to be copy and paste and change the new copy, rather than to modify the existing code to accommodate the changed registers.
Utterly horrible, IMHO. The obvious first-order fix is to make each case in the switch a call to a function. And before anyone starts mumbling about efficiency, let me just say one word - "inlining".
Edit: Is this code part of the Linux FPU emulator by any chance? If so this is very old code that was a hack to get linux to work on Intel chips like the 386 which didn't have an FPU. If it is, it's probably not a suitable study for academics, except for historians!
There's a kind of regularity here, I suspect that for a domain expert this actually feels very coherent.
Also having the variations in close proximty allows immediate visual inspection.
I don't see a need to refactor this code.
I'd start by defining constants for the various classes. Coming into this code cold, it's a mystery what the switching is for; if the switching was against named constants, I'd have a starting point.
Update: You can get rid of about 70 lines where the cases return MAJOR_0C_EXCP; simply let them fall through to the end of the routine. Since this is kernel code I'll mention that there might be some performance issues with that, particularly if the case order has already been optimized, but it would at least reduce the amount of code you need to deal with.
I don't know much about kernels or about how re-factoring them might work.
The main thing that comes to my mind is taking that switch statement and breaking each sub step in to a separate function with a name that describes what the section is doing. Basically, more descriptive names.
But, I don't think this optimizes the function any more. It just breaks it in to smaller functions of which might be helpful... I don't know.
That is my 2 cents.

Code reuse and refactoring [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
What's best practice for reuse of code versus copy/paste?
The problem with reuse can be that changing the reused code will affect many other pieces of functionality.
This is good & bad : good if the change is a bugfix or useful enhancement. Bad if other reusing code unexpectedly becomes broken because it relied on the old version (or the new version has a bug).
In some cases it would seem that copy/paste is better - each user of the pasted code has a private copy which it can customize without consequences.
Is there a best practice for this problem; does reuse require watertight unit tests?
Every line of code has a cost.
Studies show that the cost is not linear with the number of lines of code, it's exponential.
Copy/paste programming is the most expensive way to reuse software.
"does reuse require watertight unit tests?"
No.
All code requires adequate unit tests. All code is a candidate for reuse.
It seems to me that a piece of code that is used in multiple places that has the potential to change for one place and not for another place isn't following proper rules of scope. If the "same" method/class is needed by two different things to do two different functions, then that method/class should be split up.
Don't copy/paste. If it does turn out that you need to modify the code for one place, then you can extend it, possibly through inheritance, overloading, or if you must, copying and pasting. But don't start out by copy-pasting similar segments.
Using copy and paste is almost always a bad idea. As you said, you can have tests to check in case you break something.
The point is, when you call a method, you shouldn't really care about how it works, but about what it does. If you change the method, changing what it does, then it should be a new method, or you should check wherever this method is called.
On the other side, if the change doesn't modify WHAT the method does (only how), then you shouldn't have a problem elsewhere. If you do, you've done something wrong...
One very appropriate use of copy and paste is Triangulation. Write code for one case, see a second application that has some variation, copy & paste into the new context - but you're not done. It's if you stop at that point that you get into trouble. Having this code duplicated, perhaps with minor variation, exposes some common functionality that your code needs. Once it's in both places, tested, and working in both places, you should extract that commonality into a single place, call it from the two original places, and (of course) re-test.
If you have concerns that code which is called from multiple places is introducing risk of fragility, your functions are probably not fine-grained enough. Excessively coarse-grained functions, functions that do too much, are hard to reuse, hard to name, hard to debug. Find the atomic bits of functionality, name them, and reuse them.
So the consumer (reuser) code is dependent on the reused code, that's right.
You have to manage this dependency.
It is true for binary reuse (eg. a dll) and code reuse (eg. a script library) as well.
Consumer should depend on a certain (known) version of the reused code/binary.
Consumer should keep a copy of the reused code/binary, but never directly modify it, only update to a newer version when it is safe.
Think carefully when you modify resused codebase. Branch for breaking changes.
If a Consumer wants to update the reused code/binary then it first has to test to see if it's safe. If tests fail then Consumer can alway fall back to the last known (and kept) good version.
So you can benefit from reuse (eg. you have to fix a bug in one place), and still you're in control of changes. But nothing saves you from testing whenever you update the reused code/binary.
Is there a best practice for this
problem; does reuse require watertight
unit tests?
Yes and sort of yes. Rewriting code you have already did right once is never a good idea. If you never reuse code and just rewrite it you are doubling you bug surface. As with many best practice type questions Code Complete changed the way I do my work. Yes unit test to the best of your ability, yes reuse code and get a copy of Code Complete and you will be all set.
Copy and pasting is never good practice. Sometimes it might seem better as a short-term fix in a pretty poor codebase, but in a well designed codebase you will have the following affording easy re-use:
encapsulation
well defined interfaces
loose-coupling between objects (few dependencies)
If your codebase exhibits these properties, copy and pasting will never look like the better option. And as S Lott says, there is a huge cost to unnecessarily increasing the size of your codebase.
Copy/Paste leads to divergent functionality. The code may start out the same but over time, changes in one copy don't get reflected in all the other copies where it should.
Also, copy/paste may seem "OK" in very simple cases but it also starts putting programmers into a mindset where copy/paste is fine. That's the "slippery slope". Programmers start using copy/paste when refactoring should be the right approach. You always have to be careful about setting precedent and what signals that sends to future developers.
There's even a quote about this from someone with more experience than I,
"If you use copy and paste while you're coding, you're probably committing a design error."-- David Parnas
You should be writing unit tests, and while yes, having cloned code can in some sense give you the sense of security that your change isn't effecting a large number of other routines, it is probably a false sense of security. Basically, your sense of security comes from an ignorance of knowing how the code is used. (ignorance here isn't a pejorative, just comes from as a result of not being able to know everything about the codebase) Get used to using your IDE to learn where the code is being use, and get used to reading code to know how it is being used.
Where you write:
The problem with reuse can be that
changing the reused code will affect
many other pieces of functionality.
... In some cases it would seem that
copy/paste is better - each user of
the pasted code has a private copy
which it can customize without
consequences.
I think you've reversed the concerns related to copy-paste. If you copy code to 10 places and then need to make a slight modification to behavior, will you remember to change it in all 10 places?
I've worked on an unfortunately large number of big, sloppy codebases and generally what you'll see is the results of this - 20 versions of the same 4 lines of code. Some (usually small) subset of them have 1 minor change, some other small (and only partially intersecting subset) have some other minor change, not because the variations are correct but because the code was copied and pasted 20 times and changes were applied almost, but not quite consistently.
When it gets to that point it's nearly impossible to tell which of those variations are there for a reason and which are there because of a mistake (and since it's more often a mistake of omission - forgetting to apply a patch rather than altering something - there's not likely to be any evidence or comments).
If you need different functionality call a different function. If you need the same functionality, please avoid copy paste for the sanity of those who will follow you.
There are metrics that can be used to measure your code, and it's up to yo (or your development team) to decide on an adequate threshold. Ruby on Rails has the "Metric-Fu" Gem, which incorporates many tools that can help you refactor your code and keep it in tip top shape.
I'm not sure what tools are available for other laguages, but I believe there is one for .NET.
In general, copy and paste is a bad idea. However, like any rule, this has exceptions. Since the exceptions are less well-known than the rule I'll highlight what IMHO are some important ones:
You have a very simple design for something that you do not want to make more complicated with design patterns and OO stuff. You have two or three cases that vary in about a zillion subtle ways, i.e. a line here, a line there. You know from the nature of the problem that you won't likely ever have more than 2 or 3 cases. Sometimes it can be the lesser of two evils to just cut and paste than to engineer the hell out of the thing to solve a relatively simple problem like this. Code volume has its costs, but so does conceptual complexity.
You have some code that's very similar for now, but the project is rapidly evolving and you anticipate that the two instances will diverge significantly over time, to the point where trying to even identify reasonably large, factorable chunks of functionality that will stay common, let alone refactor these into reusable components, would be more trouble than it's worth. This applies when you believe that the probability of a divergent change to one instance is much greater than that of a change to common functionality.

Standard methods of debugging

What's your standard way of debugging a problem? This might seem like a pretty broad question with some of you replying 'It depends on the problem' but I think a lot of us debug by instinct and haven't actually tried wording our process. That's why we say 'it depends'.
I was sort of forced to word my process recently because a few developers and I were working an the same problem and we were debugging it in totally different ways. I wanted them to understand what I was trying to do and vice versa.
After some reflection I realized that my way of debugging is actually quite monotonous. I'll first try to be able to reliably replicate the problem (especially on my local machine). Then through a series of elimination (and this is where I think it's problem dependent) try to identify the problem.
The other guys were trying to do it in a totally different way.
So, just wondering what has been working for you guys out there? And what would you say your process is for debugging if you had to formalize it in words?
BTW, we still haven't found out our problem =)
My approach varies based on my familiarity with the system at hand. Typically I do something like:
Replicate the failure, if at all possible.
Examine the fail state to determine the immediate cause of the failure.
If I'm familiar with the system, I may have a good guess about to root cause. If not, I start to mechanically trace the data back through the software while challenging basic assumptions made by the software.
If the problem seems to have a consistent trigger, I may manually walk forward through the code with a debugger while challenging implicit assumptions that the code makes.
Tracing the root cause is, of course, where things can get hairy. This is where having a dump (or better, a live, broken process) can be truly invaluable.
I think that the key point in my debugging process is challenging pre-conceptions and assumptions. The number of times I've found a bug in that component that I or a colleague would swear is working fine is massive.
I've been told by my more intuitive friends and colleagues that I'm quite pedantic when they watch me debug or ask me to help them figure something out. :)
Consider getting hold of the book "Debugging" by David J Agans. The subtitle is "The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems". His list of debugging rules — available in a poster form at the web site (and there's a link for the book, too) is:
Understand the system
Make it fail
Quit thinking and look
Divide and conquer
Change one thing at a time
Keep an audit trail
Check the plug
Get a fresh view
If you didn't fix it, it ain't fixed
The last point is particularly relevant in the software industry.
I picked those on the web or some book which I can't recall (it may have been CodingHorror ...)
Debugging 101:
Reproduce
Progressively Narrow Scope
Avoid Debuggers
Change Only One Thing At a Time
Psychological Methods:
Rubber-duck debugging
Don't Speculate
Don't be too Quick to Blame the Tools
Understand Both Problem and Solution
Take a Break
Consider Multiple Causes
Bug Prevention Methods:
Monitor Your Own Fault Injection Habits
Introduce Debugging Aids Early
Loose Coupling and Information Hiding
Write a Regression Test to Prevent Re occurrence
Technical Methods:
Inert Trace Statements
Consult the Log Files of Third Party Products
Search the web for the Stack Trace
Introduce Design By Contract
Wipe the Slate Clean
Intermittent Bugs
Explot Localility
Introduce Dummy Implementations and Subclasses
Recompile / Relink
Probe Boundary Conditions and Special Cases
Check Version Dependencies (third party)
Check Code that Has Changed Recently
Don't Trust the Error Message
Graphics Bugs
When I'm up against a bug that I can't get seem to figure out, I like to make a model of the problem. Make a copy of the section of problem code, and start removing features from it, one at a time. Run a unit test against the code after every removal. Through this process your will either remove the feature with the bug (and hence, locate the bug), or you will have isolated the bug down to a core piece of code that contains the essence of the problem. And once you figure out the essence of the problem, its a lot easier to fix.
I normally start off by forming an hypothesis based on the information I have at hand. Once this is done, I work to prove it to be correct. If it proves to be wrong, I start off with a different hypothesis.
Most of the Multithreaded synchronization issues get solved very easily with this approach.
Also you need to have a good understanding of the debugger you are using and its features. I work on Windows applications and have found windbg to be extremely helpful in finding bugs.
Reducing the bug to its simplest form often leads to greater understanding of the issue as well adding the benefit of being able to involve others if necessary.
Setting up a quick reproduction scenario to allow for efficient use of your time to test any hypothosis you chose.
Creating tools to dump the environment quickly for comparisons.
Creating and reproducing the bug with logging turned onto the maximum level.
Examining the system logs for anything alarming.
Looking at file dates and timestamps to get a feeling if the problem could be a recent introduction.
Looking through the source repository for recent activity in the relevant modules.
Apply deductive reasoning and apply the Ockham's Razor principles.
Be willing to step back and take a break from the problem.
I'm also a big fan of using process of elimination. Ruling out variables tremendously simplifies the debugging task. It's often the very first thing that should to be done.
Another really effective technique is to roll back to your last working version if possible and try again. This can be extremely powerful because it gives you solid footing to proceed more carefully. A variation on this is to get the code to a point where it is working, with less functionality, than not working with more functionality.
Of course, it's very important to not just try things. This increases your despair because it never works. I'd rather make 50 runs to gather information about the bug rather take a wild swing and hope it works.
I find the best time to "debug" is while you're writing the code. In other words, be defensive. Check return values, liberally use assert, use some kind of reliable logging mechanism and log everything.
To more directly answer the question, the most efficient way for me to debug problems is to read code. Having a log helps you find the relevant code to read quickly. No logging? Spend the time putting it in. It may not seem like you're finding the bug, and you may not be. The logging might help you find another bug though, and eventually once you've gone through enough code, you'll find it....faster than setting up debuggers and trying to reproduce the problem, single stepping, etc.
While debugging I try to think of what the possible problems could be. I've come up with a fairly arbitrary classification system, but it works for me: all bugs fall into one of four categories. Keep in mind here that I'm talking about runtime problems, not compiler or linker errors. The four categories are:
dynamic memory allocation
stack overflow
uninitialized variable
logic bug
These categories have been most useful to me with C and C++, but I expect they apply pretty well elsewhere. The logic bug category is a big one (e.g. putting a < b when the correct thing was a <= b), and can include things like failing to synchronize access among threads.
Knowing what I'm looking for (one of these four things) helps a lot in finding it. Finding bugs always seems to be much harder than fixing them.
The actual mechanics for debugging are most often:
do I have an automated test that demonstrates the problem?
if not, add a test that fails
change the code so the test passes
make sure all the other tests still pass
check in the change
No automated testing in your environment? No time like the present to set it up. Too hard to organize things so you can test individual pieces of your program? Take the time to make it so. May make it take "too long" to fix this particular bug, but the sooner you start, the faster everything else'll go. Again, you might not fix the particular bug you're looking for but I bet you find and fix others along the way.
My method of debugging is different, probably because I am still beginner.
When I encounter logical bug I seem to end up adding more variables to see which values go where and then I go and debug line by line in the piece of code that causing a problem.
Replicating the problem and generating a repeatable test data set is definitely the first and most important step to debugging.
If I can identify a repeatable bug, I'll typically try and isolate the components involved until I locate the problem. Frequently I'll spend a little time ruling out cases so I can state definitively: The problem is not in component X (or process Y, etc.).
First I try to replicate the error, without being able to replicate the error it is basically impossible in a non-trivial program to guess the problem.
Then if possible, break out the code in a separate standalone project. There are several reasons for this: If the original project is big it quite difficult to debug second it eliminates or highlights any assumptions about the code.
I normally always have another copy of VS open which I use for the debugging parts in mini projects and to test routines which I later add to the main project.
Once having reproduced the error in the separate module the battle is almost won.
Sometimes it is not easy to break out a piece of code so in those cases I use different methods depending on how complex the issue is. In most cases assumptions about data seem to come and bite me so I try to add lots of asserts in the code in order make sure my assumptions are correct. I also disabling code by using #ifdef until the error disappears. Eliminating dependencies to other modules etc... sort of slowly circling in the bug like a vulture ..
I think I don't have really a conscious way of doing it, it varies quite a lot but the general principle is to eliminate the noise around the issue until it is quite obvious what it is. Hope I didn't sound too confusing :)

Hardest types of bugs to track? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
What are some of the nastiest, most difficult bugs you have had to track and fix and why?
I am both genuinely curious and knee deep in the process as we speak. So as they say - misery likes company.
Heisenbugs:
A heisenbug (named after the Heisenberg Uncertainty Principle) is a computer bug that disappears or alters its characteristics when an attempt is made to study it.
Race conditions and deadlocks. I do a lot of multithreaded processes and that is the hardest thing to deal with.
Bugs that happen when compiled in release mode but not in debug mode.
Any bug based on timing conditions. These often come when working with inter-thread communication, an external system, reading from a network, reading from a file, or communicating with any external server or device.
Bugs that are not in your code per se, but rather in a vendor's module on which you depend. Particularly when the vendor is unresponsive and you are forced to hack a work-around. Very frustrating!
We were developing a database to hold words and definitions in another language. It turns out that this language had only recently been added to the Unicode standard and it didn't make it into SQL Server 2005 (though it was added around 2005). This had a very frustrating effect when it came to collation.
Words and definitions went in just fine, I could see everything in Management Studio. But whenever we tried to find the definition for a given word, our queries returned nothing. After a solid 8 hours of debugging, I was at the point of thinking I had lost the ability to write a simple SELECT query.
That is, until I noticed English letters matched other English letters with any amount of foreign letters thrown in. For example, EnglishWord would match E!n#gl##$ish$&Word. (With !##$%^&* representing foreign letters).
When a collation doesn't know about a certain character, it can't sort them. If it can't sort them, it can't tell whether two string match or not (a surprise for me). So frustrating and a whole day down the drain for a stupid collation setting.
Threading bugs, especially race conditions. When you cannot stop the system (because the bug goes away), things quickly get tough.
The hardest ones I usually run into are ones that don't show up in any log trace. You should never silently eat an exception! The problem is that eating an exception often moves your code into an invalid state, where it fails later in another thread and in a completely unrelated manner.
That said, the hardest one I ever really ran into was a C program in a function call where the calling signature didn't exactly match the called signature (one was a long, the other an int). There were no errors at compile time or link time and most tests passed, but the stack was off by sizeof(int), so the variables after it on the stack would randomly have bad values, but most of the time it would work fine (the values following that bad parameter were generally being passed in as zero).
That was a BITCH to track.
Memory corruption under load due to bad hardware.
Bugs that happen on one server and not another, and you don't have access to the offending server to debug it.
Bugs that have to do with threading.
The most frustrating for me have been compiler bugs, where the code is correct but I've hit an undocumented corner case or something where the compiler's wrong. I start with the assumption that I've made a mistake, and then spend days trying to find it.
Edit: The other most frustrating was the time I got the test case set slightly wrong, so my code was correct but the test wasn't. That took days to find.
In general, I guess the worst bugs I've had have been the ones that aren't my fault.
The hardest bugs to track down and fix are those that combine all the difficult cases:
reported by a third party but you can't reproduce it under your own testing conditions;
bug occurs rarely and unpredictably (e.g. because it's caused by a race condition);
bug is on an embedded system and you can't attach a debugger;
when you try to get logging information out the bug goes away;
bug is in third-party code such as a library ...
... to which you don't have the source code so you have to work with disassembly only;
and the bug is at the interface between multiple hardware systems (e.g. networking protocol bugs or bus contention bugs).
I was working on a bug with all these features this week. It was necessary to reverse engineer the library to find out what it was up to; then generate hypotheses about which two devices were racing; then make specially-instrumented versions of the program designed to provoke the hypothesized race condition; then once one of the hypotheses was confirmed it was possible to synchronize the timing of events so that the library won the race 100% of the time.
There was a project building a chemical engineering simulator using a beowulf cluster. It so happened that the network cards would not transmit one particular sequence of bytes. If a packet contained that string, the packet would be lost. They solved the problem by replacing the hardware - finding it in the first place was much harder.
One of the hardest bugs I had to find was a memory corruption error that only occurred after the program had been running for hours. Because of the length of time it took to corrupt the data, we assumed hardware and tried two or three other computers first.
The bug would take hours to appear, and when it did appear it was usually only noticed quite a length of time after when the program got so messed up it started misbehaving. Narrowing down in the code base to where the bug was occurring was very difficult because the crashes due to corrupted memory never occurred in the function that corrupted the memory, and it took so damned long for the bug to manifest itself.
The bug turned out to be an off-by-one error in a rarely called piece of code to handle a data line that had something wrong with it (invalid character encoding from memory).
In the end the debugger proved next to useless because the crashes never occurred in the call tree for the offending function. A well sequenced stream of fprintf(stderr, ...) calls in the code and dumping the output to a file was what eventually allowed us to identify what the problem was.
Concurrency bugs are quite hard to track, because reproducing them can be very hard when you do not yet know what the bug is. That's why, every time you see an unexplained stack trace in the logs, you should search for the reason of that exception until you find it. Even if it happens only one time in a million, that does not make it unimportant.
Since you can not rely on the tests to reproduce the bug, you must use deductive reasoning to find out the bug. That in turn requires a deep understanding of how the system works (for example how Java's memory model works and what are possible sources of concurrency bugs).
Here is an example of a concurrency bug in Guice 1.0 which I located just some days ago. You can test your bug finding skills by trying to find out what is the bug causing that exception. The bug is not too hard to find - I found its cause in some 15-30 min (the answer is here).
java.lang.NullPointerException
at com.google.inject.InjectorImpl.injectMembers(InjectorImpl.java:673)
at com.google.inject.InjectorImpl$8.call(InjectorImpl.java:682)
at com.google.inject.InjectorImpl$8.call(InjectorImpl.java:681)
at com.google.inject.InjectorImpl.callInContext(InjectorImpl.java:747)
at com.google.inject.InjectorImpl.injectMembers(InjectorImpl.java:680)
at ...
P.S. Faulty hardware might cause even nastier bugs than concurrency, because it may take a long time before you can confidently conclude that there is no bug in the code. Luckily hardware bugs are rarer than software bugs.
A friend of mine had this bug. He accidentally put a function argument in a C program in square brackets instead of parenthesis like this: foo[5] instead of foo(5). The compiler was perfectly happy, because the function name is a pointer, and there is nothing illegal about indexing off a pointer.
One of the most frustrating for me was when the algorithm was wrong in the software spec.
Probably not the hardest, but they are extremely common and not trivial:
Bugs concerning mutable state. It is hard to maintain invariants in a data structure if it has many mutable fields. And you have operation order dependency - swap two lines and something bad occurs. One of my recent hard-to-find bugs was when I found that previous developer of the system I maintained used mutable data for hashtable keys - in some rare conditions it lead to infinite loops.
Order of initialization bugs. Can be obvious when found, but not so when coding.
The hardest one ever was actually a bug I was helping a friend with. He was writing C in MS Visual Studio 2005, and forgot to include time.h. He further called time without the required argument, usually NULL. This implicitly declared time like: int time(); This corrupted the stack, and in a completely unpredictable way. It was a large amount of code, and we didn't think to look at the time() call for quite some time.
Buffer overflows ( in native code )
Last year I spent a couple of months tracking a problem that ended up being a bug in a downstream system. The team lead from the offending system kept claiming that it must be something funny in our processing even though we passed the data just like they requested it from us. If the lead would have been a little more cooperative we might have nailed the bug sooner.
Uninitialized variables. (Or have modern languages done away with this?)
For embedded systems:
Unusual behaviour reported by customers in the field, but which we're unable to reproduce.
After that, bugs which turn out to be due to a freak series or concurrence of events. These are at least reproducable, but obviously they can take a long time - and a lot of experimentation - to make happen.
Difficulty of tracking:
off-by-one errors
boundary condition errors
Machine dependent problems.
I'm currently trying to debug why an application has an unhandled exception in a try{} catch{} block (yes, unhandled inside of a try / catch) that only manifests on certain OS / machine builds, and not on others.
Same version of software, same installation media, same source code, works on some - unhandled exception in what should be a very well handled part of code on others.
Gak.
When objects are cached and their equals and hashcode implementations are implemented so poorly that the hash code value isn't unique and the equals returns true when it isn't equal.
Cosmetic web bugs involving styling in various browser O/S configurations, e.g. a page looks fine in Windows and Mac in Firefox and IE but on the Mac in Safari something gets messed up. These are annoying sometimes because they require so much attention to detail and making the change to fix Safari may break something in Firefox or IE so one has to tread carefully and realize that the styling may be a series of hacks to fix page after page. I'd say those are my nastiest ones that sometimes just don't get fixed as they aren't viewed as important.
WAY back in the days, memory leaks. Thankfully, there's a lot of tools to find them, these days.
Memory issues, particularly on older systems. We have some legacy 16-bit C software that must remain 16-bit for the time being. The 64K memory blocks are royal pain to work with, and we constantly add statics or code logic that pushes us past the 64K group limits.
To make matters worse, memory errors usually don't cause the program to crash, but cause certain features to sporadically break (and not always the same features). Debugging is a non-option - the debugger doesn't have the same memory constraints so the programs always run fine in debug mode ... plus, we can't add inline printf statements for testing since that bumps the memory usage even higher.
As a result, we can sometimes spend DAYS trying to find a single block of code to rewrite, or hours moving static chars to files. Luckily the system is slowly being moved offline.
Multithreading, memory leaks, anything requiring extensive mocks, interfacing with third-party software.

Resources