chain of events analysis and reasoning

chain of events analysis and reasoning - algorithm

My boss said logs in current state are not acceptable for the customer. If there is a fault, a dozen of different modules of the device report their own errors and they all land in logs. The original reason of the fault may be buried somewhere in the middle of the list, may not appear on the list (given module being too damaged to report), or appear way late after everything else finished reporting problems that result from the original fault. Anyway, there are few people outside the system developers who can properly interprete the logs and come up with what actually happened.
My current task is writing a module that does a customer-friendly fault-reporting. That is, gather all the events that were reported over the last ~3 seconds (which is about the max interval between origin of the fault occurring and the last resulting after-effects), do some magic processing of this data, and come up with one clear, friendly line what is broken and needs to be fixed.
The problem is the magic part: how, given a number of fault reports, to come up with the original source of the fault. There is no simple list of cause-effect list. There are just commonly occurring chains of events displaying certain regularities.
Examples:
short circuit detected, resulting in limited operation mode, the limited operation does not remove the fault, so emergency state is escalated, total output power disconnected.
safety line got engaged. No module reported engaging it within 3s since it was engaged, so an "unknown-source or interference" is attributed as the reason of system halt.
most output modules report no output voltage. About 1s later the power supply monitoring module reports power is out, which is the original reason.
an output module reports no output voltage in all of its output lines. No report from power supply module. The reason is a power line disconnected from the module.
an output module reports no output voltage in one of its output lines. No other faults reported. The reason is a burnt fuse.
an output module did not report back with applying received state. Shortly after, control module reports illegal state or output lines, (resulting from the output module really not updating the state in a timely manner.) The cause is the output module (which introduced the fault), not the control module (which halted the system due to fault detected).
a fault of input module switches the device to backup-failsafe mode. An output module not used so far, which was faulty gets engaged in this mode and the fault mode gets escalated to critical. The original reason is not the input, which is allowed to report false-positives concerning faults, but the broken backup output which aborted the operation.
there is no activity of any kind from an output module, for the last 2 seconds. This means it's broken and a fault mode must be entered.
There is no comprehensive list of rules as to what causes what. The rules will be added as new kinds of faults occur "in the wild" and are diagnosed and fixed. Some of them are heuristics - if this error is accompanied with these errors, then the fault is most likely this. Some faults will not be solved - a bland list of module reports will have to suffice. Some answers will be ambigous, one set of symptoms may suggest two different faults. This is more of a "best effort" than a "guaranteed solution" one.
Now for the (overly general and vague) question: how to solve this? Are there specific algorithms, methods or generalized solutions to this kind of problem? How to write the generalized rulesets and match against them? How to do the soft-matching? (say, an input module broke right in the middle of an emergency halt, it's a completely unrelated event to be ignored.) Help please?

In all honesty, I would just write a series of simple rules and be done with it. It will be a pain maintenance wise, but getting this right may be time consuming and brittle.
If you insist, I would approach this by having each error drop some sort of symbol/token for each error code - you'll make this much harder if you try to do some bag of words/keyword matching. You would then input the outputted tokens in some sort of classifier.
At heart, you need some sort of rules engine - be it fuzzy or exact. The first thing that comes to mind is a hand-built Bayesian network. This would allow for fuzzy matching as you would calculate the most probable 'report' as a function of the tokens you receive. It also allows you to set a threshold for token groups that aren't really indicative of anything by specifying the minimum probability to return an answer.
You could also train a Bayes net or other type classifier, but you'll need quite a bit of data that you've manually labeled (token1,token2,token3->faultxyz) and it might be more accurate to do it yourself.

Related

Grafcet - synchronous machine behaviour

My goal is to implement a control algorithm written in Grafcet on a PLC. I am struggling with the difference of Grafcet as multi-process synchronous language and the single-core sequential PLC. Below is an example. What is the outcome of the Grafcet in the first cycle after the upper transition has fired? (a=1,x=1) or (a=1, x=0)?
I know that in SFC, it depends on the implementation of the engineering tool (e.g. Codesys, Multiprog) how actions are evaluated, typically from left to right. So for an SFC, (a=1,x=1) would be the answer. But since everything happens at the same time in Grafcet, I do not know how to handle this case.
Bonus points if someone can point out how I can learn more about the challenges of implementing languages like Grafcet on sequential machines.

Conditional actions are considered in not all Grafcet variants, but when they are, the behavior goes like this: as long as the step is active, turn on x while a is on.
If that's what you meant, though we may never find a conditional action formatted the way you did, x will be turned on within an infinitely short time after the two simultaneous steps are activated (at least that's my understanding based on the Grafcet evolution rules). So, the fact that the initial value of x is unpredictable - assuming that the two concurrent steps are activated at the very same time - should be actually no problem.
Moreover, as soon as the Grafcet is "implemented" in the real world (i.e. your single-core PLC), whether it's directly compiled by the engineering tool or converted into ladder diagram, an order of evaluation is necessarily chosen, as you said, and everything becomes deterministic, so your question is not a real problem when it comes to "implementing languages like Grafcet on sequential machines". You may find which are the real "challanges" by studying the canonical procedure for converting SFCs to ladder logic (detailed documentation is easily found on the web).

PLC's are single core, as you said, so... never 2 steps in the same moment of time.
There you have simultaneous branch, so both steps WILL execute. But clearly there will be one after another. By default, always the one from left. Please note that some PLC's allow you to change the order (never tried for simultaneous, but for divergent surely allow... such as RSLogix5000).
Simultaneous it's like having an AND. So you are telling processor execute first step AND second step. If you are familiar with Ladder Logic, I am sure this will be clear to you.
In the end, it should be a=1;x=1.
Also note that for other steps that are not simultaneous, there is one scan delay before evaluate next transition, which is a great thing. This is the most omitted thing when implementing a SFC in Ladder (and can lead to problems impossible to troubleshoot if you are not aware of it). I've seen this "bug" in about 50% of projects with ladder implementation and hundreds of projects so far. Example: If you have 10 consecutive transitions true, you are going from step 1 to step 10 in a single scan. Troubleshoot why motor didn't start :)
Tip: You can always use dummy steps in simultaneous branches to delay with 1 scan. So, if you want the other outcome (a=1,x=0), you can put a dummy step before left step.

What is a debugger and how can it help me diagnose problems?

This is intended to be a general-purpose question to assist new programmers who have a problem with a program, but who do not know how to use a debugger to diagnose the cause of the problem.
This question covers three classes of more specific question:
When I run my program, it does not produce the output I expect for the input I gave it.
When I run my program, it crashes and gives me a stack trace. I have examined the stack trace, but I still do not know the cause of the problem because the stack trace does not provide me with enough information.
When I run my program, it crashes because of a segmentation fault (SEGV).

A debugger is a program that can examine the state of your program while your program is running. The technical means it uses for doing this are not necessary for understanding the basics of using a debugger. You can use a debugger to halt the execution of your program when it reaches a particular place in your code, and then examine the values of the variables in the program. You can use a debugger to run your program very slowly, one line of code at a time (called single stepping), while you examine the values of its variables.
Using a debugger is an expected basic skill
A debugger is a very powerful tool for helping diagnose problems with programs. And debuggers are available for all practical programming languages. Therefore, being able to use a debugger is considered a basic skill of any professional or enthusiast programmer. And using a debugger yourself is considered basic work you should do yourself before asking others for help. As this site is for professional and enthusiast programmers, and not a help desk or mentoring site, if you have a question about a problem with a specific program, but have not used a debugger, your question is very likely to be closed and downvoted. If you persist with questions like that, you will eventually be blocked from posting more.
How a debugger can help you
By using a debugger you can discover whether a variable has the wrong value, and where in your program its value changed to the wrong value.
Using single stepping you can also discover whether the control flow is as you expect. For example, whether an if branch executed when you expect it ought to be.
General notes on using a debugger
The specifics of using a debugger depend on the debugger and, to a lesser degree, the programming language you are using.
You can attach a debugger to a process already running your program. You might do it if your program is stuck.
In practice it is often easier to run your program under the control of a debugger from the very start.
You indicate where your program should stop executing by indicating the source code file and line number of the line at which execution should stop, or by indicating the name of the method/function at which the program should stop (if you want to stop as soon as execution enters the method). The technical means that the debugger uses to cause your program to stop is called a breakpoint and this process is called setting a breakpoint.
Most modern debuggers are part of an IDE and provide you with a convenient GUI for examining the source code and variables of your program, with a point-and-click interface for setting breakpoints, running your program, and single stepping it.
Using a debugger can be very difficult unless your program executable or bytecode files include debugging symbol information and cross-references to your source code. You might have to compile (or recompile) your program slightly differently to ensure that information is present. If the compiler performs extensive optimizations, those cross-references can become confusing. You might therefore have to recompile your program with optimizations turned off.

I want to add that a debugger isn't always the perfect solution, and shouldn't always be the go-to solution to debugging. Here are a few cases where a debugger might not work for you:
The part of your program which fails is really large (poor modularization, perhaps?) and you're not exactly sure where to start stepping through the code. Stepping through all of it might be too time-consuming.
Your program uses a lot of callbacks and other non-linear flow control methods, which makes the debugger confused when you step through it.
Your program is multi-threaded. Or even worse, your problem is caused by a race condition.
The code that has the bug in it runs many times before it bugs out. This can be particularly problematic in main loops, or worse yet, in physics engines, where the problem could be numerical. Even setting a breakpoint, in this case, would simply have you hitting it many times, with the bug not appearing.
Your program must run in real-time. This is a big issue for programs that connect to the network. If you set up a breakpoint in your network code, the other end isn't going to wait for you to step through, it's simply going to time out. Programs that rely on the system clock, e.g. games with frameskip, aren't much better off either.
Your program performs some form of destructive actions, like writing to files or sending e-mails, and you'd like to limit the number of times you need to run through it.
You can tell that your bug is caused by incorrect values arriving at function X, but you don't know where these values come from. Having to run through the program, again and again, setting breakpoints farther and farther back, can be a huge hassle. Especially if function X is called from many places throughout the program.
In all of these cases, either having your program stop abruptly could cause the end results to differ, or stepping through manually in search of the one line where the bug is caused is too much of a hassle. This can equally happen whether your bug is incorrect behavior, or a crash. For instance, if memory corruption causes a crash, by the time the crash happens, it's too far from where the memory corruption first occurred, and no useful information is left.
So, what are the alternatives?
Simplest is simply logging and assertions. Add logs to your program at various points, and compare what you get with what you're expecting. For instance, see if the function where you think there's a bug is even called in the first place. See if the variables at the start of a method are what you think they are. Unlike breakpoints, it's okay for there to be many log lines in which nothing special happens. You can simply search through the log afterward. Once you hit a log line that's different from what you're expecting, add more in the same area. Narrow it down farther and farther, until it's small enough to be able to log every line in the bugged area.
Assertions can be used to trap incorrect values as they occur, rather than once they have an effect visible to the end-user. The quicker you catch an incorrect value, the closer you are to the line that produced it.
Refactor and unit test. If your program is too big, it might be worthwhile to test it one class or one function at a time. Give it inputs, and look at the outputs, and see which are not as you're expecting. Being able to narrow down a bug from an entire program to a single function can make a huge difference in debugging time.
In case of memory leaks or memory stomping, use appropriate tools that are able to analyze and detect these at runtime. Being able to detect where the actual corruption occurs is the first step. After this, you can use logs to work your way back to where incorrect values were introduced.
Remember that debugging is a process going backward. You have the end result - a bug - and find the cause, which preceded it. It's about working your way backward and, unfortunately, debuggers only step forwards. This is where good logging and postmortem analysis can give you much better results.

Fail Fast vs. Robustness

Our product is a distributed system. The modules I work on are fairly new, quite rigorous, well tested. They were developed with recent best practices in mind. Other modules can be considered as legacy software.
While I'm vigilant about everything that happens within modules I'm responsible for, I'm under constant pressure to work with bad data sent to me from the other modules. At heart, I'm a "Fail Fast" principle developer and as a result , when problems arise I usually am able to eliminate the possibility of error in my modules. It's not so much about blame, just saving wasted effort in chasing bugs in the wrong places.
But the argument I keep coming up against is: "We can't let this stuff fail in production, the customer expects this to work, why don't you work around this problem". And this would be an argument for robustness: be liberal in what you accept, conservative in what you send.
I should also note that these are mostly intermittent problems. We see them in integration tests but they are hard to reproduce. Timing and concurrency are involved.
I'm having a hard time balancing between the two principles. Part of it is my worry that if I start allowing and propagating exceptional data, I'm inviting trouble and I won't have as much confidence in my system. But I can't argue against keeping the system working even if other modules are sending me wrong data. The reason other modules aren't getting fixed is that they are too complex and fragile, while mine still appear clear and safe. But if I don't resist the pressure, my modules will slowly be saddled with the same problems I've been rejecting until now.
I should say that the system is not "crashing" in production, but my module may simply display an error to the operator and ask them to contact support. A crash would be a big problem, but if I'm reporting the error clearly, then isn't this the right thing to do? I suspect that my peers just don't want the customer to see any problems, period. But my module is rejecting data from other modules within our product, not customer input. So it seems to me that we are just not tackling problems.
So, do I need to be more pragmatic or hold my ground?

I share the "fail fast" preference/principle. Don't think of this as a conflict of principles though, its more a conflict of understanding. Your counterpart has some unspoken requirement ("dont show the user a bad time") that implies some missed requirement. You did not have a chance to think about/implement this requirement beforehand, so the requirement has left a bad taste in your mouth. Forget this viewpoint, re-approach it as a new project with a fixed requirement you can work against.
Maybe the best result is to give an error message like you displayed. But it sounds like you implemented it before having buy-in from your counterpart, when they had a choice to accept it. Earlier communication about what you were doing could have addressed something like that.
Be careful in how you prevent the ideas. Constantly referring to the other systems "too complex and fragile" might be rubbing people the wrong way. Express simply the systems are new to you and take longer to understand. Do put the time into understanding them, so you do not reduce peoples expectations of your capability.

I'd say that it depends on what happens if you don't halt. Does someone's paycheck get processed wrong? Does the wrong order get sent out? That would be worth stopping for.
If possible, have your cake and eat it too - don't report the error to the user, get the customer to agree to send diagnostic reports and report every failure back. Bug the developer(s) who own the faulting module(s) to fix them. And by bug I mean file a bug against them. Or, if management doesn't think it's worth the cost of fixing, don't.
I'd also write up unit tests against those modules that fail, especially if you can tell what the original input was that caused them to generate the wrong output.
What it really comes down to though is what the person who reviews your performance wants from you, especially after you explain the problem to them, via email.

Simply put, this sounds like a "don't check for something you can't handle". The fact that you're catching the error and able to report it means you're not propagating it. But it also means that since you can report it, you have some mechanism to trap the error and, therefore potentially handle it yourself, and correct it rather than report it.
Mind, I'm assuming that your error report is more interesting than a random exception you caught some place deep in the system. But even then, if it's an exception you're testing for and you're creating (i.e. you check if the denominator is zero and send an error rather than simply inadvertently dividing by zero and catching the exception higher up), then that suggests you may well have a way of correcting the problem.
Bottom line, you need both. You need to try to make the data as error free as practical, but also report the unexpected.
I don't think that you can lock the door and cross your arms saying "it's not my problem". The fact that it's coming from "old, fragile systems" is meaningless. YOUR code is not old a fragile and clearly the efficient place, in terms of the entire integrated system, to "fix" the data, once you've detected the problem. Yea the old modules will continue to GIGO to other, lesser systems, but those legacy modules combined with your new module are a cohesive whole and thus make up "the system".
The typical real problem here is simply the time/value equation of writing all this fix up code vs new features. That's a different debate. But if you have time, and you know things that you can do to clean up incoming data, "be liberal in what you accept" is sound policy.

I won't get into the reasons, but you are right.
In my experience, PHB's are missing the part of the brain required to understand why fail fast has merit and "robustness" as defined by do-whatever-it-takes-eat-errors-if-necessary is a bad idea. It is hopeless. They just don't have the hardware to grok it. They tend to say things "ok you make a good point but what about the user" - it's just their version of think of the children, and signals the end of a conversion with me anytime it's brought up.
My advice is to stand your ground. Eternally.

Thanks everyone. The case that prompted this question ended well, and partly thanks to insights I got from the answers above.
My initial reaction was to stick to fail fast, but I thought about this some more, and had reached the conclusion that one of the roles of my module is to provide a stabilizing anchor to the rest of the system. That does not necessarily mean accepting bad data, but surfacing problems, isolating them and handling them in a transparent manner until we find a solution.
I planned adding a new handler and code path for this case, which would properly execute as if it was a special use case that was previously undocumented.
We had a discussion where I reiterated the need to deal with the problem at the boundary, but was also willing to help. I outlined my plan to the other side, because I had a suspicion that my position was viewed as overly pedantic, and that the solution was perceived as me only having to turn off spurious validation of harmless data, even if it was incorrect. In reality though, the way I work is largely data driven, so I explained why it has to be correct and how behavior is driven by it and how in accommodating this data I will be implementing a special code path.
I think this gave weight to my position and it led to a more thorough discussion of the other side's aversion to fixing the data. It turned out that it was more of a weariness of dealing with an error prone legacy system than an actual obstacle. There was a relatively simple solution, it was just scary to make a change, a mindset that's fairly entrenched.
But having aired all challenges and possible solutions, we eventually agreed to fix the data, and so far it seems to have solved our problem. Our integration tests are now passing consistently, but we have also added logging and will continue to monitor it.
In summary, I think that for me, the synthesis of both principles is that fail fast is essential for surfacing problems. But once they do surface, robustness means providing a transparent path to continue operation in a way that does not compromise the system. I was able to offer that, and by doing so, won some goodwill from the other side and got the data fixed in the end.
Again, thanks to everyone that responded. I'm too new to rate comments, but I do appreciate all the perspectives presented.

That's a tricky one. If your module receives bad data and it's "ok" for you to just do nothing with them and return, then I would suggest to write to an error log instead of showing an error to the user.

It kind of depends on the class of error you are getting. If the way the system is breaking means you can keep going without feeding bad data to any other parts of the system, you should do everything in your power to work with whatever input is given.
To my mind though data purity trumps working systems, you cannot allow bad data to propagate elsewhere and corrupt other systems. To the extent you can massage data to be correct and then keep going, you should do so on the theory that the data is safe and you must keep the system running...
I like to think of things in terms of data streams. Passing bad data along is polluting the whole stream, and that is bad because just like real pollution a drop can spoil a whole river of data (if one element is bad, what else can you trust?). But equally bad is blocking the flow, letting nothing pass because you spotted something you could easily remove. Filter it out and if everyone at every stage is also filter, you get clear clean data out the other end even if a few impurities started up in the middle.

The question from your peers is: "why don't you work around this problem"
You say that it's possible for you detect the bad data, and report an error to the user. This is the normal approach - once you know the data coming to your functions is bad, you should fail fast (and this is the recommendation from the other answers I have read here).
However, your question doesn't specify the domain in which your software is operating. If you know the data coming in is erroneous, is it possible for you to request that data again? Is it actually possible to recover from the situation?
I mentioned that the "domain" here is important. So if you have an app which displays streamed video data for example, and maybe your wireless signal is weak so the stream is corrupt, should the system "fail fast" and display an error message? Or should a poorer image be displayed, and an attempt to reconnect made if needed, depending on the magnitude of the problem?
Depending on your domain, it may be possible for you to detect bad data, and make a second request for the data without inconveniencing the user. (This is clearly only relevant in cases where you'd expect the data to be better the second time, but you do say the issues you are experiencing are intermittent and possible concurrency related)...
So, fail-fast is good, and is definitely something you should do if you can't recover. And you should definitely not propagate bad data. But if you can recover, which in some domains you can, then failing straight away is not necessarily the best thing to do.

Performance optimization strategies of last resort [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
There are plenty of performance questions on this site already, but it occurs to me that almost all are very problem-specific and fairly narrow. And almost all repeat the advice to avoid premature optimization.
Let's assume:
the code already is working correctly
the algorithms chosen are already optimal for the circumstances of the problem
the code has been measured, and the offending routines have been isolated
all attempts to optimize will also be measured to ensure they do not make matters worse
What I am looking for here is strategies and tricks to squeeze out up to the last few percent in a critical algorithm when there is nothing else left to do but whatever it takes.
Ideally, try to make answers language agnostic, and indicate any down-sides to the suggested strategies where applicable.
I'll add a reply with my own initial suggestions, and look forward to whatever else the Stack Overflow community can think of.

OK, you're defining the problem to where it would seem there is not much room for improvement. That is fairly rare, in my experience. I tried to explain this in a Dr. Dobbs article in November 1993, by starting from a conventionally well-designed non-trivial program with no obvious waste and taking it through a series of optimizations until its wall-clock time was reduced from 48 seconds to 1.1 seconds, and the source code size was reduced by a factor of 4. My diagnostic tool was this. The sequence of changes was this:
The first problem found was use of list clusters (now called "iterators" and "container classes") accounting for over half the time. Those were replaced with fairly simple code, bringing the time down to 20 seconds.
Now the largest time-taker is more list-building. As a percentage, it was not so big before, but now it is because the bigger problem was removed. I find a way to speed it up, and the time drops to 17 seconds.
Now it is harder to find obvious culprits, but there are a few smaller ones that I can do something about, and the time drops to 13 sec.
Now I seem to have hit a wall. The samples are telling me exactly what it is doing, but I can't seem to find anything that I can improve. Then I reflect on the basic design of the program, on its transaction-driven structure, and ask if all the list-searching that it is doing is actually mandated by the requirements of the problem.
Then I hit upon a re-design, where the program code is actually generated (via preprocessor macros) from a smaller set of source, and in which the program is not constantly figuring out things that the programmer knows are fairly predictable. In other words, don't "interpret" the sequence of things to do, "compile" it.
That redesign is done, shrinking the source code by a factor of 4, and the time is reduced to 10 seconds.
Now, because it's getting so quick, it's hard to sample, so I give it 10 times as much work to do, but the following times are based on the original workload.
More diagnosis reveals that it is spending time in queue-management. In-lining these reduces the time to 7 seconds.
Now a big time-taker is the diagnostic printing I had been doing. Flush that - 4 seconds.
Now the biggest time-takers are calls to malloc and free. Recycle objects - 2.6 seconds.
Continuing to sample, I still find operations that are not strictly necessary - 1.1 seconds.
Total speedup factor: 43.6
Now no two programs are alike, but in non-toy software I've always seen a progression like this. First you get the easy stuff, and then the more difficult, until you get to a point of diminishing returns. Then the insight you gain may well lead to a redesign, starting a new round of speedups, until you again hit diminishing returns. Now this is the point at which it might make sense to wonder whether ++i or i++ or for(;;) or while(1) are faster: the kinds of questions I see so often on Stack Overflow.
P.S. It may be wondered why I didn't use a profiler. The answer is that almost every one of these "problems" was a function call site, which stack samples pinpoint. Profilers, even today, are just barely coming around to the idea that statements and call instructions are more important to locate, and easier to fix, than whole functions.
I actually built a profiler to do this, but for a real down-and-dirty intimacy with what the code is doing, there's no substitute for getting your fingers right in it. It is not an issue that the number of samples is small, because none of the problems being found are so tiny that they are easily missed.
ADDED: jerryjvl requested some examples. Here is the first problem. It consists of a small number of separate lines of code, together taking over half the time:
/* IF ALL TASKS DONE, SEND ITC_ACKOP, AND DELETE OP */
if (ptop->current_task >= ILST_LENGTH(ptop->tasklist){
. . .
/* FOR EACH OPERATION REQUEST */
for ( ptop = ILST_FIRST(oplist); ptop != NULL; ptop = ILST_NEXT(oplist, ptop)){
. . .
/* GET CURRENT TASK */
ptask = ILST_NTH(ptop->tasklist, ptop->current_task)
These were using the list cluster ILST (similar to a list class). They are implemented in the usual way, with "information hiding" meaning that the users of the class were not supposed to have to care how they were implemented. When these lines were written (out of roughly 800 lines of code) thought was not given to the idea that these could be a "bottleneck" (I hate that word). They are simply the recommended way to do things. It is easy to say in hindsight that these should have been avoided, but in my experience all performance problems are like that. In general, it is good to try to avoid creating performance problems. It is even better to find and fix the ones that are created, even though they "should have been avoided" (in hindsight). I hope that gives a bit of the flavor.
Here is the second problem, in two separate lines:
/* ADD TASK TO TASK LIST */
ILST_APPEND(ptop->tasklist, ptask)
. . .
/* ADD TRANSACTION TO TRANSACTION QUEUE */
ILST_APPEND(trnque, ptrn)
These are building lists by appending items to their ends. (The fix was to collect the items in arrays, and build the lists all at once.) The interesting thing is that these statements only cost (i.e. were on the call stack) 3/48 of the original time, so they were not in fact a big problem at the beginning. However, after removing the first problem, they cost 3/20 of the time and so were now a "bigger fish". In general, that's how it goes.
I might add that this project was distilled from a real project I helped on. In that project, the performance problems were far more dramatic (as were the speedups), such as calling a database-access routine within an inner loop to see if a task was finished.
REFERENCE ADDED:
The source code, both original and redesigned, can be found in www.ddj.com, for 1993, in file 9311.zip, files slug.asc and slug.zip.
EDIT 2011/11/26:
There is now a SourceForge project containing source code in Visual C++ and a blow-by-blow description of how it was tuned. It only goes through the first half of the scenario described above, and it doesn't follow exactly the same sequence, but still gets a 2-3 order of magnitude speedup.

Suggestions:
Pre-compute rather than re-calculate: any loops or repeated calls that contain calculations that have a relatively limited range of inputs, consider making a lookup (array or dictionary) that contains the result of that calculation for all values in the valid range of inputs. Then use a simple lookup inside the algorithm instead.
Down-sides: if few of the pre-computed values are actually used this may make matters worse, also the lookup may take significant memory.
Don't use library methods: most libraries need to be written to operate correctly under a broad range of scenarios, and perform null checks on parameters, etc. By re-implementing a method you may be able to strip out a lot of logic that does not apply in the exact circumstance you are using it.
Down-sides: writing additional code means more surface area for bugs.
Do use library methods: to contradict myself, language libraries get written by people that are a lot smarter than you or me; odds are they did it better and faster. Do not implement it yourself unless you can actually make it faster (i.e.: always measure!)
Cheat: in some cases although an exact calculation may exist for your problem, you may not need 'exact', sometimes an approximation may be 'good enough' and a lot faster in the deal. Ask yourself, does it really matter if the answer is out by 1%? 5%? even 10%?
Down-sides: Well... the answer won't be exact.

When you can't improve the performance any more - see if you can improve the perceived performance instead.
You may not be able to make your fooCalc algorithm faster, but often there are ways to make your application seem more responsive to the user.
A few examples:
anticipating what the user is going
to request and start working on that
before then
displaying results as
they come in, instead of all at once
at the end
Accurate progress meter
These won't make your program faster, but it might make your users happier with the speed you have.

I spend most of my life in just this place. The broad strokes are to run your profiler and get it to record:
Cache misses. Data cache is the #1 source of stalls in most programs. Improve cache hit rate by reorganizing offending data structures to have better locality; pack structures and numerical types down to eliminate wasted bytes (and therefore wasted cache fetches); prefetch data wherever possible to reduce stalls.
Load-hit-stores. Compiler assumptions about pointer aliasing, and cases where data is moved between disconnected register sets via memory, can cause a certain pathological behavior that causes the entire CPU pipeline to clear on a load op. Find places where floats, vectors, and ints are being cast to one another and eliminate them. Use __restrict liberally to promise the compiler about aliasing.
Microcoded operations. Most processors have some operations that cannot be pipelined, but instead run a tiny subroutine stored in ROM. Examples on the PowerPC are integer multiply, divide, and shift-by-variable-amount. The problem is that the entire pipeline stops dead while this operation is executing. Try to eliminate use of these operations or at least break them down into their constituent pipelined ops so you can get the benefit of superscalar dispatch on whatever the rest of your program is doing.
Branch mispredicts. These too empty the pipeline. Find cases where the CPU is spending a lot of time refilling the pipe after a branch, and use branch hinting if available to get it to predict correctly more often. Or better yet, replace branches with conditional-moves wherever possible, especially after floating point operations because their pipe is usually deeper and reading the condition flags after fcmp can cause a stall.
Sequential floating-point ops. Make these SIMD.
And one more thing I like to do:
Set your compiler to output assembly listings and look at what it emits for the hotspot functions in your code. All those clever optimizations that "a good compiler should be able to do for you automatically"? Chances are your actual compiler doesn't do them. I've seen GCC emit truly WTF code.

Throw more hardware at it!

More suggestions:
Avoid I/O: Any I/O (disk, network, ports, etc.) is
always going to be far slower than any code that is
performing calculations, so get rid of any I/O that you do
not strictly need.
Move I/O up-front: Load up all the data you are going
to need for a calculation up-front, so that you do not
have repeated I/O waits within the core of a critical
algorithm (and maybe as a result repeated disk seeks, when
loading all the data in one hit may avoid seeking).
Delay I/O: Do not write out your results until the
calculation is over, store them in a data structure and
then dump that out in one go at the end when the hard work
is done.
Threaded I/O: For those daring enough, combine 'I/O
up-front' or 'Delay I/O' with the actual calculation by
moving the loading into a parallel thread, so that while
you are loading more data you can work on a calculation on
the data you already have, or while you calculate the next
batch of data you can simultaneously write out the results
from the last batch.

Since many of the performance problems involve database issues, I'll give you some specific things to look at when tuning queries and stored procedures.
Avoid cursors in most databases. Avoid looping as well. Most of the time, data access should be set-based, not record by record processing. This includes not reusing a single record stored procedure when you want to insert 1,000,000 records at once.
Never use select *, only return the fields you actually need. This is especially true if there are any joins as the join fields will be repeated and thus cause unnecesary load on both the server and the network.
Avoid the use of correlated subqueries. Use joins (including joins to derived tables where possible) (I know this is true for Microsoft SQL Server, but test the advice when using a differnt backend).
Index, index, index. And get those stats updated if applicable to your database.
Make the query sargable. Meaning avoid things which make it impossible to use the indexes such as using a wildcard in the first character of a like clause or a function in the join or as the left part of a where statement.
Use correct data types. It is faster to do date math on a date field than to have to try to convert a string datatype to a date datatype, then do the calculation.
Never put a loop of any kind into a trigger!
Most databases have a way to check how the query execution will be done. In Microsoft SQL Server this is called an execution plan. Check those first to see where problem areas lie.
Consider how often the query runs as well as how long it takes to run when determining what needs to be optimized. Sometimes you can gain more perfomance from a slight tweak to a query that runs millions of times a day than you can from wiping time off a long_running query that only runs once a month.
Use some sort of profiler tool to find out what is really being sent to and from the database. I can remember one time in the past where we couldn't figure out why the page was so slow to load when the stored procedure was fast and found out through profiling that the webpage was asking for the query many many times instead of once.
The profiler will also help you to find who are blocking who. Some queries that execute quickly while running alone may become really slow due to locks from other queries.

The single most important limiting factor today is the limited memory bandwitdh. Multicores are just making this worse, as the bandwidth is shared betwen cores. Also, the limited chip area devoted to implementing caches is also divided among the cores and threads, worsening this problem even more. Finally, the inter-chip signalling needed to keep the different caches coherent also increase with an increased number of cores. This also adds a penalty.
These are the effects that you need to manage. Sometimes through micro managing your code, but sometimes through careful consideration and refactoring.
A lot of comments already mention cache friendly code. There are at least two distinct flavors of this:
Avoid memory fetch latencies.
Lower memory bus pressure (bandwidth).
The first problem specifically has to do with making your data access patterns more regular, allowing the hardware prefetcher to work efficiently. Avoid dynamic memory allocation which spreads your data objects around in memory. Use linear containers instead of linked lists, hashes and trees.
The second problem has to do with improving data reuse. Alter your algorithms to work on subsets of your data that do fit in available cache, and reuse that data as much as possible while it is still in the cache.
Packing data tighter and making sure you use all data in cache lines in the hot loops, will help avoid these other effects, and allow fitting more useful data in the cache.

What hardware are you running on? Can you use platform-specific optimizations (like vectorization)?
Can you get a better compiler? E.g. switch from GCC to Intel?
Can you make your algorithm run in parallel?
Can you reduce cache misses by reorganizing data?
Can you disable asserts?
Micro-optimize for your compiler and platform. In the style of, "at an if/else, put the most common statement first"

Although I like Mike Dunlavey's answer, in fact it is a great answer indeed with supporting example, I think it could be expressed very simply thus:
Find out what takes the largest amounts of time first, and understand why.
It is the identification process of the time hogs that helps you understand where you must refine your algorithm. This is the only all-encompassing language agnostic answer I can find to a problem that's already supposed to be fully optimised. Also presuming you want to be architecture independent in your quest for speed.
So while the algorithm may be optimised, the implementation of it may not be. The identification allows you to know which part is which: algorithm or implementation. So whichever hogs the time the most is your prime candidate for review. But since you say you want to squeeze the last few % out, you might want to also examine the lesser parts, the parts that you have not examined that closely at first.
Lastly a bit of trial and error with performance figures on different ways to implement the same solution, or potentially different algorithms, can bring insights that help identify time wasters and time savers.
HPH,
asoudmove.

You should probably consider the "Google perspective", i.e. determine how your application can become largely parallelized and concurrent, which will inevitably also mean at some point to look into distributing your application across different machines and networks, so that it can ideally scale almost linearly with the hardware that you throw at it.
On the other hand, the Google folks are also known for throwing lots of manpower and resources at solving some of the issues in projects, tools and infrastructure they are using, such as for example whole program optimization for gcc by having a dedicated team of engineers hacking gcc internals in order to prepare it for Google-typical use case scenarios.
Similarly, profiling an application no longer means to simply profile the program code, but also all its surrounding systems and infrastructure (think networks, switches, server, RAID arrays) in order to identify redundancies and optimization potential from a system's point of view.

Inline routines (eliminate call/return and parameter pushing)
Try eliminating tests/switches with table look ups (if they're faster)
Unroll loops (Duff's device) to the point where they just fit in the CPU cache
Localize memory access so as not to blow your cache
Localize related calculations if the optimizer isn't already doing that
Eliminate loop invariants if the optimizer isn't already doing that

When you get to the point that you're using efficient algorithms its a question of what you need more speed or memory. Use caching to "pay" in memory for more speed or use calculations to reduce the memory footprint.
If possible (and more cost effective) throw hardware at the problem - faster CPU, more memory or HD could solve the problem faster then trying to code it.
Use parallelization if possible - run part of the code on multiple threads.
Use the right tool for the job. some programing languages create more efficient code, using managed code (i.e. Java/.NET) speed up development but native programing languages creates faster running code.
Micro optimize. Only were applicable you can use optimized assembly to speed small pieces of code, using SSE/vector optimizations in the right places can greatly increase performance.

Divide and conquer
If the dataset being processed is too large, loop over chunks of it. If you've done your code right, implementation should be easy. If you have a monolithic program, now you know better.

First of all, as mentioned in several prior answers, learn what bites your performance - is it memory or processor or network or database or something else. Depending on that...
...if it's memory - find one of the books written long time ago by Knuth, one of "The Art of Computer Programming" series. Most likely it's one about sorting and search - if my memory is wrong then you'll have to find out in which he talks about how to deal with slow tape data storage. Mentally transform his memory/tape pair into your pair of cache/main memory (or in pair of L1/L2 cache) respectively. Study all the tricks he describes - if you don's find something that solves your problem, then hire professional computer scientist to conduct a professional research. If your memory issue is by chance with FFT (cache misses at bit-reversed indexes when doing radix-2 butterflies) then don't hire a scientist - instead, manually optimize passes one-by-one until you're either win or get to dead end. You mentioned squeeze out up to the last few percent right? If it's few indeed you'll most likely win.
...if it's processor - switch to assembly language. Study processor specification - what takes ticks, VLIW, SIMD. Function calls are most likely replaceable tick-eaters. Learn loop transformations - pipeline, unroll. Multiplies and divisions might be replaceable / interpolated with bit shifts (multiplies by small integers might be replaceable with additions). Try tricks with shorter data - if you're lucky one instruction with 64 bits might turn out replaceable with two on 32 or even 4 on 16 or 8 on 8 bits go figure. Try also longer data - eg your float calculations might turn out slower than double ones at particular processor. If you have trigonometric stuff, fight it with pre-calculated tables; also keep in mind that sine of small value might be replaced with that value if loss of precision is within allowed limits.
...if it's network - think of compressing data you pass over it. Replace XML transfer with binary. Study protocols. Try UDP instead of TCP if you can somehow handle data loss.
...if it's database, well, go to any database forum and ask for advice. In-memory data-grid, optimizing query plan etc etc etc.
HTH :)

Caching! A cheap way (in programmer effort) to make almost anything faster is to add a caching abstraction layer to any data movement area of your program. Be it I/O or just passing/creation of objects or structures. Often it's easy to add caches to factory classes and reader/writers.
Sometimes the cache will not gain you much, but it's an easy method to just add caching all over and then disable it where it doesn't help. I've often found this to gain huge performance without having to micro-analyse the code.

I think this has already been said in a different way. But when you're dealing with a processor intensive algorithm, you should simplify everything inside the most inner loop at the expense of everything else.
That may seem obvious to some, but it's something I try to focus on regardless of the language I'm working with. If you're dealing with nested loops, for example, and you find an opportunity to take some code down a level, you can in some cases drastically speed up your code. As another example, there are the little things to think about like working with integers instead of floating point variables whenever you can, and using multiplication instead of division whenever you can. Again, these are things that should be considered for your most inner loop.
Sometimes you may find benefit of performing your math operations on an integer inside the inner loop, and then scaling it down to a floating point variable you can work with afterwards. That's an example of sacrificing speed in one section to improve the speed in another, but in some cases the pay off can be well worth it.

I've spent some time working on optimising client/server business systems operating over low-bandwidth and long-latency networks (e.g. satellite, remote, offshore), and been able to achieve some dramatic performance improvements with a fairly repeatable process.
Measure: Start by understanding the network's underlying capacity and topology. Talking to the relevant networking people in the business, and make use of basic tools such as ping and traceroute to establish (at a minimum) the network latency from each client location, during typical operational periods. Next, take accurate time measurements of specific end user functions that display the problematic symptoms. Record all of these measurements, along with their locations, dates and times. Consider building end-user "network performance testing" functionality into your client application, allowing your power users to participate in the process of improvement; empowering them like this can have a huge psychological impact when you're dealing with users frustrated by a poorly performing system.
Analyze: Using any and all logging methods available to establish exactly what data is being transmitted and received during the execution of the affected operations. Ideally, your application can capture data transmitted and received by both the client and the server. If these include timestamps as well, even better. If sufficient logging isn't available (e.g. closed system, or inability to deploy modifications into a production environment), use a network sniffer and make sure you really understand what's going on at the network level.
Cache: Look for cases where static or infrequently changed data is being transmitted repetitively and consider an appropriate caching strategy. Typical examples include "pick list" values or other "reference entities", which can be surprisingly large in some business applications. In many cases, users can accept that they must restart or refresh the application to update infrequently updated data, especially if it can shave significant time from the display of commonly used user interface elements. Make sure you understand the real behaviour of the caching elements already deployed - many common caching methods (e.g. HTTP ETag) still require a network round-trip to ensure consistency, and where network latency is expensive, you may be able to avoid it altogether with a different caching approach.
Parallelise: Look for sequential transactions that don't logically need to be issued strictly sequentially, and rework the system to issue them in parallel. I dealt with one case where an end-to-end request had an inherent network delay of ~2s, which was not a problem for a single transaction, but when 6 sequential 2s round trips were required before the user regained control of the client application, it became a huge source of frustration. Discovering that these transactions were in fact independent allowed them to be executed in parallel, reducing the end-user delay to very close to the cost of a single round trip.
Combine: Where sequential requests must be executed sequentially, look for opportunities to combine them into a single more comprehensive request. Typical examples include creation of new entities, followed by requests to relate those entities to other existing entities.
Compress: Look for opportunities to leverage compression of the payload, either by replacing a textual form with a binary one, or using actual compression technology. Many modern (i.e. within a decade) technology stacks support this almost transparently, so make sure it's configured. I have often been surprised by the significant impact of compression where it seemed clear that the problem was fundamentally latency rather than bandwidth, discovering after the fact that it allowed the transaction to fit within a single packet or otherwise avoid packet loss and therefore have an outsize impact on performance.
Repeat: Go back to the beginning and re-measure your operations (at the same locations and times) with the improvements in place, record and report your results. As with all optimisation, some problems may have been solved exposing others that now dominate.
In the steps above, I focus on the application related optimisation process, but of course you must ensure the underlying network itself is configured in the most efficient manner to support your application too. Engage the networking specialists in the business and determine if they're able to apply capacity improvements, QoS, network compression, or other techniques to address the problem. Usually, they will not understand your application's needs, so it's important that you're equipped (after the Analyse step) to discuss it with them, and also to make the business case for any costs you're going to be asking them to incur. I've encountered cases where erroneous network configuration caused the applications data to be transmitted over a slow satellite link rather than an overland link, simply because it was using a TCP port that was not "well known" by the networking specialists; obviously rectifying a problem like this can have a dramatic impact on performance, with no software code or configuration changes necessary at all.

Very difficult to give a generic answer to this question. It really depends on your problem domain and technical implementation. A general technique that is fairly language neutral: Identify code hotspots that cannot be eliminated, and hand-optimize assembler code.

Last few % is a very CPU and application dependent thing....
cache architectures differ, some chips have on-chip RAM
you can map directly, ARM's (sometimes) have a vector
unit, SH4's a useful matrix opcode. Is there a GPU -
maybe a shader is the way to go. TMS320's are very
sensitive to branches within loops (so separate loops and
move conditions outside if possible).
The list goes on.... But these sorts of things really are
the last resort...
Build for x86, and run Valgrind/Cachegrind against the code
for proper performance profiling. Or Texas Instruments'
CCStudio has a sweet profiler. Then you'll really know where
to focus...

Not nearly as in depth or complex as previous answers, but here goes:
(these are more beginner/intermediate level)
obvious: dry
run loops backwards so you're always comparing to 0 rather than a variable
use bitwise operators whenever you can
break repetitive code into modules/functions
cache objects
local variables have slight performance advantage
limit string manipulation as much as possible

Did you know that a CAT6 cable is capable of 10x better shielding off external inteferences than a default Cat5e UTP cable?
For any non-offline projects, while having best software and best hardware, if your throughoutput is weak, then that thin line is going to squeeze data and give you delays, albeit in milliseconds...
Also the maximum throughput is higher on CAT6 cables because there is a higher chance that you will actually receive a cable whose strands exist of cupper cores, instead of CCA, Cupper Cladded Aluminium, which is often fount in all your standard CAT5e cables.
I if you are facing lost packets, packet drops, then an increase in throughput reliability for 24/7 operation can make the difference that you may be looking for.
For those who seek the ultimate in home/office connection reliability, (and are willing to say NO to this years fastfood restaurants, at the end of the year you can there you can) gift yourself the pinnacle of LAN connectivity in the form of CAT7 cable from a reputable brand.

Impossible to say. It depends on what the code looks like. If we can assume that the code already exists, then we can simply look at it and figure out from that, how to optimize it.
Better cache locality, loop unrolling, Try to eliminate long dependency chains, to get better instruction-level parallelism. Prefer conditional moves over branches when possible. Exploit SIMD instructions when possible.
Understand what your code is doing, and understand the hardware it's running on. Then it becomes fairly simple to determine what you need to do to improve performance of your code. That's really the only truly general piece of advice I can think of.
Well, that, and "Show the code on SO and ask for optimization advice for that specific piece of code".

If better hardware is an option then definitely go for that. Otherwise
Check you are using the best compiler and linker options.
If hotspot routine in different library to frequent caller, consider moving or cloning it to the callers module. Eliminates some of the call overhead and may improve cache hits (cf how AIX links strcpy() statically into separately linked shared objects). This could of course decrease cache hits also, which is why one measure.
See if there is any possibility of using a specialized version of the hotspot routine. Downside is more than one version to maintain.
Look at the assembler. If you think it could be better, consider why the compiler did not figure this out, and how you could help the compiler.
Consider: are you really using the best algorithm? Is it the best algorithm for your input size?

The google way is one option "Cache it.. Whenever possible don't touch the disk"

Here are some quick and dirty optimization techniques I use. I consider this to be a 'first pass' optimization.
Learn where the time is spent Find out exactly what is taking the time. Is it file IO? Is it CPU time? Is it the network? Is it the Database? It's useless to optimize for IO if that's not the bottleneck.
Know Your Environment Knowing where to optimize typically depends on the development environment. In VB6, for example, passing by reference is slower than passing by value, but in C and C++, by reference is vastly faster. In C, it is reasonable to try something and do something different if a return code indicates a failure, while in Dot Net, catching exceptions are much slower than checking for a valid condition before attempting.
Indexes Build indexes on frequently queried database fields. You can almost always trade space for speed.
Avoid lookups Inside of the loop to be optimized, I avoid having to do any lookups. Find the offset and/or index outside of the loop and reuse the data inside.
Minimize IO try to design in a manner that reduces the number of times you have to read or write especially over a networked connection
Reduce Abstractions The more layers of abstraction the code has to work through, the slower it is. Inside the critical loop, reduce abstractions (e.g. reveal lower-level methods that avoid extra code)
Spawn Threads for projects with a user interface, spawning a new thread to preform slower tasks makes the application feel more responsive, although isn't.
Pre-process You can generally trade space for speed. If there are calculations or other intense operations, see if you can precompute some of the information before you're in the critical loop.

If you have a lot of highly parallel floating point math-especially single-precision-try offloading it to a graphics processor (if one is present) using OpenCL or (for NVidia chips) CUDA. GPUs have immense floating point computing power in their shaders, which is much greater than that of a CPU.

Adding this answer since I didnt see it included in all the others.
Minimize implicit conversion between types and sign:
This applies to C/C++ at least, Even if you already think you're free of conversions - sometimes its good to test adding compiler warnings around functions that require performance, especially watch-out for conversions within loops.
GCC spesific: You can test this by adding some verbose pragmas around your code,
#ifdef __GNUC__
# pragma GCC diagnostic push
# pragma GCC diagnostic error "-Wsign-conversion"
# pragma GCC diagnostic error "-Wdouble-promotion"
# pragma GCC diagnostic error "-Wsign-compare"
# pragma GCC diagnostic error "-Wconversion"
#endif
/* your code */
#ifdef __GNUC__
# pragma GCC diagnostic pop
#endif
I've seen cases where you can get a few percent speedup by reducing conversions raised by warnings like this.
In some cases I have a header with strict warnings that I keep included to prevent accidental conversions, however this is a trade-off since you may end up adding a lot of casts to quiet intentional conversions which may just make the code more cluttered for minimal gains.

Sometimes changing the layout of your data can help. In C, you might switch from an array or structures to a structure of arrays, or vice versa.

Tweak the OS and framework.
It may sound an overkill but think about it like this: Operating Systems and Frameworks are designed to do many things. Your application only does very specific things. If you could get the OS do to exactly what your application needs and have your application understand how the the framework (php,.net,java) works, you could get much better out of your hardware.
Facebook, for example, changed some kernel level thingys in Linux, changed how memcached works (for example they wrote a memcached proxy, and used udp instead of tcp).
Another example for this is Window2008. Win2K8 has a version were you can install just the basic OS needed to run X applicaions (e.g. Web-Apps, Server Apps). This reduces much of the overhead that the OS have on running processes and gives you better performance.
Of course, you should always throw in more hardware as the first step...

Hardest types of bugs to track? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
What are some of the nastiest, most difficult bugs you have had to track and fix and why?
I am both genuinely curious and knee deep in the process as we speak. So as they say - misery likes company.

Heisenbugs:
A heisenbug (named after the Heisenberg Uncertainty Principle) is a computer bug that disappears or alters its characteristics when an attempt is made to study it.

Race conditions and deadlocks. I do a lot of multithreaded processes and that is the hardest thing to deal with.

Bugs that happen when compiled in release mode but not in debug mode.

Any bug based on timing conditions. These often come when working with inter-thread communication, an external system, reading from a network, reading from a file, or communicating with any external server or device.

Bugs that are not in your code per se, but rather in a vendor's module on which you depend. Particularly when the vendor is unresponsive and you are forced to hack a work-around. Very frustrating!

We were developing a database to hold words and definitions in another language. It turns out that this language had only recently been added to the Unicode standard and it didn't make it into SQL Server 2005 (though it was added around 2005). This had a very frustrating effect when it came to collation.
Words and definitions went in just fine, I could see everything in Management Studio. But whenever we tried to find the definition for a given word, our queries returned nothing. After a solid 8 hours of debugging, I was at the point of thinking I had lost the ability to write a simple SELECT query.
That is, until I noticed English letters matched other English letters with any amount of foreign letters thrown in. For example, EnglishWord would match E!n#gl##$ish$&Word. (With !##$%^&* representing foreign letters).
When a collation doesn't know about a certain character, it can't sort them. If it can't sort them, it can't tell whether two string match or not (a surprise for me). So frustrating and a whole day down the drain for a stupid collation setting.

Threading bugs, especially race conditions. When you cannot stop the system (because the bug goes away), things quickly get tough.

The hardest ones I usually run into are ones that don't show up in any log trace. You should never silently eat an exception! The problem is that eating an exception often moves your code into an invalid state, where it fails later in another thread and in a completely unrelated manner.
That said, the hardest one I ever really ran into was a C program in a function call where the calling signature didn't exactly match the called signature (one was a long, the other an int). There were no errors at compile time or link time and most tests passed, but the stack was off by sizeof(int), so the variables after it on the stack would randomly have bad values, but most of the time it would work fine (the values following that bad parameter were generally being passed in as zero).
That was a BITCH to track.

Memory corruption under load due to bad hardware.

Bugs that happen on one server and not another, and you don't have access to the offending server to debug it.
Bugs that have to do with threading.

The most frustrating for me have been compiler bugs, where the code is correct but I've hit an undocumented corner case or something where the compiler's wrong. I start with the assumption that I've made a mistake, and then spend days trying to find it.
Edit: The other most frustrating was the time I got the test case set slightly wrong, so my code was correct but the test wasn't. That took days to find.
In general, I guess the worst bugs I've had have been the ones that aren't my fault.

The hardest bugs to track down and fix are those that combine all the difficult cases:
reported by a third party but you can't reproduce it under your own testing conditions;
bug occurs rarely and unpredictably (e.g. because it's caused by a race condition);
bug is on an embedded system and you can't attach a debugger;
when you try to get logging information out the bug goes away;
bug is in third-party code such as a library ...
... to which you don't have the source code so you have to work with disassembly only;
and the bug is at the interface between multiple hardware systems (e.g. networking protocol bugs or bus contention bugs).
I was working on a bug with all these features this week. It was necessary to reverse engineer the library to find out what it was up to; then generate hypotheses about which two devices were racing; then make specially-instrumented versions of the program designed to provoke the hypothesized race condition; then once one of the hypotheses was confirmed it was possible to synchronize the timing of events so that the library won the race 100% of the time.

There was a project building a chemical engineering simulator using a beowulf cluster. It so happened that the network cards would not transmit one particular sequence of bytes. If a packet contained that string, the packet would be lost. They solved the problem by replacing the hardware - finding it in the first place was much harder.

One of the hardest bugs I had to find was a memory corruption error that only occurred after the program had been running for hours. Because of the length of time it took to corrupt the data, we assumed hardware and tried two or three other computers first.
The bug would take hours to appear, and when it did appear it was usually only noticed quite a length of time after when the program got so messed up it started misbehaving. Narrowing down in the code base to where the bug was occurring was very difficult because the crashes due to corrupted memory never occurred in the function that corrupted the memory, and it took so damned long for the bug to manifest itself.
The bug turned out to be an off-by-one error in a rarely called piece of code to handle a data line that had something wrong with it (invalid character encoding from memory).
In the end the debugger proved next to useless because the crashes never occurred in the call tree for the offending function. A well sequenced stream of fprintf(stderr, ...) calls in the code and dumping the output to a file was what eventually allowed us to identify what the problem was.

Concurrency bugs are quite hard to track, because reproducing them can be very hard when you do not yet know what the bug is. That's why, every time you see an unexplained stack trace in the logs, you should search for the reason of that exception until you find it. Even if it happens only one time in a million, that does not make it unimportant.
Since you can not rely on the tests to reproduce the bug, you must use deductive reasoning to find out the bug. That in turn requires a deep understanding of how the system works (for example how Java's memory model works and what are possible sources of concurrency bugs).
Here is an example of a concurrency bug in Guice 1.0 which I located just some days ago. You can test your bug finding skills by trying to find out what is the bug causing that exception. The bug is not too hard to find - I found its cause in some 15-30 min (the answer is here).
java.lang.NullPointerException
at com.google.inject.InjectorImpl.injectMembers(InjectorImpl.java:673)
at com.google.inject.InjectorImpl$8.call(InjectorImpl.java:682)
at com.google.inject.InjectorImpl$8.call(InjectorImpl.java:681)
at com.google.inject.InjectorImpl.callInContext(InjectorImpl.java:747)
at com.google.inject.InjectorImpl.injectMembers(InjectorImpl.java:680)
at ...
P.S. Faulty hardware might cause even nastier bugs than concurrency, because it may take a long time before you can confidently conclude that there is no bug in the code. Luckily hardware bugs are rarer than software bugs.

A friend of mine had this bug. He accidentally put a function argument in a C program in square brackets instead of parenthesis like this: foo[5] instead of foo(5). The compiler was perfectly happy, because the function name is a pointer, and there is nothing illegal about indexing off a pointer.

One of the most frustrating for me was when the algorithm was wrong in the software spec.

Probably not the hardest, but they are extremely common and not trivial:
Bugs concerning mutable state. It is hard to maintain invariants in a data structure if it has many mutable fields. And you have operation order dependency - swap two lines and something bad occurs. One of my recent hard-to-find bugs was when I found that previous developer of the system I maintained used mutable data for hashtable keys - in some rare conditions it lead to infinite loops.
Order of initialization bugs. Can be obvious when found, but not so when coding.

The hardest one ever was actually a bug I was helping a friend with. He was writing C in MS Visual Studio 2005, and forgot to include time.h. He further called time without the required argument, usually NULL. This implicitly declared time like: int time(); This corrupted the stack, and in a completely unpredictable way. It was a large amount of code, and we didn't think to look at the time() call for quite some time.

Buffer overflows ( in native code )

Last year I spent a couple of months tracking a problem that ended up being a bug in a downstream system. The team lead from the offending system kept claiming that it must be something funny in our processing even though we passed the data just like they requested it from us. If the lead would have been a little more cooperative we might have nailed the bug sooner.

Uninitialized variables. (Or have modern languages done away with this?)

For embedded systems:
Unusual behaviour reported by customers in the field, but which we're unable to reproduce.
After that, bugs which turn out to be due to a freak series or concurrence of events. These are at least reproducable, but obviously they can take a long time - and a lot of experimentation - to make happen.

Difficulty of tracking:
off-by-one errors
boundary condition errors

Machine dependent problems.
I'm currently trying to debug why an application has an unhandled exception in a try{} catch{} block (yes, unhandled inside of a try / catch) that only manifests on certain OS / machine builds, and not on others.
Same version of software, same installation media, same source code, works on some - unhandled exception in what should be a very well handled part of code on others.
Gak.

When objects are cached and their equals and hashcode implementations are implemented so poorly that the hash code value isn't unique and the equals returns true when it isn't equal.

Cosmetic web bugs involving styling in various browser O/S configurations, e.g. a page looks fine in Windows and Mac in Firefox and IE but on the Mac in Safari something gets messed up. These are annoying sometimes because they require so much attention to detail and making the change to fix Safari may break something in Firefox or IE so one has to tread carefully and realize that the styling may be a series of hacks to fix page after page. I'd say those are my nastiest ones that sometimes just don't get fixed as they aren't viewed as important.

WAY back in the days, memory leaks. Thankfully, there's a lot of tools to find them, these days.

Memory issues, particularly on older systems. We have some legacy 16-bit C software that must remain 16-bit for the time being. The 64K memory blocks are royal pain to work with, and we constantly add statics or code logic that pushes us past the 64K group limits.
To make matters worse, memory errors usually don't cause the program to crash, but cause certain features to sporadically break (and not always the same features). Debugging is a non-option - the debugger doesn't have the same memory constraints so the programs always run fine in debug mode ... plus, we can't add inline printf statements for testing since that bumps the memory usage even higher.
As a result, we can sometimes spend DAYS trying to find a single block of code to rewrite, or hours moving static chars to files. Luckily the system is slowly being moved offline.

Multithreading, memory leaks, anything requiring extensive mocks, interfacing with third-party software.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio