If as assert fails, is there a bug?

If as assert fails, is there a bug? - debugging

I've always followed the logic: if assert fails, then there is a bug. Root cause could either be:
Assert itself is invalid (bug)
There is a programming error (bug)
(no other options)
I.E. Are there any other conclusions one could come to? Are there cases where an assert would fail and there is no bug?

If assert fails there is a bug in either the caller or callee. Why else would there be an assertion?

Yes, there is a bug in the code.
Code Complete
Assertions check for conditions that
should never occur. [...]
If an
assertion is fired for an anomalous
condition, the corrective action is
not merely to handle an error
gracefully- the corrective action is
to change the program's source code,
recompile, and release a new version
of the software.
A good way to
think of assertions is as executable
documentation - you can't rely on them
to make the code work, but they can
document assumptions more actively
than program-language comments can.

That's a good question.
My feeling is, if the assert fails due to your code, then it is a bug. The assertion is an expected behaviour/result of your code, so an assertion failure will be a failure of your code.

Only if the assert was meant to show a warning condition - in which case a special class of assert should have been used.
So, any assert should show a bug as you suggest.

If you are using assertions you're following Bertrand Meyer's Design by Contract philosophy. It's a programming error - the contract (assertion) you have specified is not being followed by the client (caller).

If you are trying to be logically inclusive about all the possibilities, remember that electronic circuitry is known to be affected by radiation from space. If the right photon/particle hits in just the right place at just the right time, it can cause an otherwise logically impossible state transition.
The probability is vanishingly small but still non-zero.

I can think of one case that wouldn't really class as a bug:
An assert placed to check for something external that normally should be there. You're hunting something nutty that occurs on one machine and you want to know if a certain factor is responsible.
A real world example (although from before the era of asserts): If a certain directory was hidden on a certain machine the program would barf. I never found any piece of code that should have cared if the directory was hidden. I had only very limited access to the offending machine (it had a bunch of accounting stuff on it) so I couldn't hunt it properly on the machine and I couldn't reproduce it elsewhere. Something that was done with that machine (the culprit was never identified) occasionally turned that directory hidden.
I finally resorted to putting a test in the startup to see if the directory was hidden and stopping with an error if it was.

No. An assertion failure means something happened that the original programmer did not intend or expect to occur.
This can indicate:
A bug in your code (you are simply calling the method incorrectly)
A bug in the Assertion (the original programmer has been too zealous and is complaining about you doing something that is quite reasonable and the method will actually handle perfectly well.
A bug in the called code (a design flaw). That is, the called code provides a contract that does not allow you to do what you need to do. The assertion warns you that you can't do things that way, but the solution is to extend the called method to handle your input.
A known but unimplemented feature. Imagine I implement a method that could process positive and negative integers, but I only need it (for now) to handle positive ones. I know that the "perfect" implementation would handle both, but until I actually need it to handle negatives, it is a waste of effort to implement support (and it would add code bloat and possibly slow down my application). So I have considered the case but I decide not to implement it until the need is proven. I therefore add an assert to mark this unimplemented code. When I later trigger the assert by passing a negative value in, I know that the additional functionality is now needed, so I must augment the implementation. Deferring writing the code until it is actually required thus saves me a lot of time (in most cases I never imeplement the additiona feature), but the assert makes sure that I don't get any bugs when I try to use the unimplemented feature.

Related

Fail vs. raise in Ruby : Should we really believe the style guide?

Ruby offers two possibilities to cause an exception programmatically: raise and fail, both being Kernel methods. According to the documents, they are absolutely equivalent.
Out of a habit, I used only raise so far. Now I found several recommendations (for example here), to use raise for exceptions to be caught, and fail for serious errors which are not meant to be handled.
But does it really make sense? When you are writing a class or module, and cause a problem deep inside, which you signal by fail, your programming colleagues who are reviewing the code, might happily understand your intentions, but the person who is using my code will most likely not look at my code and has no way of knowing, whether the exception was caused by a raise or by fail. Hence, my careful usage of raise or fail can't have any influence on his decision, whether she should or should not handle it.
Could someone see flaws in my arguments? Or are there other criteria, which might me want to use fail instead of raise?

use 'raise' for exceptions to be caught, and 'fail' for serious errors which are not meant to be handled
This is not what the official style guide or the link you provided say on the matter.
What is meant here is use raise only in rescue blocks. Aka use fail when you want to say something is failing and use raise when rethrowing an exception.
As for the "does it matter" part - it is not one of the most hardcore strictly followed rules, but you could make the same argument for any convention. You should follow in that order:
Your project style guide
Your company style guide
The community style guide
Ideally, the three should be the same.
Update: As of this PR (December 2015), the convention is to always use raise.

I once had a conversation with Jim Weirich about this very thing, I have since always used fail when my method is explicitly failing for some reason and raise to re-thrown exceptions.
Here is a post with a message from Jim (almost verbatim to what he said to me in person):
http://www.virtuouscode.com/2014/05/21/jim-weirich-on-exceptions/
Here is the relevant text from the post, a quote attributed to Jim:
Here’s my basic philosophy (and other random thoughts) on exceptions.
When you call a method, you have certain expectations about what the method will accomplish. Formally, these expectations are called post-conditions. A method should throw an exception whenever it fails to meet its postconditions.
To effectively use this strategy, it implies you must have a small understanding of Design by Contract and the meaning of pre- and post-conditions. I think that’s a good thing to know anyways.
Here’s some concrete examples. The Rails model save method:
model.save!
-- post-condition: The model object is saved.
If the model is not saved for some reason, then an exception must be raised because the post-condition is not met.
model.save
-- post-condition: (the model is saved && result == true) ||
(the model is not saved && result == false)
If save doesn’t actually save, then the returned result will be false , but the post-condition is still met, hence no exception.
I find it interesting that the save! method has a vastly simpler post-condition.
On the topic of rescuing exceptions, I think an application should have strategic points where exceptions are rescued. There is little need for rescue/rethrows for the most part. The only time you would want to rescue and rethrow is when you have a job half-way done and you want to undo something so avoid a partially complete state. Your strategic rescue points should be chosen carefully so that the program can continue with other work even if the current operation failed. Transaction processing programs should just move on to the next transaction. A Rails app should recover and be ready to handle the next http request.
Most exception handlers should be generic. Since exceptions indicate a failure of some type, then the handler needs only make a decision on what to do in case of failure. Detailed recovery operations for very specific exceptions are generally discouraged unless the handler is very close (call graph wise) to the point of the exception.
Exceptions should not be used for flow control, use throw/catch for that. This reserves exceptions for true failure conditions.
(An aside, because I use exceptions to indicate failures, I almost always use the fail keyword rather than the raise keyword in Ruby. Fail and raise are synonyms so there is no difference except that fail more clearly communcates that the method has failed. The only time I use raise is when I am catching an exception and re-raising it, because here I’m not failing, but explicitly and purposefully raising an exception. This is a stylistic issue I follow, but I doubt many other people do).
There you have it, a rather rambling memory dump on my thoughts on exceptions.
I know that there are many style guides that do not agree (the style guide used by RoboCop, for example). I don't care. Jim convinced me.

How "defensive" should my code be?

I was having a discussion with one of my colleagues about how defensive your code should be. I am all pro defensive programming but you have to know where to stop. We are working on a project that will be maintained by others, but this doesn't mean we have to check for ALL the crazy things a developer could do. Of course, you could do that but this will add a very big overhead to your code.
How do you know where to draw the line?

Anything a user enters directly or indirectly, you should always sanity-check. Beyond that, a few asserts here and there won't hurt, but you can't really do much about crazy programmers editing and breaking your code, anyway!-)

I tend to change the amount of defense I put in my code based on the language. Today I'm primarily working in C++ so my thoughts are drifting in that direction.
When working in C++ there cannot be enough defensive programming. I treat my code as if I'm guarding nuclear secrets and every other programmer is out to get them. Asserts, throws, compiler time error template hacks, argument validation, eliminating pointers, in depth code reviews and general paranoia are all fair game. C++ is an evil wonderful language that I both love and severely mistrust.

I'm not a fan of the term "defensive programming". To me it suggests code like this:
void MakePayment( Account * a, const Payment * p ) {
if ( a == 0 || p == 0 ) {
return;
}
// payment logic here
}
This is wrong, wrong, wrong, but I must have seen it hundreds of times. The function should never have been called with null pointers in the first place, and it is utterly wrong to quietly accept them.
The correct approach here is debatable, but a minimal solution is to fail noisily, either by using an assert or by throwing an exception.
Edit: I disagree with some other answers and comments here - I do not think that all functions should check their parameters (for many functions this is simply impossible). Instead, I believe that all functions should document the values that are acceptable and state that other values will result in undefined behaviour. This is the approach taken by the most succesful and widely used libraries ever written - the C and C++ standard libraries.
And now let the downvotes begin...

I don't know that there's really any way to answer this. It's just something that you learn from experience. You just need to ask yourself how common a potential problem is likely to be and make a judgement call. Also consider that you don't necessarily have to always code defensively. Sometimes it's acceptable just to note any potential problems in your code's documentation.
Ultimately though, I think this is just something that a person has to follow their intuition on. There's no right or wrong way to do it.

If you're working on public APIs of a component then its worth doing a good amount of parameter validation. This led me to have a habit of doing validation everywhere. Thats a mistake. All that validation code never gets tested and potentially makes the system more complicated than it needs to be.
Now I prefer to validate by unit testing. Validation definitely happens for data coming from external sources, but not for calls from non-external developers.

I always Debug.Assert my assumptions.

My personal ideology: the defensiveness of a program should be proportional to the maximum naivety/ignorance of the potential user base.

Being defensive against developers consuming your API code is not that different from being defensive against regular users.
Check the parameters to make sure they are within appropriate bounds and of expected types
Verify that the number of API calls which could be made are within your Terms of Service. Generally called throttling it usually only applies to web services and password checking functions.
Beyond that there's not much else to do except make sure your app recovers well in the event of a problem and that you always give ample information to the developer so that they understand what's going on.

Defensive programming is only one way of hounouring a contract in a design-by-contract manner of coding.
The other two are
total programming and
nominal programming.
Of course you shouldnt defend yourself against every crazy thing a developer could do, but then you should state in wich context it will do what is expected to using preconditions.
//precondition : par is so and so and so
function doSth(par)
{
debug.assert(par is so and so and so )
//dostuf with par
return result
}

I think you have to bring in the question of whether you're creating tests as well. You should be defensive in your coding, but as pointed out by JaredPar -- I also believe it depends on the language you're using. If it's unmanaged code, then you should be extremely defensive. If it's managed, I believe you have a little bit of wiggleroom.
If you have tests, and some other developer tries to decimate your code, the tests will fail. But then again, it depends on test coverage on your code (if there is any).

I try to write code that is more than defensive, but down right hostile. If something goes wrong and I can fix it, I will. if not, throw or pass on the exception and make it someone elses problem. Anything that interacts with a physical device - file system, database connection, network connection should be considered unereliable and prone to failure. anticipating these failures and trapping them is critical
Once you have this mindset, the key is to be consistent in your approach. do you expect to hand back status codes to comminicate problems in the call chain or do you like exceptions. mixed models will kill you or at least drive you to drink. heavily. if you are using someone elses api, then isolate these things into mechanisms that trap/report in terms you use. use these wrapping interfaces.

If the discussion here is how to code defensively against future (possibly malevolent or incompetent) maintainers, there is a limit to what you can do. Enforcing contracts through test coverage and liberal use of asserting your assumptions is probably the best you can do, and it should be done in a way that ideally doesn't clutter the code and make the job harder for the future non-evil maintainers of the code. Asserts are easy to read and understand and make it clear what the assumptions of a given piece of code is, so they're usually a great idea.
Coding defensively against user actions is another issue entirely, and the approach that I use is to think that the user is out to get me. Every input is examined as carefully as I can manage, and I make every effort to have my code fail safe - try not to persist any state that isn't rigorously vetted, correct where you can, exit gracefully if you cannot, etc. If you just think about all the bozo things that could be perpetrated on your code by outside agents, it gets you in the right mindset.
Coding defensively against other code, such as your platform or other modules, is exactly the same as users: they're out to get you. The OS is always going to swap out your thread at an inopportune time, networks are always going to go away at the wrong time, and in general, evil abounds around every corner. You don't need to code against every potential problem out there - the cost in maintenance might not be worth the increase in safety - but it sure doesn't hurt to think about it. And it usually doesn't hurt to explicitly comment in the code if there's a scenario you thought of but regard as unimportant for some reason.

Systems should have well designed boundaries where defensive checking happens. There should be a decision about where user input is validated (at what boundary) and where other potential defensive issues require checking (for example, third party integration points, publicly available APIs, rules engine interaction, or different units coded by different teams of programmers). More defensive checking than that violates DRY in many cases, and just adds maintenance cost for very little benifit.
That being said, there are certain points where you cannot be too paranoid. Potential for buffer overflows, data corruption and similar issues should be very rigorously defended against.

I recently had scenario, in which user input data was propagated through remote facade interface, then local facade interface, then some other class, to finally get to the method where it was actually used. I was asking my self a question: When should be the value validated? I added validation code only to the final class, where the value was actually used. Adding other validation code snippets in classes laying on the propagation path would be too defensive programming for me. One exception could be the remote facade, but I skipped it too.

Good question, I've flip flopped between doing sanity checks and not doing them. Its a 50/50
situation, I'd probably take a middle ground where I would only "Bullet Proof" any routines that are:
(a) Called from more than one place in the project
(b) has logic that is LIKELY to change
(c) You can not use default values
(d) the routine can not be 'failed' gracefully
Darknight

What helps to you improve your ability to find a bug?

I want to know if there are method to quickly find bugs in the program.
It seems that the more you master the architecture of your software, the more quickly
you can locate the bugs.
How the programmers improve their ability to find a bug?

Logging, and unit tests. The more information you have about what happened, the easier it is to reproduce it. The more modular you can make your code, the easier it is to check that it really is misbehaving where you think it is, and then check that your fix solves the problem.

Divide and conquer. Whenever you are debugging, you should be thinking about cutting down the possible locations of the problem. Every time you run the app, you should be trying to eliminate a possible source and zero in on the actual location. This can be done with logging, with a debugger, assertions, etc.

Here's a prophylactic method after you have found a bug: I find it really helpful to take a minute and think about the bug.
What was the bug exactly in essence.
Why did it occur.
Could you have found it earlier, easier.
Anything else you learned from the bug.
I find taking a minute to think about these things will make it far less likely that you will produce the same bug in the future.

I will assume you mean logic bugs. The best way I have found to capture logic bugs is to implement some sort of testing scheme. Check out jUnit as the standard. Pretty much you define a set of accepted outputs of your methods. Every time you compile your system it checks all of your test cases. If you have introduced new logic that breaks your tests, you will know about it instantly and know exactly what you have to fix.
Test driven design is a pretty big movement in programming right now. You will be hard pressed to find a language that doesn't support some kind of testing. Even JavaScript has a multitude of test suites.

Experience makes you a better debugger. Pay close attention to the bugs that you AND others commonly make. Try to figure out if/how these bugs apply to ALL code that affects you, not the single instance of where the bug was seen.
Raymond Chen is famous for his powers of psychic debugging.
Most of what looks like psychic
debugging is really just knowing what
people tend to get wrong.
That means that you don't necessarily have to be intimately familiar with the architecture / system. You just need enough knowledge to understand the types of bugs that apply and are easy to make.

I personally take the approach of thinking about where the bug may be in the code before actually opening up the code and taking a look. When you first start with this approach, it may not actually work very well, especially if you are pretty unfamiliar with the code base. However, over time someone will be able to tell you the behavior they are experiencing and you'll have a good idea where the problem is located or you may even know what to fix in the code to remedy the problem before even looking at the code.
I was on a project for several years that maintained by a vendor. They were not very good debuggers and most of the time it was up to us to point them to an area of the code that had the problem. What made our problem worse was that we didn't have a nice way to view the source code, so a lot of our "debugging" was just feeling.

Error checking and reporting. The #1 newbie coder debugging mistake is to turn off error reporting, avoid checking for whether what's going on makes sense, etc etc. In general, people feel like if they can't see anything going wrong then nothing is going wrong. Which of course could not be further from the case.
Instead, your code should be chock full of error conditions that will make lots of noise, with detailed reporting, someplace you will see it. (This doesn't mean inside a production web page.) Then, instead of having to trace an error all over the place because it got passed through sixteen layers of execution before it finally got someplace that broke, your errors start happening proximately to the actual issue.

It seems that the more you master the
architecture of your software ,the
more quickly you can locate the bugs.
After understanding the architecture, one's ability to find bugs in the application increases with their ability to identify and write extensive tests.

Know your tools.
Make sure that you know how to use conditional breakpoints and watches in your debugger.
Use static analysis tools as well - they can point out the more obvious issues.

Sleep and rest.

Use programming methods that produce fewer bugs in the first place.
If to implement a single stand-alone functional requirement it takes N separate point-edits to source code, the number of bugs put into the code is roughly proportional to N, so find programming methods that minimize N. Ways to do this: DRY (don't repeat yourself), code generation, and DSL (domain-specific-language).
Where bugs are likely, have unit tests.
Obviously.IMHO, the best unit tests are monte-carlo.
Make intermediate results visible.
For example, compilers have intermediate representations, in the form of 4-tuples. If there is a bug, the intermediate code can be examined. That tells if the bug is in the first or second half of the compiler.
P.S. Most programmers are not aware that they have a choice of how much data structure to use. The less data structure you use, the less are the chances for bugs (and performance issues) caused by it.

I find tracepoints to be an invaluable debugging tool. They are a bit like logging, except you create them during a debugging session to solve a particular issue, like breakpoints.
Printing the stacktrace in a tracepoint can be especially useful. For example, you can print the hash code and stacktrace in the constructor of an object, and then later on when the object is used again you can search for its hashcode to see which client code created it. Same for seeing who disposed it or called a certain method etc.
They are also great for debugging issues related to window focus changes etc, where the debugger would interfere if you drop in break mode.

Static code tools like FindBugs

Assertions, assertions, and assertions.
Some areas of our code has 4 or 5 assertions for each line of real code. When we get a bug report the first thing that happens is that the customer data is processed in our debug build 99 times out a hundred an assert will fire near the cause of the bug.
Additionally our debug build perform redundant calculations to ensure that an optimized algorithm is returning the correct result, and also debug functions are used to examine the sanity of data structures.
The hardest thing new developers have to contend with is getting their code to survive the assertions of the code gthey are calling.
Additionally we do not allow any code to be putback to toplevel that causes any integration or unit test to fail.

Stepping through the code, examining flow/state where unexpected behavior is occurring. (Then develop a test for it, of course).

Writing Debug.Write(message) in your code and using DebugView is another option. And then run your application find out what is going on.

"Architecture" in software means something like:
Several components
The components interact across clearly-defined interfaces
Each component has a well-defined responsibility
The responsibility of one component is unlike the responsibilities of other components
So, as you said, the better the architecture the easier it is to find bugs.
First: knowing the bug, you can decide which functionality is broken, and therefore know which component implements that functionality. For example, if the bug is that something isn't being logged properly, therefore this bug should be in one of 3 places:
In the component that's responsible for logging (your logging library)
Or, above that in the application code which is using this library
Or, below that in the system code which this library is using
Second: examine the data transfered across the interfaces between components. To continue the previous example above:
Set a debugger breakpoint on the application code which invokes the logger API, to verify whether the logger API is being used correctly (e.g. whether it's being invoked at all, whether parameters are as-expected, etc.).
Doing this tells you whether the bug is in the component above this interface, or in the component that's below this interface.
Repeat (perhaps using binary search if the call stack is very deep) until you've found which component is at fault.

When you come to the point that you think there must be a bug in the OS, check your assertions -- and put them into the code with "assert" statements.
Conversely, as you are writing the code, think of the range of valid inputs for your algorithms and put in assertions to make sure you have what you think you have. Same goes for output: Check that you produced what you think you produced.
E.g. if you expect a non-empty list:
l = getList(input)
assert l, "List was empty for input: %s" % str(input)

I'm part of the QA team # work, and knowing anything about the product and how it is developed, helps a lot in finding bugs, also when I make new QA tools I pass it to our dev team to test it, finding bugs in your own code is just plain hard!
Some people say programmers are tainted, so we cannot see bugs in their own product; we are not talking about code here, we are beyond that, usability and functionality itself.
Meanwhile unit testing seams to be a nice solution to find bugs in your own code, its totally pointless if you're wrong even before writing the unit test, how are you going to find the bugs then? you don't!, let your co-worker find them, hire a QA guy.

Scientific debugging is what I always used, and it greatly helps.
Basically, if you can replicate a bug, you can track its origin. You should then experiment some tests, observe the results, and infer hypotheses on why the bug happens.
Writing about all your hypotheses, attempts, expected results and observed results can help you track down the bugs, particularly if they're nasty.
There are automated tools that can help you with that process, particularly git-bisect (and similar bisection tools on other revision systems) to quickly find which change introduced the bug, unit testing to reproduce a bug and prevent regressions in your code (can be used in combination with bisect), and delta debugging to find the culprit in your code (similar to git-bisect but whereas git-bisect works on the code history, delta debugging works on the code directly).
But whatever the tools you are using, the most important benefit is in the scientific methodology, as this is the formalization of what most experienced debuggers do.

How do you approach intermittent bugs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Scenario
You've got several bug reports all showing the same problem. They're all cryptic with similar tales of how the problem occurred. You follow the steps but it doesn't reliably reproduce the problem. After some investigation and web searching, you suspect what might be going on and you are pretty sure you can fix it.
Problem
Unfortunately, without a reliable way to reproduce the original problem, you can't verify that it actually fixes the issue rather than having no effect at all or exacerbating and masking the real problem. You could just not fix it until it becomes reproducible every time, but it's a big bug and not fixing it would cause your users a lot of other problems.
Question
How do you go about verifying your change?
I think this is a very familiar scenario to anyone who has engineered software, so I'm sure there are a plethora of approaches and best practices to tackling bugs like this. We are currently looking at one of these problems on our project where I have spent some time determining the issue but have been unable to confirm my suspicions. A colleague is soak-testing my fix in the hopes that "a day of running without a crash" equates to "it's fixed". However, I'd prefer a more reliable approach and I figured there's a wealth of experience here on SO.

Bugs that are hard to reproduce are the hardest one to solve. What you need to make sure that you have found the root of the problem, even if the problem itself cannot be reproduced successfully.
The most common intermittent bugs are caused by race-conditions - by eliminating the race, or ensuring that one side always wins you have eliminated the root of the problem even if you can't successfully confirm it by testing the results. The only thing you can test is that the cause does need repeat itself.
Sometimes fixing what is seen as the root indeed solves a problem but not the right one - there is no avoiding it. The best way to avoid intermittent bugs is be careful and methodical with the system design and architecture.

You'll never be able to verify the fix without identifying the root cause and coming up with a reliable way to reproduce the bug.
For identifying the root cause: If your platform allows it, hook some post-mortem debugging into the problem.
For example, on Windows, get your code to create a minidump file (core dump on Unix) when it encounters this problem. You can then get the customer (or WinQual, on Windows) to send you this file. This should give you more information about how your code's gone wrong on the production system.
But without that, you'll still need to come up with a reliable way to reproduce the bug. Otherwise you'll never be able to verify that it's fixed.
Even with all of this information, you might end up fixing a bug that looks like, but isn't, the one that the customer is seeing.

Instrument the build with more extensive (possibly optional) logging and data saving that allows exact reproduction of the variable UI steps the users took before the crash occurred.
If that data does not reliably allow you to reproduce the issue then you've narrowed the class of bug. Time to look at sources of random behaviour, such as variations in system configuration, pointer comparisons, uninitialized data, etc.
Sometimes you "know" (or rather feel) that you can fix the issue without extensive testing or unit testing scaffolding, because you truly understand the issue. However, if you don't, it very often boils down to something like "we ran it 100 times and the error no longer occurred, so we'll consider it fixed until the next time it's reported.".

I use what i call "heavy style defensive programming" : add asserts in all the modules that seems linked by the problem. What i mean is, add A LOT of asserts, asserts evidences, assert state of objects in all their memebers, assert "environnement" state, etc.
Asserts help you identify the code that is NOT linked to the problem.
Most of the time i find the origin of the problem just by writing the assertions as it forces you to reread all the code and plundge under the guts of the application to understand it.

There is no one answer to this problem. Sometimes the solution you've found helps you figure out the scenario to reproduce the problem, in which case you can test that scenario before and after the fix. Sometimes, though, that solution you've found only fixes one of the problems but not all of them, or like you say masks a deeper problem. I wish I could say "do this, it works every time", but there isn't a "this" that fits that scenario.

You say in a comment that you think it is a race condition. If you think you know what "feature" of the code is generating the condition, you can write a test to try to force it.
Here is some risky code in c:
const int NITER = 1000;
int thread_unsafe_count = 0;
int thread_unsafe_tracker = 0;
void* thread_unsafe_plus(void *a){
int i, local;
thread_unsafe_tracker++;
for (i=0; i<NITER; i++){
local = thread_unsafe_count;
local++;
thread_unsafe_count+=local;
};
}
void* thread_unsafe_minus(void *a){
int i, local;
thread_unsafe_tracker--;
for (i=0; i<NITER; i++){
local = thread_unsafe_count;
local--;
thread_unsafe_count+=local;
};
}
which I can test (in a pthreads enironment) with:
pthread_t th1, th2;
pthread_create(&th1,NULL,&thread_unsafe_plus,NULL);
pthread_create(&th2,NULL,&thread_unsafe_minus,NULL);
pthread_join(th1,NULL);
pthread_join(th2,NULL);
if (thread_unsafe_count != 0) {
printf("Ah ha!\n");
}
In real life, you'll probably have to wrap your suspect code in some way to help the race hit more ofter.
If it works, adjust the number of threads and other parameters to make it hit most of the time, and now you have a chance.

First you need to get stack traces from your clients, that way you can actually do some forensics.
Next do fuzz tests with random input, and keep these tests running for long stretches, they're great at finding those irrational border cases, that human programmers and testers can find through use cases and understanding of the code.

In this situation, where nothing else works, I introduce additional logging.
I also add in email notifications that show me the state of the application when it breaks down.
Sometimes I add in performance counters... I put that data in a table and look at trends.
Even if nothing shows up, you are narrowing things down. One way or another, you will end up with useful theories.

These are horrible and almost always resistant to the 'fixes' the engineer thinks he is putting in, as they have a habit of coming back to bite months later. Be wary of any fixes made to intermittent bugs. Be prepared for a bit of grunt work and intensive logging as this sounds more of a testing problem than a development problem.
My own problem when overcoming bugs like these was that I was often too close to the problem, not standing back and looking at the bigger picture. Try and get someone else to look at how you approach the problem.
Specifically my bug was to do with the setting of timeouts and various other magic numbers that in retrospect where borderline and so worked almost all of the time. The trick in my own case was to do a lot of experimentation with settings that I could find out which values would 'break' the software.
Do the failures happen during specific time periods? If so, where and when? Is it only certain people that seem to reproduce the bug? What set of inputs seem to invite the problem? What part of the application does it fail on? Does the bug seem more or less intermittent out in the field?
When I was a software tester my main tools where a pen and paper to record notes of my previous actions - remember a lot of seemingly insignificant details is vital. By observing and collecting little bits of data all the time the bug will appear to become less intermittent.

For a difficult-to-reproduce error, the first step is usually documentation. In the area of the code that is failing, modify the code to be hyper-explicit: One command per line; heavy, differentiated exception handling; verbose, even prolix debug output. That way, even if you can't reproduce or fix the error, you can gain far more information about the cause the next time the failure is seen.
The second step is usually assertion of assumptions and bounds checking. Everything you think you know about the code in question, write .Asserts and checks. Specifically, check objects for nullity and (if your language is dynamic) existence.
Third, check your unit test coverage. Do your unit tests actually cover every fork in execution? If you don't have unit tests, this is probably a good place to start.
The problem with unreproducible errors is that they're only unreproducible to the developer. If your end users insist on reproducing them, it's a valuable tool to leverage the crash in the field.

I've run into bugs on systems that seem to consistently cause errors, but when stepping through the code in a debugger the problem mysteriously disappears. In all of these cases the issue was one of timing.
When the system was running normally there was some sort of conflict for resources or taking the next step before the last one finished. When I stepped through it in the debugger, things were moving slowly enough that the problem disappeared.
Once I figured out it was a timing issue it was easy to find a fix. I'm not sure if this is applicable in your situation, but whenever bugs disappear in the debugger timing issues are my first suspects.

Once you fully understand the bug (and that's a big "once"), you should be able to reproduce it at will. When the reproduction code (automated test) is written, you fix the bug.
How to get to the point where you understand the bug?
Instrument the code (log like crazy). Work with your QA - they are good at re-creating the problem, and you need to arrange to have full dev toolkit available to you on their machines. Use automated tools for uninitialized memory/resources. Just plain stare at the code. No easy solution there.

Those types of bugs are very frustrating. Extrapolate it out to different machines with different types of custom hardware that might be in them (like at my company), and boy oh boy does it become a nightmare. I currently have several bugs like this at the moment at my job.
My rule of thumb: I don't fix it unless I can reproduce it myself or I'm presented with a log that clearly shows something wrong. Otherwise I cannot verify my change, nor can I verify that my change has not broken anything else. Of course, it's just a rule of thumb - I do make exceptions.
I think you're quite right to be concerned with your colleuge's approach.

These problems have always been caused by:
Memory Problems
Threading Problems
To solve the problem, you should:
Instrument your code (Add log statements)
Code Review threading
Code Review memory allocation / dereferencing
The code reviews will most likely only happen if it is a priority, or if you have a strong suspicion about which code is shared by the multiple bug reports. If it's a threading issue, then check your thread safety - make sure variables accessable by both threads are protected. If it's a memory issue, then check your allocations and dereferences and especially be suspicious of code that allocates and returns memory, or code that uses memory allocation by someone else who may be releasing it.

Some questions you could ask yourself:
When did this piece of code last work without problem.
What has been done since it stopped working.
If the code has never worked the approach would be different naturally.
At least when many users change a lot of code all the time this is a very common scenario.

Specific scenario
While I don't want to concentrate on only the issue I am having, here are some details of the current issue we face and how I've tackled it so far.
The issue occurs when the user interacts with the user interface (a TabControl to be exact) at a particular phase of a process. It doesn't always occur and I believe this is because the window of time for the problem to be exhibited is small. My suspicion is that the initialization of a UserControl (we're in .NET, using C#) coincides with a state change event from another area of the application, which leads to a font being disposed. Meanwhile, another control (a Label) tries to draw its string with that font, and hence the crash.
However, actually confirming what leads to the font being disposed has proved difficult. The current fix has been to clone the font so that the drawing label still has a valid font, but this really masks the root problem which is the font being disposed in the first place. Obviously, I'd like to track down the full sequence, but that is proving very difficult and time is short.
Approach
My approach was first to look at the stack trace from our crash reports and examine the Microsoft code using Reflector. Unfortunately, this led to a GDI+ call with little documentation, which only returns a number for the error - .NET turns this into a pretty useless message indicating something is invalid. Great.
From there, I went to look at what call in our code leads to this problem. The stack starts with a message loop, not in our code, but I found a call to Update() in the general area under suspicion and, using instrumentation (traces, etc), we were able to confirm to about 75% certainty that this was the source of the paint message. However, it wasn't the source of the bug - asking the label to paint is no crime.
From there, I looked at each aspect of the paint call that was crashing (DrawString) to see what could be invalid and started to rule each one out until it fell on the disposable items. I then determined which ones we had control over and the font was the only one. So, I took a look at how we handled the font and under what circumstances we disposed it to identify any potential root causes. I was able to come up with a plausible sequence of events that fit the reports from users, and therefore able to code a low risk fix.
Of course, it crossed my mind that the bug was in the framework, but I like to assume we screwed up before passing the blame to Microsoft.
Conclusion
So, that's how I approached one particular example of this kind of problem. As you can see, it's less than ideal, but fits with what many have said.

Unless there are major time constraints, I don't start testing changes until I can reliably reproduce the problem.
If you really had to, I suppose you could write a test case that appears to sometimes trigger the problem, and add it to your automated test suite (you do have an automated test suite, right?), and then make your change and hope that test case never fails again, knowing that if you didn't really fix anything at least you now have more chance of catching it. But by the time you can write a test case, you almost always have things reduced down to the point where you're no longer dealing with such an (apparently) non-deterministic situation.

Simply: ask the user who reported it.
I just use one of the reporters as a verification system.
Usually the person who was willing to report a bug is more than happy to help you to solve her problem [1].
Just give her your version with a possible fix and ask if the problem is gone.
In cases where the bug is a regression, the same method can be used to bisect where the problem occurred by giving the user with the problem multiple versions to test.
In other cases the user can also help you to debug the problem by giving her a version with more debugging capabilities.
This will limit any negative effects from a possible fix to that person instead of guessing that something will fix the bug and then later on realising that you've just released a "bug fix" that has no effect or in worst case a negative effect for the system stability.
You can also limit the possible negative effects of the "bug fix" by giving the new version to a limited number of users (for example to all of the ones that reported the problem) and releasing the fix only after that.
Also ones she can confirm that the fix you've made works, it is easy to add tests that ensures that your fix will stay in the code (at least on unit test level, if the bug is hard to reproduce on more higher system level).
Of course this requires that whatever you are working on supports this kind of approach. But if it doesn't I would really do whatever I can to enable it - end users are more satisfied and many of the hardest tech problems just go away and priorities come clear when development can directly interact with the system end users.
[1] If you have ever reported a bug, you most likely know that many times the response from the development/maintenance team is somehow negative from the end users point of view or there will be no response at all - especially in situations where the bug can not be reproduced by the development team.

Test Cases AND assertion statements

The code in this question made me think
assert(value>0); //Precondition
if (value>0)
{
//Doit
}
I never write the if-statement. Asserting is enough/all you can do.
"Crash early, crash often"
CodeComplete states:
The assert-statement makes the application Correct
The if-test makes the application Robust
I don't think you've made an application more robust by correcting invalid input values, or skipping code:
assert(value >= 0 ); //Precondition
assert(value <= 90); //Precondition
if(value < 0) //Just in case
value = 0;
if (value > 90) //Just in case
value = 90;
//Doit
These corrections are based on assumptions you made about the outside world.
Only the caller knows what "a valid input value" is for your function, and he must check its validity before he calls your function.
To paraphrase CodeComplete:
"Real-world programs become too messy when we don't rely solely on assertions."
Question: Am I wrong, stuborn, stupid, too non-defensive...

The problem with trusting just Asserts, is that they may be turned off in a production environment. To quote the wikipedia article:
Most languages allow assertions to be
enabled or disabled globally, and
sometimes independently. Assertions
are often enabled during development
and disabled during final testing and
on release to the customer. Not
checking assertions avoiding the cost
of evaluating the assertions while,
assuming the assertions are free of
side effects, still producing the same
result under normal conditions. Under
abnormal conditions, disabling
assertion checking can mean that a
program that would have aborted will
continue to run. This is sometimes
preferable.
Wikipedia
So if the correctness of your code relies on the Asserts to be there you may run into serious problems. Sure, if the code worked during testing it should work during production... Now enter the second guy that works on the code and is just going to fix a small problem...

Use assertions for validating input you control: private methods and such.
Use if statements for validating input you don't control: public interfaces designed for consumption by the user, user input testing etc.
Test you application with assertions built in. Then deploy without the assertions.

I some cases, asserts are disabled when building for release. You may not have control over this (otherwise, you could build with asserts on), so it might be a good idea to do it like this.
The problem with "correcting" the input values is that the caller will not get what they expect, and this can lead to problems or even crashes in wholly different parts of the program, making debugging a nightmare.
I usually throw an exception in the if-statement to take over the role of the assert in case they are disabled
assert(value>0);
if(value<=0) throw new ArgumentOutOfRangeException("value");
//do stuff

I would disagree with this statement:
Only the caller knows what "a valid
input value" is for your function, and
he must check its validity before he
calls your function.
Caller might think that he know that input value is correct. Only method author knows how it suppose to work. Programmer's best goal is to make client to fall into "pit of success". You should decide what behavior is more appropriate in given case. In some cases incorrect input values can be forgivable, in other you should throw exception\return error.
As for Asserts, I'd repeat other commenters, assert is a debug time check for code author, not code clients.

Don't forget that most languages allow you to turn off assertions... Personally, if I was prepared to write if tests to protect against all ranges of invalid input, I wouldn't bother with the assertion in the first place.
If, on the other hand you don't write logic to handle all cases (possibly because it's not sensible to try and continue with invalid input) then I would be using the assertion statement and going for the "fail early" approach.

If I remember correctly from CS-class
Preconditions define on what conditions the output of your function is defined. If you make your function handle errorconditions your function is defined for those condition and you don't need the assert statement.
So I agree. Usually you don't need both.
As Rik commented this can cause problems if you remove asserts in released code. Usually I don't do that except in performance-critical places.

I should have stated I was aware of the fact that asserts (here) dissappear in production code.
If the if-statement actually corrects invalid input data in production code, this means the assert never went off during testing on debug code, this means you wrote code that you never executed.
For me it's an OR situation:
(quote Andrew) "protect against all ranges of invalid input, I wouldn't bother with the assertion in the first place." -> write an if-test.
(quote aku) "incorrect input values can be forgivable" -> write an assert.
I can't stand both...

For internal functions, ones that only you will use, use asserts only. The asserts will help catch bugs during your testing, but won't hamper performance in production.
Check inputs that originate externally with if-conditions. By externally, that's anywhere outside the code that you/your team control and test.
Optionally, you can have both. This would be for external facing functions where integration testing is going to be done before production.

A problem with assertions is that they can (and usually will) be compiled out of the code, so you need to add both walls in case one gets thrown away by the compiler.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio