Best refactoring for the dreaded While (True) loop

Best refactoring for the dreaded While (True) loop - refactoring

If, like me, you shiver at the site of a While (True) loop, then you too must have thought long and hard about the best way to refactor it away. I've seen several different implementations, none really better than any other, such as the timer & delegate combination.
So what's the best way you've come up with or seen to refactor the dreaded While (True) loop?
Edit: As some comments mentioned, my intent was for this question to be an "infinite loop" refactoring, such as running a Windows style service where the only stop conditions would be OnStop or a fatal exception.

My preference would be
start:
// code goes here
goto start;
This most clearly expresses the intent. Good luck getting it past your coding standards. (Wonder how much karma this is going to cost me).

Do we really need to refactor while(true) loops?
Sometimes it's a coding standard and most of the developers has got used to this structure. If you have to think hard on how to refactor this code, are you sure it's a good idea to refactor it?
Goto used to be a black sheep in coding standards. I've met algorithms where goto made the code much more readable and shorter. Sometimes it doesn't worth to refactor (or better to use goto).
On the other hand you can avoid while(true) most of the time.

What's so dreaded about it? Try finding a common break condition and refactor it to be the head of the loop. If that's not possible – fine.

When I encounter a while(true) loop, that tells me either
the break condition is not easily tested at the top (or bottom) of the loop,
there are multiple break conditions,
or the prior programmer was too lazy to factor the loop properly.
1 and 2 means you might as well stick with while(true). (I use for(;;), but that's a style thing in my opinion.) I'm with another poster, why dread this? I dread tortored loops that jump through hoops to get the loop rolled "properly".

Why refactor? And what is so "dreadful" about this construct? It is widely used, and well understood.
If it ain't broke, don't fix it.

Replace True with the condition you were going to use to break out of the loop.
In the case of a service or background thread, you might use:
volatile bool m_shutdown = false;
void Run()
{
while (!m_shutdown)
{ ... }
}

The "running forever" situation is sometimes part of a larger state machine. Many embedded devices (with run-forever loops) don't really run forever. They often have several operating modes and will sequence among those modes.
When we built heat-pump controllers, there was a power-on-self-test (POST) mode that ran for a little while. Then there was a preliminary environmental gathering mode that ran until we figured out all the zones and thermostats and what-not.
Some engineers claimed that what came next was the "run-forever" loop. It wasn't really that simple. It was actually several operating modes that flipped and flopped. There was heating, and defrosting, and cooling, and idling, and other stuff.
My preference is to treat a "forever" loop as really just one operating mode -- there may be others at some point in the future.
someMode= True
while someMode:
try:
... do stuff ...
except SomeException, e:
log.exception( e )
# will keep running
except OtherException, e:
log.info( "stopping now" )
someMode= False
Under some circumstances, nothing we've seen so far sets someMode to False. But I like to pretend that there'll be a mode change in some future version.

#define ever 1
for (;ever;)
?
Meh, just leave it how it is, while (true) is probably as legible as you're going to get..

errr, to be a refactoring.....
Replace Infinite Loop with Infinite Recursion :-)
well, if you have a language that supports Tail calls....

If you want it to continue indefinitely until a total abortion of program flow, I don't see anything wrong with while (true). I encountered it recently in a .NET data collection service that combined while (true) with thread.sleep to wake up every minute and poll the third-party data service for new reports. I considered refactoring it with a timer and a delegate, but ultimately decided that this was the simplest and easiest-to-read method. 9 times out of 10 it's a clear code smell, but when there's no exit condition, why make things more difficult?

I don't mind it when the infinite loop is contained within a window, and dies with the window.
Think of the Hasselhoff Recursion.

void whiletrue_sim(void)
{
//some code
whiletrue_sim();
}
Warning: Your stack may overflow.

Related

How to use DoEvents() without being "evil"?

A simple search for DoEvents brings up lots of results that lead, basically, to:
DoEvents is evil. Don't use it. Use threading instead.
The reasons generally cited are:
Re-entrancy issues
Poor performance
Usability issues (e.g. drag/drop over a disabled window)
But some notable Win32 functions such as TrackPopupMenu and DoDragDrop perform their own message processing to keep the UI responsive, just like DoEvents does.
And yet, none of these seem to come across these issues (performance, re-entrancy, etc.).
How do they do it? How do they avoid the problems cited with DoEvents? (Or do they?)

DoEvents() is dangerous. But I bet you do lots of dangerous things every day. Just yesterday I set off a few explosive devices (future readers: note the original post date relative to a certain American holiday). With care, we can sometimes account for the dangers. Of course, that means knowing and understanding what the dangers are:
Re-entry issues. There are actually two dangers here:
Part of the problem here has to do with the call stack. If you call .DoEvents() in a loop that itself handles messages that use DoEvents(), and so on, you're getting a pretty deep call stack. It's easy to over-use DoEvents() and accidentally fill up your call stack, resulting in a StackOverflow exception. If you're only using .DoEvents() in one or two places, you're probably okay. If it's the first tool you reach for whenever you have a long-running process, you can easily find yourself in trouble here. Even one use in the wrong place can make it possible for a user to force a stackoverflow exception (sometimes just by holding down the enter key), and that can be a security issue.
It is sometimes possible to find your same method on the call stack twice. If you didn't build the method with this in mind (hint: you probably didn't) then bad things can happen. If everything passed in to the method is a value type, and there is no dependance on things outside of the method, you might be fine. But otherwise, you need to think carefully about what happens if your entire method were to run again before control is returned to you at the point where .DoEvents() is called. What parameters or resources outside of your method might be modified that you did not expect? Does your method change any objects, where both instances on the stack might be acting on the same object?
Performance Issues. DoEvents() can give the illusion of multi-threading, but it's not real mutlithreading. This has at least three real dangers:
When you call DoEvents(), you are giving control on your existing thread back to the message pump. The message pump might in turn give control to something else, and that something else might take a while. The result is that your original operation could take much longer to finish than if it were in a thread by itself that never yields control, definitely longer than it needs.
Duplication of work. Since it's possible to find yourself running the same method twice, and we already know this method is expensive/long-running (or you wouldn't need DoEvents() in the first place), even if you accounted for all the external dependencies mentioned above so there are no adverse side effects, you may still end up duplicating a lot of work.
The other issue is the extreme version of the first: a potential to deadlock. If something else in your program depends on your process finishing, and will block until it does, and that thing is called by the message pump from DoEvents(), your app will get stuck and become unresponsive. This may sound far-fetched, but in practice it's surprisingly easy to do accidentally, and the crashes are very hard to find and debug later. This is at the root of some of the hung app situations you may have experienced on your own computer.
Usability Issues. These are side-effects that result from not properly accounting for the other dangers. There's nothing new here, as long as you looked in other places appropriately.
If you can be sure you accounted for all these things, then go ahead. But really, if DoEvents() is the first place you look to solve UI responsiveness/updating issues, you're probably not accounting for all of those issues correctly. If it's not the first place you look, there are enough other options that I would question how you made it to considering DoEvents() at all. Today, DoEvents() exists mainly for compatibility with older code that came into being before other credible options where available, and as a crutch for newer programmers who haven't yet gained enough experience for exposure to the other options.
The reality is that most of the time, at least in the .Net world, a BackgroundWorker component is nearly as easy, at least once you've done it once or twice, and it will do the job in a safe way. More recently, the async/await pattern or the use of a Task can be much more effective and safe, without needing to delve into full-blown multi-threaded code on your own.

Back in 16-bit Windows days, when every task shared a single thread, the only way to keep a program responsive within a tight loop was DoEvents. It is this non-modal usage that is discouraged in favor of threads. Here's a typical example:
' Process image
For y = 1 To height
For x = 1 to width
ProcessPixel x, y
End For
DoEvents ' <-- DON'T DO THIS -- just put the whole loop in another thread
End For
For modal things (like tracking a popup), it is likely to still be OK.

I may be wrong, but it seems to me that DoDragDrop and TrackPopupMenu are rather special cases, in that they take over the UI, so don't have the reentrancy problem (which I think is the main reason people describe DoEvents as "Evil").
Personally I don't think it's helpful to dismiss a feature as "Evil" - rather explain the pitfalls so that people can decide for themselves. In the case of DoEvents there are rare cases where it's still reasonable to use it, for example while a modal progress dialog is displayed, where the user can't interact with the rest of the UI so there is no re-entrancy issue.
Of course, if by "Evil" you mean "something you shouldn't use without fully understanding the pitfalls", then I agree that DoEvents is evil.

Biggest beef with game loops

What do you hate most about the modern game loop? Can the game loop be improved or is there just a better alternative, such as an event-driven architecture?

It seems like this really ought to be a CW...
I'm taking a grad-level game engine programming course right now and they're sticking with the game loop approach. Granted, that doesn't mean it's the only/best solution but it's certainly logical. Using a loop allows you to ensure that all game systems get their turn to run without requesting their own timed interrupts or something else. Control can be centralized: in my current project, I have a GameManager class that, each frame, loops through the Update(float deltaTime) function for every registered object in turn. I don't have to debug an event system or set up timed signals, I just use a loop to call a series of functions. No muss, no fuss.
To answer your question of what do I hate most, the loop approach does logically lend itself to liberal use of inheritance and polymorphism which can bloat the size/complexity of your objects. If you're not careful, this can be a mild-to-horrible pitfall. If you are careful, it may not be a problem at all.

No matter there is any event in the game or not, game is supposed to draw itself and update it at a fixed rate, so I don't think dropping the game loop is possible. Still I would be wondered if anyone can think of any other alternative of game loop.

Usually the event driven architectures work best for games (only do something if the user wants something done). BUT, you'll still always need to have a thread constantly redrawing the world.

To be fully event based you could spawn an extra thread that does nothing but put a CPUTick event into your event queue every x milliseconds.
In JavaScript this is usually the more natural route, because you can so easily create an extra 'thread' that sends you the events with setInterval().
Or, if you already have a loop in the framework containing your game - like JS has in the browser, or python has with twisted - you can tell that looper to call you back at fixed intervals. e.g.:
function gameLoop() {
// update, draw...
}
window.setInterval(gameLoop, 1000/fps);

What do you do with atrocious code?

What do you do when you're assigned to work on code that's
atrocious and antiquated to the point where it's almost incomprehensible?
For example: hardware interface code, mixed with logic, AND user interface code, ALL in the same functions?
We see bad code all the time, but what do you actually do about it?
Do you try to refactor it?
Try to make it OO if it's not?
Or do you try to make some sense of it, make the necessary changes and move on?

Depends on a few factors for me:
Will I be maintaining this code in the future, or is it a one-off fix?
How long until this system is replaced entirely?
How busy am I at the moment?
Ideally, I'd refactor all bad code I had to maintain, but the reality is there are only so many hours in the day.

As is frequently the case, "It Depends".
I tend to ask myself some of the following questions:
Are there unit tests for the existing code?
Is refactoring the code an acceptable risk for my project?
Is the author still available to clarify any questions I might have about the code?
Will my employer consider the time spent on changing existing, functioning code to be an acceptable use of my time?
And so on...
But assuming that I have the capacity to do so, refactoring is preferential as the up front cost of fixing the code now will likely save me a lot of time and effort later in maintenance and development time.
There are other benefits as well, including the fact that the more clean and well maintained you keep your code base, the more likely other developers are to keep it that way. The Pragmatic Programmer calls this the Broken Window Theory.

Developers have an instinct to assume that code is always ugly because of other, inferior developers. Sometimes, code is ugly because the problem space is ugly. All that ugliness isn't just ugliness - it is sometimes institutional memory. Each line of ugly in your code probably represents a bug fix. So think very carefully before you rip it all out.
Basically, I would say that you shouldn't touch code like this unless you actually have to. If there's a real bug that you can solve, refactoring is reasonable, if you can be sure you're maintaining the same amount of functionality. But refactoring for the sake of refactoring (eg, "make the code OO") is what I would generally classify as a classic newbie mistake.

The book Working Effectively with Legacy Code discusses the options you can do. In general the rule is not to change code until you have to (to fix a bug or add a feature). The book describes how to make changes when you can't add testing and how to add testing to complex code (which allows more substantial changes).

You try to refactor it, in the strict sense on the word, where you're not changing the behaviour.
The first target is usually to break up giant methods.

Given the strength of some of the adjectives you use, i.e. atrocious, antiquated and incomprehensible, I'd bin it!
If it is in such a state, like the example you give, it's probably not got any test code for it either. Refactoring is mentioned in many of the other answers but, sometimes, it is not appropriate. I always find that, when refactoring, you generally need a clear path through which the old code can be gradually morphed into the new in a number of well defined steps.
When the old code is so far removed from how you want it to look, such as the extreme cases you seem to be suggesting, you could probably redesign, rewrite and test the new code in a shorter time than it would to take to refactor it.

Scrap it and start over, using the compiled legacy application as a business requirements document.
And spending time in analysis with the users to see what they want changed.

Post it to www.worsethanfailure.com!!!

If no modifications are needed, I don't touch it.
If at all possible, I write automated unit tests first, especially focused on the areas that need modification.
If automated unit tests are not possible, I do what I can to document manual unit tests.
I am just using the tests to document "current" behavior at this point.
If possible, I always keep a version of the code and executable environment that runs things the "original" way (before I touched it) so I can always add new "behavior documentation" tests and better detect regressions I may have caused later.
Once I start changing things, I want to be very careful not to introduce regressions. I do this by continually rerunning (and or adding new tests) to the tests I wrote before I started writing code.
When possible, I leave bugs as-is if there is no business need for them to be fixed. Those bugs may be "features" to some users and may have unclear side effects that wouldn't be clear until the code was redeployed to production.
As far as refactoring, I do that as aggressively as possible, but only in the code that I need to change otherwise anyway. I may refactor more aggressively in my own personal copy of the code that will never be checked in, just to improve the readability of the code for me personally. It's often times difficult to properly test changes that are only made for readability reasons, so for safety reasons, I generally don't check those changes in / deploy them unless I can confidently test that the code changes are completely safe (it's really bad to introduce bugs when you are making changes that are unnecessary for anything but readability).
Really, it's a risk management problem. Proceed with caution. The users do not care if the code is atrocious, they just care that it gets better without getting worse. Your need for beautiful code is not important in this scenario, get past it.

Just like any other code, you leave it slightly better when you leave it than it was when you entered it. You do not ever, ever rewrite the whole code. If that is the work it takes for some reason, then you start a project (small or large) for it.
I am assuming we are talking about a substantial amount of code here.
Not every day is a great day at work you know :)

The first question to ask is: does it work?
If the answer is yes, that would be a huge disincentive to simply ditch it and start over. There may be thousands of man-hours in that code which address edge cases and nasty bugs. Worse yet, there may be other modules in the system that depend on the current incorrect (but known and possibly documented) behavior. Don't mess with it if it isn't broken.
If you are keen on cleaning it up, start by writing test cases for the current behavior. When you run across an instance where the behavior differs from the specification, you must decide whether to accept the behavior as "correct" or go with what the spec say it ought to do.
Only once you have written test cases that all pass should you begin to refactor. The tests will tell you whether your efforts are breaking anything.

I'd talk to my manager and describe the code. Most managers would not want a program held together by banding wire and duct tape per se. If the code is really that bad there are sure to be some business logic errors, hardcoding etc. stuffed in there that will eventually just destroy productivity.
I've come across some pretty bad code before (single letter variable names, no comments, everything crammed onto one line, etc.) and once I mentioned/showed it to my manager they almost always said "go ahead and re-write it", because not only are you taking the hit for reading and changing the code but future co-workers will have to go through the same pain. Better that you take a longer period of time just once to rewrite it rather than having each person who touches the code in the future have to go through and comprehend and decipher it first.

There is an old saying. If is isn't broke, do not fix it. If you have to maintain it then reverse engineer it and document it so the next time you come across it you will know what it does.
You do not know the situation the developer was in when he or she wrote the code. He or she may have been under a time crunch when it was written, (management was all over the developer, etc)
There are also situations where he or she wrote the code per the spec, The spec then changed several times, the developer had to patch the code, as rewrite is out of the question due to time constraints. This happens all of the time.
If the code impacts the performance of robustness of the application and is modular then you can re factor or re-write. Document the situation to assist future programmers in understanding.
Also many programmers consider reverse engineering other developers code as beneath them.
they would rather rewrite without considering the ramifications of doing so.
If you have never done so, try it sometime, it will make you a better developer.
Thanks
Joe

Kill it with fire.

Depends on your time frame and how important that code is to you. If you have to "just make it work" then do that and rewrite the module when time allows.
If its an important or integral part of what you do then refactor refactor refactor.
Then find the guy/girl who wrote it and send them a rude postcard!

The worst offender (in my experience) of really AWFUL code is the ease with which people can do cut & paste these days. Cut & paste should be used rarely. If you think that's the right solution, it's generally better to step back and generalize the problem a little.

Anytime you see code that is "nearly incomprehensible", PROCEED WITH CAUTION. You need to assume that any major re-factoring will result in new bugs being introduced that you'll need to find and correct.
Additionally, I've seen this scenario many times (even fell victim to it myself once or twice): Programmer inherits legacy code, decides code is ancient & unmaintainable and decides to refactor it, ends up deleting key "fixes" or "business rules" subtly patched in over the years, ends up spending a lot of time tracking down and re-introducing similar code when users complain about "a problem fixed years ago is happening again".
Re-factoring (and debugging) almost always takes longer than expected and should never be considered as a "freebie" that comes along with whatever task you're supposed to be doing.
"If it ain't broke, don't 'fix' it" still has a lot of truth.

Im my company we always Refactor Mercilessly. so we still come across atrocious code but LESS and Less and less ...
We write a lot of in-house code and the company is run for about 100 years by the same family. Management usually tells us we have to maintain the code base (evolve) for another 50 years or so. In this setting having code you don't dare to touch is considered a bigger risk to the long term survival of the company then the prospect of downtime because some under-tested code broke because of refactoring.

I run copy-paste detector and findbugs on all legacy code that comes my way.
I then plan my initial refactoring:
remove unused code, unused variable and unused methods
refactor duplicated code
set up a single step build
build a basic functional test
By that point the code meets the basic minimum for maintainability. It can be easily built and basic errors can be found via an automated test.
I often add code like this:
log.debug("is foo null? " + (foo == null));
log.debug("is discount < raw price ? " + (foo.getDiscount() < foo.getRawPrice()));
Some of that code will be recovered for unit tests when I can refactor to it.

I've worked places where we ship that kind of code.

I try to make sense of it, make the necessary changes, and move on.
Of course, making sense of it usually involves some changes; at the very least, I move around the whitespace and line up corresponding braces in the same column like so:
if(condition){
doSomething(); }
// becomes...
if(condition)
{
doSomething();
}
I'll also often change variable names.
And very often, "the necessary changes" require refactoring. :)

Get the idea of what they're doing and the deadline to finish. A larger deadline, typically rebuild much of the code from the ground up, as I find it a very worthwhile experience to not only decipher terrible code and make it legible and document, but somewhere in your brain those neurons are pressed to avoid similar mistakes in the future.

How "defensive" should my code be?

I was having a discussion with one of my colleagues about how defensive your code should be. I am all pro defensive programming but you have to know where to stop. We are working on a project that will be maintained by others, but this doesn't mean we have to check for ALL the crazy things a developer could do. Of course, you could do that but this will add a very big overhead to your code.
How do you know where to draw the line?

Anything a user enters directly or indirectly, you should always sanity-check. Beyond that, a few asserts here and there won't hurt, but you can't really do much about crazy programmers editing and breaking your code, anyway!-)

I tend to change the amount of defense I put in my code based on the language. Today I'm primarily working in C++ so my thoughts are drifting in that direction.
When working in C++ there cannot be enough defensive programming. I treat my code as if I'm guarding nuclear secrets and every other programmer is out to get them. Asserts, throws, compiler time error template hacks, argument validation, eliminating pointers, in depth code reviews and general paranoia are all fair game. C++ is an evil wonderful language that I both love and severely mistrust.

I'm not a fan of the term "defensive programming". To me it suggests code like this:
void MakePayment( Account * a, const Payment * p ) {
if ( a == 0 || p == 0 ) {
return;
}
// payment logic here
}
This is wrong, wrong, wrong, but I must have seen it hundreds of times. The function should never have been called with null pointers in the first place, and it is utterly wrong to quietly accept them.
The correct approach here is debatable, but a minimal solution is to fail noisily, either by using an assert or by throwing an exception.
Edit: I disagree with some other answers and comments here - I do not think that all functions should check their parameters (for many functions this is simply impossible). Instead, I believe that all functions should document the values that are acceptable and state that other values will result in undefined behaviour. This is the approach taken by the most succesful and widely used libraries ever written - the C and C++ standard libraries.
And now let the downvotes begin...

I don't know that there's really any way to answer this. It's just something that you learn from experience. You just need to ask yourself how common a potential problem is likely to be and make a judgement call. Also consider that you don't necessarily have to always code defensively. Sometimes it's acceptable just to note any potential problems in your code's documentation.
Ultimately though, I think this is just something that a person has to follow their intuition on. There's no right or wrong way to do it.

If you're working on public APIs of a component then its worth doing a good amount of parameter validation. This led me to have a habit of doing validation everywhere. Thats a mistake. All that validation code never gets tested and potentially makes the system more complicated than it needs to be.
Now I prefer to validate by unit testing. Validation definitely happens for data coming from external sources, but not for calls from non-external developers.

I always Debug.Assert my assumptions.

My personal ideology: the defensiveness of a program should be proportional to the maximum naivety/ignorance of the potential user base.

Being defensive against developers consuming your API code is not that different from being defensive against regular users.
Check the parameters to make sure they are within appropriate bounds and of expected types
Verify that the number of API calls which could be made are within your Terms of Service. Generally called throttling it usually only applies to web services and password checking functions.
Beyond that there's not much else to do except make sure your app recovers well in the event of a problem and that you always give ample information to the developer so that they understand what's going on.

Defensive programming is only one way of hounouring a contract in a design-by-contract manner of coding.
The other two are
total programming and
nominal programming.
Of course you shouldnt defend yourself against every crazy thing a developer could do, but then you should state in wich context it will do what is expected to using preconditions.
//precondition : par is so and so and so
function doSth(par)
{
debug.assert(par is so and so and so )
//dostuf with par
return result
}

I think you have to bring in the question of whether you're creating tests as well. You should be defensive in your coding, but as pointed out by JaredPar -- I also believe it depends on the language you're using. If it's unmanaged code, then you should be extremely defensive. If it's managed, I believe you have a little bit of wiggleroom.
If you have tests, and some other developer tries to decimate your code, the tests will fail. But then again, it depends on test coverage on your code (if there is any).

I try to write code that is more than defensive, but down right hostile. If something goes wrong and I can fix it, I will. if not, throw or pass on the exception and make it someone elses problem. Anything that interacts with a physical device - file system, database connection, network connection should be considered unereliable and prone to failure. anticipating these failures and trapping them is critical
Once you have this mindset, the key is to be consistent in your approach. do you expect to hand back status codes to comminicate problems in the call chain or do you like exceptions. mixed models will kill you or at least drive you to drink. heavily. if you are using someone elses api, then isolate these things into mechanisms that trap/report in terms you use. use these wrapping interfaces.

If the discussion here is how to code defensively against future (possibly malevolent or incompetent) maintainers, there is a limit to what you can do. Enforcing contracts through test coverage and liberal use of asserting your assumptions is probably the best you can do, and it should be done in a way that ideally doesn't clutter the code and make the job harder for the future non-evil maintainers of the code. Asserts are easy to read and understand and make it clear what the assumptions of a given piece of code is, so they're usually a great idea.
Coding defensively against user actions is another issue entirely, and the approach that I use is to think that the user is out to get me. Every input is examined as carefully as I can manage, and I make every effort to have my code fail safe - try not to persist any state that isn't rigorously vetted, correct where you can, exit gracefully if you cannot, etc. If you just think about all the bozo things that could be perpetrated on your code by outside agents, it gets you in the right mindset.
Coding defensively against other code, such as your platform or other modules, is exactly the same as users: they're out to get you. The OS is always going to swap out your thread at an inopportune time, networks are always going to go away at the wrong time, and in general, evil abounds around every corner. You don't need to code against every potential problem out there - the cost in maintenance might not be worth the increase in safety - but it sure doesn't hurt to think about it. And it usually doesn't hurt to explicitly comment in the code if there's a scenario you thought of but regard as unimportant for some reason.

Systems should have well designed boundaries where defensive checking happens. There should be a decision about where user input is validated (at what boundary) and where other potential defensive issues require checking (for example, third party integration points, publicly available APIs, rules engine interaction, or different units coded by different teams of programmers). More defensive checking than that violates DRY in many cases, and just adds maintenance cost for very little benifit.
That being said, there are certain points where you cannot be too paranoid. Potential for buffer overflows, data corruption and similar issues should be very rigorously defended against.

I recently had scenario, in which user input data was propagated through remote facade interface, then local facade interface, then some other class, to finally get to the method where it was actually used. I was asking my self a question: When should be the value validated? I added validation code only to the final class, where the value was actually used. Adding other validation code snippets in classes laying on the propagation path would be too defensive programming for me. One exception could be the remote facade, but I skipped it too.

Good question, I've flip flopped between doing sanity checks and not doing them. Its a 50/50
situation, I'd probably take a middle ground where I would only "Bullet Proof" any routines that are:
(a) Called from more than one place in the project
(b) has logic that is LIKELY to change
(c) You can not use default values
(d) the routine can not be 'failed' gracefully
Darknight

How do you approach intermittent bugs? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Scenario
You've got several bug reports all showing the same problem. They're all cryptic with similar tales of how the problem occurred. You follow the steps but it doesn't reliably reproduce the problem. After some investigation and web searching, you suspect what might be going on and you are pretty sure you can fix it.
Problem
Unfortunately, without a reliable way to reproduce the original problem, you can't verify that it actually fixes the issue rather than having no effect at all or exacerbating and masking the real problem. You could just not fix it until it becomes reproducible every time, but it's a big bug and not fixing it would cause your users a lot of other problems.
Question
How do you go about verifying your change?
I think this is a very familiar scenario to anyone who has engineered software, so I'm sure there are a plethora of approaches and best practices to tackling bugs like this. We are currently looking at one of these problems on our project where I have spent some time determining the issue but have been unable to confirm my suspicions. A colleague is soak-testing my fix in the hopes that "a day of running without a crash" equates to "it's fixed". However, I'd prefer a more reliable approach and I figured there's a wealth of experience here on SO.

Bugs that are hard to reproduce are the hardest one to solve. What you need to make sure that you have found the root of the problem, even if the problem itself cannot be reproduced successfully.
The most common intermittent bugs are caused by race-conditions - by eliminating the race, or ensuring that one side always wins you have eliminated the root of the problem even if you can't successfully confirm it by testing the results. The only thing you can test is that the cause does need repeat itself.
Sometimes fixing what is seen as the root indeed solves a problem but not the right one - there is no avoiding it. The best way to avoid intermittent bugs is be careful and methodical with the system design and architecture.

You'll never be able to verify the fix without identifying the root cause and coming up with a reliable way to reproduce the bug.
For identifying the root cause: If your platform allows it, hook some post-mortem debugging into the problem.
For example, on Windows, get your code to create a minidump file (core dump on Unix) when it encounters this problem. You can then get the customer (or WinQual, on Windows) to send you this file. This should give you more information about how your code's gone wrong on the production system.
But without that, you'll still need to come up with a reliable way to reproduce the bug. Otherwise you'll never be able to verify that it's fixed.
Even with all of this information, you might end up fixing a bug that looks like, but isn't, the one that the customer is seeing.

Instrument the build with more extensive (possibly optional) logging and data saving that allows exact reproduction of the variable UI steps the users took before the crash occurred.
If that data does not reliably allow you to reproduce the issue then you've narrowed the class of bug. Time to look at sources of random behaviour, such as variations in system configuration, pointer comparisons, uninitialized data, etc.
Sometimes you "know" (or rather feel) that you can fix the issue without extensive testing or unit testing scaffolding, because you truly understand the issue. However, if you don't, it very often boils down to something like "we ran it 100 times and the error no longer occurred, so we'll consider it fixed until the next time it's reported.".

I use what i call "heavy style defensive programming" : add asserts in all the modules that seems linked by the problem. What i mean is, add A LOT of asserts, asserts evidences, assert state of objects in all their memebers, assert "environnement" state, etc.
Asserts help you identify the code that is NOT linked to the problem.
Most of the time i find the origin of the problem just by writing the assertions as it forces you to reread all the code and plundge under the guts of the application to understand it.

There is no one answer to this problem. Sometimes the solution you've found helps you figure out the scenario to reproduce the problem, in which case you can test that scenario before and after the fix. Sometimes, though, that solution you've found only fixes one of the problems but not all of them, or like you say masks a deeper problem. I wish I could say "do this, it works every time", but there isn't a "this" that fits that scenario.

You say in a comment that you think it is a race condition. If you think you know what "feature" of the code is generating the condition, you can write a test to try to force it.
Here is some risky code in c:
const int NITER = 1000;
int thread_unsafe_count = 0;
int thread_unsafe_tracker = 0;
void* thread_unsafe_plus(void *a){
int i, local;
thread_unsafe_tracker++;
for (i=0; i<NITER; i++){
local = thread_unsafe_count;
local++;
thread_unsafe_count+=local;
};
}
void* thread_unsafe_minus(void *a){
int i, local;
thread_unsafe_tracker--;
for (i=0; i<NITER; i++){
local = thread_unsafe_count;
local--;
thread_unsafe_count+=local;
};
}
which I can test (in a pthreads enironment) with:
pthread_t th1, th2;
pthread_create(&th1,NULL,&thread_unsafe_plus,NULL);
pthread_create(&th2,NULL,&thread_unsafe_minus,NULL);
pthread_join(th1,NULL);
pthread_join(th2,NULL);
if (thread_unsafe_count != 0) {
printf("Ah ha!\n");
}
In real life, you'll probably have to wrap your suspect code in some way to help the race hit more ofter.
If it works, adjust the number of threads and other parameters to make it hit most of the time, and now you have a chance.

First you need to get stack traces from your clients, that way you can actually do some forensics.
Next do fuzz tests with random input, and keep these tests running for long stretches, they're great at finding those irrational border cases, that human programmers and testers can find through use cases and understanding of the code.

In this situation, where nothing else works, I introduce additional logging.
I also add in email notifications that show me the state of the application when it breaks down.
Sometimes I add in performance counters... I put that data in a table and look at trends.
Even if nothing shows up, you are narrowing things down. One way or another, you will end up with useful theories.

These are horrible and almost always resistant to the 'fixes' the engineer thinks he is putting in, as they have a habit of coming back to bite months later. Be wary of any fixes made to intermittent bugs. Be prepared for a bit of grunt work and intensive logging as this sounds more of a testing problem than a development problem.
My own problem when overcoming bugs like these was that I was often too close to the problem, not standing back and looking at the bigger picture. Try and get someone else to look at how you approach the problem.
Specifically my bug was to do with the setting of timeouts and various other magic numbers that in retrospect where borderline and so worked almost all of the time. The trick in my own case was to do a lot of experimentation with settings that I could find out which values would 'break' the software.
Do the failures happen during specific time periods? If so, where and when? Is it only certain people that seem to reproduce the bug? What set of inputs seem to invite the problem? What part of the application does it fail on? Does the bug seem more or less intermittent out in the field?
When I was a software tester my main tools where a pen and paper to record notes of my previous actions - remember a lot of seemingly insignificant details is vital. By observing and collecting little bits of data all the time the bug will appear to become less intermittent.

For a difficult-to-reproduce error, the first step is usually documentation. In the area of the code that is failing, modify the code to be hyper-explicit: One command per line; heavy, differentiated exception handling; verbose, even prolix debug output. That way, even if you can't reproduce or fix the error, you can gain far more information about the cause the next time the failure is seen.
The second step is usually assertion of assumptions and bounds checking. Everything you think you know about the code in question, write .Asserts and checks. Specifically, check objects for nullity and (if your language is dynamic) existence.
Third, check your unit test coverage. Do your unit tests actually cover every fork in execution? If you don't have unit tests, this is probably a good place to start.
The problem with unreproducible errors is that they're only unreproducible to the developer. If your end users insist on reproducing them, it's a valuable tool to leverage the crash in the field.

I've run into bugs on systems that seem to consistently cause errors, but when stepping through the code in a debugger the problem mysteriously disappears. In all of these cases the issue was one of timing.
When the system was running normally there was some sort of conflict for resources or taking the next step before the last one finished. When I stepped through it in the debugger, things were moving slowly enough that the problem disappeared.
Once I figured out it was a timing issue it was easy to find a fix. I'm not sure if this is applicable in your situation, but whenever bugs disappear in the debugger timing issues are my first suspects.

Once you fully understand the bug (and that's a big "once"), you should be able to reproduce it at will. When the reproduction code (automated test) is written, you fix the bug.
How to get to the point where you understand the bug?
Instrument the code (log like crazy). Work with your QA - they are good at re-creating the problem, and you need to arrange to have full dev toolkit available to you on their machines. Use automated tools for uninitialized memory/resources. Just plain stare at the code. No easy solution there.

Those types of bugs are very frustrating. Extrapolate it out to different machines with different types of custom hardware that might be in them (like at my company), and boy oh boy does it become a nightmare. I currently have several bugs like this at the moment at my job.
My rule of thumb: I don't fix it unless I can reproduce it myself or I'm presented with a log that clearly shows something wrong. Otherwise I cannot verify my change, nor can I verify that my change has not broken anything else. Of course, it's just a rule of thumb - I do make exceptions.
I think you're quite right to be concerned with your colleuge's approach.

These problems have always been caused by:
Memory Problems
Threading Problems
To solve the problem, you should:
Instrument your code (Add log statements)
Code Review threading
Code Review memory allocation / dereferencing
The code reviews will most likely only happen if it is a priority, or if you have a strong suspicion about which code is shared by the multiple bug reports. If it's a threading issue, then check your thread safety - make sure variables accessable by both threads are protected. If it's a memory issue, then check your allocations and dereferences and especially be suspicious of code that allocates and returns memory, or code that uses memory allocation by someone else who may be releasing it.

Some questions you could ask yourself:
When did this piece of code last work without problem.
What has been done since it stopped working.
If the code has never worked the approach would be different naturally.
At least when many users change a lot of code all the time this is a very common scenario.

Specific scenario
While I don't want to concentrate on only the issue I am having, here are some details of the current issue we face and how I've tackled it so far.
The issue occurs when the user interacts with the user interface (a TabControl to be exact) at a particular phase of a process. It doesn't always occur and I believe this is because the window of time for the problem to be exhibited is small. My suspicion is that the initialization of a UserControl (we're in .NET, using C#) coincides with a state change event from another area of the application, which leads to a font being disposed. Meanwhile, another control (a Label) tries to draw its string with that font, and hence the crash.
However, actually confirming what leads to the font being disposed has proved difficult. The current fix has been to clone the font so that the drawing label still has a valid font, but this really masks the root problem which is the font being disposed in the first place. Obviously, I'd like to track down the full sequence, but that is proving very difficult and time is short.
Approach
My approach was first to look at the stack trace from our crash reports and examine the Microsoft code using Reflector. Unfortunately, this led to a GDI+ call with little documentation, which only returns a number for the error - .NET turns this into a pretty useless message indicating something is invalid. Great.
From there, I went to look at what call in our code leads to this problem. The stack starts with a message loop, not in our code, but I found a call to Update() in the general area under suspicion and, using instrumentation (traces, etc), we were able to confirm to about 75% certainty that this was the source of the paint message. However, it wasn't the source of the bug - asking the label to paint is no crime.
From there, I looked at each aspect of the paint call that was crashing (DrawString) to see what could be invalid and started to rule each one out until it fell on the disposable items. I then determined which ones we had control over and the font was the only one. So, I took a look at how we handled the font and under what circumstances we disposed it to identify any potential root causes. I was able to come up with a plausible sequence of events that fit the reports from users, and therefore able to code a low risk fix.
Of course, it crossed my mind that the bug was in the framework, but I like to assume we screwed up before passing the blame to Microsoft.
Conclusion
So, that's how I approached one particular example of this kind of problem. As you can see, it's less than ideal, but fits with what many have said.

Unless there are major time constraints, I don't start testing changes until I can reliably reproduce the problem.
If you really had to, I suppose you could write a test case that appears to sometimes trigger the problem, and add it to your automated test suite (you do have an automated test suite, right?), and then make your change and hope that test case never fails again, knowing that if you didn't really fix anything at least you now have more chance of catching it. But by the time you can write a test case, you almost always have things reduced down to the point where you're no longer dealing with such an (apparently) non-deterministic situation.

Simply: ask the user who reported it.
I just use one of the reporters as a verification system.
Usually the person who was willing to report a bug is more than happy to help you to solve her problem [1].
Just give her your version with a possible fix and ask if the problem is gone.
In cases where the bug is a regression, the same method can be used to bisect where the problem occurred by giving the user with the problem multiple versions to test.
In other cases the user can also help you to debug the problem by giving her a version with more debugging capabilities.
This will limit any negative effects from a possible fix to that person instead of guessing that something will fix the bug and then later on realising that you've just released a "bug fix" that has no effect or in worst case a negative effect for the system stability.
You can also limit the possible negative effects of the "bug fix" by giving the new version to a limited number of users (for example to all of the ones that reported the problem) and releasing the fix only after that.
Also ones she can confirm that the fix you've made works, it is easy to add tests that ensures that your fix will stay in the code (at least on unit test level, if the bug is hard to reproduce on more higher system level).
Of course this requires that whatever you are working on supports this kind of approach. But if it doesn't I would really do whatever I can to enable it - end users are more satisfied and many of the hardest tech problems just go away and priorities come clear when development can directly interact with the system end users.
[1] If you have ever reported a bug, you most likely know that many times the response from the development/maintenance team is somehow negative from the end users point of view or there will be no response at all - especially in situations where the bug can not be reproduced by the development team.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio