Fail Fast vs. Robustness

Fail Fast vs. Robustness - performance

Our product is a distributed system. The modules I work on are fairly new, quite rigorous, well tested. They were developed with recent best practices in mind. Other modules can be considered as legacy software.
While I'm vigilant about everything that happens within modules I'm responsible for, I'm under constant pressure to work with bad data sent to me from the other modules. At heart, I'm a "Fail Fast" principle developer and as a result , when problems arise I usually am able to eliminate the possibility of error in my modules. It's not so much about blame, just saving wasted effort in chasing bugs in the wrong places.
But the argument I keep coming up against is: "We can't let this stuff fail in production, the customer expects this to work, why don't you work around this problem". And this would be an argument for robustness: be liberal in what you accept, conservative in what you send.
I should also note that these are mostly intermittent problems. We see them in integration tests but they are hard to reproduce. Timing and concurrency are involved.
I'm having a hard time balancing between the two principles. Part of it is my worry that if I start allowing and propagating exceptional data, I'm inviting trouble and I won't have as much confidence in my system. But I can't argue against keeping the system working even if other modules are sending me wrong data. The reason other modules aren't getting fixed is that they are too complex and fragile, while mine still appear clear and safe. But if I don't resist the pressure, my modules will slowly be saddled with the same problems I've been rejecting until now.
I should say that the system is not "crashing" in production, but my module may simply display an error to the operator and ask them to contact support. A crash would be a big problem, but if I'm reporting the error clearly, then isn't this the right thing to do? I suspect that my peers just don't want the customer to see any problems, period. But my module is rejecting data from other modules within our product, not customer input. So it seems to me that we are just not tackling problems.
So, do I need to be more pragmatic or hold my ground?

I share the "fail fast" preference/principle. Don't think of this as a conflict of principles though, its more a conflict of understanding. Your counterpart has some unspoken requirement ("dont show the user a bad time") that implies some missed requirement. You did not have a chance to think about/implement this requirement beforehand, so the requirement has left a bad taste in your mouth. Forget this viewpoint, re-approach it as a new project with a fixed requirement you can work against.
Maybe the best result is to give an error message like you displayed. But it sounds like you implemented it before having buy-in from your counterpart, when they had a choice to accept it. Earlier communication about what you were doing could have addressed something like that.
Be careful in how you prevent the ideas. Constantly referring to the other systems "too complex and fragile" might be rubbing people the wrong way. Express simply the systems are new to you and take longer to understand. Do put the time into understanding them, so you do not reduce peoples expectations of your capability.

I'd say that it depends on what happens if you don't halt. Does someone's paycheck get processed wrong? Does the wrong order get sent out? That would be worth stopping for.
If possible, have your cake and eat it too - don't report the error to the user, get the customer to agree to send diagnostic reports and report every failure back. Bug the developer(s) who own the faulting module(s) to fix them. And by bug I mean file a bug against them. Or, if management doesn't think it's worth the cost of fixing, don't.
I'd also write up unit tests against those modules that fail, especially if you can tell what the original input was that caused them to generate the wrong output.
What it really comes down to though is what the person who reviews your performance wants from you, especially after you explain the problem to them, via email.

Simply put, this sounds like a "don't check for something you can't handle". The fact that you're catching the error and able to report it means you're not propagating it. But it also means that since you can report it, you have some mechanism to trap the error and, therefore potentially handle it yourself, and correct it rather than report it.
Mind, I'm assuming that your error report is more interesting than a random exception you caught some place deep in the system. But even then, if it's an exception you're testing for and you're creating (i.e. you check if the denominator is zero and send an error rather than simply inadvertently dividing by zero and catching the exception higher up), then that suggests you may well have a way of correcting the problem.
Bottom line, you need both. You need to try to make the data as error free as practical, but also report the unexpected.
I don't think that you can lock the door and cross your arms saying "it's not my problem". The fact that it's coming from "old, fragile systems" is meaningless. YOUR code is not old a fragile and clearly the efficient place, in terms of the entire integrated system, to "fix" the data, once you've detected the problem. Yea the old modules will continue to GIGO to other, lesser systems, but those legacy modules combined with your new module are a cohesive whole and thus make up "the system".
The typical real problem here is simply the time/value equation of writing all this fix up code vs new features. That's a different debate. But if you have time, and you know things that you can do to clean up incoming data, "be liberal in what you accept" is sound policy.

I won't get into the reasons, but you are right.
In my experience, PHB's are missing the part of the brain required to understand why fail fast has merit and "robustness" as defined by do-whatever-it-takes-eat-errors-if-necessary is a bad idea. It is hopeless. They just don't have the hardware to grok it. They tend to say things "ok you make a good point but what about the user" - it's just their version of think of the children, and signals the end of a conversion with me anytime it's brought up.
My advice is to stand your ground. Eternally.

Thanks everyone. The case that prompted this question ended well, and partly thanks to insights I got from the answers above.
My initial reaction was to stick to fail fast, but I thought about this some more, and had reached the conclusion that one of the roles of my module is to provide a stabilizing anchor to the rest of the system. That does not necessarily mean accepting bad data, but surfacing problems, isolating them and handling them in a transparent manner until we find a solution.
I planned adding a new handler and code path for this case, which would properly execute as if it was a special use case that was previously undocumented.
We had a discussion where I reiterated the need to deal with the problem at the boundary, but was also willing to help. I outlined my plan to the other side, because I had a suspicion that my position was viewed as overly pedantic, and that the solution was perceived as me only having to turn off spurious validation of harmless data, even if it was incorrect. In reality though, the way I work is largely data driven, so I explained why it has to be correct and how behavior is driven by it and how in accommodating this data I will be implementing a special code path.
I think this gave weight to my position and it led to a more thorough discussion of the other side's aversion to fixing the data. It turned out that it was more of a weariness of dealing with an error prone legacy system than an actual obstacle. There was a relatively simple solution, it was just scary to make a change, a mindset that's fairly entrenched.
But having aired all challenges and possible solutions, we eventually agreed to fix the data, and so far it seems to have solved our problem. Our integration tests are now passing consistently, but we have also added logging and will continue to monitor it.
In summary, I think that for me, the synthesis of both principles is that fail fast is essential for surfacing problems. But once they do surface, robustness means providing a transparent path to continue operation in a way that does not compromise the system. I was able to offer that, and by doing so, won some goodwill from the other side and got the data fixed in the end.
Again, thanks to everyone that responded. I'm too new to rate comments, but I do appreciate all the perspectives presented.

That's a tricky one. If your module receives bad data and it's "ok" for you to just do nothing with them and return, then I would suggest to write to an error log instead of showing an error to the user.

It kind of depends on the class of error you are getting. If the way the system is breaking means you can keep going without feeding bad data to any other parts of the system, you should do everything in your power to work with whatever input is given.
To my mind though data purity trumps working systems, you cannot allow bad data to propagate elsewhere and corrupt other systems. To the extent you can massage data to be correct and then keep going, you should do so on the theory that the data is safe and you must keep the system running...
I like to think of things in terms of data streams. Passing bad data along is polluting the whole stream, and that is bad because just like real pollution a drop can spoil a whole river of data (if one element is bad, what else can you trust?). But equally bad is blocking the flow, letting nothing pass because you spotted something you could easily remove. Filter it out and if everyone at every stage is also filter, you get clear clean data out the other end even if a few impurities started up in the middle.

The question from your peers is: "why don't you work around this problem"
You say that it's possible for you detect the bad data, and report an error to the user. This is the normal approach - once you know the data coming to your functions is bad, you should fail fast (and this is the recommendation from the other answers I have read here).
However, your question doesn't specify the domain in which your software is operating. If you know the data coming in is erroneous, is it possible for you to request that data again? Is it actually possible to recover from the situation?
I mentioned that the "domain" here is important. So if you have an app which displays streamed video data for example, and maybe your wireless signal is weak so the stream is corrupt, should the system "fail fast" and display an error message? Or should a poorer image be displayed, and an attempt to reconnect made if needed, depending on the magnitude of the problem?
Depending on your domain, it may be possible for you to detect bad data, and make a second request for the data without inconveniencing the user. (This is clearly only relevant in cases where you'd expect the data to be better the second time, but you do say the issues you are experiencing are intermittent and possible concurrency related)...
So, fail-fast is good, and is definitely something you should do if you can't recover. And you should definitely not propagate bad data. But if you can recover, which in some domains you can, then failing straight away is not necessarily the best thing to do.

Related

Writing fool-proof applications vs writing for performance

I'm biased towards writing fool-proof applications. For example with PHP site, I validate all the inputs from client-side using JS. On the server-side I validate again. On both sides I do validation for emptiness, and other patterns (email, phone, url, number, etc). And then I strip malicious tags or characters, trim them (server-side). Later I convert the input into desired formats/data types (string, int, float, etc). If the library meant for server-side only, I even give developers chances for graceful degradation and accommodate the tolerate the worst inputs and normalize to the acceptable ones (I have predefined set of the acceptable ones).
Now I'm reading a library that I wrote one and a half years ago. I'm wondering if developers are so evil or lack IQ for me do so much of graceful degradation, finding every possible chance to make the dudes right, even they gave crappy input which seriously harms performance. Or shall I do minimal checking and expect developers to be able and are willfully to give proper input? I have no hope for end-users but should I trust developers more and give them application/library with better performance?

Common policy is to validate on the server anything sent from the client because you can't be totally sure it really was your client that sent it. You don't want to "trust developers more" and in the process find that you've "trusted hackers of your site more".
Fixing invalid input automatically can be as much a curse as a blessing -- you've essentially committed to accepting the invalid input as a valid part of your protocol (ie, in a future version if you make a change that will break the invalid input that you were correcting, it is no longer backwards compatible with the client code that has been written). In extremis, you might paint yourself into a corner that way. Also, invalid calls tend to propagate to new code -- people often copy-and-paste example code and then modify it to meet their needs. If they've copied bad code that you've been correcting at the server, you might find you start getting proportionally more and more bad data coming in, as well as confusing new programmers who think "that just doesn't look like it should be right, but it's the example everyone is using -- maybe I don't understand this after all".

Never expect diligence from developers. Always validate, if you can, any input that comes into your code, especially if it comes across a network.

End users (whether they're programmers using your tool, or non-programmers using your application) don't have to be stupid or evil to type the wrong thing in. As programmers we all too often make wrong assumptions about what's obvious for them.
That's the first thing, which justifies comprehensive validation all on its own. But validation isn't the same as guessing what they meant from what they typed, and inferring correct input from incorrect - unless the inference rules are also well known to the users (like Word's auto-correct, for instance).
But what is this performance you seek? There's no bit of client-side (or server-side, for that matter) validation that takes longer to run than the second or so that is an acceptable response time.
Validate, and ensure it doesn't break as the first priority. Then worry about making it clever enough to know (reliably) what they meant. After that, worry about how fast it is. In the real world, syntax validation doesn't make a measurable difference to anything where user input takes most of the total time.

Microsoft made the mistake of trusting programmers to do the right thing back in the days of Windows 3.1 and to a lesser extent Windows 95. You need only read a few posts from Raymond Chen to see where that road ultimately leads.
(P.S. This is not a dig against Microsoft - it's a statement on fact about how programmers abused the more liberal Win16, either deliberately or through ignorance)

I think you are right in being biased toward fool-proof applications. I would not assume that that degrades performance enough to be of much concern. Rather I would address performance concerns separately, starting by profiling or my favorite method, stackshots. There must be a way to get those in PHP.

Standard methods of debugging

What's your standard way of debugging a problem? This might seem like a pretty broad question with some of you replying 'It depends on the problem' but I think a lot of us debug by instinct and haven't actually tried wording our process. That's why we say 'it depends'.
I was sort of forced to word my process recently because a few developers and I were working an the same problem and we were debugging it in totally different ways. I wanted them to understand what I was trying to do and vice versa.
After some reflection I realized that my way of debugging is actually quite monotonous. I'll first try to be able to reliably replicate the problem (especially on my local machine). Then through a series of elimination (and this is where I think it's problem dependent) try to identify the problem.
The other guys were trying to do it in a totally different way.
So, just wondering what has been working for you guys out there? And what would you say your process is for debugging if you had to formalize it in words?
BTW, we still haven't found out our problem =)

My approach varies based on my familiarity with the system at hand. Typically I do something like:
Replicate the failure, if at all possible.
Examine the fail state to determine the immediate cause of the failure.
If I'm familiar with the system, I may have a good guess about to root cause. If not, I start to mechanically trace the data back through the software while challenging basic assumptions made by the software.
If the problem seems to have a consistent trigger, I may manually walk forward through the code with a debugger while challenging implicit assumptions that the code makes.
Tracing the root cause is, of course, where things can get hairy. This is where having a dump (or better, a live, broken process) can be truly invaluable.
I think that the key point in my debugging process is challenging pre-conceptions and assumptions. The number of times I've found a bug in that component that I or a colleague would swear is working fine is massive.
I've been told by my more intuitive friends and colleagues that I'm quite pedantic when they watch me debug or ask me to help them figure something out. :)

Consider getting hold of the book "Debugging" by David J Agans. The subtitle is "The 9 Indispensable Rules for Finding Even the Most Elusive Software and Hardware Problems". His list of debugging rules — available in a poster form at the web site (and there's a link for the book, too) is:
Understand the system
Make it fail
Quit thinking and look
Divide and conquer
Change one thing at a time
Keep an audit trail
Check the plug
Get a fresh view
If you didn't fix it, it ain't fixed
The last point is particularly relevant in the software industry.

I picked those on the web or some book which I can't recall (it may have been CodingHorror ...)
Debugging 101:
Reproduce
Progressively Narrow Scope
Avoid Debuggers
Change Only One Thing At a Time
Psychological Methods:
Rubber-duck debugging
Don't Speculate
Don't be too Quick to Blame the Tools
Understand Both Problem and Solution
Take a Break
Consider Multiple Causes
Bug Prevention Methods:
Monitor Your Own Fault Injection Habits
Introduce Debugging Aids Early
Loose Coupling and Information Hiding
Write a Regression Test to Prevent Re occurrence
Technical Methods:
Inert Trace Statements
Consult the Log Files of Third Party Products
Search the web for the Stack Trace
Introduce Design By Contract
Wipe the Slate Clean
Intermittent Bugs
Explot Localility
Introduce Dummy Implementations and Subclasses
Recompile / Relink
Probe Boundary Conditions and Special Cases
Check Version Dependencies (third party)
Check Code that Has Changed Recently
Don't Trust the Error Message
Graphics Bugs

When I'm up against a bug that I can't get seem to figure out, I like to make a model of the problem. Make a copy of the section of problem code, and start removing features from it, one at a time. Run a unit test against the code after every removal. Through this process your will either remove the feature with the bug (and hence, locate the bug), or you will have isolated the bug down to a core piece of code that contains the essence of the problem. And once you figure out the essence of the problem, its a lot easier to fix.

I normally start off by forming an hypothesis based on the information I have at hand. Once this is done, I work to prove it to be correct. If it proves to be wrong, I start off with a different hypothesis.
Most of the Multithreaded synchronization issues get solved very easily with this approach.
Also you need to have a good understanding of the debugger you are using and its features. I work on Windows applications and have found windbg to be extremely helpful in finding bugs.

Reducing the bug to its simplest form often leads to greater understanding of the issue as well adding the benefit of being able to involve others if necessary.
Setting up a quick reproduction scenario to allow for efficient use of your time to test any hypothosis you chose.
Creating tools to dump the environment quickly for comparisons.
Creating and reproducing the bug with logging turned onto the maximum level.
Examining the system logs for anything alarming.
Looking at file dates and timestamps to get a feeling if the problem could be a recent introduction.
Looking through the source repository for recent activity in the relevant modules.
Apply deductive reasoning and apply the Ockham's Razor principles.
Be willing to step back and take a break from the problem.

I'm also a big fan of using process of elimination. Ruling out variables tremendously simplifies the debugging task. It's often the very first thing that should to be done.
Another really effective technique is to roll back to your last working version if possible and try again. This can be extremely powerful because it gives you solid footing to proceed more carefully. A variation on this is to get the code to a point where it is working, with less functionality, than not working with more functionality.
Of course, it's very important to not just try things. This increases your despair because it never works. I'd rather make 50 runs to gather information about the bug rather take a wild swing and hope it works.

I find the best time to "debug" is while you're writing the code. In other words, be defensive. Check return values, liberally use assert, use some kind of reliable logging mechanism and log everything.
To more directly answer the question, the most efficient way for me to debug problems is to read code. Having a log helps you find the relevant code to read quickly. No logging? Spend the time putting it in. It may not seem like you're finding the bug, and you may not be. The logging might help you find another bug though, and eventually once you've gone through enough code, you'll find it....faster than setting up debuggers and trying to reproduce the problem, single stepping, etc.
While debugging I try to think of what the possible problems could be. I've come up with a fairly arbitrary classification system, but it works for me: all bugs fall into one of four categories. Keep in mind here that I'm talking about runtime problems, not compiler or linker errors. The four categories are:
dynamic memory allocation
stack overflow
uninitialized variable
logic bug
These categories have been most useful to me with C and C++, but I expect they apply pretty well elsewhere. The logic bug category is a big one (e.g. putting a < b when the correct thing was a <= b), and can include things like failing to synchronize access among threads.
Knowing what I'm looking for (one of these four things) helps a lot in finding it. Finding bugs always seems to be much harder than fixing them.
The actual mechanics for debugging are most often:
do I have an automated test that demonstrates the problem?
if not, add a test that fails
change the code so the test passes
make sure all the other tests still pass
check in the change
No automated testing in your environment? No time like the present to set it up. Too hard to organize things so you can test individual pieces of your program? Take the time to make it so. May make it take "too long" to fix this particular bug, but the sooner you start, the faster everything else'll go. Again, you might not fix the particular bug you're looking for but I bet you find and fix others along the way.

My method of debugging is different, probably because I am still beginner.
When I encounter logical bug I seem to end up adding more variables to see which values go where and then I go and debug line by line in the piece of code that causing a problem.

Replicating the problem and generating a repeatable test data set is definitely the first and most important step to debugging.
If I can identify a repeatable bug, I'll typically try and isolate the components involved until I locate the problem. Frequently I'll spend a little time ruling out cases so I can state definitively: The problem is not in component X (or process Y, etc.).

First I try to replicate the error, without being able to replicate the error it is basically impossible in a non-trivial program to guess the problem.
Then if possible, break out the code in a separate standalone project. There are several reasons for this: If the original project is big it quite difficult to debug second it eliminates or highlights any assumptions about the code.
I normally always have another copy of VS open which I use for the debugging parts in mini projects and to test routines which I later add to the main project.
Once having reproduced the error in the separate module the battle is almost won.
Sometimes it is not easy to break out a piece of code so in those cases I use different methods depending on how complex the issue is. In most cases assumptions about data seem to come and bite me so I try to add lots of asserts in the code in order make sure my assumptions are correct. I also disabling code by using #ifdef until the error disappears. Eliminating dependencies to other modules etc... sort of slowly circling in the bug like a vulture ..
I think I don't have really a conscious way of doing it, it varies quite a lot but the general principle is to eliminate the noise around the issue until it is quite obvious what it is. Hope I didn't sound too confusing :)

How to react when the client's response is negative on delivery? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am a junior programmer. Since my supervisor told me to sit in with the client, I joined. I saw the unsatisfied face of the client despite the successful (from my programmer's perspective) delivery of the project!
Client: You could have included this!
Us: Was not in the specification!
Client: Common Sense!
As a programmer, how do you respond in this situation?

What you should do to avoid this situation:
Explicitly spec out what will be included and what will not be included.
The problem probably comes down to the unspecified parts of the spec:
The client thinks that unspecified stuff should be in, i.e. it was implied.
The developer thinks that unspecified stuff should not be in.
For future specs that you have, you should have a catch all statement, that explicitly states that if something is not specified in this document, it can be done after the original specification is done at an additional cost.
What you should do in the current situation:
Other than learning from your experiences, you should come to some compromise with the client.
Example: I will do this feature that you feel is common sense, but for all future additions/changes it will have to be spec'ed out explicitly.
I.e. you will have to do a little more work, but it is worth it in return for the catch all explicitly spec'ed agreement your client will enter into.
Bad spec?
Was it necessarily a bad spec? No.
It is impossible to mention everything your clients may expect, so it is critical to have this catch all statement mentioned above stated clearly and explicitly in your spec/contract.
Other ways to reduce the problem:
Involve the client early, show them early prototypes. Even if they don't demand it.
Try not to sell the client an end product, but more of a service for working on his product.
Consider an agile development model or something similar so that tasks are well defined, small, paid for, and indisputable.

This would be one of many reasons why I switched to an Agile development philosophy. The only way, in my opinion, to successfully avoid this scenario is to either be omniscient or involve the customer heavily and release early/release often to get feedback as soon as possible. That way you can develop the software the customer really wants, not the software the customer tells you they want.

Client: You could have included this!
Us: Was not in the specification!
Client: Common Sense!!
Us: We do not attempt to go beyond what the client has specified - we follow the specification. It's as important to NOT implement features not specified as it is to implement features specified. We will never second guess our customers, who value the fact that they can completely depend on us to correctly and completely implement the specification on time and under budget.
As others very rightly point out, the situation is almost always more complex than the simple exchange I've described above.
However, the above is valid if the implementer has a specification with the customer's signature on it which essentially implements an agreement that says "once the software provably implements all the features in the spec then it is considered complete", and anything additional is outside the specification and therefore outside the contract.
The contract itself may have some input here as well - if you don't have a signed contract than it doesn't matter what's in the spec - everything so far has been done on a handshake, and the entire deal (including payment) can go down the toilet based on any dissatisfaction on either side.
But if you have a contract and a specification, and the customer has seen and signed both, then they have no wriggle room to ask you to go further.
Now, as to the question of whether you should implement it:
AWESOME! You delivered a product and they only had one complaint. Implement the feature, call it a 'freebie (make sure they understand you're working outside the spec and contract and explicitly send them a bill for the work with the discount shown in dollars) and have them sign off on the project as a whole.
It will explicitly demonstrate that the project is ended, that you went above and beyond the call of duty, and that any further 'surprises' are outside the contract/spec, which gives you a nice layer of protection beyond what you already (ostensibly) have.
If it's a UI issue, then you're in murkier water.
Does the spec adequately describe the UI? Does it have mockups? I wouldn't fault a customer for this complaint about the UI if the spec did not very closely describe the layout, usage, and include mockups.
Either way, I think you can understand the customer's position - if they haven't played with UI mockups, then they're going to be disappointed with the result regardless - there's no way, psychologically speaking, that you and your customer could have possibly had the same idea in mind (nevermind the fact that common sense isn't!).
Quite frankly if this is the first time the customer has thought about checking out the UI before the work is finished, then it's at least partially your fault for not explaining good UI design processes to them. This is a key feature for their app, and it's very tightly coupled to what they've imagined - no one can be satisfied in such a situation unless they've 'grown' their internal representation over time to match what the reality is.
This disconnect is solved only through frequent user and customer testing, which is obviously missing. This is a problem regarding client education and communication, not whether the specification was met or not.
-Adam

Expect last minute changes of scope - they always happen, so be ready.
Review progress frequently with client - to minimize surprises.
Contract: Functional Spec, plus Time & Materials with initial cap (so client feels control).
Then when changes come along, re-negotiate the cap if necessary.
Never say they can't have what they want. They can get that answer for free!
Always give them a little more than they asked for, so they know you've got a positive attitude.
Relate to the client as being on the same team with them. Don't accept being legalistically painted as an adversary.
They may think of contractors as not loyal, compared to employees. Show them you're as dedicated to their success as their employees are, and you'll go the extra mile.

Classic case...
There's not definite answer to this one, but it all turns around communication. There should have been preventive measures put in place (like weekly reviews or something like that).
For sure, you can't redo the whole thing for free.
Two ways: Or to tell them to ** off or you deal with it.
If you choose to deal:
First, empathize, respect the client.
Have a look at what can easily be changed.
Have a look at the contracts.
Maybe create a new agreement.
Don't do too much.
Make them see the progress and the work it takes.
Find workarounds for the missing features (maybe using other great features, or available tools.)
Use your common sense, it is so common, its not even funny.

This is one of the many drawbacks of a fixed bid arrangement. Any time business needs or priorities change, or there is even a simple misunderstanding, it results in anything from an awkward situation like this to calling lawyers in. If you have an arrangement where you get paid for development time, you can always react to any change and get paid for whatever time it takes to make that change. Also, having a by-the-hour arrangement does not preclude having a plan or making an estimate.
Once you are in a fixed bid pickle, though, your options are:
1) Do it at an additional cost.
2) Do it free.
3) Don't do it.
Option 3 is the worst, and Option 1 is the best. If you have a good trusting relationship and decent communication with the client, it's usually easy to arrive at Option 1. If the relationship is bad, then you've got bigger problems. At that point, just try to avoid laywers.
A final point - any project that has something known as "The Delivery Date" inevitably runs into the problem described. Projects with said date usually involve retreating to a cave for several months to develop in hiding followed by an unleashing of the product all at once in front of the stakeholders. This is abrupt and leaves plenty of time for client expectations and the actual product to drift apart. If, instead, you show intermediate versions of the product and gather feedback every few weeks, two things happen. First, you get better feedback, minimize misunderstandings, and make a better product. Second, there is no single point in time on which a massive amount of expectation is laid. The potential difference between what the client is imagining and what actually exists is much smaller. No surprises.
Good luck.

"how do you react?"
Question 1 - do you want to continue this relationship with this customer? Seriously. If they are going to claim that unspecified features are "common sense," this may not be a good relationship to maintain or enhance.
If you want to disengage, then that's easy. Ask for them to highlight each part of the specification that you failed to comply with and play that game. Get specific test criteria for each missing feature. Pull Teeth. Be confrontational in determining what's missing. Don't ask why. Just ask for all the details up front. It's slow and unpleasant. But you don't want them anyway.
If you want to engage, well, you're going to have to change the relationship. Currently, you have a Passive Aggressive Customer. They won't say what they want, but they will say what they don't want.
This may be a habit with them; this may be how they win concessions. Or this just may be sloppy specification on their part.
If you want the relationship, your reaction has two parts.
Short-term. Get something they're happy with. They have to identify specific changes. You have to score each change with a "cost to do" and "fit with specification".
Some things are cheap and a good fit. Do those.
Some things are cheap to do, but a bad fit with the specification. Think twice about enabling a bad specification to lead to rework. In a sense, you purchased the specification from them; you may need to raise your standards, also.
The expensive things which (sadly) fit the specification are a problem. You're in trouble with these, and pretty much have to do them.
The expensive things which don't fit well with the specification are lessons learned for everyone. Detail a plan for these, including specification rewrites and approvals.
Long-term. Make sure they you're not PA'd again. Review early and often, use Agile techniques. Communicate more, prototype more, release more.

Well, it was not successfully delivered. Somewhere along the line there was miscommunication. Without knowing the specifics I would suggest this is not a developer injected problem and this is probably not to be blamed on the customer - the requirements gathering task was insufficient. This is a classic example of what happens when the software side does not have domain experts or the requirements discovery process doesn't do all that it could...
If it was me I would correct the problem and figure out how to avoid similar issue in the future.
How you handle this can very well determine the future of this contract/business with the client. Taking responsibility and correcting the issue is a huge opportunity for your company.
EDIT:
This is a good time to evaluate how this happened to help correct it. Some companies choose to totally revamp everything they do which is a mistake I think. So is ignoring it. Blaming people for the problem is also a mistake.
It is a good time to walk through how this happened, what the process is, and maybe how it could have been caught. I would not make huge rule changes or process changes - but coming up with guidelines for future work is a great thing. Your company had a clear lesson about a shortcoming. Losing the opportunity to correct this problem and to correct your process would be a waste of a good chance.

ZiG, I've had to deal with this problem on several occasions at my current place of Employment. My group (3 developers) tries to approach things in an Agile manner. We're used to getting mid-stream and even last-second requests (which we then treat on a case-by-case basis).
However, we make it clear that resources (particularly time) are limited and if it's not in the spec we can't make promises. If it's judged important and it can't fit into the current release, we generally plan a followup release. If it isn't important, it goes on a list.
One thing I've found is that you can get users to agree to Spec S at Time T. However at Time T + N, getting them to remember they agreed to Spec S, or getting them to acknowledge that they did so (with the documentation you've been keeping, I hope!) can be trickier than it should be.

Speaking to the OP's subject and question:
If you are an employed programmer, then I would hope that other resources are in the meeting with you. Possibly "higher ups" in the organization.
If this is the case, then your job is to answer DIRECT questions, and to keep your emotions in check. Yes, you may feel injured because they don't love your code, but showing any emotion with bosses present is not a good thing. Rather, try and look neutral and let the others handle the session.
Now, if they "hang you out to dry", then I would recommend the following questions:
a) "OK. I see. Why exactly to you feel this is common sense to include this feature? I'd like to discover why we didn't include it." (force them to explain their thought process. Common sense to one person is rarely common sense to anyone else.)
b) "Well, I'm sure we could include that in the next release. I'll leave it up to XXX (the bosses) to come to a mutually agreeable approach" (i.e. don't talk cost or freebies with bosses present. EVER.)
Again, this assumes you are a programmer WORKING for a company that delivered the product. Now, if are more than that - i.e. you ARE one of the higher ups, then many of the suggestions here are excellent.
However, if you are the higher-up or are a consultant programmer, then first and formost
a) Apologize for the process that did not catch this requirement. Promise to work with the client to prevent this from recurring.
Then on to the other strategies. It really doesn't matter if you charge for the fix or not - the apology is the most important action to the client. Again, it bears repeating - you are not apologizing for the missed feature. You are apologizing for the faulty design process that let it slip. Clients are usually pretty accommodating when you start this way and then seek a solution.
Cheers,
-Richard

Use SCRUM like approaches to avoid this deathtrap: involve the client in the dev process early, frequently and in informal, restricted commitees -> risk reduction and improved agility.

In terms of your literal question, how to react, the best way is to ignore your ego ("what?! After I worked so hard on this and met the spec?!") and instead focus on some active listening and working to consensus.
Client: You could have included this!
Us: Was not in the specification!
Client: Common Sense!!
Us: I understand that you're not happy that we didn't go beyond the bounds of the specification. Seeing how you feel about this, how can we make you happy? Let's see if there's a process we can create together that will help everyone.
Essentially, you don't want to turn this into a "you said/I said" death match. The only way to resolve those involves lawyers and then nobody wins. If you can agree that the spec or the process was to fault, work together to fix those.

This approach actually just worked for me: wait for the guy who doesn't like your software to leave and be replaced by the guy who does like it.
Obviously you can't really rely on this, but if you're sure that you did a good job and that your software really will satisfy the business needs of the people who hired you, it does pay to wait it out. Sometimes the client's initial reaction will not be their final one, especially if you can quickly incorporate their concerns.

Don't try make the client feel like it is their fault. It might be their fault, but making them feel that way will not produce constructive results, and could just annoy them.
Instead, you should realize that clients only complain about software they use, in most cases because they like it. Nobody complains about software nobody uses. It is inevitable that a client will complain about the software you deliver, even if you deliver exactly what they ask for. So don't sweat it. Software is never done.

Total failure on the part of the person in charge of requirements collection, no doubt about it. Additional failure of the project management to not iterate the deliverable and have check-in meetings with the client.
However, you have a signed-off spec, and what you've delivered matches the spec. So, your company has two choices: write off the cost in the name of business development and make the change for free, or charge them for the change request.

If it ain't in the spec, it ain't in the spec. As a developer with no specific domain knowledge, 'common sense' is an irrelevant concept. Different industries work in different ways and one approach might be quite appropriate for a particular domain but completely unacceptable in the other.
Writing good specs is an art-form. IMO, you can either take an agile 'analyst/programmer' approach where you make small iterations or write and maintain a detailed, unambiguous specification. Both are highly skilled tasks, and are still iterative. You still have to evolve the specification.
Either way is not as easy as it sounds and both require the ability to establish a good working relationship with the client.

You cannot know what your customer think in his head. This situation occur often with client that haven't got any experiences with programming project. What I suggest to you is to simply show him that "common sense" isn't very accurate as answer in engineering (or programming if you prefer).
Show him other example in life that will show him that you cannot build something that aren't written. Example: building a new house, the guy who build the house need a plan with all detail... he won't put optional electric plug because in the living room it's more "common sense" to have some extra...

I had this once. And luckily it wasn't me that created the design because that proved to be the problem.
It is of vital importance that the communcation between your company and the client is as perfect as possible. Be sure you understand each other. Ask questions and let them ask questions. Do not let anything open in the design. This will be the problem point at delivery. And have regular meetings during the project (preferably with a prerelease).
Unfortunately a lot of developers are bad at communciation, and a lot of clients are not aware of their own needs. But if you can minimize the gap, you have found yourself a happy (and returning) customer.

This is why I/the teams I worked with always used a prototype-style approach, that means:
after collecting the requirements, you show the client an early and basic release of the software
the client says "you could have included this"/"it's common sense"
you change your design to reflect the client's desiderata
iterate from point 1 till the official release

You have to start it early on; tell the customer, early and often, that the spec/use-cases/user-stories are a contract which define what will be delivered. in an agile environment there are plenty of chances for the customer to observe some "common sense" feature they want and ask for it, which is one of the advantages of an agile approach, but if you start accepting "common sense" additions at the end, you are preparing yourself for infinite extensions, probably at your expense.
Some customers expect this; the more and better you tell them they can't, the easier the eventual arguments will be.
As a junior guy, I realize you can't do this -- yet -- but one of the hard-but-necessary lessons is that sometimes you have to fire a customer.

You learn - everything is learning and nothing is personal.
We are experts in our area we know better than customer what he need. And next time for next customer we will suggest all useful features in advance and make him happy and will make him pay more money because we are the experts and we know better.

Improving really bad systems

How would you begin improving on a really bad system?
Let me explain what I mean before you recommend creating unit tests and refactoring. I could use those techniques but that would be pointless in this case.
Actually the system is so broken it doesn't do what it needs to do.
For example the system should count how many messages it sends. It mostly works but in some cases it "forgets" to increase the value of the message counter. The problem is that so many other modules with their own workarounds build upon this counter that if I correct the counter the system as a whole would become worse than it is currently. The solution could be to modify all the modules and remove their own corrections, but with 150+ modules that would require so much coordination that I can not afford it.
Even worse, there are some problems that has workarounds not in the system itself, but in people's head. For example the system can not represent more than four related messages in one message group. Some services would require five messages grouped together. The accounting department knows about this limitation and every time they count the messages for these services, they count the message groups and multiply it by 5/4 to get the correct number of the messages. There is absolutely no documentation about these deviations and nobody knows how many such things are present in the system now.
So how would you begin working on improving this system? What strategy would you follow?
A few additional things: I'm a one-men-army working on this so it is not an acceptable answer to hire enough men and redesign/refactor the system. And in a few weeks or months I really should show some visible progression so it is not an option either to do the refactoring myself in a couple of years.
Some technical details: the system is written in Java and PHP but I don't think that really matters. There are two databases behind it, an Oracle and a PostgreSQL one. Besides the flaws mentioned before the code itself is smells too, it is really badly written and documented.
Additional info:
The counter issue is not a synchronization problem. The counter++ statements are added to some modules, and are not added to some other modules. A quick and dirty fix is to add them where they are missing. The long solution is to make it kind of an aspect for the modules that need it, making impossible to forget it later. I have no problems with fixing things like this, but if I would make this change I would break over 10 other modules.
Update:
I accepted Greg D's answer. Even if I like Adam Bellaire's more, it wouldn't help me to know what would be ideal to know. Thanks all for the answers.

Put out the fires. If there are any issues of critical priority, whatever they are, you've got to handle them first. Hack it in if you must, with a smelly codebase it's ok. You know you'll improve it going forward. This is your sales technique targeted at whomever you're reporting to.
Pick some low-hanging fruit. I assume you're relatively new to this particular software and that you were re-tasked to deal with it. Find some apparently easy problems in a related subsystem of the code that shouldn't take more than a day or two to resolve apiece, and fix them. This may involve refactoring, or it may not. The goal is to familiarize yourself with the system and with the style of the original author. You may not get really lucky (One of the two incompetents who worked on my system before me always post-fixed his comments with four punctuation marks instead of one, which made it very easy to distinguish who wrote the particular segment of code.), but you'll develop insight into the author's weaknesses so you know what to look out for. Extensive, tight coupling with global state vs poor understanding of language tools, for example.
Set a big goal. If your experience parallels mine, you'll find yourself in a particular bit of spaghetti code more and more often as you perform the prior step. This is the first knot you need to untangle. With the experience you've gained understanding the component and knowledge about what the original author likely did wrong (and thus, what you need to watch out for), you can start envisioning a better model for this subset of the system. Don't worry if you still have to maintain some messy interfaces to maintain functionality, just take it one step at a time.
Lather, rinse, repeat! :)
Given time, consider adding unit tests for your new model one level underneath your interfaces with the rest of the system. Don't engrave the bad interfaces in code via tests that use them, you'll be changing them in a future iteration.
Addressing the particular issues you mention:
When you run into a situation that users are working around manually, talk with the users about changing it. Verify that they'll accept the change if you provide it before sinking the time into it. If they don't want the change, your job is to maintain the broken behavior.
When you run into a buggy component that multiple other components have worked around, I espouse a parallel component technique. Create a counter that works how the existing one should work. Provide a similar (or, if practical, identical) interface and slide the new component into the codebase. When you touch external components that work around the broken one, try to replace the old component with the new one. Similar interfaces ease porting of the code, and the old component is still around if the new one fails. Don't remove the old component until you can.

What is being asked of you right now? Are you being asked to implement functionality, or fix bugs? Do they even know what they want you to do?
If you don't have the manpower, time, or resources to "fix" the system as a whole, then all you can do is bail water. You're saying you should be able to make some "visible progress" in a few months' time. Well, with the system being as bad as you described, you may actually make the system worse. Under pressure to do something noticeable, you'll simply add code, and make the sysem even more convoluted.
You need to refactor, eventually. There is no way around it. If you can find a way to refactor that is visible to your end users, that would be ideal, even if it takes 6-9 months or a year instead of "a few months." But if you can't, then you have a choice to make:
Refactor, and risk being viewed as "not accomplishing anything" despite your efforts
Don't refactor, accomplish "visible" goals, and make the system more convoluted and more difficult to refactor one day. (Maybe after you find a better job, and hope the next developer to come along can never find out where you live.)
Which one is most beneficial to you personally depends on your company's culture. Will they one day decide to hire more developers, or replace this system completely with some other product?
Conversely, if your efforts to "fix things" actually break other things, will they be understanding about the monstrosity you're being asked to tackle single-handedly?
No easy answers here, sorry. You have to evaluate based on your unique, individual situation.

This is a whole book that will basically say unit test and refactor, but with more practical advice on how to do it
http://ecx.images-amazon.com/images/I/51RCXGPXQ8L._SL500_AA240_.jpg
http://www.amazon.com/Working-Effectively-Legacy-Robert-Martin/dp/0131177052

You open the directory that contains this system with Windows Explorer. Then, press Ctrl-A, and then Shift-Delete. That sounds like an improvement in your case.
Seriously though: that counter sounds like it's got thread-safety issues. I'd put a lock around the increasing functions.
And regarding the rest of the system, you can't do the impossible so try to do the possible. You need to attack your system from two fronts. Take care of the more visibly problematic issues first, so you can show progress. At the same time, you should deal with the more infrastructural problems, so that you have a chance at actually fixing this thing some day.
Good luck, and may the source be with you.

Pick one area that would be of medium difficulty to refactor. Create a skeleton of the original code with only the method signatures of the existing ones; maybe use an Interface even. Then start hacking away. You can even point the "new" methods to the old ones until you get to them.
Then, testing, testing, testing. Since there aren't any unit tests, maybe just use good old fashioned Voice-Activated-Unit Tests (people)? Or write your own tests as you go.
Document your progress as you go in some kind of repository, including frustrations and questions, so that when the next poor schmuck who gets this project won't be where you are :).
Once you get the first part done, move on to the next. The key is to build on top of incremental progress, that's why you shouldn't start with the hardest part first; it'll be too easy to get demoralized.
Joel has a couple of articles on rewriting/refactoring:
http://www.joelonsoftware.com/articles/fog0000000069.html
http://www.joelonsoftware.com/articles/fog0000000348.html

I've been working with a legacy system with the same characteristics for almost three years now, and there are no shortcuts that I'm aware of.
What bothers me most with our legacy system is that I'm not allowed to fix some bugs, since many other functions could break if I fixed them. This calls for ugly workarounds or creating new versions of old functions. Calls to the old functions can then be replaced with the new one at a time (while testing).
I'm not sure what the goal of your task is, but I strongly advise you to touch as little of the code as possible. Only do what you need to do.
You may want to get as much as possible documented by interviewing people. This is a huge task, since you don't know which questions to ask, and people will have forgotten a lot of details.
Other than that: make sure you're getting paid and enough moral support. There will be weeping and gnashing of teeth...

Well you need to start somewhere, and it sounds like there are bugs that need fixing. I would work through those bugs, making quick win refactorings, and writing any unit tests possible along the way. I would also use a tool like SourceMonitor to identify some of the most 'complex' parts of code in the system and see if I could simplify their design in any way. Ultimately, you just have to accept that it will be a slow process, and make small steps towards a better system.

I would try to pick a part of the system that could be extracted and rewritten in isolation fairly quickly. Even if it doesn't do much, you could show progress pretty quickly, and you don't have the problem of interfacing with the legacy code directly.
Hopefully, if you could pick off a few such tasks, they will see you making visible progress, and you could put forward an argument for hiring more people to rewrite the bigger modules. When parts of the system rely on broken behaviour, you don't have much choice but to separate before you fix anything.
Hopefully, you could gradually build a team capable of rewriting the whole lot.
All of this would have to go hand in hand with some decent training, otherwise people's old habits will stick, and your work will get the blame when things don't work as expected.
Good luck!

Deprecate everything that currently exists that has problems, and write new ones that work correctly. Document as much as you can about what will change and put big red flashing signs all over the place pointing to this documentation.
By doing it that way, you can keep your existing bugs (the ones that are being compensated for somewhere else) around without slowing down your progress towards getting an actual working system.

When is it good (if ever) to scrap production code and start over? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I was asked to do a code review and report on the feasibility of adding a new feature to one of our new products, one that I haven't personally worked on until now. I know it's easy to nitpick someone else's code, but I'd say it's in bad shape (while trying to be as objective as possible). Some highlights from my code review:
Abuse of threads: QueueUserWorkItem and threads in general are used a lot, and Thread-pool delegates have uninformative names such as PoolStart and PoolStart2. There is also a lack of proper synchronization between threads, in particular accessing UI objects on threads other than the UI thread.
Magic numbers and magic strings: Some Const's and Enum's are defined in the code, but much of the code relies on literal values.
Global variables: Many variables are declared global and may or may not be initialized depending on what code paths get followed and what order things occur in. This gets very confusing when the code is also jumping around between threads.
Compiler warnings: The main solution file contains 500+ warnings, and the total number is unknown to me. I got a warning from Visual Studio that it couldn't display any more warnings.
Half-finished classes: The code was worked on and added to here and there, and I think this led to people forgetting what they had done before, so there are a few seemingly half-finished classes and empty stubs.
Not Invented Here: The product duplicates functionality that already exists in common libraries used by other products, such as data access helpers, error logging helpers, and user interface helpers.
Separation of concerns: I think someone was holding the book upside down when they read about the typical "UI -> business layer -> data access layer" 3-tier architecture. In this codebase, the UI layer directly accesses the database, because the business layer is partially implemented but mostly ignored due to not being fleshed out fully enough, and the data access layer controls the UI layer. Most of the low-level database and network methods operate on a global reference to the main form, and directly show, hide, and modify the form. Where the rather thin business layer is actually used, it also tends to control the UI directly. Most of this lower-level code also uses MessageBox.Show to display error messages when an exception occurs, and most swallow the original exception. This of course makes it a bit more complicated to start writing units tests to verify the functionality of the program before attempting to refactor it.
I'm just scratching the surface here, but my question is simple enough: Would it make more sense to take the time to refactor the existing codebase, focusing on one issue at a time, or would you consider rewriting the entire thing from scratch?
EDIT: To clarify a bit, we do have the original requirements for the project, which is why starting over could be an option. Another way to phrase my question is: Can code ever reach a point where the cost of maintaining it would become greater than the cost of dumping it and starting over?

Without any offense intended, the decision to rewrite a codebase from scratch is a common, and serious management mistake newbie software developers make.
There are many disadvantages to be wary of.
Rewrites stop new features from being developed cold for months/years. Few, if any companies can afford to stand-still for this long.
Most development schedules are difficult to nail. This rewrite will be no exception. Amplify the previous point by, now, a delay in development.
Bugs that were fixed in the existing codebase through painful experience will be re-introduced. Joel Spolsky has more examples in this article.
Danger of falling victim to the Second-system effect -- in summary, ``People who have designed something only once before try to do all the things they "didn't get to do last time", loading the project up with all the things they put off while making version one, even if most of them should be put off in version two as well.''
Once this expensive, burdensome rewrite is completed, the very next team to inherit the new codebase is likely to use the same excuses for doing another rewrite. Programmers hate learning someone else's code. No one writes perfect code because perfection is so subjective. Find me any real-world application and I can give you a damning indictment and rationale for doing a from-scratch rewrite.
Whether you ultimately rewrite from scratch or not, beginning a refactoring phase now is a good way to both really sit down and understand the problem so that the rewrite will go more smoothly if truly called for, as well as giving the existing codebase an honest look to really see if a rewrite's needed.

To actually scrap and start over?
When the current code doesn't do what you would like it to do, and would be cost prohibitive to change.
I'm sure someone will now link Joel's article about Netscape throwing their code away and how it's oh-so-terrible and a huge mistake. I don't want to talk about it in detail, but if you do link that article, before you do so, consider this: the IE engine, the engine that allowed MS to release IE 4, 5, 5.5, and 6 in quick succession, the IE engine that totally destroyed Netscape... it was new. Trident was a new engine after they threw away the IE 3 engine because it didn't provide a suitable basis for their future development work. MS did that which Joel says you must never do, and it is because MS did so that they had a browser that allowed them to completely eclipse Netscape. So please... just meditate on that thought for a moment before you link Joel and say "oh you should never do it, it's a terrible idea".

A rule of thumb I've found useful is that if given a code base, if I have to re-write more than 25% of the code to make it work or modify it based upon new requirements, you may as well re-write it from scratch.
The reasoning is that you can only patch a body of code so far; beyond a certain point, it's quicker to do over.
There's an underlying assumption that you have a mechanism (such as thorough unit and/or system tests) that will tell you whether your re-written version is functionally equivalent (where it needs to be) as the original.

If it requires more time to read and understand the code (if that is even possible)
than it would to rewrite the entire application, I say scrap it and start over.
Be very carefull with this:
Are you sure you aren't just being lazy and not bothering to read the code
Are you being arrogant about the great code you will write compared to the rubbish anyone else produced.
Remember tested-working code is worth a lot more than imaginary yet-to-be-written code
In the words of our estemed host and overlord, Joel - things you should never do,
it's not always wrong to abandon working code - but you have to be sure about the reason.

I saw an application re-architected within 2 years of its introduction into production, and others rewritten in different technologies (one was C++ - now Java). Both efforts were were not, to my mind, successful.
I prefer a more evolutionary approach to bad software. If you can "componentize" your old app such that you can introduce your new requirements and interface with the old code, you can ease yourself into the new environment without having to "sell" the zero-value (from a biz perspective) investment in rewriting.
Suggested approach - write unit tests for the functionality with which you wish to interface to 1) ensure the code behaves as you expect and 2) provide a safety net for any refactoring that you may wish to do on the old base.
Bad code is the norm. I think IT gets a bad rap from business for favoring rewrites/rearchitecting/etc. They pay the money and "trust" us (as an industry) to deliver solid, extensible code. Sadly, business pressures frequently result in shortcuts that make the code unmaintainable. Sometimes it's bad programmers... sometimes bad situations.
To answer your rephrased question... can code maintenance costs ever exceed rewriting costs... the answer is clearly yes. I don't see anything in your examples, however, that lead me to believe this is your case. I think those issues can be addressed with tests and refactoring.

In terms of business value, I would think it's extremely rare that a real case can be made for a rewrite due solely to the internal state of the code. If the product's customer-facing and is currently live and bringing in money (i.e. is not a mothballed or unreleased product), then consider that:
You already have customers using it. They're familiar with it, and might have built some of their own assets around it. (Other systems that interface to it; products based on it; processes they'd have to change; staff they'd maybe have to retrain). All of this costs the customer money.
Re-writing it might cost less in the long term than making difficult changes and fixes. But you can't quantify that yet, unless your app is no more complex than Hello World. And a re-write means a re-test and a redeploy, and probably an upgrade path for your customers.
Who says the re-write will be any better? Can you honestly say your firm is writing sparkly code now? Have the practices that turned the original code to spaghetti been corrected? (Even if the main culprit was a single developer, where were his peers and management, ensuring quality through reviews, testing, etc.?)
In terms of technical reasons, I'd suggest it could be time for a major rewrite if the original has some technical dependencies that have become problematic. e.g. a third party dependency that's now out of support, etc.
In general though, I think the most sensible move is to refactor piece by piece (very small pieces if it's really that bad), and improve the internal architecture incrementally rather than in one big drop.

Two threads of thought on this one: Do you have the original requirements? Do you have confidence that the original requirements are accurate? What about test plans or unit tests? If you have those things in place it might be easier.
Putting on my customer hat, does the system work or is it unstable? If you've got something that's unstable you've got an argument to change; otherwise you're best of refactoring it bit by bit.

I think the line in the sand is when basic maintenance is taking 25% - 50% longer than it should. There comes a time when maintaining legacy code becomes too costly. A number of factors contribute to the final decision. Time and cost being the most important factors I think.

If there are clean interfaces and you can cleanly delineate module boundaries, then it might be worth refactoring it module by module or layer by layer in order to allow you to migrate existing customers forward into cleaner more stable codebases, and over time, after you've refactored every module, you will have rewritten everything.
But, based on the codereview, doesn't sound like there would be any clean boundaries.

I wonder if the people who vote for scrapping and starting over have ever successfully refactored a large project, or at least seen a large project in poor condition that they think could use a refactoring?
If anything, I err on the opposite side: I've seen 4 large projects that were a mess, that I advocated refactoring as opposed to rewriting. On a couple, there was barely a single line of original code that remained, and major interfaces changed in significant ways, but the process never involved the entire project failing to function as well as it originally did, for any more than a week. (And top-of-trunk was never broken).
Perhaps a project exists that is so severely broken that to attempt to refactor it would be doomed to failure, or perhaps one of the previous projects I refactored would have been better served by a "clean re-write", but I'm not sure I'd know how to recognize it.

I agree with Martin. You really need to weigh the effort that will be involved in writing the app from scratch against the current state of the app and how many people use it, do they like it, etc. Often we may want to completely start from scratch, but the cost far outweighs the benefit. I come across bits of ugly looking code all the time, but I soon realize that some of these 'ugly' areas are really bug fixes and make the program work correctly.

I would try to consider the architecture of the system and see whether it is possible to scrap and rewrite specific well defined components without starting everything from scratch.
What would usually happen is that you can either do that (and then sell that to the customer/management), or that you find out that the code is such a horrible and tangled mess that you become even more convinced that you need a rewrite and have more convincing arguments for it (including: "if we engineer it right, we would never need to scrap the whole thing and do a third rewrite).
Slow maintenance would eventually cause that architectural drift that would make a rewrite more expensive later.

Scrap old code early and often. When in doubt, throw it out. The hard part is convincing non-technical folks of the cost-to-maintain.
So long as the value derived appears to be greater than the cost to operate and maintain, there's still positive value flowing from the software. The question surrounding a rewrite this: "will we get even more value from a rewrite?" Or alternatively "How much more value will we get from a rewrite?" How many person-hours of maintenance will you save?
Remember, the rewrite investment is once only. The return on the rewrite investment lasts forever. Forever.
Focus the value question down to specific issues. You listed a bunch of them above. Stick with that.
"Will we get more value by reducing cost through
dropping the junk that we don't use
but still have to wade through?"
"Will we get more value from dropping the junk that's unreliable and breaks?"
"Will we get more value if we understand it -- not by documenting, but by replacing with something we built as a team?"
Do you homework. You'll have to confront the following show-stoppers.
These will originate somewhere in your executive foodchain from someone who'll respond as follows:
"Is it broken?" And when you say "It's not crashed as such," They'll say "It's not broke - don't fix it."
"You've done the code analysis, you understand it, you no longer need to fix it."
What's your answer to them?
That's only the first hurdle. Here's the worst possible situation. This doesn't always happen, but it does happen with alarming frequency.
Someone in your executive foodchain will have this thought:
"A rewrite doesn't create enough value. Rather than simply rewrite, let's expand it." The justification is that by creating enough value, users are more likely to buy in to the rewrite.
A project where scope is expanded -- artificially -- to add value is usually doomed.
Instead, do the smallest rewrite you can to replace the darn thing. Then expand to fit real needs and add value.

You can only give a definite yes to rewriting in case if you know completely how your application works (and by completely I mean it, not just having a general idea of how it should work) and you know more or less exactly how to make it better. Any other cases and it's a shot in the dark, it depends on too much things. Perhaps gradual refactoring would be safer if it is possible.

If possible, I typically would prefer to rewrite smaller portions of the code over time when I need to refactor a baseline. There are typically many smaller issues such as magic number, poor commenting, etc. that tend to make the code look worse than it actually is. So, unless the baseline is just awful, keep the code and just make improvements at the same time you are maintaining the code.
If refactoring requires a lot of work, I recommend laying out a small re-design plan/todo list that gives you a list of things to work on in order so that you can bring the baseline to a better state. Starting from scratch is always a risky move and you are not guaranteed that the code will be better when you are finished. Using this technique, you will always have a working system that improves over time.

Code with excessively high cyclomatic complexity (like over 100 in a large number of modules) is a good clue. Also, how many bugs does it have / KLOC? How critical are the bugs? How often are bugs introduced when bug fixes are made. If your answer is a lot (I cant remember norms right now), then a rewrite is warranted.

As early as possible. Whenever you get a premonition that your code is slowly turning into an ugly beast that is very likely to consume your soul and give you headaches, and you know the problem is in the underlying structure of the code (so any fix would be a hack, e.g. introduce a global variable), then it's time to start over.
For some reasons people don't like throwing away precious code, but if you feel your better off starting over, you are probably right. Trust your instinct and remember that it wasn't a waste of time, it taught you one more way of NOT approaching the problem. You could (should) always use a version control system so your baby is never really lost.

I do not have any experience with using metrics for this myself, but the
article
"Software Maintainability Metrics Models in Practice" discusses
more or less the same question asked here for two case studies they did.
It starts with the following editor's note:
In the past, when a maintainer
received new code to maintain, the
rule-of-thumb was "If you have to
change more than 40 percent of someone
else's code, you throw it out and
start over." The Maintainability Index
[MI] addressed here gives a much more
quantifiable method to determine when
to "throw it out and start over." This
work was sponsored by the U.S. Air
Force Information Warfare Center and
the U.S. Department of Energy [DOE],
Idaho Field Office, DOE Contract No.
DE-AC07-94ID13223.)

I think the rule was...
The first version is always a throw away
So, if you learned your lesson(s), or his/her lessons, then you can go ahead and write it fresh now that you understand your problem domain better.
Not that there aren't parts that can/should be kept. Tested code is the most valuable code, so if it isn't deficient in any real way other than style, no reason to toss it all out.

When is it good (if ever) to scrap production code and start over?
Never had to do this, but logic would dictate (to me, anyway) that once you pass the inflection point where you're spending more time reworking and fixing bugs in the existing code base than you are adding new functionality, it's time to trash the old stuff and get a fresh start.

If it requires more time to read and understand the code (if that is even possible) than it would to rewrite the entire application, I say scrap it and start over.

I have never completely thrown out code. Even when going from a foxpro system to a c# system.
If the old system worked then why just throw it out?
I have come across a few really bad system. Threads being used where not needed. Horrible inheritance and abuse of interfaces.
It is best to understand what the old code is doing and why it is doing it. Then change it so that it is not confusing.
Of course if the old code doesn't work. I mean can't even compile. Then you might be justified in just starting over. But how often does that actually happen?

Yes, it totally can happen. I've seen money be saved by doing it.
This is not a tech decision, it's a business decision. Code rewrites are long term gains, while "if it ain't totally broke..." is a short term gain. If you are in a first year startup that is focused on getting a product out the door, the answer is usually to just live with it. If you're in an established company, or the errors with the current systems are causing more workload, therefor more company money.. then they might go for it.
Present the problem as best as you can to your GM, use dollar values where you can. "I don't like dealing with it" means nothing. "It'll take twice the time to do everything until this is fixed" means a lot.

I think there are a number of issues here that depend largely on where you are at.
Is the software working well from a customer perspective? (If yes be very careful about changes). I would think there would be little point re-witting unless you were expanding the feature set if the system was working. And are you planning to expand the features and customer base of the software? If so then you have much more reason to change.
As much as anything just trying to understand some else's code even if well written can be difficult, when badly written I would imagine almost impossible. What you describe sounds like something that would be very difficult to expand.

I would take into consideration if the application does what it is intended to do, is required for you to ever make modifications, and are you confident that the app has been thoroughly tested in all scenarios that it will be used in.
Do not invest the time if the app does not need alterations. However, if it doesn't function as you need and you need to control the hours and time invested to make corrections, scrap it and re-write to the standards that your team can support. There's nothing worse than terrible code that you have to support / decipher but still have to live with. Remember, Murphy's Law says it will 10 at night when you'll have to make things work, and that is never productive.

Production code always has some value. The only case where I would truly throw it all out and start again is if we determine the intellectual property is irrevocably contaminated. For example if someone brought large amounts of code from a previous employer, or a large percentage of the code was ripped from a GPLd codebase.

I'm going to post this book every time I see a discussion on Refactoring. Everyone should read "Working Effectively with Legacy Code" by Michael Feathers. I found it to be an excellent book - if nothing else, it's a fun read, and motivational.

When the code has reached a point that is not maintainable or extensible anymore. Is full of short-term hacky fixes. It has lots of coupling. It has long (100+lines) methods. It has database access in the UI. It generates a lot of random, impossible to debug errors.
Bottom line: When maintaining it is more expensive (i.e. takes longer) than rewriting it.

I used to believe in just re-write from scratch, but it is wrong.
http://www.joelonsoftware.com/articles/fog0000000069.html
Changed my mind.
What I would suggested is figuring out a way to properly refactor the code. Keep all existing functionality and test as you go. We have all seen horrible code bases, but it is important to keep the knowledge over time you application has.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio