Related
I'm currently working on a small short-lived project. But despite the size it's complicated enough with very unclear logic. That's why it was started by more experienced developers. They work on it from time to time because it's not their main project.
They made some code drafts with numerous places which 'would be rewritten in the nearest future'. After that they added several another 'temporary pieces'. And then again..
So, now the project is a mess of 'half-working' pieces of code with some hardcoded values, like file names or some constants which 'will be replaced latter with working parts'. The API is awful (nobody thinks about it actually).
And it's really, really hard to do development now (for me it's the main and only project). I caught myself thinking that I spent about an hour every day just to understand again all that tricky 'temporary' things and API weaknesses. And after that hour my brain melts.
I can't just say "guys, your code smells like a trash dump". What's the correct way?
It seems like the ultimate problem is they are writing code and not taking responsibility for its quality.
If this goes against the culture of the organization, it's a matter of making the situation known others. If the developers don't know, and have a modicum of empathy, I would take the "I don't quite understand this. Could you spend a few minutes walking me through it?" with them. They should soon realize what they are doing to you, and good programmers will adjust their practices. This may also have to be done via the management hierarchy-- "In order to progress on project X, I need Y hours of the programmers' time to work with their code effectively." It should either happen or bring up a "Why" conversation that should lead to changes.
If this is the culture of the organization, that's unfortunate. It may mean that the programmers producing the code don't care, and nor do any of the management. This is a bit of a political question-- who is most capable and/or interested in seeing this change? Find allies and proceed best you can. A candid conversation with the developers may be the best choice, as they are the people capable of change and no one else is going to induce them to-- so just ask outright.
Hope this helps.
Push for implementation of a formal code review process. Then they won't dare write code like that in the first place. I recommend using a collaborative tool like SmartBear's Code Collaborator or the free ReviewBoard.
Just like people drive slower when they know the cops are watching, they write better code if they know someone is going to be looking at it.
Are these 'other developers' no longer working on the project? And if so, are you the main person working on it? If the answer to both of these is "yes" the the project is yours. Start to make incremental improvements to make it more readable.
You might also like to show the code to a more experienced developer who didn't work on the project. See if they agree that the code is hard to maintain. Suggest to your boss that you set some time aside to 'finish off' the temporary work and bring it to a point where it is maintainable.
Implementing a formal code review process is also a good idea if you want to prevent this happening again.
And remember it may not have been the other developers fault. Sometimes people are told to spend the minimum amount of time, or are told that the code will be thrown away.
We have a bug in our application that does not occur every time and therefore we don't know its "logic". I don't even get it reproduced in 100 times today.
Disclaimer: This bug exists and I've seen it. It's not a pebkac or something similar.
What are common hints to reproduce this kind of bug?
Analyze the problem in a pair and pair-read the code. Make notes of the problems you KNOW to be true and try to assert which logical preconditions must hold true for this happen. Follow the evidence like a CSI.
Most people instinctively say "add more logging", and this may be a solution. But for a lot of problems this just makes things worse, since logging can change timing-dependencies sufficiently to make the problem more or less frequent. Changing the frequency from 1 in 1000 to 1 in 1,000,000 will not bring you closer to the true source of the problem.
So if your logical reasoning does not solve the problem, it'll probably give you a few specifics you could investigate with logging or assertions in your code.
There is no general good answer to the question, but here is what I have found:
It takes a talent for this kind of thing. Not all developers are best suited for it, even if they are superstars in other areas. So know your team, who has a talent for it, and hope you can give them enough candy to get them excited about helping you out, even if it isn't their area.
Work backwards, and treat it like a scientific investigation. Start with the bug, what you see is wrong. Develop hypotheses about what could cause it (this is the creative/imaginative part, the art that not everyone has the talent for) - and it helps a lot to know how the code works. For each of those hypotheses (preferably sorted by what you think is most likely - again pure gut feel here), develop a test that tries to eliminate it as the cause, and test the hypothesis. Any given failure to meet a prediction doesn't mean the hypothesis is wrong. Test the hypothesis until it is confirmed to be wrong (although as it gets less likely you may want to move on to another hypothesis first, just don't discount this one until you have a definitive failure).
Gather as much data as you can during this process. Extensive logging and whatever else is applicable. Do not discount a hypothesis because you lack the data, rather remedy the lack of data. Quite often the inspiration for the right hypothesis comes from examining the data. Noticing something off in a stack trace, weird issue in a log, something missing that should be there in a database, etc.
Double check every assumption. So many times I have seen an issue not get fixed quickly because some general method call was not further investigated, so the problem was just assumed to be not applicable. "Oh that, that should be simple." (See point 1).
If you run out of hypotheses, that is generally caused by insufficient knowledge of the system (this is true even if you wrote every line of code yourself), and you need to run through and review code and gain additional insight into the system to come up with a new idea.
Of course, none of the above guarantees anything, but that is the approach that I have found gets results consistently.
Add some sort of logging or tracing. For example log the last X actions the user committed before causing the bug (only if you can set a condition to match bug).
It's quite common for programmers not to be able to reiterate a user-experienced crash simply because you have developed a certain workflow and habits in using the application that obviously goes around the bug.
At this frequency of 1/100, I'd say that the first thing to do is to handle exceptions and log anything anywhere or you could be spending another week hunting this bug.
Also make a priority list of potentially sensitive articulations and features in your project. For example :
1 - Multithreading
2 - Wild pointers/ loose arrays
3 - Reliance on input devices
etc.
This will help you segment areas that you can brute-force-until-break-again as suggested by other posters.
Since this is language-agnostic, I'll mention a few axioms of debugging.
Nothing a computer ever does is random. A 'random occurrence' indicates a as-yet-undiscovered pattern. Debugging begins with isolating the pattern. Vary individual elements and assess what makes a change in the behaviour of the bug.
Different user, same computer?
Same user, different computer?
Is the occurrence strongly periodic? Does rebooting change the periodicity?
FYI- I once saw a bug that was experienced by a single person. I literally mean person, not a user account. User A would never see the problem on their system, User B would sit down at that workstation, signed on as User A and could immediately reproduce the bug. There should be no conceivable way for the app to know the difference between the physical body in the chair. However-
The users used the app in different ways. User A habitually used a hotkey to to invoke a action and User B used an on-screen control. The difference in the user behaviour would cascade into a visible error a few actions later.
ANY difference that effects the behaviour of the bug should be investigated, even if it makes no sense.
There's a good chance your application is MTWIDNTBMT (Multi Threaded When It Doesn't Need To Be Multi Threaded), or maybe just multi-threaded (to be polite). A good way to reproduce sporadic errors in multi-threaded applications is to sprinkle code like this around (C#):
Random rnd = new Random();
System.Threading.Thread.Sleep(rnd.Next(2000));
and/or this:
for (int i = 0; i < 4000000000; i++)
{
// tight loop
}
to simulate threads completing their tasks at different times than usual or tying up the processor for long stretches.
I've inherited many buggy, multi-threaded apps over the years, and code like the above examples usually makes the sporadic errors occur much more frequently.
Add verbose logging. It will take multiple -- sometimes dozen(s) -- iterations to add enough logging to understand the scenario.
Now the problem is that if the problem is a race condition, which is likely if it doesn't reproduce reliably, so logging can change timing and the problem will stop happening. In this case do not log to a file, but keep a rotating buffer of the log in memory and only dump it on disk when you detect that the problem has occurred.
Edit: a little more thoughts: if this is a gui application run tests with a qa automation tool which allows you to replay macros. If this is a service-type app, try to come up with at least a guess as to what is happening and then programmatically create 'freak' usage patterns which would exercise the code that you suspect. Create higher than usual loads etc.
What development environment?
For C++, your best bet may be VMWare Workstation record/replay, see:
http://stackframe.blogspot.com/2007/04/workstation-60-and-death-of.html
Other suggestions include inspecting the stack trace, and careful code overview... there is really no silver bullet :)
Try to add code in your app to trace the bug automatically once it happens (or even alert you via mail / SMS)
log whatever you can so when it happens you can catch the right system state.
Another thing- try applying automated testing that can cover more territory than human based testing in a formed manner.. it's a long shot, but a good practice in general.
all the above, plus throw some brute force soft-robot at it that is semi random, and scater a lot of assert/verify (c/c++, probably similar in other langs) through the code
Tons of logging and careful code review are your only options.
These can be especially painful if the app is deployed and you can't adjust the logging. At that point, your only choice is going through the code with a fine-tooth comb and trying to reason about how the program could enter into the bad state (scientific method to the rescue!)
Often these kind of bugs are related to corrupted memory and for that reason they might not appear very often. You should try to run your software with some kind of memory profiler e.g., valgrind, to see if something goes wrong.
Let’s say I’m starting with a production application.
I typically add debug logging around the areas where I think the bug is occurring. I setup the logging statements to give me insight into the state of the application. Then I have the debug log level turned on and ask the user/operator(s) notify me of the time of the next bug occurrence. I then analyze the log to see what hints it gives about the state of the application and if that leads to a better understanding of what could be going wrong.
I repeat step 1 until I have a good idea of where I can start debugging the code in the debugger
Sometimes the number of iterations of the code running is key but other times it maybe the interaction of a component with an outside system (database, specific user machine, operating system, etc.). Take some time to setup a debug environment that matches the production environment as closely as possible. VM technology is a good tool for solving this problem.
Next I proceed via the debugger. This could include creating a test harness of some sort that puts the code/components in the state I’ve observed from the logs. Knowing how to setup conditional break points can save a lot of time, so get familiar with that and other features within your debugger.
Debug, debug , debug. If you’re going nowhere after a few hours, take a break and work on something unrelated for awhile. Come back with a fresh mind and perspective.
If you have gotten nowhere by now, go back to step 1 and make another iteration.
For really difficult problems you may have to resort to installing a debugger on the system where the bug is occurring. That combined with your test harness from step 4 can usually crack the really baffling issues.
Unit Tests. Testing a bug in the app is often horrendous because there is so much noise, so many variable factors. In general the bigger the (hay)stack, the harder it is to pinpoint the issue. Creatively extending your unit test framework to embrace edge cases can save hours or even days of sifting
Having said that there is no silver bullet. I feel your pain.
Add pre and post condition check in methods related to this bug.
You may have a look at Design by contract
Along with a lot of patience, a quiet prayer & cursing you would need:
a good mechanism for logging the user actions
a good mechanism for gathering the data state when the user performs some actions (state in application, database etc.)
Check the server environment (e.g. an anti-virus software running at a particular time etc.) & record the times of the error & see if you can find any trends
some more prayers & cursing...
HTH.
Assuming you're on Windows, and your "bug" is a crash or some sort of corruption in unmanaged code (C/C++), then take a look at Application Verifier from Microsoft. The tool has a number of stops that can be enabled to verify things during runtime. If you have an idea of the scenario where your bug occurs, then try to run through the scenario (or a stress version of the scenario) with AppVerifer running. Make sure to either turn on pageheap in AppVerifier, or consider compiling your code with the /RTCcsu switch (see http://msdn.microsoft.com/en-us/library/8wtf2dfz.aspx for more information).
"Heisenbugs" require great skills to diagnose, and if you want help from people here you have to describe this in much more detail, and patiently listen to various tests and checks, report result here, and iterate this till you solve it (or decide it is too expensive in terms of resources).
You will probably have to tell us your actual situation, language, DB, operative system, workload estimate, time of the day it happened in the past, and a myriad of other things, list tests you did already, how they went, and be ready to do more and share the results.
And this will not guarantee that we collectively can find it, either...
I'd suggest to write down all things that user has been doing. If you have lets say 10 such bug reports You can try to find something that connects them.
Read the stack trace carefully and try to guess what could be happened;
then try to trace\log every line of code that potentially can cause trouble.
Keep your focus on disposing resources; many sneaky sporadical bugs i found were related to close\dispose things :).
For .NET projects You can use Elmah (Error Logging Modules and Handlers) to monitor you application for un-caught exceptions, it's very simple to install and provides a very nice interface to browse unknown errors
http://code.google.com/p/elmah/
This saved me just today in catching a very random error that was occuring during a registration process
Other than that I can only recommend trying to get as much information from your users as possible and having a thorough understanding of the project workflow
They mostly come out at night....
mostly
The team that I work with has enlisted the users in recording their time they spend in our app with CamStudio when we've got a pesky bug to track down. It's easy to install and for them to use, and makes reproducing those nagging bugs much easier, since you can watch what the users are doing. It also has no relationship to the language you're working in, since it's just recording the windows desktop.
However, this route seems to be viable only if you're developing corporate apps and have good relationships with your users.
This varies (as you say), but some of the things that are handy with this can be
immediately going into the debugger when the problem occurs and dumping all the threads (or the equivalent, such as dumping the core immediately or whatever.)
running with logging turned on but otherwise entirely in release/production mode. (This is possible in some random environments like c and rails but not many others.)
do stuff to make the edge conditions on the machine worse... force low memory / high load / more threads / serving more requests
Making sure that you're actually listening to what the users encountering the problem are actually saying. Making sure that they're actually explaining the relevant details. This seems to be the one that breaks people in the field a lot. Trying to reproduce the wrong problem is boring.
Get used to reading assembly that was produced by optimizing compilers. This seems to stop people sometimes, and it isn't applicable to all languages/platforms, but it can help
Be prepared to accept that it is your (the developer's) fault. Don't get into the trap of insisting the code is perfect.
sometimes you need to actually track the problem down on the machine it is happening on.
#p.marino - not enough rep to comment =/
tl;dr - build failures due to time of day
You mentioned time of day and that caught my eye. Had a bug once were someone stayed later at work on night, tried to build and commit before they left and kept getting a failure. They eventually gave up and went home. When they caught in the next morning it built fine, they committed (probably should have been more suspiscious =] ) and the build worked for everyone. A week or two later someone stayed late and had an unexpected build failure. Turns out there was a bug in the code that made any build after 7PM break >.>
We also found a bug in one seldom used corner of the project this january that caused problems marshalling between different schemas because we were not accounting for the different calendars being 0 AND 1 month based. So if no one had messed with that part of the project we wouldn't have possibly found the bug until jan. 2011
These were easier to fix than threading issues, but still interesting I think.
hire some testers!
This has worked for really weird heisenbugs.
(I'd also recommend getting a copy of "Debugging" by Dave Argans, these ideas are partly derived form using his ideas!)
(0) Check the ram of the system using something like Memtest86!
The whole system exhibits the problem, so make a test jig that exercises the whole thing.
Say it's a server side thing with a GUI, you run the whole thing with a GUI test framework doing the necessary input to provoke the problem.
It doesn't fail 100% of the time, so you have to make it fail more often.
Start by cutting the system in half ( binary chop)
worse case, you have to remove sub-systems one at a time.
stub them out if they can't be commented out.
See if it still fails. Does it fail more often ?
Keep proper test records, and only change one variable at a time!
Worst case you use the jig and you test for weeks to get meaningful statistics. This is HARD; but remember, the jig is doing the work.
I've got No threads and only one process, and I don't talk to hardware
If the system has no threads, no communicating processes and contacts no hardware; it's tricky; heisenbugs are generally synchronization, but in the no-thread no processes case it's more likely to be uninitialized data, or data used after being released, either on the heap or the stack. Try to use a checker like valgrind.
For threaded/multi-process problems:
Try running it on a different number of CPU's. If it's running on 1, try on 4! Try forcing a 4-computer system onto 1.
It'll mostly ensure things happen one at a time.
If there are threads or communicating processes this can shake out bugs.
If this is not helping but you suspect it's synchronization or threading, try changing the OS time-slice size.
Make it as fine as your OS vendor allows!
Sometimes this has made race conditions happen almost every time!
Obversely, try going slower on the timeslices.
Then you set the test jig running with debugger(s) attached all over the place and wait for the test jig to stop on a fault.
If all else fails, put the hardware in the freezer and run it there. The timing of everything will be shifted.
Debugging is hard and time consuming especially if you are unable to deterministically reproduce the problem. My advice to you is to find out the steps to reproduce it deterministically (not just sometimes).
There has been a lot of research in the field of failure reproduction in the past years and is still very active. Record&Replay techniques have been (so far) the research direction of most researchers. This is what you need to do:
1) Analyze the source code and determine what are the sources of non-determinism in the application, that is, what are the aspects that may take your application through different execution paths (e.g. user input, OS signals)
2) Log them in the next time you execute the application
3) When your application fails again, you have the steps-to-reproduce the failure in your log.
If your log still does not reproduce the failure, then you are dealing with a concurrency bug. In that case, you should take a look at how your application accesses shared variables. Do not attempt to record the accesses to shared variables, because you would be logging too much data, thereby causing severe slowdowns and large logs. Unfortunately, there is not much I can say that would help you to reproduce concurrency bugs, because research still has a long way to go in this subject. The best I can do is to provide a reference to the most recent advance (so far) in the topic of deterministic replay of concurrency bugs:
http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html
Best regards
Use an enhanced crash reporter. In the Delphi environment, we have EurekaLog and MadExcept. Other tools exist in other environments. Or you can diagnose the core dump. You're looking for the stack trace, which will show you where it's blowing up, how it got there, what's in memory, etc.. It's also useful to have a screenshot of the app, if it's a user-interaction thing. And info about the machine that it crashed on (OS version and patch, what else is running at the time, etc..) Both of the tools that I mentioned can do this.
If it's something that happens with a few users but you can't reproduce it, and they can, go sit with them and watch. If it's not apparent, switch seats - you "drive", and they tell you what to do. You'll uncover the subtle usability issues that way. double-clicks on a single-click button, for example, initiating re-entrancy in the OnClick event. That sort of thing. If the users are remote, use WebEx, Wink, etc., to record them crashing it, so you can analyze the playback.
What do you do when you're assigned to work on code that's
atrocious and antiquated to the point where it's almost incomprehensible?
For example: hardware interface code, mixed with logic, AND user interface code, ALL in the same functions?
We see bad code all the time, but what do you actually do about it?
Do you try to refactor it?
Try to make it OO if it's not?
Or do you try to make some sense of it, make the necessary changes and move on?
Depends on a few factors for me:
Will I be maintaining this code in the future, or is it a one-off fix?
How long until this system is replaced entirely?
How busy am I at the moment?
Ideally, I'd refactor all bad code I had to maintain, but the reality is there are only so many hours in the day.
As is frequently the case, "It Depends".
I tend to ask myself some of the following questions:
Are there unit tests for the existing code?
Is refactoring the code an acceptable risk for my project?
Is the author still available to clarify any questions I might have about the code?
Will my employer consider the time spent on changing existing, functioning code to be an acceptable use of my time?
And so on...
But assuming that I have the capacity to do so, refactoring is preferential as the up front cost of fixing the code now will likely save me a lot of time and effort later in maintenance and development time.
There are other benefits as well, including the fact that the more clean and well maintained you keep your code base, the more likely other developers are to keep it that way. The Pragmatic Programmer calls this the Broken Window Theory.
Developers have an instinct to assume that code is always ugly because of other, inferior developers. Sometimes, code is ugly because the problem space is ugly. All that ugliness isn't just ugliness - it is sometimes institutional memory. Each line of ugly in your code probably represents a bug fix. So think very carefully before you rip it all out.
Basically, I would say that you shouldn't touch code like this unless you actually have to. If there's a real bug that you can solve, refactoring is reasonable, if you can be sure you're maintaining the same amount of functionality. But refactoring for the sake of refactoring (eg, "make the code OO") is what I would generally classify as a classic newbie mistake.
The book Working Effectively with Legacy Code discusses the options you can do. In general the rule is not to change code until you have to (to fix a bug or add a feature). The book describes how to make changes when you can't add testing and how to add testing to complex code (which allows more substantial changes).
You try to refactor it, in the strict sense on the word, where you're not changing the behaviour.
The first target is usually to break up giant methods.
Given the strength of some of the adjectives you use, i.e. atrocious, antiquated and incomprehensible, I'd bin it!
If it is in such a state, like the example you give, it's probably not got any test code for it either. Refactoring is mentioned in many of the other answers but, sometimes, it is not appropriate. I always find that, when refactoring, you generally need a clear path through which the old code can be gradually morphed into the new in a number of well defined steps.
When the old code is so far removed from how you want it to look, such as the extreme cases you seem to be suggesting, you could probably redesign, rewrite and test the new code in a shorter time than it would to take to refactor it.
Scrap it and start over, using the compiled legacy application as a business requirements document.
And spending time in analysis with the users to see what they want changed.
Post it to www.worsethanfailure.com!!!
If no modifications are needed, I don't touch it.
If at all possible, I write automated unit tests first, especially focused on the areas that need modification.
If automated unit tests are not possible, I do what I can to document manual unit tests.
I am just using the tests to document "current" behavior at this point.
If possible, I always keep a version of the code and executable environment that runs things the "original" way (before I touched it) so I can always add new "behavior documentation" tests and better detect regressions I may have caused later.
Once I start changing things, I want to be very careful not to introduce regressions. I do this by continually rerunning (and or adding new tests) to the tests I wrote before I started writing code.
When possible, I leave bugs as-is if there is no business need for them to be fixed. Those bugs may be "features" to some users and may have unclear side effects that wouldn't be clear until the code was redeployed to production.
As far as refactoring, I do that as aggressively as possible, but only in the code that I need to change otherwise anyway. I may refactor more aggressively in my own personal copy of the code that will never be checked in, just to improve the readability of the code for me personally. It's often times difficult to properly test changes that are only made for readability reasons, so for safety reasons, I generally don't check those changes in / deploy them unless I can confidently test that the code changes are completely safe (it's really bad to introduce bugs when you are making changes that are unnecessary for anything but readability).
Really, it's a risk management problem. Proceed with caution. The users do not care if the code is atrocious, they just care that it gets better without getting worse. Your need for beautiful code is not important in this scenario, get past it.
Just like any other code, you leave it slightly better when you leave it than it was when you entered it. You do not ever, ever rewrite the whole code. If that is the work it takes for some reason, then you start a project (small or large) for it.
I am assuming we are talking about a substantial amount of code here.
Not every day is a great day at work you know :)
The first question to ask is: does it work?
If the answer is yes, that would be a huge disincentive to simply ditch it and start over. There may be thousands of man-hours in that code which address edge cases and nasty bugs. Worse yet, there may be other modules in the system that depend on the current incorrect (but known and possibly documented) behavior. Don't mess with it if it isn't broken.
If you are keen on cleaning it up, start by writing test cases for the current behavior. When you run across an instance where the behavior differs from the specification, you must decide whether to accept the behavior as "correct" or go with what the spec say it ought to do.
Only once you have written test cases that all pass should you begin to refactor. The tests will tell you whether your efforts are breaking anything.
I'd talk to my manager and describe the code. Most managers would not want a program held together by banding wire and duct tape per se. If the code is really that bad there are sure to be some business logic errors, hardcoding etc. stuffed in there that will eventually just destroy productivity.
I've come across some pretty bad code before (single letter variable names, no comments, everything crammed onto one line, etc.) and once I mentioned/showed it to my manager they almost always said "go ahead and re-write it", because not only are you taking the hit for reading and changing the code but future co-workers will have to go through the same pain. Better that you take a longer period of time just once to rewrite it rather than having each person who touches the code in the future have to go through and comprehend and decipher it first.
There is an old saying. If is isn't broke, do not fix it. If you have to maintain it then reverse engineer it and document it so the next time you come across it you will know what it does.
You do not know the situation the developer was in when he or she wrote the code. He or she may have been under a time crunch when it was written, (management was all over the developer, etc)
There are also situations where he or she wrote the code per the spec, The spec then changed several times, the developer had to patch the code, as rewrite is out of the question due to time constraints. This happens all of the time.
If the code impacts the performance of robustness of the application and is modular then you can re factor or re-write. Document the situation to assist future programmers in understanding.
Also many programmers consider reverse engineering other developers code as beneath them.
they would rather rewrite without considering the ramifications of doing so.
If you have never done so, try it sometime, it will make you a better developer.
Thanks
Joe
Kill it with fire.
Depends on your time frame and how important that code is to you. If you have to "just make it work" then do that and rewrite the module when time allows.
If its an important or integral part of what you do then refactor refactor refactor.
Then find the guy/girl who wrote it and send them a rude postcard!
The worst offender (in my experience) of really AWFUL code is the ease with which people can do cut & paste these days. Cut & paste should be used rarely. If you think that's the right solution, it's generally better to step back and generalize the problem a little.
Anytime you see code that is "nearly incomprehensible", PROCEED WITH CAUTION. You need to assume that any major re-factoring will result in new bugs being introduced that you'll need to find and correct.
Additionally, I've seen this scenario many times (even fell victim to it myself once or twice): Programmer inherits legacy code, decides code is ancient & unmaintainable and decides to refactor it, ends up deleting key "fixes" or "business rules" subtly patched in over the years, ends up spending a lot of time tracking down and re-introducing similar code when users complain about "a problem fixed years ago is happening again".
Re-factoring (and debugging) almost always takes longer than expected and should never be considered as a "freebie" that comes along with whatever task you're supposed to be doing.
"If it ain't broke, don't 'fix' it" still has a lot of truth.
Im my company we always Refactor Mercilessly. so we still come across atrocious code but LESS and Less and less ...
We write a lot of in-house code and the company is run for about 100 years by the same family. Management usually tells us we have to maintain the code base (evolve) for another 50 years or so. In this setting having code you don't dare to touch is considered a bigger risk to the long term survival of the company then the prospect of downtime because some under-tested code broke because of refactoring.
I run copy-paste detector and findbugs on all legacy code that comes my way.
I then plan my initial refactoring:
remove unused code, unused variable and unused methods
refactor duplicated code
set up a single step build
build a basic functional test
By that point the code meets the basic minimum for maintainability. It can be easily built and basic errors can be found via an automated test.
I often add code like this:
log.debug("is foo null? " + (foo == null));
log.debug("is discount < raw price ? " + (foo.getDiscount() < foo.getRawPrice()));
Some of that code will be recovered for unit tests when I can refactor to it.
I've worked places where we ship that kind of code.
I try to make sense of it, make the necessary changes, and move on.
Of course, making sense of it usually involves some changes; at the very least, I move around the whitespace and line up corresponding braces in the same column like so:
if(condition){
doSomething(); }
// becomes...
if(condition)
{
doSomething();
}
I'll also often change variable names.
And very often, "the necessary changes" require refactoring. :)
Get the idea of what they're doing and the deadline to finish. A larger deadline, typically rebuild much of the code from the ground up, as I find it a very worthwhile experience to not only decipher terrible code and make it legible and document, but somewhere in your brain those neurons are pressed to avoid similar mistakes in the future.
I want to know if there are method to quickly find bugs in the program.
It seems that the more you master the architecture of your software, the more quickly
you can locate the bugs.
How the programmers improve their ability to find a bug?
Logging, and unit tests. The more information you have about what happened, the easier it is to reproduce it. The more modular you can make your code, the easier it is to check that it really is misbehaving where you think it is, and then check that your fix solves the problem.
Divide and conquer. Whenever you are debugging, you should be thinking about cutting down the possible locations of the problem. Every time you run the app, you should be trying to eliminate a possible source and zero in on the actual location. This can be done with logging, with a debugger, assertions, etc.
Here's a prophylactic method after you have found a bug: I find it really helpful to take a minute and think about the bug.
What was the bug exactly in essence.
Why did it occur.
Could you have found it earlier, easier.
Anything else you learned from the bug.
I find taking a minute to think about these things will make it far less likely that you will produce the same bug in the future.
I will assume you mean logic bugs. The best way I have found to capture logic bugs is to implement some sort of testing scheme. Check out jUnit as the standard. Pretty much you define a set of accepted outputs of your methods. Every time you compile your system it checks all of your test cases. If you have introduced new logic that breaks your tests, you will know about it instantly and know exactly what you have to fix.
Test driven design is a pretty big movement in programming right now. You will be hard pressed to find a language that doesn't support some kind of testing. Even JavaScript has a multitude of test suites.
Experience makes you a better debugger. Pay close attention to the bugs that you AND others commonly make. Try to figure out if/how these bugs apply to ALL code that affects you, not the single instance of where the bug was seen.
Raymond Chen is famous for his powers of psychic debugging.
Most of what looks like psychic
debugging is really just knowing what
people tend to get wrong.
That means that you don't necessarily have to be intimately familiar with the architecture / system. You just need enough knowledge to understand the types of bugs that apply and are easy to make.
I personally take the approach of thinking about where the bug may be in the code before actually opening up the code and taking a look. When you first start with this approach, it may not actually work very well, especially if you are pretty unfamiliar with the code base. However, over time someone will be able to tell you the behavior they are experiencing and you'll have a good idea where the problem is located or you may even know what to fix in the code to remedy the problem before even looking at the code.
I was on a project for several years that maintained by a vendor. They were not very good debuggers and most of the time it was up to us to point them to an area of the code that had the problem. What made our problem worse was that we didn't have a nice way to view the source code, so a lot of our "debugging" was just feeling.
Error checking and reporting. The #1 newbie coder debugging mistake is to turn off error reporting, avoid checking for whether what's going on makes sense, etc etc. In general, people feel like if they can't see anything going wrong then nothing is going wrong. Which of course could not be further from the case.
Instead, your code should be chock full of error conditions that will make lots of noise, with detailed reporting, someplace you will see it. (This doesn't mean inside a production web page.) Then, instead of having to trace an error all over the place because it got passed through sixteen layers of execution before it finally got someplace that broke, your errors start happening proximately to the actual issue.
It seems that the more you master the
architecture of your software ,the
more quickly you can locate the bugs.
After understanding the architecture, one's ability to find bugs in the application increases with their ability to identify and write extensive tests.
Know your tools.
Make sure that you know how to use conditional breakpoints and watches in your debugger.
Use static analysis tools as well - they can point out the more obvious issues.
Sleep and rest.
Use programming methods that produce fewer bugs in the first place.
If to implement a single stand-alone functional requirement it takes N separate point-edits to source code, the number of bugs put into the code is roughly proportional to N, so find programming methods that minimize N. Ways to do this: DRY (don't repeat yourself), code generation, and DSL (domain-specific-language).
Where bugs are likely, have unit tests.
Obviously.IMHO, the best unit tests are monte-carlo.
Make intermediate results visible.
For example, compilers have intermediate representations, in the form of 4-tuples. If there is a bug, the intermediate code can be examined. That tells if the bug is in the first or second half of the compiler.
P.S. Most programmers are not aware that they have a choice of how much data structure to use. The less data structure you use, the less are the chances for bugs (and performance issues) caused by it.
I find tracepoints to be an invaluable debugging tool. They are a bit like logging, except you create them during a debugging session to solve a particular issue, like breakpoints.
Printing the stacktrace in a tracepoint can be especially useful. For example, you can print the hash code and stacktrace in the constructor of an object, and then later on when the object is used again you can search for its hashcode to see which client code created it. Same for seeing who disposed it or called a certain method etc.
They are also great for debugging issues related to window focus changes etc, where the debugger would interfere if you drop in break mode.
Static code tools like FindBugs
Assertions, assertions, and assertions.
Some areas of our code has 4 or 5 assertions for each line of real code. When we get a bug report the first thing that happens is that the customer data is processed in our debug build 99 times out a hundred an assert will fire near the cause of the bug.
Additionally our debug build perform redundant calculations to ensure that an optimized algorithm is returning the correct result, and also debug functions are used to examine the sanity of data structures.
The hardest thing new developers have to contend with is getting their code to survive the assertions of the code gthey are calling.
Additionally we do not allow any code to be putback to toplevel that causes any integration or unit test to fail.
Stepping through the code, examining flow/state where unexpected behavior is occurring. (Then develop a test for it, of course).
Writing Debug.Write(message) in your code and using DebugView is another option. And then run your application find out what is going on.
"Architecture" in software means something like:
Several components
The components interact across clearly-defined interfaces
Each component has a well-defined responsibility
The responsibility of one component is unlike the responsibilities of other components
So, as you said, the better the architecture the easier it is to find bugs.
First: knowing the bug, you can decide which functionality is broken, and therefore know which component implements that functionality. For example, if the bug is that something isn't being logged properly, therefore this bug should be in one of 3 places:
In the component that's responsible for logging (your logging library)
Or, above that in the application code which is using this library
Or, below that in the system code which this library is using
Second: examine the data transfered across the interfaces between components. To continue the previous example above:
Set a debugger breakpoint on the application code which invokes the logger API, to verify whether the logger API is being used correctly (e.g. whether it's being invoked at all, whether parameters are as-expected, etc.).
Doing this tells you whether the bug is in the component above this interface, or in the component that's below this interface.
Repeat (perhaps using binary search if the call stack is very deep) until you've found which component is at fault.
When you come to the point that you think there must be a bug in the OS, check your assertions -- and put them into the code with "assert" statements.
Conversely, as you are writing the code, think of the range of valid inputs for your algorithms and put in assertions to make sure you have what you think you have. Same goes for output: Check that you produced what you think you produced.
E.g. if you expect a non-empty list:
l = getList(input)
assert l, "List was empty for input: %s" % str(input)
I'm part of the QA team # work, and knowing anything about the product and how it is developed, helps a lot in finding bugs, also when I make new QA tools I pass it to our dev team to test it, finding bugs in your own code is just plain hard!
Some people say programmers are tainted, so we cannot see bugs in their own product; we are not talking about code here, we are beyond that, usability and functionality itself.
Meanwhile unit testing seams to be a nice solution to find bugs in your own code, its totally pointless if you're wrong even before writing the unit test, how are you going to find the bugs then? you don't!, let your co-worker find them, hire a QA guy.
Scientific debugging is what I always used, and it greatly helps.
Basically, if you can replicate a bug, you can track its origin. You should then experiment some tests, observe the results, and infer hypotheses on why the bug happens.
Writing about all your hypotheses, attempts, expected results and observed results can help you track down the bugs, particularly if they're nasty.
There are automated tools that can help you with that process, particularly git-bisect (and similar bisection tools on other revision systems) to quickly find which change introduced the bug, unit testing to reproduce a bug and prevent regressions in your code (can be used in combination with bisect), and delta debugging to find the culprit in your code (similar to git-bisect but whereas git-bisect works on the code history, delta debugging works on the code directly).
But whatever the tools you are using, the most important benefit is in the scientific methodology, as this is the formalization of what most experienced debuggers do.
As you work in a legacy codebase what will have the greatest impact over time that will improve the quality of the codebase?
Remove unused code
Remove duplicated code
Add unit tests to improve test coverage where coverage is low
Create consistent formatting across files
Update 3rd party software
Reduce warnings generated by static analysis tools (i.e.Findbugs)
The codebase has been written by many developers with varying levels of expertise over many years, with a lot of areas untested and some untestable without spending a significant time on writing tests.
Read Michael Feather's book "Working effectively with Legacy Code"
This is a GREAT book.
If you don't like that answer, then the best advice I can give would be:
First, stop making new legacy code[1]
[1]: Legacy code = code without unit tests and therefore an unknown
Changing legacy code without an automated test suite in place is dangerous and irresponsible. Without good unit test coverage, you can't possibly know what affect those changes will have. Feathers recommends a "stranglehold" approach where you isolate areas of code you need to change, write some basic tests to verify basic assumptions, make small changes backed by unit tests, and work out from there.
NOTE: I'm not saying you need to stop everything and spend weeks writing tests for everything. Quite the contrary, just test around the areas you need to test and work out from there.
Jimmy Bogard and Ray Houston did an interesting screen cast on a subject very similar to this:
http://www.lostechies.com/blogs/jimmy_bogard/archive/2008/05/06/pablotv-eliminating-static-dependencies-screencast.aspx
I work with a legacy 1M LOC application written and modified by about 50 programmers.
* Remove unused code
Almost useless... just ignore it. You wont get a big Return On Investment (ROI) from that one.
* Remove duplicated code
Actually, when I fix something I always search for duplicate. If I found some I put a generic function or comment all code occurrence for duplication (sometime, the effort for putting a generic function doesn't worth it). The main idea, is that I hate doing the same action more than once. Another reason is because there's always someone (could be me) that forget to check for other occurrence...
* Add unit tests to improve test coverage where coverage is low
Automated unit tests is wonderful... but if you have a big backlog, the task itself is hard to promote unless you have stability issue. Go with the part you are working on and hope that in a few year you have decent coverage.
* Create consistent formatting across files
IMO the difference in formatting is part of the legacy. It give you an hint about who or when the code was written. This can gave you some clue about how to behave in that part of the code. Doing the job of reformatting, isn't fun and it doesn't give any value for your customer.
* Update 3rd party software
Do it only if there's new really nice feature's or the version you have is not supported by the new operating system.
* Reduce warnings generated by static analysis tools
It can worth it. Sometime warning can hide a potential bug.
I'd say 'remove duplicated code' pretty much means you have to pull code out and abstract it so it can be used in multiple places - this, in theory, makes bugs easier to fix because you only have to fix one piece of code, as opposed to many pieces of code, to fix a bug in it.
Add unit tests to improve test coverage. Having good test coverage will allow you to refactor and improve functionality without fear.
There is a good book on this written by the author of CPPUnit, Working Effectively with Legacy Code.
Adding tests to legacy code is certianly more challenging than creating them from scratch. The most useful concept I've taken away from the book is the notion of "seams", which Feathers defines as
"a place where you can alter behavior in your program without editing in that place."
Sometimes its worth refactoring to create seams that will make future testing easier (or possible in the first place.) The google testing blog has several interesting posts on the subject, mostly revolving around the process of Dependency Injection.
I can relate to this question as I currently have in my lap one of 'those' old school codebase. Its not really legacy but its certainly not followed the trend of the years.
I'll tell you the things I would love to fix in it as they bug me every day:
Document the input and output variables
Refactor the variable names so they actually mean something other and some hungarian notation prefix followed by an acronym of three letters with some obscure meaning. CammelCase is the way to go.
I'm scared to death of changing any code as it will affect hundreds of clients that use the software and someone WILL notice even the most obscure side effect. Any repeatable regression tests would be a blessing since there are zero now.
The rest is really peanuts. These are the main problems with a legacy codebase, they really eat up tons of time.
I'd say it largely depends on what you want to do with the legacy code...
If it will indefinitely remain in maintenance mode and it's working fine, doing nothing at all is your best bet. "If it ain't broke, don't fix it."
If it's not working fine, removing the unused code and refactoring the duplicate code will make debugging a lot easier. However, I would only make these changes on the erring code.
If you plan on version 2.0, add unit tests and clean up the code you will bring forward
Good documentation. As someone who has to maintain and extend legacy code, that is the number one problem. It's difficult, if not downright dangerous to change code you don't understand. Even if you're lucky enough to be handed documented code, how sure are you that the documentation is right? That it covers all of the implicit knowledge of the original author? That it speaks to all of the "tricks" and edge cases?
Good documentation is what allows those other than the original author to understand, fix, and extend even bad code. I'll take hacked yet well-documented code that I can understand over perfect yet inscrutable code any day of the week.
The single biggest thing that I've done to the legacy code that I have to work with is to build a real API around it. It's a 1970's style COBOL API that I've built a .NET object model around, so that all the unsafe code is in one place, all of the translation between the API's native data types and .NET data types is in one place, the primary methods return and accept DataSets, and so on.
This was immensely difficult to do right, and there are still some defects in it that I know about. It's not terrifically efficient either, with all the marshalling that goes on. But on the other hand, I can build a DataGridView that round-trips data to a 15-year-old application which persists its data in Btrieve (!) in about half an hour, and it works. When customers come to me with projects, my estimates are in days and weeks rather than months and years.
As a parallel to what Josh Segall said, I would say comment the hell out of it. I've worked on several very large legacy systems that got dumped in my lap, and I found the biggest problem was keeping track of what I already learned about a particular section of code. Once I started placing notes as I go, including "To Do" notes, I stopped re-figuring out what I already figured out. Then I could focus on how those code segments flow and interact.
I would say just leave it alone for the most part. If it's not broken then don't fix it. If it is broken then go ahead and fix and improve the portion of the code that is broken and its immediately surrounding code. You can use the pain of the bug or sorely missing feature to justify the effort and expense of improving that part.
I would not recommend any wholesale kind of rewrite, refactor, reformat, or putting in of unit tests that is not guided by actual business or end-user need.
If you do get the opportunity to fix something, then do it right (the chance of doing it right the first time might have already passed, but since you are touching that part again might as well do it right time around) and this includes all the items you mentioned.
So in summary, there's no single or just a few things that you should do. You should do it all but in small portions and in an opportunistic manner.
Late to the party, but the following may be worth doing where a function/method is used or referenced often:
Local variables often tend to be poorly named in legacy code (often owing to their scope expanding when a method is modified, and not being updated to reflect this). Renaming these in line with their actual purpose can help clarify legacy code.
Even just laying out the method slightly differently can work wonders - for instance, putting all the clauses of an if on one line.
There might be stale/confusing code comments there already. Remove them if they're not needed, or amend them if you absolutely have to. (Of course, I'm not advocating removal of useful comments, just those that are a hindrance.)
These might not have the massive headline impact you're looking for, but they are low risk, particularly if the code can't be unit tested.