how do you troubleshoot with "works on my machine" scenarios - debugging

It happens lot of the time that when you report a bug to a developer, he comes back saying "it works on my system" though its a browser app. How do you go about sorting that out ?

From a training/process point of view:
Train your team to know that "works on my machine" is not a get-out-of-jail-free response.
Have an automated build server.
Have an automated test deployment.
Your developers must know that "works" is defined as "works on the test server", not just their machine.
From a testing/debugging point of view:
The developer needs to be shown the sequence of actions that result in the bug happening.
You might want to capture screenshots showing the bug, or possibly a video capture (using tools such as Camtasia). People can be quite bad at describing the sequence of actions that they performed on a system that led to a bug showing itself, so the more information you can capture about the bug and how to replicate itself the better.
From a development/environment point of view:
If there is genuinely a bug that exhibits itself on one environment but not the developer's then find out if it exhibits itself on no development environments, or just your one developer's.
From thereon in it is a case of trying to reduce the differences between the two environments so that your developer can see the issue on his machine.
Or you can go the other way and attempt to debug the issue on the production (non-development) environment.
Implementation details of these vary by platform.

You need to give as much information to the developer as possible. Even stuff that you don't think is relevant.
I can't count the number of times I've had a problem reported and couldn't repeat it, only to find out later a piece of information that the user hadn't originally included but was the key to unlocking the puzzle.
You also need to not accept that answer and say "well something must be different between your set up and mine, what can we do to sort it out".

We deal with that problem by having a development environment on top of the local development that is as close to the productive system as possible in terms of setup, hardware, etc. As a result almost all problems that occur in the production environment are reproducible on that development system even if they can't be reproduced on local developer machines.

This is a common escapist retort that I encounter from teams. My response usually is: "You know, your system isn't the production server and that's where it needs to work". In other words, that excuse simply isn't acceptable.
I also indicate to them the possibilities:
a. There is a configuration difference between the local system and the server.
b. Certain dependencies of the functionality are not updated on the server.
c. They haven't cleared their browser cache.
d. I replicate the problem on the Staging server and demonstrate it to them.
e. ... and so on, depending on the case.

Try to recreate the user that found the bug's system as much as possible: From server config to machine config including browser and OS and such. You should probably have several different setups on which to test your app before releasing.

IE Tester is a good tool for this kind of troubleshooting. If you need to test lots of browsers then virtual machines like Virtual PC are your best bet so you can have many client set-ups on your test server.

ahh yes... the oldest excuse in the book.
Assuming that both the developer and the tester is testing on the same server I would try to isolate the bug by identifying what the difference is between the developers machine and the testers machine. Could be a something minor like flash version, browser differences or forgetting to clear your browser cache
I would also recommend using an automated testing framework and test apps on a dedicated test server.

Not much you can do as an end user, but as a developer you can avoid a lot of these issues by including a lot of logging in the system - The differences the user will think of will just be the simple things you have tested already, but good logging lets you see exactly what was happening when the system failed. I've found quite a few bugs that couldn't possibly happen that way.

Related

Testing a wide variety of computers with a small company

I work for a small dotcom which will soon be launching a reasonably-complicated Windows program. We have uncovered a number of "WTF?" type scenarios that have turned up as the program has been passed around to the various not-technical-types that we've been unable to replicate.
One of the biggest problems we're facing is that of testing: there are a total of three programmers -- only one working on this particular project, me -- no testers, and a handful of assorted other staff (sales, etc). We are also geographically isolated. The "testing lab" consists of a handful of VMWare and VPC images running sort-of fresh installs of Windows XP and Vista, which runs on my personal computer. The non-technical types try to be helpful when problems arise, we have trained them on how to most effectively report problems, and the software itself sports a wide array of diagnostic features, but since they aren't computer nerds like us their reporting is only so useful, and arranging remote control sessions to dig into the guts of their computers is time-consuming.
I am looking for resources that allow us to amplify our testing abilities without having to put together an actual lab and hire beta testers. My boss mentioned rental VPS services and asked me to look in to them, however they are still largely very much self-service and I was wondering if there were any better ways. How have you, or any other companies in a similar situation handled this sort of thing?
EDIT: According to the lingo, our goal here is to expand our systems testing capacity via an elastic computing platform such as Amazon EC2. At this point I am not sure suggestions of beefing up our unit/integration testing are going to help very much as we are consistently hitting walls at the systems testing phase. Has anyone attempted to do this kind of software testing on a cloud-type service like EC2?
Tom
The first question I would be asking is if you have any automated testing being done?
By this I mean mainly mean unit and integration testing. If not then I think you need to immediately look into unit testing, firstly as part of your build processes, and second via automated runs on servers. Even with a UI based application, it should be possible to find software that can automate the actions of a user and tell you when a test has failed.
Apart from the tests you as developers can think of, every time a user finds a bug, you should be able to create a test for that bug, reproduce it with the test, fix it, and then add the test to the automated tests. This way if that bug is ever re-introduced your automated tests will find it before the users do. Plus you have the confidence that your application has been tested for every known issue before the user sees it and without someone having to sit there for days or weeks manually trying to do it.
I believe logging application activity and error/exception details is the most useful strategy to communicate technical details about problems on the customer side. You can add a feature to automatically mail you logs or let the customer do it manually.
The question is, what exactly do you mean to test? Are you only interested in error-free operation or are you also concerned how the software is accepted at the customer side (usability)?
For technical errors, write a log and manually test in different scenarios in different OS installations. If you could add unit tests, it could also help. But I suppose the issue is that it works on your machine but doesn't work somewhere else.
You could also debug remotely by using IDE features like "Attach to remote process" etc. I'm not sure how to do it if you're not in the same office, likely you need to build a VPN.
If it's about usability, organize workshops. New people will be working with your application, and you will be video and audio recording them. Then analyze the problems they encountered in a team "after-flight" sessions. Talk to users, ask what they didn't like and act on it.
Theoretically, you could also built this activity logging into the application. You'll need to have a clear idea though what to log and how to interpret the data.

Why should my development team have a build server?

We know this is good to have, but I find myself justifying it to my employer. Please pitch in on why a development team needs a build server.
There are multiple reasons to use build servers. In no particular order and off the top of my head:
You simplify the developers' workflow and reduce the chance of mistakes. Your build server can take care of multiple steps such as checking out latest code, having required software installed, etc. There's no chance of a developer having some stray DLLs on their machine that can cause the build to pass or fail seemingly at random.
Your build server can replicate your target environment (operating system, etc.) and there's less of a chance of something working on developers' desktops and breaking in production.
While it's a good practice for developers to test everything they check in, sometimes they just don't. Then it's good to have the build server there to catch test errors and let the team know the product is broken.
Centralized builds provide easy access to code metrics -- which tests passed, which failed, how often, how well is your code covered by your tests, etc. Having a solid understanding of the quality state of the codebase reduces maintenance and testing costs by providing timely feedback that allows errors to be fixed quickly and easily.
Product deployment is simplified -- the developer or QA doesn't have to remember multiple manual steps. It can be easily automated.
The link between developers and QA is simplified. QA personnel can go to a known location to grab latest, propertly versioned builds.
It's easy to set up builds for release branches, providing an extra safety net for products in their release stage, when making code changes must be done with extra care.
To avoid the "but it works on my box" issue.
Have a consistent, known environment where the software is built to avoid dependencies on local dev boxes.
You can use a virtual server to avoid (much) extra cost if you need to.
ASAP knowledge on what unit tests are currently working and which do not; furthermore, you'll also know if a once passing unit tests starts to fail.
This should sum up why it is critical to have a build server:
http://www.codinghorror.com/blog/2006/10/the-build-server-your-projects-heart-monitor.html
It's a continuous quality test dashboard; it shows you statistics about how the quality of your software is doing, and it shows them to you now. (JUnit, Cobertura)
It makes sure developers aren't hamstrung by other developers breaking the build, and encourages developers to write better code. (FindBugs, PMD)
It saves you time and money throughout the year by getting better code from developers the first time - less money on testing and retesting - and by getting more code from the same developers, because they're less likely to trip each other up.
Two main reasons that non technical people can relate to:
It improves the productivity of the dev team because problems are identified earlier.
It makes the state of the project very obvious. I've shown my management the build status dashboard an now they look at it all the time.
One more thing. Something like Hudson is very simple to set up - you might want to simply run it somewhere in a corner for a while and then show it later.
This is my principal argument:
all official releases must be build in a controlled environment. No exception.
simply because you never know how the developers create their personal releases.
You also don't need to talk about build server as in "blade that costs an arm a a leg". The first build server I set up was a desktop machine that sat unplugged in a corner. It served us very well for more than 3 years.
One you have your build machine, you can start adding some features (Hudson is great) and implement everything that the other posters mentioned.
Once your build machine becomes indispensable to your organization (and everyone sees its benefits), you will be able to ask for a shiny new blade if you wish :-)
The simplest thing you can do to convince your your employer to have a build server is to tell them that they will be able to release faster and with better quality.
Faster releases come from the immediate feedback about quality of the build. If someone breaks the build, he or she can fix the broken build immediately thus avoiding a delay in the build and release schedule. Without a build server the team will have to spend time trying to find what and when happened and how to fix it.
Better quality is achieved by the build server running bug detection tools automatically every time someone check is changes into a version control system. You don't mention what is the main development language in your organization, but such tools, advanced but commercial and simple but free, exist practically for all languages. Lint, FxCop, FindBugs and PMD come to mind.
You may also check this presentation on benefits of continuous integration for a more extensive discussion.

Different methodologies for solving bugs that only occur in production

As one who is relatively new to the whole support and bug fixing environment and a young programmer I had never come across a bug that only occurs in the Websphere environment but not on the localhost test enviroment, until today. When I first got this bug report I was confused as to why I couldn't reproduce it on the localhost test environment. I decided to try on the Websphere test environment to see what would happen and I successfully reproduced the bug. The problem is I can't make changes and build to the Websphere test enviroment. I can only make changes to my local environment. Given this handicap what methodologies exist for resolving these kinds of bugs. Or are there even any methodologies at all? Any advice or help on how to approach issues like this?
Campaign for full access to a test environment. Being able to tweak things, redeploy and retry makes a huge difference. It's entirely reasonable to explain how not having access severely restricts your ability to do your job.
Make sure you've got sufficient logging, and make it configurable. Make sure you keep the logs for long enough to track down a problem reported by a customer even if it happened a few days ago.
When you finally diagnose a problem and why it only happens in a particular environment, document it and try to persuade your local system to behave the same way - that should make it easier to diagnose another symptom of the same problem next time.
In short, the methodology is to isolate and understand the differences between environments and which one or ones might be causing the issue.
Check your local build. Make sure it the same version (code and database) as Test and Prod. If it is, what are the environment differences that could effect the issue you are seeing? (Multi-core, load balancing, operating system version, 3rd library version). Don't run locally in the debugger, make sure your running a release build (if that's what is on Test and Prod) and maybe even do a local deployment rather than building from source.
Check to see if it is particular data that might be causing the problem. If you can, pull a copy of the database back from Test onto Local and see if that enables you to repro the problem.
Check with other developers. See if they can repro. the issue in their environment. Check with your QA guys, get their thoughts on what might be causing such an issue (often times they will have seen "similar" issues and might give you a clue).
At that point, if nothing helps, I generally go into a deep state of zen to try and understand what I am missing. But, there must be a difference, you just have to find it.
The Scientific Method always applies-- check your assumptions first. If the systems are different, the problem might reside in some sort of implicit default being different, or a different implementation of some function.
In all debugging processes, localization is the key. You gotta isolate the area of the problem first. If your OS, patches, libraries, and the main software itself are all identical, then it's probably the system settings (limits for sockets, file descriptors, etc). If you know you have enough inodes, space and memory left, then it's not a resource issue. If the computer is barely responsive to your interactive prodding, your load is too high, or you have some runaway processes. Remember what every process needs to run, and make sure they got what they need.
It can be also code just can't deal with the load of the production system. Locking mechanisms are a very popular cause of problem in production vs dev/test systems, simply because you can't generate enough of test cases that you get for free in production.
Logging can be easily overlooked, but I also like to add a lot of debug values into the code, to make debugging easier. I cannot even count how many times some environment variable, path, or broken symlink have ruined my day, just to realize that it would be a trivial fix if I looked at the values of variables while running, not just looking at the static code.
If all else fails, ltrace and strace are the best way to really look at what's going on under the hood. They're not easy to read, but once you get used to how spot and interpret return values of some syscalls, you gain a very powerful debugging tool.

Software tools for recording non-reproducible bugs

Obviously the non-reproducible bugs are the hardest to fix due to the nature of their cause (i.e. race conditions), so we as programmers must do our best to gather data (i.e. logs, screenshots, etc.) and verify the bug documentation is accurate in an attempt to understand what happened. Can anyone recommend any software tools, or methods, that can record and reconstitute the actual executed sequence of machine instructions while allowing the user to step through them and inspect the code?
If it helps, the project I'm building is a windows application written in C++ and uses VS2005.
Thanks in advance for all your help!
'Time-machine' / Replay debugging is very helpful for debugging the type of issues you describe.
eg Green Hills time machine debugger
I have not used this myself but it sounds like it might be useful for the type of project you are building: VMWare replay debugging

"Works on my machine" - How to fix non-reproducible bugs?

Very occasionally, despite all testing efforts, I get hit with a bug report from a customer that I simply can't reproduce in the office.
(Apologies to Jeff for the 'borrowing' of the badge)
I have a few "tools" that I can use to try and locate and fix these, but it always feels a bit like I'm knife-and-forking it:-
Asking for more and more context from the customer: (systeminfo)
Log files from our application
Ad-hoc tests with the customer to attempt to change the behaviour
Providing customer with a new build with additional diagnostics
Thinking about the problem in the bath...
Site visit (assuming customer is somewhere warm and sunny)
Are there set procedures, or other techniques than anyone uses to resolve problems like this?
One of the attributes of good debuggers, I think is that they always have a lot of weapons in their toolkit. They never seem to get "stuck" for too long and there is always something else for them to try. Some of the things I've been known to do:
ask for memory dumps
install a remote debugger on a client machine
add tracing code to builds
add logging code for debugging purposes
add performance counters
add configuration parameters to various bits of suspicious code so I can turn on and off features
rewrite and refactor suspicious code
try to replicate the issue locally on a different OS or machine
use debugging tools such as application verifier
use 3rd party load generation tools
write simulation tools in-house for load generation when the above failed
use tools like Glowcode to analyse memory leaks and performance issues
reinstall the client machine from scratch
get registry dumps and apply them locally
use registry and file watcher tools
Eventually, I find the bug just gives up out of some kind of awe at my persistence. Or the client realises that it's probably a machine or client side install or configuration issue.
Extensive logging usually helps.
The easiest way is always to see the customer in action (assuming that its readily reproducible by the customer). Oftentimes, problems arise due to issues with the customer's computer environment, conflicts with other programs, etc - these are details which you will not be able to catch on your dev rig. So a site visit might be useful; but if that's not convenient, tools like RealVNC might help as well in letting you see the customer 'do their thing'.
(watching the customer in action also allows you to catch them out in any WTF moments that they might have)
Now, if the problem is intermittent, then things get somewhat more complicated. The best way to get around this problem would be to log useful information in places where you guess problems could occur and perhaps use a tool like Splunk to index the log files during analysis. A diagnostic build (i.e. with extra logging) might be useful in this case.
I'm just in the middle of implementing an automated error reporting system that sends back to me information (currently via email although you could use a webservice) from any exception encountered by the app.
That way I get (nearly) all the information that I would do if I was sitting in front of VS2008 and it really helps me to work out what the problem is.
The customers are also usually (sorta) impressed that I know about their problem as soon as they encounter it!
Also, if you use the Application.ThreadException error handler you can send back info on unexpected exceptions too!
We use all the methods you mention progressively starting with the easiest and proceeding to the harder.
However you forget that sometimes hardware is at fault. For example, memory could be malfunctioning and some computation-intensive code will behave strangely throwing exceptions with weird diagnostics. Of cource, it works on your machine, since you don't have faulty hardware.
Experience is needed to identify such errors and insist that customer tries to install the program on another machine or does hardware check. One thing that helps greatly is good error handling - when your code throws an exception it should provide details, not just indicate that something is bad. With good error indication it's easier to identify such suspicious issues related to faulty hardware.
I think one of the most important things is the ability to ask sensible questions around what the customer has reported... More often than not they're not mentioning something that they don't see as relevant, but is actually key.
Telepathy would also be useful...
We've had good success using EurekaLog with it posting directly to FogBugz. This gets us a bug report containing a call stack, along with related system info (other processes running, memory, network details etc) and a screen shot. Occasionally customers enter further info too, which is helpful. It's certainly, in most cases, made it much easier and quicker to fix bugs.
One technique I've found useful is building an application with an integrated "diagnostic" mode (enabled by a command line switch when you launch the app). That certainly avoids the need to create custom builds with additional logging.
Otherwise, it sounds like what you're doing is as good an approach as any.
Copilot (assuming customer is somewhere cold and rainy :)
The usual procedure for this is to expect something like this will happen and add a ton of logging information. Of course you don't enable it from the beginning, but only when this happens.
Usually customers don't like to have to install a new version or some diagnostic tools. It is not their job to do your debugging. And visiting a client for cases like these is rarely an option. You must involve the client as little as possible. Changing a switch and sending you a log file is OK - anything more than this is too much.
I like the alternative of thinking the problem at the bath. I will start from trying to find out the differences between my machine and the client's configuration.
As a software engineer doing webstuff (booking/shop/member systems etc) the most important thing for us is to get as much information from the customer as possible.
Going from
it's broke!
to
it's broke! & here are screenshots of
every option I picked whilst
generating this particular report
reduces the amount of time it takes us to reproduce and fix an issue no end.
It may be obvious, but it takes a fair amount of chasing to get this kind of information from our customers sometimes! But it's worth it just for those moments you find they're not actually doing what they say they are.
I had these problems also. My solution was to add lots of logging and give the customer a debug build with all the possible debug information. Then make sure dr Watson (it was on Windows NT) created a memory dump with enough information.
After loading the memory dump in the debugger I could find out where and why it crashed.
EDIT: Oh, this obviously only works if the application terminates violently...
I think following the trail of the actions user took can lead us to the reasons of failure or selective failures. But most of the times users are at loss to precisely describe the interactions with the applications, the automatic screenshot taking (if it is desktop app. for .net app you can check Jeff's UnhandledExceptionHandler). Logging all the important action which change state of the objects can also help us in understanding it.
I don't have this problem very often, but if I did, I would use a screen sharing or recorded application to watch the user in action without having to go there (unless, as you said, it's warm and sunny and the company pays the trip).
I have recently been investigating such an issue myself. Over the course of my carrier I have learnt that, while computer systems may be complex, they are predictable so have faith that you can find the problem. My approach to these kinds of issues two fold:
1) Gather as much detailed information as possible from the customer about their failure and analyse it meticulously for patterns. Gather multiple sets of data for multiple failure occurrences to build up a clearer picture.
2) Try and reproduce the failure in house. Continue to make your system more and more similar to the customers system until you can reproduce it, the system is identical or it becomes impractical to make it more similar.
While doing this consider:
1)What differences exist between this system and other working systems.
2)What has recently changed in your product or the customers configuration that has caused the problem to start occurring.
Regards
Depending on the issue you could get WinDbg dumps, they normally give a pretty good idea of what is going on. We have diagnosed quite a few problems that weren't crashed from minidumps.
For .Net apps we also was Trace.Writeline then we can get the user to fire up DbgView and send us the output.
Its very complicated issue . I was thinking writing some procedure for this . I just made some procedure for this non-reproducible bug . it might be helpful
When the Bug accorded .. There are several factors it might to occur.
I am Sure all bugs are reproducible . I always keep eye for these kind of issues..
Get the System Information
what other process the customer did before that.
Time period it occurs . its rare or frequent
its next action happened after the issue ( its always same or different )
Find the factors for this bug ( as developer )
Find the exact position where this issue happened .
Find ALL THE SYSTEM Factors on that time
check all memory leaks or user error issue or wrong condition in code
List out all facotrs to may cause this issue.
How the each factors are affected this and wat are the data is holding those factors
Check memeory issues happened
check the customer have the current update code like yours
check all log from atleast 1 month and find any upnormal operation happened . keep on note
Just a short anecdote (hence 'community wiki'): Last week I thought it was a clever idea in a Django app to import the module pprint for pretty printing Python data only if DEBUG was True:
if settings.DEBUG:
from pprint import pprint
Then I used here and there the pprint command as debugging statement:
pprint(somevar) # show somevar on the console
After finishing the work, I tested the app with setting DEBUG=False. You can guess what happened: The site broke with HTTP500 errors all over the place, and I did not know why, because there is no traceback if DEBUG is False. I was puzzled that the errors disappeared magically, if I switched back to debug mode.
It took me 1-2 hours of putting print statements all over the code to find that the code crashes at exactly the above pprint() line. Then it took me another half an hour to convince myself to stop banging my head on the table.
Now comes the moral of the story:
Not every thing that looks like a clever idea in the first view is such savvy in the end.
An important point to look at for debugging these errors are all configuration options and platform switches your code by itself makes. This can be quite a lot more than just some user preferences. Document good, if you make an assumption about the user's platform (e.g., if you test for Win/Mac/Linux only, will your code crash on BSD or Solaris?)
Cheers,
However tough a non-reproducible problem is - we can still have a structured and strategic approach to solve them - and I can say through experience that it requires out of box thinking in 50% of the cases. Generally speaking, one can categorize the problems into different types which helps to identify what tool to be used. For example if you have a non-reproducible application crash issue or a memory issue you can use profilers and nail down the issue caused in the particular functionality.
Also, one of the most important approach is inforamation rich logging. I also use a lot of enums to describe the state of the process depending on the scenario in question. for exampe, I used like Initiated, Triggered, Running, Waiting Repaired etc to describe the schedules states and saved them to DB at different stages.
Not mentioned yet, but "directed code review" is one good solution, especially if you didn't do a proper review (at least 1 hour per 100 lines of code) before release.
I have also seen impressive demos of AppSight Suite, which is basically an advanced environment monitoring and logging tool. It allows the customer to record what happens on his machine in an extensive but fairly compact log file which you can then replay.
As many have mentioned, extensive logging, and asking the client for the log files when something goes wrong. In addition, as I worked more with web apps, I'll also provide detailed, but succinct deployment documentation (e.g., deployment steps, environmental resources that need to be set up etc).
Here are common problems I've seen that lead to the types of problem you are describing:
Environment not set up properly (e.g., missing environment variables, data sources etc).
Application not fully deployed (e.g., database schema not deployed).
Difference in operating system configuration (default character encoding being the most common culprit for me).
Most of the time, these issues can be identified through the log content.
You can use tools like Microsoft SharedView or TeamViewer to connect to remote PC and inspect problem directly on site. Of course, you'll need cooperation with customer.

Resources