How are hardware specific bugs and features tested in the Linux kernel? - linux-kernel

How do kernel developers test changes on very specific hardware? If they don't have that particular type of hardware, how do they figure out where the issue is happening?

Generally, the kernel developers responsible for writing drivers for a specific piece of hardware also happens to be the tester and someone who owns that piece of hardware.
If that developer comes across a bug (or if people report a bug) on that piece of hardware, he is usually the one that fixes it. It'd be silly to ask someone to fix or maintain a driver for some hardware he does not own.
As to higher level kernel features, here is where the Linux development paradigm comes into play. Linux employs a merge window strategy for releasing since it's unrealistic to ask every developer to test his changes on every single possible combination of hardware and software combinations.
Instead, after a major release, a merge window is opened up for around 2 weeks where developers can push in significant changes, new drivers and features. After the merge window closes, no more patches for new changes are accepted.
During this time where the merge window is closed, developers have a chance to test the newly merged code on their own hardware and report bugs and breakages. Usually, the developer works together with the person that introduced the change to fix the bug. The developer can now send the fix to be included in the next release candidate.
A new release candidate is released roughly weekly. When Linus is happy that most of the bugs introduced this version has been fixed, he makes a major release and the cycle starts again.
In other words, the answer to your question is that they don't. No one developer is responsible for everything nor are they expected to fix problems for other people. A developer is only responsible for the part of the kernel that he maintains.

I do maintain drivers for some hardware which I do not have (and which no other kernel developer has).
Changes in such drivers must be done very carefully:
Sometimes it is possible to test the change on similar hardware that is available.
When an issue was reported by a user, it might be possible to ask this user to test the change, or to find some other user with the same hardware.
When an actual bug (security related or otherwise) must be fixed, the best solution might be to apply the change untested, and hope for the best.
Otherwise, just don't make the change.

Related

How come the Windows OS hasn't been decompiled?

As far as I'm aware, Windows hasn't been decompiled by anyone yet. Obviously it's complicated, but surely it should've been done by now to some degree?
My thinking behind this is that if the end-user has access to the software, and the computer is able to run it, then even an obfuscated version of it must be obtainable?
I'm obviously missing something, I'm just not sure what.
There's nothing preventing Windows from being decompiled (apart from EULA and similar legal bindings, of course). As you noted, the code must run on the CPU at some time, and the CPU must read the code from memory, and you can read from memory too. Some parts of the system can be a bit trickier, since to run the OS you need to give the OS some exclusive priviledges (that's how most modern protected OSes work), but it's nothing that can't be worked around. In any case, there's not a lot of effort to prevent Windows decompilation - that would have barely any benefit, while making debugging, error reporting and similar harder. Microsoft even goes so far to provide a special debug version of Windows that's even more tailored for software development.
The main point is that there's little reason to decompile Windows. What practical use would such a massive effort have? And if you're a corporation that needs access to Windows source codes (for example, when developing embedded solutions), you can get them. Just because Windows isn't open source doesn't mean the sources aren't available.
If you're not someone who needs their own version of Windows (common in the times of Windows CE), there's even less of a reason to decompile Windows. You need to stick to the defined public APIs anyway - that's a good practice regardless of whether the software is open source or not. APIs are contracts - implementation details you'd get through decompiling aren't. They might very well change with the next security hotfix or such. This is especially important given how serious Windows is about compatibility - it's quite rare for an update (or even a new major release) to break compatibility with old software.
So, if you want to decompile Windows, there's nothing technical that's really preventing you from doing so. But you're looking at tens of millions of lines of source code that was compiled by very smart compilers, with bits of handwritten optimised assembly thrown around, tons of compatibility workarounds that might as well be outright obfuscation (remember, you don't get the comments - just the actually compiled code). Are you willing to spend a few hundred thousand hours to satisfy your idle curiosity? :P

Software patching at a billion miles

Could someone here shed some light about how NASA goes about designing their spacecraft architecture to ensure that they are able to patch bugs in the deployed code?
I have never built any “real time” type systems and this is a question that has come to mind after reading this article:
http://pluto.jhuapl.edu/overview/piPerspective.php?page=piPerspective_05_21_2010
“One of the first major things we’ll
do when we wake the spacecraft up next
week will be uploading almost 20 minor
bug fixes and other code enhancements
to our fault protection (or “autopilot
response”) software.”
I've been a developer on public telephone switching system software, which has pretty severe constraints on reliability, availability, survivability, and performance that approach what spacecraft systems need. I haven't worked on spacecraft (although I did work with many former shuttle programmers while at IBM), and I'm not familiar with VXworks, the operating system used on many spacecraft (including the Mars rovers, which have a phenomenal operating record).
One of the core requirements for patchability is that a system should be designed from the ground up for patching. This includes module structure, so that new variables can be added, and methods replaced, without disrupting current operations. This often means that both old and new code for a changed method will be resident, and the patching operation simply updates the dispatching vector for the class or module.
It is just about mandatory that the patching (and un-patching) software is integrated into the operating system.
When I worked on telephone systems, we generally used patching and module-replacement functions in the system to load and test our new features as well as bug fixes, long before these changes were submitted for builds. Every developer needs to be comfortable with patching and replacing modules as part of their daly work. It builds a level of trust in these components, and makes sure that the patching and replacement code is exercised routinely.
Testing is far more stringent on these systems than anything you've ever encountered on any other project. Complete and partial mock-ups of the deployment system will be readily available. There will likely be virtual machine environments as well, where the complete load can be run and tested. Test plans at all levels above unit test will be written and formally reviewed, just like formal code inspections (and those will be routine as well).
Fault tolerant system design, including software design, is essential. I don't know about spacecraft systems specifically, but something like high-availability clusters is probably standard, with the added capability to run both synchronized and unsynchronized, and with the ability to transfer information between sides during a failover. An added benefit of this system structure is that you can split the system (if necessary), reload the inactive side with a new software load, and test it in the production system without being connected to the system network or bus. When you're satisfied that the new software is running properly, you can simply failover to it.
As with patching, every developer should know how to do failovers, and should do them both during development and testing. In addition, developers should know every software update issue that can force a failover, and should know how to write patches and module replacement that avoid required failovers whenever possible.
In general, these systems are designed from the ground up (hardware, operating system, compilers, and possibly programming language) for these environments. I would not consider Windows, Mac OSX, Linux, or any unix variant, to be sufficiently robust. Part of that is realtime requirements, but the whole issue of reliability and survivability is just as critical.
UPDATE: As another point of interest, here's a blog by one of the Mars rover drivers. This will give you a perspective on the daily life of maintaining an operating spacecraft. Neat stuff!
I've never build real-time system either, but in those system, I suspect their system would not have memory protection mechanism. They do not need it since they wrote all their own software themselves. Without memory protection, it will be trivial for a program to write the memory location of another program and this can be used to hot-patch a running program (writing a self-modifying code was a popular technique in the past, without memory protection the same techniques used for self-modifying code can be used to modify another program's code).
Linux has been able to do minor kernel patching without rebooting for some time with Ksplice. This is necessary for use in situations where any downtime can be catastrophic. I've never used it myself, but I think the technique they uses is basically this:
Ksplice can apply patches to the Linux
kernel without rebooting the computer.
Ksplice takes as input a unified diff
and the original kernel source code,
and it updates the running kernel in
memory. Using Ksplice does not require
any preparation before the system is
originally booted (the running kernel
does not need to have been specially
compiled, for example). In order to
generate an update, Ksplice must
determine what code within the kernel
has been changed by the source code
patch. Ksplice performs this analysis
at the ELF object code layer, rather
than at the C source code layer.
To apply a patch, Ksplice first
freezes execution of a computer so it
is the only program running. The
system verifies that no processors
were in the middle of executing
functions that will be modified by the
patch. Ksplice modifies the beginning
of changed functions so that they
instead point to new, updated versions
of those functions, and modifies data
and structures in memory that need to
be changed. Finally, Ksplice resumes
each processor running where it left
off.
(from Wikipedia)
Well I'm sure they have simulators to test with and mechanisms for hot-patching. Take a look at the linked article below - there's a pretty good overview of the spacecraft design. Section 5 discusses the computation machinery.
http://www.boulder.swri.edu/pkb/ssr/ssr-fountain.pdf
Of note:
Redundant processors
Command switching by the uplink card that does not require processor help
Time-lagged rules
I haven't worked on spacecraft, but the machines I've worked on have all been built to have a stable idle state where it's possible to shut down the machine briefly to patch the firmware. The systems that have accommodated 'live' updates are those that were broken into interacting components, where you can bring down one segment of the system long enough to update it and the other components can continue operating as normal, as they can tolerate the temporary downtime of the serviced component.
One way you can do this is to have parallel (redundant) capabilities, such as parallel machines that all perform the same task, so that the process can be routed around the machine under service. The benefit of this approach is that you can bring it down for longer periods for more significant service, such as regular hardware preventative maintenance. Once you have this capability, supporting downtime for a firmware patch is fairly easy.
One of the approaches that's been used in the past is to use LISP.

How do you fix a bug you can't replicate?

The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?
I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?
Edit:
I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?
Language
Different programming languages will have their own flavour of bugs.
C
Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers far enough to avoid a SEGFAULT---also known as Heisenbugs. Pointer issues are arduous to track and replicate, but debuggers can help (such as GDB and DDD).
Java
An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.
JavaScript
Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.
Environment
Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:
on a server?
on a desktop?
in a web browser?
In what environment does the application produce the problem?
development?
test?
production?
Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.
Networking
As networking is essential to so many applications:
Ensure stable network connections, including wireless signals.
Does the software reconnect after network failures robustly?
Do all connections get closed properly so as to release file descriptors?
Are people using the machine who shouldn't be?
Are rogue devices interacting with the machine's network?
Are there factories or radio towers nearby that can cause interference?
Do packet sizes and frequency fall within nominal ranges?
Are packets being monitored for loss?
Are all network devices adequate for heavy bandwidth usage?
Consistency
Eliminate as many unknowns as possible:
Isolate architectural components.
Remove non-essential, or possibly problematic (conflicting), elements.
Deactivate different application modules.
Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.
Logging
Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.
Hardware
If the software seems okay, consider hardware faults:
Are the physical network connections solid?
Are there any loose cables?
Are chips seated properly?
Do all cables have clean connections?
Is the working environment clean and free of dust?
Have any hidden devices or cables been damaged by rodents or insects?
Are there bad blocks on drives?
Are the CPU fans working?
Can the motherboard power all components? (CPU, network card, video card, drives, etc.)
Could electromagnetic interference be the culprit?
And mostly for embedded:
Insufficient supply bypassing?
Board contamination?
Bad solder joints / bad reflow?
CPU not reset when supply voltages are out of tolerance?
Bad resets because supply rails are back-powered from I/O ports and don't fully discharge?
Latch-up?
Floating input pins?
Insufficient (sometimes negative) noise margins on logic levels?
Insufficient (sometimes negative) timing margins?
Tin whiskers?
ESD damage?
ESD upsets?
Chip errata?
Interface misuse (e.g. I2C off-board or in the presence of high-power signals)?
Race conditions?
Counterfeit components?
Network vs. Local
What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?
Firmware
In between hardware and software is firmware.
Is the computer BIOS up-to-date?
Is the BIOS battery working?
Are the BIOS clock and system clock synchronized?
Time and Statistics
Timing issues are difficult to track:
When does the problem happen?
How frequently?
What other systems are running at that time?
Is the application time-sensitive (e.g., will leap days or leap seconds cause issues)?
Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.
Change Management
Sometimes problems appear after a system upgrade.
When did the problem first start?
What changed in the environment (hardware and software)?
What happens after rolling back to a previous version?
What differences exist between the problematic version and good version?
Library Management
Different operating systems have different ways of distributing conflicting libraries:
Windows has DLL Hell.
Unix can have numerous broken symbolic links.
Java library files can be equally nightmarish to resolve.
Perform a fresh install of the operating system, and include only the supporting software required for your application.
Java
Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.
Use a library management tool such as Maven or Ivy.
Debugging
Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.
Sleep
It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.
Code Review
Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.
Cosmic Radiation
Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.
Tools
Sometimes, albeit infrequently, the compiler will introduce a bug, especially for niche tools (e.g. a C micro-controller compiler suffering from a symbol table overflow). Is it possible to use a different compiler? Could any other tool in the tool-chain be introducing issues?
If it's a GUI app, it's invaluable to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (not wrongly, just differently).
Otherwise, concentrate your logging in that area. Log most everything (you can pull it out later) and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.
Does your app report a version number, a build number, etc.? You need this to determine precisely which version you're debugging (or not!).
If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made, etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.
If you can't replicate it, you may fix it, but can't know that you've fixed it.
I've made my best explanation about how the bug was triggered (even if I didn't know how that situation could come about), fixed that, and made sure that if the bug surfaced again, our notification mechanisms would let a future developer know the things that I wish I had known. In practice, this meant adding log events when the paths which could trigger the bug were crossed, and metrics for related resources were recorded. And, of course, making sure that the tests exercised the code well in general.
Deciding what notifications to add is a feasability and triage question. So is deciding on how much developer time to spend on the bug in the first place. It can't be answered without knowing how important the bug is.
I've had good outcomes (didn't show up again, and the code was better for it), and bad (spent too much time not fixing the problem, whether the bug ended up fixed or not). That's what estimates and issue priorities are for.
Sometimes I just have to sit and study the code until I find the bug. Try to prove that the bug is impossible, and in the process you may figure out where you might be mistaken. If you actually succeed in convincing yourself it's impossible, assume you messed up somewhere.
It may help to add a bunch of error checking and assertions to confirm or deny your beliefs/assumptions. Something may fail that you'd never expect to.
It can be difficult, and sometimes near impossible. But my experience is, that you will sooner or later be able to reproduce and fix the bug, if you spend enough time on it (if that spent time is worth it, is another matter).
General suggestions that might help in this situation.
Add more logging, if possible, so that you have more data the next time the bug appears.
Ask the users, if they can replicate the bug. If yes, you can have them replicate it while watching over their shoulder, and hopefully find out, what triggers the bug.
Make random changes until something works :-)
Assuming you have already added all the logging that you think would help and it didn't... two things spring to mind:
Work backwards from the reported symptom. Think to yourself.. "it I wanted to produce the symptom that was reported, what bit of code would I need to be executing, and how would I get to it, and how would I get to that?" D leads to C leads to B leads to A. Accept that if a bug is not reproducible, then normal methods won't help. I've had to stare at code for many hours with these kind of thought processes going on to find some bugs. Usually it turns out to be something really stupid.
Remember Bob's first law of debugging: if you can't find something, it's because you're looking in the wrong place :-)
Think. Hard. Lock yourself away, admit no interuptions.
I once had a bug where the evidence was a hex dump of a corrupt database. The chains of pointers were systematically screwed up. All the user's programs, and our database software, worked faultlessly in testing. I stared at it for a week (it was an important customer), and after eliminating dozens of possible ideas, I realised that the data was spread across two physical files and the corruption occurred where the chains crossed file boundaries. I realized that if a backup/restore operation failed at a critical point, the two files could end up "out of sync", restored to different time points. If you then ran one of the customer's programs on the already-corrupt data, it would produce exactly the knotted chains of pointers I was seeing. I then demonstrated a sequence of events that reproduced the corruption exactly.
modify the code where you think the problem is happening, so extra debug info is recorded somewhere. when it happens next time, you will have what your need to solve the problem.
There are two types of bugs you can't replicate. The kind you discovered, and the kind someone else discovered.
If you discovered the bug, you should be able to replicate it. If you can't replicate it, then you simply haven't considered all of the contributing factors leading towards the bug. This is why whenever you have a bug, you should document it. Save the log, get a screenshot, etc. If you don't, then how can you even prove the bug really exists? Maybe it's just a false memory?
If someone else discovered a bug, and you can't replicate it, obviously ask them to replicate it. If they can't replicate it, then you try to replicate it. If you can't replicate it quickly, ignore it.
I know that sounds bad, but I think it is justified. The amount of time it will take you to replicate a bug that someone else discovered is very large. If the bug is real, it will happen again naturally. Someone, maybe even you, will stumble across it again. If it is difficult to replicate, then it is also rare, and probably won't cause too much damage if it happens a few more times.
You can be a lot more productive if you spend your time actually working, fixing other bugs and writing new code, than you will be trying to replicate a mystery bug that you can't even guarantee actually exists. Just wait for it to appear again naturally, then you will be able to spend all your time fixing it, rather than wasting your time trying to reveal it.
Discuss the problem, read code, often quite a lot of it. Often we do it in pairs, because you can usually eliminate the possibilities analytically quite quickly.
Start by looking at what tools you have available to you. For example crashes on a Windows platform go to WinQual, so if this is your case you now have crash dump information. Do you can static analysis tools that spot potential bugs, runtime analysis tools, profiling tools?
Then look at the input and output. Anything similar about the inputs in situations when users report the error, or anything out of place in the output? Compile a list of reports and look for patterns.
Finally, as David stated, stare at the code.
Ask user to give you a remote access for his computer and see everything yourself. Ask user to make a small video of how he reproduces this bug and send it to you.
Sure both are not always possible but if they are it may clarify some things. The common way of finding bugs are still the same: separating parts that may cause bug, trying to understand what`s happening, narrowing codespace that could cause the bug.
There are tools like gotomeeting.com, which you can use to share screen with your user and observe the behaviour. There could be many potential problems like number of softwares installed on their machines, some tools utility conflicting with your program. I believe gotomeeting, is not the only solution, but there could be timeout issues, slow internet issue.
Most of times I would say softwares do not report you correct error messages, for example, in case of java and c# track every exceptions.. dont catch all but keep a point where you can catch and log. UI Bugs are difficult to solve unless you use remote desktop tools. And most of time it could be bug in even third party software.
If you work on a real significant sized application, you probably have a queue of 1,000 bugs, most of which are definitely reproducible.
Therefore, I'm afraid I'd probably close the bug as WORKSFORME (Bugzilla) and then get on fixing some more tangible bugs. Or doing whatever the project manager decides to do.
Certainly making random changes is a bad idea, even if they're localised, because you risk introducing new bugs.

Bug tracking best practices [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
In my company, these rules apply:
Only testers are allowed to create issues.
Developers must send e - mail a tester to have them create an issue.
Developers send e - mail to technical lead for having him assign an issue to themselves for issues they think they can resolve.
A developer cannot assign an issue to another developer (Must send e - mail to technical lead).
If a developer's issue is blocked by another developer's code, she must solve this problem outside of the bug tracking system.
Only testers are allowed to close issues which are opened by themselves.
All assignments must go through technical lead so he can track issues.
Bugs that are not directly related to user interface are not entered into the system (must be resolved externally).
What bug tracking flow are you using? Does it work well for you?

			
				
We use BugZilla for bug tracking and there are rules like:
Anybody can report a bug and every little change whatsoever should go through bug-tracking system. If it is an enhancement in the product, the bug should be marked as enhancement and bug-tracking system should be followed.
Anybody can assign a bug to anybody else which means that there is ease in routing an issue to others if a bug resides in somebody else's code. There may be circumstances when a bug needs to be fixed at more than one place i.e., there is dependency on somebody else's code to get fixed first and after that other person will fix his/her code. Under those cases, a bug gets assigned to the person who needs to do the work first and then he/she re-routes the bug to appropriate person by re-assigning it.
If an issue appears at more than one place and the code behind is different but the issue apparently is same, the bug is cloned so that a separate track can be kept of all the changes.
Technical leads are responsible for prioritizing the bugs based on the demand of that particular fix.
Testers/QAEs are responsible for assigning a Severity to the bug i.e., Critical/Major/Minor etc.
All bugs go through bug-tracking system. Bugs coming from customers are classified separately by a custom flag to indicate a customer bug. Customer bugs are mostly in the older released builds and patches are created for them, therefore, those are kept separate.
This way we ensure that we keep track of all the changes simultaneously in our Source Control System (which is TFS btw) and the Bugzilla so that any changes can be traced back to the original code-change/owner if needed in the future.
Sounds pretty complicated. We are using roughly the following process:
Everyone in the company can open an issue ticket and assigns it to a department.
Every department has a "dispatcher" who checks the incoming tickets for validity and prioritizes them.
Depending on the department's practices, developers are assigned tickets for the current development cycle by the dispatcher, or they assign themselves the tickets, highest priority first.
When a ticket is solved, it goes back to whoever opened it. This person also performs all activities neccessary afterwards, like informing customers.
All tickets are held in a software systems that makes these tasks easy. If you get a ticket, you also get an e-mail notification.
This is a lightweight process that encourages developers to take responsibility for their issues.
Aside from this, we have several quality assurance measures in place for the process of changing anything in the software, regardless of the source and type of the change requests. This includes especially:
All code must be reviewed before it is checked into the source code management system. This includes GUI and database reviews by specialized reviewers if neccessary
Code must be tested thoroughly by the developer himself before checking it in.
After the monthly build, all changes have to be tested again to prevent problems that occur due to several changes affecting the same code.
The monthly build enters a "first customer phase" where it is only rolled out to a few customer systems. If this phase shows no previously undetected errors, the build is declared safe.
I've used a great number of issue tracking systems, including gnats (ugh!), Bugzilla (slightly less ugh), Trac, Jira, and now FogBugz. I like Trac most of all, but that's probably because I'm not the administrator on FogBugz and it's being sadly and horribly mis-used in its current incarnation.
Getting the the workflow right is pretty crucial, and oddly enough it starts with deciding what to put in your bug tracker and how to label the things you put in there. As soon as you have a customer, all development teams really track three kinds of issues:
Problems noted by real customers (live bugs).
Problems with new software currently in development (dev bugs).
Things we want to do in the future (features).
Each of these three classes of issues have their own priorities, of course. A 'live bug' that's just a spelling error on a button may be a lot less important than a 'dev bug' that's blocking a publicly announced release, or gating other development, testing, etc.
The severity of an issue describes how horrible the side affects are. In my experience, this boils down to:
The program is ruining something. Data, customers being billed incorrectly, wrong medicine being dispensed. This is as bad as it gets. I once worked on a system where a software command retracted a hydraulic arm right through the middle of a serviceman. This is as bad as it gets.
The program is crashing and we don't have a work-around, but it's not ruining anything (other than being down) in the meantime. If the downtime resulta in something getting ruined use severity #1.
The program is misbehaving, but we have an identified work-around that can actually be used.
The program is misbehaving in ways that are annoying but don't affect the results.
The program needs to be better in some well defined way: easier to use, implement a new feature, run faster, etc.
Another problem that arises a lot in these systems is the concept of 'roles.' As applied to issue tracking systems, roles boils down to who is allowed to do things. Who gets to create issues? Who gets to change the status, who gets to reassign them to another user, who gets to close them, etc.
In the small- to mid-size teams I've worked closely with, this general set of rules has worked well:
Anyone can create an issue. The creator can assign the issue to any (or most) recipients as it's being created. The default recipient is the Issue Triage team. Developers can note bugs they've found working on code this way, and assign the bug to themselves, to track why they are changing code.
The Triage team meets (specify interval here) to evaluate and assign issues. The Triage team specifically looks for duplicate reports, in which case the new issue is 'rolled up' into the existing issue chain; for unreproduced issues from the field, which are assigned to QA for reproduction; and for high-severity issues from the customers.
The originator of a bug is the ONLY person that can close it. Bug reports initiated by QA or by a CSR cannot be closed by a developer. Yes, this means that bugs that CS and the dev team disagree on remain unresolved. Why have the issue tracker report an issue as resolved when the people aren't in agreement? If you want a digital repository of lies, you have C-SPAN.
Some teams may want to reserve moving an issue from one department to another to managers, other teams may allow any team member to move an issue on to (or BACK to) another team. This may boil down to management suspicion, or simply to who is allowed to allocate work time.
The Triage process is the key. The Triage team is essentially whoever in your organization decides who works on what, and what gets worked on next. Having the team meet on a regular schedule helps to make sure that really important stuff doesn't get missed, and that the mundane stuff doesn't get dropped due to inattention. If there isn't anything in the Triage queue, the meeting (concall, netmeeting, whatever the implementation is) can be cancelled by the meeting leader.
If you're using Scrum, the Triage team is probably the scrum masters, deciding if an issue is going to be pulled into the current sprint and properly assigning the priority if it's going into the backlog.
Wait, you write:
If a developer's issue is blocked by
another developer's code, she must
solve this problem outside of the bug
tracking system.
so there are bugs that fall outside of the normal bug flow. You have then a second system for tracking those bugs, or are these all ad-hoc?
Sounds like your bug tracking system is really a user-defect tracking system.
Does it work well for you or are you looking at alternatives?
I think that customers also should be able to create issues, with no separation between bug reports and feature requests.
Assignment of issues should not be performed by developers themselves: deciding which issues have to be fixed for next release should be of customer and manager responsibilities.
Other practices can be found in Painless Bug Tracking by Joel Spolsky.
I’ve used several different types of bug tracking systems over the past 10 years including nothing, a word document, FogBugz, Bugzilla, and Remedy. FogBugz is by far the best one. At that job anyone was allowed to enter bugs, and anyone could assign a bug to anyone else. I found that this worked well especially if I found a small bug in my code. Instead of spending an hour writing e-mails and filling out forms and getting several other people involved, I could quickly log that I found and fixed a bug. This encouraged me to enter all the bugs I found and fix them quickly. If a bug required a lot of work then I would assign it to my manager so he could prioritize it with my other work.
At the job where I used Bugzilla, every time a bug was created, assigned, or changed an e-mail was sent to all the developers and managers. This had the opposite effect, it discouraged me from finding and entering bugs in the system.
logging bugs is about speed - just the minimum amount of information needed to investigate/replicate the bug
for web projects, this comes down to: 1) a descriptive bug title, 2) the page where the error occurred, 3) a description of the problem + a screenshot OR step-by-step instructions for replicating the problem (if a screenshot isnt provided)
screenshots are very powerful for two-reasons: 1) a picture says a thousand words, 2) it gives creditability to the bug report (ever investigate a bug you couldnt replicate and think "looks like the client is making stuff up again"?)
i have a blog article which goes into the topic further: Logging Bugs Like a Pro
My small shop uses a pretty simple workflow:
Anyone can create an issue (I think it's unnecessarily restrictive not to allow this) This includes customers and users of our open source projects.
A change control board (sounds fancy, but it's just QA lead and head of engineering, plus product manager) reviews new issues and assigns fix version and priority
Anyone can reassign a bug, to ask the reporter a question or pass on to another person to fix or test
Anyone can mark a bug resolved
Only QA can close a bug - we do this to enforce verification of each bug fix.
This way, everything gets logged in the bug tracking system and we keep things efficient by not restricting updates. You can end up with a bit of "bug spam" this way, but it's better than creating bottlenecks in my experience.
We use JIRA as our bug tracker - it's possible to set up all kinds of custom workflows in JIRA to enforce your particular process, but I've never found the need to do that in smaller organizations.
What bug tracking flow are you using?
Tester will post the all bugs in open condition
Assigning to Developer
Developer will try to fix the bug - fixed
Bug Closed
Reopen the bug status

Has anyone tried their software with ReactOS yet?

The Free MS Windows replacement operating system ReactOS has just released a new version. They have a large and active development team.
Have you tried your software with it yet?
if so what is your recommendation?
Is it time to start investigating it as a serious Windows replacement?
Targeting ReactOS specifically is a bit too narrow IMO -- perhaps a better focus is to target compatibility with WINE. Because ReactOS shares so many of its usermode DLLs with WINE, targeting WINE should result in the app running just fine on ReactOS.
Of course, there will always be things that WINE can't emulate well (hence the need for ReactOS). In this way, it seems that if something runs in WINE, it will run in ReactOS, whereas the fact that something runs in ReactOS doesn't mean that it will necessarily run in WINE.
Targeting WINE is well documented, perhaps easier to test, and by definition, should make your app compatible with ReactOS as a matter of course. In this way, you're not only gathering the rather large user base of current WINE users, but you're future-proofing yourself for whenever anyone wants to use your app with ReactOS.
In their homepage, at the Tour you can see a partial list of office, tools and games that already run OK (or more or less) at ReactOS. If you subscribe to the newsletter, you'll receive info about much more - for instance, I was quite surprised when I read most SQL Server 2000 tools actually work on ReactOS!! Query Analyzer, OSQL and Books Online work fine, Enterprise Manager and Profiler are buggy and the DBMS won't work at all.
At a former workplace (an all MS shop) we investigated seriously into it as a way to reduce our expenditure in licenses whilst keeping our in-house developed apps. Since it couldn't run MSDE fine, we had to abandon the project - hope in the future this will be solved and my ex-coworkers can push it again.
These announcements might as well be also on their homepage - I couldn't find them after 5 mins. of searching, though. Probably the easiest way to know all these compatibility issues is to join the newsletter, or look for its archives.
I have been tracking this OS' progress for quite some time. I believe it has all the potential to really bring an OSS operating system to the masses for it breaks the "chicken and egg" problem: it has applications and drivers from the very beginning (since it aims to have full ABI compatibility with MS Windows).
Just wait for their first beta, I won't be surprised if they surpass Linux in popularity really soon after that...
Post Edit: Found it! Look at section Support Database, it's the web place to go look for whether a particular piece of hardware of some program works on ReactOS.
ReactOS has been under development for a long long time.
They were in some hot water earlier because some of their code appeared to be line by line dissasembly of some NT kernel code, I think they have replaced all of it.
I wouldn't bother with cross platform testing until they hit the same market penetration as Linux, which I would wager is never.
Until ReactOS doesn't randomly crash just sitting there within 5 minutes of booting, I won't worry about testing my code on it. Don't get me wrong, I like ReactOS, but it's just not stable enough for any meaningful testing yet!
No, I do not think it is time to start thinking of it as a Windows replacement.
As the site states, it's still in the Alpha stages. More importantly, whos Windows replacement? Yours? Your users? The former is one thing, the latter is categorically a no-go.
As an aside, I'm not really sure who this OS is targetting. It has to be people who rely on Windows software but don't want to pay, because people who simply don't want Windows can use MacOS / Linux, and the support (community or otherwise) for these choices is good.
Moreover, if you use Linux you already have some amounts of Windows software support via Wine.
Back to people who rely on Windows software but don't want to pay. If they are home users they can just simply pirate it, if they are large business users they already have support contracts and trained people etc. It's hard enough for large businesses to be OK to update to new versions of Windows, let alone an open source replacement.
So I suppose that leaves small businesses who don't want to obtain illegal copies of MS software, can't afford the OS licences and rely on software that only runs on Windows and has bad of non-existent Wine compatibility.
It is a useful replacement for Windows when it runs 'your' software without crashing. At the moment it is not a general purpose os as it is too unstable (being only alpha) but people have used ReactOS successfully in anger for specific tasks already. As a windows replacement it has multiple potential uses, sandbox systems, test and development systems, multiple virtual instances, embedded devices, even packaging/bundling legacy apps with their own compatible o/s. Driver and application compatibility, freed from Microsoft's policy of planned obsolescence and regular GUI renewal, what's not to like?

Resources