Troubleshoot Windows freezes and slowdowns - windows

I'm a (happy?) user of Windows, but recently have problems that I don't know how to track.
I have a WinXP plus home and work Win2k3 systems. Some of them are freezing itermittently for a short amount of time (from less than a second to a few seconds). There is no CPU usage spike and not much HDD activity. Neither Process Explorer nor Windows Task Manager show any suspicious processes. The services also look ok.
On one of computers, dragging and droping (within Explorer windows or windows and apps) freezes the machine for 10-20 sec. After this period I can continue to use drag & drop for some (long) time with no delays. Don't think it is virus – it would probably infect all machines easily.
How can I know what is going on with my systems?
Update: Thank you for your suggestions. I solved the problem on one of the machines – it was a nasty rootkit. I needed to use 3rd party tools to detect and remove it. How can I diagnose it without this tool?

This is most likely not faulty hardware.
On Windows, there are occasional messages that are broadcast system-wide to all top-level windows. If a window does not respond (or is slow in responding), then the whole system will appear to freeze. There is a built-in timeout and if exceeded, the system will assume that the window isn't going to respond and it skips the window (this could be the 10-20 second delay you're seeing although I think the timeout is a little higher than this).
I have not seen a solution for tracking these kinds of problems. You might experiment by creating a program that sends individual messages to each top-level window and record the time taken for each to respond. This isn't failsafe but it's a starting point, and this is (if I recall correctly) the technique I used to identify such a problem with Adobe's iFilter (for the Microsoft indexing service).
But before you go down this path, you said that these are recent problems. See if you can figure out what you might have installed recently and then uninstall it. This includes Windows patches as well as any new drivers or applications.

Are you able to peg it to a rough time-frame of when the symptoms started? If so, you could match the critical updates/installs in Add/Remove programs to that estimation and start looking there.
More generally, I find using MSCONFIG to temporarily turn off all startup programs and all non-Microsoft services can help quickly divide and conquer - If the symptoms disappear, you have a shorter list to work through.
Safe mode (with or without network - see next idea) is another way of narrowing the list of suspects.
Since it is multiple machines, if it were hardware it would have to be something common... Especially if it is two different locations. That said, network connectivity (or lack thereof) is the other frequent culprit. Bringing up a system in a standalone config (net cable unplugged/wireless radio disabled) will seem VERY slow at first, then once the timeouts and various retries have been exceeded, should zip along, especially if you are still running in a limited startup environment. I have had recalcitrant switches/routers be a problem, as well as sluggish external services (like an ISP's DNS) cause symptoms like this.
No floppy, optical, or other removable drive access at those times?

I would recommend a tool that can show files, COM objects and network addresses accessed within the application:
http://www.moduleanalyzer.com/
You can see the dlls that use each resource and the time is taking the accesses.
The problem with Windows slowdown is in general related to a dll that is running in a process/es that is doing some staff inside a process.
In these situations you won't see anything in tools that monitor from a Process perspective. You will need to see what is happening inside the process to see any suspicious dll or module.
This tool use call stack information to see what module is accessing resources.
Try that application that has a full-feature trial.

You probably have a faulty piece of hardware, from my experience likely your HD. If you are connect to a network share (SMB) and having connectivity issues that also could cause hangs. The drag and drop slowness in general points to the "explorer" process hanging, the same process used to communicate with network resources (file shares for example).

To diagnose the activities or infiltration a rootkit or other malware uses, you might check out the forums on Bleeping Computer, some of the volunteers there who help people remove such may be willing to help you figure out where to look for such infestations.
I recently cleaned up some malware through the help of an expert on that site which I also needed to use a third-party tool (in my case Malwarebytes) to remove, but the malware was relatively new such that this tool couldn't fully clean out the stuff until a more recent update to its definitions got released.
I still don't know how or where exactly to look on a given system for such an infestation, but that site might hook you up with someone who has that expertise. As long as you emphasize that you're looking for this to be able to track down such and not for purposes of writing your own malware I would hope they'd be receptive to your request.

Related

Make windbg or kd attached to local kernel behave like system wide strace

I am running Windows 7 on which I want to do kernel debugging and I do not want to mess with boot loader. So I've downloaded LiveKd as suggested here and make it run and seems it is working. If I understand correct it is some kind of read only debugging. Here is mentioned that it is very limited and even breakpoint cannot be used. I would like to ask if is possible in this mode to periodically dump all the instructions that are being executed or basically all events which are happening on current OS? I would like to have some system wide strace (Linux users know) and to do some statistical analysis on this. I suppose it depends on more factors like installed debug symbols to begin able resolve addresses etc.
I'm not sure if debugger is the best tool you can use for tracing live system calls. As you've mentioned LiveKd session is quite limited and you are not allowed to place breakpoints in it (otherwise you would hang your own system). However, you still can create memory dumps using the .dump command (check windbg help: .hh .dump). Keep in mind though that getting a full dump (/f) of a running system might take a lot of time.
Moving back to the subject of your question, by using the "dump approach" you will miss many system calls as you will have only snapshots of a system at given points in time. So if you are looking for something similar to Linux strace I would recommend checking those tools:
Process Monitor (procmon) - it's a tool which will show you all I/O requests in the system, as well as operations performed on the registry or process activity events
Windows Performance Toolkit - it contains tools for collecting (WPR) and analysing (WPA) system and application tracing events. It might be a lot of events and it's really important to filter them accordingly to your needs. ETW (Event Tracing for Windows) is a huge subject and you probably will need to read some tutorials or books before you will be able to use it effectively (but it's really worth it!).
API Monitor - it's one of many (I consider it as one of the best) tracing applications - this tool will allow you to trace method calls in any of the running processes. It has a nice interface and even allows you to place breakpoints on methods you'd like to intercept.
There are many other tools which might be used for tracing on Windows, but I would start with the ones I listed above. You may also check a great book on this subject: Inside Windows Debugging. Good luck! :)

Guidance : I want to work at Process Information level

I couldn't find a suitable title for this. I'm going to express my query with examples.
Consider following softwares:
Process explorer from sysinternals (an advanced task manager)
Resource Manager : resmon.exe (lists each and every fine detail about resource usage about each process).
For me these softwares seems like miracles. I wonder how these are even made. C'mon how a user process can know such fine details about other processes? Who tells this software, what processes are running and what all resources are utilized? Which dlls are used? etc..
Does windows operating system give these software that information? I mean though (obviously the most lower level api) WIN32API. Are there some functions,which on calling return these values
abstractly say:
GetAllRunningProcesses()
GetMemoryUsedByProcess(Process* proc)
etc..
Other similar applications are
network Packet Capture software. How does it get information about all those packets? It clearly sits just infront of the NIC card. How is it possible?
Anti-virus: It scans memory for viruses. Intercepts other processes. Acts like a sandbox for the user application space. How? How??
If its WIN32API. I swear, I'm going to master it.
I don't want to create a multi-threaded application. I want to get information about other multithreaded applications.
I don't want to create a program which communicates using sockets. I want to learn how to learn how to capture all communication packets.
I actually want to work at the lower level. But I don't know, what should I learn. Please guide me in proper direction.
This is really a pretty open-ended question. For things like a list of running processes, look up "PSAPI" or "Toolhelp32". For memory information about a particular process, you can use VirtualQuery.
Capturing network packets is normally done by installing a device driver. If you look, you should be able to find a fair amount about how to write device drivers, though don't expect to create wonders overnight, and do expect to crash your machine a few times in the process (device drivers run in kernel mode, so it's easy for a mistake to crash the machine hard).
I can't say as much with any certainty about anti-virus, because I've never tried to write one. My immediate guess would be that their primary technique is API hooking. There's probably more to it than that, but offhand I've never spent enough time looking at them to know what.
Mark Russinovich's classic, Windows Internals, is the go-to book if you want to get deep in this kind of stuff. I notice that the just-released 5th edition includes Vista. Here's a sample chapter to peek at.
If you like Process Explorer, this is the guy who wrote that, and there are lots of examples using it in the book.
Plus, at 1232 hardcover pages, you can use it to press your clothes.

How do you fix a bug you can't replicate?

The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?
I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?
Edit:
I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?
Language
Different programming languages will have their own flavour of bugs.
C
Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers far enough to avoid a SEGFAULT---also known as Heisenbugs. Pointer issues are arduous to track and replicate, but debuggers can help (such as GDB and DDD).
Java
An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.
JavaScript
Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.
Environment
Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:
on a server?
on a desktop?
in a web browser?
In what environment does the application produce the problem?
development?
test?
production?
Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.
Networking
As networking is essential to so many applications:
Ensure stable network connections, including wireless signals.
Does the software reconnect after network failures robustly?
Do all connections get closed properly so as to release file descriptors?
Are people using the machine who shouldn't be?
Are rogue devices interacting with the machine's network?
Are there factories or radio towers nearby that can cause interference?
Do packet sizes and frequency fall within nominal ranges?
Are packets being monitored for loss?
Are all network devices adequate for heavy bandwidth usage?
Consistency
Eliminate as many unknowns as possible:
Isolate architectural components.
Remove non-essential, or possibly problematic (conflicting), elements.
Deactivate different application modules.
Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.
Logging
Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.
Hardware
If the software seems okay, consider hardware faults:
Are the physical network connections solid?
Are there any loose cables?
Are chips seated properly?
Do all cables have clean connections?
Is the working environment clean and free of dust?
Have any hidden devices or cables been damaged by rodents or insects?
Are there bad blocks on drives?
Are the CPU fans working?
Can the motherboard power all components? (CPU, network card, video card, drives, etc.)
Could electromagnetic interference be the culprit?
And mostly for embedded:
Insufficient supply bypassing?
Board contamination?
Bad solder joints / bad reflow?
CPU not reset when supply voltages are out of tolerance?
Bad resets because supply rails are back-powered from I/O ports and don't fully discharge?
Latch-up?
Floating input pins?
Insufficient (sometimes negative) noise margins on logic levels?
Insufficient (sometimes negative) timing margins?
Tin whiskers?
ESD damage?
ESD upsets?
Chip errata?
Interface misuse (e.g. I2C off-board or in the presence of high-power signals)?
Race conditions?
Counterfeit components?
Network vs. Local
What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?
Firmware
In between hardware and software is firmware.
Is the computer BIOS up-to-date?
Is the BIOS battery working?
Are the BIOS clock and system clock synchronized?
Time and Statistics
Timing issues are difficult to track:
When does the problem happen?
How frequently?
What other systems are running at that time?
Is the application time-sensitive (e.g., will leap days or leap seconds cause issues)?
Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.
Change Management
Sometimes problems appear after a system upgrade.
When did the problem first start?
What changed in the environment (hardware and software)?
What happens after rolling back to a previous version?
What differences exist between the problematic version and good version?
Library Management
Different operating systems have different ways of distributing conflicting libraries:
Windows has DLL Hell.
Unix can have numerous broken symbolic links.
Java library files can be equally nightmarish to resolve.
Perform a fresh install of the operating system, and include only the supporting software required for your application.
Java
Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.
Use a library management tool such as Maven or Ivy.
Debugging
Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.
Sleep
It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.
Code Review
Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.
Cosmic Radiation
Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.
Tools
Sometimes, albeit infrequently, the compiler will introduce a bug, especially for niche tools (e.g. a C micro-controller compiler suffering from a symbol table overflow). Is it possible to use a different compiler? Could any other tool in the tool-chain be introducing issues?
If it's a GUI app, it's invaluable to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (not wrongly, just differently).
Otherwise, concentrate your logging in that area. Log most everything (you can pull it out later) and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.
Does your app report a version number, a build number, etc.? You need this to determine precisely which version you're debugging (or not!).
If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made, etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.
If you can't replicate it, you may fix it, but can't know that you've fixed it.
I've made my best explanation about how the bug was triggered (even if I didn't know how that situation could come about), fixed that, and made sure that if the bug surfaced again, our notification mechanisms would let a future developer know the things that I wish I had known. In practice, this meant adding log events when the paths which could trigger the bug were crossed, and metrics for related resources were recorded. And, of course, making sure that the tests exercised the code well in general.
Deciding what notifications to add is a feasability and triage question. So is deciding on how much developer time to spend on the bug in the first place. It can't be answered without knowing how important the bug is.
I've had good outcomes (didn't show up again, and the code was better for it), and bad (spent too much time not fixing the problem, whether the bug ended up fixed or not). That's what estimates and issue priorities are for.
Sometimes I just have to sit and study the code until I find the bug. Try to prove that the bug is impossible, and in the process you may figure out where you might be mistaken. If you actually succeed in convincing yourself it's impossible, assume you messed up somewhere.
It may help to add a bunch of error checking and assertions to confirm or deny your beliefs/assumptions. Something may fail that you'd never expect to.
It can be difficult, and sometimes near impossible. But my experience is, that you will sooner or later be able to reproduce and fix the bug, if you spend enough time on it (if that spent time is worth it, is another matter).
General suggestions that might help in this situation.
Add more logging, if possible, so that you have more data the next time the bug appears.
Ask the users, if they can replicate the bug. If yes, you can have them replicate it while watching over their shoulder, and hopefully find out, what triggers the bug.
Make random changes until something works :-)
Assuming you have already added all the logging that you think would help and it didn't... two things spring to mind:
Work backwards from the reported symptom. Think to yourself.. "it I wanted to produce the symptom that was reported, what bit of code would I need to be executing, and how would I get to it, and how would I get to that?" D leads to C leads to B leads to A. Accept that if a bug is not reproducible, then normal methods won't help. I've had to stare at code for many hours with these kind of thought processes going on to find some bugs. Usually it turns out to be something really stupid.
Remember Bob's first law of debugging: if you can't find something, it's because you're looking in the wrong place :-)
Think. Hard. Lock yourself away, admit no interuptions.
I once had a bug where the evidence was a hex dump of a corrupt database. The chains of pointers were systematically screwed up. All the user's programs, and our database software, worked faultlessly in testing. I stared at it for a week (it was an important customer), and after eliminating dozens of possible ideas, I realised that the data was spread across two physical files and the corruption occurred where the chains crossed file boundaries. I realized that if a backup/restore operation failed at a critical point, the two files could end up "out of sync", restored to different time points. If you then ran one of the customer's programs on the already-corrupt data, it would produce exactly the knotted chains of pointers I was seeing. I then demonstrated a sequence of events that reproduced the corruption exactly.
modify the code where you think the problem is happening, so extra debug info is recorded somewhere. when it happens next time, you will have what your need to solve the problem.
There are two types of bugs you can't replicate. The kind you discovered, and the kind someone else discovered.
If you discovered the bug, you should be able to replicate it. If you can't replicate it, then you simply haven't considered all of the contributing factors leading towards the bug. This is why whenever you have a bug, you should document it. Save the log, get a screenshot, etc. If you don't, then how can you even prove the bug really exists? Maybe it's just a false memory?
If someone else discovered a bug, and you can't replicate it, obviously ask them to replicate it. If they can't replicate it, then you try to replicate it. If you can't replicate it quickly, ignore it.
I know that sounds bad, but I think it is justified. The amount of time it will take you to replicate a bug that someone else discovered is very large. If the bug is real, it will happen again naturally. Someone, maybe even you, will stumble across it again. If it is difficult to replicate, then it is also rare, and probably won't cause too much damage if it happens a few more times.
You can be a lot more productive if you spend your time actually working, fixing other bugs and writing new code, than you will be trying to replicate a mystery bug that you can't even guarantee actually exists. Just wait for it to appear again naturally, then you will be able to spend all your time fixing it, rather than wasting your time trying to reveal it.
Discuss the problem, read code, often quite a lot of it. Often we do it in pairs, because you can usually eliminate the possibilities analytically quite quickly.
Start by looking at what tools you have available to you. For example crashes on a Windows platform go to WinQual, so if this is your case you now have crash dump information. Do you can static analysis tools that spot potential bugs, runtime analysis tools, profiling tools?
Then look at the input and output. Anything similar about the inputs in situations when users report the error, or anything out of place in the output? Compile a list of reports and look for patterns.
Finally, as David stated, stare at the code.
Ask user to give you a remote access for his computer and see everything yourself. Ask user to make a small video of how he reproduces this bug and send it to you.
Sure both are not always possible but if they are it may clarify some things. The common way of finding bugs are still the same: separating parts that may cause bug, trying to understand what`s happening, narrowing codespace that could cause the bug.
There are tools like gotomeeting.com, which you can use to share screen with your user and observe the behaviour. There could be many potential problems like number of softwares installed on their machines, some tools utility conflicting with your program. I believe gotomeeting, is not the only solution, but there could be timeout issues, slow internet issue.
Most of times I would say softwares do not report you correct error messages, for example, in case of java and c# track every exceptions.. dont catch all but keep a point where you can catch and log. UI Bugs are difficult to solve unless you use remote desktop tools. And most of time it could be bug in even third party software.
If you work on a real significant sized application, you probably have a queue of 1,000 bugs, most of which are definitely reproducible.
Therefore, I'm afraid I'd probably close the bug as WORKSFORME (Bugzilla) and then get on fixing some more tangible bugs. Or doing whatever the project manager decides to do.
Certainly making random changes is a bad idea, even if they're localised, because you risk introducing new bugs.

What could cause the application as well as the system to slowdown?

I am debugging an application which slows down the system very badly. The application loads a large amount of data (some 1000 files each of half an MB) from the local hard disk.The files are loaded as memory mapped files and are mapped only when needed. This means that at any given point in time the virtual memory usage does not exceed 300 MB.
I also checked the Handle count using handle.exe from sysinternals and found that there are at the most some 8000 odd handles opened. When the data is unloaded it drops to around 400. There are no handle leaks after each load and unload operation.
After 2-3 Load unload cycles, during one load, the system becomes very slow. I checked the virtual memory usage of the application as well as the handle counts at this point and it was well within the limits (VM about 460MB not much fragmentation also, handle counts 3200).
I want how an application could make the system very slow to respond? What other tools can I use to debug this scenario?
Let me be more specific, when i mean system it is entire windows that is slowing down. Task manager itself takes 2 mins to come up and most often requires a hard reboot
The fact that the whole system slows downs is very annoying, it means you can not attach a profiler easily, it also means it would be even difficult to stop the profiling session in order to view the results ( since you said it require a hard reboot ).
The best tool suited for the job in this situation is ETW ( Event Tracing for Windows ), these tools are great, will give you the exact answer you are looking for
Check them out here
http://msdn.microsoft.com/en-us/library/cc305210.aspx
and
http://msdn.microsoft.com/en-us/library/cc305221.aspx
and
http://msdn.microsoft.com/en-us/performance/default.aspx
Hope this works.
Thanks
Tools you can use at this point:
Perfmon
Event Viewer
In my experience, when things happen to a system that prevent Task Manager from popping up, they're usually of the hardware variety -- checking the system event log of Event Viewer is sometimes just full of warnings or errors that some hardware device is timing out.
If Event Viewer doesn't indicate that any kind of loggable hardware error is causing the slowdown, then try Perfmon -- add counters for system objects to track file read, exceptions, context switches etc. per second and see if there's something obvious there.
Frankly the sort of behavior demonstrated is meant to be impossible - by design - for user-mode code to cause. WinNT goes to a lot of effort to insulate applications from each other and prevent rogue applications from making the system unusable. So my suspicion is some kind of hardware fault is to blame. Is there any chance you can simply run the same test on a different PC?
If you don't have profilers, you may have to do the same work by hand...
Have you tried commenting out all read/write operations, just to check whether the slow down disappears ?
"Divide and conquer" strategies will help you find where the problem lies.
If you run it under an IDE, run it until it gets real slow, then hit the "pause" button. You will catch it in the act of doing whatever takes so much time.
You use tools like "IBM Rational Quantify" or "Intel VTune" to detect performance issue.
[EDIT]
Like Benoît did, one good mean is measuring tasks time to identify which is eating cpu.
But remember, as you are working with many files, is likely to be missing that causes the memory to disk swap.
when task manager is taking 2 minutes to come up, are you getting a lot of disk activity? or is it cpu-bound?
I would try process explorer from sysinternals. When your system is in the slowed-down state, and you try running, say, notepad, pay attention to page fault deltas.
Windows is very greedy about caching file data. I would try removing file I/O as someone suggested, and also making sure you close the file mapping as soon as you are done with a file.
I/O is probably causing your slowdown,especially if your files are on the same disk as the OS. Another way to test that would be to move your files to another disk and see if that alleviates the problem.

How to isolate causes of system hang on Unix/OSX

I am on OSX, and my system is becoming unresponsive for a few seconds roughly every 10 minutes. (It gives me the spinning beach ball of death). I was wondering if there was any way I could isolate the problem (I have plenty of RAM, and there are no pageouts/thrashing). Any Unix/OSX tools that could help me monitor and isolate the cause of this behaviour?
Activity Monitor (cmd+space, type, activity monitor), should give you an intuitive overview of what's happening on your system. If as you say it is there are no processes clogging CPU, please do take a look at the disk/IO activity. Perhaps your disk is going south.
I have been having problems continually over the years with system hangs. It seems that generally they are a result of filesystem errors, however Apple does not do enough to take care of this issue. System reliability should be a 100% focus and I am certainly sick of these issues. I have started to move a lot of files and all backups over to a ZFS volume on a FreeBSD server and this is helping a bit as it has started to both ease my mind and allow me to recover more quickly from issues. Additionally I've placed my system volume on a large SSD (240GB as I have a lot of support files and am trying to keep things from being too divided up with symlinks) and my Users folders on another drive. This too has helped add to reliability.
Having said this, you should try to explore spindump and stackshot to see if you can catch frozen processes before the system freezes up entirely. It is very likely that you have an app or two that is attempting to access bad blocks and it just hangs the system or you have a process blocking all others for some reason with a system call that's halting io.
Apple has used stackshot a few times with me over the last couple years to hunt some nasty buggers down and the following link can shed some light on how to perhaps better hunt this goblin down: http://www.stormacq.com/?p=346
Also try: top -l2 -S > top_output.txt and exame the results for hangs / zombie processes.
The deeper you go into this, you may find it useful to subscribe to the kernel developer list (darwin-kernel#lists.apple.com) as there are some very, very sharp cookies on here that can shed light on some of the most obscure issues and help to understand precisely what panic's are saying.
Additionally, you may want to uninstall any VMs you have installed. There is a particular developer who, I've heard from very reliable sources, has very faulty hypervisor issues and it would be wise to look into that if you have any installed. It may be time to clean up your kexts altogether as well.
But, all-in-all, we really quite desperately need a better filesystem and proactive mechanisms therein to watch for bad blocks. I praised the day and shouted for joy when I thought that we were getting ZFS officially. I am doubtful Lion is that much better on the HFS+ front sadly and I certainly am considering ZFS for my Users volume + other storage on the workstation due to it's ability to scrub for bad blocks and to eliminate issues like these.
They are the bain of our existence on Apple hardware and having worked in this field for 20 years and thousands of clients, hard drive failure should be considered inexcusable at this point. Even if the actual mfgs can't and won't fix it, the onus falls upon OS developers to handle exceptions better and to guard against such failures in order to hold off silent data loss and nightmares such as these.
I'd run a mixture of 'top' as well as tail -f /var/log/messages (or wherever your main log file is).
Chances are right before/after the hang, some error message will come out. From there you can start to weed out your problems.
Activity Monitor is the GUI version of top, and with Leopard you can use the 'Sample Process' function to watch what your culprit tasks are spending most of their time doing. Also in Utilities you'll find Console aka tail -f /var/log/messages.
As a first line of attack, I'd suggest keeping top running in a Terminal window where you can see it, and watching for runaway jobs there.
If the other answers aren't getting you anywhere, I'd run watch uptime and keep notes on the times and uptimes when it locks up. Locking up about every 10 minutes is very different from locking up exactly every 10 minutes; the latter suggests looking in crontab -l for jobs starting with */10.
Use Apple's Instruments. Honestly, it's helped immensely in finding hangs like these.
Periodic unresponsiveness is often the case when swapping is happening. Do you have sufficient memory in your system? Examine the disk io to see if there are peaks.
EDIT:
I have seen similar behaviour on my Mac lately which was caused by the filesystem being broken so OS X tried to access non-existing blocks on the disk and even trying to repair it with Disk Manger told me to reformat and reinstall. Doing that and reestablishing with Time Machine helped!
If you do this, then double check that Journalling is enabled on the HFS on the harddisk. This helps quite a bit avoiding it happening again.

Resources