We have a large JDK7 application deployed on JBoss using several libraries like Hibernate, Spring and so on. After initial startup of the server, the application runs as expected but after some uptime it becomes very slow.
Using a profiler, we have seen that every time certain aspects of out application are slowing down but not always the same aspects. While in one run it might be that hibernate flush slows to a crawl, in another run it might be some DI-code from Spring.
What's going on there?
There is a bug in JDK7 regarding the CodeCache memory area which hit us very, very hard.
Explanation
Basically Java starts up and uses just in time compilation (JIT) to compile just the required parts of the bytecode during runtime. This enables the JVM to de- and recompile certain code fragments during execution. This happend, if the JVM determins, that an initial compilation of a certain code fragment is suboptimal. Oracle introduced a feature named tiered compilation in JDK 7 which allows the VM to do just that.
Compiled code in the JVM is stored in the CodeCache memory area. Up to JDK6 the default was that this area would be filled up and once at a 100% the JIT would stop compiling and an error would be printed to the console, however the application would be running same as before: Everything already compiled would stay compiled, everything not yet compiled would be executed in interpretation mode (which is roughly 100x slower)
This option is named CodeCacheFlushing, it is enabled by default since JDK7u4. The idea is, that once CodeCache is full, the least used parts of compiled code are flushed from memory to make room for other code fragments. That would make the JDK6-default-behaviour (to stop compilation all in all) obsolete. It also allowed for a much smaller CodeCache area (in JDK7 CodeCache is 48M by default/96M if tiered compilation is enabled).
Here comes the bug. In JDK7 once the CodeCache gets full, the JIT is stopped. Next comes the flushing of the CodeCache area. That's it. JIT should be reenabled after flushing is completed but that doesn't happen. Also, there is no warning printed to the console. Worse: prior to disabling the JIT roughly half of the already compiled code is thrown out.
In contrast to JDK6 where everything that was fast will stay fast and only new code will be interpreted, in JDK7 you actually lose already compiled and optimized code! All of the sudden parts of your application that performed well will stop doing so. It is left to chance, which parts of the application slow down, which makes tracking that bugger by profiler nearly impossible: At times the hibernate code for flushing slows down, at other times, its the spring DI code or your own appcode.
Are you affected?
You can use a profiler (JProfiler/YourKit) or JConsole (JVisualVM won't do) to monitor the memory-consumption of CodeCache memory area. Typically the CodeCache amount committed will stay very close to the used amount (say, committed is 23mb, used is 22mb). While your application runs, committed and used go up until committed reaches max. At that point used will drop sharply to 1/2 - 2/3 of max. After that, used whill no longer grow. That's where the bug will hit you. In JConsole, it will look like this:
Why me and not all the others?
Chances are, you are using JBoss. Oracle quickly found out that there are things not like they should be and disabled tiered compilation by default - yet Red Hat in its infinite wisdom decided, it knew better and reenabled it. Basically our webapp runs fine on Weblogic and only JBoss is affected, because without the tiered compilation (not enabled in weblogic) the growth of CodeCache is so small, we never actually hit the 48mb threshold even after weeks of operation.
What can I do?
First of all, decide, whether this bug hits you. Second, make it harder for the bug to damage you. If you disable CodeCacheFlushing at least hitting the bug won't make things worse than they were before. Stopping tiered compilation will make it less probable the bug hits you, same as increasing the amount of CodeCache-Memory available.
You can always try to switch to JDK8, this seems unaffected and also you could implement monitoring in your software to warn you, if CodeCache is running full.
TL;DR
In JDK 7 never enable tiered compilation (disabled by default, enabled in JBoss)
in JBoss 7 always set PRESERVE_JAVA_OPTS=true in standalone.conf
always disable CodeCacheFlushing (-XX:-UseCodeCacheFlushing)
always pack a sufficient amount of memory into CodeCache (-XX:ReservedCodeCacheSize=xxM).
I am currently trying to reverse a program under Linux that has a bunch of anti-debug tricks. I was able to defeat some of them, but I am still fighting against the remaining ones. Sadly since I am mediocre, it is taking me more time than expected. Anyway, the programs runs without any pain in a VM (I tried with VMWare and VBox), so I was thinking about taking a trace of its execution in the VM, then a trace under the debugger (gdb) and diff them to see were the changes are and find out the anti-debug tricks more easily.
However, I did some kernel debugging with vmware a long time ago, it was more or less ok (I remember having access to the linear address...), but here it's a bit different I think.
Do you see an easy way to debug this userland program without going into too much pain ?
I would suggest using Ether, which is a tool for monitoring the execution of a program and is based on the XEN hypervisor. The whole point of the tool is to trace a program's execution without being observable. The first thing to do is go to their website and click on the malware tab, then submit your binary and see if their automated web interface can do it for you. If this fails, you can install it yourself, which is a pain, but doable, and should yield good results, I have been able to install it in the past. They have instructions on the Ether website, but if you I'd suggest you also take a look at these supplemental instructions from Offensive Computing
A couple of other automated analysis sites that could do the trick for you:
Eureka by SRI international
and Renovo by bitblaze at UC Berkeley
The question says it all. If you have a bug that multiple users report, but there is no record of the bug occurring in the log, nor can the bug be repeated, no matter how hard you try, how do you fix it? Or even can you?
I am sure this has happened to many of you out there. What did you do in this situation, and what was the final outcome?
Edit:
I am more interested in what was done about an unfindable bug, not an unresolvable bug. Unresolvable bugs are such that you at least know that there is a problem and have a starting point, in most cases, for searching for it. In the case of an unfindable one, what do you do? Can you even do anything at all?
Language
Different programming languages will have their own flavour of bugs.
C
Adding debug statements can make the problem impossible to duplicate because the debug statement itself shifts pointers far enough to avoid a SEGFAULT---also known as Heisenbugs. Pointer issues are arduous to track and replicate, but debuggers can help (such as GDB and DDD).
Java
An application that has multiple threads might only show its bugs with a very specific timing or sequence of events. Improper concurrency implementations can cause deadlocks in situations that are difficult to replicate.
JavaScript
Some web browsers are notorious for memory leaks. JavaScript code that runs fine in one browser might cause incorrect behaviour in another browser. Using third-party libraries that have been rigorously tested by thousands of users can be advantageous to avoid certain obscure bugs.
Environment
Depending on the complexity of the environment in which the application (that has the bug) is running, the only recourse might be to simplify the environment. Does the application run:
on a server?
on a desktop?
in a web browser?
In what environment does the application produce the problem?
development?
test?
production?
Exit extraneous applications, kill background tasks, stop all scheduled events (cron jobs), eliminate plug-ins, and uninstall browser add-ons.
Networking
As networking is essential to so many applications:
Ensure stable network connections, including wireless signals.
Does the software reconnect after network failures robustly?
Do all connections get closed properly so as to release file descriptors?
Are people using the machine who shouldn't be?
Are rogue devices interacting with the machine's network?
Are there factories or radio towers nearby that can cause interference?
Do packet sizes and frequency fall within nominal ranges?
Are packets being monitored for loss?
Are all network devices adequate for heavy bandwidth usage?
Consistency
Eliminate as many unknowns as possible:
Isolate architectural components.
Remove non-essential, or possibly problematic (conflicting), elements.
Deactivate different application modules.
Remove all differences between production, test, and development. Use the same hardware. Follow the exact same steps, perfectly, to setup the computers. Consistency is key.
Logging
Use liberal amounts of logging to correlate the time events happened. Examine logs for any obvious errors, timing issues, etc.
Hardware
If the software seems okay, consider hardware faults:
Are the physical network connections solid?
Are there any loose cables?
Are chips seated properly?
Do all cables have clean connections?
Is the working environment clean and free of dust?
Have any hidden devices or cables been damaged by rodents or insects?
Are there bad blocks on drives?
Are the CPU fans working?
Can the motherboard power all components? (CPU, network card, video card, drives, etc.)
Could electromagnetic interference be the culprit?
And mostly for embedded:
Insufficient supply bypassing?
Board contamination?
Bad solder joints / bad reflow?
CPU not reset when supply voltages are out of tolerance?
Bad resets because supply rails are back-powered from I/O ports and don't fully discharge?
Latch-up?
Floating input pins?
Insufficient (sometimes negative) noise margins on logic levels?
Insufficient (sometimes negative) timing margins?
Tin whiskers?
ESD damage?
ESD upsets?
Chip errata?
Interface misuse (e.g. I2C off-board or in the presence of high-power signals)?
Race conditions?
Counterfeit components?
Network vs. Local
What happens when you run the application locally (i.e., not across the network)? Are other servers experiencing the same issues? Is the database remote? Can you use a local database?
Firmware
In between hardware and software is firmware.
Is the computer BIOS up-to-date?
Is the BIOS battery working?
Are the BIOS clock and system clock synchronized?
Time and Statistics
Timing issues are difficult to track:
When does the problem happen?
How frequently?
What other systems are running at that time?
Is the application time-sensitive (e.g., will leap days or leap seconds cause issues)?
Gather hard numerical data on the problem. A problem that might, at first, appear random, might actually have a pattern.
Change Management
Sometimes problems appear after a system upgrade.
When did the problem first start?
What changed in the environment (hardware and software)?
What happens after rolling back to a previous version?
What differences exist between the problematic version and good version?
Library Management
Different operating systems have different ways of distributing conflicting libraries:
Windows has DLL Hell.
Unix can have numerous broken symbolic links.
Java library files can be equally nightmarish to resolve.
Perform a fresh install of the operating system, and include only the supporting software required for your application.
Java
Make sure every library is used only once. Sometimes application containers have a different version of a library than the application itself. This might not be possible to replicate in the development environment.
Use a library management tool such as Maven or Ivy.
Debugging
Code a detection method that triggers a notification (e.g., log, e-mail, pop-up, pager beep) when the bug happens. Use automated testing to submit data into the application. Use random data. Use data that covers known and possible edge cases. Eventually the bug should reappear.
Sleep
It is worth reiterating what others have mentioned: sleep on it. Spend time away from the problem, finish other tasks (like documentation). Be physically distant from computers and get some exercise.
Code Review
Walk through the code, line-by-line, and describe what every line does to yourself, a co-worker, or a rubber duck. This may lead to insights on how to reproduce the bug.
Cosmic Radiation
Cosmic Rays can flip bits. This is not as big as a problem in the past due to modern error checking of memory. Software for hardware that leaves Earth's protection is subject to issues that simply cannot be replicated due to the randomness of cosmic radiation.
Tools
Sometimes, albeit infrequently, the compiler will introduce a bug, especially for niche tools (e.g. a C micro-controller compiler suffering from a symbol table overflow). Is it possible to use a different compiler? Could any other tool in the tool-chain be introducing issues?
If it's a GUI app, it's invaluable to watch the customer generate the error (or try to). They'll no doubt being doing something you'd never have guessed they were doing (not wrongly, just differently).
Otherwise, concentrate your logging in that area. Log most everything (you can pull it out later) and get your app to dump its environment as well. e.g. machine type, VM type, encoding used.
Does your app report a version number, a build number, etc.? You need this to determine precisely which version you're debugging (or not!).
If you can instrument your app (e.g. by using JMX if you're in the Java world) then instrument the area in question. Store stats e.g. requests+parameters, time made, etc. Make use of buffers to store the last 'n' requests/responses/object versions/whatever, and dump them out when the user reports an issue.
If you can't replicate it, you may fix it, but can't know that you've fixed it.
I've made my best explanation about how the bug was triggered (even if I didn't know how that situation could come about), fixed that, and made sure that if the bug surfaced again, our notification mechanisms would let a future developer know the things that I wish I had known. In practice, this meant adding log events when the paths which could trigger the bug were crossed, and metrics for related resources were recorded. And, of course, making sure that the tests exercised the code well in general.
Deciding what notifications to add is a feasability and triage question. So is deciding on how much developer time to spend on the bug in the first place. It can't be answered without knowing how important the bug is.
I've had good outcomes (didn't show up again, and the code was better for it), and bad (spent too much time not fixing the problem, whether the bug ended up fixed or not). That's what estimates and issue priorities are for.
Sometimes I just have to sit and study the code until I find the bug. Try to prove that the bug is impossible, and in the process you may figure out where you might be mistaken. If you actually succeed in convincing yourself it's impossible, assume you messed up somewhere.
It may help to add a bunch of error checking and assertions to confirm or deny your beliefs/assumptions. Something may fail that you'd never expect to.
It can be difficult, and sometimes near impossible. But my experience is, that you will sooner or later be able to reproduce and fix the bug, if you spend enough time on it (if that spent time is worth it, is another matter).
General suggestions that might help in this situation.
Add more logging, if possible, so that you have more data the next time the bug appears.
Ask the users, if they can replicate the bug. If yes, you can have them replicate it while watching over their shoulder, and hopefully find out, what triggers the bug.
Make random changes until something works :-)
Assuming you have already added all the logging that you think would help and it didn't... two things spring to mind:
Work backwards from the reported symptom. Think to yourself.. "it I wanted to produce the symptom that was reported, what bit of code would I need to be executing, and how would I get to it, and how would I get to that?" D leads to C leads to B leads to A. Accept that if a bug is not reproducible, then normal methods won't help. I've had to stare at code for many hours with these kind of thought processes going on to find some bugs. Usually it turns out to be something really stupid.
Remember Bob's first law of debugging: if you can't find something, it's because you're looking in the wrong place :-)
Think. Hard. Lock yourself away, admit no interuptions.
I once had a bug where the evidence was a hex dump of a corrupt database. The chains of pointers were systematically screwed up. All the user's programs, and our database software, worked faultlessly in testing. I stared at it for a week (it was an important customer), and after eliminating dozens of possible ideas, I realised that the data was spread across two physical files and the corruption occurred where the chains crossed file boundaries. I realized that if a backup/restore operation failed at a critical point, the two files could end up "out of sync", restored to different time points. If you then ran one of the customer's programs on the already-corrupt data, it would produce exactly the knotted chains of pointers I was seeing. I then demonstrated a sequence of events that reproduced the corruption exactly.
modify the code where you think the problem is happening, so extra debug info is recorded somewhere. when it happens next time, you will have what your need to solve the problem.
There are two types of bugs you can't replicate. The kind you discovered, and the kind someone else discovered.
If you discovered the bug, you should be able to replicate it. If you can't replicate it, then you simply haven't considered all of the contributing factors leading towards the bug. This is why whenever you have a bug, you should document it. Save the log, get a screenshot, etc. If you don't, then how can you even prove the bug really exists? Maybe it's just a false memory?
If someone else discovered a bug, and you can't replicate it, obviously ask them to replicate it. If they can't replicate it, then you try to replicate it. If you can't replicate it quickly, ignore it.
I know that sounds bad, but I think it is justified. The amount of time it will take you to replicate a bug that someone else discovered is very large. If the bug is real, it will happen again naturally. Someone, maybe even you, will stumble across it again. If it is difficult to replicate, then it is also rare, and probably won't cause too much damage if it happens a few more times.
You can be a lot more productive if you spend your time actually working, fixing other bugs and writing new code, than you will be trying to replicate a mystery bug that you can't even guarantee actually exists. Just wait for it to appear again naturally, then you will be able to spend all your time fixing it, rather than wasting your time trying to reveal it.
Discuss the problem, read code, often quite a lot of it. Often we do it in pairs, because you can usually eliminate the possibilities analytically quite quickly.
Start by looking at what tools you have available to you. For example crashes on a Windows platform go to WinQual, so if this is your case you now have crash dump information. Do you can static analysis tools that spot potential bugs, runtime analysis tools, profiling tools?
Then look at the input and output. Anything similar about the inputs in situations when users report the error, or anything out of place in the output? Compile a list of reports and look for patterns.
Finally, as David stated, stare at the code.
Ask user to give you a remote access for his computer and see everything yourself. Ask user to make a small video of how he reproduces this bug and send it to you.
Sure both are not always possible but if they are it may clarify some things. The common way of finding bugs are still the same: separating parts that may cause bug, trying to understand what`s happening, narrowing codespace that could cause the bug.
There are tools like gotomeeting.com, which you can use to share screen with your user and observe the behaviour. There could be many potential problems like number of softwares installed on their machines, some tools utility conflicting with your program. I believe gotomeeting, is not the only solution, but there could be timeout issues, slow internet issue.
Most of times I would say softwares do not report you correct error messages, for example, in case of java and c# track every exceptions.. dont catch all but keep a point where you can catch and log. UI Bugs are difficult to solve unless you use remote desktop tools. And most of time it could be bug in even third party software.
If you work on a real significant sized application, you probably have a queue of 1,000 bugs, most of which are definitely reproducible.
Therefore, I'm afraid I'd probably close the bug as WORKSFORME (Bugzilla) and then get on fixing some more tangible bugs. Or doing whatever the project manager decides to do.
Certainly making random changes is a bad idea, even if they're localised, because you risk introducing new bugs.
I have a crash dump of an application that is supposedly leaking GDI. The app is running on XP and I have no problems loading it into WinDbg to look at it. Previously we have use the Gdikdx.dll extension to look at Gdi information but this extension is not supported on XP or Vista.
Does anyone have any pointers for finding GDI object usage in WinDbg.
Alternatively, I do have access to the failing program (and its stress testing suite) so I can reproduce on a running system if you know of any 'live' debugging tools for XP and Vista (or Windows 2000 though this is not our target).
I've spent the last week working on a GDI leak finder tool. We also perform regular stress testing and it never lasted longer than a day's worth w/o stopping due to user/gdi object handle overconsumption.
My attempts have been pretty successful as far as I can tell. Of course, I spent some time beforehand looking for an alternative and quicker solution. It is worth mentioning, I had some previous semi-lucky experience with the GDILeaks tool from msdn article mentioned above. Not to mention that i had to solve a few problems prior to putting it to work and this time it just didn't give me what and how i wanted it. The downside of their approach is the heavyweight debugger interface (it slows down the researched target by orders of magnitude which I found unacceptable). Another downside is that it did not work all the time - on some runs I simply could not get it to report/compute anything! Its complexity (judging by the amount of code) was another scare-away factor. I'm not a big fan of GUIs, as it is my belief that I'm more productive with no windows at all ;o). I also found it hard to make it find and use my symbols.
One more tool I used before setting on to write my own, was the leakbrowser.
Anyways, I finally settled on an iterative approach to achieve following goals:
minor performance penalties
implementation simplicity
non-invasiveness (used for multiple products)
relying on as much available as possible
I used detours (non-commercial use) for core functionality (it is an injectible DLL). Put Javascript to use for automatic code generation (15K script to gen 100K source code - no way I code this manually and no C preprocessor involved!) plus a windbg extension for data analysis and snapshot/diff support.
To tell the long story short - after I was finished, it was a matter of a few hours to collect information during another stress test and another hour to analyze and fix the leaks.
I'll be more than happy to share my findings.
P.S. some time did I spend on trying to improve on the previous work. My intention was minimizing false positives (I've seen just about too many of those while developing), so it will also check for allocation/release consistency as well as avoid taking into account allocations that are never leaked.
Edit: Find the tool here
There was a MSDN Magazine article from several years ago that talked about GDI leaks. This points to several different places with good information.
In WinDbg, you may also try the !poolused command for some information.
Finding resource leaks in from a crash dump (post-mortem) can be difficult -- if it was always the same place, using the same variable that leaks the memory, and you're lucky, you could see the last place that it will be leaked, etc. It would probably be much easier with a live program running under the debugger.
You can also try using Microsoft Detours, but the license doesn't always work out. It's also a bit more invasive and advanced.
I have created a Windbg script for that. Look at the answer of
Command to get GDI handle count from a crash dump
To track the allocation stack you could set a ba (Break on Access) breakpoint past the last allocated GDICell object to break just at the point when another GDI allocation happens. That could be a bit complex because the address changes but it could be enough to find pretty much any leak.
Im sure this has happened to folks before, something works in debug mode, you compile in release, and something breaks.
This happened to me while working on a Embedded XP environment, the best way i found to do it really was to write a log file to determine where it would go wrong.
What are your experiences/ discoveries trying to tackle an annoying Release-mode bug?
Make sure you have good debug symbols available (you can do this even with a release build, even on embedded devices). You should be able to get a stack trace and hopefully the values of some variables. A good knowledge of assembly language is probably also useful at this point.
My experience is that generally the bug is related to code that is near the area of breakage. That is to say, if you are seeing an issue arising in the function "LoadConfigInfoFromFile" then probably you should start by closely analysing that for issues, rather than "DrawControlsOnScreen", if you know what I mean. "Spooky action at a distance" type bugs do not tend to arise often (although when they do, they tend to be a major bear).
Tracefile is always a good idea.
When it's about crashes, I'm using adplus, which is part of debugging tools for windows. basically what adplus does, is, it attaches windbg to the executable you're monitoring. When the application crashes, you get a crash dump and a log file. You can load the crash dump in your preferred debugger and find out, which instruction lead to the crash.
As release builds are heavily optimized compared to debug builds, the way you compile your code affects its behaviour. This is basically true when crashes in multithreaded code happen in the release version but not the debug version. adplus and windbg helped me, to find out, where this happened.
ADPlus is explained here:
httx://support.microsoft.com/?scid=kb%3Ben-us%3B286350&x=15&y=12
Basically what you have to do is:
1. Download and install WinDbg into C:\debuggers
httx://www.microsoft.com/whdc/devtools/debugging/default.mspx
Start your application
open a cmd and cd to c:\debuggers
start adplus like this:
"adplus.bat -crash your_exe.exe"
reproduce the crash
analyze the crashdump in vs2005 or in windbg
If it's only a small portion of the application that needs debugging then you can change those source files only to be built without optimisations. Presumably you generate debug info for all builds, and so this makes the application run mostly as it would in release, but allows you to debug the interesting parts properly.
How about using Trace statements. They are there for Release mode value checking.
Trace.WriteLine(myVar);
I agree on log file debugging to narrow it down.
I've used "Entering FunctionName" "Leaving FunctionName" until I can find what method it enters before the crash. Then I add more log messages re-compile and re-release.
Besides playing with turning off optimization and/or turning on debug information for your Release build as pauldoo said, a log file will good data can really help. I once wrote a "trace" app that would capture trace logs for the app if it was running when the release build started (otherwise the results would go to the debugger's output window if running under the debugger). I was able to have end-users email me log files from them reproducing the bugs they were seeing, and it was the only way I would have found the problem in at least one case.
Though it's probably not usable in an embedded environment, I've had good luck with WinDbg for debugging release-mode Windows applications. Even if the application is not compiled with symbol information, you can at least get a usable stack trace and plenty of other useful crash information.
You could also copy your debug symbols to the production environment even if it's compiled in relase mode
Here's an article with more information
If you problem is synchronization related dumping log in the file might be problematic.
In this case i usually will use some big array of string and dump this to screen/file after the problem was reproduces.
This is of course depend on your memory restriction, sometime i use just few symbols and numbers to store in the array if the memory on the platform is limited. Reading such logs is not a big pleasure, but sometimes this is the only choice.