Open cursor derivation logic - oracle

In an environment where online processing and batch processing is simultaneous, is there a way to devise the parameter open_cursors?
I am trying to look so that I can optimize our testing environment for open_cursor parameter. I have already checked the Oracle Performance Tuning guide but still I am unable to understand how to arrive to this number.
Will running load runner tests help me get to this number? Please let me know if any more info is needed to help.

Do you actually have a problem? open_cursors is a limit on the number of cursors a single session can have open. It is not a system-wide limit. The proper value isn't influenced by load or what happens in some other session.
The default value is almost always more than sufficient for a properly written application. If you have an application that has long-running sessions and cursor leaks, increasing the value may let you run longer before you start to encounter problems while you find and address the cursor leaks but if you have a leak you'll eventually run out no matter what your setting. In the vast majority of cases, when people get an error related to open_cursors, the proper solution is to find and fix the bug that is leaking cursors rather than to change open_cursors.

Related

Lost Duration while Debugging Apex CPU time limit exceeded

I'm open to posting the code in this section to work through the optimization but its a bit length and complex, so instead I'm hoping that somebody can assist me with a few debugging questions I have. My goal is to find out what is causing my Apex CPU Time Limit Exceeded issue.
When using the Debug Log in its basic or normal layout I receive the message
Maximum CPU Time: 15062 out of 10,000 ** Close to Limit
I've optimized and re-wrote various loops and queries several times now and in each case this number concludes around there which leads me to believe it is lying to me and that my actual usage far exceeds that number. So on my journey I switched the Log Panels of the Developer Console to Analysis in hopes of isolating exactly what loop, method, or area of the code is giving me a headache.
This leads me to my main question and problem.
Execution Tree, Performance Tree & Executed Units
All show me that my durations UNDER the 10,000ms allowance. My largest consumption is 3,556.19ms which is being used by a wrapper class I created and consumed in the constructor method where there is a fair amount of logic that is constructing a fairly complicated wrapper class that spans over 5-7 custom objects. Still even with those 3,000ms the remainder of the process shows at negligible times bringing my total around 4,000ms. Again my question is.... Why am I unable to see or find what is consuming all my time?
Incorrect Iteration Data
In addition to this, on the Performance tree there is a column of data that shows the number of iterations for each method. I know that my Production Org has 81 objects that would essentially call the constructor for my custom wrapper object. I.E. my Constructor SHOULD be called 81 times, but instead it is called 32 times. So my other question is can I rely on the iteration data in the column? Or because it was iterating so many times does it stop counting at a certain point? Its possible that one of my objects is corrupted or causing an infinite loop somehow, but I don't want to dig through all the data in search of that conclusion if its a known issue that the iteration data is not accurate anyway.
System.Debug in the Production org
The Last question is why my System.Debug() lines are not displaying in my Developer Console on the production org. I've added serveral breadcrumbs throughout the code that would help me isolate just which objects are making it through and which are not, however, I cannot in any layout view system.debug messages outside of my Sandbox.
Sorry for the wealth of questions but I did want to give an honest effort to better understand the debugging process in Salesforce. If this is a lost cause I'm happy to start sharing some code as well but hopefully some debugging tips can get me to the solution.
It's likely your debug log got truncated, see "Each debug log must be 20 MB or smaller. If it exceeds this amount, you won’t see everything you need." in https://trailhead.salesforce.com/en/content/learn/modules/apex_basics_dotnet/debugging_diagnostics
Download the log and search for text similar to "skipped 123456 bytes of detailed log" to confirm, some system.debug statements will just not show up.
You might have to fine-tune the log levels (don't log validation rules and workflows? don't log every single variable assignment with "FINE" level etc). You might have to set all flags to NONE, then track only 1 particular class/trigger that you suspect (see https://help.salesforce.com/articleView?id=code_debug_log_classes.htm&type=5 and https://salesforce.stackexchange.com/questions/214380/how-are-we-supposed-to-use-debug-logs-for-a-specific-apex-class-only)
If it's truncated it's possible analysis tools give up (I had mixed luck with console to be honest, sometimes https://apextimeline.herokuapp.com/ is great to give overview - but it'll also fail to parse a 20 MB log...
When all else fails you can load up the log into Notepad++ (or any editor of your choice), find lines related to method entry/method exit (you might need a regular expression search), take these filtered lines tor excel, play with "text to columns" and just look at timing manually, see if there's a record that causes the spike. Because it could be #10 that's the problem, the fact it exhausts limits on #32 of 81 doesn't mean much. Search like [METHOD_ENTRY|METHOD_EXIT]MyTriggerHandler.onBeforeUpdate could be a good start. But first thing is to make sure log is not truncated.

RabbitMQ Message size limitiation?

I am trying to gauge the performance of RabbitMQ when my message size increases to a few MB. However, even when I sent a 32KB message, I get a Resource temporarily unavilable message from the Server. There's no error in the log files, there are no memory limit reaching errors... How do I go about debugging this issue?
If it's on any help, I'm running this on EC2 T1.micro instance.. So 592MB RAM.
According to the bug you linked, someone recently (looks like after you left the link to the bug) left a comment that they can reliably reproduce the bug when the message size is >=15821 bytes.
I would recommend that you see if that also holds true for you -- i.e. can you also reproduce at that threshold -- and then evaluate if under that amount -- thus avoiding the bug documented in the issue above -- is a sufficient size for your needs. If not, you may want to try pika (https://github.com/pika/pika) and see if that works better with larger messages (one of the other comments on that bug suggests that pika did work for them with larger message sizes).
Another option that may work, depending on your exact use case, would be to include in the rabbitmq message payload a key of sorts that points allows you to fetch the large blob of data from wherever it's stored (Postgres, MongoDB, etc.) when you consume the message, and therefore allow you to avoid the bug. Perhaps not ideal if you really want to encapsulate everything inside the payload, but may be a feasible workaround to the bug.
In terms of debugging, since it appears that this is a bug with rabbitpy itself, I think you would need to debug the actual rabbitpy library if you wanted to proceed on that front. Doable, but perhaps not feasible due to time, etc.

PL/SQL - check for memory leaks?

I have some PL/SQL code that I think might have a memory leak. Everytime I run it it seems to run slower and slower than the time before, even though now I am decreasing the input size. The code that I'm suspicious of is populating an array from a cursor using bulk-collect, something like this
open c_myCursor(in_key);
fetch c_myCursor bulk collect into io_Array; /*io_array is a parameter, declared as in out nocopy */
close c_myCursor;
I'm not sure how to check to see what's causing this slowdown. I know there are some tables in Oracle that track this kind of memory usage, but I'm not sure if it's possible to look at those tables and find my way back to something useful about what my code is doing.
Also, I tried logging out the session and logging back in after about 10-15 minutes, still very slow.
Oracle version is 10.2
So it turns out there was other database activity. The DBA decided to run some large insert and update jobs at about the same time I started changing and testing code. I suspected my code was the root cause because I hadn't been told about the other jobs running (and I only heard about this other job after it completely froze everything and all the other devs got annoyed). That was probably why my code kept getting slower and slower.
Is there a way to find this out programmatically, such as querying for a session inserting/updating lots of data, just in case the DBA forgets to tell me the next time he does this?
v$sessmetric is a quick way to see what resources each session is using - cpu, physical_reads, logical_reads, pga_memory, etc.
"I tried logging out the session and logging back in after about 10-15 minutes, still very slow."
Assuming you are using a conventional dedicated connection on a *nix platform, this would pretty much rule out any memory leak. When you make a new connection to a database, oracle will fork off a new process for it and all the PGA memory will belong to that process and it will get released (by the OS) when the session is disconnected and the process terminated.
If you are using shared server connections then the session uses memory belonging to both the process but also the shared memory. This would probably be more vulnerable to any memory leak problem.
Windows doesn't work quite the same way, as it doesn't fork a separate process for each session, but rather has a separate thread under a single Oracle process. Again, I'd suspect this would be more vulnerable to a memory leak.
I'd generally look for other issues first, and probably start at the query underlying c_myCursor. Maybe it has to read through more old data to get to the fresh data ?
http://www.dba-oracle.com/t_plsql_dbms_profiler.htm describes DBMS_PROFILER. I suppose that the slowest parts of your code can be connected to memory leak. Anyway if you go back to the original problem, that it goes slower and slower, then the first thing to do is to see what is slow, and then to suppose memory leak.
It sounds like you do no commit between executions, and the redo log is larger and larger. Probably this is the cause that DB needs to provide read consistency.
You can also check the enterprise management console. Which version do you use? Never use XE for development, since as far as I know professional version can be used for development purposes. The enterprise management console even give you suggestions. Maybe it can tell you something clever about your PLSQL problem.
If your query returns very much data your collection can grow enormously large, say 10 000 000 records - that can be the point of the suspicious memory usage.
You can check this on by logging the size of the collection you bulk collect into. If it's larger that 10 000 (just a rough estimate, this depends on data of course) you may consider to split and work with parts of data, smth like this:
declare
cursor cCur is select smth from your_table;
--
type TCur is table of cCur%rowtype index by pls_integer;
--
fTbl TCur;
begin
open cCur;
loop
fTbl.delete;
fetch cCur bulk collect into fTbl limit 10000;
exit when cCur%notfound;
for i in 1 .. fTbl.count loop
--do your wok here
end loop;
end loop;
close cCur;
end;
Since you said that table is declared as in out nocopy I understand that you can't directly rewrite logic like this but just consider the methodology, maybe this can help you.

ORA-03113 while executing a sql query

I have a 400 line sql query which is throwing exception withing 30 seconds
ORA-03113: end-of-file on communication channel
Below are things to note:
I have set the timeout as 10 mins
There is one last condition when removed resolves this error.
This error came only recently when I analyzed indexes.
The troubling condition is like this:
AND UPPER (someMultiJoin.someColumn) LIKE UPPER ('%90936%')
So my assumption is that the query is getting terminated from the server side apparently because its identified as a resource hog.
Is my assumption appropriate ? How should I go about to fix this problem ?
EDIT: I tried to get the explain plan of faulty query but the explain plan query also gives me an ORA-03113 error. I understand that my query is not very performant but why should that be a reason for ORA-03113 error. I am trying to run the query from toad and there are no alert log or trace generated, my db version is
Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
One possible cause of this error is a thread crash on the server side. Check whether the Oracle server has generated any trace files, or logged any errors in its alert log.
You say that removing one condition from the query causes the problem to go away. How long does the query take to run without that condition? Have you checked the execution plans for both versions of the query to see if adding that condition is causing some inefficient plan to be chosen?
I've had similar connection dropping issues with certain variations on a query. In my case connections dropped when using rownum under certain circumstances. It turned out to be a bug that had a workaround by adjusting a certain Oracle Database configuration setting. We went with a workaround until a patch could be installed. I wish I could remember more specifics or find an old email on this but I don't know that the specifics would help address your issue. I'm posting this just to say that you've probably encountered a bug and if you have access to Oracle's support site (support.oracle.com) you'll likely find that others have reported it.
Edit:
I had a quick look at Oracle support. There are more than 1000 bugs related to ORA-03113 but I found one that may apply:
Bug 5015257: QUERY FAILS WITH ORA-3113 AND COREDUMP WHEN QUERY_REWRITE_ENABLED='TRUE'
To summarize:
Identified in 9.2.0.6.0 and fixed in 10.2.0.1
Running a particular query
(not identified) causes ORA-03113
Running explain on query does the
same
There is a core file in
$ORACLE_HOME/dbs
Workaround is to set
QUERY_REWRITE_ENABLED to false: alter
system set query_rewrite_enabled =
FALSE;
Another possibility:
Bug 3659827: ORA-3113 FROM LONG RUNNING QUERY
9.2.0.5.0 through 10.2.0.0
Problem: Customer has long running query that consistently produces ORA-3113 errros.
On customers system they receive core.log files but do not receive any errors
in the alert.log. On test system I used I receivded ORA-7445 errors.
Workaround: set "_complex_view_merging"=false at session level or instance level.
You can safely remove the "UPPER" on both parts if you are using the like with numbers (that are not case sensitive), this can reduce the query time to check the like sentence
AND UPPER (someMultiJoin.someColumn) LIKE UPPER ('%90936%')
Is equals to:
AND someMultiJoin.someColumn LIKE '%90936%'
Numbers are not affected by UPPER (and % is independent of character casing).
From the information so far it looks like an back-end crash, as Dave Costa suggested some time ago. Were you able to check the server logs?
Can you get the plan with set autotrace traceonly explain? Does it happen from SQL*Plus locally, or only with a remote connection? Certainly sounds like an ORA-600 on the back-end could be the culprit, particularly if it's at parse time. The successful run taking longer than the failing one seems to rule out a network problem. I suspect it's failing quite quickly but the client is taking up to 30 seconds to give up on the dead connection, or the server is taking that long to write trace and core files.
Which probably leaves you the option of patching (if you can find a relevant fix for the specific ORA-600 on Metalink) or upgrading the DB; or rewriting the query to avoid it. You may get some ideas for how to do that from Metalink if it's a known bug. If you're lucky it might be as simple as a hint, if the extra condition is having an unexpected impact on the plan. Is someMultiJoin.someColumn part of an index that's used in the successful version? It's possible the UPPER is confusing it and you could persuade it back on to the successful plan by hinting it to use the index anyway, but that's obviously rather speculative.
It means you have been disconnected. This not likely to be due to being a resource hog.
I have seen where the connection to the DB is running over a NAT and because there is no traffic it closes the tunnel and thus drops the connection. Generally if you use connection pooling you won't get this.
As #Daniel said, the network connection to the server is being broken. You might take a look at End-of-file on communication channel to see if it offers any useful suggestions.
Share and enjoy.
This is often a bug in the Cost Based Optimizer with complex queries.
What you can try to do is to change the execution plan. E.g. use WITH to pull some subquerys out. Or use the SELECT /*+ RULE */ hint to prevent Oracle from using the CBO. Also dropping the statistics helps, because Oracle then uses another execution plan.
If you can update the database, make a test installation of 9.2.0.8 and see if the error is gone there.
Sometimes it helps to make a dump of the schema, drop everything in it and import the dump again.
I was having the same error, in my case what was causing it was the length of the query.
By reducing said length, I had no more problems.

How do you reproduce bugs that occur sporadically?

We have a bug in our application that does not occur every time and therefore we don't know its "logic". I don't even get it reproduced in 100 times today.
Disclaimer: This bug exists and I've seen it. It's not a pebkac or something similar.
What are common hints to reproduce this kind of bug?
Analyze the problem in a pair and pair-read the code. Make notes of the problems you KNOW to be true and try to assert which logical preconditions must hold true for this happen. Follow the evidence like a CSI.
Most people instinctively say "add more logging", and this may be a solution. But for a lot of problems this just makes things worse, since logging can change timing-dependencies sufficiently to make the problem more or less frequent. Changing the frequency from 1 in 1000 to 1 in 1,000,000 will not bring you closer to the true source of the problem.
So if your logical reasoning does not solve the problem, it'll probably give you a few specifics you could investigate with logging or assertions in your code.
There is no general good answer to the question, but here is what I have found:
It takes a talent for this kind of thing. Not all developers are best suited for it, even if they are superstars in other areas. So know your team, who has a talent for it, and hope you can give them enough candy to get them excited about helping you out, even if it isn't their area.
Work backwards, and treat it like a scientific investigation. Start with the bug, what you see is wrong. Develop hypotheses about what could cause it (this is the creative/imaginative part, the art that not everyone has the talent for) - and it helps a lot to know how the code works. For each of those hypotheses (preferably sorted by what you think is most likely - again pure gut feel here), develop a test that tries to eliminate it as the cause, and test the hypothesis. Any given failure to meet a prediction doesn't mean the hypothesis is wrong. Test the hypothesis until it is confirmed to be wrong (although as it gets less likely you may want to move on to another hypothesis first, just don't discount this one until you have a definitive failure).
Gather as much data as you can during this process. Extensive logging and whatever else is applicable. Do not discount a hypothesis because you lack the data, rather remedy the lack of data. Quite often the inspiration for the right hypothesis comes from examining the data. Noticing something off in a stack trace, weird issue in a log, something missing that should be there in a database, etc.
Double check every assumption. So many times I have seen an issue not get fixed quickly because some general method call was not further investigated, so the problem was just assumed to be not applicable. "Oh that, that should be simple." (See point 1).
If you run out of hypotheses, that is generally caused by insufficient knowledge of the system (this is true even if you wrote every line of code yourself), and you need to run through and review code and gain additional insight into the system to come up with a new idea.
Of course, none of the above guarantees anything, but that is the approach that I have found gets results consistently.
Add some sort of logging or tracing. For example log the last X actions the user committed before causing the bug (only if you can set a condition to match bug).
It's quite common for programmers not to be able to reiterate a user-experienced crash simply because you have developed a certain workflow and habits in using the application that obviously goes around the bug.
At this frequency of 1/100, I'd say that the first thing to do is to handle exceptions and log anything anywhere or you could be spending another week hunting this bug.
Also make a priority list of potentially sensitive articulations and features in your project. For example :
1 - Multithreading
2 - Wild pointers/ loose arrays
3 - Reliance on input devices
etc.
This will help you segment areas that you can brute-force-until-break-again as suggested by other posters.
Since this is language-agnostic, I'll mention a few axioms of debugging.
Nothing a computer ever does is random. A 'random occurrence' indicates a as-yet-undiscovered pattern. Debugging begins with isolating the pattern. Vary individual elements and assess what makes a change in the behaviour of the bug.
Different user, same computer?
Same user, different computer?
Is the occurrence strongly periodic? Does rebooting change the periodicity?
FYI- I once saw a bug that was experienced by a single person. I literally mean person, not a user account. User A would never see the problem on their system, User B would sit down at that workstation, signed on as User A and could immediately reproduce the bug. There should be no conceivable way for the app to know the difference between the physical body in the chair. However-
The users used the app in different ways. User A habitually used a hotkey to to invoke a action and User B used an on-screen control. The difference in the user behaviour would cascade into a visible error a few actions later.
ANY difference that effects the behaviour of the bug should be investigated, even if it makes no sense.
There's a good chance your application is MTWIDNTBMT (Multi Threaded When It Doesn't Need To Be Multi Threaded), or maybe just multi-threaded (to be polite). A good way to reproduce sporadic errors in multi-threaded applications is to sprinkle code like this around (C#):
Random rnd = new Random();
System.Threading.Thread.Sleep(rnd.Next(2000));
and/or this:
for (int i = 0; i < 4000000000; i++)
{
// tight loop
}
to simulate threads completing their tasks at different times than usual or tying up the processor for long stretches.
I've inherited many buggy, multi-threaded apps over the years, and code like the above examples usually makes the sporadic errors occur much more frequently.
Add verbose logging. It will take multiple -- sometimes dozen(s) -- iterations to add enough logging to understand the scenario.
Now the problem is that if the problem is a race condition, which is likely if it doesn't reproduce reliably, so logging can change timing and the problem will stop happening. In this case do not log to a file, but keep a rotating buffer of the log in memory and only dump it on disk when you detect that the problem has occurred.
Edit: a little more thoughts: if this is a gui application run tests with a qa automation tool which allows you to replay macros. If this is a service-type app, try to come up with at least a guess as to what is happening and then programmatically create 'freak' usage patterns which would exercise the code that you suspect. Create higher than usual loads etc.
What development environment?
For C++, your best bet may be VMWare Workstation record/replay, see:
http://stackframe.blogspot.com/2007/04/workstation-60-and-death-of.html
Other suggestions include inspecting the stack trace, and careful code overview... there is really no silver bullet :)
Try to add code in your app to trace the bug automatically once it happens (or even alert you via mail / SMS)
log whatever you can so when it happens you can catch the right system state.
Another thing- try applying automated testing that can cover more territory than human based testing in a formed manner.. it's a long shot, but a good practice in general.
all the above, plus throw some brute force soft-robot at it that is semi random, and scater a lot of assert/verify (c/c++, probably similar in other langs) through the code
Tons of logging and careful code review are your only options.
These can be especially painful if the app is deployed and you can't adjust the logging. At that point, your only choice is going through the code with a fine-tooth comb and trying to reason about how the program could enter into the bad state (scientific method to the rescue!)
Often these kind of bugs are related to corrupted memory and for that reason they might not appear very often. You should try to run your software with some kind of memory profiler e.g., valgrind, to see if something goes wrong.
Let’s say I’m starting with a production application.
I typically add debug logging around the areas where I think the bug is occurring. I setup the logging statements to give me insight into the state of the application. Then I have the debug log level turned on and ask the user/operator(s) notify me of the time of the next bug occurrence. I then analyze the log to see what hints it gives about the state of the application and if that leads to a better understanding of what could be going wrong.
I repeat step 1 until I have a good idea of where I can start debugging the code in the debugger
Sometimes the number of iterations of the code running is key but other times it maybe the interaction of a component with an outside system (database, specific user machine, operating system, etc.). Take some time to setup a debug environment that matches the production environment as closely as possible. VM technology is a good tool for solving this problem.
Next I proceed via the debugger. This could include creating a test harness of some sort that puts the code/components in the state I’ve observed from the logs. Knowing how to setup conditional break points can save a lot of time, so get familiar with that and other features within your debugger.
Debug, debug , debug. If you’re going nowhere after a few hours, take a break and work on something unrelated for awhile. Come back with a fresh mind and perspective.
If you have gotten nowhere by now, go back to step 1 and make another iteration.
For really difficult problems you may have to resort to installing a debugger on the system where the bug is occurring. That combined with your test harness from step 4 can usually crack the really baffling issues.
Unit Tests. Testing a bug in the app is often horrendous because there is so much noise, so many variable factors. In general the bigger the (hay)stack, the harder it is to pinpoint the issue. Creatively extending your unit test framework to embrace edge cases can save hours or even days of sifting
Having said that there is no silver bullet. I feel your pain.
Add pre and post condition check in methods related to this bug.
You may have a look at Design by contract
Along with a lot of patience, a quiet prayer & cursing you would need:
a good mechanism for logging the user actions
a good mechanism for gathering the data state when the user performs some actions (state in application, database etc.)
Check the server environment (e.g. an anti-virus software running at a particular time etc.) & record the times of the error & see if you can find any trends
some more prayers & cursing...
HTH.
Assuming you're on Windows, and your "bug" is a crash or some sort of corruption in unmanaged code (C/C++), then take a look at Application Verifier from Microsoft. The tool has a number of stops that can be enabled to verify things during runtime. If you have an idea of the scenario where your bug occurs, then try to run through the scenario (or a stress version of the scenario) with AppVerifer running. Make sure to either turn on pageheap in AppVerifier, or consider compiling your code with the /RTCcsu switch (see http://msdn.microsoft.com/en-us/library/8wtf2dfz.aspx for more information).
"Heisenbugs" require great skills to diagnose, and if you want help from people here you have to describe this in much more detail, and patiently listen to various tests and checks, report result here, and iterate this till you solve it (or decide it is too expensive in terms of resources).
You will probably have to tell us your actual situation, language, DB, operative system, workload estimate, time of the day it happened in the past, and a myriad of other things, list tests you did already, how they went, and be ready to do more and share the results.
And this will not guarantee that we collectively can find it, either...
I'd suggest to write down all things that user has been doing. If you have lets say 10 such bug reports You can try to find something that connects them.
Read the stack trace carefully and try to guess what could be happened;
then try to trace\log every line of code that potentially can cause trouble.
Keep your focus on disposing resources; many sneaky sporadical bugs i found were related to close\dispose things :).
For .NET projects You can use Elmah (Error Logging Modules and Handlers) to monitor you application for un-caught exceptions, it's very simple to install and provides a very nice interface to browse unknown errors
http://code.google.com/p/elmah/
This saved me just today in catching a very random error that was occuring during a registration process
Other than that I can only recommend trying to get as much information from your users as possible and having a thorough understanding of the project workflow
They mostly come out at night....
mostly
The team that I work with has enlisted the users in recording their time they spend in our app with CamStudio when we've got a pesky bug to track down. It's easy to install and for them to use, and makes reproducing those nagging bugs much easier, since you can watch what the users are doing. It also has no relationship to the language you're working in, since it's just recording the windows desktop.
However, this route seems to be viable only if you're developing corporate apps and have good relationships with your users.
This varies (as you say), but some of the things that are handy with this can be
immediately going into the debugger when the problem occurs and dumping all the threads (or the equivalent, such as dumping the core immediately or whatever.)
running with logging turned on but otherwise entirely in release/production mode. (This is possible in some random environments like c and rails but not many others.)
do stuff to make the edge conditions on the machine worse... force low memory / high load / more threads / serving more requests
Making sure that you're actually listening to what the users encountering the problem are actually saying. Making sure that they're actually explaining the relevant details. This seems to be the one that breaks people in the field a lot. Trying to reproduce the wrong problem is boring.
Get used to reading assembly that was produced by optimizing compilers. This seems to stop people sometimes, and it isn't applicable to all languages/platforms, but it can help
Be prepared to accept that it is your (the developer's) fault. Don't get into the trap of insisting the code is perfect.
sometimes you need to actually track the problem down on the machine it is happening on.
#p.marino - not enough rep to comment =/
tl;dr - build failures due to time of day
You mentioned time of day and that caught my eye. Had a bug once were someone stayed later at work on night, tried to build and commit before they left and kept getting a failure. They eventually gave up and went home. When they caught in the next morning it built fine, they committed (probably should have been more suspiscious =] ) and the build worked for everyone. A week or two later someone stayed late and had an unexpected build failure. Turns out there was a bug in the code that made any build after 7PM break >.>
We also found a bug in one seldom used corner of the project this january that caused problems marshalling between different schemas because we were not accounting for the different calendars being 0 AND 1 month based. So if no one had messed with that part of the project we wouldn't have possibly found the bug until jan. 2011
These were easier to fix than threading issues, but still interesting I think.
hire some testers!
This has worked for really weird heisenbugs.
(I'd also recommend getting a copy of "Debugging" by Dave Argans, these ideas are partly derived form using his ideas!)
(0) Check the ram of the system using something like Memtest86!
The whole system exhibits the problem, so make a test jig that exercises the whole thing.
Say it's a server side thing with a GUI, you run the whole thing with a GUI test framework doing the necessary input to provoke the problem.
It doesn't fail 100% of the time, so you have to make it fail more often.
Start by cutting the system in half ( binary chop)
worse case, you have to remove sub-systems one at a time.
stub them out if they can't be commented out.
See if it still fails. Does it fail more often ?
Keep proper test records, and only change one variable at a time!
Worst case you use the jig and you test for weeks to get meaningful statistics. This is HARD; but remember, the jig is doing the work.
I've got No threads and only one process, and I don't talk to hardware
If the system has no threads, no communicating processes and contacts no hardware; it's tricky; heisenbugs are generally synchronization, but in the no-thread no processes case it's more likely to be uninitialized data, or data used after being released, either on the heap or the stack. Try to use a checker like valgrind.
For threaded/multi-process problems:
Try running it on a different number of CPU's. If it's running on 1, try on 4! Try forcing a 4-computer system onto 1.
It'll mostly ensure things happen one at a time.
If there are threads or communicating processes this can shake out bugs.
If this is not helping but you suspect it's synchronization or threading, try changing the OS time-slice size.
Make it as fine as your OS vendor allows!
Sometimes this has made race conditions happen almost every time!
Obversely, try going slower on the timeslices.
Then you set the test jig running with debugger(s) attached all over the place and wait for the test jig to stop on a fault.
If all else fails, put the hardware in the freezer and run it there. The timing of everything will be shifted.
Debugging is hard and time consuming especially if you are unable to deterministically reproduce the problem. My advice to you is to find out the steps to reproduce it deterministically (not just sometimes).
There has been a lot of research in the field of failure reproduction in the past years and is still very active. Record&Replay techniques have been (so far) the research direction of most researchers. This is what you need to do:
1) Analyze the source code and determine what are the sources of non-determinism in the application, that is, what are the aspects that may take your application through different execution paths (e.g. user input, OS signals)
2) Log them in the next time you execute the application
3) When your application fails again, you have the steps-to-reproduce the failure in your log.
If your log still does not reproduce the failure, then you are dealing with a concurrency bug. In that case, you should take a look at how your application accesses shared variables. Do not attempt to record the accesses to shared variables, because you would be logging too much data, thereby causing severe slowdowns and large logs. Unfortunately, there is not much I can say that would help you to reproduce concurrency bugs, because research still has a long way to go in this subject. The best I can do is to provide a reference to the most recent advance (so far) in the topic of deterministic replay of concurrency bugs:
http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html
Best regards
Use an enhanced crash reporter. In the Delphi environment, we have EurekaLog and MadExcept. Other tools exist in other environments. Or you can diagnose the core dump. You're looking for the stack trace, which will show you where it's blowing up, how it got there, what's in memory, etc.. It's also useful to have a screenshot of the app, if it's a user-interaction thing. And info about the machine that it crashed on (OS version and patch, what else is running at the time, etc..) Both of the tools that I mentioned can do this.
If it's something that happens with a few users but you can't reproduce it, and they can, go sit with them and watch. If it's not apparent, switch seats - you "drive", and they tell you what to do. You'll uncover the subtle usability issues that way. double-clicks on a single-click button, for example, initiating re-entrancy in the OnClick event. That sort of thing. If the users are remote, use WebEx, Wink, etc., to record them crashing it, so you can analyze the playback.

Resources