Application health monitoring in Linux - embedded-linux

Systemd will create a process & which in-turn creates many other applications/processes at startup, that need to run on my embedded device.
Is there any way we can add a piece of code in all the applications such that systemd will exchange 'heartbeat' & will know if some application is hung or not
Is there some examples which I can refer & understand?

The answer is yes, see this superuser post:
Yes; but first fix your buggy program before fiddling with systemd.
MariusMatutiae is quite correct. You have a problem with your program.
It deadlocks. Fiddling with systemd isn't the answer. At best, it's a
distraction. Fix your program so that it isn't broken. Direct your
energies at the right thing.
That said, other people are going to come here because of the question
title, rather than the question proper. For their benefit, here's the
answer to the title, ignoring the question proper:
Yes, systemd can monitor dæmons and automatically restart them if they
stop talking. Not just any old dæmons, though. As mvp notes, there's
no way to know that a dæmon has hung (in this universe, where the
halting problem is undecidable, at least). Neither systemd nor any
other computer program will ever be capable of deducing from scratch
that some random program thrown at them has deadlocked, or gone into
an infinite loop, or whatever. The best that you'll get here is
detecting that a dæmon hasn't performed a regular "heartbeat"
operation within a required timespan.
Dæmons that take advantage of systemd's watchdog capabilities,
therefore, have to be written to speak a systemd-specific protocol,
the sd_notify protocol. This complicates the dæmon code a tad. It's
complicated further because dæmons should, if written properly, check
whether they've been invoked with the watchdog function enabled, as
well.
A dæmon that speaks this protocol to make use of systemd's watchdog
capability …
… must check for the WATCHDOG_USEC environment variable;
… must call sd_notify() continually and frequently, throughout its lifetime, with the WATCHDOG=1 option set, at an interval of about WATCHDOG_USEC/2 ("USEC" stands for microseconds) ;
… must have Type=notify set in its unit file;
… should have NotifyAccess=main (or =all) set in its unit file;
… must have WatchdogSec=seconds set in its unit file.
… must link with libsystemd-daemon.so If you want to know the details of coding this, after reading the manual, make sure that you go to the right StackExchange. This is SuperUser. StackOverflow is over there.
Further reading
Lennart Poettering. 2011-04-12. Watchdogs. Freedesktop.org.

Related

Windows: redirect ReadFile to run process and pipe it's stdout

I was wondering how hard it would be to create a set-up under Windows where a regular ReadFile on certain files is being redirected by the file system to actually run (e.g. ShellExecute) those files, and then the new process' stdout is being used as the file content streamed out to the ReadFile call to the callee...
What I envision the set-up to look like, is that you can configure it to denote a certain folder as 'special', and that this extra functionality is then only available on that folder's content (so it doesn't need to be disk-wide). It might be accessible under a new drive letter, or a path parallel to the source folder; the location it is hooked up to is irrelevant to me.
To those of you that wonder if this is a classic xy problem: it might very well be ;) It's just that this idea has intrigued me, and I want to know what possibilities there are. In my particular case I want to employ it to #include content in my C++ code base, where the actual content included is being made up on the spot, different on each compile round. I could of course also create a script to create such content to include, call it as a pre-build step and leave it at that, but why choose the easy route.
Maybe there are already ready-made solutions for this? I did an extensive Google search for it, but came out empty handed. But then I'm not sure I already know all the keywords involved to do a good search...
When coding up something myself, I think a minifilter driver might be needed intercepting ReadFile calls, but then it must at that spot run usermode apps from kernel space - not a happy marriage I assume. Or use an existing file system driver framework that allows for usermode parts, but I found the price of existing solutions to be too steep for my taste (several thousand dollars).
And I also assume that a standard file system (minifilter) driver might be required to return a consistent file size for such files, although the actual data size returned through ReadFile would of course differ on each call. Not to mention negating any buffering that takes place.
All in all I think that a create-it-yourself solution will take quite some effort, especially when you have never done Windows driver development in your life :) Although I see myself quite capable of learning up on it, the time invested will be prohibitive I think.
Another approach might be to hook ReadFile calls from the process doing the ReadFile - via IAT hooking, or via code injection. But I want this solution to more work 'out-of-the-box', i.e. all ReadFile requests for these special files trigger the correct behavior, regardless of origin. In my case I'd need to intercept my C++ compiler (G++) behavior, but that one is called on the fly by the IDE, so I see no easy way to detect it's startup and hook it up quickly before it does it's ReadFiles. And besides, I only want certain files to be special in this regard; intercepting all ReadFiles for a certain process is overkill.
You want something like FUSE (which I used with profit many times), but for Windows. Apparently there's Dokan, I've never used it but seems to be well known enough (and, at very least, can be used as an inspiration to see "how it's done").

Character device driver hangs the system - how to avoid?

I'm writing a simple writable character device driver (2.6.32-358.el6.x86_64, under VirtualBox), and since it's not mature yet, it tends to crash/freeze (segfaults, infinite loops).
I'm testing it like this: $> echo "some data" > /dev/my_dev, and if crash/freeze occurs, the whole system (VirtualBox) freezes. I tried to move all the work to another kernel thread to avoid the system-wide freeze, but it doesn't help.
Is it possible to "isolate" such a crash/freeze, so that I'd be able to kill the process, in whose context the kernel module runs?
The module runs in kernel context. That's why debugging it is difficult and bugs can easily crash the system. Infinite loop is not really an issue as it just slows the system down, but doesn't cause a crash. Writing to the wrong memory region however is fatal.
If you are lucky, you would get a kernel oops before the freeze. If you test your code in one of the TTYs, rather than the GUI, then you might immediately see the oops (kernel BUG log) on the screen which you can study and might be helpful to you.
In my experience however, it's best to write and test the kernel-independent code in user-space, probably with mock functions and test it heavily, run valgrind on it, and make sure it doesn't have bugs. Then use it in kernel space. You'd be surprised at how much of a kernel module's code may in fact not need kernel context at all. Of course this very much depends on the functionality of the kernel module.
To actually debug the code in kernel space, there are tools which I have never used, such as kgdb. What I do myself usually is a mixture of printks and binary search. That is, if the crash is so severe that the kernel oops is not shown at all. First, I put printk (possibly with a delay after) in different places to see which parts of the code are reached before the oops. tail -f /var/log/messages comes in handy. Then, I do binary search; disable half of the code to see if the crash occurs. If not, possibly the problem is in the second half. If it occurs, surely the problem is in the first half. Repeat!
The ultimate way of writing a bug-free kernel module is to write code that doesn't have bugs in the first place. Of course, this is rarely possible, but if you write clean and undefined-behavior-free C code and write very concise functions whose correctness is obvious and you pay attention to the boundaries of arrays, it's not that hard.

Can an OS X application examine the launchd plist which invoked it?

I'm writing an application which will potentially be invoked by more than one launchd job – that is, there might be two distinct LaunchAgent .plist files which invoke the application with different program arguments or in different conditions. I'd like the application to be able to examine the job, or the job's .plist, so it can adjust its behaviour based on what it finds there.
In particular, supposing a program foo could be started up from both A.plist and B.plist, I'd like the program to be able to preserve different state depending on which job/plist invoked it. If all I can do is detect the (presumed distinct) Label of the job, that will be enough (though more would be better).
The obvious ways to do this are using different flags in the ProgramArguments array in the job, or to set different values in EnvironmentVariables, but both of those feel fragile, both imply duplication of bits of configuration, and both require extra documentation (“copy the value of the Label into EnvironmentVariables field FOO...; don't ask why”).
I can see the function SMJobCopyDictionary. With that, it appears that I can get access to the job's dictionary – ie, this information is available in principle – but I need to know the job's label first. Function SMCopyAllJobDictionaries allows me to iterate through all of the jobs, but it's not obvious how I'd find the one which invoked a particular instance of the application.
Googling launchd read job label or launchd self dict (or similar) don't come up with anything useful.
Take a look at the SampleD code on Apple's site. This code shows how daemon's can access calling launchd information.
Looking at the launch.h header, I suspect LAUNCH_JOBKEY_LABEL is what you are after.
Using the ProgramArguments is, in fact, probably the best way to handle this. It will make your code more portable (e.g. making it possible to launch in Linux, for example, or under a different launch daemon) to use program arguments that configure the behavior rather than relying on the program invocation chain.
With regard to program arguments, though, I would suggest not copying the label, but rather using the specific flags that differ between the jobs. For example, you might set --checkpoint-file and --log-file in both A.plist and B.plist to files that happen to include "A" or "B" in their names. Doing this will make the application code that depends on --checkpoint-file and --log-file much clearer, though, than if it were to make arbitrary choices based on some label and will also allow you to run the command directly from the commandline and have it still work despite there not being an invoking PLIST file.
Summary: I think the answer here is a qualified no.
Graham Miln's answer (which I've accepted) shows that this is possible in fact,
but the interface he mentions is apparently documented nowhere other than in
this sample code, and also possibly in Mac OS X Internals: A
Systems Approach (Amit Singh, http://osxbook.com, thanks to Mitchell J
Laurren-Ring for pointing this out).
I also asked about this on the launchd-dev list (see the thread starting with
my
question),
and Damien Sorresso suggested that this interface was “really only so that jobs
can check in with launchd to obtain listening sockets. They were way too general
at the outset.”
Put together, I'm left with the feeling that, while the behaviour I
want seems possible, and while I think it's neater in this case than
passing more state in environment variables, the API to support such
behaviour is, if not deprecated, than at least somewhat discouraged,
and at any rate not idiomatic.

How do you reproduce bugs that occur sporadically?

We have a bug in our application that does not occur every time and therefore we don't know its "logic". I don't even get it reproduced in 100 times today.
Disclaimer: This bug exists and I've seen it. It's not a pebkac or something similar.
What are common hints to reproduce this kind of bug?
Analyze the problem in a pair and pair-read the code. Make notes of the problems you KNOW to be true and try to assert which logical preconditions must hold true for this happen. Follow the evidence like a CSI.
Most people instinctively say "add more logging", and this may be a solution. But for a lot of problems this just makes things worse, since logging can change timing-dependencies sufficiently to make the problem more or less frequent. Changing the frequency from 1 in 1000 to 1 in 1,000,000 will not bring you closer to the true source of the problem.
So if your logical reasoning does not solve the problem, it'll probably give you a few specifics you could investigate with logging or assertions in your code.
There is no general good answer to the question, but here is what I have found:
It takes a talent for this kind of thing. Not all developers are best suited for it, even if they are superstars in other areas. So know your team, who has a talent for it, and hope you can give them enough candy to get them excited about helping you out, even if it isn't their area.
Work backwards, and treat it like a scientific investigation. Start with the bug, what you see is wrong. Develop hypotheses about what could cause it (this is the creative/imaginative part, the art that not everyone has the talent for) - and it helps a lot to know how the code works. For each of those hypotheses (preferably sorted by what you think is most likely - again pure gut feel here), develop a test that tries to eliminate it as the cause, and test the hypothesis. Any given failure to meet a prediction doesn't mean the hypothesis is wrong. Test the hypothesis until it is confirmed to be wrong (although as it gets less likely you may want to move on to another hypothesis first, just don't discount this one until you have a definitive failure).
Gather as much data as you can during this process. Extensive logging and whatever else is applicable. Do not discount a hypothesis because you lack the data, rather remedy the lack of data. Quite often the inspiration for the right hypothesis comes from examining the data. Noticing something off in a stack trace, weird issue in a log, something missing that should be there in a database, etc.
Double check every assumption. So many times I have seen an issue not get fixed quickly because some general method call was not further investigated, so the problem was just assumed to be not applicable. "Oh that, that should be simple." (See point 1).
If you run out of hypotheses, that is generally caused by insufficient knowledge of the system (this is true even if you wrote every line of code yourself), and you need to run through and review code and gain additional insight into the system to come up with a new idea.
Of course, none of the above guarantees anything, but that is the approach that I have found gets results consistently.
Add some sort of logging or tracing. For example log the last X actions the user committed before causing the bug (only if you can set a condition to match bug).
It's quite common for programmers not to be able to reiterate a user-experienced crash simply because you have developed a certain workflow and habits in using the application that obviously goes around the bug.
At this frequency of 1/100, I'd say that the first thing to do is to handle exceptions and log anything anywhere or you could be spending another week hunting this bug.
Also make a priority list of potentially sensitive articulations and features in your project. For example :
1 - Multithreading
2 - Wild pointers/ loose arrays
3 - Reliance on input devices
etc.
This will help you segment areas that you can brute-force-until-break-again as suggested by other posters.
Since this is language-agnostic, I'll mention a few axioms of debugging.
Nothing a computer ever does is random. A 'random occurrence' indicates a as-yet-undiscovered pattern. Debugging begins with isolating the pattern. Vary individual elements and assess what makes a change in the behaviour of the bug.
Different user, same computer?
Same user, different computer?
Is the occurrence strongly periodic? Does rebooting change the periodicity?
FYI- I once saw a bug that was experienced by a single person. I literally mean person, not a user account. User A would never see the problem on their system, User B would sit down at that workstation, signed on as User A and could immediately reproduce the bug. There should be no conceivable way for the app to know the difference between the physical body in the chair. However-
The users used the app in different ways. User A habitually used a hotkey to to invoke a action and User B used an on-screen control. The difference in the user behaviour would cascade into a visible error a few actions later.
ANY difference that effects the behaviour of the bug should be investigated, even if it makes no sense.
There's a good chance your application is MTWIDNTBMT (Multi Threaded When It Doesn't Need To Be Multi Threaded), or maybe just multi-threaded (to be polite). A good way to reproduce sporadic errors in multi-threaded applications is to sprinkle code like this around (C#):
Random rnd = new Random();
System.Threading.Thread.Sleep(rnd.Next(2000));
and/or this:
for (int i = 0; i < 4000000000; i++)
{
// tight loop
}
to simulate threads completing their tasks at different times than usual or tying up the processor for long stretches.
I've inherited many buggy, multi-threaded apps over the years, and code like the above examples usually makes the sporadic errors occur much more frequently.
Add verbose logging. It will take multiple -- sometimes dozen(s) -- iterations to add enough logging to understand the scenario.
Now the problem is that if the problem is a race condition, which is likely if it doesn't reproduce reliably, so logging can change timing and the problem will stop happening. In this case do not log to a file, but keep a rotating buffer of the log in memory and only dump it on disk when you detect that the problem has occurred.
Edit: a little more thoughts: if this is a gui application run tests with a qa automation tool which allows you to replay macros. If this is a service-type app, try to come up with at least a guess as to what is happening and then programmatically create 'freak' usage patterns which would exercise the code that you suspect. Create higher than usual loads etc.
What development environment?
For C++, your best bet may be VMWare Workstation record/replay, see:
http://stackframe.blogspot.com/2007/04/workstation-60-and-death-of.html
Other suggestions include inspecting the stack trace, and careful code overview... there is really no silver bullet :)
Try to add code in your app to trace the bug automatically once it happens (or even alert you via mail / SMS)
log whatever you can so when it happens you can catch the right system state.
Another thing- try applying automated testing that can cover more territory than human based testing in a formed manner.. it's a long shot, but a good practice in general.
all the above, plus throw some brute force soft-robot at it that is semi random, and scater a lot of assert/verify (c/c++, probably similar in other langs) through the code
Tons of logging and careful code review are your only options.
These can be especially painful if the app is deployed and you can't adjust the logging. At that point, your only choice is going through the code with a fine-tooth comb and trying to reason about how the program could enter into the bad state (scientific method to the rescue!)
Often these kind of bugs are related to corrupted memory and for that reason they might not appear very often. You should try to run your software with some kind of memory profiler e.g., valgrind, to see if something goes wrong.
Let’s say I’m starting with a production application.
I typically add debug logging around the areas where I think the bug is occurring. I setup the logging statements to give me insight into the state of the application. Then I have the debug log level turned on and ask the user/operator(s) notify me of the time of the next bug occurrence. I then analyze the log to see what hints it gives about the state of the application and if that leads to a better understanding of what could be going wrong.
I repeat step 1 until I have a good idea of where I can start debugging the code in the debugger
Sometimes the number of iterations of the code running is key but other times it maybe the interaction of a component with an outside system (database, specific user machine, operating system, etc.). Take some time to setup a debug environment that matches the production environment as closely as possible. VM technology is a good tool for solving this problem.
Next I proceed via the debugger. This could include creating a test harness of some sort that puts the code/components in the state I’ve observed from the logs. Knowing how to setup conditional break points can save a lot of time, so get familiar with that and other features within your debugger.
Debug, debug , debug. If you’re going nowhere after a few hours, take a break and work on something unrelated for awhile. Come back with a fresh mind and perspective.
If you have gotten nowhere by now, go back to step 1 and make another iteration.
For really difficult problems you may have to resort to installing a debugger on the system where the bug is occurring. That combined with your test harness from step 4 can usually crack the really baffling issues.
Unit Tests. Testing a bug in the app is often horrendous because there is so much noise, so many variable factors. In general the bigger the (hay)stack, the harder it is to pinpoint the issue. Creatively extending your unit test framework to embrace edge cases can save hours or even days of sifting
Having said that there is no silver bullet. I feel your pain.
Add pre and post condition check in methods related to this bug.
You may have a look at Design by contract
Along with a lot of patience, a quiet prayer & cursing you would need:
a good mechanism for logging the user actions
a good mechanism for gathering the data state when the user performs some actions (state in application, database etc.)
Check the server environment (e.g. an anti-virus software running at a particular time etc.) & record the times of the error & see if you can find any trends
some more prayers & cursing...
HTH.
Assuming you're on Windows, and your "bug" is a crash or some sort of corruption in unmanaged code (C/C++), then take a look at Application Verifier from Microsoft. The tool has a number of stops that can be enabled to verify things during runtime. If you have an idea of the scenario where your bug occurs, then try to run through the scenario (or a stress version of the scenario) with AppVerifer running. Make sure to either turn on pageheap in AppVerifier, or consider compiling your code with the /RTCcsu switch (see http://msdn.microsoft.com/en-us/library/8wtf2dfz.aspx for more information).
"Heisenbugs" require great skills to diagnose, and if you want help from people here you have to describe this in much more detail, and patiently listen to various tests and checks, report result here, and iterate this till you solve it (or decide it is too expensive in terms of resources).
You will probably have to tell us your actual situation, language, DB, operative system, workload estimate, time of the day it happened in the past, and a myriad of other things, list tests you did already, how they went, and be ready to do more and share the results.
And this will not guarantee that we collectively can find it, either...
I'd suggest to write down all things that user has been doing. If you have lets say 10 such bug reports You can try to find something that connects them.
Read the stack trace carefully and try to guess what could be happened;
then try to trace\log every line of code that potentially can cause trouble.
Keep your focus on disposing resources; many sneaky sporadical bugs i found were related to close\dispose things :).
For .NET projects You can use Elmah (Error Logging Modules and Handlers) to monitor you application for un-caught exceptions, it's very simple to install and provides a very nice interface to browse unknown errors
http://code.google.com/p/elmah/
This saved me just today in catching a very random error that was occuring during a registration process
Other than that I can only recommend trying to get as much information from your users as possible and having a thorough understanding of the project workflow
They mostly come out at night....
mostly
The team that I work with has enlisted the users in recording their time they spend in our app with CamStudio when we've got a pesky bug to track down. It's easy to install and for them to use, and makes reproducing those nagging bugs much easier, since you can watch what the users are doing. It also has no relationship to the language you're working in, since it's just recording the windows desktop.
However, this route seems to be viable only if you're developing corporate apps and have good relationships with your users.
This varies (as you say), but some of the things that are handy with this can be
immediately going into the debugger when the problem occurs and dumping all the threads (or the equivalent, such as dumping the core immediately or whatever.)
running with logging turned on but otherwise entirely in release/production mode. (This is possible in some random environments like c and rails but not many others.)
do stuff to make the edge conditions on the machine worse... force low memory / high load / more threads / serving more requests
Making sure that you're actually listening to what the users encountering the problem are actually saying. Making sure that they're actually explaining the relevant details. This seems to be the one that breaks people in the field a lot. Trying to reproduce the wrong problem is boring.
Get used to reading assembly that was produced by optimizing compilers. This seems to stop people sometimes, and it isn't applicable to all languages/platforms, but it can help
Be prepared to accept that it is your (the developer's) fault. Don't get into the trap of insisting the code is perfect.
sometimes you need to actually track the problem down on the machine it is happening on.
#p.marino - not enough rep to comment =/
tl;dr - build failures due to time of day
You mentioned time of day and that caught my eye. Had a bug once were someone stayed later at work on night, tried to build and commit before they left and kept getting a failure. They eventually gave up and went home. When they caught in the next morning it built fine, they committed (probably should have been more suspiscious =] ) and the build worked for everyone. A week or two later someone stayed late and had an unexpected build failure. Turns out there was a bug in the code that made any build after 7PM break >.>
We also found a bug in one seldom used corner of the project this january that caused problems marshalling between different schemas because we were not accounting for the different calendars being 0 AND 1 month based. So if no one had messed with that part of the project we wouldn't have possibly found the bug until jan. 2011
These were easier to fix than threading issues, but still interesting I think.
hire some testers!
This has worked for really weird heisenbugs.
(I'd also recommend getting a copy of "Debugging" by Dave Argans, these ideas are partly derived form using his ideas!)
(0) Check the ram of the system using something like Memtest86!
The whole system exhibits the problem, so make a test jig that exercises the whole thing.
Say it's a server side thing with a GUI, you run the whole thing with a GUI test framework doing the necessary input to provoke the problem.
It doesn't fail 100% of the time, so you have to make it fail more often.
Start by cutting the system in half ( binary chop)
worse case, you have to remove sub-systems one at a time.
stub them out if they can't be commented out.
See if it still fails. Does it fail more often ?
Keep proper test records, and only change one variable at a time!
Worst case you use the jig and you test for weeks to get meaningful statistics. This is HARD; but remember, the jig is doing the work.
I've got No threads and only one process, and I don't talk to hardware
If the system has no threads, no communicating processes and contacts no hardware; it's tricky; heisenbugs are generally synchronization, but in the no-thread no processes case it's more likely to be uninitialized data, or data used after being released, either on the heap or the stack. Try to use a checker like valgrind.
For threaded/multi-process problems:
Try running it on a different number of CPU's. If it's running on 1, try on 4! Try forcing a 4-computer system onto 1.
It'll mostly ensure things happen one at a time.
If there are threads or communicating processes this can shake out bugs.
If this is not helping but you suspect it's synchronization or threading, try changing the OS time-slice size.
Make it as fine as your OS vendor allows!
Sometimes this has made race conditions happen almost every time!
Obversely, try going slower on the timeslices.
Then you set the test jig running with debugger(s) attached all over the place and wait for the test jig to stop on a fault.
If all else fails, put the hardware in the freezer and run it there. The timing of everything will be shifted.
Debugging is hard and time consuming especially if you are unable to deterministically reproduce the problem. My advice to you is to find out the steps to reproduce it deterministically (not just sometimes).
There has been a lot of research in the field of failure reproduction in the past years and is still very active. Record&Replay techniques have been (so far) the research direction of most researchers. This is what you need to do:
1) Analyze the source code and determine what are the sources of non-determinism in the application, that is, what are the aspects that may take your application through different execution paths (e.g. user input, OS signals)
2) Log them in the next time you execute the application
3) When your application fails again, you have the steps-to-reproduce the failure in your log.
If your log still does not reproduce the failure, then you are dealing with a concurrency bug. In that case, you should take a look at how your application accesses shared variables. Do not attempt to record the accesses to shared variables, because you would be logging too much data, thereby causing severe slowdowns and large logs. Unfortunately, there is not much I can say that would help you to reproduce concurrency bugs, because research still has a long way to go in this subject. The best I can do is to provide a reference to the most recent advance (so far) in the topic of deterministic replay of concurrency bugs:
http://www.gsd.inesc-id.pt/~nmachado/software/Symbiosis_Tutorial.html
Best regards
Use an enhanced crash reporter. In the Delphi environment, we have EurekaLog and MadExcept. Other tools exist in other environments. Or you can diagnose the core dump. You're looking for the stack trace, which will show you where it's blowing up, how it got there, what's in memory, etc.. It's also useful to have a screenshot of the app, if it's a user-interaction thing. And info about the machine that it crashed on (OS version and patch, what else is running at the time, etc..) Both of the tools that I mentioned can do this.
If it's something that happens with a few users but you can't reproduce it, and they can, go sit with them and watch. If it's not apparent, switch seats - you "drive", and they tell you what to do. You'll uncover the subtle usability issues that way. double-clicks on a single-click button, for example, initiating re-entrancy in the OnClick event. That sort of thing. If the users are remote, use WebEx, Wink, etc., to record them crashing it, so you can analyze the playback.

How to debug a program without a debugger? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Interview question-
Often its pretty easier to debug a program once you have trouble with your code.You can put watches,breakpoints and etc.Life is much easier because of debugger.
But how to debug a program without a debugger?
One possible approach which I know is simply putting print statements in your code wherever you want to check for the problems.
Are there any other approaches other than this?
As its a general question, its not restricted to any specific language.So please share your thoughts on how you would have done it?
EDIT- While submitting your answer, please mention a useful resource (if you have any) about any concept. e.g. Logging
This will be lot helpful for those who don't know about it at all.(This includes me, in some cases :)
UPDATE: Michal Sznajderhas put a real "best" answer and also made it a community wiki.Really deserves lots of up votes.
Actually you have quite a lot of possibilities. Either with recompilation of source code or without.
With recompilation.
Additional logging. Either into program's logs or using system logging (eg. OutputDebugString or Events Log on Windows). Also use following steps:
Always include timestamp at least up to seconds resolution.
Consider adding thread-id in case of multithreaded apps.
Add some nice output of your structures
Do not print out enums with just %d. Use some ToString() or create some EnumToString() function (whatever suits your language)
... and beware: logging changes timings so in case of heavily multithreading you problems might disappear.
More details on this here.
Introduce more asserts
Unit tests
"Audio-visual" monitoring: if something happens do one of
use buzzer
play system sound
flash some LED by enabling hardware GPIO line (only in embedded scenarios)
Without recompilation
If your application uses network of any kind: Packet Sniffer or I will just choose for you: Wireshark
If you use database: monitor queries send to database and database itself.
Use virtual machines to test exactly the same OS/hardware setup as your system is running on.
Use some kind of system calls monitor. This includes
On Unix box strace or dtrace
On Windows tools from former Sysinternals tools like http://technet.microsoft.com/en-us/sysinternals/bb896645.aspx, ProcessExplorer and alike
In case of Windows GUI stuff: check out Spy++ or for WPF Snoop (although second I didn't use)
Consider using some profiling tools for your platform. It will give you overview on thing happening in your app.
[Real hardcore] Hardware monitoring: use oscilloscope (aka O-Scope) to monitor signals on hardware lines
Source code debugging: you sit down with your source code and just pretend with piece of paper and pencil that you are computer. Its so called code analysis or "on-my-eyes" debugging
Source control debugging. Compare diffs of your code from time when "it" works and now. Bug might be somewhere there.
And some general tips in the end:
Do not forget about Text to Columns and Pivot Table in Excel. Together with some text tools (awk, grep or perl) give you incredible analysis pack. If you have more than 32K records consider using Access as data source.
Basics of Data Warehousing might help. With simple cube you may analyse tons of temporal data in just few minutes.
Dumping your application is worth mentioning. Either as a result of crash or just on regular basis
Always generate you debug symbols (even for release builds).
Almost last but not least: most mayor platforms has some sort of command line debugger always built in (even Windows!). With some tricks like conditional debugging and break-print-continue you can get pretty good result with obscure bugs
And really last but not least: use your brain and question everything.
In general debugging is like science: you do not create it you discover it. Quite often its like looking for a murderer in a criminal case. So buy yourself a hat and never give up.
First of all, what does debugging actually do? Advanced debuggers give you machine hooks to suspend execution, examine variables and potentially modify state of a running program. Most programs don't need all that to debug them. There are many approaches:
Tracing: implement some kind of logging mechanism, or use an existing one such as dtrace(). It usually worth it to implement some kind of printf-like function that can output generally formatted output into a system log. Then just throw state from key points in your program to this log. Believe it or not, in complex programs, this can be more useful than raw debugging with a real debugger. Logs help you know how you got into trouble, while a debugger that traps on a crash assumes you can reverse engineer how you got there from whatever state you are already in. For applications that you use other complex libraries that you don't own that crash in the middle of them, logs are often far more useful. But it requires a certain amount of discipline in writing your log messages.
Program/Library self-awareness: To solve very specific crash events, I often have implemented wrappers on system libraries such as malloc/free/realloc which extensions that can do things like walk memory, detect double frees, attempts to free non-allocated pointers, check for obvious buffer over-runs etc. Often you can do this sort of thing for your important internal data types as well -- typically you can make self-integrity checks for things like linked lists (they can't loop, and they can't point into la-la land.) Even for things like OS synchronization objects, often you only need to know which thread, or what file and line number (capturable by __FILE__, __LINE__) the last user of the synch object was to help you work out a race condition.
If you are insane like me, you could, in fact, implement your own mini-debugger inside of your own program. This is really only an option in a self-reflective programming language, or in languages like C with certain OS-hooks. When compiling C/C++ in Windows/DOS you can implement a "crash-hook" callback which is executed when any program fault is triggered. When you compile your program you can build a .map file to figure out what the relative addresses of all your public functions (so you can work out the loader initial offset by subtracting the address of main() from the address given in your .map file). So when a crash happens (even pressing ^C during a run, for example, so you can find your infinite loops) you can take the stack pointer and scan it for offsets within return addresses. You can usually look at your registers, and implement a simple console to let you examine all this. And voila, you have half of a real debugger implemented. Keep this going and you can reproduce the VxWorks' console debugging mechanism.
Another approach, is logical deduction. This is related to #1. Basically any crash or anomalous behavior in a program occurs when it stops behaving as expected. You need to have some feed back method of knowing when the program is behaving normally then abnormally. Your goal then is to find the exact conditions upon which your program goes from behaving correctly to incorrectly. With printf()/logs, or other feedback (such as enabling a device in an embedded system -- the PC has a speaker, but some motherboards also have a digital display for BIOS stage reporting; embedded systems will often have a COM port that you can use) you can deduce at least binary states of good and bad behavior with respect to the run state of your program through the instrumentation of your program.
A related method is logical deduction with respect to code versions. Often a program was working perfectly at one state, but some later version is not longer working. If you use good source control, and you enforce a "top of tree must always be working" philosophy amongst your programming team, then you can use a binary search to find the exact version of the code at which the failure occurs. You can use diffs then to deduce what code change exposes the error. If the diff is too large, then you have the task of trying to redo that code change in smaller steps where you can apply binary searching more effectively.
Just a couple suggestions:
1) Asserts. This should help you work out general expectations at different states of the program. As well familiarize yourself with the code
2) Unit tests. I have used these at times to dig into new code and test out APIs
One word: Logging.
Your program should write descriptive debug lines which include a timestamp to a log file based on a configurable debug level. Reading the resultant log files gives you information on what happened during the execution of the program. There are logging packages in every common programming language that make this a snap:
Java: log4j
.Net: NLog or log4net
Python: Python Logging
PHP: Pear Logging Framework
Ruby: Ruby Logger
C: log4c
I guess you just have to write fine-grain unit tests.
I also like to write a pretty-printer for my data structures.
I think the rest of the interview might go something like this...
Candidate: So you don't buy debuggers for your developers?
Interviewer: No, they have debuggers.
Candidate: So you are looking for programmers who, out of masochism or chest thumping hamartia, make things complicated on themselves even if they would be less productive?
Interviewer: No, I'm just trying to see if you know what you would do in a situation that will never happen.
Candidate: I suppose I'd add logging or print statements. Can I ask you a similar question?
Interviewer: Sure.
Candidate: How would you recruit a team of developers if you didn't have any appreciable interviewing skill to distinguish good prospects based on relevant information?
Peer review. You have been looking at the code for 8 hours and your brain is just showing you what you want to see in the code. A fresh pair of eyes can make all the difference.
Version control. Especially for large teams. If somebody changed something you rely on but did not tell you it is easy to find a specific change set that caused your trouble by rolling the changes back one by one.
On *nix systems, strace and/or dtrace can tell you an awful lot about the execution of your program and the libraries it uses.
Binary search in time is also a method: If you have your source code stored in a version-control repository, and you know that version 100 worked, but version 200 doesn't, try to see if version 150 works. If it does, the error must be between version 150 and 200, so find version 175 and see if it works... etc.
use println/log in code
use DB explorer to look at data in DB/files
write tests and put asserts in suspicious places
More generally, you can monitor side effects and output of the program, and trigger certain events in the program externally.
A Print statement isn't always appropriate. You might use other forms of output such as writing to the Event Log or a log file, writing to a TCP socket (I have a nice utility that can listen for that type of trace from my program), etc.
For programs that don't have a UI, you can trigger behavior you want to debug by using an external flag such as the existence of a file. You might have the program wait for the file to be created, then run through a behavior you're interested in while logging relevant events.
Another file's existence might trigger the program's internal state to be written to your logging mechanism.
like everyone else said:
Logging
Asserts
Extra Output
&
your favorite task manager or process
explorer
links here and here
Another thing I have not seen mentioned here that I have had to use quite a bit on embedded systems is serial terminals.
You can cannot a serial terminal to just about any type of device on the planet (I have even done it to embedded CPUs for hydraulics, generators, etc). Then you can write out to the serial port and see everything on the terminal.
You can get real fancy and even setup a thread that listens to the serial terminal and responds to commands. I have done this as well and implemented simple commands to dump a list, see internal variables, etc all from a simple 9600 baud RS-232 serial port!
Spy++ (and more recently Snoop for WPF) are tremendous for getting an insight into Windows UI bugs.
A nice read would be Delta Debugging from Andreas Zeller. It's like binary search for debugging

Resources