CUDA/PyCUDA: Diagnosing launch failure that disappears under cuda-gdb - debugging

Anyone know likely avenues of investigation for kernel launch failures that disappear when run under cuda-gdb? Memory assignments are within spec, launches fail on the same run of the same kernel every time, and (so far) it hasn't failed within the debugger.
Oh Great SO Gurus, What now?

cuda-gdb spills all shared memory and registers to local memory. So when something runs ok built for debugging and fails otherwise, it usually means out of bounds shared memory access. cuda-memcheck might help, depending on what sort of card you are using. Fermi is better than older cards in that respect.
EDIT:
Casting my mind back to the bad old days, I remember having an ornery GT9500 which used to throw similar NV13 errors and have random code failures when running very memory intensive kernels with a lot of shared memory activity. Never when debugging. I put it down to bad hardware and moved on to a GT200, never to see a similar error since. One possibility might be bad hardware. Is this a G92 (9800GT or similar)?

CUDA GDB can make some of the cuda operations synchronous.
Are you reading from a memory after has been initialized ?
are you using Streams?
Are you launching more than one kernel?
Where and how does it fail ?

Related

does opencl release all device memory after process termination?

On Linux I used to be sure that whatever resources a process allocates, they are released after process termination. Memory is freed, open file descriptors are closed. No memory is leaked when I loop starting and terminating a process several times.
Recently I've started working with opencl.
I understand that the opencl-compiler keeps compiled kernels in a cache. So when I run a program that uses the same kernels like a previous run (or probably even those from another process running the same kernels) they don't need to be compiled again. I guess that cache is on the device.
From that behaviour I suspect that maybe allocated device-memory might be cached as well (maybe associated with a magic cookie for later reuse or something like that) if it was not released explicitly before termination.
So I pose this question to rule out any such suspicion.
kernels survive in chache => other memory-allocations survive somehow ???
My short answer would be yes based on this tool http://www.techpowerup.com/gpuz/
I'm investigating a memory leak on my device and I noticed that memory is freed when my process terminates... most of the time. If you have a memory leak like me, it may linger around even after the process is finished.
Another tool that may help is http://www.gremedy.com/download.php
but its really buggy so use it judiciously.

Debugging kernel hang

I am trying to run an app which is using a kernel mode driver. System locks up every hour and the only way to recover it is a hard reset. Sysrq stops responding, telnet sessions hang and there are no error messages of any kind. Unfortunately the board does not have ejtag support. I have been trying to isolate it functionally, but this is like looking for a needle in a hay stack. Any suggestions?
PS: This is a mips linux system (2.6.31).
Here are some options, depending on the specifics on your situation. If you can provide more detail about the platform and nature of the kernel mode driver it would be helpful.
Assuming you have reason to be confident in the hardware, your likely sources of lockups are locking problems in the kernel, uninitialized variables, and infinite loops with preemption disabled.
Can you configure a timer interrupt to run periodically and blink a LED? You might find it useful to see if interrupts continue to be handled while in a lockup.
Enable soft lockup detection in the Linux kernel hacking menu, and any other relevant kernel hacking features. It may take Linux a minute or two detect and report a soft lockup. Have you waited long enough to check for this?
Enable lock dependency checking in kernel hacking, and fix any reported locking errors in your driver.
Try changing the kernel preemption mode. This changes the behaviour of some system locks, in some cases turning deadlocks into less harmful locks. If it's relevant/possible, disable SMP.
Unfortunately without sysreq operating, or some way of poking the underlying system, you are out of luck.
If you can get some behavior out of the system (perhaps a hardware watchdog?), I would recommend kdump.
Furthermore, if this is a more recent problem, start by bisecting the code of the driver to determine where the crash is occurring.
If the kernel isn't totally hung and you are still getting interrupts, you might be able to use KGDB.
If you can't do that, you could add more logging code to your driver to track down the source of the problem. I'd put a printk() on every function's entry at a minimum and probably on every exit of each function as well. That should at least help you find out where the problem is happening.

Can a simple program be responsible for a BSOD?

I've got a customer who told me that my program (simple user-land program, not a driver) is crashing his system with a Blue Screen Of Death (BSOD). He says he has never encountered that with other program and that he can reproduce it easily with mine.
The BSOD is of type CRITICAL_OBJECT_TERMINATION (0x000000F4) with object type 0x3 (process): A process or thread crucial to system operation has unexpectedly exited or been terminate.
Can a simple program be responsible for a BSOD (even on Vista...) or should he check the hardware or OS installation?
Just because your program isn't a driver doesn't mean it won't use a driver.
In theory, your code shouldn't be able to BSOD the computer. It's up to the OS to make sure that doesn't happen. By definition, that means there's a problem somewhere either in hardware or in code other than your program. That doesn't preclude there being a bug in your code as well though.
The easiest way to cause a BSOD with a user-space program is (afaik) to kill the Windows subsystem process (csrss.exe). This doesn't need faulty hardware nor a bug in the kernel or a driver, it only needs administrator privileges1.
What is your code exactly doing? The error message ("A process or thread crucial to system operation has unexpectedly exited or been terminate.") sounds like one of the essential system processes terminated. Maybe you are killing a process and unintentionally got the wrong process?
If somehow possible you could try to get a memory dump from that customer. Using the Debugging Tools for Windows you can then further analyze that dump as described here.
1Windows doesn't prevent you from doing so because it "keeps administrators in control of their computer". So this is by design and not a bug. Read Raymond's articles and you will see why.
Short answer is yes. Long answer depends on what is you program is suppose to do and how it does it?
Normally, it shouldn't. If it does, there must be either
A bug in the Windows kernel (possible but very unlikely)
A bug in a device driver (not necessarily in a device your program uses, this could get quite complicated)
A fault in the hardware
I would bet on option number two (device driver) but it would be interesting if you could get us a more detailed dump.
Well, yes it can - but for many different reasons.
That's why we test on different machines, operating systems, hardware etc..
Have you set some requirements for your program and is your user following them?
If you can't duplicate it yourself, and your program doesn't need admin to run, I'd be a bit suspicous about
The stability of that system's hardware
The virus/malware status of that system.
If you can get physical access to the client box, it might be worth running a full virus scan with an up-to-date scanner, and running a full memtest on it.
I had a system once that seemed stable, except that a certian few programs wouldn't run on it (and would sometimes crash the box). Memtest showed my RAM had some bad bits, but they were in higer sims, so they only got accessed if a program tried to use a lot of RAM.
No, and that is pretty much by definition. The worst thing that you can say is that a user-land application may have "triggered" a Windows bug or a driver bug. But a modern desktop Operating System is fully responsible for its own integrity; a BSOD is a failure of that integrity. Therefore the OS is responsible, and only the OS.
(Example of a BSOD bug that your application alone could expose: a virus scanner implemented as a driver, that crashes when executing a file from sector 0xFFFFFFFF, a sector that on this one machine just happens to contain one DLL of your application)
I had problems when exit my app without stopping all the processes and BD connections when the program ends (I crashed the entire IDE). I place the "stopping and disconnecting" code in the "Terminate" of "Form_Closed" event of my main form and the problem wa solved, I don't know it this is your situation.
Another problem can be if the user is trying to access the same resources your app is using (databases, hardware, sockets, etc). Ask him/her about what apps he/she is using when the BSOD happens.
A virus can't be discarded.

How can I use up RAM quickly to test garbage collection?

Windows Server 2008. How can I quickly use up RAM so to induce GC in my app. If there is a way to do it without needing Visual Studio or installing a language runtime it would be good.
EDIT: I don't want to have to write an app and then copy it over to the server. I'm looking for a way to do it quickly without writing an app that requires an IDE or installation of a runtime/compiler.
Perhaps a powershell or batch script?...
I don't think using up RAM outside your process is going to necessarily trigger GC.
If I understand your question correctly, you have a program Foo.exe that is written in some unknown language, running on some unknown runtime (are you not allowed to post the details for some reason, or do you just not know?), and you want to try to get that program's runtime to trigger a garbage collection. However, you want to do this by using up RAM outside of foo.exe.
You could do this by creating a simple batch file that just started up a hundred copies of IE or Word or whatever program you want. However, I don't think that will do what you want it to do. If your process has already allocated a certain amount of memory, it won't necessarily give that memory up or trigger GC just because other processes are being started. It may page to disk, or may force other programs to page to disk. But not all Garbage Collectors are alike, so we can't really help without more details. I'm pretty sure some VM's never give back memory once they've allocated it, even after GC.
You could run your program inside a virtual machine such as Virtual Box, where you specify the memory ceiling of the guest operating system.
I'm having trouble imagining a scenario where this would be necessary though. Could you provide more information about the problem?
If you are using java you can specify the max amount of memory using Xmx. Search for JVM memory setting

I need to find the point in my userland code that crash my kernel

I have big system that make my system crash hard. When I boot up, I don't even have
a coredump. If I log every line that
get executed until my system goes down. I will find that evil code.
Can I log every source code line in GDB to a file?
UPDATE:
ok, I found the bug. It was nasty. The application I started did not
take the system down. After learning about coredump inspection with mdb, and some gdb stepping I found out that the systemcall causing the dump, was not implemented. Updating the system to latest kernel will fix my problem. Thanks to all of you.
MY LESSON:
make sure you know what process causes the coredump. It's not always the one you started.
Sounds like a tricky little problem.
I often try to eliminate as many possible suspects as I can by commenting out large chunks of code, configuring the system to not run certain pieces (if it allows you to do that) etc. This amounts to doing an ad-hoc binary search on the problem, and is a surprisingly effective way of zooming in on offending code relatively quickly.
A potential problem with logging is that the log might not hit the disk before the system locks up - if you don't get a core dump, you might not get the log.
Speaking of core dumps, make sure you don't have a limit on your core dump size (man ulimit.)
You could try to obtain a list of all the functions in your code using objdump, process it a little bit and create a bunch of GDB trace statements on those functions - basically creating a GDB script automatically. If that turns out to be overkill, then a binary search on the code using tracepoints can also help you zoom in on the problem.
And don't panic. You're smarter than the bug - you'll find it.
You can not reasonably track every line of your source using GDB (too slow). Besides, a system crash is most likely a result of a system call, and libc is probably doing the system call on your behalf. Even if you find the line of the application that caused OS crash, you still don't really know anything.
You should start by clarifying which OS is crashing. For Linux, you can try the following approaches:
strace -fo trace.out /path/to/app
After reboot, trace.out will contain syscalls the application was doing just before the crash. If you are lucky, you'll see the last syscall-of-death, but I wouldn't count on it.
Alternatively, try to reproduce the crash on the user-mode Linux, or on kernel with KGDB compiled in.
These will tell you where the problem in the kernel is. Finding the matching system call in your application will likely be trivial.
Please clarify your problem: What part of the system is crashing?
Is it an application?
If so, which application? Is this an application which you have written yourself? Is this an application you have obtained from elsewhere? Can you obtain a clean interrupt if you use a debugger? Can you obtain a backtrace showing which functions are calling the section of code which crashes?
Is it a new hardware driver?
Is it based on an older driver? If so, what has changed? Is it based on a manufacturer's data sheet? Is that data sheet the latest and most correct?
Is it somewhere in the kernel? Which kernel?
What is the OS? I assume it is linux, seeing that you are using the GNU debugger. But of course, that is not necessarily so.
You say you have no coredump. Have you enabled coredumps on your machine? Most systems these days do not have coredumps enabled by default.
Regarding logging GDB output, you may have some success, but it depends where the problem is whether or not you will have the right output logged before the system crashes. There is plenty of delay in writing to disk. You may not catch it in time.
I'm not familiar with the gdb way of doing this, but with windbg the way to go is to have a debugger attached to the kernel and control the debugger remotely over a serial cable (or firewire) from a second debugger. I'm pretty sure gdb has similar capabilities, I could quickly find some hints here: http://www.digipedia.pl/man/gdb.4.html

Resources