libusb doesn't enumerate a device if cleanup failed in previous program execution. Can I prevent this, or recover from the bad state in software? - libusb-1.0

I have code that opens a libusb device, does some asynchronous transfers, and then quits. Sometimes the cleanup code at the end works out and I can re-run the program without problems, but other times
libusb_cancel_transfer returns LIBUSB_ERROR_PIPE and thereafter the device does not appear in the list returned by libusb_get_device_list in any subsequent execution of the program.
The problem can be recovered from by unplugging the device and reinserting into the same port, but that's not acceptable. I need at the very least some way to recover from the bad state in software, but ideally this shouldn't ever happen.
In case this problem could be platform related, I'm running MacOSX 10.9.5 and the program is C++ built with the llvm compiler.

I have discovered that I was inadvertently submitting control transfers whose setup packets were uninitialized memory. (Used the wrong size constant.) When I stopped doing that, the problem no longer occurred.

Related

Multiple loading and unloading of PCI driver causes its /sys/bus/pci/devices/xxx directory to disappear

I have a PCI driver for a FPGA card that installs and works fine.However, we have a need to clean up our system without rebooting which includes unloading this driver.
When starting again (without rebooting) the driver is re-installed. I have found that when I do this process (install/uninstall) multiple times, on the 5th unload of the driver the directory associated with the device just disappears.
lspci command can no longer find my device because of a bad link. I have to reboot to get the device directory (/sys/bus/pci/devices/00000:04:00.0) to show up again.
With some experimentation and reducing the driver down to the bare minimum I discovered that if I do not do a call to pci_enable_device(..) function in my pci_probe_method, then I am able to install/uninstall the driver multiple times without error.
Of course, I need to call this method before I can do anything with the device but I wanted to be sure it was not some other of the more complex initialization I am doing was causing the problem.
I have verified that my call to pci_disable_device() is being called in the pci_remove_method(). I should be able to enable and disable a PCI device indefinitely, right? Any help in figuring out what is happening would be appreciated.
The actual solution to this problem was to eliminate an extraneous call I had to pci_dev_put(..). I did not notice this before when submitting the question. This was leftover from when this driver was not using the pci_probe() method to discover this device. So, executing this call in the exit routine caused the structure for this device to go away after 5 calls. So for now this problem is solved.

how to debug a pci device and linux driver

I am programming a pci device with verilog and also writing its driver,
I have probably inserted some bug in the hardware design and when i load the driver with insmod the kernel just gets stuck and doesnt respond. Now Im trying to figure out what's the last driver code line that makes my computer stuck. I have inserted printk in all relevant functions like probe and init but non of them get printed.
What other code is running when i use insmod before it gets to my init function? (I guess the kernel gets stuck over there)
printks are often not useful debugging such a problem. They are buffered sufficiently that you won't see them in time if the system hangs shortly after printk is called.
It is far more productive to selectively comment out sections of your driver and by process of elimination determine which line is the (first) problem.
Begin by commenting out the entire module's init section leaving only return 0;. Build it and load it. Does it hang? Reboot system, reenable the next few lines (class_create()?) and repeat.
From what you are telling, it is looks like that Linux scheduler is deadlocking by your driver. That's mean that interrupts from the system timer doesn't arrive or have a chance to be handled by kernel. There are two possible reasons:
You hang somewhere in your driver interrupt handler (handler starts its work but never finish it).
Your device creates interrupts storm (Device generates interrupts too frequently as a result your system do the only job -- handling of your device interrupts).
You explicitly disable all interrupts in your driver but doesn't reenable them.
In all other cases system will either crash, either oops or panic with all appropriate outputs or tolerate potential misbehavior of your device.
I guess that printk won't work for such extreme scenario as hang in kernel mode. It is quite heavy weight and due to this unreliable diagnostic tool for scenarios like your.
This trick works only in simpler environments like bootloaders or more simple kernels where system runs in default low-end video mode and there is no need to sync access to the video memory. In such systems tracing via debugging output to the display via direct writing to the video memory can be great and in many times the only tool that can be used for debugging purposes. Linux is not the case.
What techniques can be recommended from the software debugging point of view:
Try to review you driver code devoting special attention to interrupt handler and places where you disable/enable interrupts for synchronization.
Commenting out of all driver logic with gradual uncommenting can help a lot with localization of the issue.
You can try to use remote kernel debugging of your driver. I advice to try to use virtual machine for that purposes, but I'm not aware about do they allow to pass the PCI device in the virtual machine.
You can try the trick with in-memory tracing. The idea is to preallocate the memory chunk with well known virtual and physical addresses and zeroes it. Then modify your driver to write the trace data in this chunk using its virtual address. (For example, assign an unique integer value to each event that you want to trace and write '1' into the appropriate index of bytes array in the preallocated memory cell). Then when your system will hang you can simply force full memory dump generation and then analyze the memory layout packed in the dump using physical address of the memory chunk with traces. I had used this technique with VmWare Workstation VM on Windows. When the system had hanged I just pause a VM instance and looked to the appropriate .vmem file that contains raw memory latout of the physical memory of the VM instance. Not sure that this trick will work easy or even will work at all on Linux, but I would try it.
Finally, you can try to trace the messages on the PCI bus, but I'm not an expert in this field and not sure do it can help in your case or not.
In general kernel debugging is a quite tricky task, where a lot of tricks in use and all they works only for a specific set of cases. :(
I would put a logic analyzer on the bus lines (on FPGA you could use chipscope or similar). You'll then be able to tell which access is in cause (and fix the hardware). It will be useful anyway in order to debug or analyze future issues.
Another way would be to use the kernel crash dump utility which saved me some headaches in the past. But depending your Linux distribution requires installing (available by default in RH). See http://people.redhat.com/anderson/crash_whitepaper/
There isn't really anything that is run before your init. Bus enumeration is done at boot, if that goes by without a hitch the earliest cause for freezing should be something in your driver init AFAIK.
You should be able to see printks as they are printed, they aren't buffered and should not get lost. That's applicable only in situations where you can directly see kernel output, such as on the text console or over a serial line. If there is some other application in the way, like displaying the kernel logs in a terminal in X11 or over ssh, it may not have a chance to read and display the logs before the computer freezes.
If for some other reasons the printks still do not work for you, you can instead have your init function return early. Just test and move the return to later in the init until you find the point where it crashes.
It's hard to say what is causing your freezes, but interrupts is one of those things I would look at first. Make sure the device really doesn't signal interrupts until the driver enables them (that includes clearing interrupt enables on system reset) and enable them in the driver only after all handlers are registered (also, clear interrupt status before enabling interrupts).
Second thing to look at would be bus master transfers, same thing applies: Make sure the device doesn't do anything until it's asked to and let the driver make sure that no busmaster transfers are active before enabling busmastering at the device level.
The fact that the kernel gets stuck as soon as you install your driver module makes me wonder if any other driver (built in to kernel?) is already driving the device. I made this mistake once which is why i am asking. I'd look for the string "kernel driver in use" in the output of 'lspci' before installing the module. In any case, your printk's should be visible in dmesg output.
in addition to Claudio's suggestion, couple more debug ideas:
1. try kgdb (https://www.kernel.org/doc/htmldocs/kgdb/EnableKGDB.html)
2. use JTAG interfaces to connect to debug tools (these i think vary between devices, vendors so you'll have to figure out which debug tools you need to the particular hardware)

Using DeviceIoControl with FSCTL_LOCK_VOLUME to lock a volume. Debugger issue

I'm using DeviceIoControl with FSCTL_LOCK_VOLUME to lock a USB pen drive prior to direct disk read/write. The program works - sometimes.
I am having an issue with the lock call itself. When I step the command in Visual Studio 2008 the result is correct and the lock succeeds - every time!, when running the code (debug or not) the call fails sporadically with invalid handle. The only noticable difference is when stepping there is a half second pause - which I am happy with, but when running/debugging the call fails immediately.
Please can you give me a hint as to where this is falling down.
I think this is one for the true techies!
Sounds like a timing bug. Are there other threads with access to the handle? If so, one of them may be closing it before you call DeviceIoControl.

Make thread uninterruptable mac os

I'm developing multi-threading application on mac os. I'm faced with next problem: when i'm trying to debug with xcode-cocoa application(NOTE: console application doesn't have same problem), my threads are returning with errors in next calls: kevent(), semaphore_wait(), semaphore_timedwait() with EINTR (for kevent) and KERN_ABORTED (for semaphore_*). I think this is due to lldb work.
The problem is: i can't debug my application as i'm crashing after receiving such events. If i will do their handling(just make recall of same function) then my application is working very strange. Anyway I can't(I can, but it's very ugly) make good handling for such situation when my semaphore_timedwait() interrupting as i should "remember" time before i have gone timedwait() to make new timedwait() correct.
So, solution for my problems would be if I could disable for current thread "interrupting" - ability to be interrupted from another thread\process, that my functions will not return if lldb will send some signal. Is it possible on mac os?
Few notes:
In some debugegrs (I know gdb supports that) you can say whether all threads or only the one are stopped on the breakpoint.
Generally, you should be ready in your code a signal comes even if it is more work.
In multithreading application you can consider blocking the signals in most (helper) threads, so the signals shall be handled in a thread which is ready for that. See pthread_sigmask().

Can a simple program be responsible for a BSOD?

I've got a customer who told me that my program (simple user-land program, not a driver) is crashing his system with a Blue Screen Of Death (BSOD). He says he has never encountered that with other program and that he can reproduce it easily with mine.
The BSOD is of type CRITICAL_OBJECT_TERMINATION (0x000000F4) with object type 0x3 (process): A process or thread crucial to system operation has unexpectedly exited or been terminate.
Can a simple program be responsible for a BSOD (even on Vista...) or should he check the hardware or OS installation?
Just because your program isn't a driver doesn't mean it won't use a driver.
In theory, your code shouldn't be able to BSOD the computer. It's up to the OS to make sure that doesn't happen. By definition, that means there's a problem somewhere either in hardware or in code other than your program. That doesn't preclude there being a bug in your code as well though.
The easiest way to cause a BSOD with a user-space program is (afaik) to kill the Windows subsystem process (csrss.exe). This doesn't need faulty hardware nor a bug in the kernel or a driver, it only needs administrator privileges1.
What is your code exactly doing? The error message ("A process or thread crucial to system operation has unexpectedly exited or been terminate.") sounds like one of the essential system processes terminated. Maybe you are killing a process and unintentionally got the wrong process?
If somehow possible you could try to get a memory dump from that customer. Using the Debugging Tools for Windows you can then further analyze that dump as described here.
1Windows doesn't prevent you from doing so because it "keeps administrators in control of their computer". So this is by design and not a bug. Read Raymond's articles and you will see why.
Short answer is yes. Long answer depends on what is you program is suppose to do and how it does it?
Normally, it shouldn't. If it does, there must be either
A bug in the Windows kernel (possible but very unlikely)
A bug in a device driver (not necessarily in a device your program uses, this could get quite complicated)
A fault in the hardware
I would bet on option number two (device driver) but it would be interesting if you could get us a more detailed dump.
Well, yes it can - but for many different reasons.
That's why we test on different machines, operating systems, hardware etc..
Have you set some requirements for your program and is your user following them?
If you can't duplicate it yourself, and your program doesn't need admin to run, I'd be a bit suspicous about
The stability of that system's hardware
The virus/malware status of that system.
If you can get physical access to the client box, it might be worth running a full virus scan with an up-to-date scanner, and running a full memtest on it.
I had a system once that seemed stable, except that a certian few programs wouldn't run on it (and would sometimes crash the box). Memtest showed my RAM had some bad bits, but they were in higer sims, so they only got accessed if a program tried to use a lot of RAM.
No, and that is pretty much by definition. The worst thing that you can say is that a user-land application may have "triggered" a Windows bug or a driver bug. But a modern desktop Operating System is fully responsible for its own integrity; a BSOD is a failure of that integrity. Therefore the OS is responsible, and only the OS.
(Example of a BSOD bug that your application alone could expose: a virus scanner implemented as a driver, that crashes when executing a file from sector 0xFFFFFFFF, a sector that on this one machine just happens to contain one DLL of your application)
I had problems when exit my app without stopping all the processes and BD connections when the program ends (I crashed the entire IDE). I place the "stopping and disconnecting" code in the "Terminate" of "Form_Closed" event of my main form and the problem wa solved, I don't know it this is your situation.
Another problem can be if the user is trying to access the same resources your app is using (databases, hardware, sockets, etc). Ask him/her about what apps he/she is using when the BSOD happens.
A virus can't be discarded.

Resources