I tried to use KGDB on Ubuntu 14.04.2 - 3.16 kernel.
Target is running with 3.16 kernel on Ubuntu 14.04.2.
Host is running with 3.16 kernel on Ububtu 14.04.2.
Target is waiting for remote gdb connection.
Started my Host mechine and try to connect target..
$ gdb ./vmlinux
kernel image file of target machine.
$ gdb> target remote /dev/ttyS0
“unrecognized item timeout in qsupported response”.
Not able to proceed further. Can any one pass some lite on this?
The “unrecognized item timeout in qsupported response” error message most typically occurs when the serial connection between the debug station (GDB host) and target (SUT) is not running reliably in my experience. The usual solution is to do two things. FIRST, check the serial connection manually using a program such as minicom, setserial, and sty. (Check your baud rate match, and that characters appear to transfer between the two systems OK manually). Unfortunately, in my experience, the even with the correct RTS/CTS hardware flow control dongles interspersed, the KDBG agent on the target doesn't handle flow control well. So the manual test will appear to work (no real flow control with a manual test, but it does prove you have correct baud rate control on both ends and complete control). SECOND, and the typically required best solution in my experience is to lower the baud rate down to 9600. (Everyone starts at MAX, or 57600, or 34K, or 19200), but drop it to down to 9600. The data sent/received by the kernel debugger is small, and even the serial console doesn't generate a lot of data in debug situations. By locking the baud rate down to 9600, you wake sure the SUT serial kgdboc keeps up on the target, and the problem you are seeing usually goes away. If you find the speed too slow, once you have it running properly at 9600, then you can increase the speed back up one step at a time (on both ends) and find the maximum serial rate for your setup that works properly.
Related
I'm working on an embedded device based on an NXP i.MX8M mini SoC. It is running Linux based on NXP's "hardknott" Yocto recipe: https://source.codeaurora.org/external/imx/imx-manifest/tree/imx-5.10.52-2.1.0.xml?h=imx-linux-hardknott
Here's the bitbake script for the kernel it is using: https://source.codeaurora.org/external/imx/meta-imx/tree/meta-bsp/recipes-kernel/linux?h=hardknott-5.10.52-2.1.0
I believe it is pulling the kernel sources from here: https://source.codeaurora.org/external/imx/linux-imx/tree/?h=lf-5.10.y
Our device has a digital microphone that is being driven by a modified version of NXP's fsl_micfil.c driver: https://source.codeaurora.org/external/imx/linux-imx/tree/sound/soc/fsl/fsl_micfil.c?h=lf-5.10.y
Our modifications are:
In fsl_micfil_hw_params, where their code does "enable channels" (around line 1592), we've changed it to only enable the channel we care about instead of all of them.
We changed:
ret = regmap_update_bits(micfil->regmap, REG_MICFIL_CTRL1,
0xFF, ((1 << channels) - 1));
To:
ret = regmap_update_bits(micfil->regmap, REG_MICFIL_CTRL1,
0xFF, 1 << 6);
At the start of fsl_micfil_probe (around line 2202), we add code to pull a GPIO line (the mic's enable line) low. If GPIO isn't available at the time, we return -EPROBE_DEFER in order to try again later
Further down in fsl_micfil_probe (around line 2334), we change the DMA channel from 0 to 6 (to align with our enable change).
We changed:
micfil->dma_params_rx.addr = res->start + REG_MICFIL_DATACH0;
To:
micfil->dma_params_rx.addr = res->start + REG_MICFIL_DATACH6;
That having been described, here's the problem.
Our application is trying to read audio from this microphone (we can test it with the arecord command).
We have found that if we launch our application immediately after Linux boots, the device hangs. We need to power-cycle it to recover.
If we wait about 60 seconds after bootup, however, we see the following messages appear on the console:
[ 60.386133] imx-sdma 302c0000.dma-controller: firmware found.
[ 60.392010] imx-sdma 302c0000.dma-controller: loaded firmware 4.6
If we launch our application after these messages appear, everything works fine.
I looked through the implementation of the imx-sdma driver: https://source.codeaurora.org/external/imx/linux-imx/tree/drivers/dma/imx-sdma.c?h=lf-5.10.y
The driver loads some kind of firmware when it starts up. The first message ("firmware found") is presented by the probe function. The second ("loaded firmware") is presented by one of the two power management resume functions (sdma_resume or sdma_runtime_resume).
So it appears that the problem is that this DMA driver is not loading until the system has been running for about 60 seconds. I assume this is due to some kind of lazy initialization that waits until some kernel driver requires access. And my problem is happening because my driver requires access and is initializing before the DMA driver has loaded.
So, what's a good fix for this? I assume I need to add something to the fsl_micfil driver to request startup of the DMA driver. But I don't know how to do this and I don't actually know if that is the correct fix for this problem.
What would you suggest as a proper fix for this bug?
(FWIW, we made very similar changes to an older version of the kernel - based on Yocto's "sumo" branch, running kernel 4.14, and there are no problems there. So clearly something changed between kernel 4.14 and 5.10 that our code needs to deal with, but I don't know what that might be.)
I eventually figured out the reason (but not the best solution) for the problem.
As I wrote in the original message, the micfil driver uses imx-sdma, which tries to load firmware from the file system.
The "Sumo" (kernel 4.14) branch has no problem with this. At the time the driver starts, the root file system has been mounted (read-only), so it has no problem loading its firmware.
With the "Hardknott" (kernel 5.10) and "Kirkstone" (kernel 5.15) branches, however, the SDMA driver tries to load very very early in the boot sequence - before any file systems have mounted. So direct loading fails. It then posts an event via udev, to request that user-space code provide the firmware via sysfs. But the driver appears to be starting so early (within 10-15 ms, according to dmesg) that udev isn't able to queue up the event. I'm not sure about the specific reason why but debugging shows that the event is never delivered to user space.
The kernel side of the udev request times out after 60 seconds, at which point the driver retries the request, which succeeds because all the file systems are fully mounted by that point.
An ideal fix would be to modify the imx-sdma driver so it doesn't start until after the file system mounts, so it can directly load its firmware. Or at minimum, after udev is ready and able to deliver the firmware-load event to user space. I let NXP know about this issue, so maybe they'll fix it in a future release of their driver.
Until that happens, my workaround is to change the kernel configuration so the imx-sdma driver is compiled as a loadable module. This means it can't load and start until after the root file system is available, at which point it has no problem finding its firmware. dmesg shows that this happens about 6s into the boot sequence - later than I'd prefer, but early enough that no application software has started running, so nothing hangs or crashes.
I am making a driver for windows kernel and I am using COM1 for getting log of kernel, the host machine is Ubuntu and guest is qemu. Windows 10 is installed inside it. The problem is there is a bulk of data to be write on COM1. So if it has to write a line ABCDEFGHI and in the next line, it has to write JKLMNOPQ.. it is doing something like ABCDEF in one line and GHIJKLMNOPQ in the second line, means one line data is incomplete and 2nd line data is also corrupting due to this.. is there any lock we can apply? that if one connection is using COM1 then no any other should write on COM1?
QEMU does not control who can write to serial port address at certain time. It's the guest OS's job. QEMU just receives whatever data it sees in the virtual registers.
If you really need to write bulk data to the serial port in the guest OS, you would expect very long time of process. I ever had the same problem and had to test the speed on a very fast Intel CPU. The best speed I can get is 220KBps.
If processing time is not a problem for you, from guest Windows perspective, if you directly write to COM1's port address, it's your driver's responsibility to make locking mechanism, assuming there is no other drivers/processes writing to COM1 either. If you write data via a handle to the COM1 device, there is no locking provided directly by Windows built-in serial port driver. But you still can do it in other ways: open a handle to the serial port device exclusively, so only one process can write to it, or make a device filter driver sitting above Windows serial port driver to add this additional access control, or even directly modify Windows serial port driver's source code to add what you want. Its source code is available now on github.
If none of above applies to your case, you might want to double check how you capture the output of COM1 on host, usually redirected to files, shell and etc. They also could cause some mystery of troubles.
I have ARM board at remote location. Some time I had a kernel panic error in it. At this same time there is no option to hardware restart. bus no one is available at this place to restart it.
I want to restart my board automatically after kernel panic error. so what to do in kernel.
If your hardware contains watchdog timer, then compile the kernel with watchdog support and configure it. I suggest to follow this blog http://www.jann.cc/2013/02/02/linux_watchdog.html
Caution :: I never tried this. If the problem is solved, request you to update here.
You can modify the panic() function kernel/panic.c to call the kernel_restart(*cmd) at the point you want it to restart (like probably after printing the required debug information).
I am assuming you are bringing up a board, so Please note that you need to supply the ops for the associated functions in machine_restart() - (called by kernel_restart) in accordance to the MACH . If you are just using the board as is , then i guess rebuilding the kernel with kernel_restart(*cmd) should do.
The panic() is usually due to events that the kernel can not recover from. If you do not have a watchdog, you need to look at your hardware to see if a GPIO, etc is connected to the RESET line. If so, you can toggle this pin to reboot the CPU. Trying to alter panic() may just make things worse, depending on the root cause and the type of features you use.
You may hook arm_pm_restart with your custom restart functionality. You can test it with the shell command reboot, if present. panic() should call the same routine. With current ARM Linux versions
You may wish to turn off the MMU and block interrupts in this routine. It will make it more resilient when called from panic(). As you are going to reset, you can copy the routine to any physical address you like.
The watchdog maybe better; it may catch cases where even panic() may not be called. You may have a watchdog and not realize it. Many Cortex-A CPUs, have one built in. It is fairly rare for hardware not to have a watchdog.
However, if you don't have the watchdog, you can use the GPIO mechanism above; hardware should usually provide someway for software to restart the device (and peripherals). The panic() maybe due to some mis-behaving device tromping memory, latched up DRAM/Flash, etc. Toggling a RESET line maybe better than a watchdog in this case; if the RESET is also connected to other hardware, besides the CPU.
Related: How to debug kernel freeze, How to change watchdog timer
AFAIK, a simple way to restart the board after kernel panic is to pass a kernel parameter (from the bootloader usually)
panic=1
The board will then auto-reboot '1' second(s) after a panic.
Search the Documentation for more.
Some examples from the documentation:
...
panic= [KNL] Kernel behaviour on panic: delay <timeout>
timeout > 0: seconds before rebooting
timeout = 0: wait forever
timeout < 0: reboot immediately
Format: <timeout>
...
oops=panic Always panic on oopses. Default is to just kill the
process, but there is a small probability of
deadlocking the machine.
This will also cause panics on machine check exceptions.
Useful together with panic=30 to trigger a reboot.
...
As suggested in previous comments watchdog timer is your friend here. If your hardware contains watchdog timer, Enable it in kernel option and configure it.
Other alternative is use Phidget. If you usb connection available at remote location. Phidget controller/software is used to control your board using USB. Check for board support.
I am programming a pci device with verilog and also writing its driver,
I have probably inserted some bug in the hardware design and when i load the driver with insmod the kernel just gets stuck and doesnt respond. Now Im trying to figure out what's the last driver code line that makes my computer stuck. I have inserted printk in all relevant functions like probe and init but non of them get printed.
What other code is running when i use insmod before it gets to my init function? (I guess the kernel gets stuck over there)
printks are often not useful debugging such a problem. They are buffered sufficiently that you won't see them in time if the system hangs shortly after printk is called.
It is far more productive to selectively comment out sections of your driver and by process of elimination determine which line is the (first) problem.
Begin by commenting out the entire module's init section leaving only return 0;. Build it and load it. Does it hang? Reboot system, reenable the next few lines (class_create()?) and repeat.
From what you are telling, it is looks like that Linux scheduler is deadlocking by your driver. That's mean that interrupts from the system timer doesn't arrive or have a chance to be handled by kernel. There are two possible reasons:
You hang somewhere in your driver interrupt handler (handler starts its work but never finish it).
Your device creates interrupts storm (Device generates interrupts too frequently as a result your system do the only job -- handling of your device interrupts).
You explicitly disable all interrupts in your driver but doesn't reenable them.
In all other cases system will either crash, either oops or panic with all appropriate outputs or tolerate potential misbehavior of your device.
I guess that printk won't work for such extreme scenario as hang in kernel mode. It is quite heavy weight and due to this unreliable diagnostic tool for scenarios like your.
This trick works only in simpler environments like bootloaders or more simple kernels where system runs in default low-end video mode and there is no need to sync access to the video memory. In such systems tracing via debugging output to the display via direct writing to the video memory can be great and in many times the only tool that can be used for debugging purposes. Linux is not the case.
What techniques can be recommended from the software debugging point of view:
Try to review you driver code devoting special attention to interrupt handler and places where you disable/enable interrupts for synchronization.
Commenting out of all driver logic with gradual uncommenting can help a lot with localization of the issue.
You can try to use remote kernel debugging of your driver. I advice to try to use virtual machine for that purposes, but I'm not aware about do they allow to pass the PCI device in the virtual machine.
You can try the trick with in-memory tracing. The idea is to preallocate the memory chunk with well known virtual and physical addresses and zeroes it. Then modify your driver to write the trace data in this chunk using its virtual address. (For example, assign an unique integer value to each event that you want to trace and write '1' into the appropriate index of bytes array in the preallocated memory cell). Then when your system will hang you can simply force full memory dump generation and then analyze the memory layout packed in the dump using physical address of the memory chunk with traces. I had used this technique with VmWare Workstation VM on Windows. When the system had hanged I just pause a VM instance and looked to the appropriate .vmem file that contains raw memory latout of the physical memory of the VM instance. Not sure that this trick will work easy or even will work at all on Linux, but I would try it.
Finally, you can try to trace the messages on the PCI bus, but I'm not an expert in this field and not sure do it can help in your case or not.
In general kernel debugging is a quite tricky task, where a lot of tricks in use and all they works only for a specific set of cases. :(
I would put a logic analyzer on the bus lines (on FPGA you could use chipscope or similar). You'll then be able to tell which access is in cause (and fix the hardware). It will be useful anyway in order to debug or analyze future issues.
Another way would be to use the kernel crash dump utility which saved me some headaches in the past. But depending your Linux distribution requires installing (available by default in RH). See http://people.redhat.com/anderson/crash_whitepaper/
There isn't really anything that is run before your init. Bus enumeration is done at boot, if that goes by without a hitch the earliest cause for freezing should be something in your driver init AFAIK.
You should be able to see printks as they are printed, they aren't buffered and should not get lost. That's applicable only in situations where you can directly see kernel output, such as on the text console or over a serial line. If there is some other application in the way, like displaying the kernel logs in a terminal in X11 or over ssh, it may not have a chance to read and display the logs before the computer freezes.
If for some other reasons the printks still do not work for you, you can instead have your init function return early. Just test and move the return to later in the init until you find the point where it crashes.
It's hard to say what is causing your freezes, but interrupts is one of those things I would look at first. Make sure the device really doesn't signal interrupts until the driver enables them (that includes clearing interrupt enables on system reset) and enable them in the driver only after all handlers are registered (also, clear interrupt status before enabling interrupts).
Second thing to look at would be bus master transfers, same thing applies: Make sure the device doesn't do anything until it's asked to and let the driver make sure that no busmaster transfers are active before enabling busmastering at the device level.
The fact that the kernel gets stuck as soon as you install your driver module makes me wonder if any other driver (built in to kernel?) is already driving the device. I made this mistake once which is why i am asking. I'd look for the string "kernel driver in use" in the output of 'lspci' before installing the module. In any case, your printk's should be visible in dmesg output.
in addition to Claudio's suggestion, couple more debug ideas:
1. try kgdb (https://www.kernel.org/doc/htmldocs/kgdb/EnableKGDB.html)
2. use JTAG interfaces to connect to debug tools (these i think vary between devices, vendors so you'll have to figure out which debug tools you need to the particular hardware)
I don't know where i make a mistake. I try to connect mi host pc (Windows 7) to target pc (virtual machine with Windows 7) in order to start with remote kernel debugging.
Vmware (virtual machine) serial port settings:
Windgb kernel debugging:
Boot virtual machine settings:
If I turn on or turn off virtual machine, nothing happens..
Does anyone know what I'm doing wrong? By the way, is it possible to view content of variables in a driver using LiveKd?
I changed debug port to 2 and host machine can connect to target machine, but windbg get error message "Assertion failed: Missing StreamContext Support ..." and VM hangs at the "Starting Windows" and nothing more happens..
Those settings look correct to me. Occasionally when I see the same behavior I just tell WinDbg to "Break" and that appears to finish the connection.
I've been struggling with much the same thing. It's been a while since I've spent much time kernel debugging with Windbg. I run Linux for pretty much everything, so this time I tried using two KVM/QEMU VM's managed by Libvirt. Lots of different complexity there, since the version of Libvirt I'm using doesn't provide easy "ui" methods of connecting serial ports between VMs (Libvirt hint: the XML setup for the serial ports, one system's Serial port source type must be set to "bind" and one system set to "connect", even for serial type "unix")
Finally, I was able to use Putty on both VM's and chat back and forth, confirming the COM ports I've chosen are indeed connected.
... and still Windbg on my debug host continued to say "Waiting to connect..."
Just confirming #jcopenha's answer, sending Break did just work for me (I don't have Break on my laptop kbd, so I used the Debug Menu to choose "Break").
The Target system is frozen (yes, after the target was fully booted, which was another question I couldn't remember the answer to), and !process gives me interesting info from the target system. I would Up-Vote their answer, but I am new to StackOverflow and don't have the reputation yet.
Thank you!