IMX fsl_micfil driver hang because DMA is not available yet

IMX fsl_micfil driver hang because DMA is not available yet - linux-kernel

I'm working on an embedded device based on an NXP i.MX8M mini SoC. It is running Linux based on NXP's "hardknott" Yocto recipe: https://source.codeaurora.org/external/imx/imx-manifest/tree/imx-5.10.52-2.1.0.xml?h=imx-linux-hardknott
Here's the bitbake script for the kernel it is using: https://source.codeaurora.org/external/imx/meta-imx/tree/meta-bsp/recipes-kernel/linux?h=hardknott-5.10.52-2.1.0
I believe it is pulling the kernel sources from here: https://source.codeaurora.org/external/imx/linux-imx/tree/?h=lf-5.10.y
Our device has a digital microphone that is being driven by a modified version of NXP's fsl_micfil.c driver: https://source.codeaurora.org/external/imx/linux-imx/tree/sound/soc/fsl/fsl_micfil.c?h=lf-5.10.y
Our modifications are:
In fsl_micfil_hw_params, where their code does "enable channels" (around line 1592), we've changed it to only enable the channel we care about instead of all of them.
We changed:
ret = regmap_update_bits(micfil->regmap, REG_MICFIL_CTRL1,
0xFF, ((1 << channels) - 1));
To:
ret = regmap_update_bits(micfil->regmap, REG_MICFIL_CTRL1,
0xFF, 1 << 6);
At the start of fsl_micfil_probe (around line 2202), we add code to pull a GPIO line (the mic's enable line) low. If GPIO isn't available at the time, we return -EPROBE_DEFER in order to try again later
Further down in fsl_micfil_probe (around line 2334), we change the DMA channel from 0 to 6 (to align with our enable change).
We changed:
micfil->dma_params_rx.addr = res->start + REG_MICFIL_DATACH0;
To:
micfil->dma_params_rx.addr = res->start + REG_MICFIL_DATACH6;
That having been described, here's the problem.
Our application is trying to read audio from this microphone (we can test it with the arecord command).
We have found that if we launch our application immediately after Linux boots, the device hangs. We need to power-cycle it to recover.
If we wait about 60 seconds after bootup, however, we see the following messages appear on the console:
[ 60.386133] imx-sdma 302c0000.dma-controller: firmware found.
[ 60.392010] imx-sdma 302c0000.dma-controller: loaded firmware 4.6
If we launch our application after these messages appear, everything works fine.
I looked through the implementation of the imx-sdma driver: https://source.codeaurora.org/external/imx/linux-imx/tree/drivers/dma/imx-sdma.c?h=lf-5.10.y
The driver loads some kind of firmware when it starts up. The first message ("firmware found") is presented by the probe function. The second ("loaded firmware") is presented by one of the two power management resume functions (sdma_resume or sdma_runtime_resume).
So it appears that the problem is that this DMA driver is not loading until the system has been running for about 60 seconds. I assume this is due to some kind of lazy initialization that waits until some kernel driver requires access. And my problem is happening because my driver requires access and is initializing before the DMA driver has loaded.
So, what's a good fix for this? I assume I need to add something to the fsl_micfil driver to request startup of the DMA driver. But I don't know how to do this and I don't actually know if that is the correct fix for this problem.
What would you suggest as a proper fix for this bug?
(FWIW, we made very similar changes to an older version of the kernel - based on Yocto's "sumo" branch, running kernel 4.14, and there are no problems there. So clearly something changed between kernel 4.14 and 5.10 that our code needs to deal with, but I don't know what that might be.)

I eventually figured out the reason (but not the best solution) for the problem.
As I wrote in the original message, the micfil driver uses imx-sdma, which tries to load firmware from the file system.
The "Sumo" (kernel 4.14) branch has no problem with this. At the time the driver starts, the root file system has been mounted (read-only), so it has no problem loading its firmware.
With the "Hardknott" (kernel 5.10) and "Kirkstone" (kernel 5.15) branches, however, the SDMA driver tries to load very very early in the boot sequence - before any file systems have mounted. So direct loading fails. It then posts an event via udev, to request that user-space code provide the firmware via sysfs. But the driver appears to be starting so early (within 10-15 ms, according to dmesg) that udev isn't able to queue up the event. I'm not sure about the specific reason why but debugging shows that the event is never delivered to user space.
The kernel side of the udev request times out after 60 seconds, at which point the driver retries the request, which succeeds because all the file systems are fully mounted by that point.
An ideal fix would be to modify the imx-sdma driver so it doesn't start until after the file system mounts, so it can directly load its firmware. Or at minimum, after udev is ready and able to deliver the firmware-load event to user space. I let NXP know about this issue, so maybe they'll fix it in a future release of their driver.
Until that happens, my workaround is to change the kernel configuration so the imx-sdma driver is compiled as a loadable module. This means it can't load and start until after the root file system is available, at which point it has no problem finding its firmware. dmesg shows that this happens about 6s into the boot sequence - later than I'd prefer, but early enough that no application software has started running, so nothing hangs or crashes.

Related

Bootloader Strategy for Corrupt Applications

I've implemented a bootloader for a Kinetis ARM Cortex-M4 microcontroller.
The main application (starting at 0x10000) is re-programmed via the bootloader over a custom RS232 interface. I've implemented jumpToApplication and jumpToBootloader functions from the bootloader and application perspectives and all works fine so far.
One strategy I'm keen to understand is what to do upon the event of a corrupt main application?
The bootloader currently checks the stack-pointer and program-counter of the main application before deciding whether to jump. However, if the main application is corrupt then either two issues will occur:
The main application will hang and make it difficult to re-program
The microcontroller will reboot and will be stuck in a bootloader > application > bootloader (etc) loop
I have a SharedData structure which allows me to share data (via a fixed RAM location) between both the bootloader and application. I have considered adding a rebootCounter to this structure which would be incremented upon the HardFaultInterrupt being triggered in the main application.
This value could be tested in the bootloader and, depending on the counter value, a decision could be made as to whether to stay in the bootloader or try to launch the application.
Are there more "industry standard" ways of dealing with this?
UPDATE
To clarify, the ultimate reason for asking this question is to cover the following scenario:
Bootloader is programmed into the device during production phase via JTAG
Main application (latest build) is loaded during testing phase
During the testing phase, there is a power-cut or connection issue and the device is only partially programmed
When power is applied again, the bootloader will "assume" that there is a valid program in the main part of flash and will "jump" to this application
The microcontroller is now stuck in no mans land with no way of re-loading flash via the bootloader again without opening up the products enclosure and re-flashing the chip via JTAG - not something we can do when the product is in the field.
During the bootloader programming phase, the firmware is programmed and validated byte-by-byte to ensure that there is no corruption during the data transfer. If corruption occurs during this phase (bad packet due to USB hub issue, for example) then the bootloader will continue to accept re-programming commands.
UPDATE #2
The following post seems to be thinking along similar lines:
https://interrupt.memfault.com/blog/how-to-write-a-bootloader-from-scratch

First I recommend that add some delay in your bootloader that waits for a firmware update process start indicator. I developed something similar; desktop application sends start byte periodically and when you connect your device, it enters bootloader mode and waits for five seconds more to get new firmware information; so it is not important whether there is valid main application on the flash or not.
Another solution to check the existing of the main application use a specific sector of the flash for firmware information, before a firmware update process erase that sector. After a successful firmware update write a specific data to that sector. In the bootloader read this sector and verify that there is a valid application on the flash.

I would add some 'magic' value (say 0xDEAD00D) an the end of the application and only jump to the application of the magic value is there. You can have a pointer to that location at 0x10000.
To make things more robust, program the magic value after the verify is completed.

Using kexec --load-panic to reboot if file system is corrupted

I have a device which loads a small 'safe mode' Yocto image from coreboot, then selects a larger image to load, and performs a kexec to load that image. Typically this works, but in rare cases the target image's file system has been corrupted and kernel panics on boot.
Since the device will eventually be deployed into locations that are difficult to access, I was hoping to find a way to recover from any kernel panic without having to physically reboot the device.
To fix this, I added an init script using "init=/sbin/init.sh" in the kexec command line when the new kernel is loaded, and I added a recovery kernel load using "kexec --load-panic" in the init script on the 2nd file system. This method successfully recovers kernel panics that happen late in the boot process, but I encountered a file system which was trashed in a particular way so that the kernel panic would occur before the init script gets launched. Since the init script isn't executed, the panic kernel never gets loaded, and the device must be power cycled.
To fix this, I tried adding the recovery kernel into the initial small kernel loaded by coreboot, but it appears to only handle kernel panics that occur before the "kexec --exec" command loads the new kernel.
I'm trying to figure out what is the best way to solve this. For example, I could add validation before I kexec to the new image. I currently check that the file system can be mounted, that its kernel file and the init script are present. If anyone knows which other files are necessary to get to the init script, I could add them to my validation.
Alternatively, is there a way to load the new kernel and kexec to it with the recovery kernel "--load-panic" parameter already loaded?
I tried putting both the kexec --load and --load-panic in the same line, but that doesn't work.
Any recommendation is greatly appreciated.

Booting linux kernel using a simple char driver as console?

I'm trying to boot the linux kernel (v3.16.1) on a simulation model of a Sparc v8 processor for an academic project.
The simulation model consists of a cpu, memory, timer and a simple polling based output device. We've modified the kernel so that a bootloader is not necessary. We directly put the kernel image in memory, set up some necessary variables and jump into kernel code. We have a rudimentary polling based output-device, and we've been able to direct output of printk to this device.
The kernel boots all the way up to the start of "/init". After this point no output is visible. Just before this point, there's a warning displayed : "Warning: unable to open an initial console." My filesystem image seems to be fine, and contains a /dev/console node (I checked this with Qemu).
My understanding is that while printk works fine (using an early console), user processes need a device node with a proper device driver to be set up. Printk works fine, so is there a way to see all writes by user processes to the console via printk ? There's an existing driver called "ttyprintk" which sends all writes to printk. I enabled it and tried using it by passing "console=ttyprintk" kernel argument, but this gives the same warning. The kernel is not able to open "/dev/console" for writing.
My questions are :
Can I write a simple character device driver and use it as my console ? Inside this driver I plan to send all writes to printk. Is this possible ?
How can I ask the kernel to use this as my console ? Would kernel argument "console = /dev/MyDriver" work ?
Is there a simpler way to have /init and other user processes use my rudimentary output device as a console ?
4.Is there some other reason that could be causing the "Warning: unable to open an initial console." message ?
Thanks for any hints. I am new to kernel programming.
-neha

They way the /dev/console driver works is to attach to some other tty device (or devices!), which is/are then used for the console. The sysfs attribute active of the console device (try /sys/class/tty/console/active) will tell you what device the console is attached to at the moment.
The kernel also tends to log console changes:
[ 0.186989] dw-apb-uart ffc02000.serial0: ttyS0 at MMIO 0xffc02000 (irq = 194, base_baud = 6250000) is a 16550A
[ 0.755529] console [ttyS0] enabled
In the above log, once the serial port device was created the kernel decided to use it as a console. This refers to the binding of the device to the driver in the kernel, not the device node in /dev. The latter does not matter here. Also understand that the attachment of the console device to a tty happens in the kernel. /dev/console is not a symlink to another device node.
The kernel has chosen ttyS0 because I told it to via the kernel command line, console=ttyS0,115200n8. Without a console argument, the kernel uses the first console that registers with register_console().
So the question here is how can one get /dev/ttyprintk to be attached to /dev/console. And the answer appears to be you can't.
A work around might be to create a custom initramfs that changes the /dev/console device node from major 5 minor 1 to use minor 3, thus changing it into /dev/ttyprintk. Or symlink to achieve the same thing. This should get init to use ttyprintk as its stdin/stdout/stderr.
In your example, writing a tty/console driver for your output device would be the right way. Make it the console and then the kernel sends printk there.

how to debug a pci device and linux driver

I am programming a pci device with verilog and also writing its driver,
I have probably inserted some bug in the hardware design and when i load the driver with insmod the kernel just gets stuck and doesnt respond. Now Im trying to figure out what's the last driver code line that makes my computer stuck. I have inserted printk in all relevant functions like probe and init but non of them get printed.
What other code is running when i use insmod before it gets to my init function? (I guess the kernel gets stuck over there)

printks are often not useful debugging such a problem. They are buffered sufficiently that you won't see them in time if the system hangs shortly after printk is called.
It is far more productive to selectively comment out sections of your driver and by process of elimination determine which line is the (first) problem.
Begin by commenting out the entire module's init section leaving only return 0;. Build it and load it. Does it hang? Reboot system, reenable the next few lines (class_create()?) and repeat.

From what you are telling, it is looks like that Linux scheduler is deadlocking by your driver. That's mean that interrupts from the system timer doesn't arrive or have a chance to be handled by kernel. There are two possible reasons:
You hang somewhere in your driver interrupt handler (handler starts its work but never finish it).
Your device creates interrupts storm (Device generates interrupts too frequently as a result your system do the only job -- handling of your device interrupts).
You explicitly disable all interrupts in your driver but doesn't reenable them.
In all other cases system will either crash, either oops or panic with all appropriate outputs or tolerate potential misbehavior of your device.
I guess that printk won't work for such extreme scenario as hang in kernel mode. It is quite heavy weight and due to this unreliable diagnostic tool for scenarios like your.
This trick works only in simpler environments like bootloaders or more simple kernels where system runs in default low-end video mode and there is no need to sync access to the video memory. In such systems tracing via debugging output to the display via direct writing to the video memory can be great and in many times the only tool that can be used for debugging purposes. Linux is not the case.
What techniques can be recommended from the software debugging point of view:
Try to review you driver code devoting special attention to interrupt handler and places where you disable/enable interrupts for synchronization.
Commenting out of all driver logic with gradual uncommenting can help a lot with localization of the issue.
You can try to use remote kernel debugging of your driver. I advice to try to use virtual machine for that purposes, but I'm not aware about do they allow to pass the PCI device in the virtual machine.
You can try the trick with in-memory tracing. The idea is to preallocate the memory chunk with well known virtual and physical addresses and zeroes it. Then modify your driver to write the trace data in this chunk using its virtual address. (For example, assign an unique integer value to each event that you want to trace and write '1' into the appropriate index of bytes array in the preallocated memory cell). Then when your system will hang you can simply force full memory dump generation and then analyze the memory layout packed in the dump using physical address of the memory chunk with traces. I had used this technique with VmWare Workstation VM on Windows. When the system had hanged I just pause a VM instance and looked to the appropriate .vmem file that contains raw memory latout of the physical memory of the VM instance. Not sure that this trick will work easy or even will work at all on Linux, but I would try it.
Finally, you can try to trace the messages on the PCI bus, but I'm not an expert in this field and not sure do it can help in your case or not.
In general kernel debugging is a quite tricky task, where a lot of tricks in use and all they works only for a specific set of cases. :(

I would put a logic analyzer on the bus lines (on FPGA you could use chipscope or similar). You'll then be able to tell which access is in cause (and fix the hardware). It will be useful anyway in order to debug or analyze future issues.
Another way would be to use the kernel crash dump utility which saved me some headaches in the past. But depending your Linux distribution requires installing (available by default in RH). See http://people.redhat.com/anderson/crash_whitepaper/

There isn't really anything that is run before your init. Bus enumeration is done at boot, if that goes by without a hitch the earliest cause for freezing should be something in your driver init AFAIK.
You should be able to see printks as they are printed, they aren't buffered and should not get lost. That's applicable only in situations where you can directly see kernel output, such as on the text console or over a serial line. If there is some other application in the way, like displaying the kernel logs in a terminal in X11 or over ssh, it may not have a chance to read and display the logs before the computer freezes.
If for some other reasons the printks still do not work for you, you can instead have your init function return early. Just test and move the return to later in the init until you find the point where it crashes.
It's hard to say what is causing your freezes, but interrupts is one of those things I would look at first. Make sure the device really doesn't signal interrupts until the driver enables them (that includes clearing interrupt enables on system reset) and enable them in the driver only after all handlers are registered (also, clear interrupt status before enabling interrupts).
Second thing to look at would be bus master transfers, same thing applies: Make sure the device doesn't do anything until it's asked to and let the driver make sure that no busmaster transfers are active before enabling busmastering at the device level.

The fact that the kernel gets stuck as soon as you install your driver module makes me wonder if any other driver (built in to kernel?) is already driving the device. I made this mistake once which is why i am asking. I'd look for the string "kernel driver in use" in the output of 'lspci' before installing the module. In any case, your printk's should be visible in dmesg output.

in addition to Claudio's suggestion, couple more debug ideas:
1. try kgdb (https://www.kernel.org/doc/htmldocs/kgdb/EnableKGDB.html)
2. use JTAG interfaces to connect to debug tools (these i think vary between devices, vendors so you'll have to figure out which debug tools you need to the particular hardware)

How to know that the kernel has panicked?

I want to be able to monitor kernel panics - know if and when they have happened.
Is there a way to know, after the machine has booted, that it went down due to a kernel panic (and not, for example, an ordered reboot or a power failure)?
The machine may be configured with KDUMP and/or KDB, but I prefer not to assume that either is or is not installed.
Patching the kernel is an option, though I prefer to avoid it. But even if I do it, I'm not sure what can the patch do.
I'm using kernel 2.6.18 (ancient, I know). Solutions for newer kernels may be interesting too.
Thanks.

The kernel module 'netconsole' may help you to log kernel printk messages over UDP.
You can view the log message in remote syslog server, event if the machine is rebooted.
Introduction:
=============
This module logs kernel printk messages over UDP allowing debugging of
problem where disk logging fails and serial consoles are impractical.
It can be used either built-in or as a module. As a built-in,
netconsole initializes immediately after NIC cards and will bring up
the specified interface as soon as possible. While this doesn't allow
capture of early kernel panics, it does capture most of the boot
process.
Check kernel document for more information: https://www.kernel.org/doc/Documentation/networking/netconsole.txt

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio