PCIe hot reset vs slot reset - linux-kernel

I am working working on linux PCIe and NVMe driver. I came across a function in pci driver, pci_reset_bus(), which does pci reset via slot or bus. I understand that reset via bus is "PCIe hot reset" which is defined in PCIe spec. But I am not sure what is pci slot reset (which is implemented by __pci_reset_slot()).
Could anyone help me understanding this? and Also can I use this exported symbol i.e. pci_reset_bus(), for pci hot reset? I want to use this in my custom NVMe driver.

I found a good tool("NVMeCraft") to handle some NVMe SSD. You can directly confirm your question with that tool, and can found the tool with googling.

Related

How to tell linux retrain and scan PCIe bus?

We have an embedded board that has an iMX8M-Plus Processor and Linux v5.4.161. This board has one PCIe bus and that one is connected to an FPGA. When we power up the board, the FPGA is not yet configured, so it acts as if it was not on the PCIe bus.
Once the Linux is fully booted, we configure the FPGA and only after that it starts acting as a PCIe endpoint (device).
At this point, when I run lspci -> it returns nothing.
When I first execute echo "1" > /sys/bus/pci/rescan as suggested here and here and then lspci, I still get nothing.
But if I reboot the linux without reseting the FPGA, it starts being visible in the lspci list. Rebooting the linux is not an option for us. Somehow I need to tell the linux that whatever it's doing at the boot time, please do it again at runtime. But I couldn't find a solution for this so far.
According to the Texas Instrument support forum, they said if the PCIe link is not trained at the boot time, rescan command never works.
At the boot time, while linux loads a pci driver, it tries to establish a PCIe link, I can see that with an oscilloscope, PERST pin is asserted and PCIE_CLK generated for a while and then stops if it can not detect any device. But the rescan command never does that.
Also in the system there is no pcie device to executeecho 1 > $pcidevice/remove in order to make rescan functional. Or there is no device or bus to set power off and on back like echo 0 > /sys/bus/pci/slots/.../power
I also learned that there was a method in old linux times (v2.6) called adding a Fake PCIe Device which physically doesn't exist to solve this problem. For that I took the fakephp.c driver from an old linux repo and ported it to ours. After solving a couple of deprecated function problems, it is compiled for Linux Kernel v5.4. modprobe fakephp worked and driver loaded but somehow I didn't get this fake device in my device list. Here it is mentioned that the fakephp driver was removed from mainstream linux since PCI core has similar functionality, but he never mentioned how.
Short of the story is that, I am stuck here, I need my FPGA to be visible in the lspci list without restarting the linux.
I recommend configuring the FPGA in u-boot to get away from these kinds of problems. Connect up SPI pins to FPGA's config pins & run it in Slave configuration mode.

PCIe DMA problems in ARM Machines

I'm trying to write a PCIe driver for an ARM machine (Cavium ThunderX2). I'm working with Xilinx Alveo FPGAs. Our work involves migrating pages between heterogeneous nodes (x86 and ARM) and the driver takes care of the DMA between the host and the FPGA, and handles the device interrupts.
The DMA doesn't work (From Device/To Device) and I get "ARM SMMU v3.x 0x10 event occurred" errors. I tried disabling the SMMU (recommended by some threads in the NVIDIA community - https://forums.developer.nvidia.com/t/how-dma-works-in-arm-the-dma-stopped-working-with-our-pci-driver/53699), but that leads to a protection issue ("RAS Controller stopped"), and the system hangs.
I use dma_map_single APIs from dma-mapping.h to convert the virtual address to a DMA-capable bus address. Would dma_alloc_coherent make a difference? (Of course, I'll try this out)
I'm unable to figure out the problem. Is this is a PCIe driver issue or an issue with the device or is there a fix/patch available for ARM PCIe DMA ops? Any help would be appreciated!!
Thanks,
Narayan
Error snippet

Unable to wake pci bus form D3 sleep satate

On my board (x86_64, Android Lollipop, kernel: 3.14), "pci bus" goes in D3 sleep state and when I'm trying to wake it up by setting it D0 state it's failing with message:
Error log:
Refused to change power state, currently in D3.
After going through pci architecture, I came to know that we cannot bring up pci from D3hot to D0 initialized, we need to follow something like:
D3hot -> D0Uninitialized -> D0Initialized
But I'm unable to figure out how to do that, please help me to find out appropriate solution
After debugging further, I figured out that, power state transition for pci device (i.e. D3 to D0) is working fine when it's requested within pci driver (i.e. pcieport) but as I'm trying to wake up pci device through iwlwifi driver facing above mentioned problem, as it's not able to write wake request to pci chip.
Any help or any clue will be much appreciated.
After a lot of research I found that, If any device wants to use acpi features to communicate to OS, should be registered in ACPI table.
In my case my wifi chip was not registered with ACPI table, because of it was unable to use ACPI features.

USB bandwidth / host controller issues - Linux

I have 12 USB 2.0 devices plugged into an Intel NUC D54250WYK running Ubuntu 14.04.
Running lshw -short shows two different USB buses and two host controllers (xHCI and eHCI).
All of the USB devices appear on the same bus and use xHCI regardless of the ports they are plugged into. As a result I'm seeing the following errors in dmsg:
Not enough host controller resources for new device state.
Not enough bandwidth for altsetting 0.
Is there a way to force devices to a specific bus?
I've also read that Linux can have problems with xHCI. Is there a way to force eHCI without recompiling the kernal? Intel does not provide that option in BIOS.
Last I checked on this, you're in a bit of a bind. It seems xHCI is compiled into the kernel, not as a module, and if you compile in eHCI/aHCI/oHCI and not xHCI, USB as a whole breaks, possibly due to some built-in support for on-board USB controlled BlueTooth and WIFI devices on certain mobos. DO NOT UPDATE YOUR BIOS yet... see if the option to disable xHCI still exists on yours.
At this time, it seems your best option is to disable xHCI in your BIOS. This will likely disable all USB3 controllers, but allow USB2 controllers to work without this issue impeding you.
With respect to the Intel device you described, I don't see many USB ports on it, so I assume you're using hubs. From the tech specs for your device, it looks like you'll have to get access to the internal header to get at the USB2 ports.
Good news for anyone else facing this issue. Intel released a new bios (v40) that adds back the option to disables xHCI. In my case I updated the bios, disabled xHCI, and everything works as expected.
Beware of platforms that have XHCI ONLY (Apollolake, Denverton).
You will brick your HW if you disable XHCI there.

Keeping device functionality inside device controller rather than OS kernel. What are consequences?

A friend of mine asked me this question in the class and I could not answer it. He asked:
Since we know kernel controls the physical hardware via device drivers. What if all this functionality is kept inside the device controller itself rather than kernel managing them. What would be the consequences of such scenario? Good or Bad?
I searched online for this question but could not get information about this scenario. May be I'm not googling in the right keyword.
You insight into this will help me getting clearing my concepts.
Please answer.
Thanks.
Your question seems to propose the elimination of the "device driver" by "keeping" "control (of) the physical hardware ... inside the device controller". The premise for this seems to be:
kernel controls the physical hardware via device drivers.
That description of a device driver is something similar to what I've seem for end-user comprehension rather than from a developer's perspective. The end-user is aware of the device, and it is the device driver that takes that abstraction and can control that device down to the specific control bits of each device port.
But a device driver is responsible for mundane housekeeping tasks such as:
maintaining device status and availability;
configuring the device for operation;
managing data flow, setting-up/tearing-down data transfers, copying data between user space and kernel space;
handling interrupts and exceptions.
These tasks are integral to a device driver. These tasks cannot be transferred out of the purview of the kernel driver to a peripheral device.
Sometimes the device driver can only try to manage the device, rather than fully control it, for example, a NIC driver during a packet flood.
There is simply no possibility that you can eliminate a device driver no matter how much of "all this functionality is kept inside the device controller itself". And there would still be control directives/commands issued from the device driver to the peripheral.
The hardware device in question should be a computer peripheral device, not an autonomous robot device. The device should be designed to operate with a computer. Whatever interface there is between processor and device should be suitable for the task. If the peripheral is made more "intelligent", then perhaps the CPU can be unburdened and a high-level command interface can replace low-level sub-operation directives. But only "some" functionality can be transferred to the peripheral, not "all".

Resources