What are transactions and streamID in ARM SMMU - linux-kernel

I have basic query regarding ARM SMMU, as per following link:
https://www.intel.com/content/www/us/en/docs/programmable/683567/21-3/stream-id-falconmesa-stratix-10.html
"Each transaction is also classified by a 10-bit stream ID. The stream ID represents a set or stream of transactions from a particular master device."
What exactly this transaction mean from master device that gets an ID, are these DRAM RAM addresses that master device want to access using DMA via SMMU ?

Related

What happens when we press a key on Windows?

First of all, I would say to you that I write this question from nothing because I have attempt to find good documentation but nothing stand out...
What happens when we squeeze a key?
I think this is complex but I hope you can help me.
What I search to know : all (but especially the program start on the host machine and how the key electric signal is encoded and send...)
The eXtensible Host Controller (xHC) has a Periodic Transfer Ring. Windows programs this ring to trigger a transfer every time an interval in milliseconds has passed. The right interval is specified in the USB descriptor returned by the USB device. When the transfer occurs, the xHC puts a Transfer Event TRB on the event ring and triggers an MSI-X interrupt which bypasses the IOAPIC as some kind of inter-processor interrupt. If Windows detects some change in the keys pressed, it will send a message to the application which currently has focus (calling the window's procedure) with the key pressed in one of the argument.
I don't know about electrical signals but I know the eXtensible Host Controller is the USB controller responsible to interact with USB on modern Windows systems. Since Windows nowadays requires an x64 processor, the xHC must be present on your motherboard. The xHC is a PCI-Express device which is compliant with the PCI-Express specification.
To find an xHC, you:
Find the RSDP ACPI table in RAM;
This table will be found by the UEFI firmware which acts as some kind of small operating-system (OS) during boot of the computer. Then, the OS developers will write a small UEFI application named bootx64.efi that they will place on a FAT32 partition on the hard-disk. They will place this app in the /boot/efi directory. The UEFI firmware will directly launch that application on boot of the computer which allows to have an OS which doesn't require user input to be launched (similarly to how it used to work with the legacy BIOS fetching the first sector of the hard-disk and executing the instructions found there).
The UEFI application is compiled in practice with either EDK2 or gnu-efi. These compilers are aware of the UEFI environment and specification. They thus compile the code to system calls that are present during boot and available for the UEFI application written by the OS developers. The System Tables (often the ACPI tables) are given as an argument to the "main" function (often called UefiMain) called by the UEFI firmware in the UEFI application. The code of the application can thus simply use these arguments to find the RSDP table and pass it to the OS.
Find the MCFG ACPI table using the RSDP;
The chain of table is RSDP -> XSDT -> MCFG. Once the OS found the MCFG, this table specifies the base address of the PCI configuration space. To interact with PCI devices you use memory mapped IO (MMIO). You write to some position in RAM and it will instead write to the registers of the PCI devices. The MCFG thus specifies the base address at which you will start finding MMIO registers for the different PCI devices that are plugged into the computer.
Iterate on the PCI devices and look at their IDs until you find an xHC.
To iterate on the PCI devices, the PCI convention specifies a formula which is the following:
UINT64 physical_address = base_address + ((bus - first_bus) << 20 | device << 15 | function << 12);
The base_address is for a specific segment group. Each segment group can have 256 buses (suitable for large servers or large computers with lots of components). There can be up to 65536 segment groups and each can have up to 256 PCI buses. Each PCI bus can have up to 32 devices plugged onto it and each device can have up to 8 functions. Each function can also be a PCI bridge. This is quite straightforward to understand because the terminology is clear. The bus here is an actual serial bus that the PCI devices (like a network card, a graphics card, an xHC, an AHCI, etc.) use to communicate with RAM. The function is a functionality of the PCI device like controlling USB devices, hard-disks, HDMI screens (for graphics cards), etc. The PCI bridge bridges a PCI bus to another PCI bus. It means you can have almost an infinite amount of devices with the PCI specification because the bridges allow to extend the tree of devices by adding other PCI host controllers.
Meanwhile, the bus is simply a number between 0 and 255. The first bus is specified in the MCFG ACPI table for a specific segment group. The device is a number between 0 and 31 and the function is a number between 0 and 7. This formula returns a physical address which points to a conventional configuration space (it is the same for all functions) which has specific registers. These registers are used to determine what is the type of device and to load a proper driver for it. Each function of each device thus gets a configuration space.
For the xHC, there will be only one function and the IDs returned by its configuration space will be 0x0C for the class ID and 0x03 for the subclass ID (https://wiki.osdev.org/EXtensible_Host_Controller_Interface).
Once you found an xHC, it gets rather complex. You need to initialize it and get the USB devices which are plugged in the computer at the current moment. You need to take several steps to get the xHC operational. For this part, I'll leave you to read the xHCI specification which (on chapter 4) specifies exactly the steps which need to be taken (https://www.intel.com/content/dam/www/public/us/en/documents/technical-specifications/extensible-host-controler-interface-usb-xhci.pdf).
For the keyboard portion I'll leave you to read one of my answer on the stackexchange for computer science: https://cs.stackexchange.com/questions/141870/when-are-a-controllers-registers-loaded-and-ready-to-inform-an-i-o-operation/141918#141918.
Some good links:
https://wiki.osdev.org/Universal_Serial_Bus
https://wiki.osdev.org/PCI

How to record/get memory requests that Last-Level Cache sends to Memory Controller?

When there is a LLC miss, the memory request is sent to the MC to get data from memory.
Are there any tools that can get information (address/[read or write]/accurate timing) of the memory request sent by the LLC to the MC?
I want this information to be an input for my MC simulator, so that I can schedule them.
I used a tool called pin before. But it only records virtual memory addresses and can't get accurate timing.
As far as I know, there are no tools to get information from the memory requests sent by the Last-Level Cache (LLC) to the Memory Controller (MC) in a physical processor. Intel processors have hardware counters that allow for monitoring requests to DRAM, but no information about the address is available, its purpose is to count the number of requests.
You can use full system simulators like Simics or M5 to generate memory request traces with timing information. You can also get back to Pin and attach a cycle-accurate CPU simulator, but you will have to model the logical-physical address translation.

Linux Driver and API architecture for a data acquisition device

We're trying to write a driver/API for a custom data acquisition device, which captures several "channels" of data. For the sake of discussion, let's assume this is a several-channel video capture device. The device is connected to the system via an 8xPCIe Gen-1 link, which has a theoretical throughput of 16Gbps. Our actual data rate will be around 2.8Gbps (~350MB/sec).
Because of the data rate requirement, we think we have to be careful about the driver/API architecture. We've already implemented a descriptor based DMA mechanism and the associated driver. For example, we can start a DMA transaction for 256KB from the device and it completes successfully. However, in this implementation we're only capturing the data in the kernel driver, and then dropping it and we aren't streaming the data to the user-space at all. Essentially, this is just a small DMA test implementation.
We think we have to separate the problem into three sections: 1. Kernel driver 2. Userspace API 3. User Code
The acquisition device has a register in the PCIe address space which indicates whether there is data to read for any channel from the device. So, our kernel driver must poll for this bit-vector. When the kernel driver sees this bit set, it starts a DMA transaction. The user application however does not need to know about all these DMA transactions and data, until an entire chunk of data is ready (For example, assume that the device provides us with 16 lines of video data per transaction, but we need to notify the user only when the entire video frame is ready). We need to only transfer entire frames to the user application.
Here was our first attempt:
Our user-side API allows a user application to register a function callback for a "channel".
The user-side API has a "start" function, which can be called by the user application, which uses ioctl to send a start message to the kernel driver.
In the kernel driver, upon receiving the start message, we started a kernel thread, which continuously monitors the "data ready" bit-vector, and when it sees new data, copies it over to a driver-allocated (kmalloc) buffer. It keeps doing this until the size of the collected data reaches the "frame size".
At this point a custom linux SIGNAL (similar to SIGINT, SIGHUP, etc) is sent to the process which is running the driver. Our API catches this signal and then calls back the appropriate user callback function.
The user callback function calls a function in the API (transfer_data), which uses an ioctl call to send a userspace buffer address to the kernel, and the kernel completes the data transfer by doing a copy_to_user of the channel frame data to userspace.
All of the above is working OK, except that the performance is abysmal. We can only achieve about 2MB/sec of transfer rate. We need to completely re-write this and we're open to any suggestions or pointers to examples.
Other notes:
Unfortunately, we can not change anything in the hardware device. So we must poll for the "data-ready" bit and start DMA based on that bit.
Some people suggested to look at Infiniband drivers as a reference, but we're completely lost in that code.
You're probably way past this now, but if not here's my 2p.
It's hard to believe that your card can't generate interrupts when
it has transferred data. It's got a DMA engine, and it can handle
'descriptors', which are presumably elements of a scatter-gather
list. I'll assume that it can generate a PCIe 'interrupt'; YMMV.
Don't bother trawling the kernel for existing similar drivers. You
might get lucky, but I suspect not.
You need to write a blocking read, which you supply a large memory buffer to. The driver read op (a) gets gets a list of user pages for your user buffer and locks them in memory (get_user_pages); (b) creates a scatter list with pci_map_sg; (c) iterates through the list (for_each_sg); (d) for each entry writes the corresponding physical bus address and data length to the DMA controller as what I presume you're calling a 'descriptor'.
The card now has a list of descriptors which correspond to the physical bus addresses of your large user buffer. When data arrives at the card, it writes it directly into user space, into your user buffer, while your user-level read is still blocked. When it has finished the descriptor list, the card has to be able to interrupt, or it's useless. The driver responds to the interrupt and unblocks your user-level read.
And that's it. The details are nasty, of course, and poorly documented, but that should be the basic architecture. If you really haven't got interrupts you can set up a timer in the kernel to poll for completion of transfer, but if it is really a custom card you should get your money back.

Linux driver DMA transfer to a PCIe card with PC as master

I am working on a DMA routine to transfer data from PC to a FPGA on a PCIe card. I read DMA-API.txt and LDD3 ch. 15 for details. However, I could not figure out how to do a DMA transfer from PC to a consistent block of iomem on the PCIe card. The dad sample for PCI in LDD3 maps a buffer and then tells the card to do the DMA transfer, but I need the PC to do this.
What I already found out:
Request bus master
pci_set_master(pdev);
Set the DMA mask
if (dma_set_mask(&(pdev->dev), DMA_BIT_MASK(32))) {
dev_err(&pdev->dev,"No suitable DMA available.\n");
goto cleanup;
}
Request a DMA channel
if (request_dma(dmachannel, DRIVER_NAME)) {
dev_err(&pdev->dev,"Could not reserve DMA channel %d.\n", dmachannel);
goto cleanup;
}
Map a buffer for DMA transfer
dma_handle = pci_map_single(pci_dev, buffer, count, DMA_TO_DEVICE);
Question:
What do I have to do in order to let the PC perform the DMA transfer instead of the card?
Thank your for your help!
First of all thank you for your replies. Maybe I should put my questions more precisely:
In my understanding the PC has to have a DMA controller. How do I access this DMA controller to start a transfer to a memory mapped IO region in the PCIe card?
Our specification demands that the PC's DMA controller initiates the transfer. However, I could only find examples where the device would do the DMA job (DMA_mapping.txt, LDD3 ch.15). Is there a reason, why nobody uses the PC's DMA controller (It still has DMA channels though)? Would it be better to request a specification change for our project?
Thanks for your patience.
Look up DMA_mapping.txt. There's a long section in there that tells you how to set the direction ('DMA direction', line 408).
EDIT
Ok, since you edited your question... your specification is wrong. You could set up the system DMA controller, but it would be pointless, because it's too slow, as I said in the comments. Read this thread.
You must change your FPGA to support bus mastering. I do this for a living - contact me off-thread if you want to sub-contract.
What you are talking about is not really a DMA. The DMA is when your device is accessing memory and the CPU itself is not involved (with an exception of PC's memory controller, which is usually embedded into the PC's CPU these days). Not all devices can do it, and if you are using FPGA, then you surely need some sort of DMA controller in your design (i.e. Expresso DMA Core or alike). In your case, you just have to write to the mapped memory region (i.e. one that you obtain with ioremap_nocache) using iowrite calls (i.e. iowrite32) followed by write memory barriers wmb(). What I/O bar and address you have to write to entirely depends on your device.
Hope it helps. Good Luck!

Why would a device driver cause page faults?

I have a Windows console application that uses a parallel IO card for high speed data transmission. (General Standards HPDI32ALT)
My process is running in user mode, however, I am sure somewhere behind the device's API there is some kernel mode driver activity (PCI DMA transfers, reading device status registers etc..) The working model is roughly this:
at startup: I request a pointer to an IO buffer from API.
in my main loop:
block on API waiting for room in device's buffer (low watermark)
fill the IO buffer with transmission data
begin transmission to device by passing it the pointer to the IO buffer (during this time the API uses DMA on PCI bus to move the data to the card)
block on API waiting for IO to complete
The application appears to be working correctly with proper data rate and sustained throughput for long periods of time, however, when I look at the process in sys internals tool process explorer I see a large number of page faults (~6k per second). I am moving ~30MB/s to the card.
I have plenty of RAM and am reasonably sure the page faults are not disk IO related.
Any thoughts on what could be causing the page faults? I also have a receive side to this application that is using an identical IO card in receive mode. The receive mode use of the API does not cause a large number page faults.
Could the act of moving the IO buffer to kernel mode cause page faults?
So your application asks the driver for a memory buffer and you copy the send data into that buffer? That's a pretty strange model, usually you let the application manage the buffers.
If you're faulting 6K pages/s and you're only transfering 30MB/s, you're almost getting a page fault for every page you transfer. When you get the data buffer from the driver, is it always zero filled? I'm wondering if you're getting demand zero faults for every transfer.
-scott

Resources