Are DMA DRQs the DMA virtual channels - linux-kernel

I have few basic queries regarding DMA in Linux. In one of the DTS file for instance for S900 SoC, I see total number of DMA channels are 12 but when I checked in
ls /sys/class/dma
I see total number of DAM channels are 46 (dma0chan45), does that mean DRQ's are also count as DMA channels ?
Also, I didn't quite get what are the DMA channels about and what is the difference between Virtual and Physical channel?

Related

mmap, axi and multiple reads from pcie

I am trying to optimize the reading of data via pcie via mmap. We have some tools that allow for reading/writing one word from the PCIe communication at the time, but I would like to get/write as many words as require in one request.
My project uses PCIe Gen3 with AXI bridges (2 PCIe bars).
I can successfully read any word from the bus but I notice a pattern when requesting data:
request data in address 0: AXI master requests 4 addresses of data, initial addr is 0
request data in address 0 and 1: two AXI requests: first is similar to the one above, follow by a read requests of 3 addresses of data, initial addr is 1
request data from address 0 to 2: 3 AXI requests: first two are similar to the previous one, follow by a read requests of 2 addresses of data, initial addr is 2
The pattern continues until the addr is a multiple of 4. In seems that if I request the first address, the AXI sends the first 4 values. Any hints? Could this be on the driver that I am using?
Here's how I use mmap:
length_offset = tmp_offset_rw & ~(sysconf (_SC_PAGESIZE)-1);
mmap_offset = (u_long)(tmp_barx_rw << 12) + length_offset;
mmap_len = (u_long)(tmp_size * sizeof(int));
mmap_address = mmap(NULL, mmap_len + (int)(tmp_offset_rw) - length_offset,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_offset);
close(fd);
// tmp_reg_buf = new u_int[tmp_size];
// memcpy(tmp_reg_buf, mmap_address , tmp_size*sizeof(int));
// for(int i = 0; i < 4; i++)
// printf("0x%08X\n", tmp_reg_buf[i]);
for(int i = 0; i < tmp_size; i++)
printf("0x%08X\n", *((u_int*)mmap_address + (int)tmp_offset_rw - length_offset + i));
First off, the driver just sets up the mapping between application virtual addresses and physical addresses, but is not involved in requests between the CPU and the FPGA.
PCIe memory regions are typically mapped in uncached fashion, so the memory requests you see in the FPGA correspond exactly to the width of the values the CPU is reading or writing.
If you disassemble the code you have written, you will see load and store instruction operating on different widths of data. Depending on the CPU architecture, load/store instructions requesting wider data widths may have address alignment restrictions, or there may be performance penalties for fetching unaligned data.
Different memcpy() implementations often have special cases so that they can the fewest possible instructions to transfer a certain amount of data.
The reason why memcpy() may not be suitable for MMIO is that memcpy() may read more memory locations than specified in order to use larger transfer sizes. If the MMIO memory locations cause side effects on read, this could cause problems. If you're exposing something that behaves like memory, it is OK to use memcpy() with MMIO.
If you want higher performance and there is a DMA engine available on the host side of PCIe or you can include a DMA engine in the FPGA, then you can arrange for transfers up to the limits imposed by PCIe protocol, the BIOS, and the configuration of the PCIe endpoint on the FPGA. DMA is the way to maximize throughput, with bursts of 128 or 256 bytes commonly available.
The next problem that needs to be addressed to maximize throughput is latency, which can be quite long. DMA engines need to be able to pipeline requests in order to mask the latency from the FPGA to the memory system and back.

ACP and DMA, how they work?

I'm using ARM a53 platform, it has ACP component, and I'm trying to use DMA to transfer data through ACP.
By ARM trm document, if I understand it correctly, the DMA transmission data size limits to 64 bytes for each DMA transfer when using ACP.
If so, does this limitation make DMA not usable? Because it's dumb to configure DMA descriptor but to transfer 64 bytes only each time.
Or DMA should auto divide its transfer length into many ACP size limited(64 bytes) packets, without any software intervention.
Need any expert to explain how ACP and DMA work together.
Somewhere in the interfaces from the DMA to the ACP's AXI port should auto divide its transfer length as needed into transfers of appropriate length. For the Cortex-A53 ACP, AXI transfers are limited to 64B(perhaps intentionally 1x cacheline).
From https://developer.arm.com/documentation/ddi0500/e/level-2-memory-system/acp/transfer-size-support :
x byte INCR request characterized by:(some list of limitations)
Note the use of INCR instead of FIXED. INCR will automatically increment the address according to the size of the transfer, while FIXED will not. This makes it simple for the peripheral break a large transfer into a series of multiple INCR transfers.
However, do note that on the Cortex-A53, transfer size(x in the quote) is fixed at 16 or 64 byte aligned transfers. If the DMA sends an inappropriate sized transfer(because misconfigured or correct size unsupported), the AXI will emit a SLVERR. If the buffer is not appropriately aligned, I think this also causes a SLVERR.
Lastly, the on-chip network routing must support connecting the DMA to the ACP at chip design time. In my experience this is more commonly done for network accelerators and FPGA fabric glue, but tends to be less often connected for low speed peripherals like UART/SPI/I2C.

Output from an ADC is needed to be stored in memory

We want to take the output of a 16-bit Analog to Digital Converter, which is coming at a rate of 10 million samples per second and SAVE the sequence of 16 output bits in a computer memory. How to save this 16-bit binary voltage signal (0V, 5V) in a computer memory?
If a FPGA is to be used, please elaborate the method.
Sample Data and feed to fifo
Take data from fifo and prepare UDP frames and send data over ethernet
Received UDP packets on PC side and put in memory

When to Update ALSA Audio Driver Buffer Pointer

I am writing an USB Audio Playback driver using ALSA APIs. For that I was trying to understand existing audio drivers in Linux kernel. But I get confused on when to update the kernel audio buffer pointer. We know kernel puts new audio data in a ring buffer and our drivers task is to take new data from the ring buffer, pass it over USB and update the kernel buffer pointer.
The drivers I was looking at takes care of this in URB completion function. Say they have a predefined macro for USB transfer size, which is around 4096 bytes in almost all cases. So when the URB transfer is finished and the code execution path comes in URB completion, they copy another 4096 bytes from the kernel buffer into the URB buffer, submit the URB again to the USB controller and forward the kernel buffer pointer by 4096 bytes.
But what I don't understand is, how come they be so sure that by the time a URB trasfer is finished, there are 4096 bytes of new data in the kernel buffer? The new data amount in the kernel buffer might be smaller than 4096 bytes? Then why does it always update the buffer pointer by 4096 bytes. I think there should be some of knowing how many new bytes are in the kernel buffer and the driver should only update by that amount or may be I misunderstood something? Any suggestion or guideline is appreciable.
These USB audio drivers behave exactly like a PCI sound card, i.e., when the device needs some samples, those samples are just read from the ring buffer.
A PCI chip has no way of knowing what part of the buffer actually contains valid samples.
A buffer underrun is detected later by software (the device informs the driver about the current position with an interrupt; the interrupt handler then raises the underrun error if the position is too far ahead).
USB audio drivers use exactly the same mechanism for detecting underruns, i.e., the snd_pcm_period_elapsed() function checks whether the current position (as returned by your .pointer callback) is too far ahead.

How do I calculate PCIe 1x, 2.0, 3.0, speeds properly?

I am honestly very lost with the speeds calculations of PCIe devices.
I can understand the 33MHz - 66MHz clocks of PCI and PCI-X devices, but PCIe confuses me.
Could anyone explain how to calculate the transfer speeds of PCIe?
To understand the table pointed to by Paebbels, you should know how PCIe transmission works. Contrary to PCI and PCI-X, PCIe is a point-to-point serial bus with link aggregation (meaning that several serial lanes are put together to increase transfer bandwidth).
For PCIe 1.0, a single lane transmits symbols at every edge of a 1.25GHz clock (Takrate). This yield a transmission rate of 2.5G transfers (or symbols) per second. The protocol encodes 8 bit of data with 10 symbols (8b10b encoding) for DC balance and clock recovery. Therefore the raw transfer rate of a lane is
2.5Gsymb/s / 10symb * 8bits = 250MB/s
The raw transfer rate can be multiplied by the number of lanes available to get the full link transfer rate.
Note that the useful transfer rate is actually less than that because data is packetized similar to ethernet protocol layer packetization.
A more detailed explanation can be found in this Xilinx white paper.

Resources