ACP and DMA, how they work? - linux-kernel

I'm using ARM a53 platform, it has ACP component, and I'm trying to use DMA to transfer data through ACP.
By ARM trm document, if I understand it correctly, the DMA transmission data size limits to 64 bytes for each DMA transfer when using ACP.
If so, does this limitation make DMA not usable? Because it's dumb to configure DMA descriptor but to transfer 64 bytes only each time.
Or DMA should auto divide its transfer length into many ACP size limited(64 bytes) packets, without any software intervention.
Need any expert to explain how ACP and DMA work together.

Somewhere in the interfaces from the DMA to the ACP's AXI port should auto divide its transfer length as needed into transfers of appropriate length. For the Cortex-A53 ACP, AXI transfers are limited to 64B(perhaps intentionally 1x cacheline).
From https://developer.arm.com/documentation/ddi0500/e/level-2-memory-system/acp/transfer-size-support :
x byte INCR request characterized by:(some list of limitations)
Note the use of INCR instead of FIXED. INCR will automatically increment the address according to the size of the transfer, while FIXED will not. This makes it simple for the peripheral break a large transfer into a series of multiple INCR transfers.
However, do note that on the Cortex-A53, transfer size(x in the quote) is fixed at 16 or 64 byte aligned transfers. If the DMA sends an inappropriate sized transfer(because misconfigured or correct size unsupported), the AXI will emit a SLVERR. If the buffer is not appropriately aligned, I think this also causes a SLVERR.
Lastly, the on-chip network routing must support connecting the DMA to the ACP at chip design time. In my experience this is more commonly done for network accelerators and FPGA fabric glue, but tends to be less often connected for low speed peripherals like UART/SPI/I2C.

Related

mmap, axi and multiple reads from pcie

I am trying to optimize the reading of data via pcie via mmap. We have some tools that allow for reading/writing one word from the PCIe communication at the time, but I would like to get/write as many words as require in one request.
My project uses PCIe Gen3 with AXI bridges (2 PCIe bars).
I can successfully read any word from the bus but I notice a pattern when requesting data:
request data in address 0: AXI master requests 4 addresses of data, initial addr is 0
request data in address 0 and 1: two AXI requests: first is similar to the one above, follow by a read requests of 3 addresses of data, initial addr is 1
request data from address 0 to 2: 3 AXI requests: first two are similar to the previous one, follow by a read requests of 2 addresses of data, initial addr is 2
The pattern continues until the addr is a multiple of 4. In seems that if I request the first address, the AXI sends the first 4 values. Any hints? Could this be on the driver that I am using?
Here's how I use mmap:
length_offset = tmp_offset_rw & ~(sysconf (_SC_PAGESIZE)-1);
mmap_offset = (u_long)(tmp_barx_rw << 12) + length_offset;
mmap_len = (u_long)(tmp_size * sizeof(int));
mmap_address = mmap(NULL, mmap_len + (int)(tmp_offset_rw) - length_offset,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_offset);
close(fd);
// tmp_reg_buf = new u_int[tmp_size];
// memcpy(tmp_reg_buf, mmap_address , tmp_size*sizeof(int));
// for(int i = 0; i < 4; i++)
// printf("0x%08X\n", tmp_reg_buf[i]);
for(int i = 0; i < tmp_size; i++)
printf("0x%08X\n", *((u_int*)mmap_address + (int)tmp_offset_rw - length_offset + i));
First off, the driver just sets up the mapping between application virtual addresses and physical addresses, but is not involved in requests between the CPU and the FPGA.
PCIe memory regions are typically mapped in uncached fashion, so the memory requests you see in the FPGA correspond exactly to the width of the values the CPU is reading or writing.
If you disassemble the code you have written, you will see load and store instruction operating on different widths of data. Depending on the CPU architecture, load/store instructions requesting wider data widths may have address alignment restrictions, or there may be performance penalties for fetching unaligned data.
Different memcpy() implementations often have special cases so that they can the fewest possible instructions to transfer a certain amount of data.
The reason why memcpy() may not be suitable for MMIO is that memcpy() may read more memory locations than specified in order to use larger transfer sizes. If the MMIO memory locations cause side effects on read, this could cause problems. If you're exposing something that behaves like memory, it is OK to use memcpy() with MMIO.
If you want higher performance and there is a DMA engine available on the host side of PCIe or you can include a DMA engine in the FPGA, then you can arrange for transfers up to the limits imposed by PCIe protocol, the BIOS, and the configuration of the PCIe endpoint on the FPGA. DMA is the way to maximize throughput, with bursts of 128 or 256 bytes commonly available.
The next problem that needs to be addressed to maximize throughput is latency, which can be quite long. DMA engines need to be able to pipeline requests in order to mask the latency from the FPGA to the memory system and back.

STM32F411 I need to send a lot of data by USB with high speed

I'm using STM32F411 with USB CDC library, and max speed for this library is ~1Mb/s.
I'm creating a project where I have 8 microphones connected into ADC line (this part works fine), I need a 16-bit signal, so I'm increasing accuracy by adding first 16 signals from one line (ADC gives only 12-bits signal). In my project, I need 96k 16-bit samples for one line, so it's 0,768M signals for all 8 lines. This signal needs 12000Kb space, but STM32 have only 128Kb SRAM, so I decided to send about 120 with 100Kb data in one second.
The conclusion is I need ~11,72Mb/s to send this.
The problem is that I'm unable to do that because CDC USB limited me to ~1Mb/s.
Question is how to increase USB speed to 12Mb/s for STM32F4. I need some prompt or library.
Or maybe should I set up "audio device" in CubeMX?
If small b means byte in your question, the answer is: it is not possible as your micro has FS USB which max speeds is 12M bits per second.
If it means bits your 1Mb (bit) speed assumption is wrong. But you will not reach the 12M bit payload transfer.
You may try to write (only if b means bit) your own class but I afraid you will not find a ready made library. You will need also to write the device driver on the host computer

AXI4 (Lite) Narrow Burst vs. Unaligned Burst Clarification/Compatibility

I'm currently writing an AXI4 master that is supposed to support AXI4 Lite (AXI4L) as well.
My AXI4 master is receiving data from a 16-bit interface. This is on a Xilinx Spartan 6 FPGA and I plan on using the EDK AXI4 Interconnect IP, which has a minimum WDATA width of 32 bits.
At first I wanted to use narrow burst, i.e. AWSIZE = x"01" (2 bytes in transfer). However, I found that Xilinx' AXI Reference Guide UG761 states "narrow bursts [are] supported but [...] not recommended." Unaligned transactions are supposed to be supported.
This had me thinking. Say I start an unaligned burst:
AWLEN = x"01" (2 beats)
AWSIZE = x"02" (4 bytes in transfer")
And do the following:
AX (32-bit word #0: send hi16)
XB (32-bit word #1: send lo16)
Where A, B are my 16 bit words that start off at an unaligned (2-byte aligned) address. X means WSTRB is deasserted for the indicated 16 bit.
Is this supported or does this fall under the category "narrow burst" even through AWSIZE = x"02" (4 bytes in transfer) as opposed to AWSIZE = x"01" (2 bytes in transfer)?
Now, if this was just for AXI4, I would probably not care as much about this use case, because AXI4 peripherals are required to use the WSTRB signals. However, the AXI Reference Guide UG761 states "[AXI4L] Slaves interface can elect to ignore WSTRB (assume all bytes valid)."
I read here that many (but not all; and there is not list?) Xilinx AXI4L peripherals do elect to ignore WSTRB.
Does this mean that I'm essentially barred from doing narrow burst ("not recommended") as well as unaligned bursts ("WSTRB can be ignored") or is there an easy way to unload some of the implementation work from my master into the interconnect, guaranteeing proper system behavior when accessing AXI4L peripherals?
Your example is not a narrow burst, and should work.
The reason narrow burst is not recommended is that it gives sub-optimal performances. Both narrow-burst and data realignement cost in area and are not recommended IMHO. However, DRE has minimal bandwidth cost, while narrow burst does. If your AXI port is 100MHz 32 bits, you have 3.2GBits maximum throughput, if you use narrow burst of 16 bits 50% of the time, than your maximum throughput is reduced to 2.4GBits (32bits X 50MHz + 16bits X 50Mhz). Also, I'm not sure AXI-Lite support narrow burst or data realignement.
Your example has 2 major flaws. First, it requires 3 data-beats to transfer 32 bits, which is worst than narrow-burst (I don't think AXI is smart enough to cancel the last burst with WSTRB to 0). Second, you can't burst more than 2 16-bits at a time, which will hang your AXI infrastructure's performances if you have a lot of data to transfer.
The best way to deal with this is concatenate the 16 bits together to form a 32 bits in your block. Then you buffer these 32 bits and burst them when you have enough. This is the AXI high performance way to do this.
However, if you receive data as 16-bits, it seems you would be better using AXI-Stream, which support 16-bits but doesn't have the notion of addresses. You can map an AXI-Stream to AXI-4 using Xilinx's IP cores. Either AXI-Datamover or AXI-DMA can do that. Both do the same (in fact, AXI-DMA includes a datamover), but AXI-DMA is controlled trough an AXI-Lite interface while Datamover is controlled through additionals AXI-Streams.
As a final note, the Xilinx cores never requires narrow-burst or DRE. If you need DRE in AXI-DMA, it's done by the AXI-DMA core and not the AXI Interconnect. Also, these cores are clear-source, so you can checkout how they operate easily.

How does computer really request data in a computer?

I was wondering how exactly does a CPU request data in a computer. In a 32 Bits architecture, I thought that a computer would put a destination on the address bus and would receive 4 Bytes on the data bus. I recently read on the memory alignment in computer and it confused me. I read that the CPU has to read two times the memory to access a not multiple 4 address. Why is so? The address bus lets it access not multiple 4 address.
The address bus itself, even in a 32-bit architecture, is usually not 32 bits in size. E.g. the Pentium's address bus was 29 bits. Given that it has a full 32-bit range, in the Pentium's case that means each slot in memory is eight bytes wide. So reading a value that straddles two of those slots means two reads rather than one, and alignment prevents that from happening.
Other processors (including other implementations of the 32-bit Intel architecture) have different memory access word sizes but the point is generic.

Linux block driver merge bio's

I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);

Resources