mmap, axi and multiple reads from pcie

mmap, axi and multiple reads from pcie - fpga

I am trying to optimize the reading of data via pcie via mmap. We have some tools that allow for reading/writing one word from the PCIe communication at the time, but I would like to get/write as many words as require in one request.
My project uses PCIe Gen3 with AXI bridges (2 PCIe bars).
I can successfully read any word from the bus but I notice a pattern when requesting data:
request data in address 0: AXI master requests 4 addresses of data, initial addr is 0
request data in address 0 and 1: two AXI requests: first is similar to the one above, follow by a read requests of 3 addresses of data, initial addr is 1
request data from address 0 to 2: 3 AXI requests: first two are similar to the previous one, follow by a read requests of 2 addresses of data, initial addr is 2
The pattern continues until the addr is a multiple of 4. In seems that if I request the first address, the AXI sends the first 4 values. Any hints? Could this be on the driver that I am using?
Here's how I use mmap:
length_offset = tmp_offset_rw & ~(sysconf (_SC_PAGESIZE)-1);
mmap_offset = (u_long)(tmp_barx_rw << 12) + length_offset;
mmap_len = (u_long)(tmp_size * sizeof(int));
mmap_address = mmap(NULL, mmap_len + (int)(tmp_offset_rw) - length_offset,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_offset);
close(fd);
// tmp_reg_buf = new u_int[tmp_size];
// memcpy(tmp_reg_buf, mmap_address , tmp_size*sizeof(int));
// for(int i = 0; i < 4; i++)
// printf("0x%08X\n", tmp_reg_buf[i]);
for(int i = 0; i < tmp_size; i++)
printf("0x%08X\n", *((u_int*)mmap_address + (int)tmp_offset_rw - length_offset + i));

First off, the driver just sets up the mapping between application virtual addresses and physical addresses, but is not involved in requests between the CPU and the FPGA.
PCIe memory regions are typically mapped in uncached fashion, so the memory requests you see in the FPGA correspond exactly to the width of the values the CPU is reading or writing.
If you disassemble the code you have written, you will see load and store instruction operating on different widths of data. Depending on the CPU architecture, load/store instructions requesting wider data widths may have address alignment restrictions, or there may be performance penalties for fetching unaligned data.
Different memcpy() implementations often have special cases so that they can the fewest possible instructions to transfer a certain amount of data.
The reason why memcpy() may not be suitable for MMIO is that memcpy() may read more memory locations than specified in order to use larger transfer sizes. If the MMIO memory locations cause side effects on read, this could cause problems. If you're exposing something that behaves like memory, it is OK to use memcpy() with MMIO.
If you want higher performance and there is a DMA engine available on the host side of PCIe or you can include a DMA engine in the FPGA, then you can arrange for transfers up to the limits imposed by PCIe protocol, the BIOS, and the configuration of the PCIe endpoint on the FPGA. DMA is the way to maximize throughput, with bursts of 128 or 256 bytes commonly available.
The next problem that needs to be addressed to maximize throughput is latency, which can be quite long. DMA engines need to be able to pipeline requests in order to mask the latency from the FPGA to the memory system and back.

Related

Does memory-mapped I/O work by using RAM addresses?

Imagine a processor capable of addressing an 8-bit range (I know this is ridiculously small in reality) with a 128 byte RAM. And there is some 8-bit device register mapped to address 100. In order to store a value to it, does the CPU need to store a value at address 100 or does it specifically need to store a value at address 100 within RAM? In pseudo-assembly:
STI 100, value
VS
STI RAM_start+100, value

Usually, the address of a device is specified relative to the start of the address space it lives in.
The datasheet has surely more context and will clarify if the address is relative to something else.
However, before using it you have to translate that address as the CPU would see it.
For example, if your 8-bit address range accessible with the sti instruction is split in half:
0-127 => RAM
128-255 => IO
Because the hardware is wired this way, then, as seen from the CPU, the IO address range starts at 128, so an IO address of x is accessible at 128 + x.
The CPU datasheet usually establishes the convention used to give the addresses of the devices and the memory map of the CPU.
Address spaces can be hierarchical (e.g. as in PCI) or windowed (e.g. like the legacy PCI config space on x86), can have aliases, they may require special instructions or overlaps (e.g. reads to ROM, writes to RAM).
Always refers to the CPU manual/datasheet to understand the CPU memory map and how its address range(s) is (are) routed.

ACP and DMA, how they work?

I'm using ARM a53 platform, it has ACP component, and I'm trying to use DMA to transfer data through ACP.
By ARM trm document, if I understand it correctly, the DMA transmission data size limits to 64 bytes for each DMA transfer when using ACP.
If so, does this limitation make DMA not usable? Because it's dumb to configure DMA descriptor but to transfer 64 bytes only each time.
Or DMA should auto divide its transfer length into many ACP size limited(64 bytes) packets, without any software intervention.
Need any expert to explain how ACP and DMA work together.

Somewhere in the interfaces from the DMA to the ACP's AXI port should auto divide its transfer length as needed into transfers of appropriate length. For the Cortex-A53 ACP, AXI transfers are limited to 64B(perhaps intentionally 1x cacheline).
From https://developer.arm.com/documentation/ddi0500/e/level-2-memory-system/acp/transfer-size-support :
x byte INCR request characterized by:(some list of limitations)
Note the use of INCR instead of FIXED. INCR will automatically increment the address according to the size of the transfer, while FIXED will not. This makes it simple for the peripheral break a large transfer into a series of multiple INCR transfers.
However, do note that on the Cortex-A53, transfer size(x in the quote) is fixed at 16 or 64 byte aligned transfers. If the DMA sends an inappropriate sized transfer(because misconfigured or correct size unsupported), the AXI will emit a SLVERR. If the buffer is not appropriately aligned, I think this also causes a SLVERR.
Lastly, the on-chip network routing must support connecting the DMA to the ACP at chip design time. In my experience this is more commonly done for network accelerators and FPGA fabric glue, but tends to be less often connected for low speed peripherals like UART/SPI/I2C.

AXI4 (Lite) Narrow Burst vs. Unaligned Burst Clarification/Compatibility

I'm currently writing an AXI4 master that is supposed to support AXI4 Lite (AXI4L) as well.
My AXI4 master is receiving data from a 16-bit interface. This is on a Xilinx Spartan 6 FPGA and I plan on using the EDK AXI4 Interconnect IP, which has a minimum WDATA width of 32 bits.
At first I wanted to use narrow burst, i.e. AWSIZE = x"01" (2 bytes in transfer). However, I found that Xilinx' AXI Reference Guide UG761 states "narrow bursts [are] supported but [...] not recommended." Unaligned transactions are supposed to be supported.
This had me thinking. Say I start an unaligned burst:
AWLEN = x"01" (2 beats)
AWSIZE = x"02" (4 bytes in transfer")
And do the following:
AX (32-bit word #0: send hi16)
XB (32-bit word #1: send lo16)
Where A, B are my 16 bit words that start off at an unaligned (2-byte aligned) address. X means WSTRB is deasserted for the indicated 16 bit.
Is this supported or does this fall under the category "narrow burst" even through AWSIZE = x"02" (4 bytes in transfer) as opposed to AWSIZE = x"01" (2 bytes in transfer)?
Now, if this was just for AXI4, I would probably not care as much about this use case, because AXI4 peripherals are required to use the WSTRB signals. However, the AXI Reference Guide UG761 states "[AXI4L] Slaves interface can elect to ignore WSTRB (assume all bytes valid)."
I read here that many (but not all; and there is not list?) Xilinx AXI4L peripherals do elect to ignore WSTRB.
Does this mean that I'm essentially barred from doing narrow burst ("not recommended") as well as unaligned bursts ("WSTRB can be ignored") or is there an easy way to unload some of the implementation work from my master into the interconnect, guaranteeing proper system behavior when accessing AXI4L peripherals?

Your example is not a narrow burst, and should work.
The reason narrow burst is not recommended is that it gives sub-optimal performances. Both narrow-burst and data realignement cost in area and are not recommended IMHO. However, DRE has minimal bandwidth cost, while narrow burst does. If your AXI port is 100MHz 32 bits, you have 3.2GBits maximum throughput, if you use narrow burst of 16 bits 50% of the time, than your maximum throughput is reduced to 2.4GBits (32bits X 50MHz + 16bits X 50Mhz). Also, I'm not sure AXI-Lite support narrow burst or data realignement.
Your example has 2 major flaws. First, it requires 3 data-beats to transfer 32 bits, which is worst than narrow-burst (I don't think AXI is smart enough to cancel the last burst with WSTRB to 0). Second, you can't burst more than 2 16-bits at a time, which will hang your AXI infrastructure's performances if you have a lot of data to transfer.
The best way to deal with this is concatenate the 16 bits together to form a 32 bits in your block. Then you buffer these 32 bits and burst them when you have enough. This is the AXI high performance way to do this.
However, if you receive data as 16-bits, it seems you would be better using AXI-Stream, which support 16-bits but doesn't have the notion of addresses. You can map an AXI-Stream to AXI-4 using Xilinx's IP cores. Either AXI-Datamover or AXI-DMA can do that. Both do the same (in fact, AXI-DMA includes a datamover), but AXI-DMA is controlled trough an AXI-Lite interface while Datamover is controlled through additionals AXI-Streams.
As a final note, the Xilinx cores never requires narrow-burst or DRE. If you need DRE in AXI-DMA, it's done by the AXI-DMA core and not the AXI Interconnect. Also, these cores are clear-source, so you can checkout how they operate easily.

Linux block driver merge bio's

I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);

why is data structure alignment important for performance?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?

Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
setup time: time it takes for the memory devices to access the memory locations
bus arbitration overhead i.e. many devices might want access to the memory device
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.

the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.

A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
myrandomreader:
; ...
; ten instructions to generate next pseudo-random
; address in ESI from previous address
; ...
MOV EAX, DS:[ESI] ; X
LOOP myrandomreader
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).

In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio