Linux block driver merge bio's - linux-kernel

I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);

Related

mmap, axi and multiple reads from pcie

I am trying to optimize the reading of data via pcie via mmap. We have some tools that allow for reading/writing one word from the PCIe communication at the time, but I would like to get/write as many words as require in one request.
My project uses PCIe Gen3 with AXI bridges (2 PCIe bars).
I can successfully read any word from the bus but I notice a pattern when requesting data:
request data in address 0: AXI master requests 4 addresses of data, initial addr is 0
request data in address 0 and 1: two AXI requests: first is similar to the one above, follow by a read requests of 3 addresses of data, initial addr is 1
request data from address 0 to 2: 3 AXI requests: first two are similar to the previous one, follow by a read requests of 2 addresses of data, initial addr is 2
The pattern continues until the addr is a multiple of 4. In seems that if I request the first address, the AXI sends the first 4 values. Any hints? Could this be on the driver that I am using?
Here's how I use mmap:
length_offset = tmp_offset_rw & ~(sysconf (_SC_PAGESIZE)-1);
mmap_offset = (u_long)(tmp_barx_rw << 12) + length_offset;
mmap_len = (u_long)(tmp_size * sizeof(int));
mmap_address = mmap(NULL, mmap_len + (int)(tmp_offset_rw) - length_offset,
PROT_READ | PROT_WRITE, MAP_SHARED, fd, mmap_offset);
close(fd);
// tmp_reg_buf = new u_int[tmp_size];
// memcpy(tmp_reg_buf, mmap_address , tmp_size*sizeof(int));
// for(int i = 0; i < 4; i++)
// printf("0x%08X\n", tmp_reg_buf[i]);
for(int i = 0; i < tmp_size; i++)
printf("0x%08X\n", *((u_int*)mmap_address + (int)tmp_offset_rw - length_offset + i));
First off, the driver just sets up the mapping between application virtual addresses and physical addresses, but is not involved in requests between the CPU and the FPGA.
PCIe memory regions are typically mapped in uncached fashion, so the memory requests you see in the FPGA correspond exactly to the width of the values the CPU is reading or writing.
If you disassemble the code you have written, you will see load and store instruction operating on different widths of data. Depending on the CPU architecture, load/store instructions requesting wider data widths may have address alignment restrictions, or there may be performance penalties for fetching unaligned data.
Different memcpy() implementations often have special cases so that they can the fewest possible instructions to transfer a certain amount of data.
The reason why memcpy() may not be suitable for MMIO is that memcpy() may read more memory locations than specified in order to use larger transfer sizes. If the MMIO memory locations cause side effects on read, this could cause problems. If you're exposing something that behaves like memory, it is OK to use memcpy() with MMIO.
If you want higher performance and there is a DMA engine available on the host side of PCIe or you can include a DMA engine in the FPGA, then you can arrange for transfers up to the limits imposed by PCIe protocol, the BIOS, and the configuration of the PCIe endpoint on the FPGA. DMA is the way to maximize throughput, with bursts of 128 or 256 bytes commonly available.
The next problem that needs to be addressed to maximize throughput is latency, which can be quite long. DMA engines need to be able to pipeline requests in order to mask the latency from the FPGA to the memory system and back.

ACP and DMA, how they work?

I'm using ARM a53 platform, it has ACP component, and I'm trying to use DMA to transfer data through ACP.
By ARM trm document, if I understand it correctly, the DMA transmission data size limits to 64 bytes for each DMA transfer when using ACP.
If so, does this limitation make DMA not usable? Because it's dumb to configure DMA descriptor but to transfer 64 bytes only each time.
Or DMA should auto divide its transfer length into many ACP size limited(64 bytes) packets, without any software intervention.
Need any expert to explain how ACP and DMA work together.
Somewhere in the interfaces from the DMA to the ACP's AXI port should auto divide its transfer length as needed into transfers of appropriate length. For the Cortex-A53 ACP, AXI transfers are limited to 64B(perhaps intentionally 1x cacheline).
From https://developer.arm.com/documentation/ddi0500/e/level-2-memory-system/acp/transfer-size-support :
x byte INCR request characterized by:(some list of limitations)
Note the use of INCR instead of FIXED. INCR will automatically increment the address according to the size of the transfer, while FIXED will not. This makes it simple for the peripheral break a large transfer into a series of multiple INCR transfers.
However, do note that on the Cortex-A53, transfer size(x in the quote) is fixed at 16 or 64 byte aligned transfers. If the DMA sends an inappropriate sized transfer(because misconfigured or correct size unsupported), the AXI will emit a SLVERR. If the buffer is not appropriately aligned, I think this also causes a SLVERR.
Lastly, the on-chip network routing must support connecting the DMA to the ACP at chip design time. In my experience this is more commonly done for network accelerators and FPGA fabric glue, but tends to be less often connected for low speed peripherals like UART/SPI/I2C.

strange behaviour using mmap

I'm using Angtsrom embedded linux kernel v.2.6.37, based on Technexion distribution.
DM3730 SoC, TDM3730 module, custom baseboard.
CodeSourcery toolchain v. 2010-09.50
Here is dataflow in my system:
http://i.stack.imgur.com/kPhKw.png
FPGA generates incrementing data, Kernel reads it via GPMC DMA. GPMC pack size = 512 data samples. Buffer size = 61440 32bit samples (=60 ram pages).
DMA buffer is allocated by dma_alloc_coherent and mapped to userspace by mmap() call. User application directly reads data from DMA buffer and saving to NAND using fwrite() call. User reads data by 4096 samples at once.
And what I see in my file? http://i.stack.imgur.com/etzo0.png
Red line means first border of ring buffer. Ooops! Small packs (~16 samples) starts to hide after border. Their values is accurately = "old" values of corresponding buffer position. But WHY? 16 samples is much lesser than DMA pack size and user read pack size, so there cannot be pointers mismatch.
I guess there is some mmap() feature is hiding somewhere. I have tried different flags for mmap() - such as MAP_LOCKED, MAP_POPULATE, MAP_NONBLOCK with no success. I completely missunderstanding this behaviour :(
P.S. When i'm using copy_to_user() from kernel instead of mmap() and zero-copy access, there is no such behaviour.

Provide several kernel buffers through mmap

I have a kernel driver which allocates several buffers in kernel space (physically contiguous, aligned to page boundaries, and consisting of integral number of pages).
Next, I need to make my driver able to mmap some of these buffers to userspace (one buffer per mmap() call, of course). The driver registers single character device for that purpose.
Userspace program must be able to tell kernel which buffer it wants to mmap (for example, by specifying its index or unique ID, or physical address previously resolved through ioctl()).
I want to do so by using mmap()'s offset parameter, for example (from userspace):
mapped_ptr = mmap(NULL, buf_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, (MAGIC + buffer_id) * PAGE_SIZE);
Where "MAGIC" is some magic number, and buffer_id is the buffer ID which I want to mmap.
Next, in the kernel part there will be something like this:
static int my_dev_mmap(struct file *filp, struct vm_area_struct *vma)
{
int bufferID = vma->vm_pgoff - MAGIC;
/*
* Convert bufferID to PFN by looking through driver's buffer descriptors
* Check length = vma->vm_end - vma->vm_start
* Call remap_pfn_range()
*/
}
But I think it is some sort of dirty way, because "offset" in the mmap() is not supposed to specify index or identifier, its role is to provide number of skipped bytes (or pages) from the beginning of mmap-ed device(or file) memory (which is supposed to be contiguous, right?).
However, i've already seen some drivers in mainline which use "offset" to distinguish between mmap-ed buffers.
Are there any alternative solutions to this?
P.S.
I need all this just because I'm dealing with some unusual SoC' graphics controller, which can operate only on physically contiguous, aligned to 8-byte boundary memory buffers. So, I can only allocate such buffers in kernel space and pass them to user space via mmap().
The most part of controller' programming (composing instruction batches and pushing them to kernel driver) is performed in user space.
Also, I can't just allocate single big chunk of physically contiguous memory, because in that case it needs to be really big (for ex., 16+ MiB) and alloc_pages_exact() will fail.
I don't see anything wrong with using the offset to pass the index in from userspace to your driver. If it bugs you, then just look at your driver as assembling a large buffer out of individual pages that it wants to present to userspace as virtually contiguous, so that the offset really is an offset into this buffer. But really in my opinion there's nothing wrong with doing things this way.
Another alternative, if you can use kernel 3.5 or newer, might be to use the "Contiguous Memory Allocator" (CMA) -- look at <linux/dma-contiguous.h> and drivers/base/dma-contiguous.c for more information. There's also https://lwn.net/Articles/486301/ as a reference but I don't know how much (if anything) changed between that article and getting the code merged into mainline.
Finally, I've chosen to mmap exactly one buffer per one opened device file descriptor (struct file in kernel) and implement control through ioctl(): one IOCTL for allocating new buffer, one for attaching to already allocated buffer with known ID, and another one to get information about buffer.
Usually, userspace will mmap() about 10..20 buffers at the same time, so it is nice and clean solution for this case.

why is data structure alignment important for performance?

Can someone give me a short and plausible explanation for why the compiler adds padding to data structures in order to align its members? I know that it's done so that the CPU can access the data more efficiently, but I don't understand why this is so.
And if this is only CPU related, why is a double 4 byte aligned in Linux and 8 byte aligned in Windows?
Alignment helps the CPU fetch data from memory in an efficient manner: less cache miss/flush, less bus transactions etc.
Some memory types (e.g. RDRAM, DRAM etc.) need to be accessed in a structured manner (aligned "words" and in "burst transactions" i.e. many words at one time) in order to yield efficient results. This is due to many things amongst which:
setup time: time it takes for the memory devices to access the memory locations
bus arbitration overhead i.e. many devices might want access to the memory device
"Padding" is used to correct the alignment of data structures in order to optimize transfer efficiency.
In other words, accessing a "mis-aligned" structure will yield lower overall performance. A good example of such pitfall: suppose a data structure is mis-aligned and requires the CPU/Memory Controller to perform 2 bus transactions (instead of 1) in order to fetch the said structure, the performance is thus consequently lower.
the CPU fetches data from memory in groups of 4 bytes (it actualy depends on the hardware its 8 or other values for some types of hardware, but lets stick with 4 to keep it simple),
all is well if the data begins in an address which is dividable by 4, the CPU goes to the memory address and loads the data.
now suppose the data begins in an address not dividable by 4 say for the sake of simplicity at address 1, the CPU must take data from address 0 and then apply some algorithm to dump the byte at the 0 address , to gain access to the actual data at byte 1. this takes time and therefore lowers preformance. so it is much more efficient to have all data addresses aligned.
A cache line is a basic unit of caching. Typically it is 16-64 bytes or more.
Pentium IV: 64 bytes; Pentium Pro/II: 32 bytes; Pentium I: 32 bytes; 486: 16 bytes.
myrandomreader:
; ...
; ten instructions to generate next pseudo-random
; address in ESI from previous address
; ...
MOV EAX, DS:[ESI] ; X
LOOP myrandomreader
For memory read straddling two cachelines:
(for L1 cache miss) the processor must wait for the whole of cache line 1 to be read from L2->L1 into the processor before it can request the second cache line, causing a short execution stall
(for L2 cache miss) the processor must wait for two burst reads from L3 cache (if present) or main memory to complete rather than one
Processor stalls
A random 4 byte read will straddle a cacheline boundary about 5% of the time for 64 byte cachelines, 10% for 32 byte ones and 20% for 16 byte ones.
There may be additional execution overheads for some instructions on misaligned data even if it is within a cacheline. This is talked about on the Intel website for some SSE instructions.
If you are defining the structures yourself, it may make sense to look at listing all the <32bit data fields together in a struct so that padding overhead is reduced or alternatively review whether it is better to turn packing on or off for a particular structure.
On MIPS and many other platforms you don't get the choice and must align - kernel exception if you don't!!
Alignment may also matter extra specially to you if you are doing I/O on the bus or using atomic operations such as atomic increment/decrement or if you wish to be able to port your code to non-Intel.
On Intel only (!) code, a common practice is to define one set of packed structures for network and disk, and another padded set for in-memory and to have routines to convert data between these formats (also consider "endianness" for the disk and network formats).
In addition to jldupont's answer, some architectures have load and store instructions (those used to read/write to and from memory) that only operate on word aligned boundaries - so, to load a non-aligned word from memory would take two load instructions, a shift instruction, and then a mask instruction - much less efficient!

Resources