strange behaviour using mmap - linux-kernel

I'm using Angtsrom embedded linux kernel v.2.6.37, based on Technexion distribution.
DM3730 SoC, TDM3730 module, custom baseboard.
CodeSourcery toolchain v. 2010-09.50
Here is dataflow in my system:
http://i.stack.imgur.com/kPhKw.png
FPGA generates incrementing data, Kernel reads it via GPMC DMA. GPMC pack size = 512 data samples. Buffer size = 61440 32bit samples (=60 ram pages).
DMA buffer is allocated by dma_alloc_coherent and mapped to userspace by mmap() call. User application directly reads data from DMA buffer and saving to NAND using fwrite() call. User reads data by 4096 samples at once.
And what I see in my file? http://i.stack.imgur.com/etzo0.png
Red line means first border of ring buffer. Ooops! Small packs (~16 samples) starts to hide after border. Their values is accurately = "old" values of corresponding buffer position. But WHY? 16 samples is much lesser than DMA pack size and user read pack size, so there cannot be pointers mismatch.
I guess there is some mmap() feature is hiding somewhere. I have tried different flags for mmap() - such as MAP_LOCKED, MAP_POPULATE, MAP_NONBLOCK with no success. I completely missunderstanding this behaviour :(
P.S. When i'm using copy_to_user() from kernel instead of mmap() and zero-copy access, there is no such behaviour.

Related

ACP and DMA, how they work?

I'm using ARM a53 platform, it has ACP component, and I'm trying to use DMA to transfer data through ACP.
By ARM trm document, if I understand it correctly, the DMA transmission data size limits to 64 bytes for each DMA transfer when using ACP.
If so, does this limitation make DMA not usable? Because it's dumb to configure DMA descriptor but to transfer 64 bytes only each time.
Or DMA should auto divide its transfer length into many ACP size limited(64 bytes) packets, without any software intervention.
Need any expert to explain how ACP and DMA work together.
Somewhere in the interfaces from the DMA to the ACP's AXI port should auto divide its transfer length as needed into transfers of appropriate length. For the Cortex-A53 ACP, AXI transfers are limited to 64B(perhaps intentionally 1x cacheline).
From https://developer.arm.com/documentation/ddi0500/e/level-2-memory-system/acp/transfer-size-support :
x byte INCR request characterized by:(some list of limitations)
Note the use of INCR instead of FIXED. INCR will automatically increment the address according to the size of the transfer, while FIXED will not. This makes it simple for the peripheral break a large transfer into a series of multiple INCR transfers.
However, do note that on the Cortex-A53, transfer size(x in the quote) is fixed at 16 or 64 byte aligned transfers. If the DMA sends an inappropriate sized transfer(because misconfigured or correct size unsupported), the AXI will emit a SLVERR. If the buffer is not appropriately aligned, I think this also causes a SLVERR.
Lastly, the on-chip network routing must support connecting the DMA to the ACP at chip design time. In my experience this is more commonly done for network accelerators and FPGA fabric glue, but tends to be less often connected for low speed peripherals like UART/SPI/I2C.

AVR Internal Data bus width

I got a doubt about the width of internal data bus of AVR controllers connected to flash memory. I was mainly referring to Atmega328. Datasheet says (Page 17) "Since all AVR instructions are 16 or 32 bits wide, the Flash is organized as
2/4/8/16K x 16.". That means flash memory data bus width must be 16 bit? I could not see anywhere mentioning about 16 bit wide program memory data bus (Of course internal to the controller). But bus for RAM seems to be again 8 bit. Just want a clarification.
BitThe 8-bit AVR family is based on a (modified) Harvard architecture, where you have dedicated program and data storages. The data path to program memory is indeed 16-bit, while it is 8-bit only to data memory.
The funny part is, that in the beginning Atmel points out, that these are 8-bit CPUs. This makes them look very competitive when compared to other 8-bit products like 8051 or Rabbit. Due to the 16-bit program data path the AVRs perform very well in benchmark tests. Later, when 8-bit sounds a bit old-fashioned, Atmel decided to call them 8/16-bit CPUs.
Figure 7.1 on page 9 of the data sheet/complete shows that the flash isn't at all connected to the (8 bit) data bus but only to an address bus. The "data" of the flash memory primarily goes into the instruction register and by use of the LPM instruction this data is transferred into a register. Note that when writing data to the flash you always write 16 bit (R1:R0) addressed by the Z pointer (SPM instruction) ... and that the SPM instruction cannot be expressed in "clock cycles" (pg. 617)

Allocating a large DMA buffer

I want to allocate a large DMA buffer, about 40 MB in size. When I use dma_alloc_coherent(), it fails and what I see is:
------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2106 __alloc_pages_nodemask+0x1dc/0x788()
Modules linked in:
[<8004799c>] (unwind_backtrace+0x0/0xf8) from [<80078ae4>] (warn_slowpath_common+0x4c/0x64)
[<80078ae4>] (warn_slowpath_common+0x4c/0x64) from [<80078b18>] (warn_slowpath_null+0x1c/0x24)
[<80078b18>] (warn_slowpath_null+0x1c/0x24) from [<800dfbd0>] (__alloc_pages_nodemask+0x1dc/0x788)
[<800dfbd0>] (__alloc_pages_nodemask+0x1dc/0x788) from [<8004a880>] (__dma_alloc+0xa4/0x2fc)
[<8004a880>] (__dma_alloc+0xa4/0x2fc) from [<8004b0b4>] (dma_alloc_coherent+0x54/0x60)
[<8004b0b4>] (dma_alloc_coherent+0x54/0x60) from [<803ced70>] (mxc_ipu_ioctl+0x270/0x3ec)
[<803ced70>] (mxc_ipu_ioctl+0x270/0x3ec) from [<80123b78>] (do_vfs_ioctl+0x80/0x54c)
[<80123b78>] (do_vfs_ioctl+0x80/0x54c) from [<8012407c>] (sys_ioctl+0x38/0x5c)
[<8012407c>] (sys_ioctl+0x38/0x5c) from [<80041f80>] (ret_fast_syscall+0x0/0x30)
---[ end trace 4e0c10ffc7ffc0d8 ]---
I've tried different values and it looks like dma_alloc_coherent() can't allocate more than 2^25 bytes (32 MB).
How can such a large DMA buffer can be allocated?
After the system has booted up dma_alloc_coherent() is not necessarily reliable for large allocations. This is simply because non-moveable pages quickly fill up your physical memory making large contiguous ranges rare. This has been a problem for a long time.
Conveniently a recent patch-set may help you out, this is the contiguous memory allocator which appeared in kernel 3.5. If you're using a kernel with this then you should be able to pass cma=64M on your kernel command line and that much memory will be reserved (only moveable pages will be placed there). When you subsequently ask for your 40M allocation it should reliably succeed. Simples!
For more information check out this LWN article:
https://lwn.net/Articles/486301/

Linux block driver merge bio's

I have a block device driver which is working, after a fashion. It is for a PCIe device, and I am handling the bios directly with a make_request_fn rather than use a request queue, as the device has no seek time. However, it still has transaction overhead.
When I read consecutively from the device, I get bios with many segments (generally my maximum of 32), each consisting of 2 hardware sectors (so 2 * 2k) and this is then handled as one scatter-gather transaction to the device, saving a lot of signaling overhead. However on a write, the bios each have just one segment of 2 sectors and therefore the operations take a lot longer in total. What I would like to happen is to somehow cause the incoming bios to consist of many segments, or to merge bios sensibly together myself. What is the right approach here?
The current content of the make_request_fn is something along the lines of:
Determine read/write of the bio
For each segment in the bio, make an entry in a scatterlist* with sg_set_page
Map this scatterlist to PCI with pci_map_sg
For every segment in the scatterlist, add to a device-specific structure defining a multiple-segment DMA scatter-gather operation
Map that structure to DMA
Carry out transaction
Unmap structure and SG DMA
Call bio_endio with -EIO if failed and 0 if succeeded.
The request queue is set up like:
#define MYDEV_BLOCK_MAX_SEGS 32
#define MYDEV_SECTOR_SIZE 2048
blk_queue_make_request(mydev->queue, mydev_make_req);
set_bit(QUEUE_FLAG_NONROT, &mydev->queue->queue_flags);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_physical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_logical_block_size(mydev->queue, MYDEV_SECTOR_SIZE);
blk_queue_flush(mydev->queue, 0);
blk_queue_segment_boundary(mydev->queue, -1UL);
blk_queue_max_segments(mydev->queue, MYDEV_BLOCK_MAX_SEGS);
blk_queue_dma_alignment(mydev->queue, 0x7);

Provide several kernel buffers through mmap

I have a kernel driver which allocates several buffers in kernel space (physically contiguous, aligned to page boundaries, and consisting of integral number of pages).
Next, I need to make my driver able to mmap some of these buffers to userspace (one buffer per mmap() call, of course). The driver registers single character device for that purpose.
Userspace program must be able to tell kernel which buffer it wants to mmap (for example, by specifying its index or unique ID, or physical address previously resolved through ioctl()).
I want to do so by using mmap()'s offset parameter, for example (from userspace):
mapped_ptr = mmap(NULL, buf_len, PROT_READ | PROT_WRITE, MAP_SHARED, fd, (MAGIC + buffer_id) * PAGE_SIZE);
Where "MAGIC" is some magic number, and buffer_id is the buffer ID which I want to mmap.
Next, in the kernel part there will be something like this:
static int my_dev_mmap(struct file *filp, struct vm_area_struct *vma)
{
int bufferID = vma->vm_pgoff - MAGIC;
/*
* Convert bufferID to PFN by looking through driver's buffer descriptors
* Check length = vma->vm_end - vma->vm_start
* Call remap_pfn_range()
*/
}
But I think it is some sort of dirty way, because "offset" in the mmap() is not supposed to specify index or identifier, its role is to provide number of skipped bytes (or pages) from the beginning of mmap-ed device(or file) memory (which is supposed to be contiguous, right?).
However, i've already seen some drivers in mainline which use "offset" to distinguish between mmap-ed buffers.
Are there any alternative solutions to this?
P.S.
I need all this just because I'm dealing with some unusual SoC' graphics controller, which can operate only on physically contiguous, aligned to 8-byte boundary memory buffers. So, I can only allocate such buffers in kernel space and pass them to user space via mmap().
The most part of controller' programming (composing instruction batches and pushing them to kernel driver) is performed in user space.
Also, I can't just allocate single big chunk of physically contiguous memory, because in that case it needs to be really big (for ex., 16+ MiB) and alloc_pages_exact() will fail.
I don't see anything wrong with using the offset to pass the index in from userspace to your driver. If it bugs you, then just look at your driver as assembling a large buffer out of individual pages that it wants to present to userspace as virtually contiguous, so that the offset really is an offset into this buffer. But really in my opinion there's nothing wrong with doing things this way.
Another alternative, if you can use kernel 3.5 or newer, might be to use the "Contiguous Memory Allocator" (CMA) -- look at <linux/dma-contiguous.h> and drivers/base/dma-contiguous.c for more information. There's also https://lwn.net/Articles/486301/ as a reference but I don't know how much (if anything) changed between that article and getting the code merged into mainline.
Finally, I've chosen to mmap exactly one buffer per one opened device file descriptor (struct file in kernel) and implement control through ioctl(): one IOCTL for allocating new buffer, one for attaching to already allocated buffer with known ID, and another one to get information about buffer.
Usually, userspace will mmap() about 10..20 buffers at the same time, so it is nice and clean solution for this case.

Resources