How to use mmap to share user-space and kernel threads - linux-kernel

I am having some trouble finding some suitable examples to solve my problem. I want to share 4K (4096) byte of data between user and kernel space. I found many ideas which says I have to allocate memory from kernel and mmap it in user-space. Can someone provide an example on how to do it in Linux 2.6.38. Is there any good document which explains it?
Thanks in advance.

Your proposed way is one way, but as userspace is not within your control (meaning any userspace program have a possibility of poking into the kernel), you are opening up the opportunities for malicious attack from userspace. This kernel-based memory-sharing-with-userspace is also described here:
http://www.scs.ch/~frey/linux/memorymap.html
Instead, how about allocating memory in userspace, and then from kernel use the API copy_from_user() and copy_to_user() to copy to/from userspace memory? If u want to share the memory among the different processes, then u can always use IPC related API to allocate and define the memory, eg shmget() etc. And in this case there are lots of sample codes within the kernel source itself.
eg.
fs/checksum.c: missing = __copy_from_user(dst, src, len);

Related

Can one remap kernel virtual address for use by kernel code

I am porting a large application to ARM32 Linux and splitting off the hardware stuff into a device driver. Nearly all of the extensive driver code uses absolute addresses to access buffers and I/O related variables and registers. I'd have to have to change all that to pointer relative addresses - a lot of code is in assembler as well.
From user space it is simple to use mmap to ask for a target virtual address for physical memory (via /dev/mem) so that side poses no issue.
But how can I do similar in kernel code ? IOremap and Memremap give you a random kernel virtual address, worse, loading a driver using INSMOD places both code and data (.bbs) in vmalloc memory.
remap_PFN_range can be used to map kernel memory to user space via mmap call (and with that ask for a given virtual address range) - but how can that be used from the kernel itself if at all ?
So for example, say I have a buffer at physical address 0x60000000 - how can I tell the Kernel to map that to a given kernel accessible virtual address (perhaps also 0x60000000 but could be anything as long as its known at compile time) ?
So far I have spent days surfing anything that mentions remapping but am not finding the "golden" answer. Anybody know if one exists ?
AFAIK there is no "easy" way to do that.
This document explains the memory layout of the Linux kernel memory, and as you see, modules has a specific mapping space which can't be changed as long as you load your code with init_module syscall, and dynamic memory that's allocated using stuff like kmalloc also has a specific range.
Maybe you'll be able to hack something together to create a buffer at a known address, but if my memory doesn't fool me, Linux kind of depends on the layout I mentioned above for some fundamental stuff (page faults etc...).
OK, I have the answer and it's embarrassingly simple.
In my case I am running a STM32MP157 chip under buildroot. It so happens 512MBytes of DRAM is placed at 0xC0000000 physical. This means kernel space virtual address = physical address. PAGE_OFFSET and PHYS_OFFSET are 0xC0000000 so they simply cancel out.
Right, to display a nice logo on startup a 3MByte framebuffer is allocated in CMA memory which starts at 0xD8000000. This is done in early kernel init and is the first thing in CMA. Later on I allocate more framebuffers via DRM but the first one stays.
It's unused after kernel boot - except it now isn't. It's my perfect solution - just read and write directly into 0xD8000000 to 0xD83FFFFF (physical location and size of that framebuffer). All the variables I need to have at locations known at compile time are located into that space. Directly accessible, no pointers needed. No need to modify my existing code other than tell the linker to place the variables at 0xD8000000.

shared lock-free queue between kernel/user address space

I am trying to construct two shared queues(one command queue, and one reply queue) between user and kernel space. So that kernel can send message to userspace and userspace can send reply to kernel after it finishes the processing.
What I have done is use allocate kernel memory pages(for the queues) and mmap to user space, and now both user and kernel side can access those pages(here I mean what is written in kernel space can be correctly read in user space, or vise versa).
The problem is I don't know how I can synchronize the access between kernel and user space. Say if I am going to construct a ring buffer for multi-producer 1-consumer scheme, How to those ring buffer access don't get corrupted by simultaneous writes?
I did some research this week and here are some possible approaches but I am quite new in kernel module development and not so sure whether it will work or not. While digging into them, I will be so glad if I can get any comments or suggestions:
use shared semaphore between user/kernel space: Shared semaphore between user and kernel spaces
But many system calls like sem_timedwait() will be used, I am worrying about how efficient it will be.
What I really prefer is a lock-free scheme, as described in https://lwn.net/Articles/400702/. Related files in kernel tree are:
kernel/trace/ring_buffer_benchmark.c
kernel/trace/ring_buffer.c
Documentation/trace/ring-buffer-design.txt
how lock-free is achieved is documented here: https://lwn.net/Articles/340400/
However, I assume these are kernel implementation and cannot directly be used in user space(As the example in ring_buffer_benchmark.c). Is there any way I can reuse those scheme in user space? Also hope I can find more examples.
Also in that article(lwn 40072), one alternative approach is mentioned using perf tools, which seems similar what I am trying to do. If 2 won't work i will try this approach.
The user-space perf tool therefore interacts with the
kernel through reads and writes in a shared memory region without using system
calls.
Sorry for the English grammar...Hope it make sense.
For syncrhonize between kernel and user space you may use curcular buffer mechanism (documentation at Documentation/circular-buffers.txt).
Key factor of such buffers is two pointers (head and tail), which can be updated separately, which fits well for separated user and kernel codes. Also, implementation of circular buffer is quite simple, so it is not difficult to implement it in user space.
Note, that for multiple producers in the kernel you need to syncrhonize them with spinlock or similar.

Two-way communication to PCIe device via /dev/mem in Linux user-space?

Pretty sure I already know the answer to this question since there are related questions on SO already (here, here, and here,, and this was useful),,, but I wanted to be absolutely sure before I dive into kernel-space driver land (never been there before).
I have a PCIe device that I need to communicate with (and vice versa) from an app in linux user space. By opening /dev/mem, then mmap'ing,, I have been able to write a user-space driver built on top of pciutils that has allowed me to mmap the BARs and successfully write data to the device. Now, we need comm to go the other direction, from the PCIe device to the linux user app. In order for this to work, we believe we are going to need a large chunk (~100MB) of physically contiguous memory that never gets paged/swapped. Once allocated, that address will need to be passed to the PCIe device so it knows where to write its data (thus I don't see how this could be virtual, swappable memory). Is there any way to do this without a kernel space driver? One idea here was floated,, perhaps we can open /dev/mem and then feed it an ioctl command to allocate what we need? If this is possible, I haven't been able to find any examples online yet and will need to research it more heavily.
Assuming we need a kernel space driver, it will be best to allocate our large chuck during bootup, then use ioremap to get a kernel virtual address, then mmap from there to user-space, correct? From what I've read on kmalloc, we won't get anywhere close to 100MB using that call, and vmalloc is no good since that's virtual memory. In order to allocate at bootup, the driver should be statically-linked into the kernel, correct? This is basically an embedded application, so portability is not a huge concern to me. A module rather than a statically-linked driver could probably work, but my worry there is memory fragmentation could prevent a physically contiguous region from being found, so I'd like to allocate it asap from power-on. Any feedback?
EDIT1: My CPU is an ARM7 architecture.
Hugepages-1G
Current x86_64-processors not only support 4k and 2M, but also 1G-pages (flag pdpe1gb in /proc/cpuinfo indicates support).
These 1G-pages must already be reserved at kernel boot, so the boot-parameters hugepagesz=1GB hugepages=1 must be specified.
Then, the hugetlbfs must be mounted:
mkdir /hugetlb-1G
mount -t hugetlbfs -o pagesize=1G none /hugetlb-1G
Then open some file and mmap it:
fd = open("/hugetlb-1G/page-1", O_CREAT | O_RDWR, 0755);
addr = mmap(NULL, SIZE_1G, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
You can now access 1G of physically contiguous memory at addr. To be sure it doesn't get swapped out you can use mlock (but this is probably not even necessary at all for hugepages).
Even if your process crashes, the huge page will be reserved for mapping it like above, so the pci-e device will not write rogue into system or process memory.
You can find out the physical address by reading /proc/pid/pagemap.
Actually Ctx's comment about memmap is what got me down the right path. To reserve memory, I gave a bootloader argument as memmap=[size]$[location] which I found here. Different symbols mean different things, and they aren't exactly intuitive. Just another slight correction, the flag is CONFIG_STRICT_DEVMEM, which my kernel was not compiled with.
There are still some mysteries. For instance, the [location] in the the memmap argument seemed to be meaningless. No matter what I set for the location, linux took all that was not reserved with [size] in one contiguous chunk, and the space that I reserved was at the end. The only indication of this was looking at /proc/iomem. The amount of space I reserved matched the gap between the end of linux memory space and the end of system memory space. I could find no indication anywhere that linux said "I see your reserved chunk and I won't touch it" other than it wasn't taken by linux in /proc/iomem. But the FPGA has been writing to this space for days now with no visible ill-effects for linux, so I guess we're all good! I can just mmap to that location and read the data (surprised this works since linux doesn't indicate this exists, but glad it does). Thanks for the help! Ian I'll come back to your comment if I go to kernel driver space.

Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.

How to access DMA in Linux

I am writing a device driver in Linux for which I need to implement DMA.
It is clear that DMA buffers can be allocated by a call to pci_alloc_consistent(). But how can we write commands to those buffers from user level?
Tasks include writing values to specific registers, how are these implemented using DMA commands?
I believe you can write with DMA through I/O operations that you may access through a GNU C library . You must use system calls such as ioperm or iopl and run as root to gain access to DMA registers. At least thats how one gains access to IO space which may be used for DMA access. Though I may not answer the question completely, hopefully this points you in a good direction.

Resources