Two-way communication to PCIe device via /dev/mem in Linux user-space? - linux-kernel

Pretty sure I already know the answer to this question since there are related questions on SO already (here, here, and here,, and this was useful),,, but I wanted to be absolutely sure before I dive into kernel-space driver land (never been there before).
I have a PCIe device that I need to communicate with (and vice versa) from an app in linux user space. By opening /dev/mem, then mmap'ing,, I have been able to write a user-space driver built on top of pciutils that has allowed me to mmap the BARs and successfully write data to the device. Now, we need comm to go the other direction, from the PCIe device to the linux user app. In order for this to work, we believe we are going to need a large chunk (~100MB) of physically contiguous memory that never gets paged/swapped. Once allocated, that address will need to be passed to the PCIe device so it knows where to write its data (thus I don't see how this could be virtual, swappable memory). Is there any way to do this without a kernel space driver? One idea here was floated,, perhaps we can open /dev/mem and then feed it an ioctl command to allocate what we need? If this is possible, I haven't been able to find any examples online yet and will need to research it more heavily.
Assuming we need a kernel space driver, it will be best to allocate our large chuck during bootup, then use ioremap to get a kernel virtual address, then mmap from there to user-space, correct? From what I've read on kmalloc, we won't get anywhere close to 100MB using that call, and vmalloc is no good since that's virtual memory. In order to allocate at bootup, the driver should be statically-linked into the kernel, correct? This is basically an embedded application, so portability is not a huge concern to me. A module rather than a statically-linked driver could probably work, but my worry there is memory fragmentation could prevent a physically contiguous region from being found, so I'd like to allocate it asap from power-on. Any feedback?
EDIT1: My CPU is an ARM7 architecture.

Hugepages-1G
Current x86_64-processors not only support 4k and 2M, but also 1G-pages (flag pdpe1gb in /proc/cpuinfo indicates support).
These 1G-pages must already be reserved at kernel boot, so the boot-parameters hugepagesz=1GB hugepages=1 must be specified.
Then, the hugetlbfs must be mounted:
mkdir /hugetlb-1G
mount -t hugetlbfs -o pagesize=1G none /hugetlb-1G
Then open some file and mmap it:
fd = open("/hugetlb-1G/page-1", O_CREAT | O_RDWR, 0755);
addr = mmap(NULL, SIZE_1G, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
You can now access 1G of physically contiguous memory at addr. To be sure it doesn't get swapped out you can use mlock (but this is probably not even necessary at all for hugepages).
Even if your process crashes, the huge page will be reserved for mapping it like above, so the pci-e device will not write rogue into system or process memory.
You can find out the physical address by reading /proc/pid/pagemap.

Actually Ctx's comment about memmap is what got me down the right path. To reserve memory, I gave a bootloader argument as memmap=[size]$[location] which I found here. Different symbols mean different things, and they aren't exactly intuitive. Just another slight correction, the flag is CONFIG_STRICT_DEVMEM, which my kernel was not compiled with.
There are still some mysteries. For instance, the [location] in the the memmap argument seemed to be meaningless. No matter what I set for the location, linux took all that was not reserved with [size] in one contiguous chunk, and the space that I reserved was at the end. The only indication of this was looking at /proc/iomem. The amount of space I reserved matched the gap between the end of linux memory space and the end of system memory space. I could find no indication anywhere that linux said "I see your reserved chunk and I won't touch it" other than it wasn't taken by linux in /proc/iomem. But the FPGA has been writing to this space for days now with no visible ill-effects for linux, so I guess we're all good! I can just mmap to that location and read the data (surprised this works since linux doesn't indicate this exists, but glad it does). Thanks for the help! Ian I'll come back to your comment if I go to kernel driver space.

Related

Can one remap kernel virtual address for use by kernel code

I am porting a large application to ARM32 Linux and splitting off the hardware stuff into a device driver. Nearly all of the extensive driver code uses absolute addresses to access buffers and I/O related variables and registers. I'd have to have to change all that to pointer relative addresses - a lot of code is in assembler as well.
From user space it is simple to use mmap to ask for a target virtual address for physical memory (via /dev/mem) so that side poses no issue.
But how can I do similar in kernel code ? IOremap and Memremap give you a random kernel virtual address, worse, loading a driver using INSMOD places both code and data (.bbs) in vmalloc memory.
remap_PFN_range can be used to map kernel memory to user space via mmap call (and with that ask for a given virtual address range) - but how can that be used from the kernel itself if at all ?
So for example, say I have a buffer at physical address 0x60000000 - how can I tell the Kernel to map that to a given kernel accessible virtual address (perhaps also 0x60000000 but could be anything as long as its known at compile time) ?
So far I have spent days surfing anything that mentions remapping but am not finding the "golden" answer. Anybody know if one exists ?
AFAIK there is no "easy" way to do that.
This document explains the memory layout of the Linux kernel memory, and as you see, modules has a specific mapping space which can't be changed as long as you load your code with init_module syscall, and dynamic memory that's allocated using stuff like kmalloc also has a specific range.
Maybe you'll be able to hack something together to create a buffer at a known address, but if my memory doesn't fool me, Linux kind of depends on the layout I mentioned above for some fundamental stuff (page faults etc...).
OK, I have the answer and it's embarrassingly simple.
In my case I am running a STM32MP157 chip under buildroot. It so happens 512MBytes of DRAM is placed at 0xC0000000 physical. This means kernel space virtual address = physical address. PAGE_OFFSET and PHYS_OFFSET are 0xC0000000 so they simply cancel out.
Right, to display a nice logo on startup a 3MByte framebuffer is allocated in CMA memory which starts at 0xD8000000. This is done in early kernel init and is the first thing in CMA. Later on I allocate more framebuffers via DRM but the first one stays.
It's unused after kernel boot - except it now isn't. It's my perfect solution - just read and write directly into 0xD8000000 to 0xD83FFFFF (physical location and size of that framebuffer). All the variables I need to have at locations known at compile time are located into that space. Directly accessible, no pointers needed. No need to modify my existing code other than tell the linker to place the variables at 0xD8000000.

Mapping device memory into user process address space

While I was going through the LDD3 book, in chapter 15, (memory mapping and dma), the introduction of mmap call says:
mmap() system call allows mapping of device memory directly into user process address space.
The confusion is regarding the address space. Why would device memory be mapped to user space since the kernel only takes care of the device. Why would one need to map it in user space. If the device memory is mapped in user space, why kernel manages it then? And what if the device could be used erroneously if it lies in user address space?
Please correct me if I am wrong, I am just new to it.
Thanks
The same chapter you are referring to, has the answer to your question.
A definitive example of mmap usage can be seen by looking at a subset of the virtual memory areas for the X Window System server. Whenever the program reads or writes in the assigned address range, it is actually accessing the device. In the X server example, using mmap allows quick and easy access to the video card’s memory. For a performance-critical application like this, direct access makes a large difference.
...
Another typical example is a program controlling a PCI device. Most PCI peripherals map their control registers to a memory address, and a high-performance application might prefer to have direct access to the registers instead of repeatedly having to call ioctl to get its work
done.
But you are correct that usually kernel drivers handle devices without revealing device memory to user space:
As you might suspect, not every device lends itself to the mmap abstraction; it makes no sense, for instance, for serial ports and other stream-oriented devices. Another limitation of mmap is that mapping is PAGE_SIZE grained.
In the end, it all depends on how you want your device to be used from user space:
which interfaces you want to provide from driver to user space
what are performance requirements
Usually you hide device memory from user, but sometimes it's needed to give user a direct access to device memory (when alternative is bad performance or ugly interface). Only you, as an engineer, can decide which way is the best, in each particular case.
There are few usages I can think of:
user mode drivers - in this case kernel driver is only posing as a stub: for mapping memory to user space, passing interrupts, etc. (this is common for proprietary drivers).
some user space application is filling or reading DMA buffers directly, to avoid copying them between user space and kernel space.
Regards,
Mateusz.

Need help mapping pre-reserved **cacheable** DMA buffer on Xilinx/ARM SoC (Zynq 7000)

I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.

Does the last GB in the address space of a linux process map to the same physical memory?

I read that the first 3 GBs are reserved for the process and the last GB is for the Kernel. I also read that the kernel is loaded starting from the 2nd MB of the physical address space (depending on the configuration). My question is that is the mapping of that last 1 GB is same for all processes and maps to this physical area of memory?
Another question is, when a process switches to kernel mode (eg, when a sys call occurs), then what page tables are used, the process page tables or the kernel page tables? If kernel page tables are used, then they can't access the memory locations belonging to the process. If that is the case, then there is apparently no use for the kernel virtual memory since all access to kernel code and data will be through the mapping of the last 1 GB of process address space. Please help me clarify this (any useful links will be much appreciated)
It seems, you are talking about 32-bit x86 systems, right?
If I am not mistaken, the kernel can be configured not only for 3Gb/1Gb memory distrubution, there could be other variants (e.g. 2Gb/2Gb). Still, 3Gb/1Gb is probably the most common one on x86-32.
The kernel part of the address space should be inaccessible from the user space. From the kernel's point of view, yes, the mapping of the memory occupied by the kernel itself is always the same. No matter, in the context of which process (or interrupt handler, or whatever else) the kernel currently operates.
As one of the consequences, if you look at the addresses of kernel symbols in /proc/kallsyms from different processes, you will see the same addresses each time. And these are exactly the addresses of the respective kernel functions, variables and others from the kernel's point of view.
So I suppose, the answer to your first question is "yes" but it is probably not very useful for the user-space code as the kernel space memory is not directly accessible from there anyway.
As for the second question, well, if the kernel currently operates in the context of some process, it can actually access the user-space memory of that process. I can't describe it in detail but probably the implementation of kernel functions copy_from_user and copy_to_user could give you some hints. See arch/x86/lib/usercopy_32.c and arch/x86/include/asm/uaccess.h in the kernel sources. It seems, on x86-32, the user-space memory is accessed in these functions directly, using the default memory mappings for the current process context. The 'magic' stuff there is only related to the optimizations and checking the address of the memory area for correctness.
Yes, the mapping of the kernel part of the address space is the same in all processes. Part of it does map that part of the physical memory where the kernel image is loaded, but that's not the bulk of it - the remainder is used to map other physical memory locations for the kernel's runtime working set.
When a process switches to kernel mode, the page tables are not changed. The kernel part of the address space simply becomes accessible because the CPL (Current Privilege Level) is now zero.

What are the different areas of Memory & Disk?

I'm neither sure about if this is a right place to ask nor sure about how to put my query.
Let me put it this way:
Main Memory starting at 0x00000 to 0xFFFFF.
Diskspace starting at 0x00000000 to 0xFFFFFFFF.
But what we'll be able to access will not be from 0th byte till last byte right?
On hardisk I guess at the 0th byte we have MBR. & at someplace we have Filesystem (we are able to acess only this). What else?
Similarly with the Main memory. We have some Kernel Memory & User Memory(in which each processes live). What else?
My question is what are all the regions from 0th byte till the last byte? I don't know what to search for or where to find such information? If any one can post some links, that would be great.
EDIT:
I'm using x86 32Bit on Windows. Actually I was reading a book on Computer security where author mentions that a malware can either live on the disk or in the memory.(which is very true). But when we say computer is infected that doesn't mean only files (which are part of filesystem) is infected. There are other area's which are not mean't for user, like MBR. or Kernel Memory.
So, the question popped up in my mind. What are all such areas that I may not be aware about?
Apart from the fact that the answer to this question is highly dependent on the OS, disk space is not at all part of the main memory. On Intel architectures, disk access takes some I/O address space (which is different from memory address) per channel. And the exact number of words depends on what channel: IDE/ATA/SATA/SCSI. On other architectures which are memory mapped like the PowerPC disk access do take some memory address space, but still not much.
To illustrate (and be warned that this is a very simplified example, not the real world), assume a memory mapped CPU* like the PowerPC trying to access a disk with LBA addressing. The disk really only need 2 to 3 words of memory to hold multiple Gigabytes of data. That is, we only need 12 bytes to store and retrieve Gigabytes of data:
2 words (8 bytes) to tell the disk where to seek to, that is, at what address do we want to read form or write to.
1 word (4 bytes) to actually do the read and write. Every time you read from this address, the 2 word pointer automagically increment by 1 character (or 4 if you read in 32 bits).
But the above is an abstracted view of what really happens. Most disk controllers have several more registers to control power management, disk spin speed, enter and exit sleep modes etc.
And what are the addresses of these memory locations? Well, it depends on what I/O channel you're talking about. The old-school ISA bus depends on the user setting jumpers on cards to set the addresses. So for those you need to ask the user. The PCI bus auto-negotiates the addresses with the disk controllers at boot time and then, depending on architecture, either tells your bios what devices exist or pass them as parameters to the bootloader or store them in some temporary registers on the system bus. USB works like PCI but negotiates with the OS instead of the BIOS... etc.
As you can see, there is no simple answer to this even if you limit it to only specific cases like Windows7 running on 64 bit AMD CPU running on Dell motherboards.
*note: since you're worried about memory locations.
Your question is complex, and hard to answer without knowing the scope of where the view of memory is.
Pretending we're in ring-0 with direct mapped memory, a PC-compatible has multiple memory regions. Lower memory, BIOS mapped code, IO ports, video memory, etc. They all live in the same memory space. You communicate with peripherals by reading and writing from specific memory addresses (which are mapped to those components). These addresses are setup by the hardware in question and the drivers in use.
Once we enter user mode, you have to deal with virtual memory. Addresses are symbolic, and may or may not map to any particular part of physical memory. I'd suggest reading up on Virtual memory

Resources