While I was going through the LDD3 book, in chapter 15, (memory mapping and dma), the introduction of mmap call says:
mmap() system call allows mapping of device memory directly into user process address space.
The confusion is regarding the address space. Why would device memory be mapped to user space since the kernel only takes care of the device. Why would one need to map it in user space. If the device memory is mapped in user space, why kernel manages it then? And what if the device could be used erroneously if it lies in user address space?
Please correct me if I am wrong, I am just new to it.
Thanks
The same chapter you are referring to, has the answer to your question.
A definitive example of mmap usage can be seen by looking at a subset of the virtual memory areas for the X Window System server. Whenever the program reads or writes in the assigned address range, it is actually accessing the device. In the X server example, using mmap allows quick and easy access to the video card’s memory. For a performance-critical application like this, direct access makes a large difference.
...
Another typical example is a program controlling a PCI device. Most PCI peripherals map their control registers to a memory address, and a high-performance application might prefer to have direct access to the registers instead of repeatedly having to call ioctl to get its work
done.
But you are correct that usually kernel drivers handle devices without revealing device memory to user space:
As you might suspect, not every device lends itself to the mmap abstraction; it makes no sense, for instance, for serial ports and other stream-oriented devices. Another limitation of mmap is that mapping is PAGE_SIZE grained.
In the end, it all depends on how you want your device to be used from user space:
which interfaces you want to provide from driver to user space
what are performance requirements
Usually you hide device memory from user, but sometimes it's needed to give user a direct access to device memory (when alternative is bad performance or ugly interface). Only you, as an engineer, can decide which way is the best, in each particular case.
There are few usages I can think of:
user mode drivers - in this case kernel driver is only posing as a stub: for mapping memory to user space, passing interrupts, etc. (this is common for proprietary drivers).
some user space application is filling or reading DMA buffers directly, to avoid copying them between user space and kernel space.
Regards,
Mateusz.
Related
I gather that the main ways of the CPU addressing devices are "port" and "memory" mapped.
In both of these:
How are devices dynamically assigned an address - who assigns it and how?
How does the CPU then know a device exists, has been assigned and what the address is, particularly its running programs? (how does this work both if the computer is on and off)
How do interrupts work with these devices?
What's the distinction between what the OS and the hardware does?
Is it fair to say that Memory Mapped is the dominant approach in modern systems?
Realise this might be a lot in one go but thanks in advance!
In general, CPU does not know that a specific address is a memory mapped.
it's SW responsibility (BIOS/drivers mainly) to put define the address range as uncacheable (so each read/write will go through to the device and not held internally until WB), out of the core there is some mapping that redirect specific addresses to a device rather than to the DDR (memory).
short answers to part of your bullets (I'm not sure I understand all the questions):
How are devices dynamically assigned an address - who assigns it and how?
Either BIOS define such ranges (the driver communicates on a new device to the BIOS, BIOS save some addresses for plug and play devices)
How does the CPU then know a device exists, has been assigned and what the address is, particularly its running programs? (how does this work both if the computer is on and off)
The CPU doesn't know that, these addresses are treated as normal uncacheable addresses.
Is it fair to say that Memory Mapped is the dominant approach in modern systems?
Yes, it's easier to treat it just another place in memory (it also a bit faster).
the code i'm referring to is here
when i create a memory mapping for a pcidevice, i am always getting the same value for getPhysicalAddress and getVirtualAddress:
e.g.
pciDevice = OSDynamicCast(IOPCIDevice, provider);
deviceMap = pciDevice->mapDeviceMemoryWithRegister(kIOPCIConfigBaseAddress0);
deviceRegisters = (struct oxygen *) deviceMap->getVirtualAddress();
pciDevice->setMemoryEnable(true);
pciDevice->setBusMasterEnable(true);
deviceMap->getPhysicalAddress();
now, actually, i’m not too surprised by this because i think this is the point of “DMA”.
if we have some kind of mapping in the driver, then one is all we need.
that is, the physicaladdress is the virtual address as it’s the sole spot we need to do the “memory to memory” (cpu datastore to PCI sound card)
is this understanding correct?
now for the main issue: i am experiencing kernel panics that are caused caused by any access or assignment of deviceRegisters’ members, such as:
kprintf("Xonar Vendor ID:0x%04x, Device ID:0x%04x, SubDevice ID:0x%04x, Physical Address:%lu\n",
vendor_id, dev_id, subdev_id, deviceRegisters->addr);
now that tells me something i am doing something wrong in terms of allocation since accessing members of this structure should not cause a panic.
however if you look at listing 3-2 here: https://developer.apple.com/library/archive/documentation/DeviceDrivers/Conceptual/WritingAudioDrivers/ImplementDriver/ImplementDriver.html#//apple_ref/doc/uid/TP30000732-DontLinkElementID_15
this is exactly how it is supposed to be done.
a wise man (pmj) suggested i must use ioRead/Write functions to assign/access these values, but this does not really jive with the (admittedly old) skeleton code provided by apple. what could cause access issues to this memory mapping? surely having to do pointer arithmetic to assign/read values, while probably correct, is not the purpose of this design?
when i create a memory mapping for a pcidevice, i am always getting the same value for getPhysicalAddress and getVirtualAddress: e.g.
Are the values by any chance in the range 0x0..0xffff?
I very strongly suspect this is a port-mapped I/O range in your PCI device, not a memory-mapped range.
The way to check for this in your code is:
if (0 != (kIOPCIIOSpace & pciDevice->configRead32(kIOPCIConfigBaseAddress0))
{
// port mapped range
}
else
{
// memory mapped range
}
See also: https://stackoverflow.com/a/44352611/48660
now, actually, i’m not too surprised by this because i think this is the point of “DMA”.
No, port-mapped I/O is pretty much the opposite of DMA. You can certainly use port-mapped I/O to initiate a DMA transfer if that's how your device happens to operate, so perhaps it'd be better phrased as being orthogonal to DMA.
DMA is about devices directly accessing system memory. PCI BARs are about the CPU accessing device registers or memory.
if we have some kind of mapping in the driver, then one is all we need.
that is, the physicaladdress is the virtual address as it’s the sole spot we need to do the “memory to memory” (cpu datastore to PCI sound card)
is this understanding correct?
No, at least on x86, the I/O port address space is completely separate from the physical memory address space, and therefore also can't be mapped into virtual address space, as the MMU translates between virtual and physical memory spaces. On x86, there are special machine instructions, in and out, for reading and writing from I/O ports. On most architectures (for OS X notably PPC, but I think it's the case for ARM too), there is some form of memory mapping going on, however. I don't know how it works in detail on those architectures, but for the purposes of this question, you don't really need to care:
The architecture-independent way of performing I/O on a port-mapped range in a macOS kext is to use the ioread* and iowrite* methods on IOPCIDevice, where * can be 8, 16, or 32 for the 3 different possible I/O word sizes allowed by the PCI standard.
now for the main issue: i am experiencing kernel panics that are caused caused by any access or assignment of deviceRegisters’ members, such as:
Assuming you are in fact dealing with a port-mapped I/O range in your device, then this explains your kernel panics. Use pciDevice->ioread16(register_offset, deviceMap) or similar.
a wise man (pmj) suggested i must use ioRead/Write functions to assign/access these values, but this does not really jive with the (admittedly old) skeleton code provided by apple.
The document you linked to assumes the device's BAR is referring to a memory mapped range, not a port-mapped I/O range.
Pretty sure I already know the answer to this question since there are related questions on SO already (here, here, and here,, and this was useful),,, but I wanted to be absolutely sure before I dive into kernel-space driver land (never been there before).
I have a PCIe device that I need to communicate with (and vice versa) from an app in linux user space. By opening /dev/mem, then mmap'ing,, I have been able to write a user-space driver built on top of pciutils that has allowed me to mmap the BARs and successfully write data to the device. Now, we need comm to go the other direction, from the PCIe device to the linux user app. In order for this to work, we believe we are going to need a large chunk (~100MB) of physically contiguous memory that never gets paged/swapped. Once allocated, that address will need to be passed to the PCIe device so it knows where to write its data (thus I don't see how this could be virtual, swappable memory). Is there any way to do this without a kernel space driver? One idea here was floated,, perhaps we can open /dev/mem and then feed it an ioctl command to allocate what we need? If this is possible, I haven't been able to find any examples online yet and will need to research it more heavily.
Assuming we need a kernel space driver, it will be best to allocate our large chuck during bootup, then use ioremap to get a kernel virtual address, then mmap from there to user-space, correct? From what I've read on kmalloc, we won't get anywhere close to 100MB using that call, and vmalloc is no good since that's virtual memory. In order to allocate at bootup, the driver should be statically-linked into the kernel, correct? This is basically an embedded application, so portability is not a huge concern to me. A module rather than a statically-linked driver could probably work, but my worry there is memory fragmentation could prevent a physically contiguous region from being found, so I'd like to allocate it asap from power-on. Any feedback?
EDIT1: My CPU is an ARM7 architecture.
Hugepages-1G
Current x86_64-processors not only support 4k and 2M, but also 1G-pages (flag pdpe1gb in /proc/cpuinfo indicates support).
These 1G-pages must already be reserved at kernel boot, so the boot-parameters hugepagesz=1GB hugepages=1 must be specified.
Then, the hugetlbfs must be mounted:
mkdir /hugetlb-1G
mount -t hugetlbfs -o pagesize=1G none /hugetlb-1G
Then open some file and mmap it:
fd = open("/hugetlb-1G/page-1", O_CREAT | O_RDWR, 0755);
addr = mmap(NULL, SIZE_1G, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
You can now access 1G of physically contiguous memory at addr. To be sure it doesn't get swapped out you can use mlock (but this is probably not even necessary at all for hugepages).
Even if your process crashes, the huge page will be reserved for mapping it like above, so the pci-e device will not write rogue into system or process memory.
You can find out the physical address by reading /proc/pid/pagemap.
Actually Ctx's comment about memmap is what got me down the right path. To reserve memory, I gave a bootloader argument as memmap=[size]$[location] which I found here. Different symbols mean different things, and they aren't exactly intuitive. Just another slight correction, the flag is CONFIG_STRICT_DEVMEM, which my kernel was not compiled with.
There are still some mysteries. For instance, the [location] in the the memmap argument seemed to be meaningless. No matter what I set for the location, linux took all that was not reserved with [size] in one contiguous chunk, and the space that I reserved was at the end. The only indication of this was looking at /proc/iomem. The amount of space I reserved matched the gap between the end of linux memory space and the end of system memory space. I could find no indication anywhere that linux said "I see your reserved chunk and I won't touch it" other than it wasn't taken by linux in /proc/iomem. But the FPGA has been writing to this space for days now with no visible ill-effects for linux, so I guess we're all good! I can just mmap to that location and read the data (surprised this works since linux doesn't indicate this exists, but glad it does). Thanks for the help! Ian I'll come back to your comment if I go to kernel driver space.
I've got a Xilinx Zynq 7000-based board with a peripheral in the FPGA fabric that has DMA capability (on an AXI bus). We've developed a circuit and are running Linux on the ARM cores. We're having performance problems accessing a DMA buffer from user space after it's been filled by hardware.
Summary:
We have pre-reserved at boot time a section of DRAM for use as a large DMA buffer. We're apparently using the wrong APIs to map this buffer, because it appears to be uncached, and the access speed is terrible.
Using it even as a bounce-buffer is untenably slow due to horrible performance. IIUC, ARM caches are not DMA coherent, so I would really appreciate some insight on how to do the following:
Map a region of DRAM into the kernel virtual address space but ensure that it is cacheable.
Ensure that mapping it into userspace doesn't also have an undesirable effect, even if that requires we provide an mmap call by our own driver.
Explicitly invalidate a region of physical memory from the cache hierarchy before doing a DMA, to ensure coherency.
More info:
I've been trying to research this thoroughly before asking. Unfortunately, this being an ARM SoC/FPGA, there's very little information available on this, so I have to ask the experts directly.
Since this is an SoC, a lot of stuff is hard-coded for u-boot. For instance, the kernel and a ramdisk are loaded to specific places in DRAM before handing control over to the kernel. We've taken advantage of this to reserve a 64MB section of DRAM for a DMA buffer (it does need to be that big, which is why we pre-reserve it). There isn't any worry about conflicting memory types or the kernel stomping on this memory, because the boot parameters tell the kernel what region of DRAM it has control over.
Initially, we tried to map this physical address range into kernel space using ioremap, but that appears to mark the region uncacheable, and the access speed is horrible, even if we try to use memcpy to make it a bounce buffer. We use /dev/mem to map this also into userspace, and I've timed memcpy as being around 70MB/sec.
Based on a fair amount of searching on this topic, it appears that although half the people out there want to use ioremap like this (which is probably where we got the idea from), ioremap is not supposed to be used for this purpose and that there are DMA-related APIs that should be used instead. Unfortunately, it appears that DMA buffer allocation is totally dynamic, and I haven't figured out how to tell it, "here's a physical address already allocated -- use that."
One document I looked at is this one, but it's way too x86 and PC-centric:
https://www.kernel.org/doc/Documentation/DMA-API-HOWTO.txt
And this question also comes up at the top of my searches, but there's no real answer:
get the physical address of a buffer under Linux
Looking at the standard calls, dma_set_mask_and_coherent and family won't take a pre-defined address and wants a device structure for PCI. I don't have such a structure, because this is an ARM SoC without PCI. I could manually populate such a structure, but that smells to me like abusing the API, not using it as intended.
BTW: This is a ring buffer, where we DMA data blocks into different offsets, but we align to cache line boundaries, so there is no risk of false sharing.
Thank you a million for any help you can provide!
UPDATE: It appears that there's no such thing as a cacheable DMA buffer on ARM if you do it the normal way. Maybe if I don't make the ioremap call, the region won't be marked as uncacheable, but then I have to figure out how to do cache management on ARM, which I can't figure out. One of the problems is that memcpy in userspace appears to really suck. Is there a memcpy implementation that's optimized for uncached memory I can use? Maybe I could write one. I have to figure out if this processor has Neon.
Have you tried implementing your own char device with an mmap() method remapping your buffer as cacheable (by means of remap_pfn_range())?
I believe you need a driver that implements mmap() if you want the mapping to be cached.
We use two device drivers for this: portalmem and zynqportal. In the Connectal Project, we call the connection between user space software and FPGA logic a "portal". These drivers require dma-buf, which has been stable for us since Linux kernel version 3.8.x.
The portalmem driver provides an ioctl to allocate a reference-counted chunk of memory and returns a file descriptor associated with that memory. This driver implements dma-buf sharing. It also implements mmap() so that user-space applications can access the memory.
At allocation time, the application may choose cached or uncached mapping of the memory. On x86, the mapping is always cached. Our implementation of mmap() currently starts at line 173 of the portalmem driver. If the mapping is uncached, it modifies vma->vm_page_prot using pgprot_writecombine(), enabling buffering of writes but disabling caching.
The portalmem driver also provides an ioctl to invalidate and optionally write back data cache lines.
The portalmem driver has no knowledge of the FPGA. For that, we the zynqportal driver, which provides an ioctl for transferring a translation table to the FPGA so that we can use logically contiguous addresses on the FPGA and translate them to the actual DMA addresses. The allocation scheme used by portalmem is designed to produce compact translation tables.
We use the same portalmem driver with pcieportal for PCI Express attached FPGAs, with no change to the user software.
The Zynq has neon instructions, and an assembly code implementation of memcpy using neon instructions, using aligned on cache boundary (32 bytes) will achieve 300 MB/s rates or higher.
I struggled with this for some time with udmabuf and discovered the answer was as simple as adding dma_coherent; to its entry in the device tree. I saw a dramatic speedup in access time from this simple step - though I still need to add code to invalidate/flush whenever I transfer ownership from/to the device.
The more I read about low level languages like C and pointers and memory management, it makes me wonder about the current state of the art with modern operating systems and memory protection. For example what kind of checks are in place that prevent some rogue program from randomly trying to read as much address space as it can and disregard the rules set in place by the operating system?
In general terms how do these memory protection schemes work? What are their strength and weaknesses? To put it another way, are there things that simply cannot be done anymore when running a compiled program in a modern OS even if you have C and you own compiler with whatever tweaks you want?
The protection is enforced by the hardware (i.e., by the CPU). Applications can only express addresses as virtual addresses and the CPU resolves the mapping of virtual address to physical address using lookaside buffers. Whenever the CPU needs to resolve an unknown address it generates a 'page fault' which interrupts the current running application and switches control to the operating system. The operating system is responsible for looking up its internal structures (page tables) and find a mapping between the virtual address touched by the application and the actual physical address. Once the mapping is found the CPU can resume the application.
The CPU instructions needed to load a mapping between a physical address and a virtual one are protected and as such can only be executed by a protected component (ie. the OS kernel).
Overall the scheme works because:
applications cannot address physical memory
resolving mapping from virtual to physical requires protected operations
only the OS kernel is allowed to execute protected operations
The scheme fails though if a rogue module is loaded in the kernel, because at that protection level it can read and write into any physical address.
Application can read and write other processes memory, but only by asking the kernel to do this operation for them (eg. in Win32 ReadProcessMemory), and such APIs are protected by access control (certain privileges are required on the caller).
Memory protection is enforced in hardware, typically with a minimum granularity on the order of KBs.
From the Wikipedia article about memory protection:
In paging, the memory address space is
divided into equal, small pieces,
called pages. Using a virtual memory
mechanism, each page can be made to
reside in any location of the physical
memory, or be flagged as being
protected. Virtual memory makes it
possible to have a linear virtual
memory address space and to use it to
access blocks fragmented over physical
memory address space.
Most computer architectures based on
pages, most notably x86 architecture,
also use pages for memory protection.
A page table is used for mapping
virtual memory to physical memory. The
page table is usually invisible to the
process. Page tables make it easier to
allocate new memory, as each new page
can be allocated from anywhere in
physical memory.
By such design, it is impossible for
an application to access a page that
has not been explicitly allocated to
it, simply because any memory address,
even a completely random one, that
application may decide to use, either
points to an allocated page, or
generates a page fault (PF) error.
Unallocated pages simply do not have
any addresses from the application
point of view.
You should ask Google for Segmentation fault, Memory Violation Error and General Protection Failure. These are errors returned by various OSes in response for a program trying to access memory address it shouldn't access.
And Windows Vista (or 7) has routines for randomized dll attaching, which means that buffer overflow can take you to different addresses each time it occurs. This also makes buffer overflow attack a little bit less repeatable.
So, to link together the answers posted with your question. A program that attempts to read any memory address that is not mapped in its address space, will cause the processor to issue a page fault exception transferring execution control to the operating system code (trusted code), the kernel will then check which is the faulty address, if there is no mapping in the current process address space, it will send the SIGSEV (segmentation fault) signal to the process which typically kills the process (talking about Linux/Unix here), on Windows you get something along the same lines.
Note: you can take a look at mprotect() in Linux and POSIX operating systems, it allows you to protect pages of memory explicitly, functions like malloc() return memory on pages with default protection, which you can then modify, this way you can protect areas of memory as read only (but just in page size chunks, typically around 4KB).