How to get memory address from memfd_create? - shared-memory

In my application I need to share memory
between parent and child (using fork+execl).
I use memfd_create to allocate memory, because it provides a
file descriptor, which may be conveniently used in child
process (the discriptor is tied to stdin via dup2 before execl)
to attach to the allocated memory.
I do not use write and read - I use pointers
to read and write memory directly.
The only piece of the puzzle which is left to solve
is how to get the address of memory, allocated
via fd = memfd_create ....
Using mmap is undesirable, because it duplicates the memory, instead of giving the
memory address already allocated by memfd_create.
This is demonstrated by the following code.
In its output each mmap address is incremented by 4096, which is the size of memory, referred to by fd:
0x7f98411c1000
0x7f98411c0000
whereas if mmap had given the direct address,
addresses in the output would be the same.
#include <stdio.h>
#include <sys/mman.h>
#include <sys/syscall.h>
#include <unistd.h>
int main(void)
{
int fd = syscall(SYS_memfd_create, "shm", 0);
if (fd == -1) return 1;
size_t size = 4096; /* minimal */
int check = ftruncate(fd, size);
if (check == -1) return 1;
void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (ptr == MAP_FAILED) return 1;
void *ptr2 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (ptr2 == MAP_FAILED) return 1;
printf("%p\n%p\n", ptr, ptr2);
return 0;
}
So, how to get direct address, avoiding memory duplication
by mmap?

Not sure if you still need an answer, since you have written the main point in your own answer, but just adding this to be sure it is complete:
memfd_create creates a memory-only file (meaning it is not stored on disk, although it can be swapped out). As you write in your answer, it returns a file descriptor.
mmap ensures that the file behind a file descriptor is in memory (which requires no action in case of a memory-only file), and gives you a pointer to that memory. It does not copy the memory (except from disk to memory when mapping a file from disk). If the same file is mapped multiple times, each call to mmap reserves a new region of virtual memory, but all those regions access the same portion of physical memory.
So the short answer to your question is that you misunderstand mmap; it does not copy the memory, and it is the perfect solution to your problem.

Instead of physical address we get a file descriptor from memfd_create.
File descriptor is a handle through which we get access to the memory
(i.e., it is a mapping from file discriptor to memory). A file
descriptor is more convenient to work with, than memory address:
it can be passed to forked processes, etc (see OP).
mmap just maps physical address (referred to by file descriptor)
to a virtual address. A virtual memory address is in fact the memory
address, because user never sees physical addresses.
Each mmap call returns a different virtual address, all of which
are mapped to the same physical address by the kernel.

Related

mapping device memory and kernel allocated memory into the same vma

I'm working on a driver where ranges of device memory are mapped into the user space (via IOCTL) for the application to write to. It works:
vma->vm_flags |= VM_DONTCOPY;
vma->vm_flags |= VM_DONTEXPAND;
down_write(&current->mm->mmap_sem);
ret = vm_iomap_memory(vma, from, sz_required);
up_write(&current->mm->mmap_sem);
where from is a physical address obtained from pci_resource_start() with some offset added to it.
The application also needs to read from the device so I increase the size of the region mmapped by application by PAGE_SIZE, allocate a page with dma_alloc_coherent(), and try to insert it at the end of the vma but that returns EBUSY. What do I do wrong? I should be able to stitch together multiple physical ranges into a single vma, both real memory and device mapped, or is that not supported?
In the new code a page is allocated like that, dma_addr is passed to the device so it knows where to write to:
dma = dma_alloc_coherent(&device, PAGE_SIZE, &dma_addr, GFP_KERNEL);
memset(dma, 0xfe, PAGE_SIZE);
set_memory_wb((unsigned long)dma, 1);
And the mapping code is changed to:
vma->vm_flags |= VM_DONTCOPY;
vma->vm_flags |= VM_DONTEXPAND;
vma->vm_flags |= VM_MIXEDMAP;
down_write(&current->mm->mmap_sem);
ret = vm_iomap_memory(vma, from, sz_required);
up_write(&current->mm->mmap_sem);
down_write(&current->mm->mmap_sem);
ret = vm_insert_page(vma, vma->vm_end - PAGE_SIZE, virt_to_page(dma));
up_write(&current->mm->mmap_sem);
The kernel is 4.15 on x86_64
Got it working by following the "hack" in Map multiple kernel buffer into contiguous userspace buffer?
Before vm_iomap_memory() I decrement vma->vm_end by PAGE_SIZE and restore the old value afterwards. Also, I switched from dma_alloc_coherent() to alloc_page() following by dma_map_page()
Not the solution I'm satisfied with though. There has to be a better way, perhaps a fault handler in vm_ops? Although that seems counter-productive considering I know exactly what I will be mapping and where.
It appears to be working on x86_64 and aarch64

Allocate swappable memory in linux kernel

Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?
You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);
Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).

mmap query on linux platform

On Linux machine, trying to write driver and trying to map some kernel memory to application for performance gains.
Checking driver implementation for mmap online, finding different varieties of implementation.
As per man pages, mmap - creates new mapping in virtual address space of calling process.
1) Who allocates physical address space during mmap calling? kernel or device driver?
seen following varieties of driver mmap implementation.
a) driver creates contiguous physical kernel memory and maps it with process address space.
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size = vma->vm_end - vma->vm_start;
pos = kmalloc(size); //allocate contiguous physical memory.
while (size > 0) {
unsigned long pfn;
pfn = virt_to_phys((void *) pos) >> PAGE_SHIFT; // Get Page frame number
if (remap_pfn_range(vma, start, pfn, PAGE_SIZE, PAGE_SHARED)) // creates mapping
return -EAGAIN;
start += PAGE_SIZE;
pos += PAGE_SIZE;
size -= PAGE_SIZE;
}
}
b) driver creates virtual kernel memory and maps it with process address space.
static struct vm_operations_struct dr_vm_ops = {
.open = dr_vma_open,
.close = dr_vma_close,
};
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size = vma->vm_end - vma->vm_start;
void *kp = vmalloc(size);
unsigned long up;
for (up = vma->vm_start; up < vma->vm_end; up += PAGE_SIZE) {
struct page *page = vmalloc_to_page(kp); //Finding physical page from virtual address
err = vm_insert_page(vma, up, page); //How is it different from remap_pfn_range?
if (err)
break;
kp += PAGE_SIZE;
}
vma->vm_ops = &dr_vm_ops;
ps_vma_open(vma);
}
c) not sure who allocates memory in this case.
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot)) // creates mapping
return -EAGAIN;
}
2) If kernel allocates memory for mmap, is n't memory wasted in a&b cases?
3) remap_pfn_range is to map multiple pages where as vm_insert_page is just for single page mapping. is it the only difference b/w these two APIs?
Thank You,
Gopinath.
Which you use depends on what you're trying to accomplish.
(1) A device driver is part of the kernel so it doesn't really make sense to differentiate that way. For these cases, the device driver is asking for memory to be allocated for its own use from the (physical) memory resources available to the entire kernel.
With (a), a physically contiguous space is being allocated. You might do this if there is some piece of external hardware (a PCI device, for example) that will be reading or writing that memory. The return value from kmalloc already has a mapping to kernel virtual address space. remap_pfn_range is being used to map the page into the user virtual address space of the current process as well.
For (b), a virtually contiguous space is being allocated. If there is no external hardware involved, this is what you would typically use. There is still physical memory being allocated to your driver, but it isn't guaranteed that the pages are physically contiguous -- hence fewer constraints on which pages can be allocated. (They will still be contiguous in kernel virtual address space.) And then you are simply using a different API to implement the same kind of mapping into user virtual address space.
For (c), the memory being mapped is allocated under control of some other subsystem. Thevm_pgoff field has already been set to the base physical address of the resource. For example, the memory might correspond to a PCI device's address region (a network interface controller's registers, say) where that physical address is determined/assigned by your BIOS (or whatever mechanism your machine uses).
(2) Not sure I understand this question. How can the memory be "wasted" if it's being used by the device driver and a cooperating user process? And if the kernel needs to read and write the memory, there must be kernel virtual address space allocated and it needs to be mapped to the underlying physical memory. Likewise, if the user space process is to access the memory, there must be user virtual address space allocated and that must be mapped to the physical memory as well.
"Allocating virtual address space" essentially just means allocating page table entries for the memory. That is done separately from actually allocating the physical memory. And it's done separately for kernel space and user space. And "mapping" means setting the page table entry (the virtual address of the beginning of the page) to point to the correct physical page address.
(3) Yes. They are different APIs that accomplish much the same thing. Sometimes you have a struct page, sometimes you have a pfn. It can be confusing: there are often several ways to accomplish the same thing. Developers typically use the one most obvious for the item they already have ("I already have a struct page. I could calculate its pfn. But why do that when there's this other API that accepts a struct page?").

mwait x86 instruction doesn't wait for DMA

I'm trying to use monitor/mwait instructions to monitor DMA writes from a device to a memory location. In a kernel module (char device) I have the following code (very similar to this piece of kernel code) that runs in a kernel thread:
static int do_monitor(void *arg)
{
struct page *p = arg; // p is a 'struct page *'; it's also remapped to user space
uint32_t *location_p = phys_to_virt(page_to_phys(p));
uint32_t prev = 0;
int i = 0;
while (i++ < 20) // to avoid infinite loop
{
if (*location_p == prev)
{
__monitor(location_p, 0, 0);
if (this_cpu_has(X86_FEATURE_CLFLUSH_MONITOR))
clflush(location_p);
if (*location_p == prev)
__mwait(0, 0);
}
prev = *location_p;
printk(KERN_NOTICE "%d", prev);
}
}
In user space I have the following test code:
int fd = open("/dev/mon_test_dev", O_RDWR);
unsigned char *mapped = (unsigned char *)mmap(0, mmap_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
for (int i = 1; i <= 5; ++i)
*mapped = i;
munmap(mapped, mmap_size);
close(fd);
And the kernel log looks like this:
1
2
3
4
5
5
5
5
5
5
5 5 5 5 5 5 5 5 5 5
I.e. it seems that mwait doesn't wait at all.
What could be the reason?
The definition of MONITOR/MWAIT semantics does not specify explicitly whether DMA transactions may or may not trigger it. It is supposed that triggering happens for logical processor's stores.
Current descriptions of MONITOR and MWAIT in the Intel's official Software Developer Manual are quite vague to that respect. However, there are two clauses in the MONITOR section that caught my attention:
The content of EAX is an effective address (in 64-bit mode, RAX is used). By default, the DS segment is used to create a linear address that is monitored.
The address range must use memory of the write-back type. Only write-back memory will correctly trigger the
monitoring hardware.
The first clause states that MONITOR is meant to be used with linear addresses, not physical ones. Devices and their DMA are meant to work with physical addresses only. So basically this means that all agents relying on the same MONITOR range should operate in the same domain of virtual memory space.
The second clause requires the monitored memory region to be cacheable (write-back, WB). For DMA, respective memory range is usually has to be marked as uncacheable, or write-combining at best (UC or WC). This is even a stronger indicator that your intent to use MONITOR/MWAIT to be triggered by DMA is very unlikely to work on current hardware.
Considering your high-level goal - to be able to tell when a device has written to given memory range - I cannot remember any robust method to achieve it, besides using virtualization for devices (VTd, IOMMU etc.) Basically, the classic approach for a peripheral device is to issue an interrupt when it is done with writing to memory. Until an interrupt arrives, there is no way for CPU to tell if all DMA bytes have successfully reached their destination in memory.
Device virtualization allows to abstract physical addresses from a device in a transparent manner, and have an equivalent of a page fault when it attempts to write/read from memory.

How to get a struct page from any address in the Linux kernel

I have existing code that takes a list of struct page * and builds a descriptor table to share memory with a device. The upper layer of that code currently expects a buffer allocated with vmalloc or from user space, and uses vmalloc_to_page to obtain the corresponding struct page *.
Now the upper layer needs to cope with all kinds of memory, not just memory obtained through vmalloc. This could be a buffer obtained with kmalloc, a pointer inside the stack of a kernel thread, or other cases that I'm not aware of. The only guarantee I have is that the caller of this upper layer must ensure that the memory buffer in question is mapped in kernel space at that point (i.e. it is valid to access buffer[i] for all 0<=i<size at this point). How do I obtain a struct page* corresponding to an arbitrary pointer?
Putting it in pseudo-code, I have this:
lower_layer(struct page*);
upper_layer(void *buffer, size_t size) {
for (addr = buffer & PAGE_MASK; addr <= buffer + size; addr += PAGE_SIZE) {
struct page *pg = vmalloc_to_page(addr);
lower_layer(pg);
}
}
and I now need to change upper_layer to cope with any valid buffer (without changing lower_layer).
I've found virt_to_page, which Linux Device Drivers indicates operates on “a logical address, [not] memory from vmalloc or high memory”. Furthermore, is_vmalloc_addr tests whether an address comes from vmalloc, and virt_addr_valid tests if an address is a valid virtual address (fodder for virt_to_page; this includes kmalloc(GFP_KERNEL) and kernel stacks). What about other cases: global buffers, high memory (it'll come one day, though I can ignore it for now), possibly other kinds that I'm not aware of? So I could reformulate my question as:
What are all the kinds of memory zones in the kernel?
How do I tell them apart?
How do I obtain page mapping information for each of them?
If it matters, the code is running on ARM (with an MMU), and the kernel version is at least 2.6.26.
I guess what you want is a page table walk, something like (warning, not actual code, locking missing etc):
struct mm_struct *mm = current->mm;
pgd = pgd_offset(mm, address);
pmd = pmd_offset(pgd, address);
pte = *pte_offset_map(pmd, address);
page = pte_page(pte);
But you you should be very very careful with this. the kmalloc address you got might very well be not page aligned for example. This sounds like a very dangerous API to me.
Mapping Addresses to a struct page
There is a requirement for Linux to have a fast method of mapping virtual addresses to physical addresses and for mapping struct pages to their physical address. Linux achieves this by knowing where, in both virtual and physical memory, the global mem_map array is because the global array has pointers to all struct pages representing physical memory in the system. All architectures achieve this with very similar mechanisms, but, for illustration purposes, we will only examine the x86 carefully.
Mapping Physical to Virtual Kernel Addresses
any virtual address can be translated to the physical address by simply subtracting PAGE_OFFSET, which is essentially what the function virt_to_phys() with the macro __pa() does:
/* from <asm-i386/page.h> */
132 #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
/* from <asm-i386/io.h> */
76 static inline unsigned long virt_to_phys(volatile void * address)
77 {
78 return __pa(address);
79 }
Obviously, the reverse operation involves simply adding PAGE_OFFSET, which is carried out by the function phys_to_virt() with the macro __va(). Next we see how this helps the mapping of struct pages to physical addresses.
There is one exception where virt_to_phys() cannot be used to convert virtual addresses to physical ones. Specifically, on the PPC and ARM architectures, virt_to_phys() cannot be used to convert addresses that have been returned by the function consistent_alloc(). consistent_alloc() is used on PPC and ARM architectures to return memory from non-cached for use with DMA.
What are all the kinds of memory zones in the kernel? <---see here
For user-space allocated memory, you want to use get_user_pages, which will give you the list of pages associated with the malloc'd memory, and also increment their reference counter (you'll need to call page_cache_release on each page once done with them.)
For vmalloc'd pages, vmalloc_to_page is your friend, and I don't think you need to do anything.
For 64 bit architectures, the answer of gby should be adapted to:
pgd_t * pgd;
pmd_t * pmd;
pte_t * pte;
struct page *page = NULL;
pud_t * pud;
void * kernel_address;
pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
pte = pte_offset_map(pmd, address);
page = pte_page(*pte);
// mapping in kernel memory:
kernel_address = kmap(page);
// work with kernel_address....
kunmap(page);
You could try virt_to_page. I am not sure it is what you want, but at least it is somewhere to start looking.

Resources