I have an userspace process that allocates memory from the huge pages. This memory needs to be shared with kernel space threads and work queues. To do so I'm using ioctl to register the process memory.
In IOCTL:
take struct ring address from usersapce and do the vmap() the pages pages.
for memory blocks that can show up in the ring I pin_user_pages with LONG_TERM flag.
And here are the questions I have:
If I do the vmap on struct ring pointer can I (should I) unpin_user_page referring to it?
Should I vmap buffers that are going to be placed in the ring? Or pinning user pages is enough?
How to translate (remote to the kthread) userspace address to kernel address which can be accessed from kernel?
For second and third question:
So far I analyzed the code of the pin_user_pages_remote function. It ends up in the __get_user_pages_locked. In this function there is piece of code:
pages[i] = virt_to_page((void *)start);
if (pages[i])
get_page(pages[i]);
Does this mean that when pages are pinned in ioctl then in kthread I can:
u64 offset = user_address & (PAGE_SIZE - 1);
struct page page = virt_to_page(user_address);
u64 phys = page_to_phys(page);
void *kva = phys_to_virt(phys) + offset;
Does this need to be vmaped to do this trick?
Or there is much simpler method for doing so?
Related
I'm working on a driver where ranges of device memory are mapped into the user space (via IOCTL) for the application to write to. It works:
vma->vm_flags |= VM_DONTCOPY;
vma->vm_flags |= VM_DONTEXPAND;
down_write(¤t->mm->mmap_sem);
ret = vm_iomap_memory(vma, from, sz_required);
up_write(¤t->mm->mmap_sem);
where from is a physical address obtained from pci_resource_start() with some offset added to it.
The application also needs to read from the device so I increase the size of the region mmapped by application by PAGE_SIZE, allocate a page with dma_alloc_coherent(), and try to insert it at the end of the vma but that returns EBUSY. What do I do wrong? I should be able to stitch together multiple physical ranges into a single vma, both real memory and device mapped, or is that not supported?
In the new code a page is allocated like that, dma_addr is passed to the device so it knows where to write to:
dma = dma_alloc_coherent(&device, PAGE_SIZE, &dma_addr, GFP_KERNEL);
memset(dma, 0xfe, PAGE_SIZE);
set_memory_wb((unsigned long)dma, 1);
And the mapping code is changed to:
vma->vm_flags |= VM_DONTCOPY;
vma->vm_flags |= VM_DONTEXPAND;
vma->vm_flags |= VM_MIXEDMAP;
down_write(¤t->mm->mmap_sem);
ret = vm_iomap_memory(vma, from, sz_required);
up_write(¤t->mm->mmap_sem);
down_write(¤t->mm->mmap_sem);
ret = vm_insert_page(vma, vma->vm_end - PAGE_SIZE, virt_to_page(dma));
up_write(¤t->mm->mmap_sem);
The kernel is 4.15 on x86_64
Got it working by following the "hack" in Map multiple kernel buffer into contiguous userspace buffer?
Before vm_iomap_memory() I decrement vma->vm_end by PAGE_SIZE and restore the old value afterwards. Also, I switched from dma_alloc_coherent() to alloc_page() following by dma_map_page()
Not the solution I'm satisfied with though. There has to be a better way, perhaps a fault handler in vm_ops? Although that seems counter-productive considering I know exactly what I will be mapping and where.
It appears to be working on x86_64 and aarch64
I was wondering if there is an existing system call/API for accessing getting the physical address of the virtual address?
If there is none then some direction on how to get that working ?
Also, how to get the physical address of MMIO which is non-pageable physical memory ?
The answer lies in IOMemoryDescriptor and IODMACommand objects.
If the memory in question is kernel-allocated, it should be allocated by creating an IOBufferMemoryDescriptor in the first place. If that's not possible, or if it's a buffer allocated in user space, you can wrap the relevant pointer using IOMemoryDescriptor::withAddressRange(address, length, options, task) or one of the other factory functions. In the case of withAddressRange, the address passed in must be virtual, in the address space of task.
You can directly grab physical address ranges from an IOMemoryDescriptor by calling the getPhysicalSegment() function (only valid between prepare()…complete() calls). However, normally you would do this for creating scatter-gather lists (DMA), and for this purpose Apple strongly recommends the IODMACommand. You can create these using IODMACommand::withSpecification(). Then use the genIOVMSegments() function to generate the scatter-gather list.
Modern Macs, and also some old PPC G5s contain an IOMMU (Intel calls this VT-d), so the system memory addresses you pass to PCI/Thunderbolt devices are not in fact physical, but IO-Mapped. IODMACommand will do this for you, as long as you use the "system mapper" (the default) and set mappingOptions to kMapped. If you're preparing addresses for the CPU, not a device, you will want to turn off mapping - use kIOMemoryMapperNone in your IOMemoryDescriptor options. Depending on what exactly you're trying to do, you probably don't need IODMACommand in this case either.
Note: it's often wise to pool and reuse your IODMACommand objects, rather than freeing and reallocating them.
Regarding MMIO, I assume you mean PCI BARs and similar - for IOPCIDevice, you can grab an IOMemoryDescriptor representing the memory-mapped device range using getDeviceMemoryWithRegister() and similar functions.
Example:
If all you want are pure CPU-space physical addresses for a given virtual memory range in some task, you can do something like this (untested as a complete kext that uses it would be rather large):
// INPUTS:
mach_vm_address_t virtual_range_start = …; // start address of virtual memory
mach_vm_size_t virtual_range_size_bytes = …; // number of bytes in range
task_t task = …; // Task object of process in which the virtual memory address is mapped
IOOptionBits direction = kIODirectionInOut; // whether the memory will be written or read, or both during the operation
IOOptionBits options =
kIOMemoryMapperNone // we want raw physical addresses, not IO-mapped
| direction;
// Process for getting physical addresses:
IOMemoryDescriptor* md = IOMemoryDescriptor::withAddressRange(
virtual_range_start, virtual_range_size_bytes, direction, task);
// TODO: check for md == nullptr
// Wire down virtual range to specific physical pages
IOReturn result = md->prepare(direction);
// TODO: do error handling
IOByteCount offset = 0;
while (offset < virtual_range_size_bytes)
{
IOByteCount segment_len = 0;
addr64_t phys_addr = md->getPhysicalSegment(offset, &len, kIOMemoryMapperNone);
// TODO: do something with physical range of segment_len bytes at address phys_addr here
offset += segment_len;
}
/* Unwire. Call this only once you're done with the physical ranges
* as the pager can change the physical-virtual mapping outside of
* prepare…complete blocks. */
md->complete(direction);
md->release();
As explained above, this is not suitable for generating DMA scatter-gather lists for device I/O. Note also this code is only valid for 64-bit kernels. You'll need to be careful if you still need to support ancient 32-bit kernels (OS X 10.7 and earlier) because virtual and physical addresses can still be 64-bit (64-bit user processes and PAE, respectively), but not all memory descriptor functions are set up for that. There are 64-bit-safe variants available to be used for 32-bit kexts.
Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?
You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);
Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).
On Linux machine, trying to write driver and trying to map some kernel memory to application for performance gains.
Checking driver implementation for mmap online, finding different varieties of implementation.
As per man pages, mmap - creates new mapping in virtual address space of calling process.
1) Who allocates physical address space during mmap calling? kernel or device driver?
seen following varieties of driver mmap implementation.
a) driver creates contiguous physical kernel memory and maps it with process address space.
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size = vma->vm_end - vma->vm_start;
pos = kmalloc(size); //allocate contiguous physical memory.
while (size > 0) {
unsigned long pfn;
pfn = virt_to_phys((void *) pos) >> PAGE_SHIFT; // Get Page frame number
if (remap_pfn_range(vma, start, pfn, PAGE_SIZE, PAGE_SHARED)) // creates mapping
return -EAGAIN;
start += PAGE_SIZE;
pos += PAGE_SIZE;
size -= PAGE_SIZE;
}
}
b) driver creates virtual kernel memory and maps it with process address space.
static struct vm_operations_struct dr_vm_ops = {
.open = dr_vma_open,
.close = dr_vma_close,
};
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
unsigned long size = vma->vm_end - vma->vm_start;
void *kp = vmalloc(size);
unsigned long up;
for (up = vma->vm_start; up < vma->vm_end; up += PAGE_SIZE) {
struct page *page = vmalloc_to_page(kp); //Finding physical page from virtual address
err = vm_insert_page(vma, up, page); //How is it different from remap_pfn_range?
if (err)
break;
kp += PAGE_SIZE;
}
vma->vm_ops = &dr_vm_ops;
ps_vma_open(vma);
}
c) not sure who allocates memory in this case.
static int driver_mmap(struct file *filp, struct vm_area_struct *vma)
{
if (remap_pfn_range(vma, vma->vm_start, vma->vm_pgoff,
vma->vm_end - vma->vm_start,
vma->vm_page_prot)) // creates mapping
return -EAGAIN;
}
2) If kernel allocates memory for mmap, is n't memory wasted in a&b cases?
3) remap_pfn_range is to map multiple pages where as vm_insert_page is just for single page mapping. is it the only difference b/w these two APIs?
Thank You,
Gopinath.
Which you use depends on what you're trying to accomplish.
(1) A device driver is part of the kernel so it doesn't really make sense to differentiate that way. For these cases, the device driver is asking for memory to be allocated for its own use from the (physical) memory resources available to the entire kernel.
With (a), a physically contiguous space is being allocated. You might do this if there is some piece of external hardware (a PCI device, for example) that will be reading or writing that memory. The return value from kmalloc already has a mapping to kernel virtual address space. remap_pfn_range is being used to map the page into the user virtual address space of the current process as well.
For (b), a virtually contiguous space is being allocated. If there is no external hardware involved, this is what you would typically use. There is still physical memory being allocated to your driver, but it isn't guaranteed that the pages are physically contiguous -- hence fewer constraints on which pages can be allocated. (They will still be contiguous in kernel virtual address space.) And then you are simply using a different API to implement the same kind of mapping into user virtual address space.
For (c), the memory being mapped is allocated under control of some other subsystem. Thevm_pgoff field has already been set to the base physical address of the resource. For example, the memory might correspond to a PCI device's address region (a network interface controller's registers, say) where that physical address is determined/assigned by your BIOS (or whatever mechanism your machine uses).
(2) Not sure I understand this question. How can the memory be "wasted" if it's being used by the device driver and a cooperating user process? And if the kernel needs to read and write the memory, there must be kernel virtual address space allocated and it needs to be mapped to the underlying physical memory. Likewise, if the user space process is to access the memory, there must be user virtual address space allocated and that must be mapped to the physical memory as well.
"Allocating virtual address space" essentially just means allocating page table entries for the memory. That is done separately from actually allocating the physical memory. And it's done separately for kernel space and user space. And "mapping" means setting the page table entry (the virtual address of the beginning of the page) to point to the correct physical page address.
(3) Yes. They are different APIs that accomplish much the same thing. Sometimes you have a struct page, sometimes you have a pfn. It can be confusing: there are often several ways to accomplish the same thing. Developers typically use the one most obvious for the item they already have ("I already have a struct page. I could calculate its pfn. But why do that when there's this other API that accepts a struct page?").
I have existing code that takes a list of struct page * and builds a descriptor table to share memory with a device. The upper layer of that code currently expects a buffer allocated with vmalloc or from user space, and uses vmalloc_to_page to obtain the corresponding struct page *.
Now the upper layer needs to cope with all kinds of memory, not just memory obtained through vmalloc. This could be a buffer obtained with kmalloc, a pointer inside the stack of a kernel thread, or other cases that I'm not aware of. The only guarantee I have is that the caller of this upper layer must ensure that the memory buffer in question is mapped in kernel space at that point (i.e. it is valid to access buffer[i] for all 0<=i<size at this point). How do I obtain a struct page* corresponding to an arbitrary pointer?
Putting it in pseudo-code, I have this:
lower_layer(struct page*);
upper_layer(void *buffer, size_t size) {
for (addr = buffer & PAGE_MASK; addr <= buffer + size; addr += PAGE_SIZE) {
struct page *pg = vmalloc_to_page(addr);
lower_layer(pg);
}
}
and I now need to change upper_layer to cope with any valid buffer (without changing lower_layer).
I've found virt_to_page, which Linux Device Drivers indicates operates on “a logical address, [not] memory from vmalloc or high memory”. Furthermore, is_vmalloc_addr tests whether an address comes from vmalloc, and virt_addr_valid tests if an address is a valid virtual address (fodder for virt_to_page; this includes kmalloc(GFP_KERNEL) and kernel stacks). What about other cases: global buffers, high memory (it'll come one day, though I can ignore it for now), possibly other kinds that I'm not aware of? So I could reformulate my question as:
What are all the kinds of memory zones in the kernel?
How do I tell them apart?
How do I obtain page mapping information for each of them?
If it matters, the code is running on ARM (with an MMU), and the kernel version is at least 2.6.26.
I guess what you want is a page table walk, something like (warning, not actual code, locking missing etc):
struct mm_struct *mm = current->mm;
pgd = pgd_offset(mm, address);
pmd = pmd_offset(pgd, address);
pte = *pte_offset_map(pmd, address);
page = pte_page(pte);
But you you should be very very careful with this. the kmalloc address you got might very well be not page aligned for example. This sounds like a very dangerous API to me.
Mapping Addresses to a struct page
There is a requirement for Linux to have a fast method of mapping virtual addresses to physical addresses and for mapping struct pages to their physical address. Linux achieves this by knowing where, in both virtual and physical memory, the global mem_map array is because the global array has pointers to all struct pages representing physical memory in the system. All architectures achieve this with very similar mechanisms, but, for illustration purposes, we will only examine the x86 carefully.
Mapping Physical to Virtual Kernel Addresses
any virtual address can be translated to the physical address by simply subtracting PAGE_OFFSET, which is essentially what the function virt_to_phys() with the macro __pa() does:
/* from <asm-i386/page.h> */
132 #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
/* from <asm-i386/io.h> */
76 static inline unsigned long virt_to_phys(volatile void * address)
77 {
78 return __pa(address);
79 }
Obviously, the reverse operation involves simply adding PAGE_OFFSET, which is carried out by the function phys_to_virt() with the macro __va(). Next we see how this helps the mapping of struct pages to physical addresses.
There is one exception where virt_to_phys() cannot be used to convert virtual addresses to physical ones. Specifically, on the PPC and ARM architectures, virt_to_phys() cannot be used to convert addresses that have been returned by the function consistent_alloc(). consistent_alloc() is used on PPC and ARM architectures to return memory from non-cached for use with DMA.
What are all the kinds of memory zones in the kernel? <---see here
For user-space allocated memory, you want to use get_user_pages, which will give you the list of pages associated with the malloc'd memory, and also increment their reference counter (you'll need to call page_cache_release on each page once done with them.)
For vmalloc'd pages, vmalloc_to_page is your friend, and I don't think you need to do anything.
For 64 bit architectures, the answer of gby should be adapted to:
pgd_t * pgd;
pmd_t * pmd;
pte_t * pte;
struct page *page = NULL;
pud_t * pud;
void * kernel_address;
pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
pte = pte_offset_map(pmd, address);
page = pte_page(*pte);
// mapping in kernel memory:
kernel_address = kmap(page);
// work with kernel_address....
kunmap(page);
You could try virt_to_page. I am not sure it is what you want, but at least it is somewhere to start looking.