In linux kernel, when I do
cat /proc/pid/maps
I get some entries that map the files in /dev/XXX. I understand this is the device file, which corresponds to hardware devices instead of actual files. How does the memory management in linux kernel handle such mapping? What happens if I read or write to /dev/XXX?
This is what I believe is correct: When a device maps a particular memory region, say X to Y, a vm_area_struct (vm_start = X and vm_end = Y) is associated with that region and then mapped into the process's virtual memory map. That vm_are_struct is then associated with a vm_operations_struct that supplied by the device driver.
That means a device driver will implement some or all the functions in the vm_operations_struct. The most important one being the fault function.
Now, assume the process references a page in that area for the first time. That will trigger the page fault handler. The page fault has this code in it:
3646 if (pte_none(entry)) {
3647 if (vma->vm_ops) {
3648 if (likely(vma->vm_ops->fault))
3649 return do_linear_fault(mm, vma, address,
3650 pte, pmd, flags, entry);
3651 }
3652 return do_anonymous_page(mm, vma, address,
3653 pte, pmd, flags);
3654 }
Notice line #3648. It checks if the vm_operations_struct is implemented for that vm_area_struct. If it is, it will call the "fault" member function. Look at some of the implementations for that function.
The implementation of this function should return a pointer to page struct. The page itself will be allocated and populated with data by the device driver (i.e. "fault" function).
The rest of the page handler will associate a pte entry with that page. That means the next time the page is accessed and causes a page fault the check pte_none on line #3646 will fail. This will result in skipping the device driver fault function the second time that particular page is referenced.
Note: a vm_area_struct may be mapped to more than one page struct. So, one vm_area_struct may fit N*4KB. Assuming PAGE_SIZE is 4KB and N=1,2,...,X
Related
When a program calls mmap to allocate an anonymous page, also known as a demand-zero page, what appears in the address field of the corresponding page table entry (PTE)? I am assuming that the kernel does not create a zero-initialized page in physical memory (and enter that physical page's page number into the PTE) until the requesting process actually touches the page — hence the term demand-zero. Since it would not be a disk address, and would not be 0 (which is for unallocated pages), what value would appear there? As a different but related question, how does the kernel "know" that this page is to be handled as a demand-zero page, i.e., that the fault handler should find a physical page and initialize it with 0 rather than copy a page from disk?
I am assuming that the kernel does not create a zero-initialized page in physical memory
Indeed, this is usually the case. Unless special cases, like for example if MAP_POPULATE is specified to explicitly request the page to be initialized (also called "pre-fauting").
what appears in the address field of the corresponding page table entry (PTE)?
Right after mmap you don't even have a PTE allocated for the page (or in general, you don't have any entry at any page table level). For what the CPU is concerned, the page doesn't even exist. If you were to walk the page table you would just get to a point (at an arbitrary level) where the corresponding entry is marked as "not present".
Since it would not be a disk address, and would not be 0 (which is for unallocated pages), what value would appear there?
For what the CPU is concerned, the page is unallocated. At the first page fault, two things can happen:
For a read page fault, the PTE is updated to point to the zero page: this is a special page that is always entirely zeroed-out and is pointed to by the PTEs of any anonymous (demand-zero) page in the system that has not been modified yet.
For a write page fault, an actual physical page will be allocated and the corresponding PTE updated to point to its physical address.
Quoting directly from the documentation:
The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program’s stack and heap or by explicit calls to mmap(2) system call. Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.
how does the kernel "know" that this page is to be handled as a demand-zero page, i.e., that the fault handler should find a physical page and initialize it with 0 rather than copy a page from disk?
When a page fault occurs, the kernel page fault handler (architecture-dependent) determines to which VMA the page belongs to, and retrieves the corresponding struct vm_area_struct (which was created earlier either by the kernel itself or by a mmap syscall). This structure is then passed on to architecture-independent code (do_fault()) along with the needed fault information (struct vm_fault).
The vm_area_struct then contains all the remaining necessary information to handle the fault (for example the ->vm_file field which is != NULL in case of a file-backed mapping). The field ->vm_ops points to a struct vm_operations_struct which defines a set of function pointers to call in different occasions. In particular anonymous VMAs have ->vm_ops == NULL.
For other kind of pages, ->fault() is the function used when handling a page fault. This function knows what to check and how to actually handle the fault.
B & O also describe the VMA, but do not explain how the kernel could use the VMA to distinguish between, say, an unallocated page and an allocated page to be created and zero-initialized.
Simple, just check vma->vm_ops == NULL and in such case you know that the page is a demand-zero anon page. Then on a page fault act as needed (read fault -> update PTE to point to global zero page, write fault -> allocate a page and update PTE).
I am recently studying Linux kernel and I have a question regarding how user process's page table is first updated by the kernel. Let's consider X86 architecture as an example.
When a binary is first started, it's handled by a structure called bprm. The main function for handling it is called: do_execve. In this function, mm_struct is created for the process which will create the top level page table pgd.
Then kernel will go ahead and create virtual spaces for the process and map each segment into the virtual space by elf_map function which will eventually call into do_mmap()
Then what I found is that, after all the preparation above, the kernel will call START_THREAD to start the process.
My question is that, before START_THREAD, there seems no where to initialize the page table of the process. Then after searched for a while, I found out that page table is updated only when needed which I assume that only during first read/write operation from the userspace, the page table entry is first updated(please correct me if I am wrong).
My question is where in the kernel does the first page table updates(if you can tell me the code location that would be better)?
Answer is: yes, modern kernel does employ a lazy allocation for page tables.
Before identify the code location, let's check how this happens. The common path here is:
You do mmap to create a new vma(virtual memory area) -> kernel reserves the virtual memory range on your process address space -> After return to userspace, you try to access(read/write) the mmap region -> As the physical page has not been allocated, this triggers a page fault to allocate the physical page -> After some hardware context saving and trap prologues, you arrive at the architecture-specific page fault handler -> Before allocating the physical page, kernel has to make sure the page table itself is allocated. This is where the lazy page table allocation happens.
In general, the stack/heap of a userspace process is each described by a vma(just like the normal mmap region), so the above path also applies to your question.
Code location(source code version on my workspace is 5.13.2) for x86 can be found under arch/x86/mm/fault.c(as mentioned above, page fault handling is highly architecture-dependent). The call-chain can be traced with handle_page_fault() -> do_user_addr_fault() -> handle_mm_fault()(goto mm/memory.c) -> __handle_mm_fault()
Within mm/memory.c/__handle_mm_fault(), you will find code snippet like this:
struct vm_fault vmf = {
.vma = vma,
.address = address & PAGE_MASK,
.flags = flags,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
};
unsigned int dirty = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
p4d_t *p4d;
vm_fault_t ret;
pgd = pgd_offset(mm, address);
p4d = p4d_alloc(mm, pgd, address);
if (!p4d)
return VM_FAULT_OOM;
vmf.pud = pud_alloc(mm, p4d, address);
if (!vmf.pud)
return VM_FAULT_OOM;
...
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
...
return handle_pte_fault(&vmf);
Modern linux uses four/five level page table on 64-bit machine, lazy page table allocation is necessary to reduce memory usage.
I was wondering if there is an existing system call/API for accessing getting the physical address of the virtual address?
If there is none then some direction on how to get that working ?
Also, how to get the physical address of MMIO which is non-pageable physical memory ?
The answer lies in IOMemoryDescriptor and IODMACommand objects.
If the memory in question is kernel-allocated, it should be allocated by creating an IOBufferMemoryDescriptor in the first place. If that's not possible, or if it's a buffer allocated in user space, you can wrap the relevant pointer using IOMemoryDescriptor::withAddressRange(address, length, options, task) or one of the other factory functions. In the case of withAddressRange, the address passed in must be virtual, in the address space of task.
You can directly grab physical address ranges from an IOMemoryDescriptor by calling the getPhysicalSegment() function (only valid between prepare()…complete() calls). However, normally you would do this for creating scatter-gather lists (DMA), and for this purpose Apple strongly recommends the IODMACommand. You can create these using IODMACommand::withSpecification(). Then use the genIOVMSegments() function to generate the scatter-gather list.
Modern Macs, and also some old PPC G5s contain an IOMMU (Intel calls this VT-d), so the system memory addresses you pass to PCI/Thunderbolt devices are not in fact physical, but IO-Mapped. IODMACommand will do this for you, as long as you use the "system mapper" (the default) and set mappingOptions to kMapped. If you're preparing addresses for the CPU, not a device, you will want to turn off mapping - use kIOMemoryMapperNone in your IOMemoryDescriptor options. Depending on what exactly you're trying to do, you probably don't need IODMACommand in this case either.
Note: it's often wise to pool and reuse your IODMACommand objects, rather than freeing and reallocating them.
Regarding MMIO, I assume you mean PCI BARs and similar - for IOPCIDevice, you can grab an IOMemoryDescriptor representing the memory-mapped device range using getDeviceMemoryWithRegister() and similar functions.
Example:
If all you want are pure CPU-space physical addresses for a given virtual memory range in some task, you can do something like this (untested as a complete kext that uses it would be rather large):
// INPUTS:
mach_vm_address_t virtual_range_start = …; // start address of virtual memory
mach_vm_size_t virtual_range_size_bytes = …; // number of bytes in range
task_t task = …; // Task object of process in which the virtual memory address is mapped
IOOptionBits direction = kIODirectionInOut; // whether the memory will be written or read, or both during the operation
IOOptionBits options =
kIOMemoryMapperNone // we want raw physical addresses, not IO-mapped
| direction;
// Process for getting physical addresses:
IOMemoryDescriptor* md = IOMemoryDescriptor::withAddressRange(
virtual_range_start, virtual_range_size_bytes, direction, task);
// TODO: check for md == nullptr
// Wire down virtual range to specific physical pages
IOReturn result = md->prepare(direction);
// TODO: do error handling
IOByteCount offset = 0;
while (offset < virtual_range_size_bytes)
{
IOByteCount segment_len = 0;
addr64_t phys_addr = md->getPhysicalSegment(offset, &len, kIOMemoryMapperNone);
// TODO: do something with physical range of segment_len bytes at address phys_addr here
offset += segment_len;
}
/* Unwire. Call this only once you're done with the physical ranges
* as the pager can change the physical-virtual mapping outside of
* prepare…complete blocks. */
md->complete(direction);
md->release();
As explained above, this is not suitable for generating DMA scatter-gather lists for device I/O. Note also this code is only valid for 64-bit kernels. You'll need to be careful if you still need to support ancient 32-bit kernels (OS X 10.7 and earlier) because virtual and physical addresses can still be 64-bit (64-bit user processes and PAE, respectively), but not all memory descriptor functions are set up for that. There are 64-bit-safe variants available to be used for 32-bit kexts.
Memory in the Linux kernel is usually unswappable (Do Kernel pages get swapped out?). However, sometimes it is useful to allow memory to be swapped out. Is it possible to explicitly allocate swappable memory inside the Linux kernel? One method I thought of was to create a user space process and use its memory. Is there anything better?
You can create a file in the internal shm shared memory filesystem.
const char *name = "example";
loff_t size = PAGE_SIZE;
unsigned long flags = 0;
struct file *filp = shmem_file_setup(name, size, flags);
/* assert(!IS_ERR(filp)); */
The file isn't actually linked, so the name isn't visible. The flags may include VM_NORESERVE to skip accounting up-front, instead accounting as pages are allocated. Now you have a shmem file. You can map a page like so:
struct address_space *mapping = filp->f_mapping;
pgoff_t index = 0;
struct page *p = shmem_read_mapping_page(mapping, index);
/* assert(!IS_ERR(filp)); */
void *data = page_to_virt(p);
memset(data, 0, PAGE_SIZE);
There is also shmem_read_mapping_page_gfp(..., gfp_t) to specify how the page is allocated. Don't forget to put the page back when you're done with it.
put_page(p);
Ditto with the file.
fput(filp);
Answer to your question is a simple No, or Yes with a complex modification to kernel source.
First, to enable swapping out, you have to ask yourself what is happening when kswapd is swapping out. Essentially it will walk through all the processes and make a decision whether its memory can be swapped out or not. And all these memory have the hardware mode of ring 3. So SMAP essentially forbid it from being read as data or executed as program in the kernel (ring 0):
https://en.wikipedia.org/wiki/Supervisor_Mode_Access_Prevention
And check your distros "CONFIG_X86_SMAP", for mine Ubuntu it is default to "y" which is the case for past few years.
But if you keep your memory as a kernel address (ring 0), then you may need to consider changing the kswapd operation to trigger swapout of kernel addresses. Whick kernel addresses to walk first? And what if the address is part of the kswapd's kernel operation? The complexities involved is huge.
And next is to consider the swap in operation: When the memory read is attempted and it's "not present" bit is enabled, then hardware exception will trigger linux kernel memory fault handler (which is __do_page_fault()).
And looking into __do_page_fault:
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1477
and there after how it handler the kernel addresses (do_kern_address_fault()):
https://elixir.bootlin.com/linux/latest/source/arch/x86/mm/fault.c#L1174
which essentially is just reporting as error for possible scenario. If you want to enable kernel address pagefaulting, then this path has to be modified.
And note too that the SMAP check (inside smap_violation) is done in the user address pagefaulting (do_usr_addr_fault()).
I have existing code that takes a list of struct page * and builds a descriptor table to share memory with a device. The upper layer of that code currently expects a buffer allocated with vmalloc or from user space, and uses vmalloc_to_page to obtain the corresponding struct page *.
Now the upper layer needs to cope with all kinds of memory, not just memory obtained through vmalloc. This could be a buffer obtained with kmalloc, a pointer inside the stack of a kernel thread, or other cases that I'm not aware of. The only guarantee I have is that the caller of this upper layer must ensure that the memory buffer in question is mapped in kernel space at that point (i.e. it is valid to access buffer[i] for all 0<=i<size at this point). How do I obtain a struct page* corresponding to an arbitrary pointer?
Putting it in pseudo-code, I have this:
lower_layer(struct page*);
upper_layer(void *buffer, size_t size) {
for (addr = buffer & PAGE_MASK; addr <= buffer + size; addr += PAGE_SIZE) {
struct page *pg = vmalloc_to_page(addr);
lower_layer(pg);
}
}
and I now need to change upper_layer to cope with any valid buffer (without changing lower_layer).
I've found virt_to_page, which Linux Device Drivers indicates operates on “a logical address, [not] memory from vmalloc or high memory”. Furthermore, is_vmalloc_addr tests whether an address comes from vmalloc, and virt_addr_valid tests if an address is a valid virtual address (fodder for virt_to_page; this includes kmalloc(GFP_KERNEL) and kernel stacks). What about other cases: global buffers, high memory (it'll come one day, though I can ignore it for now), possibly other kinds that I'm not aware of? So I could reformulate my question as:
What are all the kinds of memory zones in the kernel?
How do I tell them apart?
How do I obtain page mapping information for each of them?
If it matters, the code is running on ARM (with an MMU), and the kernel version is at least 2.6.26.
I guess what you want is a page table walk, something like (warning, not actual code, locking missing etc):
struct mm_struct *mm = current->mm;
pgd = pgd_offset(mm, address);
pmd = pmd_offset(pgd, address);
pte = *pte_offset_map(pmd, address);
page = pte_page(pte);
But you you should be very very careful with this. the kmalloc address you got might very well be not page aligned for example. This sounds like a very dangerous API to me.
Mapping Addresses to a struct page
There is a requirement for Linux to have a fast method of mapping virtual addresses to physical addresses and for mapping struct pages to their physical address. Linux achieves this by knowing where, in both virtual and physical memory, the global mem_map array is because the global array has pointers to all struct pages representing physical memory in the system. All architectures achieve this with very similar mechanisms, but, for illustration purposes, we will only examine the x86 carefully.
Mapping Physical to Virtual Kernel Addresses
any virtual address can be translated to the physical address by simply subtracting PAGE_OFFSET, which is essentially what the function virt_to_phys() with the macro __pa() does:
/* from <asm-i386/page.h> */
132 #define __pa(x) ((unsigned long)(x)-PAGE_OFFSET)
/* from <asm-i386/io.h> */
76 static inline unsigned long virt_to_phys(volatile void * address)
77 {
78 return __pa(address);
79 }
Obviously, the reverse operation involves simply adding PAGE_OFFSET, which is carried out by the function phys_to_virt() with the macro __va(). Next we see how this helps the mapping of struct pages to physical addresses.
There is one exception where virt_to_phys() cannot be used to convert virtual addresses to physical ones. Specifically, on the PPC and ARM architectures, virt_to_phys() cannot be used to convert addresses that have been returned by the function consistent_alloc(). consistent_alloc() is used on PPC and ARM architectures to return memory from non-cached for use with DMA.
What are all the kinds of memory zones in the kernel? <---see here
For user-space allocated memory, you want to use get_user_pages, which will give you the list of pages associated with the malloc'd memory, and also increment their reference counter (you'll need to call page_cache_release on each page once done with them.)
For vmalloc'd pages, vmalloc_to_page is your friend, and I don't think you need to do anything.
For 64 bit architectures, the answer of gby should be adapted to:
pgd_t * pgd;
pmd_t * pmd;
pte_t * pte;
struct page *page = NULL;
pud_t * pud;
void * kernel_address;
pgd = pgd_offset(mm, address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
pte = pte_offset_map(pmd, address);
page = pte_page(*pte);
// mapping in kernel memory:
kernel_address = kmap(page);
// work with kernel_address....
kunmap(page);
You could try virt_to_page. I am not sure it is what you want, but at least it is somewhere to start looking.