How does get_user_pages() pin a process page in Linux?

How does get_user_pages() pin a process page in Linux? - memory-management

I am trying to pin a Linux process page by using get_user_pages() function in kernel. (I am using Ubuntu 16.04, Linux-4.4.0).
But I am not clear, how does get_user_pages() pin the process page, or how does that pin mean in the funtion's description.
I did following test to check if the page is pinned.
1. A process, called aligned_alloc(0x1000, 0x1000) to allocation a 4KB memory.
2. A kernel module, which will receive a virtual address from a process by ioctl().
3. Once the virtual address is received in kernel module, it is used to call get_user_pages() like this,
res = get_user_pages(current, current->mm, vaddr, 1, 1, 1, &page);
4. The process is sleeping for hours, for me to check the status.
With above steps, from /proc/pid/maps, /proc/pid/smaps, and /proc/meminfo, I could NOT find the locking (or pinning) of the process's virtual address.
I also checked the ref_count of the page struct for the process virtual address, before and after call get_user_pages(), the ref_count is the same (3 in my test case), like below.
[ 7159.432196] Before, page flag = ffff800004004c, count=3
[ 7159.432196] Pinned Got mmaped.
[ 7159.432197] After, page flags = ffff800004004c, count = 3
Did I miss something?
And how does get_user_pages() pin the process pages?
I found a similar question in SO, How do "pinned" pages in Linux present (or actually "pin") themselves, but no answers.

Related

When is the user process's page table first updated during the program start stage

I am recently studying Linux kernel and I have a question regarding how user process's page table is first updated by the kernel. Let's consider X86 architecture as an example.
When a binary is first started, it's handled by a structure called bprm. The main function for handling it is called: do_execve. In this function, mm_struct is created for the process which will create the top level page table pgd.
Then kernel will go ahead and create virtual spaces for the process and map each segment into the virtual space by elf_map function which will eventually call into do_mmap()
Then what I found is that, after all the preparation above, the kernel will call START_THREAD to start the process.
My question is that, before START_THREAD, there seems no where to initialize the page table of the process. Then after searched for a while, I found out that page table is updated only when needed which I assume that only during first read/write operation from the userspace, the page table entry is first updated(please correct me if I am wrong).
My question is where in the kernel does the first page table updates(if you can tell me the code location that would be better)?

Answer is: yes, modern kernel does employ a lazy allocation for page tables.
Before identify the code location, let's check how this happens. The common path here is:
You do mmap to create a new vma(virtual memory area) -> kernel reserves the virtual memory range on your process address space -> After return to userspace, you try to access(read/write) the mmap region -> As the physical page has not been allocated, this triggers a page fault to allocate the physical page -> After some hardware context saving and trap prologues, you arrive at the architecture-specific page fault handler -> Before allocating the physical page, kernel has to make sure the page table itself is allocated. This is where the lazy page table allocation happens.
In general, the stack/heap of a userspace process is each described by a vma(just like the normal mmap region), so the above path also applies to your question.
Code location(source code version on my workspace is 5.13.2) for x86 can be found under arch/x86/mm/fault.c(as mentioned above, page fault handling is highly architecture-dependent). The call-chain can be traced with handle_page_fault() -> do_user_addr_fault() -> handle_mm_fault()(goto mm/memory.c) -> __handle_mm_fault()
Within mm/memory.c/__handle_mm_fault(), you will find code snippet like this:
struct vm_fault vmf = {
.vma = vma,
.address = address & PAGE_MASK,
.flags = flags,
.pgoff = linear_page_index(vma, address),
.gfp_mask = __get_fault_gfp_mask(vma),
};
unsigned int dirty = flags & FAULT_FLAG_WRITE;
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
p4d_t *p4d;
vm_fault_t ret;
pgd = pgd_offset(mm, address);
p4d = p4d_alloc(mm, pgd, address);
if (!p4d)
return VM_FAULT_OOM;
vmf.pud = pud_alloc(mm, p4d, address);
if (!vmf.pud)
return VM_FAULT_OOM;
...
vmf.pmd = pmd_alloc(mm, vmf.pud, address);
if (!vmf.pmd)
return VM_FAULT_OOM;
...
return handle_pte_fault(&vmf);
Modern linux uses four/five level page table on 64-bit machine, lazy page table allocation is necessary to reduce memory usage.

A Process accessing memory outside of allocated region

Assume a process is allocated a certain region of virtual memory.
How will the processor react if the process happens to access a memory region outside this allocation region?
Does the processor kill the process? Or does it raise a Fault?
Thank you in advance.

Processes are not really allocated a certain region of virtual memory. They are allocated physical frames that they can access using virtual memory. Processes have virtual access to all virtual memory available.
When a high level language is compiled, it is placed in an executable. This executable is a file format which specifies several things among which is the virtual memory in use by the program. When the OS launches that executable, it will allocate certain physical pages to the newly created process. These pages contain the actual code. The OS needs to set up the page tables so that the virtual addresses that the process uses are translated to the right position in memory (the right physical addresses).
When a process attempts to jump nowhere at a virtual address it shouldn't jump to, several things can happen. It is undefined behavior.
As stated on osdev.org (https://wiki.osdev.org/Paging):
A page fault exception is caused when a process is seeking to access an area of virtual memory that is not mapped to any physical memory, when a write is attempted on a read-only page, when accessing a PTE or PDE with the reserved bit or when permissions are inadequate.
The CPU pushes an error code on the stack before firing a page fault exception. The error code must be analyzed by the exception handler to determine how to handle the exception. The bottom 3 bits of the exception code are the only ones used, bits 3-31 are reserved.
It really depends on the language you used and several factors come into play. For example, in assembly, if you try to jump in RAM to a random virtual address. Several things can happen.
If you jump into an allocated page, then the page could contain anything. It could as well contain zeroes. If it contain zeroes, then the process will keep executing the instructions until it reaches a page which isn't present in RAM and trigger a page fault. Or it could as well just end up executing a jmp to somewhere else in RAM and in the end trigger page fault.
If you jump into a page which has the present bit not set (unallocated page), then the CPU will trigger a page fault immediately. Since the page is not allocated, it will not magically become allocated. The OS needs to take action. If the page was supposed to be accessed by the process then maybe it was swapped to the hard disk and the OS needs to swap it back in RAM. If it wasn't supposed to be accessed (like in this case), the OS needs to kill the process (and it does). The OS knows the process should not access a page by looking at its memory map for that process. It should not just blindly allocate a page to a process which jumps nowhere. If the process needs more memory during execution it can ask the OS properly using system calls.
If you jump to a virtual address which, once translated by the MMU using the page tables, lands in RAM in kernel mode code (supervisor code), the CPU will trigger a page fault with supervisor and present error codes (1 0 1).
The OS uses 2 levels of permission (0 and 3). Thus all user mode processes run with permission 3. Nothing prevents one user process from accessing the memory and the code of another process except the way the page tables are set up. The page tables are often not filled up completely. If you jump to a random virtual address, anything can happen. The virtual address can be translated to anything.

How do I write to a __user memory from within the top half of an interrupt handler?

I am working on a proprietary device driver. The driver is implemented as a kernel module. This module is then coupled with an user-space process.
It is essential that each time the device generates an interrupt, the driver updates a set of counters directly in the address space of the user-space process from within the top half of the interrupt handler. The driver knows the PID and the task_struct of the user-process and is also aware of the virtual address where the counters lie in the user-process context. However, I am having trouble in figuring out how code running in the interrupt context could take up the mm context of the user-process and write to it. Let me sum up what I need to do:
Get the address of the physical page and offset corresponding to the virtual address of the counters in the context of the user-process.
Set up mappings in the page table and write to the physical page corresponding to the counter.
For this, I have tried the following:
Try to take up the mm context of the user-task, like below:
use_mm(tsk->mm);
/* write to counters. */
unuse_mm(tsk->mm);
This apparently causes the entire system to hang.
Wait for the interrupt to occur when our user-process was the
current process. Then use copy_to_user().
I'm not much of an expert on kernel programming. If there's a good way to do this, please do advise and thank you in advance.

Your driver should be the one, who maps kernel's memory for user space process. E.g., you may implement .mmap callback for struct file_operation for your device.
Kernel driver may write to kernel's address, which it have mapped, at any time (even in interrupt handler). The user-space process will immediately see all modifications on its side of the mapping (using address obtained with mmap() system call).

Unix's architecture frowns on interrupt routines accessing user space
because a process could (in theory) be swapped out when the interrupt occurs. 
If the process is running on another CPU, that could be a problem, too. 
I suggest that you write an ioctl to synchronize the counters,
and then have the the process call that ioctl
every time it needs to access the counters.

Outside of an interrupt context, your driver will need to check the user memory is accessible (using access_ok), and pin the user memory using get_user_pages or get_user_pages_fast (after determining the page offset of the start of the region to be pinned, and the number of pages spanned by the region to be pinned, including page alignment at both ends). It will also need to map the list of pages to kernel address space using vmap. The return address from vmap, plus the offset of the start of the region within its page, will give you an address that your interrupt handler can access.
At some point, you will want to terminate access to the user memory, which will involve ensuring that your interrupt routine no longer accesses it, a call to vunmap (passing the pointer returned by vmap), and a sequence of calls to put_page for each of the pages pinned by get_user_pages or get_user_pages_fast.

I don't think what you are trying to do is possible. Consider this situation:
(assuming how your device works)
Some function allocates the user-space memory for the counters and
supplies its address in PROCESS X.
A switch occurs and PROCESS Y executes.
Your device interrupts.
The address for your counters is inaccessible.
You need to schedule a kernel mode asynchronous event (lower half) that will execute when PROCESS X is executing.

Why there is no SIGSEGV signal on copy on write?

The copy-on-write article on wikipedia says that copy-on-write is usually implemented by giving read only access to the pages, so that when one is written, the page fault trap handler can map a unique physical memory page for it. So my question is why a user-level application doesn't receive a SIGSEGV signal when such page fault happens? Afterall, the wikipedia article on SIGSEGV says that SIGSEGV is the signal sent to a process when it makes an invalid memory reference, or segmentation fault. So in this case, that is on copy-on-write case, why no SIGSEGV is sent to the process.

I know it's been a while since this was asked, but I wanted to expand on Alexey's answer a bit.
Copy-on-write (I assume you're talking about virtual memory and not filesystems) usually works like so:
The OS knows which pages need to be copied on write. (They are the pages which are private to a process.) These pages are marked in hardware as read-only. However, the virtual memory map of the process has the pages marked as readable and writable. This means that the user process believes it has full access to the pages in question.
When a user process attempts to write to one of these pages, a page fault is generated because the processor recognizes that the page is read-only (based on the hardware marks before). Page faults are sort of like segfaults, but for the kernel instead of for user processes.
This triggers the page fault handler to run within the kernel, which looks at the page in question and sees that it's a private page which has not yet been copied. The handler will create a copy of the page and mark the copy as writable.
Then the handler will replace the old page's address with the new one in the virtual-to-physical translation table and exit.
The last instruction will be retried by the user process at this point, and this time the write will succeed because the new page is writeable at both the virtual memory map (the user process' view of memory permissions) and hardware (the kernel's view of memory permissions) levels.
A page fault is generated every time a segmentation fault occurs, but most page faults are handled by the kernel and are never passed to the process that caused them as segfaults. There are many reasons why a page fault might be handled at a lower level, including:
The page which was accessed was paged out to disk because it hadn't been used in a long time. The OS must bring it back into memory so the process can use it again.
The process is accessing a newly-allocated page for the first time, and the actual physical page hasn't been allocated yet. The OS must allocate a page and then insert it into the virtual-to-physical translation table before the memory can actually be used.
The OS is playing a hardware page access permissions trick to allow it to watch for accesses to a particular page. This is what happens in copy-on-write, but it can have other uses as well. Consider an OS-level virtualization technology like kvm, where writing to a memory-mapped device's location in memory in the guest OS should actually write to a file or the display in the host OS.

The main idea of COW is that COW is completely transparent to the user process as if it fully owned the memory without any sharing.

Page translation of process code section in Linux. Why does the Page Table Entry get 0 for some pages?

For some reason, I need to translate the virtual address of the code section to physical address. I did following experiment:
I get the virtual address from the start_code and end_code in mm_struct of process A, which are the initial address and final address of the executable code.
I get the CR3 of process A.
I translate the virtual address to physical address page by page. For example, there are 10 pages for code section in process A. I will translate 10 virtual address of each beginning of the page.
I found out some pages will get Page Table Entry(PTE) == 0.Some pages could successfully translate to a physical address.
I tried Firefox and Minicom as my Process, and both of them will get into situation.
I guess my question is: could anyone explain to me why PTE == 0? Does it mean these pages have been swap out to disk? If this is the case, how can I find these pages?
Thanks for any input!!

It looks as if you are trying to perform page table introspection without using the kernel APIs for it. Note that the address space is arranged in a red-black tree of vm_area_struct structs and you should probably use the APIs that traverse them. The mappings might change at any time so using the proper locking for these data structures is necessary.
For example, see the get_user_pages() function. It can be used to swap-in and temporarily pin the pages into memory. Using this function for page table introspection is usually asked for because once have the physical address in hand then the kernel can swap out the page at any time.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio