Kernel threads accessing user space address - linux-kernel

here's a quote from Understand Linux kernel book (emphasis mine)
... no need to invalidate a TLB entry that refers to a User Mode linear address, because no kernel thread accesses the User Mode address space
I understand the user space process cannot access kernel space, but why the reverse (which is what i think the sentence above implies) true? Is this enforced by hardware, or simply a design choice of the kernel?

The sentence is wrong, but in the context it is fine enough.
The full quote is:
In fact, each kernel thread does not have its own set of page tables;
rather, it makes use of the set of page tables belonging to a regular
process. However, there is no need to invalidate a TLB entry that
refers to a User Mode linear address, because no kernel thread
accesses the User Mode address space
What they mean is switching user <-> user thread changes address spaces (duh), but user thread -> kernel thread and kernel thread -> kernel thread DOES NOT as an optimisation. kernel threads are not tied to any user thread, so there is no specific user part of the address space to access in the first place. As things get scheduled in different order over time and a particular kernel thread gets executed after random user threads, it keeps executing with different page tables for the user part (kernel part stays the same). So there is nothing for a kernel thread to access in userspace. Just do ps auxw and check all the stuff with enclosed in '[]'. That's kernel threads.
This must not be confused with kernel code accessing userspace - this happens all the time, e.g. when a user thread performs a syscall.
I also said the sentence is wrong because in special cases a kernel thread can explicitly set a particular address space for use. This is done by aio.

Related

Page Fault in Linux Kernel

I have few questions after reading Mel Gorman's book Understanding the Linux Virtual Memory Manager. Section 4.3 Process Address Space Descriptor says kernel threads never page fault or access the user space portion. The only exception is page faulting within the vmalloc space . Following are my questions.
kenrel threads never page fault: Does this mean only user space code triggers page fault? If a kmalloc() or vmalloc() is called, will it not page fault? I believe the kernel has to map these to the anon pages. When a write to this pages is performed, a page fault occurs. Is my understanding correct?
Why can't kernel threads access user space? Aren't copy_to_user() or copy_from_user() do that?
Exception is page faulting within vmalloc space: Does that mean vmalloc() triggers a page fault and kmalloc() doesn't ? Why kmalloc() does not page fault? The physical frames to kernel's virtual address need not to be kept as a page table entry?
kernel threads never page fault: The page fault talked about is when making a virtual page resident, or bringing it back from swap. Kernel pages not only get paged in on kmalloc(), but also remain resident for their lifetime. The same does not hold for user space pages, which A) may be lazy allocated (i.e. just reserved as page table entries on malloc(), but not actually faulted in until a memset() or other dereference) and B) may be swapped out on low memory conditions.
Why can't kernel threads access user space? Aren't copy_to_user() or copy_from_user() do that?
That's a great question, with a hardware-specific reply. It used to be the case that kernel threads were discouraged from accessing user space, exactly because of the possible page fault hit that might occur, if accessing unpaged/paged out memory in user space (recall, that wouldn't happen in kernel space, as above ensures). So copy_to/from would be normal memcpy, but wrapped in a page fault handler. This way, any potential page fault would be handled transparently (i.e. the memory would be paged in) and all would be well. But there were certainly cases where the bad approach of memcpy to/from user memory would just work - worse, it would work more often than not, as page faults very with RAM residency and availability - and thus unhandled faults would cause random panics. Hence the decree of always using the copy_from/to_user.
Recently, however, kernel/user memory isolation became important from a security standpoint. This is due to many exploitation techniques (NULL pointer dereferencing being a very common and powerful one), where fake kernel objects (or code) could be constructed in user space (and thus, easily controlled) memory, and could lead to code execution in kernel.
Most architectures thus have a page table bit which physically prevents a page belonging to user mode from being accessed by kernel. Taking ARM64 as an example, this feature is called PAN/PXN (Privileged Access/Execute Never).
Thus, copy_from/to now not only handles page faults, but also disables PAN/PXN before the operation, and restores it after.
Exception is page faulting within vmalloc space: vmalloc() allocates memory which is swappable, whereas kmalloc does not. The difference is in the implementation (kmalloc uses GFP_KERNEL). This also means that kmalloc is more likely to fail (if there is no RAM available for this), but will not page fault (it would return NULL, which itself would be a problem..)
I think you get counfused because you haven't understand clearly about the start of kernel, process, and virtual memeory.
kenrel threads never page fault: This is because the pages of kernel space and user space use different allocation methods. For the kernel space, we allocate pages when initialization, but for user space, we allocate them when running process and calling funcitons like malloc(), and after mapping, when truly using that virtual memory, we trigger page fault.
Why can't kernel threads access user space? When kenrel start, the process 0 will create process 1 and process 2. The process 1 is used to form the user space process tree, while the process 2 is used to manage the kernel threads. And the functions you mensioned are always used by those user threads to transmit data into/out of kernel to realise some function like open file or socket and so on.
Exception is page faulting within vmalloc space: The vmalloc space is not function vmalloc(), it is an area in kernel memory space for some dynamic memory allocation used as an exception.

How do I write to a __user memory from within the top half of an interrupt handler?

I am working on a proprietary device driver. The driver is implemented as a kernel module. This module is then coupled with an user-space process.
It is essential that each time the device generates an interrupt, the driver updates a set of counters directly in the address space of the user-space process from within the top half of the interrupt handler. The driver knows the PID and the task_struct of the user-process and is also aware of the virtual address where the counters lie in the user-process context. However, I am having trouble in figuring out how code running in the interrupt context could take up the mm context of the user-process and write to it. Let me sum up what I need to do:
Get the address of the physical page and offset corresponding to the virtual address of the counters in the context of the user-process.
Set up mappings in the page table and write to the physical page corresponding to the counter.
For this, I have tried the following:
Try to take up the mm context of the user-task, like below:
use_mm(tsk->mm);
/* write to counters. */
unuse_mm(tsk->mm);
This apparently causes the entire system to hang.
Wait for the interrupt to occur when our user-process was the
current process. Then use copy_to_user().
I'm not much of an expert on kernel programming. If there's a good way to do this, please do advise and thank you in advance.
Your driver should be the one, who maps kernel's memory for user space process. E.g., you may implement .mmap callback for struct file_operation for your device.
Kernel driver may write to kernel's address, which it have mapped, at any time (even in interrupt handler). The user-space process will immediately see all modifications on its side of the mapping (using address obtained with mmap() system call).
Unix's architecture frowns on interrupt routines accessing user space
because a process could (in theory) be swapped out when the interrupt occurs. 
If the process is running on another CPU, that could be a problem, too. 
I suggest that you write an ioctl to synchronize the counters,
and then have the the process call that ioctl
every time it needs to access the counters.
Outside of an interrupt context, your driver will need to check the user memory is accessible (using access_ok), and pin the user memory using get_user_pages or get_user_pages_fast (after determining the page offset of the start of the region to be pinned, and the number of pages spanned by the region to be pinned, including page alignment at both ends). It will also need to map the list of pages to kernel address space using vmap. The return address from vmap, plus the offset of the start of the region within its page, will give you an address that your interrupt handler can access.
At some point, you will want to terminate access to the user memory, which will involve ensuring that your interrupt routine no longer accesses it, a call to vunmap (passing the pointer returned by vmap), and a sequence of calls to put_page for each of the pages pinned by get_user_pages or get_user_pages_fast.
I don't think what you are trying to do is possible. Consider this situation:
(assuming how your device works)
Some function allocates the user-space memory for the counters and
supplies its address in PROCESS X.
A switch occurs and PROCESS Y executes.
Your device interrupts.
The address for your counters is inaccessible.
You need to schedule a kernel mode asynchronous event (lower half) that will execute when PROCESS X is executing.

Mapping of Page allocated to user process in Kernel virtual address space

When a page is created for a process (which will be mapped into process address space), will that page be mapped into kernel address space ?
If not, then it won't have kernel virtual address. Then how the swapper will find the page and swap that out, if a need arises ?
If we're talking about the x86 or similar (in terms of page translation) architectures, at any given time there's one virtual address space and normally one part of it is reserved for the kernel and the other for user-mode processes.
On a context switch between two processes only the user-mode part of the virtual address space changes.
With such an organization, the kernel always has full access to the current user-mode process, because, again, there's only one current virtual address space at any moment for both the kernel and a user-mode process, it's not two, it's one. So, the kernel doesn't really have to have another, extra mapping for user-mode pages. But that's not the main point.
The main point is that the kernel keeps some sort of statistics for every page that if needed can be saved to the disk and reused elsewhere. The CPU marks each page's page table entry (PTE) as accessed when the page is first read from or written to and as dirty when it's first written to.
The kernel scans the PTEs periodically, reads the accessed and dirty markers to update said statistics and clears accessed and dirty so it can detect a change in them later (of course, if any). Based on this statistics it determines which pages are rarely used or long unused and can be repurposed.
If the "swapper" runs in the context of the current process and if it runs in the kernel, then in theory it has enough information from the kernel (the list of rarely used or long unused pages to save and unmap if dirty or just unmap if not dirty) and sufficient access to the pages of interest.
If the "swapper" itself runs as a user-mode process, things become more complicated because it doesn't have access to another process' pages by default and has to either create a mapping or ask the kernel do some extra work for it in the context of the process of interest.
So, finding rarely used and long unused pages and their addresses occurs in the kernel. The CPU helps by automatically marking PTEs as accessed and dirty. There may need to be an extra mapping to dirty pages if they get saved to the disk not in the context of the process that owns them.

Why is kernel said to be in process address space?

This might be a silly question but it just popped up in my mind. All the text about process address space and virtual memory layout mentions that the process address space has
space reserved for kernel. For e.g. on 32 bit systems the process address space is 4GB of which 1 GB is reserved for kernel in Linux (Might be different on other OS).
I am just wondering why kernel is said to be in the process address space when a process cannot address the kernel directly. Why don't we say that the kernel has a separate address space than a process and why can't we have a different page table for kernel itself which is separate from the page tables of the processes?
When the process makes a system call, we don't need to switch the page tables (from process address space page table to kernel address space page table) for servicing the system call (which should be done only in kernel mode). This is said to be that the kernel is running in the process context.
Some kernel events which won't run in process context will load the page tables only for kernel.
Got it ?

Kernel mode transition

If I understand correctly, a memory adderss in system space is accesible only from kernel mode. Does it mean when components mapped in system space are executed the processor must be swicthed to kernel mode?
For ex: the virtual memory manager is a frequently used component and is mapped in system space. Whenever the VMM runs in the context of user process (lets say it translated an address), does the processor must be swicthed to kernel mode?
Thanks,
Suresh.
Typically, there's 2 parts involved.The MMU(Memory manage unit) which is a hardware component that does the translation from virtual addresses to physical addresses. And the operating system VM subsystem.
The operating system part needs to run in privileged mode (a.k.a. kernel mode) and will set up/change the mapping in the MMU based on the the user space needs.
E.g. to request more (virtual) memory, or map a file into memory, a transition to kernel mode is needed and the VM subsystem can change the mapping of the process.
Around this there's often a ton of tricks to be made - e-g. map the whole address space of the kernel into the user process virtual space, but change its access so the process can't use that memory - this means whenever you transit to kernel mode you don't need to reload the mapping for the kernel.
Taking your example of the virtual memory manager, it never actually runs in user space. To allocate memory, user mode applications make calls to the Win32 API (NTDLL.DLL as one example) to routines such as VirtualAlloc.
With regards to address translation, here's a summary of how it works (based on the content from Windows Internals 5th Edition).
The VMM uses page tables which the CPU uses to translate virtual addresses to physical addresses. The page tables live in the system space. Each table contains many PTEs (page table entries) which stores the physical address to which a virtual address is mapped. I won't go into too much detail here, but the point is that all of the VMM's work is performed in system space and not in user space.
As for context switching - when a thread running in user space needs to run in the system space, then a context switch will occur. Since the memory manager lives in system space, it's threads never need to make a context switch, since it already lives in the system space.
Apologies for the simplistic explanation, this is quite a complicated topic of discussion in depth. I would highly recommend that you pick up a copy of Windows Internals as this sounds like it would come in handy for you.

Resources