Is Kernel Virtual Memory pages are swappable - linux-kernel

Like each user level process has its own Virtual memory space whose pages are swapped out/in, does Linux Kernel's Virtual memory pages are swappable ?

Kernel space pages don't get page-{in,out} by design and are pinned to memory. The pages in the kernel can usually be trusted from a security point of view, while the user space pages should NOT be trusted.
For this reason you don't have to worry about accessing kernel buffers directly in your code. While its not the same the user space buffers, without worrying about handling page faults.
Kernel space pages cannot page-out by design, as you may want to consider what would your application do when the page containing the instructions for handling a page fault gets page-out!

No, kernel memory is not swapped on Linux.

Related

Page Fault in Linux Kernel

I have few questions after reading Mel Gorman's book Understanding the Linux Virtual Memory Manager. Section 4.3 Process Address Space Descriptor says kernel threads never page fault or access the user space portion. The only exception is page faulting within the vmalloc space . Following are my questions.
kenrel threads never page fault: Does this mean only user space code triggers page fault? If a kmalloc() or vmalloc() is called, will it not page fault? I believe the kernel has to map these to the anon pages. When a write to this pages is performed, a page fault occurs. Is my understanding correct?
Why can't kernel threads access user space? Aren't copy_to_user() or copy_from_user() do that?
Exception is page faulting within vmalloc space: Does that mean vmalloc() triggers a page fault and kmalloc() doesn't ? Why kmalloc() does not page fault? The physical frames to kernel's virtual address need not to be kept as a page table entry?
kernel threads never page fault: The page fault talked about is when making a virtual page resident, or bringing it back from swap. Kernel pages not only get paged in on kmalloc(), but also remain resident for their lifetime. The same does not hold for user space pages, which A) may be lazy allocated (i.e. just reserved as page table entries on malloc(), but not actually faulted in until a memset() or other dereference) and B) may be swapped out on low memory conditions.
Why can't kernel threads access user space? Aren't copy_to_user() or copy_from_user() do that?
That's a great question, with a hardware-specific reply. It used to be the case that kernel threads were discouraged from accessing user space, exactly because of the possible page fault hit that might occur, if accessing unpaged/paged out memory in user space (recall, that wouldn't happen in kernel space, as above ensures). So copy_to/from would be normal memcpy, but wrapped in a page fault handler. This way, any potential page fault would be handled transparently (i.e. the memory would be paged in) and all would be well. But there were certainly cases where the bad approach of memcpy to/from user memory would just work - worse, it would work more often than not, as page faults very with RAM residency and availability - and thus unhandled faults would cause random panics. Hence the decree of always using the copy_from/to_user.
Recently, however, kernel/user memory isolation became important from a security standpoint. This is due to many exploitation techniques (NULL pointer dereferencing being a very common and powerful one), where fake kernel objects (or code) could be constructed in user space (and thus, easily controlled) memory, and could lead to code execution in kernel.
Most architectures thus have a page table bit which physically prevents a page belonging to user mode from being accessed by kernel. Taking ARM64 as an example, this feature is called PAN/PXN (Privileged Access/Execute Never).
Thus, copy_from/to now not only handles page faults, but also disables PAN/PXN before the operation, and restores it after.
Exception is page faulting within vmalloc space: vmalloc() allocates memory which is swappable, whereas kmalloc does not. The difference is in the implementation (kmalloc uses GFP_KERNEL). This also means that kmalloc is more likely to fail (if there is no RAM available for this), but will not page fault (it would return NULL, which itself would be a problem..)
I think you get counfused because you haven't understand clearly about the start of kernel, process, and virtual memeory.
kenrel threads never page fault: This is because the pages of kernel space and user space use different allocation methods. For the kernel space, we allocate pages when initialization, but for user space, we allocate them when running process and calling funcitons like malloc(), and after mapping, when truly using that virtual memory, we trigger page fault.
Why can't kernel threads access user space? When kenrel start, the process 0 will create process 1 and process 2. The process 1 is used to form the user space process tree, while the process 2 is used to manage the kernel threads. And the functions you mensioned are always used by those user threads to transmit data into/out of kernel to realise some function like open file or socket and so on.
Exception is page faulting within vmalloc space: The vmalloc space is not function vmalloc(), it is an area in kernel memory space for some dynamic memory allocation used as an exception.

TLB Hit - Checking if the page is within the process's memory space

I have been reading about the translation of virtual addresses to physical addresses. I understand that the TLB is a hardware cache that resides in the CPU's Memory Management Unit and contains mappings of recent pages accessed.
However, say there is a TLB hit - How does the OS ensure that the page can actually be accessed by the process (is within the process's allocated address space)?
I believe that one way to do that would be to check with the process's page table, but that seems to defeat the whole purpose of using a TLB. Any insights ?
It depends upon the memory management strategy that the OS is using. For examples, in case of the OS using the inverted paging table, each entry in the page table contains the id of the process (PID) that are owning the page.
For the "normal" paging, each paging entry may contain extra bits for memory protection and sharing.
At a basic level the TLB only contains pages that are in ram, and the os clears the TLB whenever a page is removed from ram.

Windows kernel memory protection

In Windows the high memory of every process (0x80000000 or 0xc0000000)
Is reserved for kernel code, user code cannot access these regions of memory, if it tries so an access violation exception will be thrown.
I wish to know how is the kernel space protected ?
Is it via memory segmentations or via paging ?
I would like to hear a technical explanation.
Thanks a lot,
Michael.
Assuming you are talking about x86 and x64 architectures.
Memory protection is achieved using the paging system. Each page table entry on an x86/x64 CPU has a bit to indicate whether it is a user or supervisor page. Accesses to supervisor pages are only permitted for code running with CPL<3, whereas accesses to non supervisor pages are possible regardless of CPL.
CPL is the "Current Privilege Level" which is sometimes referred to as Ring. Windows only uses two rings, although the CPU implements 4. Ring 0 is the CPU mode in which what Windows refers to as "kernel mode" runs. Ring 3 is the CPU mode in which "User mode" runs. Since code running at CPL=3 cannot access supervisor pages, this is how memory protection is implemented.
The answer for ARM is likely to be similar, but different.
That's an easy one and doesn't require talking about rings and kernel behavior. Accessing virtual memory at a particular address requires that address to be mapped, the operating system has to allocate a memory page for that address. The low-level winapi function that does that is VirtualAlloc(). Which takes an optional address, first argument. The OS will simply fail a request for an unmappable address. Otherwise the exact same mechanism that prevents you from mapping any address in the lowest 64KB of the address space.

How different in management page table entries (PTE) in kernel space and user space?

In Linux OS, after enable the page table, kernel will only map PTEs belong to kernel space once and never remap them again ? This action is opposite with PTEs in the user space which needs to remap every time process switching happening ?
So, I want know the difference in management of PTEs in kernel and user space.
This question is a extended part from the question at:
Page table in Linux kernel space during boot
Each process has its own page tables (although the parts that describe the kernel's address space are the same and are shared.)
On a process switch, the CPU is told the address of the new table (this is a single pointer which is written to the CR3 register on x86 CPUs).
So, I want know the difference in management of PTEs in kernel and user space.
See these related questions,
Does Linux use self map for page tables?
Linux Virtual memory
Kernel developer on memory management
Position independent code and shared libraries
There are many optimizations to this,
Each task has a different PGD, but PTE values maybe shared between processes, so large chunks of memory can be mapped the same for each process; only the top-level directory (CR3 on x86, TTB on ARM) is updated.
Also, many CPUs have a TLB and cache. These need to be maintained with the memory mapping. Some caches are VIVT, VIPT and PIPT. The first two have to have some cache flushing iff the PGD and/or PTE change. Often a CPU will support a process, thread or domain id. The OS only needs to switch this register during a context switch. The hardware cache and TLB entries must contains tags with the process, thread, or domain id. This is an implementation detail for each architecture.
So it is possible that TLB flushes could be needed when a top level page registers changes. The CPU could flush the entire TLB when this happens. However, this would be a disadvantage to pages that remain mapped.
Also, sub-sections of memory can be the same. A loader or other library can use mmap to create code that is similar between processes. This common code may not need to be swapped at the page table level, depending on architecture, loader and Linux version. It could of course have a virtual alias and then it needs to be swapped.
And the final point to the answer; kernel pages are always mapped. Only a non-preemptive OS could not map the kernel, but that would make little sense as every process wants to call the kernel. I guess the micro-kernel paradigm allows for device drivers to unload when they are not in use. Linux uses module loading to handle this.

How remap_pfn_range remaps kernel memory to user space?

remap_pfn_range function (used in mmap call in driver) can be used to map kernel memory to user space. How is it done? Can anyone explain precise steps? Kernel Mode is a privileged mode (PM) while user space is non privileged (NPM). In PM CPU can access all memory while in NPM some memory is restricted - cannot be accessed by CPU. When remap_pfn_range is called, how is that range of memory which was restricted only to PM is now accessible to user space?
Looking at remap_pfn_range code there is pgprot_t struct. This is protection mapping related struct. What is protection mapping? Is it the answer to above question?
It's simple really, kernel memory (usually) simply has a page table entry with the architecture specific bit that says: "this page table entry is only valid while the CPU is in kernel mode".
What remap_pfn_range does is create another page table entry, with a different virtual address to the same physical memory page that doesn't have that bit set.
Usually, it's a bad idea btw :-)
The core of the mechanism is page table MMU:
Related image1 http://windowsitpro.com/content/content/3686/figure_01.gif
or this:
Both picture above are characteristics of x86 hardware memory MMU, nothing to do with Linux kernel.
Below described how the VMAs is linked to the process's task_struct:
Related image http://image9.360doc.com/DownloadImg/2010/05/0320/3083800_2.gif
(source: slideplayer.com)
And looking into the function itself here:
http://lxr.free-electrons.com/source/mm/memory.c#L1756
The data in physical memory can be accessed by the kernel through the kernel's PTE, as shown below:
(source: tldp.org)
But after calling remap_pfn_range() a PTE (for an existing kernel memory but to be used in userspace to access it) is derived (with different page protection flags). The process's VMA memory will be updated to use this PTE to access the same memory - thus minimizing the need to waste memory by copying. But kernel and userspace PTE have different attributes - which is used to control the access to the physical memory, and the VMA will also specified the attributes at the process level:
vma->vm_flags |= VM_IO | VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP;

Resources