When and how CPL field changes? - linux-kernel

The processor maintains current privilege level in CPL field. I want to know about all possible scenarios when CPL field changes from 3 to 0 and vice versa. For example, CPL field might change from 3 to 0 when a system call is invoked by a user process.
Moreover, please try to elaborate what goes on inside the kernel/CPU before CPL field is changed.
Note: I have read a few posts explaining how protection is enforced by the CPU using CPL, RPL and DPL. I am unable to understand when and how does the CPL change.

This is a pretty in-depth question. The answer depends on which kernel you're looking at. Typically, CPL is only going to change during context switches (probably the initial switch from kernel to userspace) and during system calls.
The kernel needs to have usermode (CPL 3) segments set up in the Global Descriptor Table. Segment selectors (CS, DS, ES, FS, GS) are then set to the CPL=3 segment values.
Here is a great reference: http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection/
Also take a look at the Intel manuals. https://www-ssl.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html (Specifically Vol 3A, Page 5-7 is what you're looking for)

Related

ARM Linux: PTE not writable but dirty

I am aware that ARM architecture emulates the Linux's young and dirty flags by setting them in page fault handlers as discussed here. But recently for a small binary, I observed that a Linux PTE in one of the anonymous segments was set to be not writable and dirty. The following Linux PTE state was observed:
- L_PTE_PRESENT : 1
- L_PTE_YOUNG : 1
- L_PTE_DIRTY : 1
- L_PTE_RDONLY : 1
- L_PTE_XN : 0
I couldn't find an explanation for this combination of PTE flags. Does the kernel set this combination for special anonymous VMA segments? What does this combination signify? Any pointers will be helpful. Thanks in advance.
I observed that a Linux PTE in one of the anonymous segments was set to be not writable and dirty... What does this combination signify?
TL;DR - This simply means that the page is not in a backing store and it is read-only.
Dirty just means not written to a backing store (swap, mmap file or inode). Many things such as code are always read from a file, so they are backed by an inode.
If you mmap some read-only memory, then you could get this combination, for example. Other possibilities are a stack guard, allocator run-time buffer overflow detection, and copy-on-write functionality.
These are not normal. For a typical allocation, you will have something backed by swap and only a write will cause the page to become dirty. So the case is probably less frequent but valid.
See: ARM Linux PTE bits
ARM Linux emulate dirty/accessed
There seems to be little documentation on what the young bit means. young is information about what to swap. If something is young and not accessed for a prolonged time, it is a good candidate to evict. In contrast, dirty is for whether it needs to be swapped. If a page is dirty, then it has not been written to a backing store (a swap file or mmap file, etc). The pager must write out this page then. If it was not dirty (or clean), then the pager can simply discard the memory and re-use.
The difference between young and dirty is like should and must.
- L_PTE_PRESENT : 1 - it has physical RAM (not swapped)
- L_PTE_YOUNG : 1 - is has not been used
- L_PTE_DIRTY : 1 - it is different than backing store
- L_PTE_RDONLY : 1 - user space can not write.
- L_PTE_XN : 0 - code can execute.
Not present and dirty seem like an impossible condition for instance, but dirty and read-only is valid.

Linux kernel ARM Translation table base (TTB0 and TTB1)

Compiled Linux kernel 2.6.34.3 for ARMv7 (Cortex-a8)
I looked into the kernel code and it looks like the Linux kernel sets the hardware page tables for the kernel address space (everything over 0xC0000000)on TTB1 (translation table base) and the user process on ttb0 (everything under 0xC0000000) which changes for every process context switch. Is this correct? I'm still confused how the MMU knows which ttb to look at for translations?
I read that the TTBCR (translation table base control register) determines which of the ttb register to walk when an MVA is not found, however the register always reads 0 which means always use TTBR0 in the ARM architecture reference manual. How is that possible? Can anyone explain to me how the Linux kernel uses these two ttbs?
I read how the ttb works from this site https://www.cs.rutgers.edu/~pxk/416/notes/10-paging.html but I still dont understand how the kernel use the two ttbs
(Double checked the kernel code, for some reason both ttb0 and ttb1 is set, but it seems like ttb1 is never used, i set the TTB1 register to 0 and the Linux kernel continue to run as usual)
The TTBR registers are used together to determine addressing for the full 32-bit or 40-bit address space. Which register is used for what address ranges is controlled via the tXsz bits in the TTBCR. There is an entry for t0sz corresponding to TTBR0 and t1sz for TTBR1.
The page tables addressed by each TTBRx register are independent, but you typically find most Linux implementations just use TTBR0. Linux expects to be able to use a 3G/1G address space partitioning scheme, which is not supported by ARM. If you look at page B3-1345 of the ARMv7 Architecture Reference Manual, you'll see that the value of t0sz and t1sz determine the address ranges supported by TTBR0 and TTBR1 respectively. To add confusion to disorientation, it is even possible to have disjoined address spaces where TTBR0 and TTBR1 support ranges that are not contiguous, resulting in a hole in the system address space. Good times!
To answer your main question though, it is recommended by ARM that TTBR0 be used to store the offset to the page tables used by USER processes, and TTBR1 be used to store the offset to the page tables used by the KERNEL. I have yet to see a single implementation that actually does this. Almost exclusively TTBR0 is used in all cases, with TTBR1 containing a duplicate copy of the L1 tables.
So how does this work? The value of TTBR is stored as part of the process state and simply restored each time a process with switched out. This is how it is expected to work. Originally, TTBR1 would hold a constant value for the kernel tables and never be replaced or swapped out, whereas TTBR0 would be changed each time you context switch between processes. Apparently most Linux implementations for ARM have decided to just basically eliminate the use of TTBR1 and stick to using TTBR0 for everything.
If you want to test this theory on your device, try whacking TTBR1 and watch nothing happen. Then try whacking TTBR0 and watch your system crash. I've yet to encounter a single instance that didn't result in this exact same result. Long story short, TTBR1 is useless by Linux, and TTBR0 is used almost exclusively and simply swapped out.
Now, once you get to LPAE support, throw all this away and start over again. This is the implementation where you will start to see the value of t0sz and t1sz being something other than zero, and hence N as well.
I have very little knowledge about ARM architecture, but from what I read in your enclosed link, then I guess Linux implements its virtual-memory management that way:
High-order bits of the virtual address determine which one to use. The base of the table is stored in one of two base registers (TTBR0 or TTBR1), depending on whether the topmost n bits of the virtual address are 0 (use TTBR0) or not (use TTBR1). The value for n is defined by the Translation Table Base Control Register (TTBCR).
The register TTBCR tells which addresses will be translated from page-tables pointed to by TTBR0 or TTBR1. If TTBCR contains 0xc000000, then any address from 0 to 0xbfffffff is translated by the page-table pointed by TTBR0, and any address from 0xc0000000 to 0xffffffff is translated by the page-table pointed by TTBR1. That match the Linux memory-split of 3GB for user process / 1GB for the kernel.
This allows one to have a design where the operating system and memory-mapped I/O are located in the upper part of the address space and managed by the page table in TTBR1 and user processes are in the lower part of memory and managed by the page table in TTB0. On a context switch, the operating system has to change TTBR0 to point to the first-level table for the new process. TTBR1 will still contain the memory map for the operating system and memory-mapped I/O.
Hence, the value of TTBR1 should never change because you want the kernel to be permanently mapped (think of what happens when an interrupt is raised). On the other hand, TTBR0 is modified at every process-switch, it contains the page-table of the current process.
See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211k/Bihgfcgf.html
For ARM5 and lower the TTB table is fixed in size and alignment (to 16k). Each level 1 entry represents 1MB. The table entry is 32bits (16k*1M/(32bit/8) = 4GB). The TTBCR controls TTBR0 table size. From the above URL,
Selecting which Translation Table Base Register is used
The Translation Table Base Register is selected as follows:
If N = 0, always use Translation Table Base Register 0.
- This is the default case at reset. It is backwards compatible with ARMv5 or earlier processors.
If N is greater than 0, then:
- if bits [31:32-N] of the Virtual Address are all 0, use Translation Table Base Register 0 otherwise use Translation Table Base Register 1.
So the size of TTBR0 also sets the memory split. For a traditional Linux 3G/1G 1G/3G, the value 2 should be selected. 4kB table == 1G memory == bits 31..30 are zero. For a value of 6 the table is 256byte == 64MB == bits 31..26 are zero.
In Linux parlance these are page global entries (and this splits this page global directory). The entries can point to another table or just be a 1MB segment. The next table entries are page middle Linux directories and then the final page table entries. I think the page middle entries are unused on the ARM.
The MMU hardware doesn't walk the tables every time. There is a TLB (translation look aside buffer). It is like a cache for the MMU tables. When the OS updates these tables, the TLB must be flushed or the processor will use stale entries. Similarly the ARM cache is virtual tagged, so changing the mapping may also mean the cache must be flushed. For these reasons, you never want to change things on a context switch. Shared libraries text (say libc.so) should be the same on a context switch. Hopefully each process has libc.so mapped at the same virtual address. There is a big gain in doing this; lower memory use and good I-cache use.
The domain and PID registers as well as supervisor/user modes can also control memory accesses. These are single registers that can be toggled on a context switch.
See http://lwn.net/images/conf/rtlws11/papers/proc/p01.pdf for info on PID and domain use on the ARMV5. The current Linux source doesn't do exactly like the paper describes. It is entirely possible that Linux doesn't need to use this mechanism and sets the TTBCR to zero so that the VM code for ARM sub-architectures is similar.
Edit: I don't believe the TTBCR functionality can be used to achieve a 3G/1G split. I think the Rutger's page was discussing the TTBCR generically and not in the Linux context. Also, at least the 2.6.38 Linux used domains or DACR but does not use the pid or fcse as it supports a limited number of processes.
http://lwn.net/Articles/106177/ - also referenced on the Rutgers page.
The TTBR0 holds the base address of translation table 0, and information about the memory it occupies.
This is one of the translation tables for the stage 1 translation of memory accesses from modes other than Hyp mode

Change user space memory protection flags from kernel module

I am writing a kernel module that has access to a particular process's memory. I have done an anonymous mapping on some of the user space memory with do_mmap():
#define MAP_FLAGS (MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS)
prot = PROT_WRITE;
retval = do_mmap(NULL, vaddr, vsize, prot, MAP_FLAGS, 0);
vaddr and vsize are set earlier, and the call succeeds. After I write to that memory block from the kernel module (via copy_to_user), I want to remove the PROT_WRITE permission on it (like I would with mprotect in normal user space). I can't seem to find a function that will allow this.
I attempted unmapping the region and remapping it with the correct protections, but that zeroes out the memory block, erasing all the data I just wrote; setting MAP_UNINITIALIZED might fix that, but, from the man pages:
MAP_UNINITIALIZED (since Linux 2.6.33)
Don't clear anonymous pages. This flag is intended to improve performance on embedded
devices. This flag is only honored if the kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option
is normally enabled only on embedded devices (i.e., devices where one has complete
control of the contents of user memory).
so, while that might do what I want, it wouldn't be very portable. Is there a standard way to accomplish what I've suggested?
After some more research, I found a function called get_user_pages() (best documentation I've found is here) that returns a list of pages from userspace at a given address that can be mapped to kernel space with kmap() and written to that way (in my case, using kernel_read()). This can be used as a replacement for copy_to_user() because it allows forcing write permissions on the pages retrieved. The only drawback is that you have to write page by page, instead of all in one go, but it does solve the problem I described in my question.
In userspace there is a system call mprotect that can modify the protection flags on existing mapping. You probably need to follow from the implementation of that system call, or maybe simply call it directly from your code. See mm/protect.c.

Ring level shift in Win NT based OS

Can anyone please tell me how there is privilege change in Windows OS.
I know the user mode code (RL:3) passes the parameters to APIs.
And these APIs call the kernel code (RL:1).
But now I want to know, during security(RPL) check is there some token that is exchanged between these RL3 API and RL1 Kernel API.
if I am wrong please let me know (through Some Link or Brief description) how it works.
Please feel free to close this thread if its offtopic, offensive or duplicate.
RL= Ring Level
RPL:Requested Privilege level
Interrupt handlers and the syscall instruction (which is an optimized software interrupt) automatically modify the privilege level (this is a hardware feature, the ring 0 vs ring 3 distinction you mentioned) along with replacing other processor state (instruction pointer, stack pointer, etc). The prior state is of course saved so that it can be restored after the interrupt completes.
Kernel code has to be extremely careful not to trust input from user-mode. One way of handling this is to not let user-mode pass in pointers which will be dereferenced in kernel mode, but instead HANDLEs which are looked up in a table in kernel-mode memory, which can't be modified by user-mode at all. Capability information is stored in the HANDLE table and associated kernel data structures, this is how, for example, WriteFile knows to fail if a file object is opened for read-only access.
The task switcher maintains information on which process is currently running, so that syscalls which perform security checks, such as CreateFile, can check the user account of the current process and verify it against the file ACL. This process ID and user token are again stored in memory which is accessible only to the kernel.
The MMU page tables are used to prevent user-mode from modifying kernel memory -- generally there is no page mapping at all; there are also page access bits (read, write, execute) which are enforced in hardware by the MMU. Kernel code uses a different page table, the swap occurs as part of the syscall instruction and/or interrupt activation.

Question about memory page protection

Here's another question I met when reading < Windows via C/C++ 5th Edition >. First, let's see some quotation.
LPVOID WINAPI VirtualAlloc(
__in_opt LPVOID lpAddress,
__in SIZE_T dwSize,
__in DWORD fdwAllocationType,
__in DWORD fdwProtect
);
The last parameter, fdwProtect,
indicates the protection attribute
that should be assigned to the region.
The protection attribute associated with the region has no effect on the
committed storage mapped to the
region.
When reserving a region, assign the protection attribute that will be used
most often with the storage committed
to the region. For example, if you
intend to commit physical storage with
a protection attribute of
PAGE_READWRITE, you should reserve the
region with PAGE_READWRITE. The
system's internal record keeping
behaves more efficiently when the
region's protection attribute matches
the committed storage's protection
attribute.
(When commiting storage)...you usually
pass the same page protection
attribute that was used when
VirtualAlloc was called to reserve the
region, although you can specify a
different protection attribute.
The above quotation totally puzzled me.
If the protection attribute associated with the region has no effect on the committed storage, why do we need it?
Since it is recommended to use the same protection attribute for both reserving and committing, why does Windows still offer us the option to use different attribute? Isn't it mis-leading and kind of a paradox?
Where exactly is the protection attribute stored for reserved region and committed storage, repectively?
Many thanks for your insights.
It's important to read it in context.
The protection attribute associated
with the region has no effect on the
committed storage mapped to the
region.
was referring to reserving, not committing regions.
A reserved page has no backing store, so it's protection is always conceptually PAGE_NOACCESS, regardless of what you pass to VirtualAlloc. I.e. if a thread attempts to read/write to an address in a reserved region, an access violation is raised.
From linked article:
Reserved addresses are always
PAGE_NOACCESS, a default enforced by
the system no matter what value is
passed to the function. Committed
pages can be either read-only,
read-write, or no-access.
Re:
Where exactly is the protection
attribute stored for reserved region
and committed storage, repectively?
The protection attributes for virtual address regions are stored in the VAD tree, per process. (VAD == Virtual Address Descriptor, see Windows Internals, or linked article)
Since it is recommended to use the same protection attribute for both reserving and committing, why does Windows still offer us the option to use different attribute? Isn't it mis-leading and kind of a paradox?
Because the function always accepts a protection parameter, but its behaviour depends on fdwAllocationType. Protection only makes sense for committed storage.
The reason Richter suggests using the same protection setting is presumably because fewer changes in the protection flags in a region mean fewer "blocks" (see your book for definition), and hence a smaller AVL tree for the VADs. I.e. if all pages in a region are committed with the same flags, there'll only be 1 block. Otherwise there could be as many blocks as pages in the region. And you need a VAD for each block (not page).
Block == set of consecutive pages with identical protection/state.
If the protection attribute associated
with the region has no effect on the
committed storage, why do we need it?
As above.
Well... One reason could be so you could use guard pages so you can commit memory as you use it.
Think of the thread stack in Windows; the page immediately below the stack is set as a guard page, typically with read and write ability. Once the guard page is touched, an exception handler runs and commits the guard page and makes the next page a guard.
See here for a better description. Also, that link is part of a series on how windows handles low level resources and is pretty good reading.
Another reason for allowing you to respecify the protection attributes could be for copy on write techniques. Pages are set to read only until they're changed, which can raise an exception you can handle etc etc etc.
On the 386 family of Intel chips, the commit, read/write/reserve flags are stored in the page tables. Take a look at a 386 chip reference for more details.
Edit: I poked around for a bit and could not find where MS stores the PAGE_GUARD bit. Now I'm curious where I saw it. :) Too bad I threw out about 500 pounds of old reference material last spring...
Hope this helps :)

Resources