Address designation in RISC-V - linux-kernel

I am running a simulated RV64GC core in QEMU and am trying to better understand the virtual memory subsystem and address translation process in RISC-V. My simulated system runs with OpenSBI, the Linux Kernal v5.5, and a minimal rootfs.
In QEMU debug traces, I see that sometimes (most commonly with ecalls) control is passed to the SBI and the addresses change from kernel (virtual?) addresses with an offset of 0xffffffe000000000 into something that looks like real, physical, addresses in RAM. For example,
...
0xffffffe00003a192: 00000073 ecall
...
IN: sbi_ecall_0_1_handler
0x0000000080004844: 00093603 ld a2,0(s2)
0x0000000080004848: 4785 addi a5,zero,1
0x000000008000484a: 00a797b3 sll a5,a5,a0
...
In the RISC-V privileged specification version 1.11, section 4.1.12, the satp CSR (control and state register) is defined to have a MODE field that determines address translation designation. A MODE of 0 means that translation is bare (addresses are considered physical), a MODE of 8 or 9 requires Sv39 or Sv48 page-based virtual addressing, respectively, and any other MODE values are reserved.
Now, both the RISC-V privileged and unprivileged specifications don't seem to mention when satp may be changed (other than with csrrw), so this leads me to the following questions:
When control is handed to the SBI (as with the ecall above), does the satp MODE change to 0? If yes, does this mean the satp mode should be reset on a u/s/mret instruction? Are there other instances (other than csrrw) where satp is supposed to change?
If not, is there some other mechanism by which the addresses are interpreted and designated as physical? Or are the addresses (the 0x80XXXXXX addresses above) instead considered virtual and should go through the usual virtual address translation process (as outlined in section 4.3.2 of the RISC-V privileged specification)? If this is the case, when are page table entries created for this?

The memory model of RISC-V works in the following way:
M mode has its own memory protection system described under section 3.6 of privileged specifications called PMP (physical memory protection). This is to impose memory protection on lower privilege levels and also M mode itself (if lock bit is used). There is no virtual memory system in M mode.
Now in the S mode, it has page based virtual memory system that S mode can use to set virtual to physical address mapping and also to impose memory restrictions on S mode itself and also U mode.
So each privilege level has control on its own resources and the resources below it but never on the resources of the privilege level above it. This is how things work.
M mode can control memory accessible by M, S and U modes, and S mode can control memory view (virtual memory) and accessibility of S and U modes but not M mode. So satp mode never even changes when moving to M mode. As the mapping pointed by it is never even applicable to M mode. It has its on memory protection unit.
This would be huge security hole if lower privilege levels could impose memory restrictions on higher privilege levels.

Related

How is virtual system space protected against access?

On Microsoft Docs I read:
In 64-bit Windows, the theoretical amount of virtual address space is 2^64 bytes (16 exabytes), but only a small portion of the 16-exabyte range is actually used. The 8-terabyte range from 0x000'00000000 through 0x7FF'FFFFFFFF is used for user space, and portions of the 248-terabyte range from 0xFFFF0800'00000000 through 0xFFFFFFFF'FFFFFFFF are used for system space.
Since I have 64 bit pointers, I could possibly construct a pointer that points to some 0xFFFFxxxxxxxxxxxx address.
The site continues:
Code running in user mode has access to user space but does not have access to system space.
If I wereable to guess a valid address in system virtual address space, what mechanism prevents me from writing there?
I know about memory protection but that doesn't seem to offer something that distinguishes between user memory and system memory.
According to the comments by #RbMm, this information is stored in the PTE (page table entry). There seems to be a bit which defines whether access is granted from user mode.
This is confirmed by an article on OSR online, which says
Bit Name: User access
The structure itself does not seem to be part of Microsoft symbols
0:000> dt ntdll!_page*
ntdll!_PAGED_LOOKASIDE_LIST
ntdll!_PAGEFAULT_HISTORY
0:000> dt ntdll!page*
0:000> dt ntdll!*pte*
00007fff324fe910 ntdll!RtlpTestHookInitialize
The PTEs are closely supported by the CPU (the MMU, memory management unit, specifically). That's why we find additional information at OSDev, which says
U, the 'User/Supervisor' bit, controls access to the page based on privilege level. If the bit is set, then the page may be accessed by all; if the bit is not set, however, only the supervisor can access it.
In some leaked SDK files, the bit seems to be
unsigned __int64 Owner : 1;
Since the PTE is supported by the CPU, we should find similar things in Linux. And voilĂ , I see this SO answer which also has the bit:
#define _PAGE_USER 0x004
which exactly matches the information of OSDev.

PowerPC protected/privileged mode structure

Where can I get information on the privileged/protected mode structure of PowerPC?
I have tried seeing the user manuals but I couldn't get any info.
Architecturally, it's fairly straightforward - as a broad overview: Firstly, you have a couple of bits in the Machine State Register (MSR): Problem state (PR) and Hypervisor (HV). Those two bits represent three states:
PR=1, HV=x - unprivileged (typically: userspace)
PR=0, HV=0 - supervisor (typically: virtualised guest OS kernel)
PR=0, HV=1 - hypervisor (typically: hypervisor host, or non-virtualised OS kernel)
If your implementation doesn't support hardware virtualisation (ie., doesn't have a HV bit), then there are just two states:
PR=1 - userspace
PR=0 - supervisor
Then, certain facilities are only available at specific machine states. For example, some special purpose registers can only be accessed in PR=0 state; attempts to access these registers with PR=1 will cause a program interrupt, transferring control back to the OS. The OS can then decide what action to take (say, kill the process, or access the privileged resource on behalf of that process).
Of course, the MSR itself is privileged; userspace processes can't simply clear the PR bit to enter supervisor state.
For implementing access control to memory, the storage control facility can mark mappings as only available when the machine is in PR=0 and/or HV=1 states. Processing a virtual address translation will check the mapping configuration against the machine state, and potentially raise a data or instruction storage exception if access should not be allowed. Again, this transfers control back to the OS/hypervisor.
For details, see the POWER ISA documents. Book III has most of the details about privileged states.

Windows kernel memory protection

In Windows the high memory of every process (0x80000000 or 0xc0000000)
Is reserved for kernel code, user code cannot access these regions of memory, if it tries so an access violation exception will be thrown.
I wish to know how is the kernel space protected ?
Is it via memory segmentations or via paging ?
I would like to hear a technical explanation.
Thanks a lot,
Michael.
Assuming you are talking about x86 and x64 architectures.
Memory protection is achieved using the paging system. Each page table entry on an x86/x64 CPU has a bit to indicate whether it is a user or supervisor page. Accesses to supervisor pages are only permitted for code running with CPL<3, whereas accesses to non supervisor pages are possible regardless of CPL.
CPL is the "Current Privilege Level" which is sometimes referred to as Ring. Windows only uses two rings, although the CPU implements 4. Ring 0 is the CPU mode in which what Windows refers to as "kernel mode" runs. Ring 3 is the CPU mode in which "User mode" runs. Since code running at CPL=3 cannot access supervisor pages, this is how memory protection is implemented.
The answer for ARM is likely to be similar, but different.
That's an easy one and doesn't require talking about rings and kernel behavior. Accessing virtual memory at a particular address requires that address to be mapped, the operating system has to allocate a memory page for that address. The low-level winapi function that does that is VirtualAlloc(). Which takes an optional address, first argument. The OS will simply fail a request for an unmappable address. Otherwise the exact same mechanism that prevents you from mapping any address in the lowest 64KB of the address space.

Linux kernel ARM Translation table base (TTB0 and TTB1)

Compiled Linux kernel 2.6.34.3 for ARMv7 (Cortex-a8)
I looked into the kernel code and it looks like the Linux kernel sets the hardware page tables for the kernel address space (everything over 0xC0000000)on TTB1 (translation table base) and the user process on ttb0 (everything under 0xC0000000) which changes for every process context switch. Is this correct? I'm still confused how the MMU knows which ttb to look at for translations?
I read that the TTBCR (translation table base control register) determines which of the ttb register to walk when an MVA is not found, however the register always reads 0 which means always use TTBR0 in the ARM architecture reference manual. How is that possible? Can anyone explain to me how the Linux kernel uses these two ttbs?
I read how the ttb works from this site https://www.cs.rutgers.edu/~pxk/416/notes/10-paging.html but I still dont understand how the kernel use the two ttbs
(Double checked the kernel code, for some reason both ttb0 and ttb1 is set, but it seems like ttb1 is never used, i set the TTB1 register to 0 and the Linux kernel continue to run as usual)
The TTBR registers are used together to determine addressing for the full 32-bit or 40-bit address space. Which register is used for what address ranges is controlled via the tXsz bits in the TTBCR. There is an entry for t0sz corresponding to TTBR0 and t1sz for TTBR1.
The page tables addressed by each TTBRx register are independent, but you typically find most Linux implementations just use TTBR0. Linux expects to be able to use a 3G/1G address space partitioning scheme, which is not supported by ARM. If you look at page B3-1345 of the ARMv7 Architecture Reference Manual, you'll see that the value of t0sz and t1sz determine the address ranges supported by TTBR0 and TTBR1 respectively. To add confusion to disorientation, it is even possible to have disjoined address spaces where TTBR0 and TTBR1 support ranges that are not contiguous, resulting in a hole in the system address space. Good times!
To answer your main question though, it is recommended by ARM that TTBR0 be used to store the offset to the page tables used by USER processes, and TTBR1 be used to store the offset to the page tables used by the KERNEL. I have yet to see a single implementation that actually does this. Almost exclusively TTBR0 is used in all cases, with TTBR1 containing a duplicate copy of the L1 tables.
So how does this work? The value of TTBR is stored as part of the process state and simply restored each time a process with switched out. This is how it is expected to work. Originally, TTBR1 would hold a constant value for the kernel tables and never be replaced or swapped out, whereas TTBR0 would be changed each time you context switch between processes. Apparently most Linux implementations for ARM have decided to just basically eliminate the use of TTBR1 and stick to using TTBR0 for everything.
If you want to test this theory on your device, try whacking TTBR1 and watch nothing happen. Then try whacking TTBR0 and watch your system crash. I've yet to encounter a single instance that didn't result in this exact same result. Long story short, TTBR1 is useless by Linux, and TTBR0 is used almost exclusively and simply swapped out.
Now, once you get to LPAE support, throw all this away and start over again. This is the implementation where you will start to see the value of t0sz and t1sz being something other than zero, and hence N as well.
I have very little knowledge about ARM architecture, but from what I read in your enclosed link, then I guess Linux implements its virtual-memory management that way:
High-order bits of the virtual address determine which one to use. The base of the table is stored in one of two base registers (TTBR0 or TTBR1), depending on whether the topmost n bits of the virtual address are 0 (use TTBR0) or not (use TTBR1). The value for n is defined by the Translation Table Base Control Register (TTBCR).
The register TTBCR tells which addresses will be translated from page-tables pointed to by TTBR0 or TTBR1. If TTBCR contains 0xc000000, then any address from 0 to 0xbfffffff is translated by the page-table pointed by TTBR0, and any address from 0xc0000000 to 0xffffffff is translated by the page-table pointed by TTBR1. That match the Linux memory-split of 3GB for user process / 1GB for the kernel.
This allows one to have a design where the operating system and memory-mapped I/O are located in the upper part of the address space and managed by the page table in TTBR1 and user processes are in the lower part of memory and managed by the page table in TTB0. On a context switch, the operating system has to change TTBR0 to point to the first-level table for the new process. TTBR1 will still contain the memory map for the operating system and memory-mapped I/O.
Hence, the value of TTBR1 should never change because you want the kernel to be permanently mapped (think of what happens when an interrupt is raised). On the other hand, TTBR0 is modified at every process-switch, it contains the page-table of the current process.
See http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211k/Bihgfcgf.html
For ARM5 and lower the TTB table is fixed in size and alignment (to 16k). Each level 1 entry represents 1MB. The table entry is 32bits (16k*1M/(32bit/8) = 4GB). The TTBCR controls TTBR0 table size. From the above URL,
Selecting which Translation Table Base Register is used
The Translation Table Base Register is selected as follows:
If N = 0, always use Translation Table Base Register 0.
- This is the default case at reset. It is backwards compatible with ARMv5 or earlier processors.
If N is greater than 0, then:
- if bits [31:32-N] of the Virtual Address are all 0, use Translation Table Base Register 0 otherwise use Translation Table Base Register 1.
So the size of TTBR0 also sets the memory split. For a traditional Linux 3G/1G 1G/3G, the value 2 should be selected. 4kB table == 1G memory == bits 31..30 are zero. For a value of 6 the table is 256byte == 64MB == bits 31..26 are zero.
In Linux parlance these are page global entries (and this splits this page global directory). The entries can point to another table or just be a 1MB segment. The next table entries are page middle Linux directories and then the final page table entries. I think the page middle entries are unused on the ARM.
The MMU hardware doesn't walk the tables every time. There is a TLB (translation look aside buffer). It is like a cache for the MMU tables. When the OS updates these tables, the TLB must be flushed or the processor will use stale entries. Similarly the ARM cache is virtual tagged, so changing the mapping may also mean the cache must be flushed. For these reasons, you never want to change things on a context switch. Shared libraries text (say libc.so) should be the same on a context switch. Hopefully each process has libc.so mapped at the same virtual address. There is a big gain in doing this; lower memory use and good I-cache use.
The domain and PID registers as well as supervisor/user modes can also control memory accesses. These are single registers that can be toggled on a context switch.
See http://lwn.net/images/conf/rtlws11/papers/proc/p01.pdf for info on PID and domain use on the ARMV5. The current Linux source doesn't do exactly like the paper describes. It is entirely possible that Linux doesn't need to use this mechanism and sets the TTBCR to zero so that the VM code for ARM sub-architectures is similar.
Edit: I don't believe the TTBCR functionality can be used to achieve a 3G/1G split. I think the Rutger's page was discussing the TTBCR generically and not in the Linux context. Also, at least the 2.6.38 Linux used domains or DACR but does not use the pid or fcse as it supports a limited number of processes.
http://lwn.net/Articles/106177/ - also referenced on the Rutgers page.
The TTBR0 holds the base address of translation table 0, and information about the memory it occupies.
This is one of the translation tables for the stage 1 translation of memory accesses from modes other than Hyp mode

Kernel mode transition

If I understand correctly, a memory adderss in system space is accesible only from kernel mode. Does it mean when components mapped in system space are executed the processor must be swicthed to kernel mode?
For ex: the virtual memory manager is a frequently used component and is mapped in system space. Whenever the VMM runs in the context of user process (lets say it translated an address), does the processor must be swicthed to kernel mode?
Thanks,
Suresh.
Typically, there's 2 parts involved.The MMU(Memory manage unit) which is a hardware component that does the translation from virtual addresses to physical addresses. And the operating system VM subsystem.
The operating system part needs to run in privileged mode (a.k.a. kernel mode) and will set up/change the mapping in the MMU based on the the user space needs.
E.g. to request more (virtual) memory, or map a file into memory, a transition to kernel mode is needed and the VM subsystem can change the mapping of the process.
Around this there's often a ton of tricks to be made - e-g. map the whole address space of the kernel into the user process virtual space, but change its access so the process can't use that memory - this means whenever you transit to kernel mode you don't need to reload the mapping for the kernel.
Taking your example of the virtual memory manager, it never actually runs in user space. To allocate memory, user mode applications make calls to the Win32 API (NTDLL.DLL as one example) to routines such as VirtualAlloc.
With regards to address translation, here's a summary of how it works (based on the content from Windows Internals 5th Edition).
The VMM uses page tables which the CPU uses to translate virtual addresses to physical addresses. The page tables live in the system space. Each table contains many PTEs (page table entries) which stores the physical address to which a virtual address is mapped. I won't go into too much detail here, but the point is that all of the VMM's work is performed in system space and not in user space.
As for context switching - when a thread running in user space needs to run in the system space, then a context switch will occur. Since the memory manager lives in system space, it's threads never need to make a context switch, since it already lives in the system space.
Apologies for the simplistic explanation, this is quite a complicated topic of discussion in depth. I would highly recommend that you pick up a copy of Windows Internals as this sounds like it would come in handy for you.

Resources