Where can I get information on the privileged/protected mode structure of PowerPC?
I have tried seeing the user manuals but I couldn't get any info.
Architecturally, it's fairly straightforward - as a broad overview: Firstly, you have a couple of bits in the Machine State Register (MSR): Problem state (PR) and Hypervisor (HV). Those two bits represent three states:
PR=1, HV=x - unprivileged (typically: userspace)
PR=0, HV=0 - supervisor (typically: virtualised guest OS kernel)
PR=0, HV=1 - hypervisor (typically: hypervisor host, or non-virtualised OS kernel)
If your implementation doesn't support hardware virtualisation (ie., doesn't have a HV bit), then there are just two states:
PR=1 - userspace
PR=0 - supervisor
Then, certain facilities are only available at specific machine states. For example, some special purpose registers can only be accessed in PR=0 state; attempts to access these registers with PR=1 will cause a program interrupt, transferring control back to the OS. The OS can then decide what action to take (say, kill the process, or access the privileged resource on behalf of that process).
Of course, the MSR itself is privileged; userspace processes can't simply clear the PR bit to enter supervisor state.
For implementing access control to memory, the storage control facility can mark mappings as only available when the machine is in PR=0 and/or HV=1 states. Processing a virtual address translation will check the mapping configuration against the machine state, and potentially raise a data or instruction storage exception if access should not be allowed. Again, this transfers control back to the OS/hypervisor.
For details, see the POWER ISA documents. Book III has most of the details about privileged states.
Related
I am running a simulated RV64GC core in QEMU and am trying to better understand the virtual memory subsystem and address translation process in RISC-V. My simulated system runs with OpenSBI, the Linux Kernal v5.5, and a minimal rootfs.
In QEMU debug traces, I see that sometimes (most commonly with ecalls) control is passed to the SBI and the addresses change from kernel (virtual?) addresses with an offset of 0xffffffe000000000 into something that looks like real, physical, addresses in RAM. For example,
...
0xffffffe00003a192: 00000073 ecall
...
IN: sbi_ecall_0_1_handler
0x0000000080004844: 00093603 ld a2,0(s2)
0x0000000080004848: 4785 addi a5,zero,1
0x000000008000484a: 00a797b3 sll a5,a5,a0
...
In the RISC-V privileged specification version 1.11, section 4.1.12, the satp CSR (control and state register) is defined to have a MODE field that determines address translation designation. A MODE of 0 means that translation is bare (addresses are considered physical), a MODE of 8 or 9 requires Sv39 or Sv48 page-based virtual addressing, respectively, and any other MODE values are reserved.
Now, both the RISC-V privileged and unprivileged specifications don't seem to mention when satp may be changed (other than with csrrw), so this leads me to the following questions:
When control is handed to the SBI (as with the ecall above), does the satp MODE change to 0? If yes, does this mean the satp mode should be reset on a u/s/mret instruction? Are there other instances (other than csrrw) where satp is supposed to change?
If not, is there some other mechanism by which the addresses are interpreted and designated as physical? Or are the addresses (the 0x80XXXXXX addresses above) instead considered virtual and should go through the usual virtual address translation process (as outlined in section 4.3.2 of the RISC-V privileged specification)? If this is the case, when are page table entries created for this?
The memory model of RISC-V works in the following way:
M mode has its own memory protection system described under section 3.6 of privileged specifications called PMP (physical memory protection). This is to impose memory protection on lower privilege levels and also M mode itself (if lock bit is used). There is no virtual memory system in M mode.
Now in the S mode, it has page based virtual memory system that S mode can use to set virtual to physical address mapping and also to impose memory restrictions on S mode itself and also U mode.
So each privilege level has control on its own resources and the resources below it but never on the resources of the privilege level above it. This is how things work.
M mode can control memory accessible by M, S and U modes, and S mode can control memory view (virtual memory) and accessibility of S and U modes but not M mode. So satp mode never even changes when moving to M mode. As the mapping pointed by it is never even applicable to M mode. It has its on memory protection unit.
This would be huge security hole if lower privilege levels could impose memory restrictions on higher privilege levels.
I think I'm missing a fundamental concept of how the OS manages memory.
OS is responsible for keeping track of what parts of physical memory are free.
OS creates and manages page tables, which have mappings between virtual to physical addresses.
For each instruction that specifies a virtual address, the hardware reads the page table to get the corresponding physical address. One way the hardware may know the location of the current page table is by a register that the OS updates.
This makes sense for how processes access memory. However, I don't understand how the OS itself accesses memory.
Assuming it uses the same instructions, the hardware would still be translating from virtual addresses to physical? Is there, for example, a known physical location for a page table for the OS itself? I know the question is murky, having trouble even understanding what to ask.
At some point there has to be a page table in a physical location. The method used for this depends upon the processor.
Let me give a simple example based upon the VAX processor. Suppose you divide the logical address space into a system range shared by all processes and a user range that is unique to each process. Then give each of those ranges its own page table.
Now you can place the user page table in the system address range of the logical address space.
If you access memory in the user space, you go to page table that the system finds at a logical address in the system space, that the processor then had to translate into a physical address using the page table for the system space; a two level translation.
If you use logical addresses for the the system space page table then you'd have no way of translating those into physical addresses. Instead, the local of the system page table is defined using physical addresses.
Another approach would be to define all page tables using physical addresses.
I don't understand how the OS itself accesses memory.
Think of the operating system as a process. The OS basically is a process, just like other processes, with elevated privileges. Whenever the OS wants to use eome memory location,it uses page tables for virtual to physical address translation, just like other processes would.
Think of it this way: Every process has a page table of its own, the same goes for the OS. The OS remembers the location of these page tables in the control structures associated with each process (e.g. the PCB), and for the currently running process the address (physical pointer) to the page table is kept in hardware (for the x86 architecture this is in control register 4 (CR4)). On x86, whenever the OS switches the running process it changes the value in CR4 so that the address points to the correct page table (its own if it switches to itself).
However, this is greatly simplified in modern operating systems, where the kernel (the OS) is mapped into the memory space of all processes, so that the kernel can run whenever it wants without having to switch page tables (which is costly). The pages in a process' page table belonging to the kernel are restricted to the process, and only accessible once the kernel takes control to do some management task.
In Windows the high memory of every process (0x80000000 or 0xc0000000)
Is reserved for kernel code, user code cannot access these regions of memory, if it tries so an access violation exception will be thrown.
I wish to know how is the kernel space protected ?
Is it via memory segmentations or via paging ?
I would like to hear a technical explanation.
Thanks a lot,
Michael.
Assuming you are talking about x86 and x64 architectures.
Memory protection is achieved using the paging system. Each page table entry on an x86/x64 CPU has a bit to indicate whether it is a user or supervisor page. Accesses to supervisor pages are only permitted for code running with CPL<3, whereas accesses to non supervisor pages are possible regardless of CPL.
CPL is the "Current Privilege Level" which is sometimes referred to as Ring. Windows only uses two rings, although the CPU implements 4. Ring 0 is the CPU mode in which what Windows refers to as "kernel mode" runs. Ring 3 is the CPU mode in which "User mode" runs. Since code running at CPL=3 cannot access supervisor pages, this is how memory protection is implemented.
The answer for ARM is likely to be similar, but different.
That's an easy one and doesn't require talking about rings and kernel behavior. Accessing virtual memory at a particular address requires that address to be mapped, the operating system has to allocate a memory page for that address. The low-level winapi function that does that is VirtualAlloc(). Which takes an optional address, first argument. The OS will simply fail a request for an unmappable address. Otherwise the exact same mechanism that prevents you from mapping any address in the lowest 64KB of the address space.
Can anyone please tell me how there is privilege change in Windows OS.
I know the user mode code (RL:3) passes the parameters to APIs.
And these APIs call the kernel code (RL:1).
But now I want to know, during security(RPL) check is there some token that is exchanged between these RL3 API and RL1 Kernel API.
if I am wrong please let me know (through Some Link or Brief description) how it works.
Please feel free to close this thread if its offtopic, offensive or duplicate.
RL= Ring Level
RPL:Requested Privilege level
Interrupt handlers and the syscall instruction (which is an optimized software interrupt) automatically modify the privilege level (this is a hardware feature, the ring 0 vs ring 3 distinction you mentioned) along with replacing other processor state (instruction pointer, stack pointer, etc). The prior state is of course saved so that it can be restored after the interrupt completes.
Kernel code has to be extremely careful not to trust input from user-mode. One way of handling this is to not let user-mode pass in pointers which will be dereferenced in kernel mode, but instead HANDLEs which are looked up in a table in kernel-mode memory, which can't be modified by user-mode at all. Capability information is stored in the HANDLE table and associated kernel data structures, this is how, for example, WriteFile knows to fail if a file object is opened for read-only access.
The task switcher maintains information on which process is currently running, so that syscalls which perform security checks, such as CreateFile, can check the user account of the current process and verify it against the file ACL. This process ID and user token are again stored in memory which is accessible only to the kernel.
The MMU page tables are used to prevent user-mode from modifying kernel memory -- generally there is no page mapping at all; there are also page access bits (read, write, execute) which are enforced in hardware by the MMU. Kernel code uses a different page table, the swap occurs as part of the syscall instruction and/or interrupt activation.
If I understand correctly, a memory adderss in system space is accesible only from kernel mode. Does it mean when components mapped in system space are executed the processor must be swicthed to kernel mode?
For ex: the virtual memory manager is a frequently used component and is mapped in system space. Whenever the VMM runs in the context of user process (lets say it translated an address), does the processor must be swicthed to kernel mode?
Thanks,
Suresh.
Typically, there's 2 parts involved.The MMU(Memory manage unit) which is a hardware component that does the translation from virtual addresses to physical addresses. And the operating system VM subsystem.
The operating system part needs to run in privileged mode (a.k.a. kernel mode) and will set up/change the mapping in the MMU based on the the user space needs.
E.g. to request more (virtual) memory, or map a file into memory, a transition to kernel mode is needed and the VM subsystem can change the mapping of the process.
Around this there's often a ton of tricks to be made - e-g. map the whole address space of the kernel into the user process virtual space, but change its access so the process can't use that memory - this means whenever you transit to kernel mode you don't need to reload the mapping for the kernel.
Taking your example of the virtual memory manager, it never actually runs in user space. To allocate memory, user mode applications make calls to the Win32 API (NTDLL.DLL as one example) to routines such as VirtualAlloc.
With regards to address translation, here's a summary of how it works (based on the content from Windows Internals 5th Edition).
The VMM uses page tables which the CPU uses to translate virtual addresses to physical addresses. The page tables live in the system space. Each table contains many PTEs (page table entries) which stores the physical address to which a virtual address is mapped. I won't go into too much detail here, but the point is that all of the VMM's work is performed in system space and not in user space.
As for context switching - when a thread running in user space needs to run in the system space, then a context switch will occur. Since the memory manager lives in system space, it's threads never need to make a context switch, since it already lives in the system space.
Apologies for the simplistic explanation, this is quite a complicated topic of discussion in depth. I would highly recommend that you pick up a copy of Windows Internals as this sounds like it would come in handy for you.