TLB flush on a switch from kernel to user UNIX

TLB flush on a switch from kernel to user UNIX - linux-kernel

I was looking for an answer for this question but didn't find any clear answer.
When calling a sys call such as getpid(), while returning from kernel mode to user mode, does the TLB get flushed?
My "logic" says yes, in order to prevent the user to invades kernel's address space.
I am not very convinced with that answer because of the hardware protection which can protect the kernel's virtual space and save flushing time.
Would love to get this straight,
Thanks.

Related

Windows kernel memory protection

In Windows the high memory of every process (0x80000000 or 0xc0000000)
Is reserved for kernel code, user code cannot access these regions of memory, if it tries so an access violation exception will be thrown.
I wish to know how is the kernel space protected ?
Is it via memory segmentations or via paging ?
I would like to hear a technical explanation.
Thanks a lot,
Michael.

Assuming you are talking about x86 and x64 architectures.
Memory protection is achieved using the paging system. Each page table entry on an x86/x64 CPU has a bit to indicate whether it is a user or supervisor page. Accesses to supervisor pages are only permitted for code running with CPL<3, whereas accesses to non supervisor pages are possible regardless of CPL.
CPL is the "Current Privilege Level" which is sometimes referred to as Ring. Windows only uses two rings, although the CPU implements 4. Ring 0 is the CPU mode in which what Windows refers to as "kernel mode" runs. Ring 3 is the CPU mode in which "User mode" runs. Since code running at CPL=3 cannot access supervisor pages, this is how memory protection is implemented.
The answer for ARM is likely to be similar, but different.

That's an easy one and doesn't require talking about rings and kernel behavior. Accessing virtual memory at a particular address requires that address to be mapped, the operating system has to allocate a memory page for that address. The low-level winapi function that does that is VirtualAlloc(). Which takes an optional address, first argument. The OS will simply fail a request for an unmappable address. Otherwise the exact same mechanism that prevents you from mapping any address in the lowest 64KB of the address space.

What happens when I printk a char * that was initialized in userspace?

I implemented a new system call as an intro exercise. All it does is take in a buffer and printk that buffer. I later learned that the correct practice would be to use copy_from_user.
Is this just a precautionary measure to validate the address, or is my system call causing some error (page fault?) that I cannot see?
If it is just a precautionary measure, what is it protecting against?
Thanks!

There are several reasons.
Some architectures employ segmented memory, where there is a separate segment for the user memory. In that case, copy_from_user is essential to actually get the right memory address.
The kernel has access to everything, including (almost by definition) a lot of privileged information. Not using copy_from_user could allow information disclosure if a user passes in a kernel address. Worse, if you are writing to a user-supplied buffer without copy_to_user, the user could overwrite kernel memory.
You'd like to prevent the user from crashing the kernel module just by passing in a bad pointer; using copy_from_user protects against faults so e.g. a system call handler can return EFAULT in response to a bad user pointer.

Change user space memory protection flags from kernel module

I am writing a kernel module that has access to a particular process's memory. I have done an anonymous mapping on some of the user space memory with do_mmap():
#define MAP_FLAGS (MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS)
prot = PROT_WRITE;
retval = do_mmap(NULL, vaddr, vsize, prot, MAP_FLAGS, 0);
vaddr and vsize are set earlier, and the call succeeds. After I write to that memory block from the kernel module (via copy_to_user), I want to remove the PROT_WRITE permission on it (like I would with mprotect in normal user space). I can't seem to find a function that will allow this.
I attempted unmapping the region and remapping it with the correct protections, but that zeroes out the memory block, erasing all the data I just wrote; setting MAP_UNINITIALIZED might fix that, but, from the man pages:
MAP_UNINITIALIZED (since Linux 2.6.33)
Don't clear anonymous pages. This flag is intended to improve performance on embedded
devices. This flag is only honored if the kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option
is normally enabled only on embedded devices (i.e., devices where one has complete
control of the contents of user memory).
so, while that might do what I want, it wouldn't be very portable. Is there a standard way to accomplish what I've suggested?

After some more research, I found a function called get_user_pages() (best documentation I've found is here) that returns a list of pages from userspace at a given address that can be mapped to kernel space with kmap() and written to that way (in my case, using kernel_read()). This can be used as a replacement for copy_to_user() because it allows forcing write permissions on the pages retrieved. The only drawback is that you have to write page by page, instead of all in one go, but it does solve the problem I described in my question.

In userspace there is a system call mprotect that can modify the protection flags on existing mapping. You probably need to follow from the implementation of that system call, or maybe simply call it directly from your code. See mm/protect.c.

Ring level shift in Win NT based OS

Can anyone please tell me how there is privilege change in Windows OS.
I know the user mode code (RL:3) passes the parameters to APIs.
And these APIs call the kernel code (RL:1).
But now I want to know, during security(RPL) check is there some token that is exchanged between these RL3 API and RL1 Kernel API.
if I am wrong please let me know (through Some Link or Brief description) how it works.
Please feel free to close this thread if its offtopic, offensive or duplicate.
RL= Ring Level
RPL:Requested Privilege level

Interrupt handlers and the syscall instruction (which is an optimized software interrupt) automatically modify the privilege level (this is a hardware feature, the ring 0 vs ring 3 distinction you mentioned) along with replacing other processor state (instruction pointer, stack pointer, etc). The prior state is of course saved so that it can be restored after the interrupt completes.
Kernel code has to be extremely careful not to trust input from user-mode. One way of handling this is to not let user-mode pass in pointers which will be dereferenced in kernel mode, but instead HANDLEs which are looked up in a table in kernel-mode memory, which can't be modified by user-mode at all. Capability information is stored in the HANDLE table and associated kernel data structures, this is how, for example, WriteFile knows to fail if a file object is opened for read-only access.
The task switcher maintains information on which process is currently running, so that syscalls which perform security checks, such as CreateFile, can check the user account of the current process and verify it against the file ACL. This process ID and user token are again stored in memory which is accessible only to the kernel.
The MMU page tables are used to prevent user-mode from modifying kernel memory -- generally there is no page mapping at all; there are also page access bits (read, write, execute) which are enforced in hardware by the MMU. Kernel code uses a different page table, the swap occurs as part of the syscall instruction and/or interrupt activation.

Linux kernel code space write protection

I had couple of questions on linux kernel memory page write protection.
How can i figure out if the kernel
code (text segment) is write
protected or not. I can look at
/proc/<process-id>/map to see the
memory map for various processes.
But not sure where to look for the
kernel code memory map.
If the kernel code segment is write
protected, then is it possible for
the code segment pages to be
overwritten by any other kernel
level code. In other words, does the
write protect on a text segment page
protects against only the user space
code writing to it or will it
prevent writes even from within the
kernel space code.
Thanks

Code running in the kernel has direct access to the page tables for the current address space, so it can check for write access by examining those. There are probably functions to help you with that check, but I'm not familiar enough with the mm code to point them out. Is there an easier way? I'm not sure.
The kernel text should never be writable from user-space. The text can additionally be protected against writing from kernel code too (I think this is what you're talking about). This is only a basic protection against bugs. Kernel code, if it really wants to, can disable that protection by modifying the page tables directly.

There is one paper talking about that. Basically, it uses a small hypervisor to protect the OS kernel.
SecVisor: A Tiny Hypervisor to Provide Lifetime Kernel Code Integrity for Commodity OSes.
http://www.sosp2007.org/papers/sosp079-seshadri.pdf

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio