I am try to understand the mechanism used by Linux to invoke a system call. In particular, I am struggling to understand the VSDO mechanism. Can it be used to invoke all system calls? And what the difference between the vsdo page and vsyscall page within the process memory? are they always there?
For example using cat /proc/self/maps :
7fff32938000-7fff32939000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Best,
The vsyscall and vDSO segments are two mechanisms used to accelerate certain system calls in Linux. For instance, gettimeoftheday is usually invoked through this mechanism. The first mechanism introduced was vsyscall, which was added as a way to execute specific system calls which do not need any real level of privilege to run in order to reduce the system call overhead. Following the previous example, all gettimeofday needs to do is to read the kernel's the current time. There are applications that call gettimeofday frequently (e.g to generate timestamps), to the point that they care about even a little bit of overhead. To address this concern, the kernel maps into user space a page containing the current time and a fast gettimeofday implementation (i.e. just a function which reads the time saved into vsyscall). Using this virtual system call, the C library can provide a fast gettimeofday which does not have the overhead introduced by the context switch between kernel space and user space usually introduced by the classic system call model INT 0x80 or SYSCALL.
However, this vsyscall mechanism has some limitations: the memory allocated is small and allows only 4 system calls, and, more important and serious, the vsyscall page is statically allocated to the same address in each process, since the location of the vsyscall page is nailed down in the kernel ABI. This static allocation of the vsyscall compromises the benefit introduced by the memory space randomisation commonly used by Linux. An attacker, after compromising an application by exploiting a stack-overflow, can invoke a system call from the vsyscall page with arbitrary parameters. All he needs is the address of the system call, which is easily predicable as it is statically allocated (if you try to run again your command even with different applications, you'll notice that the address of the vsyscall does not change).
It would be nice to remove or at least randomize the location of the vsyscall page to thwart this type of attack. Unfortunately, applications depend on the existence and exact address of that page, so nothing can be done.
This security issue has been addresses by replacing all system call instructions at fixed addressed by a special trap instruction. An application trying to call into the vsyscall page will trap into the kernel, which will then emulate the desired virtual system call in kernel space. The result is a kernel system call emulating a virtual system call which was put there to avoid the kernel system call in the first place. The result is a vsyscall which takes longer to execute but, crucially, does not break the existing ABI. In any case, the slowdown will only be seen if the application is trying to use the vsyscall page instead of the vDSO. The vDSO offers the same functionality as the vsyscall, while overcoming its limitations. The vDSO (Virtual Dynamically linked Shared Objects) is a memory area allocated in user space which exposes some kernel functionalities at user space in a safe manner.
This has been introduced to solve the security threats caused by the vsyscall.
The vDSO is dynamically allocated which solves security concerns and can have more than 4 system calls. The vDSO links are provided via the glibc library. The linker will link in the glibc vDSO functionality, provided that such a routine has an accompanying vDSO version, such as gettimeofday. When your program executes, if your kernel does not have vDSO support, a traditional syscall will be made.
Credits and useful links :
Awesome tutorial, how to create your own vDSO.
vsyscall andvDSO, nice article
useful article and links
What is linux-gate.so.1?
Related
This is actually a theoretical question about memory management. Since different operating systems implement things differently, I'll have to relieve my thirst for knowledge asking how things work in only one of them :( Preferably the open source and widely used one: Linux.
Here is the list of things I know in the whole puzzle:
malloc() is user space. libc is responsible for the syscall job (calling brk/sbrk/mmap...). It manages to get big chunks of memory, described by ranges of virtual addresses. The library slices these chunks and manages to respond the user application requests.
I know what brk/sbrk syscalls do. I know what 'program break' means. These calls basically push the program break offset. And this is how libc gets its virtual memory chunks.
Now that user application has a new virtual address to manipulate, it simply writes some value to it. Like: *allocated_integer = 5;. Ok. Now, what? If brk/sbrk only updates offsets in the process' entry in the process table, or whatever, how the physical memory is actually allocated?
I know about virtual memory, page tables, page faults, etc. But I wanna know exactly how these things are related to this situation that I depicted. For example: is the process' page table modified? How? When? A page fault occurs? When? Why? With what purpose? When is this 'buddy algorithm' called, and this free_area data structure accessed? (http://www.tldp.org/LDP/tlk/mm/memory.html, section 3.4.1 Page Allocation)
Well, after finally finding an excellent guide (http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/) and some hours digging the Linux kernel, I found the answers...
Indeed, brk only pushes the virtual memory area.
When the user application hits *allocated_integer = 5;, a page fault occurs.
The page fault routine will search for the virtual memory area responsible for the address and then call the page table handler.
The page table handler goes through each level (2 levels in x86 and 4 levels in x86_64), allocating entries if they're not present (2nd, 3rd and 4th), and then finally calls the real handler.
The real handler actually calls the function responsible for allocating page frames.
I am writing a kernel module that has access to a particular process's memory. I have done an anonymous mapping on some of the user space memory with do_mmap():
#define MAP_FLAGS (MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS)
prot = PROT_WRITE;
retval = do_mmap(NULL, vaddr, vsize, prot, MAP_FLAGS, 0);
vaddr and vsize are set earlier, and the call succeeds. After I write to that memory block from the kernel module (via copy_to_user), I want to remove the PROT_WRITE permission on it (like I would with mprotect in normal user space). I can't seem to find a function that will allow this.
I attempted unmapping the region and remapping it with the correct protections, but that zeroes out the memory block, erasing all the data I just wrote; setting MAP_UNINITIALIZED might fix that, but, from the man pages:
MAP_UNINITIALIZED (since Linux 2.6.33)
Don't clear anonymous pages. This flag is intended to improve performance on embedded
devices. This flag is only honored if the kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option
is normally enabled only on embedded devices (i.e., devices where one has complete
control of the contents of user memory).
so, while that might do what I want, it wouldn't be very portable. Is there a standard way to accomplish what I've suggested?
After some more research, I found a function called get_user_pages() (best documentation I've found is here) that returns a list of pages from userspace at a given address that can be mapped to kernel space with kmap() and written to that way (in my case, using kernel_read()). This can be used as a replacement for copy_to_user() because it allows forcing write permissions on the pages retrieved. The only drawback is that you have to write page by page, instead of all in one go, but it does solve the problem I described in my question.
In userspace there is a system call mprotect that can modify the protection flags on existing mapping. You probably need to follow from the implementation of that system call, or maybe simply call it directly from your code. See mm/protect.c.
I had couple of questions on linux kernel memory page write protection.
How can i figure out if the kernel
code (text segment) is write
protected or not. I can look at
/proc/<process-id>/map to see the
memory map for various processes.
But not sure where to look for the
kernel code memory map.
If the kernel code segment is write
protected, then is it possible for
the code segment pages to be
overwritten by any other kernel
level code. In other words, does the
write protect on a text segment page
protects against only the user space
code writing to it or will it
prevent writes even from within the
kernel space code.
Thanks
Code running in the kernel has direct access to the page tables for the current address space, so it can check for write access by examining those. There are probably functions to help you with that check, but I'm not familiar enough with the mm code to point them out. Is there an easier way? I'm not sure.
The kernel text should never be writable from user-space. The text can additionally be protected against writing from kernel code too (I think this is what you're talking about). This is only a basic protection against bugs. Kernel code, if it really wants to, can disable that protection by modifying the page tables directly.
There is one paper talking about that. Basically, it uses a small hypervisor to protect the OS kernel.
SecVisor: A Tiny Hypervisor to Provide Lifetime Kernel Code Integrity for Commodity OSes.
http://www.sosp2007.org/papers/sosp079-seshadri.pdf
Ultimately I am just trying to figure out how to dynamically allocate heap memory from within assembly.
If I call Linux sbrk() from assembly code, can I use the address returned as I would use an address of a statically (ie in the .data section of my program listing) declared chunk of memory?
I know Linux uses the hardware MMU if present, so I am not sure if what sbrk returns is a 'raw' pointer to real RAM, or is it a cooked pointer to RAM that may be modified by Linux's VM system?
I read this: How are sbrk/brk implemented in Linux?. I suspect I can not use the return value from sbrk() without worry: the MMU fault on access-non-allocated-address must cause the VM to alter the real location in RAM being addressed. Thus assy, not linked against libc or what-have-you, would not know the address has changed.
Does this make sense, or am I out to lunch?
Unix user processes live in virtual memory, no matter if written in assembler of Fortran, and should not care about physical addresses. That's kernel's business - kernel sets up and manages the MMU. You don't have to worry about it. Page faults are handled automatically and transparently.
sbrk(2) returns a virtual address specific to the process, if that's what you were asking.
How does Windows protect against a user-mode thread from arbitrarily transitioning the CPU to kernel-mode?
I understand these things are true:
User-mode threads DO actually transition to kernel-mode when a system call is made through NTDLL.
The transition to kernel-mode is done through processor-specific instructions.
So what is special about these system calls through NTDLL? Why can't the user-mode thread fake-it and execute the processor-specific instructions to transition to kernel-mode? I know I'm missing some key piece of Windows architecture here...what is it?
You're probably thinking that thread running in user mode is calling into Ring 0, but that's not what's actually happening. The user mode thread is causing an exception that's caught by the Ring 0 code. The user mode thread is halted and the CPU switches to a kernel/ring 0 thread, which can then inspect the context (e.g., call stack and registers) of the user mode thread to figure out what to do. Before syscall, it really was an exception rather than a special exception specifically to invoke ring 0 code.
If you take the advice of the other responses and read the Intel manuals, you'll see syscall/sysenter don't take any parameters - the OS decides what happens. You can't call arbitrary code. WinNT uses function numbers that map to which kernel mode function the user mode code will execute (for example, NtOpenFile is fnc 75h on my Windows XP machine (the numbers change all the time; it's one of the jobs of NTDll is to map a function call to a fnc number, put it in EAX, point EDX to the incoming parameters then invoke sysenter).
Intel CPUs enforce security using what's called 'Protection Rings'.
There are 4 of these, numbered from 0 to 3. Code running in ring 0 has the highest privileges; it can (practically) do whatever it pleases with your computer. The code in ring 3, on the other hand, is always on a tight leash; it has only limited powers to influence things. And rings 1 and 2 are currently not used for any purpose at all.
A thread running in a higher privileged ring (such as ring 0) can transition to lower privilege ring (such as ring 1, 2 or 3) at will. However, the transition the other way around is strictly regulated. This is how the security of high privileged resources (such as memory) etc. is maintained.
Naturally, your user mode code (applications and all) runs in ring 3 while the OS's code runs in ring 0. This ensures that the user mode threads can't mess with the OS's data structures and other critical resources.
For details on how all this is actually implemented you could read this article. In addition, you may also want to go through Intel Manuals, especially Vol 1 and Vol 3A, which you can download here.
This is the story for Intel processors. I'm sure other architectures have something similar going on.
I think (I may be wrong) that the mechanism which it uses for transition is simple:
User-mode code executes a software interrupt
This (interrupt) causes a branch to a location specified in the interrupt descriptor table (IDT)
The thing that prevents user-mode code from usurping this is as follows: you need to be priviledged to write to the IDT; so only the kernel is able to specify what happens when an interrupt is executed.
Code running in User Mode (Ring 3) can't arbitrarily change to Kernel Mode (Ring 0). It can only do so using special routes -- jump gates, interrupts, and sysenter vectors. These routes are highly protected and input is scrubbed so that bad data can't (shouldn't) cause bad behavior.
All of this is set up by the kernel, usually on startup. It can only be configured in Kernel Mode so User-Mode code can't modify it.
It's probably fair to say that it does it in a (relatively) similar way to what Linux does. In both cases it's going to be CPU-specific, but on x86 probably either a software interrupt with the INT instruction, or via SYSENTER instruction.
The advantage of looking at how Linux does it is that you can do so without a Windows source licence.
The userspace source part is here here at LXR and the
kernel space bit - look at entry_32.S and entry_64.S
Under Linux on x86 there are three different mechanisms, int 0x80, syscall and sysenter.
A library which is built at runtime by the kernel called vdso is called by the C library to implement the syscall function, which uses a different mechanism depending on the CPU and which system call it is. The kernel then has handlers for those mechanisms (if they exist on the specific CPU variant).