What happens when I printk a char * that was initialized in userspace? - linux-kernel

I implemented a new system call as an intro exercise. All it does is take in a buffer and printk that buffer. I later learned that the correct practice would be to use copy_from_user.
Is this just a precautionary measure to validate the address, or is my system call causing some error (page fault?) that I cannot see?
If it is just a precautionary measure, what is it protecting against?
Thanks!

There are several reasons.
Some architectures employ segmented memory, where there is a separate segment for the user memory. In that case, copy_from_user is essential to actually get the right memory address.
The kernel has access to everything, including (almost by definition) a lot of privileged information. Not using copy_from_user could allow information disclosure if a user passes in a kernel address. Worse, if you are writing to a user-supplied buffer without copy_to_user, the user could overwrite kernel memory.
You'd like to prevent the user from crashing the kernel module just by passing in a bad pointer; using copy_from_user protects against faults so e.g. a system call handler can return EFAULT in response to a bad user pointer.

Related

Making a virtual IOPCIDevice with IOKit

I have managed to create a virtual IOPCIDevice which attaches to IOResources and basically does nothing. I'm able to get existing drivers to register and match to it.
However when it comes to IO handling, I have some trouble. IO access by functions (e.g. configRead, ioRead, configWrite, ioWrite) that are described in IOPCIDevice class can be handled by my own code. But drivers that use memory mapping and IODMACommand are the problem.
There seems to be two things that I need to manage: IODeviceMemory(described in the IOPCIDevice) and DMA transfer.
How could I create a IODeviceMemory that ultimately points to memory/RAM, so that when driver tries to communicate to PCI device, it ultimately does nothing or just moves the data to RAM, so my userspace client can handle this data and act as an emulated PCI device?
And then could DMA commands be directed also to my userspace client without interfering to existing drivers' source code that use IODMACommand.
Thanks!
Trapping memory accesses
So in theory, to achieve what you want, you would need to allocate a memory region, set its protection bits to read-only (or possibly neither read nor write if a read in the device you're simulating has side effects), and then trap any writes into your own handler function where you'd then simulate device register writes.
As far as I'm aware, you can do this sort of thing in macOS userspace, using Mach exception handling. You'd need to set things up that page protection fault exceptions from the process you're controlling get sent to a Mach port you control. In that port's message handler, you'd:
check where the access was going to
if it's the device memory, you'd suspend all the threads of the process
switch the thread where the write is coming from to single-step, temporarily allow writes to the memory region
resume the writer thread
trap the single-step message. Your "device memory" now contains the written value.
Perform your "device's" side effects.
Turn off single-step in the writer thread.
Resume all threads.
As I said, I believe this can be done in user space processes. It's not easy, and you can cobble together the Mach calls you need to use from various obscure examples across the web. I got something similar working once, but can't seem to find that code anymore, sorry.
… in the kernel
Now, the other problem is you're trying to do this in the kernel. I'm not aware of any public KPIs that let you do anything like what I've described above. You could start looking for hacks in the following places:
You can quite easily make IOMemoryDescriptors backed by system memory. Don't worry about the IODeviceMemory terminology: these are just IOMemoryDescriptor objects; the IODeviceMemory class is a lie. Trapping accesses is another matter entirely. In principle, you can find out what virtual memory mappings of a particular MD exist using the "reference" flag to the createMappingInTask() function, and then call the redirect() method on the returned IOMemoryMap with a NULL backing memory argument. Unfortunately, this will merely suspend any thread attempting to access the mapping. You don't get a callback when this happens.
You could dig into the guts of the Mach VM memory subsystem, which mostly lives in the osfmk/vm/ directory of the xnu source. Perhaps there's a way to set custom fault handlers for a VM region there. You're probably going to have to get dirty with private kernel APIs though.
Why?
Finally, why are you trying to do this? Take a step back: What is it you're ultimately trying to do with this? It doesn't seem like simulating a PCI device in this way is an end to itself, so is this really the only way to do what greater goal you're ultimately trying to achieve? See: XY problem

What happens in the kernel when the process accesses an address just allocated with brk/sbrk?

This is actually a theoretical question about memory management. Since different operating systems implement things differently, I'll have to relieve my thirst for knowledge asking how things work in only one of them :( Preferably the open source and widely used one: Linux.
Here is the list of things I know in the whole puzzle:
malloc() is user space. libc is responsible for the syscall job (calling brk/sbrk/mmap...). It manages to get big chunks of memory, described by ranges of virtual addresses. The library slices these chunks and manages to respond the user application requests.
I know what brk/sbrk syscalls do. I know what 'program break' means. These calls basically push the program break offset. And this is how libc gets its virtual memory chunks.
Now that user application has a new virtual address to manipulate, it simply writes some value to it. Like: *allocated_integer = 5;. Ok. Now, what? If brk/sbrk only updates offsets in the process' entry in the process table, or whatever, how the physical memory is actually allocated?
I know about virtual memory, page tables, page faults, etc. But I wanna know exactly how these things are related to this situation that I depicted. For example: is the process' page table modified? How? When? A page fault occurs? When? Why? With what purpose? When is this 'buddy algorithm' called, and this free_area data structure accessed? (http://www.tldp.org/LDP/tlk/mm/memory.html, section 3.4.1 Page Allocation)
Well, after finally finding an excellent guide (http://duartes.org/gustavo/blog/post/how-the-kernel-manages-your-memory/) and some hours digging the Linux kernel, I found the answers...
Indeed, brk only pushes the virtual memory area.
When the user application hits *allocated_integer = 5;, a page fault occurs.
The page fault routine will search for the virtual memory area responsible for the address and then call the page table handler.
The page table handler goes through each level (2 levels in x86 and 4 levels in x86_64), allocating entries if they're not present (2nd, 3rd and 4th), and then finally calls the real handler.
The real handler actually calls the function responsible for allocating page frames.

Change user space memory protection flags from kernel module

I am writing a kernel module that has access to a particular process's memory. I have done an anonymous mapping on some of the user space memory with do_mmap():
#define MAP_FLAGS (MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS)
prot = PROT_WRITE;
retval = do_mmap(NULL, vaddr, vsize, prot, MAP_FLAGS, 0);
vaddr and vsize are set earlier, and the call succeeds. After I write to that memory block from the kernel module (via copy_to_user), I want to remove the PROT_WRITE permission on it (like I would with mprotect in normal user space). I can't seem to find a function that will allow this.
I attempted unmapping the region and remapping it with the correct protections, but that zeroes out the memory block, erasing all the data I just wrote; setting MAP_UNINITIALIZED might fix that, but, from the man pages:
MAP_UNINITIALIZED (since Linux 2.6.33)
Don't clear anonymous pages. This flag is intended to improve performance on embedded
devices. This flag is only honored if the kernel was configured with the
CONFIG_MMAP_ALLOW_UNINITIALIZED option. Because of the security implications, that option
is normally enabled only on embedded devices (i.e., devices where one has complete
control of the contents of user memory).
so, while that might do what I want, it wouldn't be very portable. Is there a standard way to accomplish what I've suggested?
After some more research, I found a function called get_user_pages() (best documentation I've found is here) that returns a list of pages from userspace at a given address that can be mapped to kernel space with kmap() and written to that way (in my case, using kernel_read()). This can be used as a replacement for copy_to_user() because it allows forcing write permissions on the pages retrieved. The only drawback is that you have to write page by page, instead of all in one go, but it does solve the problem I described in my question.
In userspace there is a system call mprotect that can modify the protection flags on existing mapping. You probably need to follow from the implementation of that system call, or maybe simply call it directly from your code. See mm/protect.c.

Ring level shift in Win NT based OS

Can anyone please tell me how there is privilege change in Windows OS.
I know the user mode code (RL:3) passes the parameters to APIs.
And these APIs call the kernel code (RL:1).
But now I want to know, during security(RPL) check is there some token that is exchanged between these RL3 API and RL1 Kernel API.
if I am wrong please let me know (through Some Link or Brief description) how it works.
Please feel free to close this thread if its offtopic, offensive or duplicate.
RL= Ring Level
RPL:Requested Privilege level
Interrupt handlers and the syscall instruction (which is an optimized software interrupt) automatically modify the privilege level (this is a hardware feature, the ring 0 vs ring 3 distinction you mentioned) along with replacing other processor state (instruction pointer, stack pointer, etc). The prior state is of course saved so that it can be restored after the interrupt completes.
Kernel code has to be extremely careful not to trust input from user-mode. One way of handling this is to not let user-mode pass in pointers which will be dereferenced in kernel mode, but instead HANDLEs which are looked up in a table in kernel-mode memory, which can't be modified by user-mode at all. Capability information is stored in the HANDLE table and associated kernel data structures, this is how, for example, WriteFile knows to fail if a file object is opened for read-only access.
The task switcher maintains information on which process is currently running, so that syscalls which perform security checks, such as CreateFile, can check the user account of the current process and verify it against the file ACL. This process ID and user token are again stored in memory which is accessible only to the kernel.
The MMU page tables are used to prevent user-mode from modifying kernel memory -- generally there is no page mapping at all; there are also page access bits (read, write, execute) which are enforced in hardware by the MMU. Kernel code uses a different page table, the swap occurs as part of the syscall instruction and/or interrupt activation.

How can I access memory at known physical address inside a kernel module?

I am trying to work on this problem: A user spaces program keeps polling a buffer to get requests from a kernel module and serves it and then responses to the kernel.
I want to make the solution much faster, so instead of creating a device file and communicating via it, I allocate a memory buffer from the user space and mark it as pinned, so the memory pages never get swapped out. Then the user space invokes a special syscall to tell the kernel about the memory buffer so that the kernel module can get the physical address of that buffer. (because the user space program may be context-switched out and hence the virtual address means nothing if the kernel module accesses the buffer at that time.)
When the module wants to send request, it needs put the request to the buffer via physical address. The question is: How can I access the buffer inside the kernel module via its physical address.
I noticed there is get_user_pages, but don't know how to use it, or maybe there are other better methods?
Thanks.
You are better off doing this the other way around - have the kernel allocate the buffer, then allow the userspace program to map it into its address space using mmap().
Finally I figured out how to handle this problem...
Quite simple but may be not safe.
use phys_to_virt, which calls __va(pa), to get the virtual address in kernel and I can access that buffer. And because the buffer is pinned, that can guarantee the physical location is available.
What's more, I don't need s special syscall to tell the kernel the buffer's info. Instead, a proc file is enough because I just need tell the kernel once.

Resources