User to kernel mode big picture? - linux-kernel

I've to implement a char device, a LKM.
I know some basics about OS, but I feel I don't have the big picture.
In a C programm, when I call a syscall what I think it happens is that the CPU is changed to ring0, then goes to the syscall vector and jumps to a kernel memmory space function that handle it. (I think that it does int 0x80 and in eax is the offset of the syscall vector, not sure).
Then, I'm in the syscall itself, but I guess that for the kernel is the same process that was before, only that it is in kernel mode, I mean the current PCB is the process that called the syscall.
So far... so good?, correct me if something is wrong.
Others questions... how can I write/read in process memory?.
If in the syscall handler I refer to address, say, 0xbfffffff. What it means that address? physical one? Some virtual kernel one?

To read/write memory from the kernel, you need to use function calls such as get_user or __copy_to_user.
See the User Space Memory Access API of the Linux Kernel.

You can never get to ring0 from a regular process.
You'll have to write a kernel module to get to ring0.
And you never have to deal with any physical addresses, 0xbfffffff represents an address in a virtual address space of your process.

Big picture:
Everything happens in assembly. So in Intel assembly, there is a set of privilege instruction which can only be executed in Ring0 mode ( To make the transition into Ring0 mode, you can use the "Int" or "Sysenter" instruction:
what all happens in sysenter instruction is used in linux?
And then inside the Ring0 mode (which is your kernel mode), accessing the memory will require the privilege level to be matched via DPL/CPL/RPL attributes bits tagged in the segment register:
You may asked, how the CPU initialize the memory and register in the first place: it is because when bootup, x86 CPU is running in realmode, unprotected (no Ring concept), and so everything is possible and lots of setup work is done.
As for virtual vs non-virtual memory address (or physical address): just remember that anything in the register used for memory addressing, is always via virtual address (if the MMU is setup, protected mode enabled). Look at the picture here (noticed that anything from the CPU is virtual address, only the memory bus will see physical address):
As for memory separation between userspace and kernel, you can read here:


Difference between copying user space memory and mapping userspace memory

What is the difference between copying from user space buffer to kernel space buffer and, mapping user space buffer to kernel space buffer and then copying kernel space buffer to another kernel data structure?
What I meant to say is:
The first method is copy_from_user() function.
The second method is say, a user space buffer is mapped to kernel space and the kernel is passed with physical address(say using /proc/self/pagemap), then kernel space calls phys_to_virt() on the passed physical address to get it's corresponding kernel virtual address. Then kernel copies the data from one of its data structures say skb_buff to the kernel virtual address it got from the call to phys_to_virt() call.
Note: phys_to_virt() adds an offset of 0xc0000000 to the passed physical address to get kernel virtual address, right?
The second method describes the functionality in DPDK for KNI module and they say in documentation that it eliminates the overhead of copying from user space to kernel space. Please explain me how.
It really depends on what you're trying to accomplish, but still some differences I can think about?
To begin with, copy_from_user has some built-in security checks that should be considered.
While mapping your data "manually" to kernel space enables you to read from it continuously, and maybe monitor something that the user process is doing to the data in that page, while using the copy_to_user method will require constantly calling it to be aware of changes.
Can you elaborate on what you are trying to do?

Windows kernel memory protection

In Windows the high memory of every process (0x80000000 or 0xc0000000)
Is reserved for kernel code, user code cannot access these regions of memory, if it tries so an access violation exception will be thrown.
I wish to know how is the kernel space protected ?
Is it via memory segmentations or via paging ?
I would like to hear a technical explanation.
Thanks a lot,
Assuming you are talking about x86 and x64 architectures.
Memory protection is achieved using the paging system. Each page table entry on an x86/x64 CPU has a bit to indicate whether it is a user or supervisor page. Accesses to supervisor pages are only permitted for code running with CPL<3, whereas accesses to non supervisor pages are possible regardless of CPL.
CPL is the "Current Privilege Level" which is sometimes referred to as Ring. Windows only uses two rings, although the CPU implements 4. Ring 0 is the CPU mode in which what Windows refers to as "kernel mode" runs. Ring 3 is the CPU mode in which "User mode" runs. Since code running at CPL=3 cannot access supervisor pages, this is how memory protection is implemented.
The answer for ARM is likely to be similar, but different.
That's an easy one and doesn't require talking about rings and kernel behavior. Accessing virtual memory at a particular address requires that address to be mapped, the operating system has to allocate a memory page for that address. The low-level winapi function that does that is VirtualAlloc(). Which takes an optional address, first argument. The OS will simply fail a request for an unmappable address. Otherwise the exact same mechanism that prevents you from mapping any address in the lowest 64KB of the address space.

OS and Hardware role during a LD instruction

When loading the contents of a virtual address into a particular register, what are some general sequence of events that need to happen in the hardware and operating system as part of the process?
For example,
LD 0xffe4ca32, R1
The address used for this is the virtual address right?
And it would need to go through some address translation first to get a physical address.
My first question is,
When this instruction executes, how is this instruction handled by the Hardware and Operating System?
And my second question is,
Is the "value" of that virtual address, 0xffe4ca32, the contents of its mapped physical address or is it the physical address itself?
Im just not clear what is being loaded into R1
Let's assume x86. First, the CPU asks the MMU (memory management unit) to to translate the address. First the MMU checks something called the TLB (translation look-aside buffer), where recent translations from virtual to physical are stored. If it is there, the referenced address is returned. Otherwise, the MMU looks up the address in the page table. If the page is either a supervisor only page, or a page marked as not present in memory, the CPU throws a protection fault, or a page fault. For the protection fault, the OS will usually terminate the responsible process however it does that. For a page fault, the OS then checks it's own special paging structures to see if that page has been paged out, or if it just doesn't exist. If it has been paged out, it is read in to some page somewhere in memory, and the virtual address is remapped to that new place. If space cannot be found, another page will be put on disk to make room (a lot of this is called thrashing). If it has not been paged out, the OS will most likely kill the process, as it is trying to reference a non existing page.
Value of mapped physical address. Virtual memory pointers behave just like physical memory pointers in the perspective of user-space. In kernel space, there are some complications as physical memory access is needed (this is usually achieved through something called identity paging, where the first few hundred pages are mapped directly to their corresponding physical memory.

Somewhat newb question about assy and the heap

Ultimately I am just trying to figure out how to dynamically allocate heap memory from within assembly.
If I call Linux sbrk() from assembly code, can I use the address returned as I would use an address of a statically (ie in the .data section of my program listing) declared chunk of memory?
I know Linux uses the hardware MMU if present, so I am not sure if what sbrk returns is a 'raw' pointer to real RAM, or is it a cooked pointer to RAM that may be modified by Linux's VM system?
I read this: How are sbrk/brk implemented in Linux?. I suspect I can not use the return value from sbrk() without worry: the MMU fault on access-non-allocated-address must cause the VM to alter the real location in RAM being addressed. Thus assy, not linked against libc or what-have-you, would not know the address has changed.
Does this make sense, or am I out to lunch?
Unix user processes live in virtual memory, no matter if written in assembler of Fortran, and should not care about physical addresses. That's kernel's business - kernel sets up and manages the MMU. You don't have to worry about it. Page faults are handled automatically and transparently.
sbrk(2) returns a virtual address specific to the process, if that's what you were asking.

How can I access memory at known physical address inside a kernel module?

I am trying to work on this problem: A user spaces program keeps polling a buffer to get requests from a kernel module and serves it and then responses to the kernel.
I want to make the solution much faster, so instead of creating a device file and communicating via it, I allocate a memory buffer from the user space and mark it as pinned, so the memory pages never get swapped out. Then the user space invokes a special syscall to tell the kernel about the memory buffer so that the kernel module can get the physical address of that buffer. (because the user space program may be context-switched out and hence the virtual address means nothing if the kernel module accesses the buffer at that time.)
When the module wants to send request, it needs put the request to the buffer via physical address. The question is: How can I access the buffer inside the kernel module via its physical address.
I noticed there is get_user_pages, but don't know how to use it, or maybe there are other better methods?
You are better off doing this the other way around - have the kernel allocate the buffer, then allow the userspace program to map it into its address space using mmap().
Finally I figured out how to handle this problem...
Quite simple but may be not safe.
use phys_to_virt, which calls __va(pa), to get the virtual address in kernel and I can access that buffer. And because the buffer is pinned, that can guarantee the physical location is available.
What's more, I don't need s special syscall to tell the kernel the buffer's info. Instead, a proc file is enough because I just need tell the kernel once.
