From Kernel Space to User Space: Inner-workings of Interrupts - linux-kernel

I have been trying to understand how do h/w interrupts end up in some user space code, through the kernel.
My research led me to understand that:
1- An external device needs attention from CPU
2- It signals the CPU by raising an interrupt (h/w trance to cpu or bus)
3- The CPU asserts, saves current context, looks up address of ISR in the
interrupt descriptor table (vector)
4- CPU switches to kernel (privileged) mode and executes the ISR.
Question #1: How did the kernel store ISR address in interrupt vector table? It might probably be done by sending the CPU some piece of assembly described in the CPUs user manual? The more detail on this subject the better please.
In user space how can a programmer write a piece of code that listens to a h/w device notifications?
This is what I understand so far.
5- The kernel driver for that specific device has now the message from the device and is now executing the ISR.
Question #3:If the programmer in user space wanted to poll the device, I would assume this would be done through a system call (or at least this is what I understood so far). How is this done? How can a driver tell the kernel to be called upon a specific systemcall so that it can execute the request from the user? And then what happens, how does the driver gives back the requested data to user space?
I might be completely off track here, any guidance would be appreciated.
I am not looking for specific details answers, I am only trying to understand the general picture.

Question #1: How did the kernel store ISR address in interrupt vector table?
Driver calls request_irq kernel function (defined in include/linux/interrupt.h and in kernel/irq/manage.c), and Linux kernel will register it in right way according to current CPU/arch rules.
It might probably be done by sending the CPU some piece of assembly described in the CPUs user manual?
In x86 Linux kernel stores ISR in Interrupt Descriptor Table (IDT), it format is described by vendor (Intel - volume 3) and also in many resources like http://en.wikipedia.org/wiki/Interrupt_descriptor_table and http://wiki.osdev.org/IDT and http://phrack.org/issues/59/4.html and http://en.wikibooks.org/wiki/X86_Assembly/Advanced_Interrupts.
Pointer to IDT table is registered in special CPU register (IDTR) with special assembler commands: LIDT and SIDT.
If the programmer in user space wanted to poll the device, I would assume this would be done through a system call (or at least this is what I understood so far). How is this done? How can a driver tell the kernel to be called upon a specific systemcall so that it can execute the request from the user? And then what happens, how does the driver gives back the requested data to user space?
Driver usually registers some device special file in /dev; pointers to several driver functions are registered for this file as "File Operations". User-space program opens this file (syscall open), and kernels calls device's special code for open; then program calls poll or read syscall on this fd, kernel will call *poll or *read of driver's file operations (http://www.makelinux.net/ldd3/chp-3-sect-7.shtml). Driver may put caller to sleep (wait_event*) and irq handler will wake it up (wake_up* - http://www.makelinux.net/ldd3/chp-6-sect-2 ).
You can read more about linux driver creation in book LINUX DEVICE DRIVERS (2005) by Jonathan Corbet, Alessandro Rubini, and Greg Kroah-Hartman: https://lwn.net/Kernel/LDD3/
Chapter 3: Char Drivers https://lwn.net/images/pdf/LDD3/ch03.pdf
Chapter 10: Interrupt Handling https://lwn.net/images/pdf/LDD3/ch10.pdf

Related

Linux kernel initialization - When are devicetree blobs parsed and tree nodes are loaded?

I would like to establish a milestone roadmap for Linux initialization for me to easily understand. (For an embedded system) Here is what I got:
Bootloader loads kernel to RAM and starts it
Linux kernel enters head.o, starts start_kernel()
CPU architecture is found, MMU is started.
setup_arch() is called, setting CPU up.
Kernel subsystems are loaded.
do_initcalls() is called and modules with *_initcall() and module_init() functions are started.
Then /sbin/init (or alike) is run.
I don't know when exactly devicetree is processed here. Is it when do_initcall() functions are beings processed or is it something prior to that?
In general when devicetree is parsed, and when tree nodes are processed?
Thank you very much in advance.
Any correction to my thoughts are highly appreciated.
It's a good question.
Firstly, I think you already know that the kernel will use data in the DT to identify the specific machine, in case of general use across different platform or hardware, we need it to establish in the early boot so that it has the opportunity to run machine-specific fixups.
Here is some information I digest from linux kernel documents.
In the majority of cases, the machine identity is irrelevant, and the kernel will instead select setup code based on the machine’s core CPU or SoC. On ARM for example, setup_arch() in arch/arm/kernel/setup.c will call setup_machine_fdt() in arch/arm/kernel/devtree.c which searches through the machine_desc table and selects the machine_desc which best matches the device tree data. It determines the best match by looking at the ‘compatible’ property in the root device tree node, and comparing it with the dt_compat list in struct machine_desc (which is defined in arch/arm/include/asm/mach/arch.h if you’re curious).
As for the Linux Initialization, I think there are something we can add in the list.
Put on START button, reset signal trigger
CS:IP fix to the BIOS 0XFFFF0 address
Jump to the start of BIOS
Self-check, start of hardware device like keyboard, real mode IDT & GDT
Load Bootloader like grub2 or syslinux.
Bootloader loads kernel to RAM and starts it (boot.img->core.img).
A20 Open, call setup.s, switch into protected mode
Linux kernel enters head.o, IDT & GDT refresh, decompress_kernel(), starts start_kernel()
INIT_TASK(init_task) create
trap_init()
CPU architecture is found, MMU is started (mmu_init()).
setup_arch() is called, setting CPU up.
Kernel subsystems are loaded.
do_initcalls() is called and modules with *_initcall() and module_init() functions are started.
rest_init() will create process 1 & 2, in other word, /sbin/init (or alike) and kthreadd is run.

How do I write to a __user memory from within the top half of an interrupt handler?

I am working on a proprietary device driver. The driver is implemented as a kernel module. This module is then coupled with an user-space process.
It is essential that each time the device generates an interrupt, the driver updates a set of counters directly in the address space of the user-space process from within the top half of the interrupt handler. The driver knows the PID and the task_struct of the user-process and is also aware of the virtual address where the counters lie in the user-process context. However, I am having trouble in figuring out how code running in the interrupt context could take up the mm context of the user-process and write to it. Let me sum up what I need to do:
Get the address of the physical page and offset corresponding to the virtual address of the counters in the context of the user-process.
Set up mappings in the page table and write to the physical page corresponding to the counter.
For this, I have tried the following:
Try to take up the mm context of the user-task, like below:
use_mm(tsk->mm);
/* write to counters. */
unuse_mm(tsk->mm);
This apparently causes the entire system to hang.
Wait for the interrupt to occur when our user-process was the
current process. Then use copy_to_user().
I'm not much of an expert on kernel programming. If there's a good way to do this, please do advise and thank you in advance.
Your driver should be the one, who maps kernel's memory for user space process. E.g., you may implement .mmap callback for struct file_operation for your device.
Kernel driver may write to kernel's address, which it have mapped, at any time (even in interrupt handler). The user-space process will immediately see all modifications on its side of the mapping (using address obtained with mmap() system call).
Unix's architecture frowns on interrupt routines accessing user space
because a process could (in theory) be swapped out when the interrupt occurs. 
If the process is running on another CPU, that could be a problem, too. 
I suggest that you write an ioctl to synchronize the counters,
and then have the the process call that ioctl
every time it needs to access the counters.
Outside of an interrupt context, your driver will need to check the user memory is accessible (using access_ok), and pin the user memory using get_user_pages or get_user_pages_fast (after determining the page offset of the start of the region to be pinned, and the number of pages spanned by the region to be pinned, including page alignment at both ends). It will also need to map the list of pages to kernel address space using vmap. The return address from vmap, plus the offset of the start of the region within its page, will give you an address that your interrupt handler can access.
At some point, you will want to terminate access to the user memory, which will involve ensuring that your interrupt routine no longer accesses it, a call to vunmap (passing the pointer returned by vmap), and a sequence of calls to put_page for each of the pages pinned by get_user_pages or get_user_pages_fast.
I don't think what you are trying to do is possible. Consider this situation:
(assuming how your device works)
Some function allocates the user-space memory for the counters and
supplies its address in PROCESS X.
A switch occurs and PROCESS Y executes.
Your device interrupts.
The address for your counters is inaccessible.
You need to schedule a kernel mode asynchronous event (lower half) that will execute when PROCESS X is executing.

bypassing tty layer and copy to user

I would like to copy data to user space from kernel module which receives data from serial port and transfers it to DMA, which in turn forwards the data to tty layer and finally to user space.
the current flow is
serial driver FIFO--> DMA-->TTY layer -->User space (the data to tty layer is emptied from DMA upon expiration of timer)
What I want to achieve is
serial driver FIFO-->DMA-->user space. (I am OK with using timer to send the data to user space, if there is a better way let me know)
Also the kernel module handling the serialFIFO->DMA is not a character device.
I would like to bypass tty layer completely. what is the best way to achieve so?
Any pointers/code snippet would be appreciated.
In >=3.10.5 the "serial FIFO" that you refer to is called a uart_port. These are defined in drivers/tty/serial.
I assume that what you want to do is to copy the driver for your UART to a new file, then instead of using uart_insert_char to insert characters from the UART RX FIFO, you want to insert the characters into a buffer that you can access from user space.
The way to do this is to create a second driver, a misc class device driver that has file operations, including mmap, and that allocates kernel memory that the driver's mmap file operation function associates with the userspace mapped memory. There is a good example of code for this written by Maxime Ripard. This example was written for a FIQ handled device, but you can use just the probe routine's dma_zalloc_coherent call and the mmap routine, with it's call to remap_pfn_range, to do the trick, that is, to associate a user space mmap on the misc device file with the alloc'ed memory.
You need to connect the memory that you allocated in your misc driver to the buffer that you write to in your UART driver using either a global void pointer, or else by using an exported symbol, if your misc driver is a module. Initialize the pointer to a known invalid value in the UART driver and test it to make sure the misc driver has assigned it before you try to insert characters to the address to which it points.
Note that you can't add an mmap function to the UART driver directly because the UART driver class does not support an mmap file operation. It only supports the operations defined in the include/linux/serial_core.h struct uart_ops.
Admittedly this is a cumbersome solution - two device drivers, but the alternative is to write a new device class, a UART device that has an mmap operation, and that would be a lot of work compared with the above solution although it would be elegant. No one has done this to date because as Jonathan Corbet say's "...not every device lends itself to the mmap abstraction; it makes no sense, for instance, for serial ports and other stream-oriented devices", though this is exactly what you are asking for.
I implemented this solution for a polling mode UART driver based on the mxs-auart.c code and Maxime's example. It was non-trivial effort but mostly because I am using a FIQ handler for the polling timer. You should allow two to three weeks to get the whole thing up and running.
The DMA aspect of your question depends on whether the UART supports DMA transfer mode. If so, then you should be able to set it using the serial flags. The i.MX28's PrimeCell auarts support DMA transfer but for my application there was no advantage over simply reading bytes directly from the UART RX FIFO.

Ring level shift in Win NT based OS

Can anyone please tell me how there is privilege change in Windows OS.
I know the user mode code (RL:3) passes the parameters to APIs.
And these APIs call the kernel code (RL:1).
But now I want to know, during security(RPL) check is there some token that is exchanged between these RL3 API and RL1 Kernel API.
if I am wrong please let me know (through Some Link or Brief description) how it works.
Please feel free to close this thread if its offtopic, offensive or duplicate.
RL= Ring Level
RPL:Requested Privilege level
Interrupt handlers and the syscall instruction (which is an optimized software interrupt) automatically modify the privilege level (this is a hardware feature, the ring 0 vs ring 3 distinction you mentioned) along with replacing other processor state (instruction pointer, stack pointer, etc). The prior state is of course saved so that it can be restored after the interrupt completes.
Kernel code has to be extremely careful not to trust input from user-mode. One way of handling this is to not let user-mode pass in pointers which will be dereferenced in kernel mode, but instead HANDLEs which are looked up in a table in kernel-mode memory, which can't be modified by user-mode at all. Capability information is stored in the HANDLE table and associated kernel data structures, this is how, for example, WriteFile knows to fail if a file object is opened for read-only access.
The task switcher maintains information on which process is currently running, so that syscalls which perform security checks, such as CreateFile, can check the user account of the current process and verify it against the file ACL. This process ID and user token are again stored in memory which is accessible only to the kernel.
The MMU page tables are used to prevent user-mode from modifying kernel memory -- generally there is no page mapping at all; there are also page access bits (read, write, execute) which are enforced in hardware by the MMU. Kernel code uses a different page table, the swap occurs as part of the syscall instruction and/or interrupt activation.

How does the kernel know if the CPU is in user mode or kenel mode?

Since the CPU runs in user/kernel mode, I want to know how this is determined by kernel. I mean, if a sys call is invoked, the kernel executes it on behalf of the process, but how does the kernel know that it is executing in kernel mode?
You can tell if you're in user-mode or kernel-mode from the privilege level set in the code-segment register (CS). Every instruction loaded into the CPU from the memory pointed to by the RIP or EIP register (the instruction pointer register depending on if you are x86_64 or x86 respectively) will read from the segment described in the global descriptor table (GDT) by the current code-segment descriptor. The lower two-bits of the code segment descriptor will determine the current privilege level that the code is executing at. When a syscall is made, which is typically done through a software interrupt, the CPU will check the current privilege-level, and if it's in user-mode, will exchange the current code-segment descriptor for a kernel-level one as determined by the syscall's software interrupt gate descriptor, as well as make a stack-switch and save the current flags, the user-level CS value and RIP value on this new kernel-level stack. When the syscall is complete, the user-mode CS value, flags, and instruction pointer (EIP or RIP) value are restored from the kernel-stack, and a stack-switch is made back to the current executing processes' stack.
Broadly if it's running kernel code it's in kernel mode. The transition from user-space to kernel mode (say for a system call) causes a context switch to occur. As part of this context switch the CPU mode is changed.
Kernel code only executes in kernel mode. There is no way, kernel code can execute in user mode. When application calls system call, it will generate a trap (software interrupt) and the mode will be switch to kernel mode and kernel implementation of system call will executed. Once it is done, kernel will switch back to user mode and user application will continue processing in user mode.
The term is called "Superviser Mode", which applies to x86/ARM and many other processor as well.
Read this (which applies only to x86 CPU):
http://en.wikipedia.org/wiki/Ring_(computer_security)
Ring 0 to 3 are the different privileges level of x86 CPU. Normally only Ring0 and 3 are used (kernel and user), but nowadays Ring 1 find usages (eg, VMWare used it to emulate guest's execution of ring 0). Only Ring 0 has the full privilege to run some privileged instructions (like lgdt, or lidt), and so a good test at the assembly level is of course to execute these instruction, and see if your program encounters any exception or not.
Read this to really identify your current privilege level (look for CPL, which is a pictorialization of Jason's answer):
http://duartes.org/gustavo/blog/post/cpu-rings-privilege-and-protection
It is a simple question and does not need any expert comment as provided above..
The question is how does a cpu come to know whether it is kernel mode or its a user mode.
The answer is "mode bit"....
It is a bit in Status register of cpu's registers set.
When "mode bit=0",,,it is considered as kernel mode(also called,monitor mode,privileged mode,protected mode...and many other...)
When "mode bit=1",,it is considered as User mode...and user can now perform its personal applications without any special kernel interruption.
so simple...isn't it??

Resources