Difference between User vs Kernel System call - performance

A system call is how a program requests a service from an operating system's kernel.
They can occur in user-mode and kernel-mode.
What are differences?
For example:
Overhead
System time

A system call is the way you transition between the application ("user mode") and the kernel.
Syscalls are slower than normal function calls, but newer x86 chips from Intel and AMD have a special sysenter/syscall opcode to make it take just a hundred nanoseconds or so, give or take.

#Leo,
Could you elaborate on how system calls vary when made from within kernel space? For better understanding of the Linux kernel, which is written in C and assembly
Notice, that system calls are just an interface between user space and kernel space. When you need some computer resources (files, networks, ...), you ask the kernel to give it to you (under the hood you ask the kernel to run kernel code, that is responsible for it).
Overhead of system calls is that you need to perform a CPU interrupt. As Will mentioned the time for it is very depends of a CPU type.

Related

What is the relation between reentrant kernel and preemptive kernel?

What is the relation between reentrant kernel and preemptive kernel?
If a kernel is preemptive, must it be reentrant? (I guess yes)
If a kernel is reentrant, must it be preemptive? (I am not sure)
I have read https://stackoverflow.com/a/1163946, but not sure about if there is relation between the two concepts.
I guess my questions are about operating system concepts in general. But if it matters, I am interested mostly in Linux kernel, and encounter the two concepts when reading Understanding the Linux Kernel.
What is reentrant kernel:
As the name suggests, a reentrant kernel is the one which allows
multiple processes to be executing in the kernel mode at any given
point of time and that too without causing any consistency problems
among the kernel data structures.
What is kernel preemption:
Kernel preemption is a method used mainly in monolithic and hybrid
kernels where all or most device drivers are run in kernel space,
whereby the scheduler is permitted to forcibly perform a context
switch (i.e. preemptively schedule; on behalf of a runnable and higher
priority process) on a driver or other part of the kernel during its
execution, rather than co-operatively waiting for the driver or kernel
function (such as a system call) to complete its execution and return
control of the processor to the scheduler.
Can I imagine a preemptive kernel which is not reentrant? Hardly, but I can. Let's consider an example: some thread performs a system call. While entering a kernel it takes a big kernel lock and forbids all interrupt except scheduler timer irq. After that this thread is preempted in kernel by a scheduler. Now we may switch to another userspace thread. This process do some work in userspace and after that enters kernel, take big kernel lock and sleeps and so on. In practice looks like this solution can't be implemented, because of huge latency due to forbidding interrupts on a big time intervals.
Can I imagine reentrant kernel which is not preemptive? Why not? Just use cooperative preemption in kernel. Thread 1 enters kernel and calls thread_yield() after some time. Thread 2 enters kernel do it's own work maybe call another thread_yield maybe not. There is nothing special here.
As for linux kernel it is absolutely reentrant, the kernel preemption may be configured by CONFIG_PREEMPT. Also voluntary preemption is possible and many other different options.

Is there any mechanism where the kernel part of an OS in memory may also be swapped?

I'm recently learning the part of I/O buffering of operating system and according to the book I use,
When a user process issues an I/O request, the OS assigns a buffer in the system portion of main memory to the operation.
I understand how this method is able to avoid the swapping problem in non-buffering situation. But is it assumed that the OS buffering created for the process will never be swapped out?
To extend my question, I was wondering if there is any mechanism where the kernel portion of an OS in memory may also be swapped?
It is common for operating systems to page out parts of the kernel. The kernel has to define what parts may be paged out and which may not be paged out. For example, typically, there will be separate memory allocators for paged pool and non-paged pool.
Note that on most processors the page table format is the same for system pages as for user pages, thus supporting paging of the kernel.
Determining what parts of the kernel may be paged out is part of the system design and is done up front. You cannot page out the system interrupt table. You can page out system service code for the most part. You cannot page out interrupt handling code for the most part.
I was wondering if there is any mechanism where the kernel portion of an OS in memory
IIRC some old versions of AIX might have been able to swap (i.e. to paginate) some kernel code. And probably older OSes too (perhaps even Multics).
However, it practically is useless today, because the kernel memory is a tiny fraction of the RAM on current (desktop & server) computers. The total kernel memory is several dozens of megabytes only, while most computers have several dozens of gigabytes of RAM.
BTW, microkernel systems (e.g. GNU Hurd) can have server programs in paging processes.
See Operating Systems: Three Easy Pieces

Will gettimeofday() be slowed due to the fix to the recently announced Intel bug?

I have been estimating the impact of the recently announced Intel bug on my packet processing application using netmap. So far, I have measured that I process about 50 packets per each poll() system call made, but this figure doesn't include gettimeofday() calls. I have also measured that I can read from a non-existing file descriptor (which is about the cheapest thing that a system call can do) 16.5 million times per second. My packet processing rate is 1.76 million packets per second, or in terms of system calls, 0.0352 million system calls per second. This means performance reduction would be 0.0352 / 16.5 = 0.21333% if system call penalty doubles, hardly something I should worry about.
However, my application may use gettimeofday() system calls quite often. My understanding is that these are not true system calls, but rather implemented as virtual system calls, as described in What are vdso and vsyscall?.
Now, my question is, does the fix to the recently announced Intel bug (that may affect ARM as well and that probably won't affect AMD) slow down gettimeofday() system calls? Or is gettimeofday() an entirely different animal due to being implemented as a different kind of virtual system call?
In general, no.
The current patches keep things like the vDSO pages mapped in user-space, and only change the behavior for the remaining vast majority of kernel-only pages which will no longer be mapped in user-space.
On most architectures, gettimeofday() is implemented as a purely userspace call, and never enters the kernel, doesn't include the TLB flush or CR3 switch that KPTI implies, so you shouldn't see a performance impact.
Exceptions include unusual kernel or hardware configurations that don't use the vDSO mechanisms, e.g., if you don't have a constant rdtsc or if you have explicitly disabled rdtsc timekeeping via a boot parameter. You'd probably already know if that was the case since it means that gettimeofday would take 100-200ns rather than 15-20ns since it's already making a kernel call.
Good question, the VDSO pages are kernel memory mapped into user space. If you single-step into gettimeofday(), you see a call into the VDSO page where some code there uses rdtsc and scales the result with scale factors it reads from another data page.
But these pages are supposed to be readable from user-space, so Linux can keep them mapped without any risk. The Meltdown vulnerability is that the U/S bit (user/supervisor) in page-table / TLB entries doesn't stop unprivileged loads (and further dependent instructions) from happening microarchitecturally, producing a change in the microarchitectural state which can then be read with cache-timing.

Are there any performance penalties for running SMP enabled Linux kernel on a Uni processor (ARM Cortex A8 based SOC)?

This is a two fold question that raised from my trivial observation that I am running a SMP enabled Linux on our ARM-Cortex 8 based SoC. First part is about performance (memory space/CPU time) difference between SMP and NON-SMP Linux kernel on a Uni processor system. Does any difference exits?
Second part is about use of Spinlock. AFAIK spinklock are noop in case uni-processor. Since there is only one CPU and only one process will be running on it (at a time ) there is no other process for busy-looping. So for synchronization I just need to disable interrupt for protecting my critical section. Is this understanding of mine correct?
Ignore portability of drivers factor for this discussion.
A large amount of synchronisation code in the kernel compiles way to almost nothing in uni-processor kernels which descries the behaviour you describe. Performance of n-way system is definitely not 2n - and gets worse as the number of CPUs.
You should continue to write your driver with using synchronisation mechanisms for SMP systems - safe in the knowledge that you'll get the correct single-processor case when the kernel is configured for uni-processor.
Disabling interrupts globally is like taking a sledge-hammer to a nut - maybe just disabling pre-emption on the current CPU is enough - which the spinlock does even on uni-processor systems.
If you've not already done so, take a look at Chapter 5 of Linux Device Drivers 3rd Edition - there are a variety of spinlock options depending on the circumstance.
As you have stated that you are running the linux kernel as compiled in SMP mode on Uni-processor system so it's clear that you'll not get any benefit in terms of speed & memory.
As the linux-kernel uses extensive locking for synchronization. But it Uni-Processor mode there may be no need of locking theoretically but there are many cases where its necessary so try to use Locking where its needed but not as much as in SMP.
but you should know it well that Spinlocks are implemented by set of macros, some prevent concurrency with IRQ handlers while the
other ones not.Spinlocks are suitable to protect small pieces of code which are intended to run
for a very short time.
As of your second question, you are trying to remove spinlocks by disabling interrupts for Uni-Processor mode but Spinlock macros are in non-preemptible UP(Uni-Processor) kernels evaluated to empty macros(or some of them to macros just disabling/enabling interrupts). UP kernels with
preemption enabled use spinlocks to disable preemption. For most purposes, pre-emption can be tought of as SMP equivalent. so in UP kernels if you use Spinlocks then they will be just empty macro & i think it will be better to use it.
there are basically four technique for synchronization as..1->Nonpreemptability,2->Atomic Operations,3->Interrupt Disabling,4->Locks.
but as you are saying to disable interrupt for synchronization then remember Because of its simplicity, interrupt disabling is used by kernel functions for implementing a critical region.
This technique does not always prevent kernel control path interleaving.
Critical section should be short because any communication between CPU and I/O is blocked while a kernel control path is running in this section.
so if you need synchronization in Uni-Processor then use semaphore.

Performance difference between system call vs function call

I quite often listen to driver developers saying its good to avoid kernel mode switches as much as possible. I couldn't understand the precise reason. To start with my understanding is -
System calls are software interrupts. On x86 they are triggered by using instruction sysenter. Which actually looks like a branch instruction which takes the target from a machine specific register.
System calls don't really have to change the address space or process context.
Though, they do save registers on process stack and and change stack pointer to kernel stack.
Among these operations syscall pretty much works like a normal function call. Though the sysenter could behave like a mis-predicted branch which could lead to ROB flush in processor pipeline. Even that is not really bad, its just like any other mis-predicted branch.
I heard a few people answering on Stack Overflow:
You never know how long syscall takes - [me] yeah, but thats case with any function. Amount of time it takes depends on the function
It is often scheduling spot. - [me] process can get rescheduled, even if it is running all the time in user mode. ex, while(1); doesnt guarantee a no-context switch.
Where is the actual syscall cost coming from?
You don't indicate what OS you are asking about. Let me attempt an answer anyway.
The CPU instructions syscall and sysenter should not be confused with the concept of a system call and its representation in the respective OSs.
The best explanation for the difference in the overhead incurred by each respective instruction is given by reading through the Operation sections of the IntelĀ® 64 and IA-32 Architectures Developer's Manual volume 2A (for int, see page 3-392) and volume 2B (for sysenter see page 4-463). Also don't forget to glance at iretd and sysexit while at it.
A casual counting of the pseudo-code for the operations yields:
408 lines for int
55 lines for sysenter
Note: Although the existing answer is right in that sysenter and syscall are not interrupts or in any way related to interrupts, older kernels in the Linux and the Windows world used interrupts to implement their system call mechanism. On Linux this used to be int 0x80 and on Windows int 0x2E. And consequently on those kernel versions the IDT had to be primed to provide an interrupt handler for the respective interrupt. On newer systems, that's true, the sysenter and syscall instructions have completely replaced the old ways. With sysenter it's the MSR (machine specific register) 0x176 which gets primed with the address of the handler for sysenter (see the reading material linked below).
On Windows ...
A system call on Windows, just like on Linux, results in the switch to kernel mode. The scheduler of NT doesn't provide any guarantees about the time a thread is granted. Also it yanks away time from threads and can even end up starving threads. In general one can say that user mode code can be preempted by kernel mode code (with very few very specific exceptions to which you'll certainly get in the "advanced driver writing class"). This makes perfect sense if we only look at one example. User mode code can be swapped out - or, for that matter, the data it's trying to access. Now the CPU doesn't have the slightest clue how to access pages in the swap/paging file, so an intermediate step is required. And that's also why kernel mode code must be able to preempt user mode code. It is also the reason for one of the most prolific bug-check codes seen on Windows and mostly caused by third-party drivers: IRQL_NOT_LESS_OR_EQUAL. It means that a driver accessed paged memory when it wasn't possible to preempt the code touching that memory.
Further reading
SYSENTER and SYSEXIT in Windows by Geoff Chappell (always worth a read in my experience!)
Sysenter Based System Call Mechanism in Linux 2.6
Windows NT platform specific discussion: How Do Windows NT System Calls REALLY Work?
Windows NT platform specific discussion: System Call Optimization with the SYSENTER Instruction
Windows Internals, 5th ed., by Russinovich et. al. - pages 125 through 132.
ReactOS implementation of KiFastSystemCall
SYSENTER/SYSCALL is not a software interrupt; whole point of those instructions is to avoid overhead caused by issuing IRQ and calling interrupt handler.
Saving registers on stack costs time, this is one place where the syscall cost comes from.
Another place comes from the kernel mode switch itself. It involves changing segment registers - CS, DS, ES, FS, GS, they all have to be changed (it's less costly on x86-64, as segmentation is mostly unused, but you still need to essentially make far jump to kernel code) and also changes CPU ring of execution.
To conclude: function call is (on modern systems, where segmentation is not used) near call, while syscall involves far call and ring switch.

Resources