What is the overhead involved in a mode switch - linux-kernel

Many a times i read/hear the argument that making a lot of system calls etc would be inefficient since the application make a mode switch i.e goes from user mode to kernel mode and after executing the system call starts executing in the user mode by making a mode switch again.
My question is what is the overhead of a mode switch ? Does cpu cache gets invalidated or tlb entries are flushed out or what happens that causes overhead ?
Please note that i am asking about the overhead involved in mode switch and not context switch. I know that mode switch and context switch are two different things and i am fully aware about overhead associated with a context switch, but what i fail to understand is what overhead is caused by a mode switch ?
If its possible please provide some information about a particular *nix platform like Linux, FreeBSD, Solaris etc.
Regards
lali

There should be no CPU cache or TLB flush on a simple mode switch.
A quick test tells me that, on my Linux laptop it takes about 0.11 microsecond for a userspace process to complete a simple syscall that does an insignificant amount of work other than the switch to kernel mode and back. I'm using getuid(), which only copies a single integer from an in-memory struct. strace confirms that the syscall is repeated MAX times.
#include <unistd.h>
#define MAX 100000000
int main() {
int ii;
for (ii=0; ii<MAX; ii++) getuid();
return 0;
}
This takes about 11 seconds on my laptop, measured using time ./testover, and 11 seconds divided by 100 million gives you 0.11 microsecond.
Technically, that's two mode switches, so I suppose you could claim that a single mode switch takes 0.055 microseconds, but a one-way switch isn't very useful, so I'd consider the there-and-back number to be the more relevant one.

There are many ways to do a mode switch on the x86 CPUs (which I am assuming here). For a user called function, the normal way is to do a Task jump or Call (referred to as Task Gates and Call Gates). Both of these involve a Task switch (equivalent to a context switch). Add to that a bit of processing before the call, the standard verification after the call, and the return. This rounds up the bare minimum to a safe mode switch.
As for Eric's timing, I am not a Linux expert, but in most OS I have dealt with, simple system calls cache data (if it can be done safely) in the user space to avoid this overhead. And it would seem to me that a getuid() would be a prime candidate for such data caching. Thus Eric's timing could be more a reflection of the overhead of the pre-switch processing in user space than anything else.

Related

Does context switching usually happen between calling a function, and executing it?

So I have been working on the source code of a complex application (written by hundreds of programmers) for a while now. And among other things, I have created some time checking functions, along with suitable data structures to measure execution periods of different segments of the main loop and run some analysis on these measurements.
Here's a pseudocode that helps explaining:
main()
{
TimeSlicingSystem::AddTimeSlice(0);
FunctionA();
TimeSlicingSystem::AddTimeSlice(3);
FuncitonB();
TimeSlicingSystem::AddTimeSlice(6);
PrintTimeSlicingValues();
}
void FunctionA()
{
TimeSlicingSystem::AddTimeSlice(1);
//...
TimeSlicingSystem::AddTimeSlice(2);
}
FuncitonB()
{
TimeSlicingSystem::AddTimeSlice(4);
//...
TimeSlicingSystem::AddTimeSlice(5);
}
PrintTimeSlicingValues()
{
//Prints the different between each slice, and the slice before it,
//starting from slice number 1.
}
Most measurements were very reasonable, for instance assigning a value to a local variable will cost less than a fraction of a microsecond. Most functions will execute from start to finish in a few microseconds, and rarely ever reach one millisecond.
I then ran a few tests for half an hour or so, and I found some strange results that I couldn't quite understand. Certain functions will be called, and when measuring the time from the moment of calling the function (last line in 'calling' code) to the first line inside the 'called' function will take a very long time, up to a 30 milliseconds period. That's happening in a loop that would otherwise complete a full iteration in less than 8 milliseconds.
To get a picture of that, in the pseudocode I included, the time period between the slice number 0, and the slice number 1, or the time between the slice number 3, and the slice number 4 is measured. This the sort of periods I am referring to. It is the measured time between calling a function, and running the first line inside the called function.
QuestionA. Could this behavior be due to thread, or process switching by the OS? Does calling a function is a uniquely vulnerable spot to that? The OS I am working on is Windows 10.
Interestingly enough, there was never a last line in a function returning to the first line after the call in the 'calling' code problem at all ( periods from slice number 2 to 3 or from 5 to 6 in pseudocode)! And all measurements were always less than 5 microseconds.
QuestionB. Could this be, in any way, due to the time measurement method I am using? Could switching between different cores gives an allusion of slower than actually is context switching due to clock differences? (although I never found a single negative delta time so far, which seems to refute this hypothesis altogether). Again, the OS I am working on is Windows 10.
My time measuring function looks looks this:
FORCEINLINE double Seconds()
{
Windows::LARGE_INTEGER Cycles;
Windows::QueryPerformanceCounter(&Cycles);
// add big number to make bugs apparent where return value is being passed to float
return Cycles.QuadPart * GetSecondsPerCycle() + 16777216.0;
}
QuestionA. Could this behavior be due to thread, or process switching by the OS?
Yes. Thread switches can happen at any time (e.g. when a device sends an IRQ that causes a different higher priority thread to unblock and preempt your thread immediately) and this can/will cause unexpected time delays in your thread.
Does calling a function is a uniquely vulnerable spot to that?
There's nothing particularly special about calling your own functions that makes them uniquely vulnerable. If the function involves the kernel's API a thread switch can be more likely, and some things (e.g. calling "sleep()") are almost guaranteed to cause a thread switch.
Also there's potential interaction with virtual memory management - often things (e.g. your executable file, your code, your data) use "memory mapped files" where accessing it for the first time may cause OS to fetch the code or data from disk (and your thread can be blocked until the code or data it wanted arrived from disk); and rarely used code or data can also be sent to swap space and need to be fetched.
QuestionB. Could this be, in any way, due to the time measurement method I am using?
In practice it's likely that Windows' QueryPerformanceCounter() is implemented with an RDTSC instruction (assuming 80x86 CPU/s) and doesn't involve the kernel at all, and for modern hardware it's likely that this is monatomic. In theory Windows could emulate RDTSC and/or implement QueryPerformanceCounter() in another way to guard against security problems (timing side channels), as has been recommended by Intel for about 30 years now, but this is unlikely (modern operating systems, including but not limited to Windows, tend to care more about performance than security); and in theory your hardware/CPU could be so old (about 10+ years old) that Windows has to implement QueryPerformanceCounter() in a different way, or you could be using some other CPU (e.g. ARM and not 80x86).
In other words; it's unlikely (but not impossible) that the time measurement method you're using is causing any timing problems.

Restart a CPU that ends up unresponsive during undervolting

I'm working on a set of kernel changes that allows me to undervolt my CPU at runtime. One consequence of extreme undervolting that I'm often facing is that the CPU becomes completely unresponsive.
I've tried using functions cpu_up and cpu_down in the hope of asking the kernel to restore the CPU, but to no avail.
Is there any way to recover the CPU from this state? Does the kernel have any routines that can bring back a CPU from this unresponsive state?
First, to successfully benefit from undervolting, it's important that you reduce the voltage by small amounts each time (such as between 5-10 mV). Then after each step of reduction, you should check the changes to one or more hardware error metrics (typically the CPU cache error rate). Generally what happens is that error rate should increase gradually when the voltage is decreased slowly. However, at some point, an error will occur that cannot be corrected through ECC (or whatever hardware correction mechanism is being used by the processor). This is when execution becomes unreliable. Linux responds to such errors by panicking (the system will either automatically reboot or it will just hang). So you may still have chance to detect the error and choose to continue execution, but correctness is not guaranteed anymore even if you immediately increased the voltage back. So that would be a very, very dangerous thing to do. It can get very nasty very quickly. An error might occur while you're handling some another error (maybe because of the code that is handling the error, so the safest thing to do is to abort, see Peter's comment).
Modern processors offer mechanisms to profile and handle correctable and uncorrectable hardware errors. In particular, x86 offers the Machine Check Architecture (MCA). By default, in Linux, when an uncorrectable machine check occurs, the machine check exception handler is invoked, which may abort the system (although it will try to see if it can safely recover somehow). You cannot handle that in user mode without using additional tools.
Here are the different x86 MCE tolerance levels supported by Linux:
struct mca_config mca_cfg __read_mostly = {
.bootlog = -1,
/*
* Tolerant levels:
* 0: always panic on uncorrected errors, log corrected errors
* 1: panic or SIGBUS on uncorrected errors, log corrected errors
* 2: SIGBUS or log uncorrected errors (if possible), log corr. errors
* 3: never panic or SIGBUS, log all errors (for testing only)
*/
.tolerant = 1,
.monarch_timeout = -1
};
Note that the default tolerant value is 1. But since you are modifying the kernel, you can change the way Linux handle MCEs either by changing the tolerant level or the handling code itself. You can get started with the machine_check_poll and do_machine_check functions.
User-mode tools that may enable you to profile and potentially responds to machine checks include mcelog and mcedaemon. MCA is discussed in Volume 3 Chapter 15 and Chapter 16 of the Intel manual. For ARM, you can also profile cache ECC errors as discussed in here.
It is very important to understand that different cores of the same chip may behave differently when reducing the voltage beyond the nominal value. This is due to process variation. So don't assume that voltage reductions would work across cores of the same chip or across chips. You're going to have to test that on every core of every chip (in case you have multiple sockets).
I've tried using functions cpu_up and cpu_down in the hope of asking
the kernel to restore the CPU, but to no avail.
These functions are part of the Hotplug CPU infrastructure. Not really useful here.
The answer is CPU dependent. My answer is limited to x86_64 and s390:
Extreme undervolting is essentially unplugging the CPU, to be able to bring it back up you have to make sure that CONFIG_HOTPLUG_CPU = y is configured.
Also, depending on the kernel version you are using you may have different teardown or setup options available to you readily. If you are using 4.x have a look at cpuhp_* routines in <linux/cpuhotplug.h> in particular cpuhp_setup_state_multimay be the one you can use to set things up ... if in doubt look atcpuhp_setup_state_nocallsas well as__cpuhp_setup_state` ... Hopefully this helps :-)

User CPU time deviation. What is it caused by?

I am wondering why the time command in Unix always outputs different user CPU time. It is said it is time that CPU spends executing user code of the needed process, so it excludes tasks that are managed by the kernel:
Any I/O or other hardware waits and interrupts, also cache management
Other processes' intervention (taking control away)
All things that user code does not know about
But for a simple C program with bubble-sorting of 1000000 elements it always shows user CPU time ranging from 0.3 to 1.0 seconds.
I have found little information about that in classic books about kernels and operation systems. Please, enlighten me, somebody.
'All things that user code doesn't know about
is not true. User time means CPU cycle used by user mode.
There are 2 execution modes, user mode (with limited privileges) and kernel mode (with almost all privileges). In user mode generally operations not involving higher level privileges are performed. User mode is switched to kernel mode whenever kernel call/system calls are made.
More information on CPU modes is available here,
http://www.linfo.org/kernel_mode.html
http://minnie.tuhs.org/CompArch/Lectures/week05.html
Thus even for simple bubble sort program you will use quite a CPU cycles. Measuring user time in real seconds per program can be difficult as well as less useful because getting exact numbers won't make much sense. This will depend and vary a lot on underlying HW, Kernel versions, other processes sharing resources etc. It varies even in consecutive runs - range can be considered in such cases.
In general cases, User CPU time will be higher than System CPU time but inverse is possible.

QueryPerformanceCounter on multi-core processor under Windows 10 behaves erratically

Under Windows, my application makes use of QueryPerformanceCounter (and QueryPerformanceFrequency) to perform "high resolution" timestamping.
Since Windows 10 (and only tested on Intel i7 processors so far), we observe erratic behaviours in the values returned by QueryPerformanceCounter.
Sometimes, the value returned by the call will jump far ahead and then back to its previous value.
It feels as if the thread has moved from one core to another and was returned a different counter value for a lapse of time (no proof, just a gut feeling).
This has never been observed under XP or 7 (no data about Vista, 8 or 8.1).
A "simple" workaround has been to enable the UsePlatformClock boot opiton using BCDEdit (which makes everything behaves wihtout a hitch).
I know about the potentially superior GetSystemTimePreciseAsFileTime but as we still support 7 this is not exactly an option unless we write totatlly different code for different OSes, which we really don't want to do.
Has such behaviour been observed/explained under Windows 10 ?
I'd need much more knowledge about your code but let me highlight few things from MSDN:
When computing deltas, the values [from QueryPerformanceCounter] should be clamped to ensure that any bugs in the timing values do not cause crashes or unstable time-related computations.
And especially this:
Set that single thread to remain on a single processor by using the Windows API SetThreadAffinityMask ... While QueryPerformanceCounter and QueryPerformanceFrequency typically adjust for multiple processors, bugs in the BIOS or drivers may result in these routines returning different values as the thread moves from one processor to another. So, it's best to keep the thread on a single processor.
Your case might exploited one of those bugs. In short:
You should query the timestamp always from one thread (setting same CPU affinity to be sure it won't change) and read that value from any other thread (just an interlocked read, no need for fancy synchronizations).
Clamp the calculated delta (at least to be sure it's not negative)...
Notes:
QueryPerformanceCounter() uses, if possible, TSC (see MSDN). Algorithm to synchronize TSC (if available and in your case it should be) is vastly changed from Windows 7 to Windows 8 however note that:
With the advent of multi-core/hyper-threaded CPUs, systems with multiple CPUs, and hibernating operating systems, the TSC cannot be relied upon to provide accurate results — unless great care is taken to correct the possible flaws: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. Therefore, a program can get reliable results only by limiting itself to run on one specific CPU.
Then, even if in theory QPC is monotonic then you must always call it from the same thread to be sure of this.
Another note: if synchronization is made by software you may read from Intel documentation that:
...It may be difficult for software to do this in a way than ensures that all logical processors will have the same value for the TSC at a given point in time...
Edit: if your application is multithreaded and you can't (or you don't wan't) to set CPU affinity (especially if you need precise timestamping at the cost to have de-synchronized values between threads) then you may use GetSystemTimePreciseAsFileTime() when running on Win8 (or later) and fallback to timeGetTime() for Win7 (after you set granularity to 1 ms with timeBeginPeriod(1) and assuming 1 ms resolution is enough). A very interesting reading: The Windows Timestamp Project.
Edit 2: directly suggested by OP! This, when applicable (because it's a system setting, not local to your application), might be an easy workaround. You can force QPC to use HPET instead of TSC using bcdedit (see MSDN). Latency and resolution should be worse but it's intrinsically safe from above described issues.

Is 16 milliseconds an unusually long length of time for an unblocked thread running on Windows to be waiting for execution?

Recently I was doing some deep timing checks on a DirectShow application I have in Delphi 6, using the DSPACK components. As part of my diagnostics, I created a Critical Section class that adds a time-out feature to the usual Critical Section object found in most Windows programming languages. If the time duration between the first Acquire() and the last matching Release() is more than X milliseconds, an Exception is thrown.
Initially I set the time-out at 10 milliseconds. The code I have wrapped in Critical Sections is pretty fast using mostly memory moves and fills for most of the operations contained in the protected areas. Much to my surprise I got fairly frequent time-outs in seemingly random parts of the code. Sometimes it happened in a code block that iterates a buffer list and does certain quick operations in sequence, other times in tiny sections of protected code that only did a clearing of a flag between the Acquire() and Release() calls. The only pattern I noticed is that the durations found when the time-out occurred were centered on a median value of about 16 milliseconds. Obviously that's a huge amount of time for a flag to be set in the latter example of an occurrence I mentioned above.
So my questions are:
1) Is it possible for Windows thread management code to, on a fairly frequent basis (about once every few seconds), to switch out an unblocked thread and not return to it for 16 milliseconds or longer?
2) If that is a reasonable scenario, what steps can I take to lessen that occurrence and should I consider elevating my thread priorities?
3) If it is not a reasonable scenario, what else should I look at or try as an analysis technique to diagnose the real problem?
Note: I am running on Windows XP on an Intel i5 Quad Core with 3 GB of memory. Also, the reason why I need to be fast in this code is due to the size of the buffer in milliseconds I have chosen in my DirectShow filter graphs. To keep latency at a minimum audio buffers in my graph are delivered every 50 milliseconds. Therefore, any operation that takes a significant percentage of that time duration is troubling.
Thread priorities determine when ready threads are run. There's, however, a starvation prevention mechanism. There's a so-called Balance Set Manager that wakes up every second and looks for ready threads that haven't been run for about 3 or 4 seconds, and if there's one, it'll boost its priority to 15 and give it a double the normal quantum. It does this for not more than 10 threads at a time (per second) and scans not more than 16 threads at each priority level at a time. At the end of the quantum, the boosted priority drops to its base value. You can find out more in the Windows Internals book(s).
So, it's a pretty normal behavior what you observe, threads may be not run for seconds.
You may need to elevate priorities or otherwise consider other threads that are competing for the CPU time.
sounds like normal windows behaviour with respect to timer resolution unless you explicitly go for some of the high precision timers. Some details in this msdn link
First of all, I am not sure if Delphi's Now is a good choice for millisecond precision measurements. GetTickCount and QueryPerformanceCoutner API would be a better choice.
When there is no collision in critical section locking, everything runs pretty fast, however if you are trying to enter critical section which is currently locked on another thread, eventually you hit a wait operation on an internal kernel object (mutex or event), which involves yielding control on the thread and waiting for scheduler to give control back later.
The "later" above would depend on a few things, including priorities mentioned above, and there is one important things you omitted in your test - what is the overall CPU load at the time of your testing. The more is the load, the less chances to get the thread continue execution soon. 16 ms time looks perhaps a bit still within reasonable tolerance, and all in all it might depends on your actual implementation.

Resources