Who calls delay_tsc() on Linux - linux-kernel

I used OProfile to profiling my Linux box. During the profiling processes, I've found that besides "native_safe_halt" function, the "delay_tsc" is the second most significant function consuming cpu cycles (around 10%). It seems delay_tsc() is a busy loop. But who calls it and what is its function?

Nobody calls it directly since it's a local function inside that piece of source you link to. The way to call it is by the published __delay() function.
When you call __delay(), this will use the delay_fn function pointer (also local to that file) to select one of several delay functions. By default, the one selected is delay_loop(), which uses x86 instructions to try and mark time.
However, if use_tsc_delay() has been called (at boot time), it switches the function pointer to delay_tsc(), which uses the time stamp counter (a CPU counter) to mark time.
It's called by any kernel code that wants a reasonably reliable, high-resolution delay function. You can see all the code in the kernel that references __delay here (quite a few places).
I think it's probably pretty safe, in terms of profiling, to ignore the time spent in that function since its intent is to delay. In other words, it's not useful work that's taking a long time to perform - if callers didn't want to delay, they wouldn't call it.
Some examples from that list:
A watchdog timer uses it to pace the cores so that their output is not mixed up with each other, by delaying for some multiple of the current core ID.
The ATI frame buffer driver appears to use it for delays between low-level accesses to the hardware. In fact, it's used quite a bit for that purpose in many device drivers.
It's used during start-up to figure out the relationship between jiffies and the actual hardware speeds.

Related

Does context switching usually happen between calling a function, and executing it?

So I have been working on the source code of a complex application (written by hundreds of programmers) for a while now. And among other things, I have created some time checking functions, along with suitable data structures to measure execution periods of different segments of the main loop and run some analysis on these measurements.
Here's a pseudocode that helps explaining:
main()
{
TimeSlicingSystem::AddTimeSlice(0);
FunctionA();
TimeSlicingSystem::AddTimeSlice(3);
FuncitonB();
TimeSlicingSystem::AddTimeSlice(6);
PrintTimeSlicingValues();
}
void FunctionA()
{
TimeSlicingSystem::AddTimeSlice(1);
//...
TimeSlicingSystem::AddTimeSlice(2);
}
FuncitonB()
{
TimeSlicingSystem::AddTimeSlice(4);
//...
TimeSlicingSystem::AddTimeSlice(5);
}
PrintTimeSlicingValues()
{
//Prints the different between each slice, and the slice before it,
//starting from slice number 1.
}
Most measurements were very reasonable, for instance assigning a value to a local variable will cost less than a fraction of a microsecond. Most functions will execute from start to finish in a few microseconds, and rarely ever reach one millisecond.
I then ran a few tests for half an hour or so, and I found some strange results that I couldn't quite understand. Certain functions will be called, and when measuring the time from the moment of calling the function (last line in 'calling' code) to the first line inside the 'called' function will take a very long time, up to a 30 milliseconds period. That's happening in a loop that would otherwise complete a full iteration in less than 8 milliseconds.
To get a picture of that, in the pseudocode I included, the time period between the slice number 0, and the slice number 1, or the time between the slice number 3, and the slice number 4 is measured. This the sort of periods I am referring to. It is the measured time between calling a function, and running the first line inside the called function.
QuestionA. Could this behavior be due to thread, or process switching by the OS? Does calling a function is a uniquely vulnerable spot to that? The OS I am working on is Windows 10.
Interestingly enough, there was never a last line in a function returning to the first line after the call in the 'calling' code problem at all ( periods from slice number 2 to 3 or from 5 to 6 in pseudocode)! And all measurements were always less than 5 microseconds.
QuestionB. Could this be, in any way, due to the time measurement method I am using? Could switching between different cores gives an allusion of slower than actually is context switching due to clock differences? (although I never found a single negative delta time so far, which seems to refute this hypothesis altogether). Again, the OS I am working on is Windows 10.
My time measuring function looks looks this:
FORCEINLINE double Seconds()
{
Windows::LARGE_INTEGER Cycles;
Windows::QueryPerformanceCounter(&Cycles);
// add big number to make bugs apparent where return value is being passed to float
return Cycles.QuadPart * GetSecondsPerCycle() + 16777216.0;
}
QuestionA. Could this behavior be due to thread, or process switching by the OS?
Yes. Thread switches can happen at any time (e.g. when a device sends an IRQ that causes a different higher priority thread to unblock and preempt your thread immediately) and this can/will cause unexpected time delays in your thread.
Does calling a function is a uniquely vulnerable spot to that?
There's nothing particularly special about calling your own functions that makes them uniquely vulnerable. If the function involves the kernel's API a thread switch can be more likely, and some things (e.g. calling "sleep()") are almost guaranteed to cause a thread switch.
Also there's potential interaction with virtual memory management - often things (e.g. your executable file, your code, your data) use "memory mapped files" where accessing it for the first time may cause OS to fetch the code or data from disk (and your thread can be blocked until the code or data it wanted arrived from disk); and rarely used code or data can also be sent to swap space and need to be fetched.
QuestionB. Could this be, in any way, due to the time measurement method I am using?
In practice it's likely that Windows' QueryPerformanceCounter() is implemented with an RDTSC instruction (assuming 80x86 CPU/s) and doesn't involve the kernel at all, and for modern hardware it's likely that this is monatomic. In theory Windows could emulate RDTSC and/or implement QueryPerformanceCounter() in another way to guard against security problems (timing side channels), as has been recommended by Intel for about 30 years now, but this is unlikely (modern operating systems, including but not limited to Windows, tend to care more about performance than security); and in theory your hardware/CPU could be so old (about 10+ years old) that Windows has to implement QueryPerformanceCounter() in a different way, or you could be using some other CPU (e.g. ARM and not 80x86).
In other words; it's unlikely (but not impossible) that the time measurement method you're using is causing any timing problems.

Will gettimeofday() be slowed due to the fix to the recently announced Intel bug?

I have been estimating the impact of the recently announced Intel bug on my packet processing application using netmap. So far, I have measured that I process about 50 packets per each poll() system call made, but this figure doesn't include gettimeofday() calls. I have also measured that I can read from a non-existing file descriptor (which is about the cheapest thing that a system call can do) 16.5 million times per second. My packet processing rate is 1.76 million packets per second, or in terms of system calls, 0.0352 million system calls per second. This means performance reduction would be 0.0352 / 16.5 = 0.21333% if system call penalty doubles, hardly something I should worry about.
However, my application may use gettimeofday() system calls quite often. My understanding is that these are not true system calls, but rather implemented as virtual system calls, as described in What are vdso and vsyscall?.
Now, my question is, does the fix to the recently announced Intel bug (that may affect ARM as well and that probably won't affect AMD) slow down gettimeofday() system calls? Or is gettimeofday() an entirely different animal due to being implemented as a different kind of virtual system call?
In general, no.
The current patches keep things like the vDSO pages mapped in user-space, and only change the behavior for the remaining vast majority of kernel-only pages which will no longer be mapped in user-space.
On most architectures, gettimeofday() is implemented as a purely userspace call, and never enters the kernel, doesn't include the TLB flush or CR3 switch that KPTI implies, so you shouldn't see a performance impact.
Exceptions include unusual kernel or hardware configurations that don't use the vDSO mechanisms, e.g., if you don't have a constant rdtsc or if you have explicitly disabled rdtsc timekeeping via a boot parameter. You'd probably already know if that was the case since it means that gettimeofday would take 100-200ns rather than 15-20ns since it's already making a kernel call.
Good question, the VDSO pages are kernel memory mapped into user space. If you single-step into gettimeofday(), you see a call into the VDSO page where some code there uses rdtsc and scales the result with scale factors it reads from another data page.
But these pages are supposed to be readable from user-space, so Linux can keep them mapped without any risk. The Meltdown vulnerability is that the U/S bit (user/supervisor) in page-table / TLB entries doesn't stop unprivileged loads (and further dependent instructions) from happening microarchitecturally, producing a change in the microarchitectural state which can then be read with cache-timing.

Performance profiling a KEXT

How to measure performance impact of a kext in OS X in terms of CPU, memory or thread usage during some user defined activities ? Any particular method tool that can be use from user land ? OR any approach/method that can be considered?
You've essentially got 2 options:
Instrumenting your kext with time measurements. Take stamps before and after the operation you're trying to measure using mach_absolute_time(), convert to a human-readable unit using absolutetime_to_nanoseconds() and take the difference, then collect that information somewhere in your kext where it can be extracted from userspace.
Sampling kernel stacks using dtrace (iprofiler -kernelstacks -timeprofiler from the command line, or using Instruments.app)
Personally, I've had a lot more success with the former method, although it's definitely more work. Most kext code runs so briefly that a sampling profiler barely catches any instances of it executing, unless you reduce the sampling interval so far that measurements start interfering with the system, or your kext is seriously slow. It's pretty easy to do though, so it's often a valid sanity check.
You can also get your compiler to instrument your code with counters (-fprofile-arcs), which in theory will allow you to combine the sampling statistics with the branch counters to determine the runtime of each branch. Extracting this data is a pain though (my code may help) and again, the statistical noise has made this useless for me in practice.
The explicit method also allows you to measure asynchronous operations, etc., but of course also comes with some intrinsic overhead. Accumulating the data safely is also a little tricky. (I use atomic operations, but you could use spinlocks too. Don't forget to not just measure means but also standard deviation, and minimum/maximum times.) And extracting the data can be a pain because you have to add a userspace interface to your kext for it. But it's definitely worth it!

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

Something faster than GetSystemTime?

I'm writing a DDE logging applet in visual c++ that logs several hundred events per minute and I need a faster way to keep time than calling GetSystemTime in winapi. Do you have any ideas?
(asking this because in testing under load, all exceptions were caused by a call to getsystemtime)
There is a mostly undocumented struct named USER_SHARED_DATA at a high usermode readable address that contains the time and other global things, but GetSystemTime just extracts the time from there and calls RtlTimeToTimeFields so there is not much you can gain by using the struct directly.
Possibly crazy thought: do you definately need an accurate timestamp? Suppose you only got the system time say every 10th call - how bad would that be?
As per your comments; calling GetSystemDateFormat and GetSystemTimeFormat each time will be a waste of time. These are not likely to change, so those values could easily be cached for improved performance. I would imagine (without having actually tested it) that these two calls are far more time-consuming than a call to GetSystemTime.
first of all, find out why your code is throwing exceptions (assuming you have described it correctly: i.e. a real exception has been thrown, and the app descends down into kernel mode - which is really slow btw.)
then you will most likely solve any performance bottleneck.
Chris J.
First, The fastest way I could think is using RDTSC instruction. RDTSC is an Intel time-stamp counter instruction with a nanosecond precision. Use it as a combination with CPUID instruction. To use those instructions properly, you can read "Using the RDTSC Instruction for Performance Monitoring", then you can convert the nanoseconds to seconds for your purpose.
Second, consider using QueryPerformanceFrequency() and QueryPerformanceCounter().
Third, not the fastest but much faster than the standard GetSystemTime(), that is, using timeGetTime().

Resources