high resolution timing in kernel? - linux-kernel

I'm writing a kernel module that requires function called in 0.1ms intervals measured with at least 0.01ms precision. 250MHz ARM CPU, the HZ variable (jiffies per second) is 100 so anything jiffies-based is a no-go unless I can increase jiffies granularity.
Any suggestions where/how to look?

Assuming the kernel you are running has the Hi-Res timer support turned on (it is a build time config option) and that you have a proper timer hardware which can provide the needed support to raise an interrupt in such granularity, you can use the in kernel hrtimer API to register a timer with your requirement.
Here is the hrtimer documentation: http://www.mjmwired.net/kernel/Documentation/timers/hrtimers.txt
Bare in mind though, that for truly getting uninterrupted responses on such a scale you most probably also need to apply and configure the Linux RT (aka PREEMPT_RT) patches.
You can read more here: http://elinux.org/Real_Time

Related

Understanding the `scheduler_tick()` function of Linux Kernel

To my knowledge, I understood that the scheduler_tick() function periodically gets called by the timer interrupt with the frequency of HZ.
I'm new to kernel development, so these might be naive questions but I'm trying to understand the execution of the scheduler_tick() function. I'd appreciate your help on these questions:
It is possible for the timer to invoke this function again before the previous call returns?
Is it called per-core? (with per-core timer interrupt)
If the HZ frequency is high, the scheduler executes a process for longer than its timeslice. If the HZ is low, I presume a significant overhead due to frequent interruptions and invocation of this function. Is this assumption correct?
Answering your main questions and comments at the same time:
It's not possible for scheduler code to be called while already scheduling, as it runs with interrupts disabled.
Yes, each core has its own runqueue and does its own scheduling.
It's the opposite: high HZ -> scheduler will run more frequently (since HZ represents the number of scheduling ticks per second); low HZ -> scheduler will run less frequently.
Scheduling on every single core simultaneously is not needed, each CPU has its own timer and does its own timer interrupts regardless of sync with other CPUs. Scheduler ticks don't happen at the same time on all CPUs (furthermore CPUs could be idle and not ticking at all in a "tickless" kernel, configured with CONFIG_NOHZ). Tickless kernels make things a bit tricker, for more info check out Documentation/timers/NO_HZ.txt and "How are jiffies incremented in a tickless kernel?".
As per what trigger_load_balance() does and how, see Documentation/scheduler/sched-domains.txt, which explains it more or less in detail. In particular, trigger_load_balance() doesn't do the actual load balancing, it just checks if it's needed and if so raises a soft IRQ (SCHED_SOFTIRQ), which is then handled by run_rebalance_domains(), which calls rebalance_domains(), which does the actual work. The latter function indeed uses RCU/locking, as you can see by the calls to functions such as spin_trylock() and rcu_read_lock().

QueryPerformanceCounter on multi-core processor under Windows 10 behaves erratically

Under Windows, my application makes use of QueryPerformanceCounter (and QueryPerformanceFrequency) to perform "high resolution" timestamping.
Since Windows 10 (and only tested on Intel i7 processors so far), we observe erratic behaviours in the values returned by QueryPerformanceCounter.
Sometimes, the value returned by the call will jump far ahead and then back to its previous value.
It feels as if the thread has moved from one core to another and was returned a different counter value for a lapse of time (no proof, just a gut feeling).
This has never been observed under XP or 7 (no data about Vista, 8 or 8.1).
A "simple" workaround has been to enable the UsePlatformClock boot opiton using BCDEdit (which makes everything behaves wihtout a hitch).
I know about the potentially superior GetSystemTimePreciseAsFileTime but as we still support 7 this is not exactly an option unless we write totatlly different code for different OSes, which we really don't want to do.
Has such behaviour been observed/explained under Windows 10 ?
I'd need much more knowledge about your code but let me highlight few things from MSDN:
When computing deltas, the values [from QueryPerformanceCounter] should be clamped to ensure that any bugs in the timing values do not cause crashes or unstable time-related computations.
And especially this:
Set that single thread to remain on a single processor by using the Windows API SetThreadAffinityMask ... While QueryPerformanceCounter and QueryPerformanceFrequency typically adjust for multiple processors, bugs in the BIOS or drivers may result in these routines returning different values as the thread moves from one processor to another. So, it's best to keep the thread on a single processor.
Your case might exploited one of those bugs. In short:
You should query the timestamp always from one thread (setting same CPU affinity to be sure it won't change) and read that value from any other thread (just an interlocked read, no need for fancy synchronizations).
Clamp the calculated delta (at least to be sure it's not negative)...
Notes:
QueryPerformanceCounter() uses, if possible, TSC (see MSDN). Algorithm to synchronize TSC (if available and in your case it should be) is vastly changed from Windows 7 to Windows 8 however note that:
With the advent of multi-core/hyper-threaded CPUs, systems with multiple CPUs, and hibernating operating systems, the TSC cannot be relied upon to provide accurate results — unless great care is taken to correct the possible flaws: rate of tick and whether all cores (processors) have identical values in their time-keeping registers. There is no promise that the timestamp counters of multiple CPUs on a single motherboard will be synchronized. Therefore, a program can get reliable results only by limiting itself to run on one specific CPU.
Then, even if in theory QPC is monotonic then you must always call it from the same thread to be sure of this.
Another note: if synchronization is made by software you may read from Intel documentation that:
...It may be difficult for software to do this in a way than ensures that all logical processors will have the same value for the TSC at a given point in time...
Edit: if your application is multithreaded and you can't (or you don't wan't) to set CPU affinity (especially if you need precise timestamping at the cost to have de-synchronized values between threads) then you may use GetSystemTimePreciseAsFileTime() when running on Win8 (or later) and fallback to timeGetTime() for Win7 (after you set granularity to 1 ms with timeBeginPeriod(1) and assuming 1 ms resolution is enough). A very interesting reading: The Windows Timestamp Project.
Edit 2: directly suggested by OP! This, when applicable (because it's a system setting, not local to your application), might be an easy workaround. You can force QPC to use HPET instead of TSC using bcdedit (see MSDN). Latency and resolution should be worse but it's intrinsically safe from above described issues.

How to improve scheduling and interrupt latency

How to improve scheduler and interrupt latency:
Background:
Embedded system based on 10 cores mips64 processor
9 cores run SMP linux. kernel version 2.6.32.27
We have realtime performance required process which has to complete certain tasks within 1ms. At maximum load conditions it may take 800uS.
This process starts the processing after receiving GPIO interrupt (1ms interrupt provided by FPGA. implemented as a kernel driver).
Till then it will make a icotl call to gpio driver and will be put to sleep by the virtue of wake_up_interruptible system call
The GPIO ISR will wake_up() this process
To prevent other processes hogging CPU for this process, we run this process on an "isolcpus" core.
We have set priority to be highest among user thread for this process as below:
Priority: 80, Scheduling type:SCHED_FIFO
threadSetRtPriority(SCHED_FIFO, 80);
All /proc/sys/kernel/sched_ parameter values are default. We haven't fine tuned them
Problem:
Sometimes we see that ISR has called wake_up, but the process is scheduled only after 350uS.
This is a big time since our processor is running at 1.25GHz.
This big number for scheduling latency, is puzzling us, as we have already isolated the core exclusively for this process by using "isolcpus"
We profile the max CPU cycle count between consecutive 1ms GPIO ISR calls. This max time is more than 1.5ms.
This big number for interrupt latency is too a concern for us, as this will eat up into the time available for the process to do its processing within 1ms boundary.
Please help us with inputs to reduce the interrupt and scheduling latency numbers
The standard Linux kernel does not provide real-time scheduling. A level of real-time determinism can be achieved with the RT_Preempt patch. It still requires careful design, and is no substitute for an RTOS for critical real-time requirements.
I have been working on linux kernel 4.8 preempt-rt which has the RT_Preempt patch applied from this repo: linux kernel 4.8 preempt-rt and have some promising results!
I have benchmarked both preempt-rt and non-preempt-rt linux kernels by running rt-benchmark cyclictests and found that the Max Latency in case of preempt-rt linux kernel has come down to 61 us as against 2025 us when using non-preempt linux kernel, which might as well help your case.
The results have clearly tempted me to use the prempt-rt kernel as there is an overwhelming difference in Max Latency between the two. I have documented the results here: sachin-mokashi-linux-preempt-rt, in case if it might be of help to you!

windows c++ and possibility of a microsecond sleep

Is there anyway at all in the windows environment to sleep for ~1 microsecond? After researching and reading many threads and various websites, I have not been able to see that this is possible. Since the scheduler appears to be the limiting factor and it operates
at the 1 millisecond level, then I believe it can't be done without going to a real time OS.
It may not be the most portable, and I've not used these functions myself, but it might be possible to use the information in the High-Resolution Timer section of this link and block: QueryPerformanceCounter
Despite the fact that windows is claimed to be not a "real-time" OS, events can be generated at microsecond resolution. The use of a combination of system time (file_time) and the performance counter frequency has been described at other places. However, careful
implementation with taking care about processor affinity and process/thread priorities opens the door to timed events at microsecond resolution.
Since the windows scheduler and the windows timer services do rely on the systems interrupt
mechanism, the microsecond can only be tuned for by polling. Particulary on multicore systems
polling is not so ugly anymore. And the polling only has to last for the shortest possible
interrupt period. The multimedia timer interface allows to put the interrupt period down to
about 1ms, thus one can get near the desired (microsecond resolution) time and the polling will last for 1ms at most.
My implementation of microsecond resolution time services for windows, test code and an extensive description can be found at the
Windows Timestamp Project located at windowstimestamp.com

What is the clock source for the count returned by QueryPerfomanceCounter

I was under the impression that QueryPerformanceCounter was actually accessing the counter that feeds the HPET (High Performance Event Timer)---the difference of course being that HPET is a timer which send an interrupt when the counter value matches the desired interval whereas to make a timer "out of" QueryPerformanceCounter you have to write your own loop in software.
The only reason I had assumed the hardware behind the two was the same is because somewhere I had read that QueryPerformanceCounter was accessing a counter on the chipset.
http://www.gamedev.net/reference/programming/features/timing/ claims that QueryPerformanceCounter use chipset timers which apparently have a specified clock rate. However, I can verify that QueryPerformanceFrequency returns wildly different numbers on different machines, and in fact, the number can change slightly from boot to boot.
The numbers returned can sometimes be totally ridiculous---implying ticks in the nanosecond range. Of course when put together it all works; that is, writing timer software using QueryPerformanceCounter/QueryPerformanceFrequency allows you to get proper timing and latency is pretty low.
A software timer using these functions can be pretty good. For example, with an interval of 1 millisecond, over 30 seconds it's easy to nearly 100% of ticks to fall within 10% of the intended interval. With an interval of 100 microseconds, you still get a high success rate (99.7%) but the worst ticks can be way off (200 microseconds).
I'm wondering if the clock behind the HPET is the same. Supposedly HPET should still increase accuracy since it is a hardware timer, but as of yet I don't know how to access it in Windows.
Sounds like Microsoft has made these functions use "whatever best timer there is":
http://www.microsoft.com/whdc/system/sysinternals/mm-timer.mspx
Did you try updating your CPU driver for your AMD multicore system? Did you check whether your motherboard chipset is on the "bad" list? Did you try setting the CPU affinity?
One can also use the RTC-based time functions and/or a skip-detecting heuristic to eliminate trouble with QPC.
This has some hints: CPU clock frequency and thus QueryPerformanceCounter wrong?
Please improve this. It is a community wiki.

Resources