Mechanisms for timed delivery of NMIs

Mechanisms for timed delivery of NMIs - linux-kernel

I would like a timed delivery of a non-maskable interrupt (NMI). Specifically, I would like to be able to put the processor into a C state with interrupts disabled. Then, I expect the processor to wake up on delivery of the NMI.
I know that performance counters can be setup to deliver NMIs on overflow. However, I am not sure what event the counters should be setup to count. You can't count instructions or unhalted clock cycles because the CPU is basically halted.
I realize that you might not want to do this on a deployed system but this is more for an academic experiment. I want to be able to control the precise amount of time spent in the sleep state.

Related

Flame graph(perf record) cannot display accurate CPU idle usage

When the CPU usage is 60%, the flame graphs(perf record) is used to capture the CPU usage. Why is 40% idle-related stack usage not displayed in the flame graphs? The usage of the idle stack is often less than 5%.

For flame graphs, the point is normally to measure where a process spends CPU time while it's running, not which blocking functions it calls that make it sleep, or where it gets scheduled out and sleeps when it doesn't want to.
I capture performance for one cpu processor, not one process. According to the operating system design, if there is no active task on the CPU, the CPU calls an idle waiting function. For example, Linux often calls schedule_idle until it is interrupted by a new task. Therefore, it is expected that the schedule_idle can be found in flame gragh and it consumes 40% of the cpu usage.
Perf events like cycles don't increment when the clock is halted (e.g. cycles is cpu_clk_unhalted.thread_p or similar). If you really wanted to see time spend idle, you might be able to disable idle power saving to get Linux to just spin in a loop instead of using x86 monitor/mwait or even basic hlt to put the CPU into a C-state where the clock doesn't tick.
Or run your code pinned to one logical core, and on the other logical core, pin a task that runs the pause instruction in a loop. So the physical core's clock keeps ticking for the core you're counting events for.
You should still get counts for cpu_clk_unhalted.thread_any ([Core cycles when at least one thread on the physical core is not in halt state]) when recording that event on the logical core with your task, even when that logical core is asleep.
And you can also record counts for cpu_clk_unhalted.thread to count cycles when this (hardware) thread aka logical core isn't halted, to know how much CPU time you actually used. (Or use the software event task-clock for that.)
Use perf list to see events available on your CPU, and read their descriptions carefully.

what is the difference between Test pin and Ready pin in 8086 microprocessor?

can you tell me difference between Test pin and Ready pin in 8086 microprocessor because both of them deal with wait instructions?
TEST: input is examined by the ‘‘Wait’’ instruction. If the TEST input is
LOW execution continues, otherwise the processor waits in an ‘‘Idle’’
state. This input is synchronized internally during each clock cycle on
the leading edge of CLK.
READY: is the acknowledgement from the addressed memory or I/O
device that it will complete the data transfer. The READY signal from
memory/IO is synchronized by the 8284A Clock Generator to form
READY. This signal is active HIGH. The 8086 READY input is not
synchronized. Correct operation is not guaranteed if the setup and hold
times are not met.

If you read the description of the READY signal, the wait instruction is not mentioned.
The READY signal is sampled on each and every memory or I/O cycle. If a device is not capable of responding to the CPU's request in the standard bus cycle, the READY signal can be used to stretch out the cycle, giving it more time.
This is done by signalling to the CPU that the device is not READY. The CPU adds a clock cycles to the bus transaction until it is READY. These extra cycles are given the confusing name of "WAIT STATES", and have nothing to do with the WAIT instruction or the TEST line. Many years ago, makers of fast memory would brag "No wait states!"
The part about the 8284a refers to the details of ensuring that the READY input meets the timing requirements of the processor. Namely the so called setup and hold times, normally only of concern to the engineer designing the computer system.
In your question, you can see that the TEST input is explicitly sampled by the WAIT instruction. The TEST input is simply an input signal with a dedicated pin on the processor (TEST) sampled by a dedicated instruction (WAIT).
Most processors have signals similar to the READY line. The TEST line is rather more rare.

What is the kernel timer system and how is it related to the scheduler?

I'm having a hard time understanding this.
How does the scheduler know that a certain period of time has passed?
Does it use some sort of syscall or interrupt for that?
What's the point of using the constant HZ instead of seconds?
What does the system timer have to do with the scheduler?

How does the scheduler know that a certain period of time has passed?
The scheduler consults the system clock.
Does it use some sort of syscall or interrupt for that?
Since the system clock is updated frequently, it suffices for the scheduler to just read its current value. The scheduler is already in kernel mode so there is no syscall interface involved in reading the clock.
Yes, there are timer interrupts that trigger an ISR, an interrupt service routine, which reads hardware registers and advances the current value of the system clock.
What's the point of using the constant HZ instead of seconds?
Once upon a time there was significant cost to invoking the ISR, and on each invocation it performed a certain amount of bookkeeping, such as looking for scheduler quantum expired and firing TCP RTO retransmit timers. The hardware had limited flexibility and could only invoke the ISR at fixed intervals, e.g. every 10ms if HZ is 100. Higher HZ values made it more likely the ISR would run and find there is nothing to do, that no events had occurred since the previous run, in which case the ISR represented overhead, cycles stolen from a foreground user task. Lower HZ values would impact dispatch latency, leading to sluggish network and interactive response times. The HZ tuning tradeoff tended to wind up somewhere near 100 or 1000 for practical hardware systems. APIs that reported system clock time could only do so in units of ticks, where each ISR invocation would advance the clock by one tick. So callers would need to know the value of HZ in order to convert from tick units to S.I. units. Modern systems perform network tasks on a separately scheduled TCP kernel thread, and may support tickless kernels which discard many of these outdated assumptions.
What does the system timer have to do with the scheduler?
The scheduler runs when the system timer fires an interrupt.
The nature of a pre-emptive scheduler is it can pause "spinning" usermode code, e.g. while (1) {}, and manipulate the run queue, even on a single-core system.
Additionally, the scheduler runs when a process voluntarily gives up its time slice, e.g. when issuing syscalls or taking page faults.

How to improve scheduling and interrupt latency

How to improve scheduler and interrupt latency:
Background:
Embedded system based on 10 cores mips64 processor
9 cores run SMP linux. kernel version 2.6.32.27
We have realtime performance required process which has to complete certain tasks within 1ms. At maximum load conditions it may take 800uS.
This process starts the processing after receiving GPIO interrupt (1ms interrupt provided by FPGA. implemented as a kernel driver).
Till then it will make a icotl call to gpio driver and will be put to sleep by the virtue of wake_up_interruptible system call
The GPIO ISR will wake_up() this process
To prevent other processes hogging CPU for this process, we run this process on an "isolcpus" core.
We have set priority to be highest among user thread for this process as below:
Priority: 80, Scheduling type:SCHED_FIFO
threadSetRtPriority(SCHED_FIFO, 80);
All /proc/sys/kernel/sched_ parameter values are default. We haven't fine tuned them
Problem:
Sometimes we see that ISR has called wake_up, but the process is scheduled only after 350uS.
This is a big time since our processor is running at 1.25GHz.
This big number for scheduling latency, is puzzling us, as we have already isolated the core exclusively for this process by using "isolcpus"
We profile the max CPU cycle count between consecutive 1ms GPIO ISR calls. This max time is more than 1.5ms.
This big number for interrupt latency is too a concern for us, as this will eat up into the time available for the process to do its processing within 1ms boundary.
Please help us with inputs to reduce the interrupt and scheduling latency numbers

The standard Linux kernel does not provide real-time scheduling. A level of real-time determinism can be achieved with the RT_Preempt patch. It still requires careful design, and is no substitute for an RTOS for critical real-time requirements.

I have been working on linux kernel 4.8 preempt-rt which has the RT_Preempt patch applied from this repo: linux kernel 4.8 preempt-rt and have some promising results!
I have benchmarked both preempt-rt and non-preempt-rt linux kernels by running rt-benchmark cyclictests and found that the Max Latency in case of preempt-rt linux kernel has come down to 61 us as against 2025 us when using non-preempt linux kernel, which might as well help your case.
The results have clearly tempted me to use the prempt-rt kernel as there is an overwhelming difference in Max Latency between the two. I have documented the results here: sachin-mokashi-linux-preempt-rt, in case if it might be of help to you!

How to generate ~100kHz clock signal in Liunx kernel module with bit-banging?

I'm trying to generate clock signal on GPIO pin (ARM platform, mach-davinci, kernel 2.6.27) which will have something arroung 100kHz. Using tasklet with high priority to do that. Theory is simple, set gpio high, udelay for 5us, set gpio low, wait another 5us, but strange problems appear. First of all, can't get this 5us of dalay, but it's fine, looks like hw performance problem, so i moved to period = 40us (gives ~25kHz). Second problem is worst. Once per ~10ms udelay waits 3x longer than usual. I'm thinking that it's hearbeat taking this time, but this is is unacceptable from protocol (which will be implemented on top of this) point of view. Is there any way to temporary disable heartbeat procedure, lets say, for 500ms ? Or maybe I'm doing it wrong from the beginning? Any comments?

You cannot use tasklet for this kind of job. Tasklets can be preempted by interrupts. In some case your tasklet can be even executed in the process context!
If you absolutely have to do it this way, use an interrupt handler - get in, disable interrupts, do whatever you have to do and get out as fast as you can.

Generating the clock asynchronously in software is not the right thing to do. I can think of two alternatives that will work better:
Your processor may have a built-in clock generator peripheral that isn't already being used by the kernel or another driver. When you set one of these up, you tell it how fast to run its clock, and it just starts running out the pulses.
Get your processor's datasheet and study it.
You might not find a peripheral called a "clock" per se, but might find something similar that you can press into service, like a PWM peripheral.
The other device you are talking to may not actually require a regular clock. Some chips that need a "clock" line merely need a line that goes high when there is a bit to read, which then goes low while the data line(s) are changing. If this is the case, the 100 kHz thing you're reading isn't a hard requirement for a clock of exactly that frequency, it is just an upper limit on how fast the clock line (and thus the data line(s)) are allowed to transition.
With a CPU so much faster than the clock, you want to split this into two halves:
The "top half" sets the data line(s) state correctly, then brings the clock line up. Then it schedules the bottom half to run 5 μs later, using an interrupt or kernel timer.
In the "bottom half", called by the interrupt or timer, bring the clock line back down, then schedule the top half to run again 5 μs later.

Unless you can run your timer tasklet at higher priority than the kernel timer, you will always be susceptible to this kind of jitter. You do really have to do this by bit-ganging? It would be far easier to use a hardware timer or PWM generator. Configure the timer to run at your desired rate, set the pin to output, and you're done.
If you need software control on each bit period, you can try and work around the other tasks by setting your tasklet to run at a short period, say three-fourths of your 40 us delay. In the tasklet, disable interrupts and poll the clock until you get to the correct 40 us timeslot, set the I/O state, re-enable interrupts, and exit. But this effectively types up 25 % of your system in watching a clock.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio