I'm writing a driver for device that has no DMA.
Driver start device operation and then waits for operation to complete by pulling busy flag from device periodically.
The operation takes 4.5 or 9 microseconds to complete. Under no circumstances device completes operation earlier. No errors will arise from reading state of device later than designated amount of time (9 microseconds).
Busywait
All kernel articles found in Internet tell to busywait for intervals this small. They argue that any sleep operation will waste more resources than busywait. Personally, I don't like wasting cycles on trivial wait.
Schedule/sleep
I've found out that on target hardware with a process using whole CPU schedule() call causing other threads (on empty processor there are no other threads) can be invoked up to 4 times before device operation completes. When I add more dummy processes to load the CPU, I see that schedule call in my driver gives fairer CPU allocation for concurrently running processes.
Should I call schedule() or even usleep_range() while waiting for my device, or should I rely on kernel preemption to deal with busywaits?
According to Documentation/timers/timers-howto.txt, usleep_range is what you want for recent kernels.
[Update: Whoops! I should have read more carefully. For < 10us, it says to use udelay. So I guess it depends on how long you are willing to sleep for.]
Related
Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.
That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64?
That is, CPU usage is displayed as decrease only when logged in Idle?
And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC], Sparc?
Clarification: shows CPU-usage in standard Task manager in OS-Windows
A hardware thread (logical core) that's stalled on a cache miss can't be doing anything else, so it still counts as busy for the purposes of task-managers / CPU time accounting / OS process scheduler time-slices / stuff like that.
This is true across all architectures.
Without hyperthreading, "hardware thread" / "logical core" are the same as a "physical core".
Morphcore / other on-the-fly changing between hyperthreading and a more powerful single core could make there be a difference between a thread that keeps many execution units busy, vs. a thread that is blocked on cache misses a lot of the time.
I don't get the link between the OS CPU usage statistics and the optimal use of the pipeline. I think they are uncorrelated as the OS doesn't measure the pipeline load.
I'm writing this in the hope that Peter Cordes can help me understand it better and as a continuation of the comments.
User programs relinquish control to OS very often: when they need input from user or when
they are done with the signal/message. GUI program are basically just big loops and at
each iteration control is given to the OS until the next message.
When the OS has the control it schedules others threads/tasks and if not other actions
are needed just enter the idle process (long time ago a tight loop, now a sleep state)
until the next interrupt. This is the Idle Time.
Time spent on an ISR processing user input is considered idle time by any OS.
An a cache miss there would be still considered idle time.
A heavy program takes more time to complete the work for a given message thereby returning
control to OS say 2 times in a second instead of
20.
If the OS measures that in the last second, it got control for 20ms only then the
CPU usage is (1000-20)/1000 = 98%.
This has nothing to do with the optimal use of the CPU architecture, as said stalls can
occur in the OS code and still be part of the Idle time statistic.
The CPU utilization at pipeline level is not what is measured and it is orthogonal to the
OS statistics.
CPU usage is meant to be used by sysadmin, it is a measure of the load you put on a system,
it is not the measure of how efficiently the assembly of a program was generated.
Sysadmins can't help with that, but measuring how often the OS got the control back (without
preempting) is a measure of how much load a program is putting on the system.
And sysadmins can definitively do terminate heavy programs.
One has blocking calls whenever the CPU is waiting for some system to respond, e.g. waiting for an internet request. Is the CPU literally wasting time during these calls (I don't know whether there are machine instructions other than no-op that would correspond to the CPU literally wasting time). If not, what is it doing?
The thread is simply skipped when the operating system scheduler looks for work to hand off to a core. With the very common outcome that nothing needs to be done. The processor core then executes the HLT instruction.
In the HALT state it consumes (almost) no power. An interrupt is required to bring it back alive. Most typically that will be the clock interrupt, it ticks 64 times per second by default. It could be a device interrupt. The scheduler then again looks for work to do. Rinse and repeat.
Basically, the kernel maintains run queues or something similar to schedule threads. Each thread receives a time slice where it gets to execute until it expires or it volontarily yields its slice. When a thread yields or its slice expires, the scheduler decides which thread gets to execute next.
A blocking system call would result in a yield. It would also result in the thread being removed from the run queue and placed in a sleep/suspend queue where it is not eligible to receive time slices. It would remain in the sleep/suspend queue until some critiera is met (e.g. timer tick, data available on socket, etc.). Once the criteria is met, it'd be placed back into the run queue.
Sleep(1); // Yield, install a timer, and place the thread in a sleep queue.
As long as there are tasks in any of the run queues (there may be more than one, commonly one per processor core), the scheduler will keep handing out time slices. Depending on scheduler design and hardware constraints, these time slices may vary in length.
When there are no tasks in the run queue, the core can enter a powersaving state until an interrupt is received.
In essence, the processor never wastes time. Its either executing other threads, servicing interrupts or in a powersaving state (even for very short durations).
While a thread is blocked, especially if it is blocked on an efficient wait object that puts the blocked thread to sleep, the CPU is busy servicing other threads in the system. If there are no application threads running, there is always system threads running. The CPU is never truly idle.
I am trying to determine approximate time delay (Win 7, Vista, XP) to switch threads when an IO operation completes.
What I (think I) know is that:
a) Thread contex switches are themselves computationally very fast. (By very fast, I mean typically way under 1ms, maybe even under 1us? - assuming a relatively fast, unloaded machine etc.)
b) Round robin time slice quantums are on the order of 10-15ms.
What I can't seem to find is information about the typical latency time from a (high priority) thread becoming active/signaled - via, say, a synchronous disk write completing - and that thread actually running again.
E.g., I have read in at least one place that all inactive threads remain asleep until ~10ms system quantum expires and then (assuming they are ready to go), they all get reactivated almost synchronously together. But in another place I read that the delay between when a thread completes an I/O operation and when it becomes active/signaled and runs again is measured in microseconds, not milliseconds.
My context for asking is related to capture and continuous streaming write to a RAID array of SSDs, from a high speed camera, where unless I can start a new write after a prior one has finished in well under 1ms (it would be best if under 1/10ms, on average), it will be problematic.
Any information regarding this issue would be most appreciated.
Thanks,
David
Thread context switches cost between 2,000 and 10,000 cpu cycles, so a handful of microseconds.
An I/O completion is fast when a thread is blocking on the synchronization handle that signals completion. That makes the Windows thread scheduler temporarily boost the thread priority. Which in turn makes it likely (but not guaranteed) to be chosen as the thread that gets the processor love. So that's typically microseconds, not milliseconds.
Do note that disk writes normally go through the file system cache. Which makes the WriteFile() call a simple memory-to-memory copy that doesn't block the thread. This runs at memory bus speeds, 5 gigabytes per second and up. Data is then written to the disk in a lazy fashion, the thread isn't otherwise involved or delayed by that. You'll only get slow writes when the file system cache is filled to capacity and you don't use overlapped I/O. Which is certainly a possibility if you write video streams. The amount of RAM makes a great deal of difference. And SSD controllers are not made the same. Nothing you can reason out up front, you'll have to test.
I know that spinlocks work with spining, different kernel paths exist and Kernels are preemptive, so why spinlocks don't work in uniprocessor systems? (for example, in Linux)
If I understand your question, you're asking why spin locks are a bad idea on single core machines.
They should still work, but can be much more expensive than true thread-sleeping concurrency:
When you use a spinlock, you're essentially asserting that you don't think you will have to wait long. You are saying that you think it's better to maintain the processor time slice with a busy loop than the cost of sleeping your thread and context-shifting to another thread or process. If you have to wait a very short amount of time, you can sleep and be reawakened almost immediately, but the cost of going down and up is more expensive than just waiting around.
This is more likely to be OK on multi-core processors, since they have much better concurrency profiles than single core processors. On multi core processors, between loop iterations, some other thread may have taken care of your prerequisite. On single core processors, it's not possible that someone else could have helped you out - you've locked up the one and only core.
The problem here is that if you wait or sleep on a lock, you hint to the system that you don't have everything you need yet, so it should go do some other stuff and come back to you later. With a spin lock, you never tell the system this, so you lock it up waiting for something else to happen - but, meanwhile, you're holding up the whole system, so something else can't happen.
The nature of a spinlock is that it does not deschedule the process - instead it spins until the process acquires the lock.
On a uniprocessor, it will either immediately acquire the lock or it will spin forever - if the lock is contended, then there will never be an opportunity for the process which currently holds the resource to give it up. Spinlocks are only useful when another process can execute while one is spinning on the lock - which means multiprocessor systems.
there are different versions of spinlock:
spin_lock_irqsave(&xxx_lock, flags);
... critical section here ..
spin_unlock_irqrestore(&xxx_lock, flags);
In Uni processor spin_lock_irqsave() should be used when data needs to shared between process context and interrupt context, as in this case IRQ also gets disabled. spin_lock_irqsave() work under all circumstances, but partly because they are safe they are also fairly slow.
However, in case data needs to be protected across different CPUs then it is better to use below versions, these are cheaper ones as IRQs dont get disabled in this case:
spin_lock(&lock);
...
spin_unlock(&lock);
In uniprocessor systems calling spin_lock_irqsave(&xxx_lock, flags); has the same effect as disabling interrupts which will provide the needed interrupt concurrency protection without unneeded SMP protection. However, in multiprocessor systems this covers both interrupt and SMP concurrency issues.
Spinlocks are, by their nature, intended for use on multiprocessor systems, although a uniprocessor workstation running a preemptive kernel behaves like SMP, as far as concurrency is concerned. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status. Because of preemption, even if you never expect your code to run on an SMP system, you still need to implement proper locking.
Ref:Linux device drivers
By Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartma
Find the following two paragraph in Operating System Three Easy Pieces that might be helpful:
For spin locks, in the single CPU case, performance overheads can be
quite painful; imagine the case where the thread holding the lock is
pre-empted within a critical section. The scheduler might then run
every other thread (imagine there are N − 1 others), each of which
tries to ac- quire the lock. In this case, each of those threads will
spin for the duration of a time slice before giving up the CPU, a
waste of CPU cycles.
However, on multiple CPUs, spin locks work
reasonably well (if the number of threads roughly equals the number of
CPUs). The thinking goes as follows: imagine Thread A on CPU 1 and
Thread B on CPU 2, both contending for a lock. If Thread A (CPU 1)
grabs the lock, and then Thread B tries to, B will spin (on CPU 2).
However, presumably the crit- ical section is short, and thus soon the
lock becomes available, and is ac- quired by Thread B. Spinning to
wait for a lock held on another processor doesn’t waste many cycles in
this case, and thus can be effective