Force Intel Core i7 CPU to sleep momentarily? - windows

I would like to get my Core i7 CPU to enter sleep state just momentarily, for one millisecond or so from a batch file or executable.
I know sleep can be induced with SetSuspendState, but I'm looking for a solution that does not put the entire system to sleep, but just the CPU momentarily.
CPU is Core i7 3632QM, and OS is Windows 7 and 10.
Thanks

Based on your comment about defeating some kind of shutdown every 30 mins, it sounds like you need the whole CPU (all cores) to sleep. We need much more detail on that to do more than guess about which sleep states will serve your purpose and which won't.
Based on comments, it's likely that ACPI S3 sleep will be needed. Ross's comment about the hardware supporting an S1 sleep didn't mention an S2 (CPU actually powered down), so it's probably not even possible to power down just the CPU.
So your best bet is to look into programmatically doing a sleep/wake cycle, which is possible on at least some hardware. On Linux, the rtcwake command has an option to do that. I assume it programs a wakeup time into the BIOS's NVRAM before initiating a sleep. (I think there are only a few commonly-used formats/locations for storing this, so there's a good chance it's possible on your computer.)
Try a google search for wake up laptop at a certain time or something to find Windows equivalent of rtcwake. I didn't look at any of the hits, but they look promising.
I'm not an expert at this system power-state management stuff, but you probably need the system to enter an ACPI sleep state. S3 is the usual "suspend to RAM"; OSes that support suspend usually use this as their non-hibernate option.
For your use, maybe S1 or S2 will do (and anything less than this, like CPU power-saving C-states probably won't be sufficient, especially not states that are just per-core).
ACPI global sleep states (from Wikipedia). Systems are not required to implement all levels.
S1, Power on Suspend (POS): Processor caches are flushed, and the CPU(s) stops executing instructions. The power to the CPU(s) and RAM is maintained. Devices that do not indicate they must remain on may be powered off.
S2: CPU powered off. Dirty cache is flushed to RAM.
S3, commonly referred to as Standby, Sleep, or Suspend to RAM (STR): RAM remains powered. (But hard drives and everything else powers down)
S4, hibernate
I'm not going to try to write Windows API function calls to do this. I wouldn't be surprised if there's an program for requesting Windows to enter S1 or S2 state (ideally with some kind of triggered wakeup).
#RossRidge says that the HM70 chipset does implement S1 sleep (and implies that it doesn't support an S2 sleep.) Since S1 doesn't power down the CPU, it may not reset the timer. Even a hypothetical S2 sleep might not do the trick, because the timer may be external to the CPU and/or managed by the BIOS.
Software exists to program the BIOS to wake at a certain time. That's one possible way to trigger coming out of suspend. So it might be possible to write a script that programs a wakeup time for 2 seconds in the future, then initiates a sleep.
#MargaretBloom comments that Chapter 14 of the Intel Manuals enumerates all the power-management capabilities. (See the x86 tag wiki for links). Also that a totally different workaround may be possible, by using SMM.
re: your your followup question which was downvoted into oblivion:
enter sleep state just momentarily, for one millisecond
1ms is about 3 million core clock cycles. That's not momentary for a computer, especially from an asm programming perspective.
You definitely don't want to write assembly by hand to enter these states. Instead, use your OS's existing ACPI interface. This is a big part of the reason that everyone downvoted the crap out of your followup question.
Other than short per-core sleeps from mwait, pause, and hlt insns, the OS needs to know what's going on. For more about pause, see this. There aren't specific instructions to enter deeper sleeps anyway; you program ACPI by writing to device registers in MMIO space.
When all cores are HLTed at the same time, the whole CPU can opportunistically power down more stuff until the next timer or other interrupt wakes it up again (this is or at least is related to ACPI C-states, as I understand it). But this happens all the time during normal operation, because modern OSes run HLT on cores that are idle. The only interesting thing you could do here is get the CPU to sleep like this occasionally even if the system was running some CPU-intensive processes. (e.g. some threads with non-idle priority that run hlt in a loop). Since HLT is a privileged instruction, this would require a kernel thread or a syscall. You probably can't actually raise the priority of the system idle process so it steals time from other runable processes.
This may be an oversimplification: I haven't looked at kernel idle tasks recently to see if they still just run HLT when they want the current core to sleep until the next interrupt. For a while (when CPU power management was in its infancy) idle loops used to run some other stuff to enter a low-power C-state. But HLT may do that now.

Related

Is there some sort of hardware support required for the implementation of the scheduler?

The state of the process at any given time consists of the processes in execution right? So at the moment say there are 4 userspace programs running on the processors. Now after each time slice, I assume control has to pass over to the scheduler so that the appropriate process can be scheduled next. What initiates this transfer of control? For me it seems like there has to be some kind of special timer/register in hardware that keeps count of the current time taken by the process since the process itself has no mechanism to keep track of the time for which it has executed... Is my intuition right??
First of all, this answer concerns the x86 architecture only.
There are different kinds of schedulers: preemptive and non-preemptive (cooperative).
Preemptive schedulers preempt the execution of a process, that is, initiate a context switch using a TSS (Task State Segment), which then performs a jump to another process. The process is stopped and another one is started.
Cooperative schedulers do not stop processes. They rely on the process, which give up the CPU in favor of the scheduler, also called "yielding," similar to user-level threads without kernel support.
Preemption can be accomplished in two ways: as the result of some I/O-bound action or while the CPU is at play.
Imagine you sent some instructions to the FPU. It takes some time until it's finished. Instead of sitting around idly, you could do something else while the FPU performs its calculations! So, as the result of an I/O operation, the scheduler switches to another process, possibly resuming with the preempted process right after the FPU is done.
However, regular preemption, as required by many scheduling algorithms, can only be implemented with some interruption mechanism happening with a certain frequency, independently of the process. A timer chip was deemed suitable and with the IBM 5150 (a.k.a. IBM PC) released in 1981, an x86 system was delivered, incorporating, inter alia, an Intel 8086, an Intel 8042 keyboard controller chip, the Intel 8259 PIC (Programmable Interrupt Controller), and the Intel 8253 PIT (Programmable Interval Timer).
The i8253 connected, like a few other peripheral device, to the i8259. A couple of times in a second (18 Hz?) it issued an #INT signal to the PIC on IRQ 0 and after acknowledging and all the CPU was interrupted and a handler was executed.
That very handler could contain scheduling code, which decides on the next process to execute1.
Of course, we (most of us) are living in the 21st century by now and there's no IBM PC or one of its derivatives like the XT or AT used. The PIC has changed to the more sophisticated Intel 82093AA APIC to handle multiple processors/cores and for general improvement but the PIT has remained the same, I think, maybe in shape of some integrated version like the Intel AIP.
Cooperative schedulers do not need a regular interrupt and therefore no special hardware support (except maybe for hardware-supported multitasking). The process yields the CPU deliberately and if it doesn't, you have a problem. The reason as to why few OSes actually use cooperative schedulers: it poses a big security hole.
1 Note, however, that OSes on the 8086 (mostly DOS) didn't have a real
scheduler. The x86 architecture only natively supported multitasking in the
hardware with the advent of one of the 80386 versions (SX, DX, and whatever). I just wanted to stress that the IBM 5150 was the first x86 system with a timer chip (and, of course, the first PC altogether).
Systems running an OS with preemptive schedulers, (ie. all those in common use), are, IME, all provided with a hardware timer interrupt that causes a driver to run and can change the set of running threads.
Such a timer interrupt is very useful for providing timeouts for system calls, sleep() functionality and other time-related functions. It can also help share out the available CPU amongst ready threads when the system is overloaded, or the thread/s run on it are CPU-intensive, and so the number of ready threads exceeds the number of cores available to run them.
It is quite possible to implement a preemptive scheduler without any hardware timer, allowing the set of running threads to be secheduled upon software interrupts, (system calls), from threads that are already running, and all the other interrupts upon I/O completion from the peripheral drivers for disk, NIC, KB, mouse etc. I've never seen it done though - the timer functionality is too useful:)

cpu_idle_loop vs halt/wfe/sevl instructions

Whenever a cpu is idle, it executes the cpu_idle_loop.
I am curios to know about the advantages of this loop when compared to halt [x86] or wfe/wfi instruction in arm ?
Is there any power consumption advantages ?
wfe / wfi are just instructions, which can make core into a low power mode, but that can't affect the clocks into the core etc. If a core is getting power at this time leakage will be still there, which matters in battery powered devices a lot.
In a function like cpu_idle_loop, you can control more power into the core since you know what affects what and can also flush caches and reduce power used by them etc. You can also totally cut power to the core removing or reducing leakage to the minimum possible. In a multicore system, last core going to idle can power down platform / board into a even more power preserving state.
wfe / wfi is good for avoiding core to waste power while waiting which is also good for heat not dispatched. Must have to implement mutexes / semaphores but a SOC is consisting of many elements these days and kernel can inform the hardware when most of it is not needed rather than just idling efficiently a single core.
On top of the power advantage pointed out by other users, I would like to point out another less noticed advantage of using WFI's. Consider the case when our kernel is being run as a virtual machine on top of another Host operating system. The Host OS would have marked WFI instructions as trap. When a WFI instruction is executed by a Guest OS, control is immediately transferred (Trapped) to Host OS. This allows the host to efficiently schedule other OS's in its ready queue. If the Guest OS were using a busy IDLE loop (instead of WFI), time slice allotted to the Guest OS has to expire before the Host OS can schedule in another Guest OS, this leads to wasted CPU cycles.

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64?
That is, CPU usage is displayed as decrease only when logged in Idle?
And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC], Sparc?
Clarification: shows CPU-usage in standard Task manager in OS-Windows
A hardware thread (logical core) that's stalled on a cache miss can't be doing anything else, so it still counts as busy for the purposes of task-managers / CPU time accounting / OS process scheduler time-slices / stuff like that.
This is true across all architectures.
Without hyperthreading, "hardware thread" / "logical core" are the same as a "physical core".
Morphcore / other on-the-fly changing between hyperthreading and a more powerful single core could make there be a difference between a thread that keeps many execution units busy, vs. a thread that is blocked on cache misses a lot of the time.
I don't get the link between the OS CPU usage statistics and the optimal use of the pipeline. I think they are uncorrelated as the OS doesn't measure the pipeline load.
I'm writing this in the hope that Peter Cordes can help me understand it better and as a continuation of the comments.
User programs relinquish control to OS very often: when they need input from user or when
they are done with the signal/message. GUI program are basically just big loops and at
each iteration control is given to the OS until the next message.
When the OS has the control it schedules others threads/tasks and if not other actions
are needed just enter the idle process (long time ago a tight loop, now a sleep state)
until the next interrupt. This is the Idle Time.
Time spent on an ISR processing user input is considered idle time by any OS.
An a cache miss there would be still considered idle time.
A heavy program takes more time to complete the work for a given message thereby returning
control to OS say 2 times in a second instead of
20.
If the OS measures that in the last second, it got control for 20ms only then the
CPU usage is (1000-20)/1000 = 98%.
This has nothing to do with the optimal use of the CPU architecture, as said stalls can
occur in the OS code and still be part of the Idle time statistic.
The CPU utilization at pipeline level is not what is measured and it is orthogonal to the
OS statistics.
CPU usage is meant to be used by sysadmin, it is a measure of the load you put on a system,
it is not the measure of how efficiently the assembly of a program was generated.
Sysadmins can't help with that, but measuring how often the OS got the control back (without
preempting) is a measure of how much load a program is putting on the system.
And sysadmins can definitively do terminate heavy programs.

Why spinlocks don't work in uniprocessor (unicore) systems?

I know that spinlocks work with spining, different kernel paths exist and Kernels are preemptive, so why spinlocks don't work in uniprocessor systems? (for example, in Linux)
If I understand your question, you're asking why spin locks are a bad idea on single core machines.
They should still work, but can be much more expensive than true thread-sleeping concurrency:
When you use a spinlock, you're essentially asserting that you don't think you will have to wait long. You are saying that you think it's better to maintain the processor time slice with a busy loop than the cost of sleeping your thread and context-shifting to another thread or process. If you have to wait a very short amount of time, you can sleep and be reawakened almost immediately, but the cost of going down and up is more expensive than just waiting around.
This is more likely to be OK on multi-core processors, since they have much better concurrency profiles than single core processors. On multi core processors, between loop iterations, some other thread may have taken care of your prerequisite. On single core processors, it's not possible that someone else could have helped you out - you've locked up the one and only core.
The problem here is that if you wait or sleep on a lock, you hint to the system that you don't have everything you need yet, so it should go do some other stuff and come back to you later. With a spin lock, you never tell the system this, so you lock it up waiting for something else to happen - but, meanwhile, you're holding up the whole system, so something else can't happen.
The nature of a spinlock is that it does not deschedule the process - instead it spins until the process acquires the lock.
On a uniprocessor, it will either immediately acquire the lock or it will spin forever - if the lock is contended, then there will never be an opportunity for the process which currently holds the resource to give it up. Spinlocks are only useful when another process can execute while one is spinning on the lock - which means multiprocessor systems.
there are different versions of spinlock:
spin_lock_irqsave(&xxx_lock, flags);
... critical section here ..
spin_unlock_irqrestore(&xxx_lock, flags);
In Uni processor spin_lock_irqsave() should be used when data needs to shared between process context and interrupt context, as in this case IRQ also gets disabled. spin_lock_irqsave() work under all circumstances, but partly because they are safe they are also fairly slow.
However, in case data needs to be protected across different CPUs then it is better to use below versions, these are cheaper ones as IRQs dont get disabled in this case:
spin_lock(&lock);
...
spin_unlock(&lock);
In uniprocessor systems calling spin_lock_irqsave(&xxx_lock, flags); has the same effect as disabling interrupts which will provide the needed interrupt concurrency protection without unneeded SMP protection. However, in multiprocessor systems this covers both interrupt and SMP concurrency issues.
Spinlocks are, by their nature, intended for use on multiprocessor systems, although a uniprocessor workstation running a preemptive kernel behaves like SMP, as far as concurrency is concerned. If a nonpreemptive uniprocessor system ever went into a spin on a lock, it would spin forever; no other thread would ever be able to obtain the CPU to release the lock. For this reason, spinlock operations on uniprocessor systems without preemption enabled are optimized to do nothing, with the exception of the ones that change the IRQ masking status. Because of preemption, even if you never expect your code to run on an SMP system, you still need to implement proper locking.
Ref:Linux device drivers
By Jonathan Corbet, Alessandro Rubini, Greg Kroah-Hartma
Find the following two paragraph in Operating System Three Easy Pieces that might be helpful:
For spin locks, in the single CPU case, performance overheads can be
quite painful; imagine the case where the thread holding the lock is
pre-empted within a critical section. The scheduler might then run
every other thread (imagine there are N − 1 others), each of which
tries to ac- quire the lock. In this case, each of those threads will
spin for the duration of a time slice before giving up the CPU, a
waste of CPU cycles.
However, on multiple CPUs, spin locks work
reasonably well (if the number of threads roughly equals the number of
CPUs). The thinking goes as follows: imagine Thread A on CPU 1 and
Thread B on CPU 2, both contending for a lock. If Thread A (CPU 1)
grabs the lock, and then Thread B tries to, B will spin (on CPU 2).
However, presumably the crit- ical section is short, and thus soon the
lock becomes available, and is ac- quired by Thread B. Spinning to
wait for a lock held on another processor doesn’t waste many cycles in
this case, and thus can be effective

Resources