If the processor is always spinning, why does it drain more resources sometimes and not others? - sleep

I've had a massive curiosity as to this for a long time. How do processors "sleep"? As far as I can tell it spins until a set time, that's fine. But there's a problem with this, if the processor is always running at full speed, then why does it get noisy and heat up when it's from my program but stay quiet and not drain power during the OS idle spinning?
Does this have more to do with shared variable access? Even a spin lock requires checking a register with a shared variable. So would it require an external device I/O to actually heat up, or is there a "different sleep"?
I've always wondered how to put my program into a sleep() and know it won't drain battery, (admittedly before I learned about OS schedules/time slicing).
In short, how does this work, and how would I ensure that my Sleep() or otherwise spin-locks use the low-power type.

Related

Do you need a realtime operating system in order to ensure your program is never taken off the CPU?

If I were to write a program and I wanted to be guaranteed that the program never sees an instance where, after it is running, it gets kicked off of the cpu until program termination, would I need an RTOS or is there a way to have such an experience guranteed on a regular linux os.
Example:
Lets say we a running a headless Linux machine and running a program as user or root (eg reading SPI data from a sensor, listening for http requests) and there is reason to believe there is almost almost no other interaction with the machine aside from the single standalone script running.
If I wanted to ensure that my process running never gets taken off my cpu even for a moment such that I never miss valuable sensor information or incoming http requests, does this warrant a real-time operating system to keep this guarantee?
are process priorities of programs ran by the user / root enough of a priority to not get kicked off?
is a realtime os needed to guarantee our program never witnesses a moment when it is kicked off of the cpu?
I know that Real Time OS are needed for guarantees on hard limits and hard deadlines of events. I also know that on a regular operating system it is up to the OS to decide priority and scheduling.
if this is in the wrong stack let me know.
Do you need to act on sensor readings in a constant time frame? How complicated this action should be? If all you need is to never miss a reading and you're ok with buffering them - just add a microcontroller or an FPGA in between your non-realtime device and a sensor.
Also, you can ensure some soft real time constraints even with an unpatched Linux. You can pin a process to a CPU and avoid using any syscalls in it - spin and poll instead, at 100% CPU utilisation, and then it's likely kernel will never touch it. Make sure the process binary and all the dynamic libraries (if any) are on a RAM disk (to avoid paging) and disable swap.

How to prevent Windows GPU "Timeout Detection and Recovery"?

If I run a long-running kernel on a GPU device, after 2 seconds (by default) the windows TDR (Timeout Detection and Recovery) will kill the running kernels. I understand it, but what if you can't predict how long the kernel will run, because you need to do lots of computations and neither you know the capacity/speed of the underlying GPU for the actual user, who runs your program?
What are the best practices for solving this problem?
I found 3 ways to prevent it to happen, but none of those seems a good solution for me:
You need to make sure that your kernels are not too time-consuming:
The kernel is time consuming and though I could do some kind of fragmentation and not run 1 million of them but 2*500k or 4*250k, but I still can't predict if it will fit into the default 2 seconds on the actual user's GPU. (I had the idea to half the number until your kernel won't drop a CL_INVALID_COMMAND_QUEUE error, and then you just call it multiple times with the smaller amount, but to be honest it sounds really hackie and have some other drawbacks.)
You can turn-off the watchdog timer (or increase the delay): Timeout Detection and Recovery of GPUs:
It's done by registry edit, and you need to restart Windows to make it effective. You can't do it on a user's machine.
You can run the kernel on a GPU that is not hooked up to a display:
How can you make sure the GPU is not hooked up to a display on a users machine? Even in my laptop my primary GPU is the Intel HD4000 and the NVidia GPU is not in use for display (I think so), but TDR still kills my kernels.
You listed all of the solutions I know of. Since solution 2 leaves the machine in an unusable state while your kernel runs (not a good practice) it should be avoided. Since adding another GPU (solution 3) is not practical for you, your best bet is to focus on solution 1. I don't know why you are trying to maximize the work size to run as long as possible to avoid TDR. You should instead target around 10 ms or less (if you run many kernels that take longer the GUI is very sluggish). So instead of 4*250000, think more like 400*2500. You may need to put in some clFinish calls between each one (or batch of 10, or whatever). Keeping the execution time small (10 ms) and not overfilling the queue will allow the GPU to do other things in between kernels and you won't get TDR resets nor make the machine unusable and yet the GPU will be quite busy.

Control Block Processes

Whenever a process is moved into the waiting state, I understand that the CPU moved to another process. But whenever a process is in waiting state if it is still needing to make a request to another I/O resource does that computation not require processing? Is there i'm assuming a small part of the processor that is dedicated to help computation of the I/O request to move data back and forth?
I hope this question makes sense lol.
IO operations are actually tasks for peripheral devices to do some work. Usually you set the task by writing data to special areas of memory which belongs to devices. They monitor changes in that small area and start to execute the tasks. So CPU does not need to do anything while the operation is in progress and can switch to another program. When the IO is completed usually an interrupt is triggered. This is a special hardware mechanism which pauses currently executed program in arbitrary place and switches to a special suprogramm, which decides what to do later. There can be another designs, for example device may set special flag somewhere in it's memory region and OS must check it from time to time.
The problem is that these IO are usually quite small, such as send 1 byte over COM port, so CPU has to be interrupted too often. You can't achieve high speed with them. Here is where DMA comes handy. This is a special coprocessor (or part of peripheral device) which has direct access to RAM and can feed big blocks of memory in-to devices. So it can process megabytes of data without interrupting CPU.

Force Intel Core i7 CPU to sleep momentarily?

I would like to get my Core i7 CPU to enter sleep state just momentarily, for one millisecond or so from a batch file or executable.
I know sleep can be induced with SetSuspendState, but I'm looking for a solution that does not put the entire system to sleep, but just the CPU momentarily.
CPU is Core i7 3632QM, and OS is Windows 7 and 10.
Thanks
Based on your comment about defeating some kind of shutdown every 30 mins, it sounds like you need the whole CPU (all cores) to sleep. We need much more detail on that to do more than guess about which sleep states will serve your purpose and which won't.
Based on comments, it's likely that ACPI S3 sleep will be needed. Ross's comment about the hardware supporting an S1 sleep didn't mention an S2 (CPU actually powered down), so it's probably not even possible to power down just the CPU.
So your best bet is to look into programmatically doing a sleep/wake cycle, which is possible on at least some hardware. On Linux, the rtcwake command has an option to do that. I assume it programs a wakeup time into the BIOS's NVRAM before initiating a sleep. (I think there are only a few commonly-used formats/locations for storing this, so there's a good chance it's possible on your computer.)
Try a google search for wake up laptop at a certain time or something to find Windows equivalent of rtcwake. I didn't look at any of the hits, but they look promising.
I'm not an expert at this system power-state management stuff, but you probably need the system to enter an ACPI sleep state. S3 is the usual "suspend to RAM"; OSes that support suspend usually use this as their non-hibernate option.
For your use, maybe S1 or S2 will do (and anything less than this, like CPU power-saving C-states probably won't be sufficient, especially not states that are just per-core).
ACPI global sleep states (from Wikipedia). Systems are not required to implement all levels.
S1, Power on Suspend (POS): Processor caches are flushed, and the CPU(s) stops executing instructions. The power to the CPU(s) and RAM is maintained. Devices that do not indicate they must remain on may be powered off.
S2: CPU powered off. Dirty cache is flushed to RAM.
S3, commonly referred to as Standby, Sleep, or Suspend to RAM (STR): RAM remains powered. (But hard drives and everything else powers down)
S4, hibernate
I'm not going to try to write Windows API function calls to do this. I wouldn't be surprised if there's an program for requesting Windows to enter S1 or S2 state (ideally with some kind of triggered wakeup).
#RossRidge says that the HM70 chipset does implement S1 sleep (and implies that it doesn't support an S2 sleep.) Since S1 doesn't power down the CPU, it may not reset the timer. Even a hypothetical S2 sleep might not do the trick, because the timer may be external to the CPU and/or managed by the BIOS.
Software exists to program the BIOS to wake at a certain time. That's one possible way to trigger coming out of suspend. So it might be possible to write a script that programs a wakeup time for 2 seconds in the future, then initiates a sleep.
#MargaretBloom comments that Chapter 14 of the Intel Manuals enumerates all the power-management capabilities. (See the x86 tag wiki for links). Also that a totally different workaround may be possible, by using SMM.
re: your your followup question which was downvoted into oblivion:
enter sleep state just momentarily, for one millisecond
1ms is about 3 million core clock cycles. That's not momentary for a computer, especially from an asm programming perspective.
You definitely don't want to write assembly by hand to enter these states. Instead, use your OS's existing ACPI interface. This is a big part of the reason that everyone downvoted the crap out of your followup question.
Other than short per-core sleeps from mwait, pause, and hlt insns, the OS needs to know what's going on. For more about pause, see this. There aren't specific instructions to enter deeper sleeps anyway; you program ACPI by writing to device registers in MMIO space.
When all cores are HLTed at the same time, the whole CPU can opportunistically power down more stuff until the next timer or other interrupt wakes it up again (this is or at least is related to ACPI C-states, as I understand it). But this happens all the time during normal operation, because modern OSes run HLT on cores that are idle. The only interesting thing you could do here is get the CPU to sleep like this occasionally even if the system was running some CPU-intensive processes. (e.g. some threads with non-idle priority that run hlt in a loop). Since HLT is a privileged instruction, this would require a kernel thread or a syscall. You probably can't actually raise the priority of the system idle process so it steals time from other runable processes.
This may be an oversimplification: I haven't looked at kernel idle tasks recently to see if they still just run HLT when they want the current core to sleep until the next interrupt. For a while (when CPU power management was in its infancy) idle loops used to run some other stuff to enter a low-power C-state. But HLT may do that now.

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

Resources