cpu_idle_loop vs halt/wfe/sevl instructions

cpu_idle_loop vs halt/wfe/sevl instructions - linux-kernel

Whenever a cpu is idle, it executes the cpu_idle_loop.
I am curios to know about the advantages of this loop when compared to halt [x86] or wfe/wfi instruction in arm ?
Is there any power consumption advantages ?

wfe / wfi are just instructions, which can make core into a low power mode, but that can't affect the clocks into the core etc. If a core is getting power at this time leakage will be still there, which matters in battery powered devices a lot.
In a function like cpu_idle_loop, you can control more power into the core since you know what affects what and can also flush caches and reduce power used by them etc. You can also totally cut power to the core removing or reducing leakage to the minimum possible. In a multicore system, last core going to idle can power down platform / board into a even more power preserving state.
wfe / wfi is good for avoiding core to waste power while waiting which is also good for heat not dispatched. Must have to implement mutexes / semaphores but a SOC is consisting of many elements these days and kernel can inform the hardware when most of it is not needed rather than just idling efficiently a single core.

On top of the power advantage pointed out by other users, I would like to point out another less noticed advantage of using WFI's. Consider the case when our kernel is being run as a virtual machine on top of another Host operating system. The Host OS would have marked WFI instructions as trap. When a WFI instruction is executed by a Guest OS, control is immediately transferred (Trapped) to Host OS. This allows the host to efficiently schedule other OS's in its ready queue. If the Guest OS were using a busy IDLE loop (instead of WFI), time slice allotted to the Guest OS has to expire before the Host OS can schedule in another Guest OS, this leads to wasted CPU cycles.

Related

Is there some sort of hardware support required for the implementation of the scheduler?

The state of the process at any given time consists of the processes in execution right? So at the moment say there are 4 userspace programs running on the processors. Now after each time slice, I assume control has to pass over to the scheduler so that the appropriate process can be scheduled next. What initiates this transfer of control? For me it seems like there has to be some kind of special timer/register in hardware that keeps count of the current time taken by the process since the process itself has no mechanism to keep track of the time for which it has executed... Is my intuition right??

First of all, this answer concerns the x86 architecture only.
There are different kinds of schedulers: preemptive and non-preemptive (cooperative).
Preemptive schedulers preempt the execution of a process, that is, initiate a context switch using a TSS (Task State Segment), which then performs a jump to another process. The process is stopped and another one is started.
Cooperative schedulers do not stop processes. They rely on the process, which give up the CPU in favor of the scheduler, also called "yielding," similar to user-level threads without kernel support.
Preemption can be accomplished in two ways: as the result of some I/O-bound action or while the CPU is at play.
Imagine you sent some instructions to the FPU. It takes some time until it's finished. Instead of sitting around idly, you could do something else while the FPU performs its calculations! So, as the result of an I/O operation, the scheduler switches to another process, possibly resuming with the preempted process right after the FPU is done.
However, regular preemption, as required by many scheduling algorithms, can only be implemented with some interruption mechanism happening with a certain frequency, independently of the process. A timer chip was deemed suitable and with the IBM 5150 (a.k.a. IBM PC) released in 1981, an x86 system was delivered, incorporating, inter alia, an Intel 8086, an Intel 8042 keyboard controller chip, the Intel 8259 PIC (Programmable Interrupt Controller), and the Intel 8253 PIT (Programmable Interval Timer).
The i8253 connected, like a few other peripheral device, to the i8259. A couple of times in a second (18 Hz?) it issued an #INT signal to the PIC on IRQ 0 and after acknowledging and all the CPU was interrupted and a handler was executed.
That very handler could contain scheduling code, which decides on the next process to execute1.
Of course, we (most of us) are living in the 21st century by now and there's no IBM PC or one of its derivatives like the XT or AT used. The PIC has changed to the more sophisticated Intel 82093AA APIC to handle multiple processors/cores and for general improvement but the PIT has remained the same, I think, maybe in shape of some integrated version like the Intel AIP.
Cooperative schedulers do not need a regular interrupt and therefore no special hardware support (except maybe for hardware-supported multitasking). The process yields the CPU deliberately and if it doesn't, you have a problem. The reason as to why few OSes actually use cooperative schedulers: it poses a big security hole.
1 Note, however, that OSes on the 8086 (mostly DOS) didn't have a real
scheduler. The x86 architecture only natively supported multitasking in the
hardware with the advent of one of the 80386 versions (SX, DX, and whatever). I just wanted to stress that the IBM 5150 was the first x86 system with a timer chip (and, of course, the first PC altogether).

Systems running an OS with preemptive schedulers, (ie. all those in common use), are, IME, all provided with a hardware timer interrupt that causes a driver to run and can change the set of running threads.
Such a timer interrupt is very useful for providing timeouts for system calls, sleep() functionality and other time-related functions. It can also help share out the available CPU amongst ready threads when the system is overloaded, or the thread/s run on it are CPU-intensive, and so the number of ready threads exceeds the number of cores available to run them.
It is quite possible to implement a preemptive scheduler without any hardware timer, allowing the set of running threads to be secheduled upon software interrupts, (system calls), from threads that are already running, and all the other interrupts upon I/O completion from the peripheral drivers for disk, NIC, KB, mouse etc. I've never seen it done though - the timer functionality is too useful:)

Force Intel Core i7 CPU to sleep momentarily?

I would like to get my Core i7 CPU to enter sleep state just momentarily, for one millisecond or so from a batch file or executable.
I know sleep can be induced with SetSuspendState, but I'm looking for a solution that does not put the entire system to sleep, but just the CPU momentarily.
CPU is Core i7 3632QM, and OS is Windows 7 and 10.
Thanks

Based on your comment about defeating some kind of shutdown every 30 mins, it sounds like you need the whole CPU (all cores) to sleep. We need much more detail on that to do more than guess about which sleep states will serve your purpose and which won't.
Based on comments, it's likely that ACPI S3 sleep will be needed. Ross's comment about the hardware supporting an S1 sleep didn't mention an S2 (CPU actually powered down), so it's probably not even possible to power down just the CPU.
So your best bet is to look into programmatically doing a sleep/wake cycle, which is possible on at least some hardware. On Linux, the rtcwake command has an option to do that. I assume it programs a wakeup time into the BIOS's NVRAM before initiating a sleep. (I think there are only a few commonly-used formats/locations for storing this, so there's a good chance it's possible on your computer.)
Try a google search for wake up laptop at a certain time or something to find Windows equivalent of rtcwake. I didn't look at any of the hits, but they look promising.
I'm not an expert at this system power-state management stuff, but you probably need the system to enter an ACPI sleep state. S3 is the usual "suspend to RAM"; OSes that support suspend usually use this as their non-hibernate option.
For your use, maybe S1 or S2 will do (and anything less than this, like CPU power-saving C-states probably won't be sufficient, especially not states that are just per-core).
ACPI global sleep states (from Wikipedia). Systems are not required to implement all levels.
S1, Power on Suspend (POS): Processor caches are flushed, and the CPU(s) stops executing instructions. The power to the CPU(s) and RAM is maintained. Devices that do not indicate they must remain on may be powered off.
S2: CPU powered off. Dirty cache is flushed to RAM.
S3, commonly referred to as Standby, Sleep, or Suspend to RAM (STR): RAM remains powered. (But hard drives and everything else powers down)
S4, hibernate
I'm not going to try to write Windows API function calls to do this. I wouldn't be surprised if there's an program for requesting Windows to enter S1 or S2 state (ideally with some kind of triggered wakeup).
#RossRidge says that the HM70 chipset does implement S1 sleep (and implies that it doesn't support an S2 sleep.) Since S1 doesn't power down the CPU, it may not reset the timer. Even a hypothetical S2 sleep might not do the trick, because the timer may be external to the CPU and/or managed by the BIOS.
Software exists to program the BIOS to wake at a certain time. That's one possible way to trigger coming out of suspend. So it might be possible to write a script that programs a wakeup time for 2 seconds in the future, then initiates a sleep.
#MargaretBloom comments that Chapter 14 of the Intel Manuals enumerates all the power-management capabilities. (See the x86 tag wiki for links). Also that a totally different workaround may be possible, by using SMM.
re: your your followup question which was downvoted into oblivion:
enter sleep state just momentarily, for one millisecond
1ms is about 3 million core clock cycles. That's not momentary for a computer, especially from an asm programming perspective.
You definitely don't want to write assembly by hand to enter these states. Instead, use your OS's existing ACPI interface. This is a big part of the reason that everyone downvoted the crap out of your followup question.
Other than short per-core sleeps from mwait, pause, and hlt insns, the OS needs to know what's going on. For more about pause, see this. There aren't specific instructions to enter deeper sleeps anyway; you program ACPI by writing to device registers in MMIO space.
When all cores are HLTed at the same time, the whole CPU can opportunistically power down more stuff until the next timer or other interrupt wakes it up again (this is or at least is related to ACPI C-states, as I understand it). But this happens all the time during normal operation, because modern OSes run HLT on cores that are idle. The only interesting thing you could do here is get the CPU to sleep like this occasionally even if the system was running some CPU-intensive processes. (e.g. some threads with non-idle priority that run hlt in a loop). Since HLT is a privileged instruction, this would require a kernel thread or a syscall. You probably can't actually raise the priority of the system idle process so it steals time from other runable processes.
This may be an oversimplification: I haven't looked at kernel idle tasks recently to see if they still just run HLT when they want the current core to sleep until the next interrupt. For a while (when CPU power management was in its infancy) idle loops used to run some other stuff to enter a low-power C-state. But HLT may do that now.

Multicore thread processing order

I am having some real trouble finding this info online, im in Uni monday so i could use the library then but the soon the better. When a system has multicore processors, does each processor take a thread from the first process in the ready queue or does it take one from the first and one from the second? Also just to check, threads will be sent and fetched from the multicores concurrently by the OS right? If anyone could point me in the right direction resource wise, that would be great!

The key thing is to appreciate what the machine's architecture actually is.
A "core" is a CPU with cache with a connection to the system memory. Most machine architectures are Symmetric Multi-Processing, meaning that the system memory is equally accessible by all cores in the system.
Most operating systems run a scheduler thread on each core (Linux does). The scheduler has a list of threads it is responsible for, and it will run them to the best of its ability on the core that it controls. The rules it uses to choose which thread to run will be either round robin, or priority based, etc; ie all the normal scheduling rules. So far it is just like a scheduler that you would find in a single core computer. To some extent each scheduler is independent from all the other schedulers.
However, this an SMP environment, meaning that it really doesn't matter which core runs which thread. This is because all the cores can see all the memory, and all the code and data for all threads in the entire system is stored in that single memory.
So the schedulers talk amongst themselves to help each other out. Schedulers with too many threads to run can pass a thread over to a scheduler whose core is under utilised. They are load balancing within the machine. "Pass a thread over" means copying the data structure that describes the thread (thread id, which data, which code).
So that's about it. As the only communication between cores is via memory it all relies on an effective mutual exclusion semaphore system being available, which is something the hardware has to allow for.
The Difficulty
So I've painted a very simple picture, but in practice the memory is not perfectly symmetrical. SMP these days is synthesised on top of HyperTransport and QPI.
Long gone are the days when cores really did have equal access to the system memory at the electronic level. At the very lowest layer of their architecture AMD are purely NUMA, and Intel nearly so.
Nowadays a core has to send a request to other cores over a high speed serial link (HyperTransport or QPI) asking them to send data that they've got in their attached memory. Intel and AMD have done a good job of making it look convincingly like SMP in the general case, but it's not perfect. Data in memory attached to a different core takes longer to get hold of. It's insanely complex - the cores are now nodes on a network - but it's what they've had to do to get improved performance.
So schedulers take that into account when choosing which core should run which thread. They will try to place a thread on a core that is closest to the memory holding the data that the thread has access to.
The Future, Again
If the world's software ecosystem could be weaned off SMP the hardware guys would be able to save a lot of space on the silicon, and we would have faster more efficient systems. This has been done before; Transputers were a good attempt at a strictly NUMA architecture.
NUMA and Communicating Sequential Processes would today make it far easier to write multi threaded software that scales very easily and runs more efficiently than today's SMP shared memory behemoths.
SMP was in effect a cheap and nasty way of bringing together multiple cores, and the cost in terms of software development difficulties and inefficient hardware has been very high.

Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64?
That is, CPU usage is displayed as decrease only when logged in Idle?
And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC], Sparc?
Clarification: shows CPU-usage in standard Task manager in OS-Windows

A hardware thread (logical core) that's stalled on a cache miss can't be doing anything else, so it still counts as busy for the purposes of task-managers / CPU time accounting / OS process scheduler time-slices / stuff like that.
This is true across all architectures.
Without hyperthreading, "hardware thread" / "logical core" are the same as a "physical core".
Morphcore / other on-the-fly changing between hyperthreading and a more powerful single core could make there be a difference between a thread that keeps many execution units busy, vs. a thread that is blocked on cache misses a lot of the time.

I don't get the link between the OS CPU usage statistics and the optimal use of the pipeline. I think they are uncorrelated as the OS doesn't measure the pipeline load.
I'm writing this in the hope that Peter Cordes can help me understand it better and as a continuation of the comments.
User programs relinquish control to OS very often: when they need input from user or when
they are done with the signal/message. GUI program are basically just big loops and at
each iteration control is given to the OS until the next message.
When the OS has the control it schedules others threads/tasks and if not other actions
are needed just enter the idle process (long time ago a tight loop, now a sleep state)
until the next interrupt. This is the Idle Time.
Time spent on an ISR processing user input is considered idle time by any OS.
An a cache miss there would be still considered idle time.
A heavy program takes more time to complete the work for a given message thereby returning
control to OS say 2 times in a second instead of
20.
If the OS measures that in the last second, it got control for 20ms only then the
CPU usage is (1000-20)/1000 = 98%.
This has nothing to do with the optimal use of the CPU architecture, as said stalls can
occur in the OS code and still be part of the Idle time statistic.
The CPU utilization at pipeline level is not what is measured and it is orthogonal to the
OS statistics.
CPU usage is meant to be used by sysadmin, it is a measure of the load you put on a system,
it is not the measure of how efficiently the assembly of a program was generated.
Sysadmins can't help with that, but measuring how often the OS got the control back (without
preempting) is a measure of how much load a program is putting on the system.
And sysadmins can definitively do terminate heavy programs.

Difference between core and processor

What is the difference between a core and a processor?
I've already looked for it on Google, but I only get definitions for multi-core and multi-processor, which is not what I am looking for.

A core is usually the basic computation unit of the CPU - it can run a single program context (or multiple ones if it supports hardware threads such as hyperthreading on Intel CPUs), maintaining the correct program state, registers, and correct execution order, and performing the operations through ALUs. For optimization purposes, a core can also hold on-core caches with copies of frequently used memory chunks.
A CPU may have one or more cores to perform tasks at a given time. These tasks are usually software processes and threads that the OS schedules. Note that the OS may have many threads to run, but the CPU can only run X such tasks at a given time, where X = number cores * number of hardware threads per core. The rest would have to wait for the OS to schedule them whether by preempting currently running tasks or any other means.
In addition to the one or many cores, the CPU will include some interconnect that connects the cores to the outside world, and usually also a large "last-level" shared cache. There are multiple other key elements required to make a CPU work, but their exact locations may differ according to design. You'll need a memory controller to talk to the memory, I/O controllers (display, PCIe, USB, etc..). In the past these elements were outside the CPU, in the complementary "chipset", but most modern design have integrated them into the CPU.
In addition the CPU may have an integrated GPU, and pretty much everything else the designer wanted to keep close for performance, power and manufacturing considerations. CPU design is mostly trending in to what's called system on chip (SoC).
This is a "classic" design, used by most modern general-purpose devices (client PC, servers, and also tablet and smartphones). You can find more elaborate designs, usually in the academy, where the computations is not done in basic "core-like" units.

An image may say more than a thousand words:
* Figure describing the complexity of a modern multi-processor, multi-core system.
Source:
https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization

Let's clarify first what is a CPU and what is a core, a central processing unit CPU, can have multiple core units, those cores are a processor by itself, capable of execute a program but it is self contained on the same chip.
In the past one CPU was distributed among quite a few chips, but as Moore's Law progressed they made to have a complete CPU inside one chip (die), since the 90's the manufacturer's started to fit more cores in the same die, so that's the concept of Multi-core.
In these days is possible to have hundreds of cores on the same CPU (chip or die) GPUs, Intel Xeon. Other technique developed in the 90's was simultaneous multi-threading, basically they found that was possible to have another thread in the same single core CPU, since most of the resources were duplicated already like ALU, multiple registers.
So basically a CPU can have multiple cores each of them capable to run one thread or more at the same time, we may expect to have more cores in the future, but with more difficulty to be able to program efficiently.

CPU is a central processing unit. Since 2002 we have only single core processor i.e. we will only perform a single task or a program at a time.
For having multiple programs run at a time we have to use the multiple processor for executing multi processes at a time so we required another motherboard for that and that is very expensive.
So, Intel introduced the concept of hyper threading i.e. it will convert the single CPU into two virtual CPUs i.e we have two cores for our task. Now the CPU is single, but it is only pretending (masqueraded) that it has a dual CPU and performs multiple tasks. But having real multiple cores will be better than that so people develop making multi-core processor i.e. multiple processors on a single box i.e. grabbing a multiple CPU on single big CPU. I.e. multiple cores.

In the early days...like before the 90s...the processors weren't able to do multi tasks that efficiently...coz a single processor could handle just a single task...so when we used to say that my antivirus,microsoft word,vlc,etc. softwares are all running at the same time...that isn't actually true. When I said a processor could handle a single process at a time...I meant it. It actually would process a single task...then it used to pause that task...take another task...complete it if its a short one or again pause it and add it to the queue...then the next. But this 'pause' that I mentioned was so small (appx. 1ns) that you didn't understand that the task has been paused. Eg. On vlc while listening to music there are other apps running simultaneously but as I told you...one program at a time...so the vlc is actually pausing in between for ns so you dont underatand it but the music is actually stopping in between.
But this was about the old processors...
Now-a- days processors ie 3rd gen pcs have multi cored processors. Now the 'cores' can be compared to a 1st or 2nd gen processors itself...embedded onto a single chip, a single processor. So now we understood what are cores ie they are mini processors which combine to become a processor. And each core can handle a single process at a time or multi threads as designed for the OS. And they folloq the same steps as I mentioned above about the single processor.
Eg. A i7 6gen processor has 8 cores...ie 8 mini processors in 1 i7...ie its speed is 8x times the old processors. And this is how multi tasking can be done.
There could be hundreds of cores in a single processor
Eg. Intel i128.
I hope I explaned this well.

I have read all answers, but this link was more clear explanation for me about difference between CPU(Processor) and Core. So I'm leaving here some notes from there.
The main difference between CPU and Core is that the CPU is an electronic circuit inside the computer that carries out instruction to perform arithmetic, logical, control and input/output operations while the core is an execution unit inside the CPU that receives and executes instructions.

Intel's picture is helpful, as shown by Tortuga's best answer. Here's a caption for it.
Processor: One semiconductor chip, the CPU (central processing unit) seated in one socket, circa 1950s-2010s. Over time, more functions have been packed onto the CPU chip. Prior to the 1950s releases of single-chip processors, one processor might have spread across multiple chips. In the mid 2010s the system-on-a-chip chips made it slightly more sketchy to equate one processor to one chip, though that's generally what people mean by processor, as in "this computer has an i7 processor" or "this computer system has four processors."
Core: One block of a CPU, executing one instruction at a time. (You'll see people say one instruction per clock cycle, but some CPUs use multiple clock cycles for some instructions.)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio