How do you limit a process' CPU usage on Windows? (need code, not an app) - windows

There is programs that is able to limit the CPU usage of processes in Windows. For example BES and ThreadMaster. I need to write my own program that does the same thing as these programs but with different configuration capabilities. Does anybody know how the CPU throttling of a process is done (code)? I'm not talking about setting the priority of a process, but rather how to limit it's CPU usage to for example 15% even if there is no other processes competing for CPU time.
Update: I need to be able to throttle any processes that is already running and that I have no source code access to.

You probably want to run the process(es) in a job object, and set the maximum CPU usage for the job object with SetInformationJobObject, with JOBOBJECT_CPU_RATE_CONTROL_INFORMATION.

Very simplified, it could work somehow like this:
Create a periodic waitable timer with some reasonable small wait time (maybe 100ms). Get a "last" value for each relevant process by calling GetProcessTimes once.
Loop forever, blocking on the timer.
Each time you wake up:
if GetProcessAffinityMask returns 0, call SetProcessAffinityMask(old_value). This means we've suspended that process in our last iteration, we're now giving it a chance to run again.
else call GetProcessTimes to get the "current" value
call GetSystemTimeAsFileTime
calculate delta by subtracting last from current
cpu_usage = (deltaKernelTime + deltaUserTime) / (deltaTime)
if that's more than you want call old_value = GetProcessAffinityMask followed by SetProcessAffinityMask(0) which will take the process offline.
This is basically a very primitive version of the scheduler that runs in the kernel, implemented in userland. It puts a process "to sleep" for a small amount of time if it has used more CPU time than what you deem right. A more sophisticated measurement maybe going over a second or 5 seconds would be possible (and probably desirable).
You might be tempted to suspend all threads in the process instead. However, it is important not to fiddle with priorities and not to use SuspendThread unless you know exactly what a program is doing, as this can easily lead to deadlocks and other nasty side effects. Think for example of suspending a thread holding a critical section while another thread is still running and trying to acquire the same object. Or imagine your process gets swapped out in the middle of suspending a dozen threads, leaving half of them running and the other half dead.
Setting the affinity mask to zero on the other hand simply means that from now on no single thread in the process gets any more time slices on any processor. Resetting the affinity gives -- atomically, at the same time -- all threads the possibility to run again.
Unluckily, SetProcessAffinityMask does not return the old mask as SetThreadAffinityMask does, at least according to the documentation. Therefore an extra Get... call is necessary.

CPU usage is fairly simple to estimate using QueryProcessCycleTime. The machine's processor speed can be obtained from HKLM\HARDWARE\DESCRIPTION\System\CentralProcessor\\~MHz (where is the processor number, one entry for each processor present). With these values, you can estimate your process' CPU usage and yield the CPU as necessary using Sleep() to keep your usage in bounds.

Related

If a CPU is always executing instructions how do we measure its work?

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?
First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.
Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.
Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

What exactly is CPU load if instructions are executed one at a time?

I know this question has been asked many times in many different manners, but it's still not clear for me what the CPU load % means.
I'll start explaining how I perceive the concepts now (of course, I might, and sure will, be wrong):
A CPU core can only execute one instruction at a time. It will not execute the next instruction until it finishes executing the current one.
Suppose your box has one single CPU with one single core. Parallel computing is hence not possible. Your OS's scheduler will pick up a process, set the IP to the entry point, and send that instruction to the CPU. It won't move to the next instruction until the CPU finishes executing the current instruction. After a certain amount of time it will switch to another process, and so on. But it will never switch to another process if the CPU is currently executing an instruction. It will wait until the CPU becomes free to switch to another process. Since you only have one single core, you can't have two processes executing simultaneously.
I/O is expensive. Whenever a process wants to read a file from the disk, it has to wait until the disk accomplishes its task, and the current process can't execute its next instruction until then. The CPU is not doing anything while the disk is working, and so our OS will switch to another process until the disk finishes its job in order not to waste time.
Following these principles, I've come myself to the conclusion that CPU load at a given time can only be one of the following two values:
0% - Idle. CPU is doing nothing at all.
100% - Busy. CPU is currently executing an instruction.
This is obviously false as taskmgr reports %1, 12%, 15%, 50%, etc. CPU usage values.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core (as reported by taskmgr)? While that given process is executing, what happens with the 99%?
What does it mean that the overall CPU usage is 19% (as reported by Rainmeter at the moment)?
If you look into the task manager on Windows there is Idle process, that does exactly that, it just shows amount of cycles not doing anything useful. Yes, CPU is always busy, but it might be just running in a loop waiting for useful things to come.
Since you only have one single core, you can't have two processes
executing simultaneously.
This is not really true. Yes, true parallelism is not possible with single core, but you can create illusion of one with preemptive multitasking. Yes, it is impossible to interrupt instruction, but it is not a problem because most of the instructions require tiny amount of time to finish. OS shares time with time slices, which are significantly longer than execution time of single instruction.
What does it mean that a given process, at a given time, is utilizing 1% of a given CPU core
Most of the time applications are not doing anything useful. Think of application that waits for user to click a button to start processing something. This app doesn't need CPU, so it sleeps most of the time, or every time it gets time slice it just goes into sleep (see event loop in Windows). GetMessage is blocking, so it means that thread will sleep until message arrives. So what CPU load really means? So imagine the app receives some events or data to do things, it will do operations instead of sleeping. So if it utilizes X% of CPU means that over sampling period of time that app used X% of CPU time. CPU time usage is average metric.
PS: To summarize concept of CPU load, think of speed (in terms of physics). There are instantaneous and average speeds, so speaking of CPU load, there also are instantaneous and average measurements. Instantaneous is always equal to either 0% or 100%, because at some point of time process either uses CPU or not. If process used 100% of CPU in the course of 250ms and didn't use for next 750ms then we can say that process loaded CPU for 25% with sampling period of 1 second (average measurement can only be applied with certain sampling period).
http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages
A single-core CPU is like a single lane of traffic. Imagine you are a bridge operator ... sometimes your bridge is so busy there are cars lined up to cross. You want to let folks know how traffic is moving on your bridge. A decent metric would be how many cars are waiting at a particular time. If no cars are waiting, incoming drivers know they can drive across right away. If cars are backed up, drivers know they're in for delays.
This is basically what CPU load is. "Cars" are processes using a slice of CPU time ("crossing the bridge") or queued up to use the CPU. Unix refers to this as the run-queue length: the sum of the number of processes that are currently running plus the number that are waiting (queued) to run.
Also see: http://en.wikipedia.org/wiki/Load_(computing)

Why one non-voluntary context switch per second?

The OS is RHEL 6 (2.6.32). I have isolated a core and am running a compute intensive thread on it. /proc/{thread-id}/status shows one non-voluntary context switch every second.
The thread in question is a SCHED_NORMAL thread and I don't want to change this.
How can I reduce this number of non-voluntary context switches? Does this depend on any scheduling parameters in /proc/sys/kernel?
EDIT: Several responses suggest alternative approaches. Before going that route, I first want to understand why I am getting exactly one non-voluntary context switch per second even over hours of run. For example, is this caused by CFS? If so, which parameters and how?
EDIT2: Further clarification - first question I would like an answer to is the following: Why am I getting one non-voluntary context switch per second instead of, say, one switch every half or two seconds?
This is a guess, but an educated one - since you use an isolated CPU the scheduler does not schedule any task except your own on it with one exception - the vmstat code in the kernel has a timer that schedules a single work queue item on each CPU once per second to calculate memory usage statistics and this is what you are seeing gets scheduled each second.
The work queue code is smart enough to not schedule the work queue kernel thread if the core is 100% idle but not if it is running a single task.
You can verify this using ftrace. If the sched_switch tracer shows that the entity you switch to once every second or so (the value is rounded to the nearest jiffie events and the timer does not count when the cpu is idle so this might skew the timing) is the events/CPU_NUMBER task (or keventd for older kernels), then it's almost 100% that the cause is indeed the vmstat_update function setting its timer to queue a work queue item every second which the events kernel thread runs.
Note that the cycle at which vmstat sets its timer is configurable - you can set it to other value via the vm.stat_interval sysctl knob. Increasing this value will give you a lower rate of such interruptions at the cost of less accurate memory usage statistics.
I maintain a wiki with all the sources of interruptions to isolated CPU work loads here. I also have a patch in the works for getting vmstat to not schedule the work queue item if there is no change between one vmstat work queue run to the next - such as would happen if your single task on the CPU does not use any dynamic memory allocations. Not sure it will benefit you, though - it depends on your work load.
I strongly suggest you try to optimize the code itself so that when it's running on a CPU, you get the maximum out of it.
Anyhow, I am not sure this will work, but give it a try anyway and let us know:
What I'll basically do is just set the scheduling policy to be FIFO then give the process the maximum priority possible.
#include<sched.h>
struct sched_param sp = sched_get_priority_max(SCHED_FIFO);
int ret;
ret = sched_setscheduler(0, SCHED_FIFO, &sp);
if (ret == -1) {
perror("sched_setscheduler");
return 1;
}
Please keep in mind that any blocking statement your process makes is MOST LIKELY gonna cause the scheduler to get it off the CPU.
Source
Man page
EDIT:
Sorry, just noticed the pthread tag; the concept still holds so check out this man page:
http://www.kernel.org/doc/man-pages/online/pages/man3/pthread_setschedparam.3.html
If one interrupt per second on your dedicated CPU is still too much, then you really need to not go through the normal scheduler at all. May I suggest the real-time and isochronous priority levels, that can leave your process scheduled more reliably than the usual pre-emptive mechanisms?

What effect does changing the process priority have in Windows?

If you go into Task Manager, right click a process, and set priority to Realtime, it often stops program crashes, or makes them run faster.
In a programming context, what does this do?
It calls SetPriorityClass().
Every thread has a base priority level determined by the thread's
priority value and the priority class of its process. The system uses
the base priority level of all executable threads to determine which
thread gets the next slice of CPU time. The SetThreadPriority function
enables setting the base priority level of a thread relative to the
priority class of its process. For more information, see Scheduling
Priorities.
It tells the widows scheduler to be more or less greedy when allocating execution time slices to your process. Realtime execution makes it never yield execution (not even to drivers, according to MSDN), which may cause stalls in your app if it waits on external events but has no yielding of its own(like Sleep, SwitchToThread or WaitFor[Single|Multiple]Objects), as such using realtime should be avoided unless you know that the application will handle it correctly.
It works by changing the weight given to this process in the OS task scheduler. Your CPU can only execute one instruction at a time (to put it very, very simply) and the OS's job is to keep swapping instructions from each running process. By raising or lowering the priority, you're affecting how much time it's allotted in the CPU relative to other applications currently being multi-tasked.

How to reserve a core for one thread on windows?

I am working on a very time sensitive application which polls a region of shared memory taking action when it detects a change has occurred. Changes are rare but I need to minimize the time from change to action. Given the infrequency of changes I think the CPU cache is getting cold. Is there a way to reserve a core for my polling thread so that it does not have to compete with other threads for either cache or CPU?
Thread affinity alone (SetThreadAffinityMask) will not be enough. It does not reserve a CPU core, but it does the opposite, it binds the thread to only the cores that you specify (that is not the same thing!).
By constraining the CPU affinity, you reduce the likelihood that your thread will run. If another thread with higher priority runs on the same core, your thread will not be scheduled until that other thread is done (this is how Windows schedules threads).
Without constraining affinity, your thread has a chance of being migrated to another core (taking the last time it was run as metric for that decision). Thread migration is undesirable if it happens often and soon after the thread has run (or while it is running) but it is a harmless, beneficial thing if a couple of dozen milliseconds have passed since it was last scheduled (caches will have been overwritten by then anyway).
You can "kind of" assure that your thread will run by giving it a higher priority class (no guarantee, but high likelihood). If you then use SetThreadAffinityMask as well, you have a reasonable chance that the cache is always warm on most common desktop CPUs (which luckily are normally VIPT and PIPT). For the TLB, you will probably be less lucky, but there's nothing you can do about it.
The problem with a high priority thread is that it will starve other threads because scheduling is implemented so it serves higher priority classes first, and as long as these are not satisfied, lower classes get zero. So, the solution in this case must be to block. Otherwise, you may impair the system in an unfavorable way.
Try this:
create a semaphore and share it with the other process
set priority to THREAD_PRIORITY_TIME_CRITICAL
block on the semaphore
in the other process, after writing data, call SignalObjectAndWait on the semaphore with a timeout of 1 (or even zero timeout)
if you want, you can experiment binding them both to the same core
This will create a thread that will be the first (or among the first) to get CPU time, but it is not running.
When the writer thread calls SignalObjectAndWait, it atomically signals and blocks (even if it waits for "zero time" that is enough to reschedule). The other thread will wake from the Semaphore and do its work. Thanks to its high priority, it will not be interrupted by other "normal" (that is, non-realtime) threads. It will keep hogging CPU time until done, and then block again on the semaphore. At this point, SignalObjectAndWait returns.
Using the Task Manager, you can set the "affinity" of processes.
You would have to set the affinity of your time-critical app to core 4, and the affinity of all the other processes to cores 1, 2, and 3. Assuming four cores of course.
You could call the SetProcessAffinityMask on every process but yours with a mask that excludes just the core that will "belong" to your process, and use it on your process to set it to run just on this core (or, even better, SetThreadAffinityMask just on the thread that does the time-critical task).
Given the infrequency of changes I think the CPU cache is getting cold.
That sounds very strange.
Let's assume your polling thread and the writing thread are on different cores.
The polling thread will be reading the shared memory address and so will be caching the data. That cache line is probably marked as exclusive. Then the write thread finally writes; first, it reads the cache line of memory in (so that line is now marked as shared on both cores) and then it writes. Writing causes the polling thread CPU's cache line to be marked as invalid. The polling thread then comes to read again; if it reads while the writing thread still has the data cached, it will read from the second cores cache, invalidating its cache line and taking ownership for itself. There's a lot of bus traffic overhead to do this.
Another issue is that the writing thread, if it doesn't write often, will almost certainly lose the TLB entry for the page with the shared memory address. Recalculating the physical address is a long, slow process. Since the polling thread polls often, possibly that page is always in that cores TLB; and in that sense, you might well do better, in latency terms, to have both threads on the same core. (Although if they're both compute intensive, they might interfere destructively and that cost could be much higher - I can't know, as I don't know what the threads are doing).
One thing you could do is use a hyperthread on the writing thread core; if you know early on you're going to write, get the hyperthread to read the shared memory address. This will load the TLB and cache while the writing thread is still busy computing, giving you parallelism.
The Win32 function SetThreadAffinityMask() is what you are looking for.

Resources