If a CPU is always executing instructions how do we measure its work? - cpu

Let us say we have a fictitious single core CPU with Program Counter and basic instruction set such as Load, Store, Compare, Branch, Add, Mul and some ROM and RAM. Upon switching on it executes a program from ROM.
Would it be fair to say the work the CPU does is based on the type of instruction it's executing. For example, a MUL operating would likely involve more transistors firing up than say Branch.
However from an outside perspective if the clock speed remains constant then surely the CPU could be said to be running at 100% constantly.
How exactly do we establish a paradigm for measuring the work of the CPU? Is there some kind of standard metric perhaps based on the type of instructions executing, the power consumption of the CPU, number of clock cycles to complete or even whether it's accessing RAM or ROM.
A related second question is what does it mean for the program to "stop". Usually does it just branch in an infinite loop or does the PC halt and the CPU waits for an interupt?

First of all, that a CPU is always executing some code is just an approximation these days. Computer systems have so-called sleep states which allow for energy saving when there is not too much work to do. Modern CPUs can also throttle their speed in order to improve battery life.
Apart from that, there is a difference between the CPU executing "some work" and "useful work". The CPU by itself can't tell, but the operating system usually can. Except for some embedded software, a CPU will never be running a single job, but rather an operating system with different processes within it. If there is no useful process to run, the Operating System will schedule the "idle task" which mostly means putting the CPU to sleep for some time (see above) or jsut burning CPU cycles in a loop which does nothing useful. Calculating the ratio of time spent in idle task to time spent in regular tasks gives the CPU's business factor.
So while in the old days of DOS when the computer was running (almost) only a single task, it was true that it was always doing something. Many applications used so-called busy-waiting if they jus thad to delay their execution for some time, doing nothing useful. But today there will almost always be a smart OS in place which can run the idle process than can put the CPU to sleep, throttle down its speed etc.

Oh boy, this is a toughie. It’s a very practical question as it is a measure of performance and efficiency, and also a very subjective question as it judges what instructions are more or less “useful” toward accomplishing the purpose of an application. The purpose of an application could be just about anything, such as finding the solution to a complex matrix equation or rendering an image on a display.
In addition, modern processors do things like clock gating in power idle states. The oscillator is still producing cycles, but no instructions execute due to certain circuitry being idled due to cycles not reaching them. These are cycles that are not doing anything useful and need to be ignored.
Similarly, modern processors can execute multiple instructions simultaneously, execute them out of order, and predict and execute which instructions will be executed next before your program (i.e. the IP or Instruction Pointer) actually reaches them. You don’t want to include instructions whose execution never actually complete, such as because the processor guesses wrong and has to flush those instructions, e.g. as due to a branch mispredict. So a better metric is counting those instructions that actually complete. Instructions that complete are termed “retired”.
So we should only count those instructions that complete (i.e. retire), and cycles that are actually used to execute instructions (i.e. unhalted).)
Perhaps the most practical general metric for “work” is CPI or cycles-per-instruction: CPI = CPU_CLK_UNHALTED.CORE / INST_RETIRED.ANY. CPU_CLK_UNHALTED.CORE are cycles used to execute actual instructions (vs those “wasted” in an idle state). INST_RETIRED are those instructions that complete (vs those that don’t due to something like a branch mispredict).
Trying to get a more specific metric, such as the instructions that contribute to the solution of a matrix multiple, and excluding instructions that don’t directly contribute to computing the solution, such as control instructions, is very subjective and difficult to gather statistics on. (There are some that you can, such as VECTOR_INTENSITY = VPU_ELEMENTS_ACTIVE / VPU_INSTRUCTIONS_EXECUTED which is the number of SIMD vector operations, such as SSE or AVX, that are executed per second. These instructions are more likely to directly contribute to the solution of a mathematical solution as that is their primary purpose.)
Now that I’ve talked your ear off, check out some of the optimization resources at your local friendly Intel developer resource, software.intel.com. Particularly, check out how to effectively use VTune. I’m not suggesting you need to get VTune though you can get a free or very discounted student license (I think). But the material will tell you a lot about increasing your programs performance (i.e. optimizing), which is, if you think about it, increasing the useful work your program accomplishes.

Expanding on Michał's answer a bit:
Program written for modern multi-tasking OSes are more like a collection of event handlers: they effectively setup listeners for I/O and then yield control back to the OS. The OS wake them up each time there is something to process (e.g. user action, data from device) and they "go to sleep" by calling into the OS once they've finished processing. Most OSes will also preempt in case one process hog the CPU for too long and starve the others.
The OS can then keep tabs on how long each process are actually running (by remembering the start and end time of each run) and generate the statistics like CPU time and load (ready process queue length).
And to answer your second question:
To stop mostly means a process is no longer scheduled and all associated resource (scheduling data structures, file handles, memory space, ...) destroyed. This usually require the process to call a special OS call (syscall/interrupt) so the OS can release the resources gracefully.
If however a process run into an infinite loop and stops responding to OS events, then it can only be forcibly stopped (by simply not running it anymore).

Related

Should CPU time always be identical between executions of same code?

My understanding of CPU time is that it should always be the same between every execution, on a same machine. It should require an identical amount of cpu cycles every time.
But I'm running some tests now, of executing a basic echo "Hello World", and it's giving me 0.003 to 0.005 seconds.
Is my understanding of CPU time wrong, or there's an issue in my measurement?
Your understanding is completely wrong. Real-world computers running modern OSes on modern CPUs are not simple, theoretical abstractions. There are all kinds of factors that can affect how much CPU time code requires to execute.
Consider memory bandwidth. On a typical modern machine, all the tasks running on the machine's cores are competing for access to the system memory. If the code is running at the same time code on another core is using lots of memory bandwidth, that may result in accesses to RAM taking more clock cycles.
Many other resources are shared as well, such as caches. Say the code is frequently interrupted to let other code run on the core. That will mean that the code will frequently find the cache cold and take lots of cache misses. That will also result in the code taking more clock cycles.
Let's talk about page faults as well. The code itself may be in memory or it may not be when the code starts running. Even if the code is in memory, you may or may not take soft page faults (to update the operating system's tracking of what memory is being actively used) depending on when that page last took a soft page fault or how long ago it was loaded into RAM.
And your basic hello world program is doing I/O to the terminal. The time that takes can depend on what else is interacting with the terminal at the time.
The biggest effects on modern systems include:
virtual memory lazily paging in code and data from disk if it's not hot in pagecache. (First run of a program tends to have a lot more startup overhead.)
CPU frequency isn't fixed. (idle / turbo speeds. grep MHz /proc/cpuinfo).
CPU caches can be hot or not
(for very short intervals) an interrupt randomly happening or not in your timed region.
So even if cycles were fixed (which they very much are not), you wouldn't see equal times.
Your assumption is not totally wrong, but it only applies to core clock cycles for individual loops, and only to cases that don't involve any memory access. (e.g. data already hot in L1d cache, code already hot in L1i cache inside a CPU core). And assuming no interrupt happens while the timed loop is running.
Running a whole program is a much larger scale of operation and will involve shared resources (and possible contention for them) like access to main memory. And as #David pointed out, a write system call to print a string on a terminal emulator - that communication with another process can be slow and involves waking up another process, if your program ends up waiting for it. Redirecting to /dev/null or a regular file would remove that, or just closing stdout like ./hello >&- would make your write system call return -EBADF (on Linux).
Modern CPUs are very complex beasts. You presumably have an Intel or AMD x86-64 CPU with out-of-order execution, and a dozen or so buffers for incoming / outgoing cache lines, allowing it to track about that many outstanding cache misses (memory-level parallelism). And 2 levels of private cache per core, and a shared L3 cache. Good luck predicting an exact number of clock cycles for anything but the most controlled conditions.
But yes, if you do control the condition, the same small loop will typically run at the same number of core clock cycles per iteration.
However, even that's not always the case. I've seen cases where the same loop seems to have have two stable states for how the CPU schedules instructions. Different entry condition quirks can lead to an ongoing speed difference over millions of loop iterations.
I've seen this occasionally when microbenchmarking stuff on modern Intel CPUs like Sandybridge and Skylake. It's usually not clear exactly what the two stable states are, and what exactly is causing the bottleneck, even with the help of performance counters and https://agner.org/optimize
In one case I remember, an interrupt tended to get the loop into the efficient mode of execution. #BeeOnRope was measuring slow cycles/iteration using or RDPMC for a short interval (or maybe RDTSC with core clock fixed = TSC reference clocks), while I was measuring it running faster by using a really large repeat count and just using perf stat on the whole program (which was a static executable with just that one loop written by hand in asm). And #Bee was able to repro my results by increasing the iteration count so an interrupt would happen inside the timed region, and returning from the interrupt tended to get the CPU out of that non-optimal uop-scheduling pattern, whatever it was.

Set CPU affinity for profiling

I am working on a calculation intensive C# project that implements several algorithms. The problem is that when I want to profile my application, the time it takes for a particular algorithm varies. For example sometimes running the algorithm 100 times takes about 1100 ms and another time running 100 times takes much more time like 2000 or even 3000 ms. It may vary even in the same run. So it is impossible to measure improvement when I optimize a piece of code. It's just unreliable.
Here is another run:
So basically I want to make sure one CPU is dedicated to my app. The PC has an old dual core Intel E5300 CPU running on Windows 7 32 bit. So I can't just set process affinity and forget about one core forever. It would make the computer very slow for daily tasks. I need other apps to use a specific core when I desire and the when I'm done profiling, the CPU affinities come back to normal. Having a bat file to do the task would be a fantastic solution.
My question is: Is it possible to have a bat file to set process affinity for every process on windows 7?
PS: The algorithm is correct and every time runs the same code path. I created some object pool so after first run, zero memory is allocated. I also profiled memory allocation with dottrace and it showed no allocation after first run. So I don't believe GC is triggered when the algorithm is working. Physical memory is available and system is not running low on RAM.
Result: The answer by Chris Becke does the job and sets process affinities exactly as intended. It resulted in more uniform results specially when background apps like visual studio and dottrace are running. Further investigation into the divergent execution time revealed that the root for the unpredictability is CPU overheat. The CPU overheat alarm was off while the temperature was over 100C! So after fixing the malfunctioning fan, the results became completely uniform.
You mean SetProcessAffinityMask?
I see this question, while tagged windows, is c#, so... I see the System.Diagnostics.Process object has a ThreadAffinity member that should perform the same function.
I am just not sure that this will stabilize the CPU times quite in the way you expect. A single busy task that is not doing IO should remain scheduled on the same core anyway unless another thread interrupts it, so I think your variable times are more due to other threads / processes interrupting your algorithm than the OS randomly shunting your thread to a different core - so unless you set the affinity for all other threads in the system to exclude your preferred core I can't see this helping.

Which one will workload(usage) of the CPU-Core if there is a persistent cache-miss, will be 100%?

That is, if the core processor most of the time waiting for data from RAM or cache-L3 with cache-miss, but the system is a real-time (real-time thread priority), and the thread is attached (affinity) to the core and works without switching thread/context, what kind of load(usage) CPU-Core should show on modern x86_64?
That is, CPU usage is displayed as decrease only when logged in Idle?
And if anyone knows, if the behavior is different in this case for other processors: ARM, Power[PC], Sparc?
Clarification: shows CPU-usage in standard Task manager in OS-Windows
A hardware thread (logical core) that's stalled on a cache miss can't be doing anything else, so it still counts as busy for the purposes of task-managers / CPU time accounting / OS process scheduler time-slices / stuff like that.
This is true across all architectures.
Without hyperthreading, "hardware thread" / "logical core" are the same as a "physical core".
Morphcore / other on-the-fly changing between hyperthreading and a more powerful single core could make there be a difference between a thread that keeps many execution units busy, vs. a thread that is blocked on cache misses a lot of the time.
I don't get the link between the OS CPU usage statistics and the optimal use of the pipeline. I think they are uncorrelated as the OS doesn't measure the pipeline load.
I'm writing this in the hope that Peter Cordes can help me understand it better and as a continuation of the comments.
User programs relinquish control to OS very often: when they need input from user or when
they are done with the signal/message. GUI program are basically just big loops and at
each iteration control is given to the OS until the next message.
When the OS has the control it schedules others threads/tasks and if not other actions
are needed just enter the idle process (long time ago a tight loop, now a sleep state)
until the next interrupt. This is the Idle Time.
Time spent on an ISR processing user input is considered idle time by any OS.
An a cache miss there would be still considered idle time.
A heavy program takes more time to complete the work for a given message thereby returning
control to OS say 2 times in a second instead of
20.
If the OS measures that in the last second, it got control for 20ms only then the
CPU usage is (1000-20)/1000 = 98%.
This has nothing to do with the optimal use of the CPU architecture, as said stalls can
occur in the OS code and still be part of the Idle time statistic.
The CPU utilization at pipeline level is not what is measured and it is orthogonal to the
OS statistics.
CPU usage is meant to be used by sysadmin, it is a measure of the load you put on a system,
it is not the measure of how efficiently the assembly of a program was generated.
Sysadmins can't help with that, but measuring how often the OS got the control back (without
preempting) is a measure of how much load a program is putting on the system.
And sysadmins can definitively do terminate heavy programs.

How expensive is a context switch? Is it better to implement a manual task switch than to rely on OS threads?

Imagine I have two (three, four, whatever) tasks that have to run in parallel. Now, the easy way to do this would be to create separate threads and forget about it. But on a plain old single-core CPU that would mean a lot of context switching - and we all know that context switching is big, bad, slow, and generally simply Evil. It should be avoided, right?
On that note, if I'm writing the software from ground up anyway, I could go the extra mile and implement my own task-switching. Split each task in parts, save the state inbetween, and then switch among them within a single thread. Or, if I detect that there are multiple CPU cores, I could just give each task to a separate thread and all would be well.
The second solution does have the advantage of adapting to the number of available CPU cores, but will the manual task-switch really be faster than the one in the OS core? Especially if I'm trying to make the whole thing generic with a TaskManager and an ITask, etc?
Clarification: I'm a Windows developer so I'm primarily interested in the answer for this OS, but it would be most interesting to find out about other OSes as well. When you write your answer, please state for which OS it is.
More clarification: OK, so this isn't in the context of a particular application. It's really a general question, the result on my musings about scalability. If I want my application to scale and effectively utilize future CPUs (and even different CPUs of today) I must make it multithreaded. But how many threads? If I make a constant number of threads, then the program will perform suboptimally on all CPUs which do not have the same number of cores.
Ideally the number of threads would be determined at runtime, but few are the tasks that can truly be split into arbitrary number of parts at runtime. Many tasks however can be split in a pretty large constant number of threads at design time. So, for instance, if my program could spawn 32 threads, it would already utilize all cores of up to 32-core CPUs, which is pretty far in the future yet (I think). But on a simple single-core or dual-core CPU it would mean a LOT of context switching, which would slow things down.
Thus my idea about manual task switching. This way one could make 32 "virtual" threads which would be mapped to as many real threads as is optimal, and the "context switching" would be done manually. The question just is - would the overhead of my manual "context switching" be less than that of OS context switching?
Naturally, all this applies to processes which are CPU-bound, like games. For your run-of-the-mill CRUD application this has little value. Such an application is best made with one thread (at most two).
I don't see how a manual task switch could be faster since the OS kernel is still switching other processes, including yours in out of the running state too. Seems like a premature optimization and a potentially huge waste of effort.
If the system isn't doing anything else, chances are you won't have a huge number of context switches anyway. The thread will use its timeslice, the kernel scheduler will see that nothing else needs to run and switch right back to your thread. Also the OS will make a best effort to keep from moving threads between CPUs so you benefit there with caching.
If you are really CPU bound, detect the number of CPUs and start that many threads. You should see nearly 100% CPU utilization. If not, you aren't completely CPU bound and maybe the answer is to start N + X threads. For very IO bound processes, you would be starting a (large) multiple of the CPU count (i.e. high traffic webservers run 1000+ threads).
Finally, for reference, both Windows and Linux schedulers wake up every millisecond to check if another process needs to run. So, even on an idle system you will see 1000+ context switches per second. On heavily loaded systems, I have seen over 10,000 per second per CPU without any significant issues.
The only advantage of manual switch that I can see is that you have better control of where and when the switch happens. The ideal place is of course after a unit of work has been completed so that you can trash it all together. This saves you a cache miss.
I advise not to spend your effort on this.
Single-core Windows machines are going to become extinct in the next few years, so I generally write new code with the assumption that multi-core is the common case. I'd say go with OS thread management, which will automatically take care of whatever concurrency the hardware provides, now and in the future.
I don't know what your application does, but unless you have multiple compute-bound tasks, I doubt that context switches are a significant bottleneck in most applications. If your tasks block on I/O, then you are not going to get much benefit from trying to out-do the OS.

How can you insure your code runs with no variability in execution time due to cache?

In an embedded application (written in C, on a 32-bit processor) with hard real-time constraints, the execution time of critical code (specially interrupts) needs to be constant.
How do you insure that time variability is not introduced in the execution of the code, specifically due to the processor's caches (be it L1, L2 or L3)?
Note that we are concerned with cache behavior due to the huge effect it has on execution speed (sometimes more than 100:1 vs. accessing RAM). Variability introduced due to specific processor architecture are nowhere near the magnitude of cache.
If you can get your hands on the hardware, or work with someone who can, you can turn off the cache. Some CPUs have a pin that, if wired to ground instead of power (or maybe the other way), will disable all internal caches. That will give predictability but not speed!
Failing that, maybe in certain places in the software code could be written to deliberately fill the cache with junk, so whatever happens next can be guaranteed to be a cache miss. Done right, that can give predictability, and perhaps could be done only in certain places so speed may be better than totally disabling caches.
Finally, if speed does matter - carefully design the software and data as if in the old day of programming for an ancient 8-bit CPU - keep it small enough for it all to fit in L1 cache. I'm always amazed at how on-board caches these days are bigger than all of RAM on a minicomputer back in (mumble-decade). But this will be hard work and takes cleverness. Good luck!
Two possibilities:
Disable the cache entirely. The application will run slower, but without any variability.
Pre-load the code in the cache and "lock it in". Most processors provide a mechanism to do this.
It seems that you are referring to x86 processor family that is not built with real-time systems in mind, so there is no real guarantee for constant time execution (CPU may reorder micro-instructions, than there is branch prediction and instruction prefetch queue which is flushed each time when CPU wrongly predicts conditional jumps...)
This answer will sound snide, but it is intended to make you think:
Only run the code once.
The reason I say that is because so much will make it variable and you might not even have control over it. And what is your definition of time? Suppose the operating system decides to put your process in the wait queue.
Next you have unpredictability due to cache performance, memory latency, disk I/O, and so on. These all boil down to one thing; sometimes it takes time to get the information into the processor where your code can use it. Including the time it takes to fetch/decode your code itself.
Also, how much variance is acceptable to you? It could be that you're okay with 40 milliseconds, or you're okay with 10 nanoseconds.
Depending on the application domain you can even further just mask over or hide the variance. Computer graphics people have been rendering to off screen buffers for years to hide variance in the time to rendering each frame.
The traditional solutions just remove as many known variable rate things as possible. Load files into RAM, warm up the cache and avoid IO.
If you make all the function calls in the critical code 'inline', and minimize the number of variables you have, so that you can let them have the 'register' type.
This should improve the running time of your program. (You probably have to compile it in a special way since compilers these days tend to disregard your 'register' tags)
I'm assuming that you have enough memory not to cause page faults when you try to load something from memory. The page faults can take a lot of time.
You could also take a look at the generated assembly code, to see if there are lots of branches and memory instuctions that could change your running code.
If an interrupt happens in your code execution it WILL take longer time. Do you have interrupts/exceptions enabled?
Understand your worst case runtime for complex operations and use timers.

Resources