Questions on Measuring Time Using the CPU Clock

Questions on Measuring Time Using the CPU Clock - time

I'm aware of the standard methods of getting time deltas using CPU clock counters on various operating systems. My question is, how do such operating systems account for the change in CPU frequency for power saving purposes. I initially thought this could be explained based on the fact that OS's use specific calls to measure frequency to get the corrected frequency based on which core is being used, what frequency it's currently set to, etc. But then I realized, wouldn't that make any time delta inaccurate if the CPU frequency was lowered and raised back to it's original value in between two clock queries.
For example take the following scenario:
Query the CPU cycles. Operating system lowers CPU frequency for power saving. Some other code is run here. Operating system raises CPU frequency for performance. Query the CPU cycles. Calculate delta as cycle difference divided by frequency.
This would yield an inaccurate delta since the CPU frequency was not constant between the two queries. How is this worked around by the operating system or programs that have to work with time deltas using CPU cycles?

see this wrong clock cycle measurements with rdtsc
there are more ways how to deal with it
set CPU clock to max
read the link above to see how to do it?
use PIT instead of RDTSC
PIT is programmable interrupt timer (Intel 8253 if I remember correctly) it is present on all PC motherboards since x286 (and maybe even before) but the resolution is only ~119KHz and not all OS give you access to it.
combine PIT and RDTSC
just measure the CPU clock by PIT repeatedly when is stable enough start your measurement (and remain scanning for CPU clock change). If CPU clock changes during measurement then throw away the measurement and start again

Related

What is better from an energy-saving perspective to run task minimum energy consumption

The CPU frequency and CPU usage are the main factors that impact energy consumption (as far as I know). however what is better from an energy-saving perspective to run task minimum energy consumption:
Option 1: Maximum CPU frequency with minimum usage
Option 2: Maximum CPU usage with min frequency.

Work per time scales approximately linearly with CPU frequency. (A bit less than linear because higher CPU frequency means DRAM latency is more clock cycles).
CPU power has two components: switching (dynamic) power which scales with f3 (because voltage has to increase for higher frequency, and transistors switch are pumping that V^2 capacitor energy more often); and leakage power which doesn't vary as dramatically. At high frequency dynamic power dominates, but as you lower the frequency, eventually it becomes significant. The smaller your transistors, the more significant leakage is.
System-wide, there's also other power for things like DRAM that doesn't change much or at all with CPU frequency.
Min frequency is more efficient, unless the minimum is far below the best frequency for work per energy. (Some parts of power decrease with frequency, others like leakage current and DRAM refresh don't).
Frequencies lower than max have lower work per energy (better task efficiency) up to a certain point. Like 800 MHz on a Skylake CPU on Intel's 14 nm process. If there's work to be done, there's no gain from dropping below that; just race-to-sleep at that most efficient frequency. (Power would decrease, but work rate would decrease more below that point.)
https://en.wikichip.org/wiki/File:Intel_Architecture,_Code_Name_Skylake_Deep_Dive-_A_New_Architecture_to_Manage_Power_Performance_and_Energy_Efficiency.pdf is slides from IDF2015 about Skylake power management covered a lot of that general-case stuff well. Unfortunately I don't know where to find a copy of the audio from Efraim Rotem's talk; it was up for a year or so after, but the original link is dead now. :/
Also in general about dynamic power (from switching, not leakage) scaling with frequency cubed if you adjust voltage as well as frequency, see Modern Microprocessors
A 90-Minute Guide! and
https://electronics.stackexchange.com/questions/614018/why-does-switching-cause-power-dissipation
https://electronics.stackexchange.com/questions/258724/why-do-cpus-need-so-much-current
https://electronics.stackexchange.com/questions/548601/why-does-decreasing-the-cmos-supply-voltage-also-decrease-the-maximum-circuit-fr

Type of CPU load to perform constant relaxed work

I'm trying to figure out how to program a certain type of load to the CPU that makes it work constantly but with average stress.
The only approach I know how to load a CPU with some work to do which wouldn't be at its maximum possible performance is when we alternate the part of giving the CPU something to do with sleep for some time. E.g. to achieve 20% CPU usage, do some computation which would take e.g. 0.2 seconds and then sleep for 0.8 seconds. Then the CPU usage will be roughly 20%.
However this essentially means the CPU will be jumping between max performance to idle all the time.
I wrote a small Python program where I'm making a process for each CPU core, set its affinity so each process runs on a designated core, and I'm giving it some absolutely meaningless load:
def actual_load_cycle():
x = list(range(10000))
del x
while repeating a call to this procedure in cycle and then sleeping for some time to ensure the working time is N% of total time:
while 1:
timer.mark_time(timer_marker)
for i in range(coef):
actual_load_cycle()
elapsed = timer.get_time_since(timer_marker)
# now we need to sleep for some time. The elapsed time is CPU_LOAD_TARGET% of 100%.
time_to_sleep = elapsed / CPU_LOAD_TARGET * (100 - CPU_LOAD_TARGET)
sleep(time_to_sleep)
It works well, giving the load within 7% of desired value of CPU_LOAD_TARGET - I don't need a precise amount of load.
But it sets the temperature of the CPU very high, with CPU_LOAD_TARGET=35 (real CPU usage reported by the system is around 40%) the CPU temps go up to 80 degrees.
Even with the minimal target like 5%, the temps are spiking, maybe just not as much - up to 72-73.
I believe the reason for this is that those 20% of time the CPU works as hard as it can, and it doesn't get cooler fast enough while sleeping afterwards.
But when I'm running a game, like Uncharted 4, the CPU usage as measured by MSI Afterburner is 42-47%, but the temperatures stay under 65 degrees.
How can I achieve similar results, how can I program such load to make CPU usage high but the work itself would be quite relaxed, as is done e.g. in the games?
Thanks!

The heat dissipation of a CPU is mainly dependent of its power consumption which is very dependent of the workload, and more precisely the instruction being executed and the number of active cores. Modern processors are very complex so it is very hard to predict the power consumption based on a given workload, especially when the executed code is a Python code executed in the CPython interpreter.
There are many factors that can impact the power consumption of a modern processors. The most important one is frequency scaling. Mainstream x86-64 processors can adapt the frequency of a core based on the kind of computation done (eg. use of wide SIMD floating-point vectors like the ZMM registers of AVX-512F VS scalar 64-bit integers), the number of active cores (the higher the number of core the lower the frequency), the current temperature of the core, the time executing instructions VS sleeping, etc. On modern processor, the memory hierarchy can take a significant amount of power so operations involving the memory controller and more generally the RAM can eat more power than the one operating on in-core registers. In fact, regarding the instructions actually executed, the processor needs to enable/disable some parts of its integrated circuit (eg. SIMD units, integrated GPU, etc.) and not all can be enabled at the same time due to TDP constraints (see Dark silicon). Floating-point SIMD instructions tend to eat more energy than integer SIMD instructions. Something even weirder: the consumption can actually be dependent of the input data since transistors may switch more frequently from one state to another with some data (researchers found this while computing matrix multiplication kernels on different kind of platforms with different kind of input). The power is automatically adapted by the processor since it would be insane (if even possible) for engineers to actually consider all possible cases and all possible dynamic workload.
One of the cheapest x86 instruction is NOP which basically mean "do nothing". That being said, the processor can run at the highest turbo frequency so to execute a loop of NOP resulting in a pretty high power consumption. In fact, some processor can run the NOP in parallel on multiple execution units of a given core keeping busy all the available ALUs. Funny point: running dependent instructions with a high latency might actually reduce the power consumption of the target processor.
The MWAIT/MONITOR instructions provide hints to allow the processor to enter an implementation-dependent optimized state. This includes a lower-power consumption possibly due to a lower frequency (eg. no turbo) and the use of sleep states. Basically, your processor can sleep for a very short time so to reduce its power consumption and then be able to use a high frequency for a longer time due to a lower power / heat-dissipation before. The behaviour is similar to humans: the deeper the sleep the faster the processor can be after that, but the deeper the sleep the longer the time to (completely) wake up. The bad news is that such instruction requires very-high privileges AFAIK so you basically cannot use them from a user-land code. AFAIK, there are instructions to do that in user-land like UMWAIT and UMONITOR but they are not yet implemented except maybe in very recent processors. For more information, please read this post.
In practice, the default CPython interpreter consumes a lot of power because it makes a lot of memory accesses (including indirection and atomic operations), does a lot of branches that needs to be predicted by the processor (which has special power-greedy units for that), performs a lot of dynamic jumps in a large code. The kind of pure-Python code executed does not reflect the actual instructions executed by the processor since most of the time will be spent in the interpreter itself. Thus, I think you need to use a lower-level language like C or C++ to better control kind of workload to be executed. Alternatively, you can use JIT compiler like Numba so to have a better control while still using a Python code (but not a pure-Python one anymore). Still, one should keep in mind that the JIT can generate many unwanted instructions that can result in an unexpectedly higher power consumption. Alternatively, a JIT compiler can optimize trivial codes like a sum from 1 to N (simplified as just a N*(N+1)/2 expression).
Here is an example of code:
import numba as nb
def test(n):
s = 1
for i in range(1, n):
s += i
s *= i
s &= 0xFF
return s
pythonTest = test
numbaTest = nb.njit('(int64,)')(test) # Compile the function
pythonTest(1_000_000_000) # takes about 108 seconds
numbaTest(1_000_000_000) # takes about 1 second
In this code, the Python function takes 108 times more time to execute than the Numba function on my machine (i5-9600KF processor) so one should expect a 108 higher energy needed to execute the Python version. However, in practice, this is even worse: the pure-Python function causes the target core to consume a much higher power (not just more energy) than the equivalent compiled Numba implementation on my machine. This can be clearly seen on the temperature monitor:
Base temperature when nothing is running: 39°C
Temperature during the execution of pythonTest: 55°C
Temperature during the execution of numbaTest: 46°C
Note that my processor was running at 4.4-4.5 GHz in all cases (due to the performance governor being chosen). The temperature is retrieved after 30 seconds in each cases and it is stable (due to the cooling system). The function are run in a while(True) loop during the benchmark.
Note that game often use multiple cores and they do a lot of synchronizations (at least to wait for the rendering part to be completed). A a result, the target processor can use a slightly lower turbo frequency (due to TDP constraints) and have a lower temperature due to the small sleeps (saving energy).

CPU cycles vs. total CPU time

On Windows, GetProcessTimes() and QueryProcessCycleTime() can be used to get totals for all threads of an app. I expected (apparently naively) to find a proportional relationship between the total number of cycles and the total processor time (user + kernel). When converted to the same units (seconds) and expressed at a percent of the app's running time, they're not even close; and the ratio between them varies greatly.
Right after an app starts, they're fairly close.
3.6353% CPU cycles
5.2000% CPU time
0.79 Ratio
But this ratio increases as an app remains idle (below, after 11 hours, mostly idle).
0.0474% CPU cycles
0.0039% CPU time
12.16 Ratio
Apparently, cycles are counted that don't contribute to user or kernel time. I'm curious about how it works. Please enlighten me.
Thanks.
Vince

The GetProcessTimes and the QueryProcessCycleTime values are calculated in different ways. GetProcesTimes/GetThreadTimes are updated in response to timer interrupts, while QueryProcessCycleTime values are based on the tracking of actual thread execution times. These different ways of measuring may cause vastly different timing results when both API results are used and compared. Especially since the GetThreadTimes includes only fully completed time-slot values for its thread counters (see http://blog.kalmbachnet.de/?postid=28), which usually results in incorrect timings.
Since GetProcessTimes will in general report less time than actually spent (due to not always completing its time-slice) it makes sense
that its CPU time percentage will decrease over time compared to the cycle measurement percentage.

Instruction to get the current time on x86

Is there an x86 instruction to get the current time?
Basically... something like a replacement for clock_get_time ... something with the minimum overhead... where I don't really care about getting the time in any specific format... as long as it's a format I can use.
Basically I'm doing some work to "Detect how much PHYSICAL REAL LIFE TIME" has gone by... and I want to be able to measure time as frequently as possible!
I guess you can imagine i'm doing something like a profiling app... :)
I really need aggressively efficient access to the hardware time. So ideally... some ASM to get the time... store it somewhere... then massage it later into some format that I can actually process.
I'm not interested in _rdtsc as that measures the number of cycles gone by. I need to know how much physical time has executed... not cycles which can vary due to thermal fluctations or so..

For profiling, often it's most useful to profile in terms of CPU clock cycles, rather than wall-clock time. CPU dynamic clocking (turbo and power saving) makes it annoying to get the CPU ramped up to full speed before the start of a measurement period.
If you still need wall-clock time after that:
Recent x86 CPUs have a TSC that runs at a fixed rate, regardless of CPU frequency adjustment for power-saving. Also, the TSC doesn't stop when the CPU is halted. (i.e. no work to do, so it ran the HLT instruction to wait for an interrupt in low-power mode.)
It turned out that efficient access to a useful time-source was more useful to have in hardware than an actual clock cycle counter, so that's what RDTSC morphed into, a few CPU generations after its introduction. Now we're back to using hardware performance counters for measuring clock cycles.
In Linux, look for constant_tsc and nonstop_tsc in the CPU features flags in /proc/cpuinfo. IDK if there are CPUID bits for those. If no, use Linux's code for it (if you can use GPLed code).
On a CPU with those two key features, Linux uses the TSC as its clocksource, IIRC.
The lowest overhead way to get the current time in user-space will be to work out the conversion between RDTSC ticks and real time. While profiling, you might just store 64bit TSC snapshots, and convert to real-time later. (So you can handle TSC wraparound then). RDTSC only takes about 24 cycles (Agner Fog's instruction table, Intel Haswell). I think the overhead of a system call will be an order of magnitude higher than that. (The kernel will have to do a RDTSC in there somewhere anyway).
Agner Fog has documented his profiling / timing methods, and has some example code. I haven't looked recently, but it might have useful stuff for this application.

How to monitor the utilization of cores on Xeon Phi at 10Hz?

I've been trying to measure/monitor the utilization of all those 60 cores on Xeon Phi (Knights Corner, in-order processors) at a relatively high frequency, say, at least every 0.1s which yields to 10Hz.
I tried the latest PAPI library. But it only supports PAPI_TOT_INS which is the counter of completed instructions. This won't work because I actually need something related to the instructions issued every 0.1s, not finished. Several instructions issued at different cycles may finish at the same cycle. The issue of instructions is influenced by whether the core is halted or not.
Other commands available like 'top' and 'perf' operate at 1Hz which is too slow for my measurement. I need a higher frequency. And, I also need to synchronize the measurement with vital phases of my codes. So, the Intel Vtune Profile does not work for me either.
Is there a possible way for me to monitor the issue of instructions on Xeon Phi or any other activities linked to their utilizations? I understand that those hardware counters are there, but to read them seems very challenging to me. Maybe I can deduce this utilization by measuring the CPU time of each thread?
Thanks.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio