CPU Benchmark lookup or similar value rating results in C# - performance

I normally lookup a CPU Benchmark rating on https://www.cpubenchmark.net/
Is there a way I can get a rating when checking a system's capabilities?
Example, click a button and it gets CPU Benchmark rating of 7653 and I flag this for CPU not good enough, minimum Benchmark required is 10000.

Related

Type of CPU load to perform constant relaxed work

I'm trying to figure out how to program a certain type of load to the CPU that makes it work constantly but with average stress.
The only approach I know how to load a CPU with some work to do which wouldn't be at its maximum possible performance is when we alternate the part of giving the CPU something to do with sleep for some time. E.g. to achieve 20% CPU usage, do some computation which would take e.g. 0.2 seconds and then sleep for 0.8 seconds. Then the CPU usage will be roughly 20%.
However this essentially means the CPU will be jumping between max performance to idle all the time.
I wrote a small Python program where I'm making a process for each CPU core, set its affinity so each process runs on a designated core, and I'm giving it some absolutely meaningless load:
def actual_load_cycle():
x = list(range(10000))
del x
while repeating a call to this procedure in cycle and then sleeping for some time to ensure the working time is N% of total time:
while 1:
timer.mark_time(timer_marker)
for i in range(coef):
actual_load_cycle()
elapsed = timer.get_time_since(timer_marker)
# now we need to sleep for some time. The elapsed time is CPU_LOAD_TARGET% of 100%.
time_to_sleep = elapsed / CPU_LOAD_TARGET * (100 - CPU_LOAD_TARGET)
sleep(time_to_sleep)
It works well, giving the load within 7% of desired value of CPU_LOAD_TARGET - I don't need a precise amount of load.
But it sets the temperature of the CPU very high, with CPU_LOAD_TARGET=35 (real CPU usage reported by the system is around 40%) the CPU temps go up to 80 degrees.
Even with the minimal target like 5%, the temps are spiking, maybe just not as much - up to 72-73.
I believe the reason for this is that those 20% of time the CPU works as hard as it can, and it doesn't get cooler fast enough while sleeping afterwards.
But when I'm running a game, like Uncharted 4, the CPU usage as measured by MSI Afterburner is 42-47%, but the temperatures stay under 65 degrees.
How can I achieve similar results, how can I program such load to make CPU usage high but the work itself would be quite relaxed, as is done e.g. in the games?
Thanks!
The heat dissipation of a CPU is mainly dependent of its power consumption which is very dependent of the workload, and more precisely the instruction being executed and the number of active cores. Modern processors are very complex so it is very hard to predict the power consumption based on a given workload, especially when the executed code is a Python code executed in the CPython interpreter.
There are many factors that can impact the power consumption of a modern processors. The most important one is frequency scaling. Mainstream x86-64 processors can adapt the frequency of a core based on the kind of computation done (eg. use of wide SIMD floating-point vectors like the ZMM registers of AVX-512F VS scalar 64-bit integers), the number of active cores (the higher the number of core the lower the frequency), the current temperature of the core, the time executing instructions VS sleeping, etc. On modern processor, the memory hierarchy can take a significant amount of power so operations involving the memory controller and more generally the RAM can eat more power than the one operating on in-core registers. In fact, regarding the instructions actually executed, the processor needs to enable/disable some parts of its integrated circuit (eg. SIMD units, integrated GPU, etc.) and not all can be enabled at the same time due to TDP constraints (see Dark silicon). Floating-point SIMD instructions tend to eat more energy than integer SIMD instructions. Something even weirder: the consumption can actually be dependent of the input data since transistors may switch more frequently from one state to another with some data (researchers found this while computing matrix multiplication kernels on different kind of platforms with different kind of input). The power is automatically adapted by the processor since it would be insane (if even possible) for engineers to actually consider all possible cases and all possible dynamic workload.
One of the cheapest x86 instruction is NOP which basically mean "do nothing". That being said, the processor can run at the highest turbo frequency so to execute a loop of NOP resulting in a pretty high power consumption. In fact, some processor can run the NOP in parallel on multiple execution units of a given core keeping busy all the available ALUs. Funny point: running dependent instructions with a high latency might actually reduce the power consumption of the target processor.
The MWAIT/MONITOR instructions provide hints to allow the processor to enter an implementation-dependent optimized state. This includes a lower-power consumption possibly due to a lower frequency (eg. no turbo) and the use of sleep states. Basically, your processor can sleep for a very short time so to reduce its power consumption and then be able to use a high frequency for a longer time due to a lower power / heat-dissipation before. The behaviour is similar to humans: the deeper the sleep the faster the processor can be after that, but the deeper the sleep the longer the time to (completely) wake up. The bad news is that such instruction requires very-high privileges AFAIK so you basically cannot use them from a user-land code. AFAIK, there are instructions to do that in user-land like UMWAIT and UMONITOR but they are not yet implemented except maybe in very recent processors. For more information, please read this post.
In practice, the default CPython interpreter consumes a lot of power because it makes a lot of memory accesses (including indirection and atomic operations), does a lot of branches that needs to be predicted by the processor (which has special power-greedy units for that), performs a lot of dynamic jumps in a large code. The kind of pure-Python code executed does not reflect the actual instructions executed by the processor since most of the time will be spent in the interpreter itself. Thus, I think you need to use a lower-level language like C or C++ to better control kind of workload to be executed. Alternatively, you can use JIT compiler like Numba so to have a better control while still using a Python code (but not a pure-Python one anymore). Still, one should keep in mind that the JIT can generate many unwanted instructions that can result in an unexpectedly higher power consumption. Alternatively, a JIT compiler can optimize trivial codes like a sum from 1 to N (simplified as just a N*(N+1)/2 expression).
Here is an example of code:
import numba as nb
def test(n):
s = 1
for i in range(1, n):
s += i
s *= i
s &= 0xFF
return s
pythonTest = test
numbaTest = nb.njit('(int64,)')(test) # Compile the function
pythonTest(1_000_000_000) # takes about 108 seconds
numbaTest(1_000_000_000) # takes about 1 second
In this code, the Python function takes 108 times more time to execute than the Numba function on my machine (i5-9600KF processor) so one should expect a 108 higher energy needed to execute the Python version. However, in practice, this is even worse: the pure-Python function causes the target core to consume a much higher power (not just more energy) than the equivalent compiled Numba implementation on my machine. This can be clearly seen on the temperature monitor:
Base temperature when nothing is running: 39°C
Temperature during the execution of pythonTest: 55°C
Temperature during the execution of numbaTest: 46°C
Note that my processor was running at 4.4-4.5 GHz in all cases (due to the performance governor being chosen). The temperature is retrieved after 30 seconds in each cases and it is stable (due to the cooling system). The function are run in a while(True) loop during the benchmark.
Note that game often use multiple cores and they do a lot of synchronizations (at least to wait for the rendering part to be completed). A a result, the target processor can use a slightly lower turbo frequency (due to TDP constraints) and have a lower temperature due to the small sleeps (saving energy).

What is an appropriate model for detecting CPU usage abnormality?

I am currently working on a model for detecting CPU usage abnormality for our regression test cases. The goal is to keep tracking the CPU usage day after day, and raise an alarm if the CPU usage of any particular test case rockets up so that the relevant developer can kick in in time for investigation.
The model I've figured out is like this:
Measure the CPU usage of the target process every second. Just read the /proc/#pid/stat and make the #CPU time# divided by the #wall time# during the passed second (which means the #wall time# must be 1).
When the test case finishes, we get an array of CPU usage data (how long the array is depends on how long the test case runs). Go through this array and get a summary array with 5 elements, which represents the distribution of CPU usage -- [0~20%, 20~40%, 40~60%, 60~80%, +80%], such as [0.1, 0.2, 0.2, 0.4, 0.1].
0~20%: How many measurements over the total measurements have CPU usage between 0 and 20%
20~40%: How many measurements over the total measurements have CPU usage between 20% and 40%
40~60%: How many measurements over the total measurements have CPU usage between 40% and 60%
60~80%: How many measurements over the total measurements have CPU usage between 60% and 80%
+80%: How many measurements over the total measurements have CPU usage over 80% (actually the CPU usage might be higher than 100% if multi-threading is enabled)
The sum of this summary array should be 1.
Then I do calculation on the summary array and get a score with this formula:
summary[0]*1 + summary[1]*1 + summary[2]*2 + summary[3]*4 + summary[4]*8
The principle is: higher CPU usage gets higher penalty.
The higher the final score is, the higher the overall CPU usage is. For any particular test case, I can get one score every day. If the score rockets up some day, an alarm is raised.
To verify this model, I've repeatedly run a randomly selected set of test cases for a few hundred times, on different machines. I found that on any particular host, the scores fluctuate, but within a reasonable range. However, different hosts live in different ranges.
This is a diagram showing the scores of 600 runs for a particular test case. Different colors indicate different hosts.
Here are a few questions:
Does it make sense that the scores on different machines locate in different ranges?
If I want to trace CPU usage for a test case, should I run it merely on one dedicated machine day after day?
Is there any better model for tracing CPU usage and raising alarms?
You should have an O(n) model for every function based on what you believe it should be doing. Then generate your regression cases against it by having variations on n. If the function does not display expected performance as n increases, something is wrong. If it is too good, the problem is in your expectations and test cases.
Another approach is to come up with a theoretical ideal run metric. These tend to save time via hard coding or looking up in massive tables. Here, additional error checking, etc required will detract from your efficiency compared to the ideal run method.
Only way to prove cpu use is bad is to come up with a better alternative. If you could do so, then you should have designed your dev process to produce the better algorithm in the first place!

How to interprete ETW graphs for sampled and precise CPU usage when they contradict

I'm having difficulties pinning down where our application is spending its time.
Looking at the flame graphs of an ETW trace from the sampled and the precise CPU Usage, they contradict each other. Below are the graphs for a 1 second duration
According to the "CPU Usage (Sampled)" graph
vbscript.dll!COleScript::ParseScriptText is a big contributor in the total performance.
ws2_32.dll!recv is a small contributor.
According to the "CPU Usage (Precise)" graph
Essentially, this shows it's the other way around?
vbscript.dll!COleScript::ParseScriptText is a small contributor, only taking up 3.95 ms of CPU.
ws2_32.dll!recv is a big contributor, taking up 915,09 ms of CPU.
What am I missing or misinterpreting?
CPU Usage (Sampled)
CPU Usage (Precise)
There is a simple explanation:
CPU Usage (Precise) is based on context switch data and it can therefore give an extremely accurate measure of how much time a thread spends on-CPU - how much time it is running. However because it is based on context switch data it only knows what the call stack is when a thread goes on/off CPU. It has no idea what it is doing in-between. Therefore, CPU Usage (Precise) data knows how much time a thread is using but it has no ideas where that time is spent.
CPU Usage (Sampled) is less accurate in regards to how much CPU time a thread consumes, but it is quite good (statistically speaking, absent systemic bias) at telling you where time is spent.
With CPU Usage (Sampled) you still need to be careful about inclusive versus exclusive time (time spent in a function versus spent in its descendants) in order to interpret data correctly but it sounds like that data is what you want.
For more details see https://randomascii.wordpress.com/2015/09/24/etw-central/ which has documentation of all of the columns for these two tables, as well as case studies of using them both (one to investigate CPU usage scenarios, the other to investigate CPU idle scenarios)

jvmtop CPU usage >100%

I've been using jvmtop for a few months to monitor JVM statistics. As I've tallied the output with Jconsole, I've mostly observed similar stats in jvmtop as well.
However during a recent test execution I've observed few entries of CPU% to go above 100% (120% being the highest). Now as I believe jvmtop provides cumulative CPU usage (not like top which provides more of core-wise details), need guidance on how to interpret this entries of beyond 100% usage.
I have looked at jvmtop code and may conclude that its algorithm for computing CPU usage is completely screwed up. Let me explain.
In the simplified form, the formula looks like
CPU_usage = Δ processCpuTime / ( Δ processRunTime * CPU_count)
The idea makes sense, but the problem is how it is actually implemented. I see at least 3 issues:
Process CPU time and process uptime are obtained by two independent method calls. Furthermore, those are remote calls made via RMI. This means, there can be an arbitrary long delay between obtaining those values. So, Δ processCpuTime and Δ processRunTime are computed over different time intervals, and it's not quite correct to divide one by another. When Δ processCpuTime is computed over a longer period than Δ processRunTime, the result may happen to be > 100%.
Process CPU time is based on OperatingSystemMXBean.getProcessCpuTime() call, while the process uptime relies on RuntimeMXBean.getUptime(). The problem is, these two methods use the different clock source and the different time scale. Generally speaking, times obtained by these methods are not comparable to each other.
CPU_count is computed as OperatingSystemMXBean.getAvailableProcessors(). However, the number of logical processors visible to the JVM is not always equal to the number of physical processors on the machine. For example, in a container getAvailableProcessors() may return a value based on cpu-shares, while in practice the JVM may use more physical cores. In this case CPU usage may again appear > 100%.
However, as far as I see, in the latest version the final value of CPU load is artificially capped by 99%:
return Math.min(99.0,
deltaTime / (deltaUptime * osBean.getAvailableProcessors()));

CPU cycles vs. total CPU time

On Windows, GetProcessTimes() and QueryProcessCycleTime() can be used to get totals for all threads of an app. I expected (apparently naively) to find a proportional relationship between the total number of cycles and the total processor time (user + kernel). When converted to the same units (seconds) and expressed at a percent of the app's running time, they're not even close; and the ratio between them varies greatly.
Right after an app starts, they're fairly close.
3.6353% CPU cycles
5.2000% CPU time
0.79 Ratio
But this ratio increases as an app remains idle (below, after 11 hours, mostly idle).
0.0474% CPU cycles
0.0039% CPU time
12.16 Ratio
Apparently, cycles are counted that don't contribute to user or kernel time. I'm curious about how it works. Please enlighten me.
Thanks.
Vince
The GetProcessTimes and the QueryProcessCycleTime values are calculated in different ways. GetProcesTimes/GetThreadTimes are updated in response to timer interrupts, while QueryProcessCycleTime values are based on the tracking of actual thread execution times. These different ways of measuring may cause vastly different timing results when both API results are used and compared. Especially since the GetThreadTimes includes only fully completed time-slot values for its thread counters (see http://blog.kalmbachnet.de/?postid=28), which usually results in incorrect timings.
Since GetProcessTimes will in general report less time than actually spent (due to not always completing its time-slice) it makes sense
that its CPU time percentage will decrease over time compared to the cycle measurement percentage.

Resources