What is an appropriate model for detecting CPU usage abnormality? - cpu

I am currently working on a model for detecting CPU usage abnormality for our regression test cases. The goal is to keep tracking the CPU usage day after day, and raise an alarm if the CPU usage of any particular test case rockets up so that the relevant developer can kick in in time for investigation.
The model I've figured out is like this:
Measure the CPU usage of the target process every second. Just read the /proc/#pid/stat and make the #CPU time# divided by the #wall time# during the passed second (which means the #wall time# must be 1).
When the test case finishes, we get an array of CPU usage data (how long the array is depends on how long the test case runs). Go through this array and get a summary array with 5 elements, which represents the distribution of CPU usage -- [0~20%, 20~40%, 40~60%, 60~80%, +80%], such as [0.1, 0.2, 0.2, 0.4, 0.1].
0~20%: How many measurements over the total measurements have CPU usage between 0 and 20%
20~40%: How many measurements over the total measurements have CPU usage between 20% and 40%
40~60%: How many measurements over the total measurements have CPU usage between 40% and 60%
60~80%: How many measurements over the total measurements have CPU usage between 60% and 80%
+80%: How many measurements over the total measurements have CPU usage over 80% (actually the CPU usage might be higher than 100% if multi-threading is enabled)
The sum of this summary array should be 1.
Then I do calculation on the summary array and get a score with this formula:
summary[0]*1 + summary[1]*1 + summary[2]*2 + summary[3]*4 + summary[4]*8
The principle is: higher CPU usage gets higher penalty.
The higher the final score is, the higher the overall CPU usage is. For any particular test case, I can get one score every day. If the score rockets up some day, an alarm is raised.
To verify this model, I've repeatedly run a randomly selected set of test cases for a few hundred times, on different machines. I found that on any particular host, the scores fluctuate, but within a reasonable range. However, different hosts live in different ranges.
This is a diagram showing the scores of 600 runs for a particular test case. Different colors indicate different hosts.
Here are a few questions:
Does it make sense that the scores on different machines locate in different ranges?
If I want to trace CPU usage for a test case, should I run it merely on one dedicated machine day after day?
Is there any better model for tracing CPU usage and raising alarms?

You should have an O(n) model for every function based on what you believe it should be doing. Then generate your regression cases against it by having variations on n. If the function does not display expected performance as n increases, something is wrong. If it is too good, the problem is in your expectations and test cases.
Another approach is to come up with a theoretical ideal run metric. These tend to save time via hard coding or looking up in massive tables. Here, additional error checking, etc required will detract from your efficiency compared to the ideal run method.
Only way to prove cpu use is bad is to come up with a better alternative. If you could do so, then you should have designed your dev process to produce the better algorithm in the first place!

Related

Type of CPU load to perform constant relaxed work

I'm trying to figure out how to program a certain type of load to the CPU that makes it work constantly but with average stress.
The only approach I know how to load a CPU with some work to do which wouldn't be at its maximum possible performance is when we alternate the part of giving the CPU something to do with sleep for some time. E.g. to achieve 20% CPU usage, do some computation which would take e.g. 0.2 seconds and then sleep for 0.8 seconds. Then the CPU usage will be roughly 20%.
However this essentially means the CPU will be jumping between max performance to idle all the time.
I wrote a small Python program where I'm making a process for each CPU core, set its affinity so each process runs on a designated core, and I'm giving it some absolutely meaningless load:
def actual_load_cycle():
x = list(range(10000))
del x
while repeating a call to this procedure in cycle and then sleeping for some time to ensure the working time is N% of total time:
while 1:
timer.mark_time(timer_marker)
for i in range(coef):
actual_load_cycle()
elapsed = timer.get_time_since(timer_marker)
# now we need to sleep for some time. The elapsed time is CPU_LOAD_TARGET% of 100%.
time_to_sleep = elapsed / CPU_LOAD_TARGET * (100 - CPU_LOAD_TARGET)
sleep(time_to_sleep)
It works well, giving the load within 7% of desired value of CPU_LOAD_TARGET - I don't need a precise amount of load.
But it sets the temperature of the CPU very high, with CPU_LOAD_TARGET=35 (real CPU usage reported by the system is around 40%) the CPU temps go up to 80 degrees.
Even with the minimal target like 5%, the temps are spiking, maybe just not as much - up to 72-73.
I believe the reason for this is that those 20% of time the CPU works as hard as it can, and it doesn't get cooler fast enough while sleeping afterwards.
But when I'm running a game, like Uncharted 4, the CPU usage as measured by MSI Afterburner is 42-47%, but the temperatures stay under 65 degrees.
How can I achieve similar results, how can I program such load to make CPU usage high but the work itself would be quite relaxed, as is done e.g. in the games?
Thanks!
The heat dissipation of a CPU is mainly dependent of its power consumption which is very dependent of the workload, and more precisely the instruction being executed and the number of active cores. Modern processors are very complex so it is very hard to predict the power consumption based on a given workload, especially when the executed code is a Python code executed in the CPython interpreter.
There are many factors that can impact the power consumption of a modern processors. The most important one is frequency scaling. Mainstream x86-64 processors can adapt the frequency of a core based on the kind of computation done (eg. use of wide SIMD floating-point vectors like the ZMM registers of AVX-512F VS scalar 64-bit integers), the number of active cores (the higher the number of core the lower the frequency), the current temperature of the core, the time executing instructions VS sleeping, etc. On modern processor, the memory hierarchy can take a significant amount of power so operations involving the memory controller and more generally the RAM can eat more power than the one operating on in-core registers. In fact, regarding the instructions actually executed, the processor needs to enable/disable some parts of its integrated circuit (eg. SIMD units, integrated GPU, etc.) and not all can be enabled at the same time due to TDP constraints (see Dark silicon). Floating-point SIMD instructions tend to eat more energy than integer SIMD instructions. Something even weirder: the consumption can actually be dependent of the input data since transistors may switch more frequently from one state to another with some data (researchers found this while computing matrix multiplication kernels on different kind of platforms with different kind of input). The power is automatically adapted by the processor since it would be insane (if even possible) for engineers to actually consider all possible cases and all possible dynamic workload.
One of the cheapest x86 instruction is NOP which basically mean "do nothing". That being said, the processor can run at the highest turbo frequency so to execute a loop of NOP resulting in a pretty high power consumption. In fact, some processor can run the NOP in parallel on multiple execution units of a given core keeping busy all the available ALUs. Funny point: running dependent instructions with a high latency might actually reduce the power consumption of the target processor.
The MWAIT/MONITOR instructions provide hints to allow the processor to enter an implementation-dependent optimized state. This includes a lower-power consumption possibly due to a lower frequency (eg. no turbo) and the use of sleep states. Basically, your processor can sleep for a very short time so to reduce its power consumption and then be able to use a high frequency for a longer time due to a lower power / heat-dissipation before. The behaviour is similar to humans: the deeper the sleep the faster the processor can be after that, but the deeper the sleep the longer the time to (completely) wake up. The bad news is that such instruction requires very-high privileges AFAIK so you basically cannot use them from a user-land code. AFAIK, there are instructions to do that in user-land like UMWAIT and UMONITOR but they are not yet implemented except maybe in very recent processors. For more information, please read this post.
In practice, the default CPython interpreter consumes a lot of power because it makes a lot of memory accesses (including indirection and atomic operations), does a lot of branches that needs to be predicted by the processor (which has special power-greedy units for that), performs a lot of dynamic jumps in a large code. The kind of pure-Python code executed does not reflect the actual instructions executed by the processor since most of the time will be spent in the interpreter itself. Thus, I think you need to use a lower-level language like C or C++ to better control kind of workload to be executed. Alternatively, you can use JIT compiler like Numba so to have a better control while still using a Python code (but not a pure-Python one anymore). Still, one should keep in mind that the JIT can generate many unwanted instructions that can result in an unexpectedly higher power consumption. Alternatively, a JIT compiler can optimize trivial codes like a sum from 1 to N (simplified as just a N*(N+1)/2 expression).
Here is an example of code:
import numba as nb
def test(n):
s = 1
for i in range(1, n):
s += i
s *= i
s &= 0xFF
return s
pythonTest = test
numbaTest = nb.njit('(int64,)')(test) # Compile the function
pythonTest(1_000_000_000) # takes about 108 seconds
numbaTest(1_000_000_000) # takes about 1 second
In this code, the Python function takes 108 times more time to execute than the Numba function on my machine (i5-9600KF processor) so one should expect a 108 higher energy needed to execute the Python version. However, in practice, this is even worse: the pure-Python function causes the target core to consume a much higher power (not just more energy) than the equivalent compiled Numba implementation on my machine. This can be clearly seen on the temperature monitor:
Base temperature when nothing is running: 39°C
Temperature during the execution of pythonTest: 55°C
Temperature during the execution of numbaTest: 46°C
Note that my processor was running at 4.4-4.5 GHz in all cases (due to the performance governor being chosen). The temperature is retrieved after 30 seconds in each cases and it is stable (due to the cooling system). The function are run in a while(True) loop during the benchmark.
Note that game often use multiple cores and they do a lot of synchronizations (at least to wait for the rendering part to be completed). A a result, the target processor can use a slightly lower turbo frequency (due to TDP constraints) and have a lower temperature due to the small sleeps (saving energy).

jvmtop CPU usage >100%

I've been using jvmtop for a few months to monitor JVM statistics. As I've tallied the output with Jconsole, I've mostly observed similar stats in jvmtop as well.
However during a recent test execution I've observed few entries of CPU% to go above 100% (120% being the highest). Now as I believe jvmtop provides cumulative CPU usage (not like top which provides more of core-wise details), need guidance on how to interpret this entries of beyond 100% usage.
I have looked at jvmtop code and may conclude that its algorithm for computing CPU usage is completely screwed up. Let me explain.
In the simplified form, the formula looks like
CPU_usage = Δ processCpuTime / ( Δ processRunTime * CPU_count)
The idea makes sense, but the problem is how it is actually implemented. I see at least 3 issues:
Process CPU time and process uptime are obtained by two independent method calls. Furthermore, those are remote calls made via RMI. This means, there can be an arbitrary long delay between obtaining those values. So, Δ processCpuTime and Δ processRunTime are computed over different time intervals, and it's not quite correct to divide one by another. When Δ processCpuTime is computed over a longer period than Δ processRunTime, the result may happen to be > 100%.
Process CPU time is based on OperatingSystemMXBean.getProcessCpuTime() call, while the process uptime relies on RuntimeMXBean.getUptime(). The problem is, these two methods use the different clock source and the different time scale. Generally speaking, times obtained by these methods are not comparable to each other.
CPU_count is computed as OperatingSystemMXBean.getAvailableProcessors(). However, the number of logical processors visible to the JVM is not always equal to the number of physical processors on the machine. For example, in a container getAvailableProcessors() may return a value based on cpu-shares, while in practice the JVM may use more physical cores. In this case CPU usage may again appear > 100%.
However, as far as I see, in the latest version the final value of CPU load is artificially capped by 99%:
return Math.min(99.0,
deltaTime / (deltaUptime * osBean.getAvailableProcessors()));

CPU cycles vs. total CPU time

On Windows, GetProcessTimes() and QueryProcessCycleTime() can be used to get totals for all threads of an app. I expected (apparently naively) to find a proportional relationship between the total number of cycles and the total processor time (user + kernel). When converted to the same units (seconds) and expressed at a percent of the app's running time, they're not even close; and the ratio between them varies greatly.
Right after an app starts, they're fairly close.
3.6353% CPU cycles
5.2000% CPU time
0.79 Ratio
But this ratio increases as an app remains idle (below, after 11 hours, mostly idle).
0.0474% CPU cycles
0.0039% CPU time
12.16 Ratio
Apparently, cycles are counted that don't contribute to user or kernel time. I'm curious about how it works. Please enlighten me.
Thanks.
Vince
The GetProcessTimes and the QueryProcessCycleTime values are calculated in different ways. GetProcesTimes/GetThreadTimes are updated in response to timer interrupts, while QueryProcessCycleTime values are based on the tracking of actual thread execution times. These different ways of measuring may cause vastly different timing results when both API results are used and compared. Especially since the GetThreadTimes includes only fully completed time-slot values for its thread counters (see http://blog.kalmbachnet.de/?postid=28), which usually results in incorrect timings.
Since GetProcessTimes will in general report less time than actually spent (due to not always completing its time-slice) it makes sense
that its CPU time percentage will decrease over time compared to the cycle measurement percentage.

Predicting system performance - method for extrapolating multivariate performance metrics into perdictive equation

I have a reporting application. Its performance is dependent on the hardware it is hosted on and the data it runs against. So under hardware, the main factors are:
CPU cores
Memory
Hard disk speed
.. and under data, the main factors are:
Number of customers
The average amount of data each customer has generated
My plan is to run a series of tests to measure the performance when I alter a single factor. So, for example, I will run the performance tests against 1 core, 2 cores and 4 cores and then run the tests against 4GB RAM, 16GB RAM and 64GB RAM.
From these measurements I would like to produce a formula that can roughly predict how well a system will perform given certain hardware and data.
For example:
Performance Score = f(cpu) + g(mem) + h(disk) + j(cust) + k(data)
where f, g, h, j and k are functions of the parameter they are passed.
My question is:
Is there a formal method for taking performance metrics as an input and extrapolating that data to produce a formula that predicts performance?
Yes - I would use linear regression as a starting point.
For an example, see How can I predict memory usage and time based on historical values.
I found Data Analysis Using Regression and Multilevel/Hierarchical Models to be s highly readable introduction to the subject (you probably won't need multilevel models, so you can skip the second part of the book).

Is memory latency affected by CPU frequency? Is it a result of memory power management by the memory controller?

I basically need some help to explain/confirm some experimental results.
Basic Theory
A common idea expressed in papers on DVFS is that execution times have on-chip and off-chip components. On-chip components of execution time scale linearly with CPU frequency whereas the off-chip components remain unaffected.
Therefore, for CPU-bound applications, there is a linear relationship between CPU frequency and instruction-retirement rate. On the other hand, for a memory bound application where the caches are often missed and DRAM has to be accessed frequently, the relationship should be affine (one is not just a multiple of the other, you also have to add a constant).
Experiment
I was doing experiments looking at how CPU frequency affects instruction-retirement rate and execution time under different levels of memory-boundedness.
I wrote a test application in C that traverses a linked list. I effectively create a linked list whose individual nodes have sizes equal to the size of a cache-line (64 bytes). I allocated a large amount of memory that is a multiple of the cache-line size.
The linked list is circular such that the last element links to the first element. Also, this linked list randomly traverses through the cache-line sized blocks in the allocated memory. Every cache-line sized block in the allocated memory is accessed, and no block is accessed more than once.
Because of the random traversal, I assumed it should not be possible for the hardware to use any pre-fetching. Basically, by traversing the list, you have a sequence of memory accesses with no stride pattern, no temporal locality, and no spacial locality. Also, because this is a linked list, one memory access can not begin until the previous one completes. Therefore, the memory accesses should not be parallelizable.
When the amount of allocated memory is small enough, you should have no cache misses beyond initial warm up. In this case, the workload is effectively CPU bound and the instruction-retirement rate scales very cleanly with CPU frequency.
When the amount of allocated memory is large enough (bigger than the LLC), you should be missing the caches. The workload is memory bound and the instruction-retirement rate should not scale as well with CPU frequency.
The basic experimental setup is similiar to the one described here:
"Actual CPU Frequency vs CPU Frequency Reported by the Linux "cpufreq" Subsystem".
The above application is run repeatedly for some duration. At the start and end of the duration, the hardware performance counter is sampled to determine the number of instructions retired over the duration. The length of the duration is measured as well. The average instruction-retirement rate is measured as the ratio between these two values.
This experiment is repeated across all the possible CPU frequency settings using the "userspace" CPU-frequency governor in Linux. Also, the experiment is repeated for the CPU-bound case and the memory-bound case as described above.
Results
The two following plots show results for the CPU-bound case and memory-bound case respectively. On the x-axis, the CPU clock frequency is specified in GHz. On the y-axis, the instruction-retirement rate is specified in (1/ns).
A marker is placed for repetition of the experiment described above. The line shows what the result would be if instruction-retirement rate increased at the same rate as CPU frequency and passed through the lowest-frequency marker.
Results for the CPU-bound case.
Results for the memory-bound case.
The results make sense for the CPU-bound case, but not as much for the memory-bound case. All the markers for the memory-bound fall below the line which is expected because the instruction-retirement rate should not increase at the same rate as CPU frequency for a memory-bound application. The markers appear to fall on straight lines, which is also expected.
However, there appears to be step-changes in the instruction-retirement rate with change in CPU frequency.
Question
What is causing the step changes in the instruction-retirement rate? The only explanation I could think of is that the memory controller is somehow changing the speed and power-consumption of memory with changes in the rate of memory requests. (As instruction-retirement rate increases, the rate of memory requests should increase as well.) Is this a correct explanation?
You seem to have exactly the results you expected - a roughly linear trend for the cpu bound program, and a shallow(er) affine one for the memory bound case (which is less cpu effected). You will need a lot more data to determine if they are consistent steps or if they are - as I suspect - mostly random jitter depending on how 'good' the list is.
The cpu clock will affect bus clocks, which will affect timings and so on - synchronisation between differently clocked buses is always challenging for hardware designers. The spacing of your steps is interestingly 400 Mhz but I wouldn't draw too much from this - generally, this kind of stuff is way too complex and specific-hardware dependent to be properly analysed without 'inside' knowledge the memory controller used, etc.
(please draw nicer lines of best fit)

Resources