jvmtop CPU usage >100% - performance

I've been using jvmtop for a few months to monitor JVM statistics. As I've tallied the output with Jconsole, I've mostly observed similar stats in jvmtop as well.
However during a recent test execution I've observed few entries of CPU% to go above 100% (120% being the highest). Now as I believe jvmtop provides cumulative CPU usage (not like top which provides more of core-wise details), need guidance on how to interpret this entries of beyond 100% usage.

I have looked at jvmtop code and may conclude that its algorithm for computing CPU usage is completely screwed up. Let me explain.
In the simplified form, the formula looks like
CPU_usage = Δ processCpuTime / ( Δ processRunTime * CPU_count)
The idea makes sense, but the problem is how it is actually implemented. I see at least 3 issues:
Process CPU time and process uptime are obtained by two independent method calls. Furthermore, those are remote calls made via RMI. This means, there can be an arbitrary long delay between obtaining those values. So, Δ processCpuTime and Δ processRunTime are computed over different time intervals, and it's not quite correct to divide one by another. When Δ processCpuTime is computed over a longer period than Δ processRunTime, the result may happen to be > 100%.
Process CPU time is based on OperatingSystemMXBean.getProcessCpuTime() call, while the process uptime relies on RuntimeMXBean.getUptime(). The problem is, these two methods use the different clock source and the different time scale. Generally speaking, times obtained by these methods are not comparable to each other.
CPU_count is computed as OperatingSystemMXBean.getAvailableProcessors(). However, the number of logical processors visible to the JVM is not always equal to the number of physical processors on the machine. For example, in a container getAvailableProcessors() may return a value based on cpu-shares, while in practice the JVM may use more physical cores. In this case CPU usage may again appear > 100%.
However, as far as I see, in the latest version the final value of CPU load is artificially capped by 99%:
return Math.min(99.0,
deltaTime / (deltaUptime * osBean.getAvailableProcessors()));

Related

Type of CPU load to perform constant relaxed work

I'm trying to figure out how to program a certain type of load to the CPU that makes it work constantly but with average stress.
The only approach I know how to load a CPU with some work to do which wouldn't be at its maximum possible performance is when we alternate the part of giving the CPU something to do with sleep for some time. E.g. to achieve 20% CPU usage, do some computation which would take e.g. 0.2 seconds and then sleep for 0.8 seconds. Then the CPU usage will be roughly 20%.
However this essentially means the CPU will be jumping between max performance to idle all the time.
I wrote a small Python program where I'm making a process for each CPU core, set its affinity so each process runs on a designated core, and I'm giving it some absolutely meaningless load:
def actual_load_cycle():
x = list(range(10000))
del x
while repeating a call to this procedure in cycle and then sleeping for some time to ensure the working time is N% of total time:
while 1:
timer.mark_time(timer_marker)
for i in range(coef):
actual_load_cycle()
elapsed = timer.get_time_since(timer_marker)
# now we need to sleep for some time. The elapsed time is CPU_LOAD_TARGET% of 100%.
time_to_sleep = elapsed / CPU_LOAD_TARGET * (100 - CPU_LOAD_TARGET)
sleep(time_to_sleep)
It works well, giving the load within 7% of desired value of CPU_LOAD_TARGET - I don't need a precise amount of load.
But it sets the temperature of the CPU very high, with CPU_LOAD_TARGET=35 (real CPU usage reported by the system is around 40%) the CPU temps go up to 80 degrees.
Even with the minimal target like 5%, the temps are spiking, maybe just not as much - up to 72-73.
I believe the reason for this is that those 20% of time the CPU works as hard as it can, and it doesn't get cooler fast enough while sleeping afterwards.
But when I'm running a game, like Uncharted 4, the CPU usage as measured by MSI Afterburner is 42-47%, but the temperatures stay under 65 degrees.
How can I achieve similar results, how can I program such load to make CPU usage high but the work itself would be quite relaxed, as is done e.g. in the games?
Thanks!
The heat dissipation of a CPU is mainly dependent of its power consumption which is very dependent of the workload, and more precisely the instruction being executed and the number of active cores. Modern processors are very complex so it is very hard to predict the power consumption based on a given workload, especially when the executed code is a Python code executed in the CPython interpreter.
There are many factors that can impact the power consumption of a modern processors. The most important one is frequency scaling. Mainstream x86-64 processors can adapt the frequency of a core based on the kind of computation done (eg. use of wide SIMD floating-point vectors like the ZMM registers of AVX-512F VS scalar 64-bit integers), the number of active cores (the higher the number of core the lower the frequency), the current temperature of the core, the time executing instructions VS sleeping, etc. On modern processor, the memory hierarchy can take a significant amount of power so operations involving the memory controller and more generally the RAM can eat more power than the one operating on in-core registers. In fact, regarding the instructions actually executed, the processor needs to enable/disable some parts of its integrated circuit (eg. SIMD units, integrated GPU, etc.) and not all can be enabled at the same time due to TDP constraints (see Dark silicon). Floating-point SIMD instructions tend to eat more energy than integer SIMD instructions. Something even weirder: the consumption can actually be dependent of the input data since transistors may switch more frequently from one state to another with some data (researchers found this while computing matrix multiplication kernels on different kind of platforms with different kind of input). The power is automatically adapted by the processor since it would be insane (if even possible) for engineers to actually consider all possible cases and all possible dynamic workload.
One of the cheapest x86 instruction is NOP which basically mean "do nothing". That being said, the processor can run at the highest turbo frequency so to execute a loop of NOP resulting in a pretty high power consumption. In fact, some processor can run the NOP in parallel on multiple execution units of a given core keeping busy all the available ALUs. Funny point: running dependent instructions with a high latency might actually reduce the power consumption of the target processor.
The MWAIT/MONITOR instructions provide hints to allow the processor to enter an implementation-dependent optimized state. This includes a lower-power consumption possibly due to a lower frequency (eg. no turbo) and the use of sleep states. Basically, your processor can sleep for a very short time so to reduce its power consumption and then be able to use a high frequency for a longer time due to a lower power / heat-dissipation before. The behaviour is similar to humans: the deeper the sleep the faster the processor can be after that, but the deeper the sleep the longer the time to (completely) wake up. The bad news is that such instruction requires very-high privileges AFAIK so you basically cannot use them from a user-land code. AFAIK, there are instructions to do that in user-land like UMWAIT and UMONITOR but they are not yet implemented except maybe in very recent processors. For more information, please read this post.
In practice, the default CPython interpreter consumes a lot of power because it makes a lot of memory accesses (including indirection and atomic operations), does a lot of branches that needs to be predicted by the processor (which has special power-greedy units for that), performs a lot of dynamic jumps in a large code. The kind of pure-Python code executed does not reflect the actual instructions executed by the processor since most of the time will be spent in the interpreter itself. Thus, I think you need to use a lower-level language like C or C++ to better control kind of workload to be executed. Alternatively, you can use JIT compiler like Numba so to have a better control while still using a Python code (but not a pure-Python one anymore). Still, one should keep in mind that the JIT can generate many unwanted instructions that can result in an unexpectedly higher power consumption. Alternatively, a JIT compiler can optimize trivial codes like a sum from 1 to N (simplified as just a N*(N+1)/2 expression).
Here is an example of code:
import numba as nb
def test(n):
s = 1
for i in range(1, n):
s += i
s *= i
s &= 0xFF
return s
pythonTest = test
numbaTest = nb.njit('(int64,)')(test) # Compile the function
pythonTest(1_000_000_000) # takes about 108 seconds
numbaTest(1_000_000_000) # takes about 1 second
In this code, the Python function takes 108 times more time to execute than the Numba function on my machine (i5-9600KF processor) so one should expect a 108 higher energy needed to execute the Python version. However, in practice, this is even worse: the pure-Python function causes the target core to consume a much higher power (not just more energy) than the equivalent compiled Numba implementation on my machine. This can be clearly seen on the temperature monitor:
Base temperature when nothing is running: 39°C
Temperature during the execution of pythonTest: 55°C
Temperature during the execution of numbaTest: 46°C
Note that my processor was running at 4.4-4.5 GHz in all cases (due to the performance governor being chosen). The temperature is retrieved after 30 seconds in each cases and it is stable (due to the cooling system). The function are run in a while(True) loop during the benchmark.
Note that game often use multiple cores and they do a lot of synchronizations (at least to wait for the rendering part to be completed). A a result, the target processor can use a slightly lower turbo frequency (due to TDP constraints) and have a lower temperature due to the small sleeps (saving energy).

What is an appropriate model for detecting CPU usage abnormality?

I am currently working on a model for detecting CPU usage abnormality for our regression test cases. The goal is to keep tracking the CPU usage day after day, and raise an alarm if the CPU usage of any particular test case rockets up so that the relevant developer can kick in in time for investigation.
The model I've figured out is like this:
Measure the CPU usage of the target process every second. Just read the /proc/#pid/stat and make the #CPU time# divided by the #wall time# during the passed second (which means the #wall time# must be 1).
When the test case finishes, we get an array of CPU usage data (how long the array is depends on how long the test case runs). Go through this array and get a summary array with 5 elements, which represents the distribution of CPU usage -- [0~20%, 20~40%, 40~60%, 60~80%, +80%], such as [0.1, 0.2, 0.2, 0.4, 0.1].
0~20%: How many measurements over the total measurements have CPU usage between 0 and 20%
20~40%: How many measurements over the total measurements have CPU usage between 20% and 40%
40~60%: How many measurements over the total measurements have CPU usage between 40% and 60%
60~80%: How many measurements over the total measurements have CPU usage between 60% and 80%
+80%: How many measurements over the total measurements have CPU usage over 80% (actually the CPU usage might be higher than 100% if multi-threading is enabled)
The sum of this summary array should be 1.
Then I do calculation on the summary array and get a score with this formula:
summary[0]*1 + summary[1]*1 + summary[2]*2 + summary[3]*4 + summary[4]*8
The principle is: higher CPU usage gets higher penalty.
The higher the final score is, the higher the overall CPU usage is. For any particular test case, I can get one score every day. If the score rockets up some day, an alarm is raised.
To verify this model, I've repeatedly run a randomly selected set of test cases for a few hundred times, on different machines. I found that on any particular host, the scores fluctuate, but within a reasonable range. However, different hosts live in different ranges.
This is a diagram showing the scores of 600 runs for a particular test case. Different colors indicate different hosts.
Here are a few questions:
Does it make sense that the scores on different machines locate in different ranges?
If I want to trace CPU usage for a test case, should I run it merely on one dedicated machine day after day?
Is there any better model for tracing CPU usage and raising alarms?
You should have an O(n) model for every function based on what you believe it should be doing. Then generate your regression cases against it by having variations on n. If the function does not display expected performance as n increases, something is wrong. If it is too good, the problem is in your expectations and test cases.
Another approach is to come up with a theoretical ideal run metric. These tend to save time via hard coding or looking up in massive tables. Here, additional error checking, etc required will detract from your efficiency compared to the ideal run method.
Only way to prove cpu use is bad is to come up with a better alternative. If you could do so, then you should have designed your dev process to produce the better algorithm in the first place!

maxed CPU performance - stack all tasks or aim for less than 100%?

I have 12 tasks to run on an octo-core machine. All tasks are CPU intensive and each will max out a core.
Is there a theoretical reason to avoid stacking tasks on a maxed out core (such as overhead, swapping across tasks) or is it faster to queue everything?
Task switching is a waste of CPU time. Avoid it if you can.
Whatever the scheduler timeslice is set to, the CPU will waste its time every time slice by going into the kernel, saving all the registers, swapping the memory mappings and starting the next task. Then it has to load in all its CPU cache, etc.
Much more efficient to just run one task at a time.
Things are different of course if the tasks use I/O and aren't purely compute bound.
Yes it's called queueing theory https://en.wikipedia.org/wiki/Queueing_theory. There are many different models https://en.wikipedia.org/wiki/Category:Queueing_theory for a range of different problems I'd suggest you scan them and pick the one most applicable to your workload then go and read up on how to avoid the worst outcomes for that model, or pick a different, better, model for dispatching your workload.
Although the graph at this link https://commons.wikimedia.org/wiki/File:StochasticQueueingQueueLength.png applies to Traffic it will give you an idea of what is happening to response times as your CPU utilisation increases. It shows that you'll reach an inflection point after which things get slower and slower.
More work is arriving than can be processed with subsequent work waiting longer and longer until it can be dispatched.
The more cores you have the further to the right you push the inflection point but the faster things go bad after you reach it.
I would also note that unless you've got some really serious cooling in place you are going to cook your CPU. Depending on it's design it will either slow itself down, making your problem worse, or you'll trigger it's thermal overload protection.
So a simplistic design for 8 cores would be, 1 thread to manage things and add tasks to the work queue and 7 threads that are pulling tasks from the work queue. If the tasks need to be performed within a certain time you can add a TimeToLive value so that they can be discarded rather than executed needlessly. As you are almost certainly running your application in an OS that uses a pre-emptive threading model consider things like using processor affinity where possible because as #Zan-Lynx says task/context switching hurts. Be careful not to try to build your OS'es thread management again as you'll probably wind up in conflict with it.
tl;dr: cache thrash is Bad
You have a dozen tasks. Each will have to do a certain amount of work.
At an app level they each processed a thousand customer records or whatever. That is fixed, it is a constant no matter what happens on the hardware.
At the the language level, again it is fixed, C++, java, or python will execute a fixed number of app instructions or bytecodes. We'll gloss over gc overhead here, and page fault and scheduling details.
At the assembly level, again it is fixed, some number of x86 instructions will execute as the app continues to issue new instructions.
But you don't care about how many instructions, you only care about how long it takes to execute those instructions. Many of the instructions are reads which MOV a value from RAM to a register. Think about how long that will take. Your computer has several components to implement the memory hierarchy - which ones will be involved? Will that read hit in L1 cache? In L2? Will it be a miss in last-level cache so you wait (for tens or hundreds of cycles) until RAM delivers that cache line? Did the virtual memory reference miss in RAM, so you wait (for milliseconds) until SSD or Winchester storage can page in the needed frame? You think of your app as issuing N reads, but you might more productively think of it as issuing 0.2 * N cache misses. Running at a different multi-programming level, where you issue 0.3 * N cache misses, could make elapsed time quite noticeably longer.
Every workload is different, and can place larger or smaller demands on memory storage. But every level of the memory hierarchy depends on caching to some extent, and higher multi-programming levels are guaranteed to impact cache hit rates. There are network- and I/O-heavy workloads where very high multi-programming levels absolutely make sense. But for CPU- and memory-intensive workloads, when you benchmark elapsed times you may find that less is more.

Computing time in relation to number of operations

is it possible to calculate the computing time of a process based on the number of operations that it performs and the speed of the CPU in GHz?
For example, I have a for loop that performs a total number of 5*10^14 cycles. If it runs on a 2.4 GHz processor, will the computing time in seconds be: 5*10^14/2.4*10^9 = 208333 s?
If the process runs on 4 cores in parallel, will the time be reduced by four?
Thanks for your help.
No, it is not possible to calculate the computing time based just on the number of operations. First of all, based on your question, it sounds like you are talking about the number of lines of code in some higher-level programming language since you mention a for loop. So depending on the optimization level of your compiler, you could see varying results in computation time depending on what kinds of optimizations are done.
But even if you are talking about assembly language operations, it is still not possible to calculate the computation time based on the number of instructions and CPU speed alone. Some instructions might take multiple CPU cycles. If you have a lot of memory access, you will likely have cache misses and have to load data from disk, which is unpredictable.
Also, if the time that you are concerned about is the actual amount of time that passes between the moment the program begins executing and the time it finishes, you have the additional confounding variable of other processes running on the computer and taking up CPU time. The operating system should be pretty good about context switching during disk reads and other slow operations so that the program isn't stopped in the middle of computation, but you can't count on never losing some computation time because of this.
As far as running on four cores in parallel, a program can't just do that by itself. You need to actually write the program as a parallel program. A for loop is a sequential operation on its own. In order to run four processes on four separate cores, you will need to use the fork system call and have some way of dividing up the work between the four processes. If you divide the work into four processes, the maximum speedup you can have is 4x, but in most cases it is impossible to achieve the theoretical maximum. How close you get depends on how well you are able to balance the work between the four processes and how much overhead is necessary to make sure the parallel processes successfully work together to generate a correct result.

Cache miss in an Out of order processor

Imagine an application is running on an Out of order processor and it has a lot of last level cache(LLC) misses (more than 70%). Do you think that if we decrease the frequency of the processor and set it to a smaller value then the execution time of the application will increase in a big way or doesn't effect much? Could you please explain your answer
Thanks and regards
As is the case with most micro-architectural features, the safe answer would be - "it might, and it might not - depends on the exact characteristics of your application".
Take for e.g. a workload that runs through a large graph that resides in the memory - each new node needs to be fetched and processed before the new node can be chosen. If you reduce the frequency you would harm the execution phase which is latency critical since the next set of memory accesses depends on it.
On the other hand, a workload that is bandwidth-bounded (i.e.- performs as fast as the system memory BW limits), is probably not fully utilizing the CPU and would therefore not be harmed as much.
Basically the question boils down to how well your application utilizes the CPU, or rather - where between the CPU and the memory, can you find the performance bottleneck.
By the way, note that even if reducing the frequency does impact the execution time, it could still be beneficial for you power/performance ratio, depends where along the power/perf curve you're located and on the exact values.

Resources