Out of curiosity ... is it possible to have more than 100% utilization of the CPU in a multi-threaded environment?
No, of course not. And any utility which tells you otherwise is lying.
a single CPU core can not be at more than 100% utilization. But on a multi core system most utilities report the sum of the CPU utilization thus numbers above 100% are quite common.
No, this could never happen by definition of cpu utilization. What you may see is the number of runnable processes greater than the number of cpus. This is normal in a multithreaded environment as the scheduler schedules on the thread level instaed of the process level.
Related
We have a 20-core CPU running on a KVM (CentOS 7.8)
We have two heavy enterprise JAVA applications (Java 8) running on the same node.
We are using ParallelGC in both and by default 14 GC threads are showing up (default no. determined using ~ 5/8 * no. of cores)
Is it okay to have GC threads(combined is 14+14 = 28) exceeding the no. of cores(20) in the system ? Will there be no issues when GC threads on both JVM instances are running concurrently?
Would it make sense to reduce the no. of GC threads to 10 each ?
How can we determine the minimum no. of GC threads (ParallelGC) needed to get the job done without impacting for an application?
Will there be no issues when GC threads on both JVM instances are running concurrently?
Well, if both JVMs are running the GC at the same time, then their respective GC runs may take longer because there are fewer physical cores available. However, the OS should (roughly speaking) give both JVMs an equal share of the available CPU. So nothing should break.
Would it make sense to reduce the no. of GC threads to 10 each ?
That would mean that if two JVMs were running the GC simultaneously then they wouldn't be competing for CPU.
But the flip-side is that a since a JVM has 10 threads for GC, its GC runs will always take longer than if it has 16 threads ... and 16 cores are currently available.
Bear in mind that most of the time you would expect that the JVMs are not GC'ing at the same time. (If they are, then probably something else is wrong1.)
How can we determine the minimum no. of GC threads (ParallelGC) needed to get the job done without impacting for an application?
By a process of trial and error:
Pick an initial setting
Measure performance with a indicative benchmark workload
Adjust setting
Repeat ... until you have determined the best settings.
But beware that if your actual workload doesn't match your benchmark, your settings maybe "off".
My advice would be that if you have two CPU intensive, performance critical applications, and you suspect that they are competing for resources, you should try to run them on different (dedicated!) compute nodes.
There is only so much you can achieve by "fiddling with the tuning knobs".
1 - Maybe the applications have memory leaks. Maybe they need to be profiled to look for major CPU hotspots. Maybe you don't have enough RAM and the real competition is for physical RAM pages and swap device bandwidth (during GC) rather than CPU.
I have a system - processor 2.8 ghz, 20 physical cores, 40 logical cores, 128 gb ram and 4tb hard drive.
Scenario:
I am running 3 (independent) python base processes/scripts (running independently) that read data from file and write it to database. They are taking time while not using CPU and Memory 100% not even 40%.
Why is it so? (I think it depends upon OS)
How can I configure it to utilise CPU and Memory more?
I am using Windows 8.1.
take a look at processoraffinity and processpriority
https://msdn.microsoft.com/en-us/library/system.diagnostics.processthread.processoraffinity(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.diagnostics.process.priorityclass(v=vs.110).aspx
A process (including a python script) isn't going to use any more cores than it has running threads. So if your python script is single-threaded, it's only going to use a single core.
Further, disk and database operations will stall the process while blocked on I/O and network. (Effective CPU usage == 0).
In other words, your program may not be "cpu bound" if it's doing a lot of I/O.
I'm not sure what your programs do, but if the problem at hand can be parallelized (split up into multiple independent tasks), then it might lend itself to having more threads or processes to take advantage of the extra hardware you have. But it's tricky and very hard to get this right and get the performance gain.
I'm new to Elixir, and I'm starting to read through Dave Thomas's excellent Programming Elixir. I was curious how far I could take the concurrency of the "pmap" function, so I iteratively boosted the number of items to square from 1,000 to 10,000,000. Out of curiosity, I watched the output of htop as I did so, usually peaking out with CPU usage similar to that shown below:
After showing the example in the book, Dave says:
And, yes, I just kicked off 1,000 background processes, and I used all the cores and processors on my machine.
My question is, how come on my machine only cores 1, 3, 5, and 7 are lighting up? My guess would be that it has to do with my iex process being only a single OS-level process and OSX is managing the reach of that process. Is that what's going on here? Is there some way to ensure all cores get utilized for performance-intensive tasks?
Great comment by #Thiago Silveira about first line of iex's output. The part [smp:8:8] says how many operating system level processes is Erlang using. You can control this with flag --smp if you want to disable it:
iex --erl '-smp disable'
This will ensure that you have only one system process. You can achieve similar result by leaving symmetric multiprocessing enabled, but setting directly NumberOfShcedulers:NumberOfSchedulersOnline.
iex --erl '+S 1:1'
Each operating system process needs to have its own scheduler for Erlang processes, so you can easily see how many of them do you have currently:
:erlang.system_info(:schedulers_online)
To answer your question about performance. If your processors are not working at full capacity (100%) and non of them is doing nothing (0%) then it is probable that making the load more evenly distributed will not speed things up. Why?
The CPU usage is measured by probing the processor state at many points in time. This states are either "working" or "idle". 82% CPU usage means that you can perform couple of more tasks on this CPU without slowing other tasks.
Erlang schedulers try to be smart and not migrate Erlang processes between cores unless they have to because it requires copying. The migration occurs for example when one of schedulers is idle. It can then borrow a process from others scheduler run queue.
Next thing that may cause such a big discrepancy between odd and even cores is Hyper Threading. On my dual core processor htop shows 4 logical cores. In your case you probably have 4 physical cores and 8 logical because of HT. It might be the case that you are utilizing your physical cores with 100%.
Another thing: pmap needs to calculate result in separate process, but at the end it sends it to the caller which may be a bottleneck. The more you send messages the less CPU utilization you can achieve. You can try for fun giving the processes a task that is really CPU intensive like calculating Ackerman function. You can even calculate how much of your job is the sequential part and how much is parallel using Amdahl's law and measuring execution times for different number of cores.
To sum up: the CPU utilization from screenshot looks really great! You don't have to change anything for more performance-intensive tasks.
Concurrency is not Parallelism
In order to get good parallel performance out of Elixir/BEAM coding you need to have some understanding of how the BEAM scheduler works.
This is a very simplistic model, but the BEAM scheduler gives each process 2000 reductions before it swaps out the process for the next process. Reductions can be thought of as function calls. By default a process runs on the core/scheduler that spawned it. Processes only get moved between schedulers if the queue of outstanding processes builds up on a given scheduler. By default the BEAM runs a scheduling thread on each available core.
What this implies is that in order to get the most use of the processors you need to break up your tasks into large enough pieces of work that will exceed the standard "reduction" slice of work. In general, pmap style parallelism only gives significant speedup when you chunk many items into a single task.
The other thing to be aware of is that some parts of the BEAM use a spin/wait loop when awaiting work and that can skew usage when you use
a tool like htop to examine CPU usage. You'll get a much better understanding of your program's performance by using :observer.
I have built software that I deploy on Windows 2003 server. The software runs as a service continuously and it's the only application on the Windows box of importance to me. Part of the time, it's retrieving data from the Internet, and part of the time it's doing some computations on that data. It's multi-threaded -- I use thread pools of roughly 4-20 threads.
I won't bore you with all those details, but suffice it to say that as I enable more threads in the pool, more concurrent work occurs, and CPU use rises. (as does demand for other resources, like bandwidth, although that's of no concern to me -- I have plenty)
My question is this: should I simply try to max out the CPU to get the best bang for my buck? Intuitively, I don't think it makes sense to run at 100% CPU; even 95% CPU seems high, almost like I'm not giving the OS much space to do what it needs to do. I don't know the right way to identify best balance. I guessing I could measure and measure and probably find that the best throughput is achived at a CPU avg utilization of 90% or 91%, etc. but...
I'm just wondering if there's a good rule of thumb about this??? I don't want to assume that my testing will take into account all kinds of variations of workloads. I'd rather play it a bit safe, but not too safe (or else I'm underusing my hardware).
What do you recommend? What is a smart, performance minded rule of utilization for a multi-threaded, mixed load (some I/O, some CPU) application on Windows?
Yep, I'd suggest 100% is thrashing so wouldn't want to see processes running like that all the time. I've always aimed for 80% to get a balance between utilization and room for spikes / ad-hoc processes.
An approach i've used in the past is to crank up the pool size slowly and measure the impact (both on CPU and on other constraints such as IO), you never know, you might find that suddenly IO becomes the bottleneck.
CPU utilization shouldn't matter in this i/o intensive workload, you care about throughput, so try using a hill climbing approach and basically try programmatically injecting / removing worker threads and track completion progress...
If you add a thread and it helps, add another one. If you try a thread and it hurts remove it.
Eventually this will stabilize.
If this is a .NET based app, hill climbing was added to the .NET 4 threadpool.
UPDATE:
hill climbing is a control theory based approach to maximizing throughput, you can call it trial and error if you want, but it is a sound approach. In general, there isn't a good 'rule of thumb' to follow here because the overheads and latencies vary so much, it's not really possible to generalize. The focus should be on throughput & task / thread completion, not CPU utilization. For example, it's pretty easy to peg the cores pretty easily with coarse or fine-grained synchronization but not actually make a difference in throughput.
Also regarding .NET 4, if you can reframe your problem as a Parallel.For or Parallel.ForEach then the threadpool will adjust number of threads to maximize throughput so you don't have to worry about this.
-Rick
Assuming nothing else of importance but the OS runs on the machine:
And your load is constant, you should aim at 100% CPU utilization, everything else is a waste of CPU. Remember the OS handles the threads so it is indeed able to run, it's hard to starve the OS with a well behaved program.
But if your load is variable and you expect peaks you should take in consideration, I'd say 80% CPU is a good threshold to use, unless you know exactly how will that load vary and how much CPU it will demand, in which case you can aim for the exact number.
If you simply give your threads a low priority, the OS will do the rest, and take cycles as it needs to do work. Server 2003 (and most Server OSes) are very good at this, no need to try and manage it yourself.
I have also used 80% as a general rule-of-thumb for target CPU utilization. As some others have mentioned, this leaves some headroom for sporadic spikes in activity and will help avoid thrashing on the CPU.
Here is a little (older but still relevant) advice from the Weblogic crew on this issue: http://docs.oracle.com/cd/E13222_01/wls/docs92/perform/basics.html#wp1132942
If you feel your load is very even and predictable you could push that target a little higher, but unless your user base is exceptionally tolerant of periodic slow responses and your project budget is incredibly tight, I'd recommend adding more resources to your system (adding a CPU, using a CPU with more cores, etc.) over making a risky move to try to squeeze out another 10% CPU utilization out of your existing platform.
The Windows Task Manager shows CPU usage in percentage. What's the formula behind this? Is it this:
% CPU usage for process A = (Sum of
all time slices given to A till now)/
Total time since the machine booted
Or is it something else?
I am not 100% sure what is uses, but I think you are a bit off on the CPU calculation.
I believe they are doing something like.
Process A CPU Usage = (Cycles for A over last X seconds)/(Total cycles for last X seconds)
I believe it is tied to the "update interval" set in task manager.
While doing a bit of research for you though I found this MSDN article that shows a microsoft recommended way of calculating the CPU time of a set of instructions, this might point you a bit towards their calculation as well.
No, it's not "since boot time" - it's far more time-sensitive than that.
It's "proportion of time during which a CPU was actively running a thread in that process since the last refresh". (Where the refresh rate is typically about a second.) In task manager I believe it's then divided by the number of CPUs, so the total ends up being 100% (i.e. on a dual core machine, a single-threaded CPU hog will show as 50%). Other similar programs sometimes don't do this, giving a total of 100% * cores.
You may also want to check this article as the way CPU cycles are handled with regards to scheduling was changed as part of Vista. I presume that this also applies to Win7.
See the source code of Task Manager