We are running a Windows 2019 cluster in AWS ECS.
From time to time the instances get problems with higher cpu and memory usage, not related to container usage
When checking the instances we can see that the vmcompute process have spiked its memory usage (commit) to up to 90% of the system memory, and having a average CPU usage of at least 30-40%
But I fail to understand why that is happening, and if it is a real issue?
Or if the memory and CPU usage will decrease when more load is put onto the containers?
Related
We have a 20-core CPU running on a KVM (CentOS 7.8)
We have two heavy enterprise JAVA applications (Java 8) running on the same node.
We are using ParallelGC in both and by default 14 GC threads are showing up (default no. determined using ~ 5/8 * no. of cores)
Is it okay to have GC threads(combined is 14+14 = 28) exceeding the no. of cores(20) in the system ? Will there be no issues when GC threads on both JVM instances are running concurrently?
Would it make sense to reduce the no. of GC threads to 10 each ?
How can we determine the minimum no. of GC threads (ParallelGC) needed to get the job done without impacting for an application?
Will there be no issues when GC threads on both JVM instances are running concurrently?
Well, if both JVMs are running the GC at the same time, then their respective GC runs may take longer because there are fewer physical cores available. However, the OS should (roughly speaking) give both JVMs an equal share of the available CPU. So nothing should break.
Would it make sense to reduce the no. of GC threads to 10 each ?
That would mean that if two JVMs were running the GC simultaneously then they wouldn't be competing for CPU.
But the flip-side is that a since a JVM has 10 threads for GC, its GC runs will always take longer than if it has 16 threads ... and 16 cores are currently available.
Bear in mind that most of the time you would expect that the JVMs are not GC'ing at the same time. (If they are, then probably something else is wrong1.)
How can we determine the minimum no. of GC threads (ParallelGC) needed to get the job done without impacting for an application?
By a process of trial and error:
Pick an initial setting
Measure performance with a indicative benchmark workload
Adjust setting
Repeat ... until you have determined the best settings.
But beware that if your actual workload doesn't match your benchmark, your settings maybe "off".
My advice would be that if you have two CPU intensive, performance critical applications, and you suspect that they are competing for resources, you should try to run them on different (dedicated!) compute nodes.
There is only so much you can achieve by "fiddling with the tuning knobs".
1 - Maybe the applications have memory leaks. Maybe they need to be profiled to look for major CPU hotspots. Maybe you don't have enough RAM and the real competition is for physical RAM pages and swap device bandwidth (during GC) rather than CPU.
We are having an interesting issue where we are seeing a CPU spike on our EC2 instance and at the same time we are seeing a spike in disk latency. Here is the pattern for CPU spike
CPU spike from 50% to 100% within 30 seconds
It stays at 100% utilization for two minutes
CPU utilization is dropped from 100 to almost 0 in 10 seconds. At the same time almost disk latency is also back to normal
This issue has happened on different AWS ec2 instances a couple of times over a week and still happening. In all cases we are seeing CPU spike along with disk latency with CPU spike having a similar pattern as above.
We had put process monitoring tools to check if any particular process was occupying the CPU. That tool revealed that each of process on the ec2 instance starts taking approx twice the CPU. For eg our app server CPU utilization increases from .75% to 1.5 . Similar observation for Nginx and other processes. There was no single process occupying more than 8% CPU. We studied our traffic pattern and there is nothing unusual which can cause this. So the question is
Can increase in disk latency cause the CPU spike pattern as above or in general can disk latency result in CPU spike
Here is my bet: you are running t2 / t3 machines which are burstable instances. You can access 30% of the CPU all the time, and a credit system create a fair usage predictable mode for the 70% remaining. You earn credit by running the instance, you lose credit by going over 30% CPU usage.
You are running out of credits and then AWS reduce your access to CPU. The system goes smooth again when credits are added to your balance.
t2 and t3 doesn't have the system credit system, you can find details here: CPU Credits and baseline
You have two solutions:
Take a bigger instance, so you will have more credits per hour and better baseline or another family like c5, m5, r5 etc...
Take an unlimited mode option for your t3 instances
I would suggest faster storage. cpu aims to add up to 100%. limiting is working in this strange way that it simulates usage for "unknown" reason. Reasons can be one of those:
idle time (notice here this is what you consider FREE cpu, thats why I say it adds up to 100%)
user time (normal usage)
system time (system usage)
iowait (your case, cpu waiting for HDD/SSD to answer)
nice time (low priority processes that were not included in user time)
interupt time (external device "talk" time - could be your case if you have many usb devices etc - rather unlikely)
softirq (queued work from a processed interrupt - see above)
steal time (case that Clement is describing)
I would suggest ensuring which one is your case
you can try below to get the info:
$ sudo apt-get install sysstat
$ mpstat -P ALL 1
From here there is 2 options for you :)
EBS allows you to run IO optimized volume called "IO1" (mid price - mid speed)
Change the machine and use one in "Nitro System" (provides bare metal capabilities - that is: as if you had actual NVMe connected directly - max possible speed)
m5.2xlarge 8 37 32 GiB EBS Only $0.384 per Hour
m5d.2xlarge 8 37 32 GiB 1 x 300 NVMe SSD $0.452 per Hour
Source: Instances built on the Nitro System
We have Zalenium installed on an Azure Linux VM that has: 16 vcpu and 64Gb memory
We have configured Zalenium with:
- Max 10 containers
- Max tests per container = 10
- Video recording = Only Failed Tests
When we execute 10 tests in parallel, we are noticing the memory usage is about 10Gb but the CPU usage is at 70%.
This high usage of resources will affect our ability to scale beyond 10 containers, as the Azure cost will be too high.
My question is, has anyone else seen such high resource usage, and is there any advice on how to bring that down?
Thanks
That is normal, each container uses in average 1GB of RAM and 1 CPU (2CPUs if video recording is enabled). Memory consumption also depends on how heavy the site under test is.
I'm a bit stumped on searching for a tool to help me with a pesky load balancing problem.
Say you have a large variety of repeated hourly, daily and weekly units of code that can vary greatly in the RAM, CPU, Disk and Network usage. The usage is known and tracked fairly well.
Now say you have N servers to execute these tasks. The RAM, CPU, Disk and Network usage limits on the these servers are known as well.
What tools are available that can handle auto delegating tasks to these server in a manner that will ensure resources wont be strained?
For example (simplified):
Task A - Consumes 20% CPU, 15GB of RAM -> Push to Server P that has 100% CPU and 32GB RAM Available
Task B - Consumes 10% CPU, 20GB of RAM -> Server P would be Memory strained if I allocate this. Push to Server L that has 100% CPU and 32GB RAM Available
Task C - Consumes 10% CPU, 2GB of RAM -> Push to Server P that has 80% CPU and 17GB RAM Available
Task A - Reports finish, Server P now has 90% CPU and 30GB RAM
I feel like this is a common problem and I'm not sure why, but I'm having a heck of time finding anything in my google adventures.
It would be pretty straight forward to code something myself, but if there's already a tried and tested tool for this, why re-invent the wheel?
The closest I could find was a tool called hang fire I/O https://www.hangfire.io/overview.html
Which is somewhat close, but ignores the resource load balancing aspect of scheduling it looks like; which is key. I already have a solid scheduling system in place as it is, so this wont really help me.
We have a transaction intensive process at one customer site running on a quad core server with four processors. The process is designed to take advantage of every core available. So in this installation, we take an input queue, divide it by 16th's and allocate each fraction of the queue to a core. It works well and keeps up with the transaction volume on the box.
Looking at the CPU utilization on the box, it never seems to go above 33%. Now we have a new customer with at least double the volume of the existing customer. Some of us are arguing that since CPU usage is way below maximum utilization, that we should go with the same configuration.
Others claim that there is no direct correlation between cpu utilization and transaction processing speed and since the logic of the underlying software module is based on the number of available cores, that it makes sense to obtain a box with proportionately more cores available for the new client to accommodate the increased traffic volume.
Does anyone have a sense as to who is right in this instance?
Thank you,
To determine the optimum configuration for your new customer, understanding the reason for low CPU usage is paramount.
Very likely, the reason is one of the following:
Your process is limited by memory bandwidth. In this case, faster RAM will help if supported by the motherboard. If possible, a redesign to limit the amount of data accessed during processing will improve performance. Adding more CPU cores will, on its own, do nothing to improve performance.
Your process is limited by disk I/O. Using faster disk connections (SATA etc.) and/or upgrading to a SSD might help, but more CPU power will not.
Your process is limited by synchronization contention. In this case, adding more threads for more cores might even be counter productive. Redesigning your algorithm might help in this case.
Having said this, I have also seen situations where processes that are definitely CPU bound fail to achieve 100% CPU usage on modern processors (Core i7 etc.) because in certain turbo boost relevant cases, task manager will show less than 100%.
As 9000 said, you need to find out what your bottlenecks are when under load. Perfmon might provide enough data to find out.
Another afterthought: You could limit your process on the existing machine to part of the cores (but still at least 30% so that theoretically, CPU doesn't become a bottleneck due to this limitation) and check if overall throughput degrades. If it does not, adding more cores will not improve performance.