Golang High GC pause times on docker/kubernetes

Golang High GC pause times on docker/kubernetes - go

I am migrationg a web application written in Go from AWS Elastic Beanstalk to Kubernets and I noticed that the garbage collector pause times (I am using Newrelic to monitor the application) increased about a 100 when running the application.
I believe it is related with the CPU limiting that the Kubernet does.
Does anyone have any idea about what is really causing it? Is it possible to overcome it?
Below there is a small example of this difference.
Elastc Beanstalk:
Kubernets:
After some tests and more research I discovered some interesting things.
The CPU limit on Docker seems to have a great influence on GC time/pauses. After some tests I got the CPU limit to 500m which means about 1/2 CPU of a 8-core machine.
I set GOMAXPROCS = 1 and GOGC = 1000 and this lead to less and faster GC pauses, however the average memory usage increased.
Here are a 27h overview of Kubernets and Elastic Beanstalk
Kubernetes:
Elastic Beanstalk:

Related

Kubernetes throttling JVM application that isn't hitting CPU quota

I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
The memory is set to 2gb, changing this to 3gb has no significant impact
The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
Switching between GCs (G1 and ZGC) has no impact
We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
The throttling is observed in custom load tests which replicates the warmup requests described above
Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
This is all done on a managed kubernetes by AWS
All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).

In the end this all boiled down to missing one part of the equation: I/O bounds.
Imagine if one request takes 10 DB calls, each taking 3 milliseconds to fulfill (including network latency etc.). A single request then takes 10*3 = 30 milliseconds of I/O. The request throughput of one request is then 1000ms / 30ms = 33,33 requests / second. Now if one service instance uses 10 threads to handle requests we get 333,3 requests / seconds as our upper bound of throughput. We can't get any faster than this because we are I/O bottlenecked in regards to our thread count.
And this leaves out multiple factors like:
thread pool size vs. db connection pool size
our service doing non-db related tasks (actual logic, json serialization when the response get fulfilled)
database capacity (was not an issue for us)
TL;DR: You can't get faster when you are I/O bottlenecked, no matter much how CPU you provide. I/O has to be improve if you want your single service instance to have more throughput, this is mostly done by db connection pool sizing in relation to thread pool sizing in relation to db calls per request. We missed this basic (and well known) relation between resources!

Observed performance degradation with increasing concurrency

I am facing performance issues with my Spring boot application; when I increase the concurrency the performance degrades, and I start getting 10 times slower responses. I am using EFS for some file processing. In my analysis, I am of the impression that accessing EFS files takes significant amount of time when I increase the number of users. Is there any way to improve the EFS performance with increased concurrency?
I have tried by changing the EFS provisioned throughput to burst mode but couldn't observe any significant performance improvement.
I would really appreciate any advice/suggestion/experience inline with this scenarios.

Kubernetes number of replicas vs performance

I have just gotten into Kubernetes and really liking its ability to orchestrate containers. I had the assumption that when the app starts to grow, I can simply increase the replicas to handle the demand. However, now that I have run some benchmarking, the results confuse me.
I am running Laravel 6.2 w/ Apache on GKE with a single g1-small machine as the node. I'm only using NodePort service to expose the app since LoadBalancer seems expensive.
The benchmarking tool used are wrk and ab. When the replicas is increased to 2, requests/s somehow drops. I would expect the requests/s to increase since there are 2 pods available to serve the request. Is there a bottleneck occurring somewhere or perhaps my understanding is flawed. Do hope someone can point out what I'm missing.

A g1-small instance is really tiny: you get 50% utilization of a single core and 1.7 GB of RAM. You don't describe what your application does or how you've profiled it, but if it's CPU-bound, then adding more replicas of the process won't help you at all; you're still limited by the amount of CPU that GCP gives you. If you're hitting the memory limit of the instance that will dramatically reduce your performance, whether you swap or one of the replicas gets OOM-killed.
The other thing that can affect this benchmark is that, sometimes, for a limited time, you can be allowed to burst up to 100% CPU utilization. So if you got an instance and ran the first benchmark, it might have used a burst period and seen higher performance, but then re-running the second benchmark on the same instance might not get to do that.
In short, you can't just crank up the replica count on a Deployment and expect better performance. You need to identify where in the system the actual bottleneck is. Monitoring tools like Prometheus that can report high-level statistics on per-pod CPU utilization can help. In a typical database-backed Web application the database itself is the bottleneck, and there's nothing you can do about that at the Kubernetes level.

Read IOPS limits for RDS read-replica?

I've noticed strange thing happened on my PostgreSQL Amazon RDS Read replica.
We've done "stress-test" of dozens parallel high-load read requests. Performance was really good in the beginning of the test, but then rapidly decreased while PostgresSQL itself kept holding dozens of select queries which were performed fast before it stacked.
I've opened Monitor statistics tab in RDS console and have seen that along with visible performance reducing Read IOPS number also decreased from 3000/sec to 300/sec and didn't go upper then 300/sec iops for long time.
At the same time CPU usage was really low ~3%, there weren't any problems with RAM or storage space.
So my question: are any documented limitations of Read IOPS for read replica? It looks like Amazon RDS automatically reduced high limit of IOPS after really high load (3000/sec).
Read-replica server runs on db.t2.large instance with 100 GB General Purpose (SSD) storage type with disabled fixed IOPS feature.

The behavior you describe is exactly as documented for the underlying storage class GP2.
GP2 is designed to [...] deliver a consistent baseline performance of 3 IOPS/GB
GP2 volumes smaller than 1 TB can also burst up to 3,000 IOPS.
https://aws.amazon.com/ebs/details/
3 IOPS/GB on a 100GB volume is 300 IOPS.
See also http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html for a description of how IOPS credits work. While your system isn't busy, it will build up credits that can be used for the next burst.

Reduce RabbitMQ memory usage

I'm trying to run RabbitMQ on a small VPS (512mb RAM) along with Nginx and a few other programs. I've been able to tweak the memory usage of everything else without difficulty, but I can't seem to get RabbitMQ to use any less RAM.
I think I need to reduce the number of threads Erlang uses for RabbitMQ, but I've not been able to get it to work. I've also tried setting the vm_memory_high_watermark to a few different values below the default (of 40%), even as low as 5%.
Part of the problem might be that the VPS provider (MediaTemple) allows me to go over my allocated memory, so when using free or top, it shows that the server has around 900mb.
Any suggestions to reduce memory usage by RabbitMQ, or limit the number of threads that Erlang will create? I believe Erlang is using 30 threads, based on the -A30 flag that I've seen on the process command.
Ideally I'd like RabbitMQ mem usage to be below 100mb.
Edit:
With vm_memory_high_watermark set to 5% (or 0.05 in the config file), the RabbitMQ logs report that RabbitMQ's memory limit is set to 51mb. I'm not sure where 51mb is coming from. Current VPS allocated memory is 924mb, so 5% of that should be around 46mb.
According to htop/free before starting up RabbitMQ, I'm sitting around 453mb of used ram, and after start RabbitMQ I'm around 650mb. Nearly 200mb increase. Could it be that 200mb is the lower limit that RabbitMQ will run with?
Edit 2
Here are some screenshots of ps aux and free before and after starting RabbitMQ and a graph showing the memory spike when RabbitMQ is started.
Edit 3
I also checked with no plugins enabled, and it made very little difference. It seems the plugins I had (management and its prerequisites) only added about 8mb of ram usage.
Edit 4
I no longer have this server to test with, however, there is a conf setting delegate_count that is set to a default of 16. As far as I know, this spawns 16 sup-procs for rabbitmq. Lowering this number on smaller servers may help reduce the memory footprint. No idea if this actually works, or how it impacts performance, but it's something to try.

The appropriate way to limit memory usage in RabbitMQ is using the vm_memory_high_watermark. You said:
I've also tried setting the
vm_memory_high_watermark to a few
different values below the default (of
40%), even as low as 5%.
This should work, but it might not be behaving the way you expect. In the logs, you'll find a line that tells you what the absolute memory limit is, something like this:
=INFO REPORT==== 29-Oct-2009::15:43:27 ===
Memory limit set to 2048MB.
You need to tweak the memory limit as needed - Rabbit might be seeing your system as having a lot more RAM than you think it has if you're running on a VPS environment.
Sometimes, Rabbit can't tell what system you're on and uses 1GB as the base point (so you get a limit of 410MB by default).
Also, make sure you are running on a version of RabbitMQ that supports the vm_memory_high_watermark setting - ideally you should run with the latest stable release.

Make sure to set an appropriate QoS prefetch value. By default, if there's a client, the Rabbit server will send any messages it has for that client's queue to the client. This results in extensive memory usage both on the client & the server.
Drop the prefetch limit down to something reasonable, like say 100, and Rabbit will keep the remaining messages on disk on the server until the client is really ready to process them, and your memory usage will go way way down on both the client & the server.
Note that the suggestion of 100 is just a reasonable place to start - it sure beats infinity. To really optimize that number, you'll want to take into consideration the messages/sec your client is able to process, the latency of your network, and also how large each of your messages is on average.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio