Observed performance degradation with increasing concurrency

Observed performance degradation with increasing concurrency - spring-boot

I am facing performance issues with my Spring boot application; when I increase the concurrency the performance degrades, and I start getting 10 times slower responses. I am using EFS for some file processing. In my analysis, I am of the impression that accessing EFS files takes significant amount of time when I increase the number of users. Is there any way to improve the EFS performance with increased concurrency?
I have tried by changing the EFS provisioned throughput to burst mode but couldn't observe any significant performance improvement.
I would really appreciate any advice/suggestion/experience inline with this scenarios.

Related

Kubernetes throttling JVM application that isn't hitting CPU quota

I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
The memory is set to 2gb, changing this to 3gb has no significant impact
The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
Switching between GCs (G1 and ZGC) has no impact
We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
The throttling is observed in custom load tests which replicates the warmup requests described above
Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
This is all done on a managed kubernetes by AWS
All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).

In the end this all boiled down to missing one part of the equation: I/O bounds.
Imagine if one request takes 10 DB calls, each taking 3 milliseconds to fulfill (including network latency etc.). A single request then takes 10*3 = 30 milliseconds of I/O. The request throughput of one request is then 1000ms / 30ms = 33,33 requests / second. Now if one service instance uses 10 threads to handle requests we get 333,3 requests / seconds as our upper bound of throughput. We can't get any faster than this because we are I/O bottlenecked in regards to our thread count.
And this leaves out multiple factors like:
thread pool size vs. db connection pool size
our service doing non-db related tasks (actual logic, json serialization when the response get fulfilled)
database capacity (was not an issue for us)
TL;DR: You can't get faster when you are I/O bottlenecked, no matter much how CPU you provide. I/O has to be improve if you want your single service instance to have more throughput, this is mostly done by db connection pool sizing in relation to thread pool sizing in relation to db calls per request. We missed this basic (and well known) relation between resources!

What is the difference between Scalability test and the other performance tests?

Are there any clear ideas that define the Scalability test? I have designed Load, Stress, Spike and Soak tests using JMeter Ultimate Thread Group but, i have not any idea about Scalability test differs from these tests. How to design a good scalability test with ultimate thread group in Jmeter for maximum user count is equal to 500.

As per the Wikipedia article on Scalability testing:
Scalability testing, is the testing of a software application to measure its capability to scale up or scale out in terms of any of its non-functional capability.
So basically you can use the same approach as for the Stress Testing, something like this:
then you need to pay attention to the following charts/KPIs:
Active Threads Over Time - to show number of active virtual users
Transactions per Second - to show the system throughput
Charts of system resources consumption - to show usage of CPU, RAM, etc.
Ideally the charts should be more or less the same/linear, i.e. if you increase the load by factor of 2x the throughput should increase by the same factor and resource consumption should increase proportionally.
If the charts are not equal/similar/proportional - then at some point the system won't be able to keep threads/transactions per second ratio and this will indicate the bottleneck

Kubernetes number of replicas vs performance

I have just gotten into Kubernetes and really liking its ability to orchestrate containers. I had the assumption that when the app starts to grow, I can simply increase the replicas to handle the demand. However, now that I have run some benchmarking, the results confuse me.
I am running Laravel 6.2 w/ Apache on GKE with a single g1-small machine as the node. I'm only using NodePort service to expose the app since LoadBalancer seems expensive.
The benchmarking tool used are wrk and ab. When the replicas is increased to 2, requests/s somehow drops. I would expect the requests/s to increase since there are 2 pods available to serve the request. Is there a bottleneck occurring somewhere or perhaps my understanding is flawed. Do hope someone can point out what I'm missing.

A g1-small instance is really tiny: you get 50% utilization of a single core and 1.7 GB of RAM. You don't describe what your application does or how you've profiled it, but if it's CPU-bound, then adding more replicas of the process won't help you at all; you're still limited by the amount of CPU that GCP gives you. If you're hitting the memory limit of the instance that will dramatically reduce your performance, whether you swap or one of the replicas gets OOM-killed.
The other thing that can affect this benchmark is that, sometimes, for a limited time, you can be allowed to burst up to 100% CPU utilization. So if you got an instance and ran the first benchmark, it might have used a burst period and seen higher performance, but then re-running the second benchmark on the same instance might not get to do that.
In short, you can't just crank up the replica count on a Deployment and expect better performance. You need to identify where in the system the actual bottleneck is. Monitoring tools like Prometheus that can report high-level statistics on per-pod CPU utilization can help. In a typical database-backed Web application the database itself is the bottleneck, and there's nothing you can do about that at the Kubernetes level.

What's a sensible basic OLTP configuration for Postgres?

We're just starting to investigate using Postgres as the backend for our system which will be used with an OLTP-type workload: > 95% (possibly >99%) of the transactions will be inserting 1 row into 4 separate tables, or updating 1 row. Our test machine is running 9.5.6 (using out-of-the-box config options) on a modest cloud-hosted Windows VM with a 4-core i7 processor, with a conventional 7200 RPM disk. This is much, much slower than our targeted production hardware, but useful right now for finding bottlenecks in our basic design.
Our initial tests have been pretty discouraging. Although the insert statements themselves run fairly quickly (combined execution time is around 2ms), the overall transaction time is around 40ms, due to the commit statement taking 38 ms. Furthermore, during a simple 3-minute load test (5000 transactions), we're only seeing about 30 transactions per second, with pgbadger reporting 3 minutes spent in "commit" (38 ms avg.), and the next highest statements being the inserts at 10 (2ms) and 3 (0.6 ms) respectively. During this test, the cpu on the postgres instance is pegged at 100%
The fact that the time spent in commit is equal to the elapsed time of the test tells me the that not only is commit serialized (unsurprising, given the relatively slow disk on this system), but that it is consuming a cpu during that duration, which surprises me. I would have assumed before the fact that if we were i/o bound, we would be seeing very low cpu usage, not high usage.
In doing a bit of reading, it would appear that using Asynchronous Commits would solve a lot of these issues, but with the caveat of data loss on crashes/immediate shutdown. Similarly, grouping transactions together into a single begin/commit block, or using multi-row insert syntax improves throughput as well.
All of these options are possible for us to employ, but in a traditional OLTP application, none of them would be (you need to have fast, atomic, synchronous transactions). 35 transactions per second on a 4-core box would have unacceptable 20 years ago on other RDBMs running on much slower hardware than this test machine, which makes me think that we're doing this wrong, as I'm sure Postgres is capable of handling much higher workloads.
I've looked around but can't find some common-sense config options that would serve as starting points for tuning a Postgres instance. Any suggestions?

If COMMIT is your time hog, that probably means:
Your system honors the FlushFileBuffers system call, which is as it should be.
Your I/O is miserably slow.
You can test this by setting fsync = off in postgresql.conf – but don't ever do this on a production system. If that improves performance a lot, you know that your I/O system is very slow when it actually has to write data to disk.
There is nothing that PostgreSQL (or any other reliable database) can improve here without sacrificing data durability.

Although it would be interesting to see some good starting configs for OLTP workloads, we've solved our mystery of the unreasonably high CPU during the commits. Turns out it wasn't Postgres at all, it was Windows Defender constantly scanning the Postgres data files. The team that set up our VM that was hosting the test server didn't understand that we needed a backend configuration as opposed to a user configuration.

Golang High GC pause times on docker/kubernetes

I am migrationg a web application written in Go from AWS Elastic Beanstalk to Kubernets and I noticed that the garbage collector pause times (I am using Newrelic to monitor the application) increased about a 100 when running the application.
I believe it is related with the CPU limiting that the Kubernet does.
Does anyone have any idea about what is really causing it? Is it possible to overcome it?
Below there is a small example of this difference.
Elastc Beanstalk:
Kubernets:
After some tests and more research I discovered some interesting things.
The CPU limit on Docker seems to have a great influence on GC time/pauses. After some tests I got the CPU limit to 500m which means about 1/2 CPU of a 8-core machine.
I set GOMAXPROCS = 1 and GOGC = 1000 and this lead to less and faster GC pauses, however the average memory usage increased.
Here are a 27h overview of Kubernets and Elastic Beanstalk
Kubernetes:
Elastic Beanstalk:

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio