We have application insights running on a web application deployed to a public interface on IIS. After approx. 1 hour of test usage (8-10 concurrent users) all requests enter a long running state (1-2 minutes) before performance returns to normal. This pattern happens at regular intervals during the day generally in line with usage.
Removing App Insights "fixes" the problem. Put Insights back in, problem re-occurs.
What we know:
PerfMon counters: CPU, Memory, Network Interface, Disk I/O, % maxConcurrentCalls, % maxConcurrentInstances show no spikes during the bottleneck
A firewall blocks the outbound App Insights call and the request is left to timeout
No IIS Events are raised (w3wp crash / app pool recycling) around the bottleneck time
WCF configuration is default
Given the frequency of the App Insight calls and the fact they are blocked and left to timeout I would expect the number of threads to max out and cause the bottleneck, but I would not expect it to return to normal after 1-2 minutes given the processing of the queued requests would equally trigger App Insight calls.
What is the counter here that shows the bottleneck and why does performance return to normal when it should stay choked.
Related
I am running a Kotlin Spring Boot based service in a Kubernetes cluster that connects to a PostgreSQL database. Each request takes around 3-5 database calls which partially run in parallel via Kotlin coroutines (with a threadpool backed coroutine context present).
No matter the configuration this services gets throttled heavily after getting hit by real traffic after just starting up. This slowness sometimes persists for 2-3 minutes and often only affects some fresh pods, but not all.
I am looking for new avenues to analyze the problem - here's a succinct list of circumstances / stuff I am already doing:
The usual response time of my service is around 7-20ms while serving 300-400 requests / second per pod
New / autoscaled instances warmup themselfes by doing 15000 HTTP requests against themselfs. The readiness probe is not "up" before this process finishes
We are currently setting a cpu request and limit of 2000m, changing this to 3000m does reduce the issue but the latency still spikes to around 300-400ms which is not acceptable (at most 100ms would be great, 50ms ideal)
The memory is set to 2gb, changing this to 3gb has no significant impact
The pods are allocating 200-300mb/s during peak load, the GC activity does not seem abnormal to me
Switching between GCs (G1 and ZGC) has no impact
We are experiencing pod throttling of around 25-50% (calculated via Kubernetes metrics) while the pod CPU usage is around 40-50%
New pods struggle to take 200-300 requests / sec even though we warm up, curiously enough some pods suffer for long periods. All external factors have been analyzed and disabling most baggage has no impact (this includes testing with disabled tracing, metric collection, disabling Kafka integration and verifying our database load is not maxing out - it's sitting at around 20-30% CPU usage while network and memory usage are way lower)
The throttling is observed in custom load tests which replicates the warmup requests described above
Connecting with visualvm during the load tests and checking the CPU time spent yields no striking issues
This is all done on a managed kubernetes by AWS
All the nodes in our cluster are of the same type (c5.2xlarge of AWS)
Any tools / avenues to investigate are appreciated - thank you! I am still puzzled why my service is getting throttled although its CPU usage is way below 100%. Our nodes are also not affected by the old kernel cfs bug from before kernel 5.6 (not entirely sure in which version it got fixed, we are very recent on our nodes kernel version though).
In the end this all boiled down to missing one part of the equation: I/O bounds.
Imagine if one request takes 10 DB calls, each taking 3 milliseconds to fulfill (including network latency etc.). A single request then takes 10*3 = 30 milliseconds of I/O. The request throughput of one request is then 1000ms / 30ms = 33,33 requests / second. Now if one service instance uses 10 threads to handle requests we get 333,3 requests / seconds as our upper bound of throughput. We can't get any faster than this because we are I/O bottlenecked in regards to our thread count.
And this leaves out multiple factors like:
thread pool size vs. db connection pool size
our service doing non-db related tasks (actual logic, json serialization when the response get fulfilled)
database capacity (was not an issue for us)
TL;DR: You can't get faster when you are I/O bottlenecked, no matter much how CPU you provide. I/O has to be improve if you want your single service instance to have more throughput, this is mostly done by db connection pool sizing in relation to thread pool sizing in relation to db calls per request. We missed this basic (and well known) relation between resources!
What could be the possible reason for throughput remaining the same though load has increased much compared to previous tests? NOTE: I even received the error "Internal Server Error" while running my performance test.
It means you have reached the Saturation Point - the point for maximum performance!
A certain amount of concurrent users adjoining with maximum CPU utilization and peak throughput. Adding any more concurrent users will lead to degradation of response time and throughput, and will cause peak CPU utilization. Also, it can throw some errors!
After that, If you continue increasing the number of virtual users you may see these:
Response time is increasing.
Some of your requests got failed.
Throughput is either remains the same or decreases - this indicates the performance bottleneck!
Ideal load test in ideal world looks like:
response time remains the same, not matter how many virtual users are hitting the server
throughput increases by the same factor as the load increases, i.e:
100 virtual users - 500 requests/second
200 virtual users - 1000 requests/second
etc.
In reality application under test might scale up to certain extent, but finally you will reach the point when you will be increasing the load but the response time will be going up and throughput will remain the same (or going down)
Looking into HTTP Status Code 500
The HyperText Transfer Protocol (HTTP) 500 Internal Server Error server error response code indicates that the server encountered an unexpected condition that prevented it from fulfilling the request.
Most probably it indicates the fact that the application under test is overloaded, the next step would be to find out the reason which may be in:
Application is not properly configured for the high loads (including all the middleware: application server, database, load balancer, etc.)
Application lacks resources (CPU, RAM, etc.), can be checked using i.e. JMeter PerfMon Plugin
Application uses inefficient functions/algorithms, can be checked using profiling tools
Initially i have ran a load test with 100 users for 10 minutes and 1000 records got inserted in the database for the below scenarios.
Employee Creation -- Test script design took 1 minute
Employee Update -- Test script design took 2 minutes
And then I ran the same load test with 200 users for 10 minutes and 1100 records got inserted without any error logs or deadlocks.
My question is when we increase/double the thread group count from 100 to 200, Records insertion also should be double or approximately double. then why is it not happening? Same case with the number requests/samples.
You reached a maximum in your test throughput at about 110 records per min. In other words, you have a bottleneck on client or server, which doesn't allow 200 users to process request concurrently and/or within the same amount of time (either some users wait until they can start processing a request, or each request takes longer, so total number of requests is lower).
Some bottlenecks can be resolved by you (if they are related to script, JMeter configuration or JMeter machine), others have to be resolved on server side (by whoever has access to it), and some cannot be resolved at all (they are true bottlenecks of your app).
Without knowing your application, it's hard to suggest anything beyond general "checklist" items:
Verify JMeter script and check if it has any places where it may wait, take a long time, and so on. For example if your ramp-up period is too high, it may be that "first" user will finish execution, before "last" user even started it. Scriptable samplers, pre- and post-processors may cause delays as well.
Make sure JMeter is configured properly to handle 200 concurrent threads. For example if JMeter heap is set too low, it could be that JMeter is very slow, as it constantly needs to run GC. See this question for how to look at and configure memory (it discusses out of memory error, but even without that error inadequate memory can cause slowness)
Make sure JMeter machine is configured correctly to allow creation of 200+ HTTP connections concurrently. A common issue on both Windows and Linux machine is that people assume that they can have 65535 connections (as maximal number of ports), but in reality, both Windows and Linux limit number of ports they allow by default to be used. Also after the use port may remain in TIME_WAIT or CLOSE_WAIT state for several minutes, which makes it unusable. As a result, running out of ports is quite common. Here's how to monitor and resolve this issue on Windows and Linux.
Check JMeter machine performance as a whole: does it have enough CPU, memory; is it swapping memory, etc.
If none of the above is a problem, you need to look at how requests arrive to the server. If client is capable of sending 200 concurrent requests (which you should have established in previous steps), but server receives them at slower rate, then maybe something in the network slows things down. For example something like slow DNS resolution or slow routing between JMeter and server can cause issues.
Also Item #3 on the client is also applicable to the server.
If requests do arrive to the server at the same speed as they are sent from the client, then probably their processing by the server slows down as number of parallel requests goes up. This is where you are on dev and devOP territory, and probably need to work with them to identify bottlenecks on server side. It could be configuration of the web or application server, application itself, ... anything on app way pretty much.
Performance testing is 10% execution, and 90% analysis and identification of bottlenecks, so here you go.
Concurrent users : 2000
Ramp up period:10 sec
loop :1
The summary report for above is:
Average :2000
min :4
max :19964
Error:17.50% // this error are Java.net.SocketTimeoutException
Through put:67.2 sec
From this result can I infer that my server cannot handle load of 2000 users ?
Assuming 20 seconds response time and 20% error rate my expectation is that your application under test is overloaded. Personally I would go for the following steps:
Re-run the test using larger ramp-up time, generate Reporting Dashboard after the run and correlate increasing load with the other KPIs like:
what was the maximum throughput (how many requests per second could system server) and what was amount of active users at this time
what was the number of users when error start occurring
what is relationship between amount of users and response time
etc.
The next step would be determining the bottleneck i.e. why application responds slowly or not responds at all. The reasons could be in:
Application simply lacks resources (CPU, RAM, network or disk IO or starts swapping). A good practices is monitoring the system health i.e. with JMeter PerfMon Plugin
Configuration of software (application/web server, database, load balancer, etc.) is not appropriate for high loads. Each of software components needs to be appropriately tuned for high loads
Your application code is not optimal, you can re-run your test with profiler tool telemetry to see what are the largest objects, the slowest functions, etc.
It can be something absolutely external like faulty router, bad network cable, using WiFi instead of LAN, hitting corporate proxy limits, etc.
I'm trying to stress-test my Spring RESTful Web Service.
I run my Tomcat server on a Intel Core 2 Duo notebook, 4 GB of RAM. I know it's not a real server machine, but i've only this and it's only for study purpose.
For the test, I run JMeter on a remote machine and communication is through a private WLAN with a central wireless router. I prefer to test this from wireless connection because it would be accessed from mobile clients. With JMeter i run a group of 50 threads, starting one thread per second, then after 50 seconds all threads are running. Each thread sends repeatedly an HTTP request to the server, containing a small JSON object to be processed, and sleeping on each iteration for an amount of time equals to the sum of a 100 milliseconds constant delay and a random value of gaussian distribution with standard deviation of 100 milliseconds. I use some JMeter plugins for graphs.
Here are the results:
I can't figure out why mi hits per seconds doesn't pass the 100 threshold (in the graph they are multiplied per 10), beacuse with this configuration it should have been higher than this value (50 thread sending at least three times would generate 150 hit/sec). I don't get any error message from server, and all seems to work well. I've tried even more and more configurations, but i can't get more than 100 hit/sec.
Why?
[EDIT] Many time I notice a substantial performance degradation from some point on without any visible cause: no error response messages on client, only ok http response messages, and all seems to work well on the server too, but looking at the reports:
As you can notice, something happens between 01:54 and 02:14: hits per sec decreases, and response time increase, okay it could be a server overload, but what about the cpu decreasing? This is not compatible with the congestion hypothesis.
I want to notice that you've chosen very well which rows to display on Composite Graph. It's enough to make some conclusions:
Make note that Hits Per Second perfectly correlates with CPU usage. This means you have "CPU-bound" system, and the maximum performance is mostly limited by CPU. This is very important to remember: server resources spent by Hits, not active users. You may disable your sleep timers at all and still will receive the same 80-90 Hits/s.
The maximum level of CPU is somewhere at 80%, so I assume you run Windows OS (Win7?) on your machine. I used to see that it's impossible to achieve 100% CPU utilization on Windows machine, it just does not allow to spend the last 20%. And if you achieved the maximum, then you see your installation's capacity limit. It just has not enough CPU resources to serve more requests. To fight this bottleneck you should either give more CPU (use another server with higher level CPU hardware), or configure OS to let you use up to 100% (I don't know if it is applicable), or optimize your system (code, OS settings) to spend less CPU to serve single request.
For the second graph I'd suppose something is downloaded via the router, or something happens on JMeter machine. "Something happens" means some task is running. This may be your friend who just wanted to do some "grep error.log", or some scheduled task is running. To pin this down you should look at the router resources and jmeter machine resources at the degradation situation. There must be a process that swallows CPU/DISK/Network.