Problem : slowness observed during first 1 sec only and the remaining 59 have a constant response time which 90% better than the max response time.
Server Env details: spring boot webflux with r2dbc pool deployed in ECS fargate and connecting to Postgres Aurora cluster.
pool settings - maxSize,initialSize is 200
Using spring data r2dbc and enabled proxy listener for debugging.
Client :
A gatling script with a very minimal load of 200,250,300,500 users with a ramp time of 50 sec is configured in AWS EC2 in same VPC.
Scenario
ECS server started.
wait for 4 min.
did a dry run of 5 requests using postman.
trigger load using gatling
Shutdown the ECS.
repeat the steps with a different number of users.
The behaviour is consistent with different users. Always the first 1 min has the slowest responses and having the max response time.
Subsequent runs without server restart has a good performance without any delays.
Total|OK |KO|Cnt/s|Min|50th pct|75th pct|95th pct|99th pct|Max |Mean |Std Dev|
500 |500 |0 |9.804|94 |184 |397 |1785 |2652 | 2912 | 417 | 556 |
And also observed in the logs the time difference between these two immediate lines logged is 168ms for the request with max response time.
-- Executing query: BEGIN
-- io.r2dbc.spi.Connection.beginTransaction callback at ConnectionFactory#create()
Any suggestion how to approach/fix the issue?
Thanks.
Related
I have a microservice using spring boot 2.7.0 with embedded NIO tomcat. The application is responsible for receiving requests and for each request it makes 6 parallel remote calls waits at most 2 seconds for response from any of the 6 requests.
While performance testing this microservice using jmeter I observed that the CPU remains under-utilised around 14-15% but the microservice's response time increases to more than a minute. Typically it shouldn't be more than 2-3 seconds.
There are 3 thread configurations in my microservice:
Tomcat threads here I tried various configuration of maxthreads, maxconnection,accept-like (5000,30000,2000), (500,10000,2000), (200,5000,2000) but the CPU is always under-utilised. Here are the properties I am changing
server.tomcat.max-threads=200
server.tomcat.max-connections=5000
server.tomcat.accept-count=2000
server.connection-timeout=3000
For each request received we create a ForkJoinPool with parallelism as 6 to make the 6 remote calls. We tried using an ExecutorService too with different configuration like newSingleThreadExecutor,newCachedThreadPool,newWorkStealingPool. Also increased pool size to around same as maxThreads of tomcat and beyond but the result was same CPU still underutilized but microservice taking more than a minute to respond.
On logging the active thread count here we saw that no matter how much thread pool size or tomcat maxthreads we increased the, active thread count went upto 300 then start declining. We tried with a 4core 8GB system and 8core 16GB system results were exactly same
For making remote calls we use spring rest template with maxConnTotal and maxConnTotalPerRoute same as maxthreads of tomcat. maxConnTotal and maxConnTotalPerRoute are same because all 6 remote calls are to the same server.
Here are the jmeter parameters used -GTHREADS=1000 -GRAMP_UP=180 -GDURATION=300
There are 3 instances of this microservice running, roughly after 2-2.5 minutes after jmeter starts, all 3 instance's response time goes beyond a minute for all requests while CPU remains at 14-15% only. Could someone please help figure out what CPU is not spiking if CPU would spike to 35% then autoscaling would kick in but since CPU is under-utilised no scaling is happening
Use a profiler tool like VisualVM, YourKit or JProfiler to see where your application spends the most time
CPU is not the only possible bottleneck, check Tomcat's connection pool utilization as it might be the case the requests are queuing up, memory usage, network usage, database pool usage, DB slow queries log and so on. If you don't have a better monitoring software or an APM tool in place you can consider using JMeter PerfMon Plugin
We replaced RestTemplate for remote calls with WebClient and introducted WebFlux Mono to make the complete request non-blocking. The request itself now returns our response wrapped in Mono. It solved our issue now there is no idle time as threads are not blocked on IO rather they are busy serving other requests.
we are getting a weird issue and we are not able to identify what can be the root cause. Our service is written using Sprint boot and we are connecting to MariaDB through JPA.
If we do performance test of our service directly hitting the service end point (returning list of records) we get 300ms performance for 40 tps load
If we do performance test with the same load 40 tps through an experience api which consumes our service through feign client we get performance of 5 seconds
Interestingly from our service logs the difference we see above is while opening the db connection and executing query in our service. We are confused why with two different ways of hitting the same service the db performance differs. Has anyone faced similar issue before or has any suggestions to debug?
I am using spring boot 2 for APIs, hosted on aws ecs fargate. and database is postgress 10.6 on RDS with 16 gb ram and 4 cpu.
my hikari configuration is as follows:
spring.datasource.testWhileIdle = true
spring.datasource.validationQuery = SELECT 1
spring.datasource.hikari.maximum-pool-size=100
spring.datasource.hikari.minimum-idle=80
spring.datasource.hikari.connection-timeout=30000
spring.datasource.hikari.idle-timeout=500000
spring.datasource.hikari.max-lifetime=1800000
Now generally this works perfectly.. but when load comes on server, say around 5000 concurrent API request..(which is also not huge though.. ), my application crashes.
Have enabled debug log for hikari.. so getting below messages:
hikaripool-1 - pool stats (total=100 active=100 idle=0 waiting=100)
Exception message says connection is not available:
HikariPool-1 - Connection is not available, request timed out after 30000ms.
org.hibernate.exception.JDBCConnectionException: Unable to acquire JDBC
At the same time when I see RDS postgress performance insighter, max query execution time is < 0.03 second..And CPU utilization also under 50%. So no issue with the Database server.
I am using entitymager and JPA only.. not using any query by opening connection manually. so closing connection or connection leak might not be an issue. But after enabling leak detection:
spring.datasource.hikari.leakDetectionThreshold=2000
Getting warn in logs saying apparent connection leak detected:
when I check the method pointing to this error: then it's just JPA findById() method..
so what should be the root cause of connection not available and request time out for just 10k api request.. with 100 pool size.. why its not releasing any connection after active connection goes to 100 and wait is 100? My ECS application server restarts automatically with this error and accessible only after 5-7 minutes after that..
HikariCP recommend removing minimumIdle when there's spike demands as you are testing
for maximum performance and responsiveness to spike demands, we recommend not setting this value and instead allowing HikariCP to act as a fixed size connection
And if you remove it. also idle-timeout is irrelevant
See also configure HikariCP for PostgreSQL
It is likely that your application is throttling it self into timeouts because of the wrong connection pool size in your configuration. A pool size of 100 is 10 times too many and this will affect performance and stability
HikariCP Pool size formula can be found in their wiki, but it looks like this:
((core_count * 2) + effective_spindle_count). Core count should not include
HT threads, even if hyperthreading is enabled.
If you have 4 cores then your connection pool size can be left at the default size of 10.
if this might help, I was facing this issue recently and it gave me a tough time.
The server accepts too many requests which hikari pool is not able to handle and so hikari tries to obtain extra connections to serve this spike in demand.
Eg. For tomcat with 200 default threads, if your maxPoolSize = 10, on spike demands, your server would try to serve 200 threads at the same time. If the connections in the pool are busy, hikari would try to obtain 190 connections and this is what you see in the waiting.
Here is how I am managing it.
I made sure that tomcat threads do not exceed the number of hikari maxPoolSize. In that way, there won't be need to ask for more connections during spike.
in spring boot, this is the config I used.
server.tomcat.threads.max = 50
spring.datasource.hikari.maximumPoolSize = 50
Note: 50 is variable based on your server capacity
Scenario taking into account login user->Navigate to Page 01->hold the user for 5min->Logout user
Scripted as below:
Navigate to the Home page
The user is logged in (Assertion for login verification over some text on the dashboard)
Dashboard appears
Navigate to Page 01 (Assertion Page 01 content)
Logout (Constant Timer added for 5min and Assertion for logout to verify home page is redirected)
Step Up thread configuration has been kept this way:
For achieving this scenario distributed system was implemented as follows:
Master(My own machine 8 GB Ram and Core 2 Duo Processor)
2 slaves machines (8 GB Ram each and I7 and Core 2 Duo Processors)
Thread: jp#gc - Stepping Thread Group
The server has been configured as below:
2 EC2 Instance (16 GB Ram each)
1 Load Balancer
1 RDS Instance
Note: Instance is auto scaled at 60% CPU Utilization.
While executing the script for 500 concurrent users using stepping thread on Non-GUI mode, below list of error is appearing on the dashboard report
504/Gateway Time-out
Non HTTP response code: java.net.SocketException/Non HTTP response message: Connection reset
Assertion as logout failed
Could someone help me out to know why these are appearing? when I checked the Load Balancer 504/Gateway Time-out was not appearing there? I was trying to track these error but was not able to figure it out why these along with other two errors are appearing. When the same script is executed for 10 users not error is appearing on GUI mode.
While the same script when executed for 100-250 concurrent user it works pretty well when no such above error.
If the issue doesn't happen for 250 virtual users and happens for 500 - it's definitely the bottleneck caused by increased load, you just need to find out the reason.
Make sure to have DNS Cache Manager added to your Test Plan otherwise you may run into the situation when the load goes to one server only
Set up monitoring of your EC2 instances to ensure that they have enough headroom to operate in terms of CPU, RAM, Network, etc. You can use Amazon CloudWatch or JMeter PerfMon Plugin for this.
You might want to re-run the test with profiling tool telemetry enabled - this way you will be able to see where application spends the majority of time
Inspect the configuration of your application servers, databases, etc. as it might be configuration issue of the middleware
Be aware that according to JMeter Best Practices you should always be using the latest JMeter version so consider migrating to JMeter 5.0 (or whatever is the latest version available at JMeter Downloads page) on as soon as possible.
I have an application hosted with Openshift and I need it to generate some Excel reports. The report generation process can take a long time (over 5 minutes). This causes the the client to see a 502 error and the request times out. How can and where can I configure my Openshift stack (it is a Java webapp running from Tomcat6) to increase the timeout duration?
5 minutes is an awfully long time for a web request to run. It would be better to have the web request schedule a background job that then notifies the user when the report is done being generated.