AWS Aurora MySQL multi-reader CPU utilization - amazon-aurora

I have an Aurora MySQL database with a writer and 3 readers,
My server SQL calls are always done using the Aurora reader endpoint (except writing of course),
For some reason, too often on my app's peak hours, one of the readers will reach 100% cpu usage and the other readers will be around the 10-20% CPU usage, this scenario is causing me serious issues as if I try to make some calls from my app (or any other person for that matter) I will not get a response due to the fact that the reader will not serve me (which is weird since the other 2 are available), I'm not sure if there is something I'm missing regarding the distribution of queries and connections between the different readers, is there any other action required to be done in order to create a better distribution between the readers?
As far as I know, the moment you create multiple readers the reader endpoint should automatically act as a load balancer exactly for this purpose.
If it's of any help, the 3 readers are all db.r3.xlarge.

Related

Limit concurrent queries in Spring JPA

I have a simple rest endpoint that executes Postgres procedure.
This procedure returns the current state of device.
For example:
20 devices.
Client app connect to API and make 20 responses to that endpoint every second.
For x clients there are x*20 requests.
For 2 clients 40 requests.
It causes a big cpu load on Postgres server only if there are many clients and/or many devices.
I didn’t create it but I need to redesign it.
How to limit concurrent queries to db only for it? It would be a hot fix.
My second idea is to create background worker that executes queries only one in the same time. Then the endpoint fetches data from memory.
I would try the simple way first. Try to reduce
the amount of database connections in the pool OR
the amount of working threads in the build-in Tomcat.
More flexible option would be to put the logic behind a thread pool limiting the amount of working threads. It is not trivial, if the Spring context and database is used inside a worker. Take a look on a Spring annotation #Async.
Offtopic: The solution we are discussing here looks like a workaround. The discussed solution alone will most probably increase the throughput only by factor 2 maybe 3. It is not JEE conform and it will be most probably not very stable. It is better to refactor the application avoiding such a problem. Another option would be to buy a new database server.
Update: JEE compliant solution would be to implement some sort of bulkhead pattern. It will limit the amount of concurrent running requests and reject it, if the some critical number is reached. The server application answers with "503 Service Unavailable". The client application catches this status and retries a second later (see "exponential backoff").

AWS Aurora / Lambda serverless production environment exhibiting occasional spikes

We've been running our production web app off AWS Lambda / API Gateway, with an Aurora serverless database. Things had been running smoothly for over a year, but recently (coinciding with much increased periods of peak usage) we've experienced temporary slowness, and in the worst case unavailability, due to some kind of bottleneck that results in a spike in the number of DB connections and 4XX and 5XX from our two APIs.
We're using the serverless-mysql library to execute queries and manage DB connections.
Some potential causes of the issue that have been eliminated:
There are no long-running queries locking up tables or anything of that sort (as demonstrated by show full processlist in MySQL), in fact no query runs longer than 1s accordingly to our slow_log
All calls to await serverlessMysql.query() are immediately followed by await serverlessMysql.end()
Our database manager class is instantiated outside the Lambda handler, so it isn't reinstantiated every time a Lambda instance is reused
We've adjusted the config options for serverless-mysql so that retries aren't so aggresive. The default config makes it very aggressive in retrying to connect, both in frequency and number of retries. This has definitely helped, but has not eliminated the problem.
What details can I post that might help someone diagnose this problem? It's a major pain in the ass.
It would be helpful to see the load this application is getting. Which I know is easier said than done with Lambda.
You sort of hinted at it, but it's possible you're hitting the Max Connections() on the 'capacity class' your aurora serverless instance is set to. I've hit this a few times. It's hard to discover with lambda and serverless aurora because you don't have the same logging you would traditionally have.
Outside of that, the core issue you're experiencing seems to be related to spikes created from your application - so you need to discover if a query is maybe just inefficient, and running too many times at once. These are almost impossible to troubleshoot with Lambda logs. But db locks still occur with aurora serverless.
To help track down the issue, you could try the following:
Setup APM
I highly, highly, recommend getting something like NewRelic setup and monitoring your Lambda function.
I'm pretty sure NR has a free trial option, and tracking down a problem like this would be seemingly simple with an APM. I can't tell you how much easier problems like this are to solve with a solid apm.
Monitor traffic ingress
Again, I'm not sure of what this application is doing, but it could be possible that a spike in network traffic from a particular user kicks off a load of queries that make things go awry. Setup a free Cloudflare account or some other proxy if you can, and determine network traffic more easily.
Hope this helps.

Aurora Replica Lag Much Higher Than Reported

Our application uses reader and writer instances of Amazon RDS Aurora. The AWS dashboard shows the replica lag to consistently be about 20ms. However, we are seeing old results on the reader more than 90ms after a commit on the master and at least up to 170ms in some cases.
When doing CRUD operations, our app commits the data, then issues a HTTP redirect to the client to load the new data. The network turnaround on the redirect is logged on the client and is usually at least 90ms. We are logging both the commit time and read time on the application server and see a difference of around 170ms. Old data is showing up consistently.
Previously to Aurora we had a standard MySQL replication setup with significantly less powerful boxes and never had this issue.
Altering the application to read and write from the same aurora instance solves the problem, but I thought Aurora used shared storage for replication. What is going on? Could this be an issue with Aurora's query cache? Is the reported replica lag inaccurate?
Any help would be appreciated.
Thank you,
If you do care about strong consistency, you should issue your queries to the writer (RW cluster endpoint). On that note, it is definitely worrisome that you are seeing stale data and replica lag metric is not capturing it. For sorting that specific part, I would recommend opening a support case with AWS Aurora.

How many users should a EC2 Micro Instance be able to handle only with a nginx server?

I have a iOS Social App.
This app talks to my server to do updates & retrieval fairly often. Mostly small text as JSON. Sometimes users will upload pictures that my web-server will then upload to a S3 Bucket. No pictures or any other type of file will be retrieved from the web-server
The EC2 Micro Ubuntu 13.04 Instance runs PHP 5.5, PHP-FPM and NGINX. Cache is handled by Elastic Cache using Redis and the database connects to a separate m1.large MongoDB server. The content can be fairly dynamic as newsfeed can be dynamic.
I am a total newbie in regards to configuring NGINX for performance and I am trying to see whether I've configured my server properly or not.
I am using Siege to test my server load but I can't find any type of statistics on how many concurrent users / page loads should my system be able to handle so that I know that I've done something right or something wrong.
What amount of concurrent users / page load should my server be able to handle?
I guess if I cant get hold on statistic from experience what should be easy, medium, and extreme for my micro instance?
I am aware that there are several other questions asking similar things. But none provide any sort of estimates for a similar system, which is what I am looking for.
I haven't tried nginx on microinstance for the reasons Jonathan pointed out. If you consume cpu burst you will be throttled very hard and your app will become unusable.
IF you want to follow that path I would recommend:
Try to cap cpu usage for nginx and php5-fpm to make sure you do not go over the thereshold of cpu penalities. I have no ideia what that thereshold is. I believe the main problem with micro instance is to maintain a consistent cpu availability. If you go over the cap you are screwed.
Try to use fastcgi_cache, if possible. You want to hit php5-fpm only if really needed.
Keep in mind that gzipping on the fly will eat alot of cpu. I mean alot of cpu (for a instance that has almost none cpu power). If you can use gzip_static, do it. But I believe you cannot.
As for statistics, you will need to do that yourself. I have statistics for m1.small but none for micro. Start by making nginx serve a static html file with very few kb. Do a siege benchmark mode with 10 concurrent users for 10 minutes and measure. Make sure you are sieging from a stronger machine.
siege -b -c10 -t600s 'http:// private-ip /test.html'
You will probably see the effects of cpu throttle by just doing that! What you want to keep an eye on is the transactions per second and how much throughput can the nginx serve. Keep in mind that m1small max is 35mb/s so m1.micro will be even less.
Then, move to a json response. Try gzipping. See how much concurrent requests per second you can get.
And dont forget to come back here and report your numbers.
Best regards.
Micro instances are unique in that they use a burstable profile. While you may get up two 2 ECU's in terms of performance for a short period of time, after it uses its burstable allotment it will be limited to around 0.1 or 0.2 ECU. Eventually the allotment resets and you can get 2 ECU's again.
Much of this is going to come down to how CPU/Memory heavy your application is. It sounds like you have it pretty well optimized already.

Strange performance using JPA, am I missing something?

We have a JPA -> Hibernate -> Oracle setup, where we are only able to crank up to 22 transactions per seconds (two reads and one write per transaction). The CPU and disk and network are not bottlenecking.
Is there something I am missing? I wonder if there could be some sort of oracle imposed limit that the DBA's have applied?
Network is not the problem, as when I do raw reads on the table, i can do 2000 reads per second. The problem is clearly writes.
CPU is not the problem on the app server, the CPU is basically idling.
Disk is not the problem on the app server, the data is completely loaded into memory before the processing starts
Might be worth comparing performance with a different client technology (or even just a simple test using SQL*Plus) to see if you can beat this performance anyway - it may simply be an under-resourced or misconfigured database.
I'd also compare the results for SQLPlus running directly on the d/b server, to it running locally on whatever machine your Java code is running on (where it is communicating over SQLNet). This would confirm if the problem is below your Java tier.
To be honest there are so many layers between your JPA code and the database itself, diagnosing the cause is going to be fun . . . I recall one mysterious d/b performance problem resolved itself as a misconfigured network card - the DBAs were rightly insistent that the database wasn't showing any bottlenecks.
It sounds like the application is doing a transaction in a bit less than 0.05 seconds. If the SELECT and UPDATE statements are extracted from the app and run them by themselves, using SQL*Plus or some other tool, how long do they take, and if you add up the times for the statements do they come pretty near to 0.05? Where does the data come from that is used in the queries, and which eventually gets used in the UPDATE? It's entirely possible that the slowdown is not the database but somewhere else in the app, such a the data acquisition phase. Perhaps something like a profiler could be used to find out where the app is spending its time.
Share and enjoy.

Resources