Aurora Replica Lag Much Higher Than Reported - amazon-aurora

Our application uses reader and writer instances of Amazon RDS Aurora. The AWS dashboard shows the replica lag to consistently be about 20ms. However, we are seeing old results on the reader more than 90ms after a commit on the master and at least up to 170ms in some cases.
When doing CRUD operations, our app commits the data, then issues a HTTP redirect to the client to load the new data. The network turnaround on the redirect is logged on the client and is usually at least 90ms. We are logging both the commit time and read time on the application server and see a difference of around 170ms. Old data is showing up consistently.
Previously to Aurora we had a standard MySQL replication setup with significantly less powerful boxes and never had this issue.
Altering the application to read and write from the same aurora instance solves the problem, but I thought Aurora used shared storage for replication. What is going on? Could this be an issue with Aurora's query cache? Is the reported replica lag inaccurate?
Any help would be appreciated.
Thank you,

If you do care about strong consistency, you should issue your queries to the writer (RW cluster endpoint). On that note, it is definitely worrisome that you are seeing stale data and replica lag metric is not capturing it. For sorting that specific part, I would recommend opening a support case with AWS Aurora.

Related

Apache Geode performance drop in cluster usage

I have integrated Apache Geode into a web application to store HTTP session data in it. This web application is run load-balanced, i.e. there are multiple instances of it sharing session data. Each web application instance has its own locale Geode cache (locator and server) and the data is distributed by use of a replicated region to other Geode nodes in the cluster. All instances are in the same network, no multi-site usage. The number of GET operations per second are around 5000 per second; the number of PUT operations are approximatley half of it.
Testing this setup with only one web application instance the performarnce is very promising (in the area of 20-30 ms). However, when adding an instance there is a significatn performance drop up to a few seconds.
It has shown that disabling TCP syn cookies lead to an improvement of processing time up to 50%. Though the performance is still not acceptable.
I ask myself how an eventual bottleneck (e.g. by the communication between Geode nodes) could be identified? Mainly I think of getting out metrics/statistics from Geode, although I could not find anything helpful yet in that regard. I'd appreciate any hint on how to investigate and eliminate performance problems with Apache Geode.

Losing Provenance records in Apache Nifi

We work with a lot of data and have a high throughput of files going through our Nifi instances. We have recently been losing providence records and don't seem to understand what the cause is.
Below are some details, if relevant:
We have our Providence database on its own drive in the cloud, and are not seeing any high IO usage or resource contention.
We have added additional threads to this, as well as 999k file handles.
If it means anything, providence data is kept for 2 weeks in our configuration.
We are on Nifi version 1.15.3, but are planning on an upgrade in the near future.
Any ideas on what the cause may be and how to remediate this? Thanks!

AWS Aurora MySQL multi-reader CPU utilization

I have an Aurora MySQL database with a writer and 3 readers,
My server SQL calls are always done using the Aurora reader endpoint (except writing of course),
For some reason, too often on my app's peak hours, one of the readers will reach 100% cpu usage and the other readers will be around the 10-20% CPU usage, this scenario is causing me serious issues as if I try to make some calls from my app (or any other person for that matter) I will not get a response due to the fact that the reader will not serve me (which is weird since the other 2 are available), I'm not sure if there is something I'm missing regarding the distribution of queries and connections between the different readers, is there any other action required to be done in order to create a better distribution between the readers?
As far as I know, the moment you create multiple readers the reader endpoint should automatically act as a load balancer exactly for this purpose.
If it's of any help, the 3 readers are all db.r3.xlarge.

AWS Aurora / Lambda serverless production environment exhibiting occasional spikes

We've been running our production web app off AWS Lambda / API Gateway, with an Aurora serverless database. Things had been running smoothly for over a year, but recently (coinciding with much increased periods of peak usage) we've experienced temporary slowness, and in the worst case unavailability, due to some kind of bottleneck that results in a spike in the number of DB connections and 4XX and 5XX from our two APIs.
We're using the serverless-mysql library to execute queries and manage DB connections.
Some potential causes of the issue that have been eliminated:
There are no long-running queries locking up tables or anything of that sort (as demonstrated by show full processlist in MySQL), in fact no query runs longer than 1s accordingly to our slow_log
All calls to await serverlessMysql.query() are immediately followed by await serverlessMysql.end()
Our database manager class is instantiated outside the Lambda handler, so it isn't reinstantiated every time a Lambda instance is reused
We've adjusted the config options for serverless-mysql so that retries aren't so aggresive. The default config makes it very aggressive in retrying to connect, both in frequency and number of retries. This has definitely helped, but has not eliminated the problem.
What details can I post that might help someone diagnose this problem? It's a major pain in the ass.
It would be helpful to see the load this application is getting. Which I know is easier said than done with Lambda.
You sort of hinted at it, but it's possible you're hitting the Max Connections() on the 'capacity class' your aurora serverless instance is set to. I've hit this a few times. It's hard to discover with lambda and serverless aurora because you don't have the same logging you would traditionally have.
Outside of that, the core issue you're experiencing seems to be related to spikes created from your application - so you need to discover if a query is maybe just inefficient, and running too many times at once. These are almost impossible to troubleshoot with Lambda logs. But db locks still occur with aurora serverless.
To help track down the issue, you could try the following:
Setup APM
I highly, highly, recommend getting something like NewRelic setup and monitoring your Lambda function.
I'm pretty sure NR has a free trial option, and tracking down a problem like this would be seemingly simple with an APM. I can't tell you how much easier problems like this are to solve with a solid apm.
Monitor traffic ingress
Again, I'm not sure of what this application is doing, but it could be possible that a spike in network traffic from a particular user kicks off a load of queries that make things go awry. Setup a free Cloudflare account or some other proxy if you can, and determine network traffic more easily.
Hope this helps.

Golden Gate replication is extremely delayed

We are using Golden Gate in production to replicate from Oracle Database into the Postgres. Together with that, the Golden Gate replicates also into another instance of Oracle Database.
Replicated Oracle Database is placed in the internal network of our company.
Target Oracle database is placed also in the internal network of our company.
Postgres is placed in AWS Amazon Cloud.
Replication Oracle->Oracle is without problem, there is no delay.
Replication Oracle->Postgres can have an inedibly large delay - sometimes in can grow up to 1 day delay. Also, there is no error reported.
We have been investigating the problem and we cannot find the cause: the network bandwidth is large enough for our transferred data, there is enough RAM memory and CPU is used only by 20%.
The only difference seems to be in the Ping in between internal network and AWS Amazon Cloud. In internal network the ping is approx 2ms and and into the amazon the ping is almost 20ms.
What can be the cause and how to resolve it?
You really should contact Oracle Support on this topic; however, Oracle GoldenGate 12.2 supports Postgres as a target (only).
As for your latency within your replication process. It sounds like Oracle-to-Oracle is working fine and that is within your internal network. The problem only appears when going Oracle-to-Postgres (AWS Cloud).
Do you have your lag monitoring configured? LAGINFO (https://docs.oracle.com/goldengate/c1221/gg-winux/GWURF/laginfo.htm#GWURF532) should be configured within your MGR processes. This will provide some baseline lag information for determining how to proceed forward.
Are you compressing the trail files?
How much data are you sending? DML stats?
This should get you started on the right path.

Resources