Golden Gate replication is extremely delayed - oracle

We are using Golden Gate in production to replicate from Oracle Database into the Postgres. Together with that, the Golden Gate replicates also into another instance of Oracle Database.
Replicated Oracle Database is placed in the internal network of our company.
Target Oracle database is placed also in the internal network of our company.
Postgres is placed in AWS Amazon Cloud.
Replication Oracle->Oracle is without problem, there is no delay.
Replication Oracle->Postgres can have an inedibly large delay - sometimes in can grow up to 1 day delay. Also, there is no error reported.
We have been investigating the problem and we cannot find the cause: the network bandwidth is large enough for our transferred data, there is enough RAM memory and CPU is used only by 20%.
The only difference seems to be in the Ping in between internal network and AWS Amazon Cloud. In internal network the ping is approx 2ms and and into the amazon the ping is almost 20ms.
What can be the cause and how to resolve it?

You really should contact Oracle Support on this topic; however, Oracle GoldenGate 12.2 supports Postgres as a target (only).
As for your latency within your replication process. It sounds like Oracle-to-Oracle is working fine and that is within your internal network. The problem only appears when going Oracle-to-Postgres (AWS Cloud).
Do you have your lag monitoring configured? LAGINFO (https://docs.oracle.com/goldengate/c1221/gg-winux/GWURF/laginfo.htm#GWURF532) should be configured within your MGR processes. This will provide some baseline lag information for determining how to proceed forward.
Are you compressing the trail files?
How much data are you sending? DML stats?
This should get you started on the right path.

Related

Developer copy of Oracle DB

Here is the problem - I have to use remote db for few hours a day. And the VPN we use (for unknown reason) drops Oracle connection several times an hour which is really annoying and time consuming..
The sysadmin who manages both the Sonic VPN and the DB cant help..
So I am thinking to place a db copy locally.
What I need/don't need:
the all changes on the remote db (the master) should propagate quite easily to the copy (auto or manually - I don't mind as soon as it a one button push). they are rare - once a day at most
my changes to the local db should not be propagated to the master (but I am flexible here)
I don't have to spend more than 5 min a day to maintain this
it would be nice to replicate only DDL from master (I don't need the actual data changes, only tables changes)
is there a sort of replication or any other solution I can use to achieve this?
Database Replication isn't cheap. Your company will pay more to build replication environment , starting from the oracle edition and license and many extra.
Replication will increase the complexity of the database administration.
Finally, the More important point ,Database replication work in your VPN environment :) (which is disconnected all the time ) and replication will fail all the time.
You can with network team:
Review the service level agreement (SLA) contract of VPN with the
service provider to know the percentage of time down and the Quality of service.
Network administrator monitor network to spot where is the problem-may be line /router/network configuration/network card.
Do some measures: what the size of your transaction per minute (in bytes) to select the best speed from the network service provider.
Measuring Network Bandwidth Using iperf , for ref: https://blogs.oracle.com/mandalika/entry/measuring_network_bandwidth_using_iperf
Perform a Network Performance Test
if the changes are once a day your best and easiest solution would be to do a full backup of master db then zip it ftp/email and unzip + restore on your end. But this wont be feasible if the db size is too large.

Postgres constant 30% CPU usage

I recently migrated my Postgres database from Windows to CentOS 6.7.
On Windows the database never used much CPU, but on Linux I see it using a constant ~30% CPU (using top). (4 core on machine)
Anyone know if this is normally, or why it would be doing this?
The application seems to run fine, and as fast or faster than Windows.
Note, it is a big database, 100gb+ data, 1000+ databases.
I tried using Pgadmin to monitor the server status, but the server status hangs, and fails to run, error "the log_filename parameter must be equal"
With 1000 databases I expect vacuum workers and stats collector to spend a lot of time checking about what needs maintenance.
I suggest you to do two things
raise the autovacuum_naptime parameter to reduce the frequency of checks
put the stats_temp_directory on a ramdisk
You probably also set a high max_connections limit to allow your clients to use those high number of databases and this is another probable source of CPU load, due to the high number of 'slots' to be checked every time a backend has to synchronize with the others.
There could be multiple reasons for increasing server loads.
If you are looking for query level loads on server then you should match a specific Postgres backend ID to a system process ID using the pg_stat_activity system table.
SELECT pid, datname, usename, query FROM pg_stat_activity;
Once you know what queries are running you can investigate further (EXPLAIN/EXPLAIN ANALYZE; check locks, etc.)
You may have lock contention issues, probably due to very high max_connections. Consider lowering max_connections and using a connection pooler if this is the case. But that can increase turn around time for clients connections.
Might be Windows System blocking connections and not allowing to use system. And now Linus allowing its connections to use CPU and perform faster. :P
Also worth read:
How to monitor PostgreSQL
Monitoring CPU and memory usage from Postgres

Aurora Replica Lag Much Higher Than Reported

Our application uses reader and writer instances of Amazon RDS Aurora. The AWS dashboard shows the replica lag to consistently be about 20ms. However, we are seeing old results on the reader more than 90ms after a commit on the master and at least up to 170ms in some cases.
When doing CRUD operations, our app commits the data, then issues a HTTP redirect to the client to load the new data. The network turnaround on the redirect is logged on the client and is usually at least 90ms. We are logging both the commit time and read time on the application server and see a difference of around 170ms. Old data is showing up consistently.
Previously to Aurora we had a standard MySQL replication setup with significantly less powerful boxes and never had this issue.
Altering the application to read and write from the same aurora instance solves the problem, but I thought Aurora used shared storage for replication. What is going on? Could this be an issue with Aurora's query cache? Is the reported replica lag inaccurate?
Any help would be appreciated.
Thank you,
If you do care about strong consistency, you should issue your queries to the writer (RW cluster endpoint). On that note, it is definitely worrisome that you are seeing stale data and replica lag metric is not capturing it. For sorting that specific part, I would recommend opening a support case with AWS Aurora.

Redis cluster-network latency

There is a new Redis cluster setup, one team I know in my company is working on, in order to improve the application data caching based out on Redis. The setup is as follows, a Redis cluster with a Redis master and many slaves, say 40-50 (but can grow more when the application is scaled), one Redis instance per one virtual machine. I was told this setup helps the applications deployed in servers on every virtual machines query the data present in the local Redis instance than querying an instance in the network in order to avoid network latency. Periodically, the Redis master is updated only with whatever data are modified or newly created or deleted (data backed by a relational database), say every 5 seconds or so. This will initiate the data sync operation with all the Redis slave instances. The data-consumers (the application deployed on all the virtual machines) of the Redis (slaves) reads updated values to do processing. Is this approach a correct one to the network latency problem faced by the applications in querying from a Redis instance that is within a data center network? Will this setup not create lots of network traffic when Redis master syncing the data with all its slave nodes?
I couldn't find much answers on this from the internet. Your opinions on this are much appreciated.
The relevance of this kind of architecture depends a lot about the workload. Here are the important criteria:
the ratio between the write and read operations. Obviously, the more read operations, the more relevant the architecture. The main benefit IMO, is not necessarily the latency gains, but the scalability, the extra reliability it brings, and the network resource consumption.
the ratio between the cost of a local Redis access against the cost of a remote Redis access. Do not assume that the only cost of a remote Redis access is the network latency. It is not. On my systems, a local Redis access costs about 50 us (in average, very low workload), while a remote access costs 120 us (in average, very low workload). The network latency is about 60 us. Measure the same kind of figures on your own system/network, with your own data.
Here are a few advices:
do not use a single Redis master against many slave instances. It will limit the scalability of the system. If you want to scale, you need to build a hierarchy of slaves. For instance, have the master replicates to 8 slaves. Each slave replicates to 8 other slaves locally running on your 64 application servers. If you need to add more nodes, you can tune the replication factor at the master or slave level, or add one more layer in this tree for extreme scalability. It brings you flexibility.
consider using unix socket between the application and the local slaves, rather than TCP sockets. If it good for both latency and throughput.
Regarding your last questions, you really need to evaluate the average local and remote latencies to decide whether this is worth it. Note that the protocol used by Redis to synchronize master and slaves is close to the normal client server traffic. Every SET commands applied on the master, will be also applied on the slave. The network bandwidth consumption is therefore similar. So in the end, it is really a matter of how many reads and how many writes you expect.

Oracle data synchronization in live environment

What are known reliable tools for syncing huge amounts of data between Oracle DB instances in live environment?
Requirements are that the host with live data is running in a live environment, i.e. the database is updated. Receiving host is offline, and will go online only when data sync is complete.
Most of the data is stored in blob columns and amount of data to sync reaches ~100GB. Only part of data from a table needs to move, while the actual size of the table is around 50 TB.
This is a clustered system, and each live machine is a clone of the other, each machine contains an instance of Oracle DB. Sometimes machines need to go under maintenance and they lose live data. When they come back up, the data needs to be synchronized. Machine is brought offline for maintenance usually not longer for 6 hours. Without having clone machines, we would not be able to ensure that system is up, when one of the machines must go for maintenance.
Sync should not severely influence the live machine CPU usage.
First thing to look at is Oracle Advanced Replication and Oracle Streams. You might want to consider getting a good book on Streams.

Resources