How to replicate, for DR purposes, a Greenplum DB to another data centre? - greenplum

We are planning for a large Greenplum DB (growing from 10 to 100TB over the first 18 months). Traditional backup and restore tools aren't going to help as we have 24hr RPO/RTOs to deal with.
Is there a way to replicate the DB across to our DR site without resorting to block replication (i.e. place a segment on SAN and mirror)?

You've got a number of options to choose:
Dual ETL. Replicate input data and run the same ETL on two sites. Synchronize them with backup-restore every week or so
Backup-restore. Simple backup-restore can be not that efficient. But if you use DataDomain it can perform deduplication on the block level and store only changed blocks. It can offload the deduplication task to run on the Greenplum cluster (DDBoost). Also in case of replication to remote site it would replicate only changed blocks, which would greatly reduce replication time. In my experience, if clean backup on DD takes 12 hours, subsequent DDBoost backup will take 4 hours + 4 hours to replicate the data
Custom solution. I know the case when the data replicatioin to remote site is made as a part of ETL process. For the ETL job you know which tables are changed, they are added to the replication queue and moved to the remote site using external tables. Analysts are working in a special sandbox and their sandbox is replicated with backup-restore daily
At the moment Greenplum does not have built-in WAN replication solution so this is almost all the options to choose from.

I do some investigation on this. here is my result
I. Using EMC Symmetrix VMAX SAN(Storage Area Network) Mirror and SRDF (Symmetrix Remote Data Facility) remote replication software
Please refer to h12079-vnx-replication-technologies-overview-wp.pdf for details
Preconditions
1. Having EMC Symmetrix VMAX SAN installed
2. Having SRDF softeware
Advantages of 3 different modes
1. Symmetrix Remote Data Facility / Synchronous (SRDF/S)
Provides a no data loss solution (Zero RPO).
No server resource contention for remote mirroring operation.
Can perform restoration of primary site with minimal impact to application. Performance on remote site. Enterprise
disaster recovery solution. Supports replicating over IP and Fibre
Channel protocols.
2. Symmetrix Remote Data Facility / Asynchronous (SRDF/A) Extended-distance data replication that supports longer distances
than SRDF/S. SRDF/A does not affect host performance, because host
activity is decoupled from the remote copy process. Efficient link
utilization that results in lower link-bandwidth requirements.
Facilities to invoke failover and restore operations. Supports
replicating over IP and Fibre Channel protocols.
3. Symmetrix Remote Data Facility / Data Mobility (SRDF/DM)
II. Using Backup Tools
Please refer to http://gpdb.docs.pivotal.io/4350/admin_guide/managing/backup.html for details
Parallel Backup
Parallel backup utility gpcrondump
Non-parallel backup
It is not recommended. It is used for migrate PostgreSQL databases to GreenPlum databases
Parallel Restore
Support system with the same configuration and different configuration with the source GreenPlum database configuration
Non-Parallel Restore
pg_restore requires to modified the create statement to add distributed by clause
 
Disadvantages
1. The backup process locks table, it put an EXCLUSIVE lock on table pg_class. It means that read permission is only allowed in this period.
2. After releasing the EXCLUSIVE lock on table pg_clas, it will put an ACCESS SHARE lock on all the tables, it only allows read access during the lock period.
III. Replay DDL statements
In PostgreSQL, there is a parameters to log all the sql statements to a file.
In the data/postgresql.conf, modify log_statement to ‘all’
Write an application to get the DML and DDL statement, and run them in the DR servers.
Advantage
1. Easy to configure and maintain
2. No decrease in the performance
Disadvantage
1. Need additional storage for the statement logging
IV. Parse the wal log of PostgreSQL
Parse the wal log to extract the DDL statement from the log and then run all the generated DDL statements in the DR GreenPlum
Advantage
1. Doesn’t impact the source GreenPlum Database
Disadvantage
1. Write code to parse the wal log
2. Not easy to parse the log, there are not enough documents about the wal log.
3. Don’t know if it is feasible for GreenPlum, as it is one solution for PostgreSQL.

Related

Dynamically List contents of a table in database that continously updates

It's kinda real-world problem and I believe the solution exists but couldn't find one.
So We, have a Database called Transactions that contains tables such as Positions, Securities, Bogies, Accounts, Commodities and so on being updated continuously every second whenever a new transaction happens. For the time being, We have replicated master database Transaction to a new database with name TRN on which we do all the querying and updating stuff.
We want a sort of monitoring system ( like htop process viewer in Linux) for Database that dynamically lists updated rows in tables of the database at any time.
TL;DR Is there any way to get a continuous updating list of rows in any table in the database?
Currently we are working on Sybase & Oracle DBMS on Linux (Ubuntu) platform but we would like to receive generic answers that concern most of the platform as well as DBMS's(including MySQL) and any tools, utilities or scripts that can do so that It can help us in future to easily migrate to other platforms and or DBMS as well.
To list updated rows, you conceptually need either of the two things:
The updating statement's effect on the table.
A previous version of the table to compare with.
How you get them and in what form is completely up to you.
The 1st option allows you to list updates with statement granularity while the 2nd is more suitable for time-based granularity.
Some options from the top of my head:
Write to a temporary table
Add a field with transaction id/timestamp
Make clones of the table regularly
AFAICS, Oracle doesn't have built-in facilities to get the affected rows, only their count.
Not a lot of details in the question so not sure how much of this will be of use ...
'Sybase' is mentioned but nothing is said about which Sybase RDBMS product (ASE? SQLAnywhere? IQ? Advantage?)
by 'replicated master database transaction' I'm assuming this means the primary database is being replicated (as opposed to the database called 'master' in a Sybase ASE instance)
no mention is made of what products/tools are being used to 'replicate' the transactions to the 'new database' named 'TRN'
So, assuming part of your environment includes Sybase(SAP) ASE ...
MDA tables can be used to capture counters of DML operations (eg, insert/update/delete) over a given time period
MDA tables can capture some SQL text, though the volume/quality could be in doubt if a) MDA is not configured properly and/or b) the DML operations are wrapped up in prepared statements, stored procs and triggers
auditing could be enabled to capture some commands but again, volume/quality could be in doubt based on how the DML commands are executed
also keep in mind that there's a performance hit for using MDA tables and/or auditing, with the level of performance degradation based on individual config settings and the volume of DML activity
Assuming you're using the Sybase(SAP) Replication Server product, those replicated transactions sent through repserver likely have all the info you need to know which tables/rows are being affected; so you have a couple options:
route a copy of the transactions to another database where you can capture the transactions in whatever format you need [you'll need to design the database and/or any customized repserver function strings]
consider using the Sybase(SAP) Real Time Data Streaming product (yeah, additional li$ence is required) which is specifically designed for scenarios like yours, ie, pull transactions off the repserver queues and format for use in downstream systems (eg, tibco/mqs, custom apps)
I'm not aware of any 'generic' products that work, out of the box, as per your (limited) requirements. You're likely looking at some different solutions and/or customized code to cover your particular situation.

Reduce durability to improve performace in db2

In db2 10.5, is it possible that transaction commit does not wait for log IO to finish and return control to the client, like SQL server's delayed durability?
Is there any way to reduce the number of log IOs when there are large number of serial small transactions?
Db2 on Linux/Unit/Windows, at version 11.1 does not currently support lazy commits as offered by some versions of Microsoft SQL-server.
For some kinds of processing, especially large batches, it is often very convenient to use unlogged global temporary tables for intermediate tables. That is the one convenient way to eliminate logging overhead, although the use case is limited to specific scenarios. Such tables (declared global temporary tables, or created global temporary tables) let you do fast processing without incurring logging overhead, although you have to design your batches (typically your stored procedures) to work with these kinds of tables, including the ability to restart after failures and recover etc.
If you have high frequency discrete OLTP transactions that include both insert AND update (not batch) , you should concentrate on optimising your active-logging configuration. For example to ensure your active logs are on the fastest media, to ensure your Db2 is never waiting for log files, to ensure that the logbufsz is adequate, to ensure that the bufferpool cleaning is optimal etc, to ensure that the size of your transaction-log files is compatible with your RTO and RPO Service-levels.
Db2 LUW has a database configuration parameter called mincommit. The value of mincommit indicates how many transactions will commit before flushing out the log buffer to disk. This is probably what you are looking for.
From Db2 LUW v10.5 and higher, this parameter is ignored, and the value is only meaningful on versions of Db2 LUW up to v10.1.
For older versions of Db2 LUW, the recommendation is to leave the value at 1. In most cases there is no performance improvement while at the same time introducing risk. Hence, the configuration parameter has been deprecated in version 10.1. My advice: Do not use it even though it still exists.

How much data/networking usage does an Oracle Client use while running queries across schemas?

When I run a query to copy data from schemas, does it perform all SQL on the server end or copy data to a local application and then push it back out to the DB?
The two tables sit in the same DB, but the DB is accessed through a VPN. Would it change if it was across databases?
For instance (Running in Toad Data Point):
create table schema2.table
as
select
sum(row1)
,row2
from schema1
The purpose I ask the question is because I'm getting quotes for a Virtual Machine in Azure Cloud and want to make sure that I'm not going to break the bank on data costs.
The processing of SQL statements on the same database usually takes place entirely on the server and generates little network traffic.
In Oracle, schemas are a logical object. There is no physical barrier between them. In a SQL query using two tables it makes no difference if those tables are in the same schema or in different schemas (other than privilege issues).
Some exceptions:
Real Application Clusters (RAC) - RAC may share a huge amount of data between the nodes. For example, if the table was cached on one node and the processing happened on another, it could send all the table data through the network. (I'm not sure how this works on the cloud though. Normally the inter-node traffic is done with a separate, dedicated network connection.)
Database links - It should be obvious if your application is using database links though.
Oracle Reports and Forms(?) - A few rare tools have client-side PL/SQL processing. Possibly those programs might send data to the client for processing. But I still doubt it would do something crazy like send an entire table to the client to be sorted, and then return the results to the server.
Backups/archive logs - I assume all the data will be backed up. I'm not sure how that's counted, but possibly that means all data written will also be counted as network traffic eventually.
The queries below are examples of different ways to check the network traffic being generated.
--SQL*Net bytes sent for a session.
select *
from gv$sesstat
join v$statname
on gv$sesstat.statistic# = v$statname.statistic#
--You probably also want to filter for a specific INST_ID and SID here.
where lower(display_name) like '%sql*net%';
--SQL*Net bytes sent for the entire system.
select *
from gv$sysstat
where lower(name) like '%sql*net%'
order by value desc;

Kafka Streams with lookup data on HDFS

I'm writing an application with Kafka Streams (v0.10.0.1) and would like to enrich the records I'm processing with lookup data. This data (timestamped file) is written into a HDFS directory on daily basis (or 2-3 times a day).
How can I load this in the Kafka Streams application and join to the actual KStream?
What would be the best practice to reread the data from HDFS when a new file arrives there?
Or would it be better switching to Kafka Connect and write the RDBMS table content to a Kafka topic which can be consumed by all the Kafka Streams application instances?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system. Additionally AFAIK I don't have a control when the compaction happens.
The recommend approach is indeed to ingest the lookup data into Kafka, too -- for example via Kafka Connect -- as you suggested above yourself.
But in this case how can I schedule the Connect job to run on a daily basis rather than continuously fetch from the source table which is not necessary in my case?
Perhaps you can update your question you do not want to have a continuous Kafka Connect job running? Are you concerned about resource consumption (load on the DB), are you concerned about the semantics of the processing if it's not "daily udpates", or...?
Update:
As suggested Kafka Connect would be the way to go. Because the lookup data is updated in the RDBMS on a daily basis I was thinking about running Kafka Connect as a scheduled one-off job instead of keeping the connection always open. Yes, because of semantics and the overhead of keeping a connection always open and making sure that it won't be interrupted..etc. For me having a scheduled fetch in this case looks safer.
Kafka Connect is safe, and the JDBC connector has been built for exactly the purpose of feeding DB tables into Kafka in a robust, fault-tolerant, and performant way (there are many production deployments already). So I would suggest to not fallback to "batch update" pattern just because "it looks safer"; personally, I think triggering daily ingestions is operationally less convenient than just keeping it running for continuous (and real-time!) ingestion, and it also leads to several downsides for your actual use case (see next paragraph).
But of course, your mileage may vary -- so if you are set on updating just once a day, go for it. But you lose a) the ability to enrich your incoming records with the very latest DB data at the point in time when the enrichment happens, and, conversely, b) you might actually enrich the incoming records with stale/old data until the next daily update completed, which most probably will lead to incorrect data that you are sending downstream / making available to other applications for consumption. If, for example, a customer updates her shipping address (in the DB) but you only make this information available to your stream processing app (and potentially many other apps) once per day, then an order processing app will ship packages to the wrong address until the next daily ingest will complete.
The lookup data is not big and records may be deleted / added / modified. I don't know either how I can always have a full dump into a Kafka topic and truncate the previous records. Enabling log compaction and sending null values for the keys that have been deleted would probably won't work as I don't know what has been deleted in the source system.
The JDBC connector for Kafka Connect already handles this automatically for you: 1. it ensures that DB inserts/updates/deletes are properly reflected in a Kafka topic, and 2. Kafka's log compaction ensures that the target topic doesn't grow out of bounds. Perhaps you may want to read up on the JDBC connector in the docs to learn which functionality you just get for free: http://docs.confluent.io/current/connect/connect-jdbc/docs/ ?

How to provision Oracle-to-MySQL replication using synchronous CDC with no downtime?

I'm trying to provision Oracle-to-MySQL replication using the parallel extraction method outlined in the Tungsten Replicator documentation.
Setup CDC tables in Oracle using the setupCDC.sh script provided by Tungsten.
Start the parallel extractor, specifying the starting SCN of the CDC process given by the previous script.
The parallel extractor will insert all existing data using flaskback queries of the form AS OF SCN ..., performing a point-in-time provisioning with data integrity.
The problem is the setupCDC script prints out an SCN only if the CDC is asynchronous. It's hinted in an official forum thread that this is to "get a single position for the whole schema snapshot."
Due to licensing restrictions, I can only use synchronous CDC. Is it safe to manually read the SCN recorded in the all_capture table and use it for provisioning? What are my options that can achieve both data integrity and minimum downtime?
a. Disable write operations to the master database while provisioning is in progress:
This is non-desirable as my database holds hundreds of gigabytes of data, probably resulting in a long downtime.
b. Allow write operations during provisioning: any discrepancies will be fixed by re-applying all CDC data through normal replication after parallel extraction has processed all tables. Any errors raised during the re-application will have to be ignored.
Would this be safe, from the viewpoint of data integrity?
Is it safe to manually read the SCN recorded in the all_capture table and use it for provisioning?
For synchronous CDC, there is no entry in the all_capture table, which is for asynchronous capture processes.
Instead, each change table records the SCN at the time of its creation. You can determine the lowest SCN from the change_tables table and provide it as the argument to the provisioning command: trepctrl online -provision <scn>.
SQL> COL scn FORMAT 999999999999999
SQL> SELECT MIN(created_scn) scn FROM change_tables WHERE change_set_name = 'TUNGSTEN_CS_{service_name}';
(Replace {service_name} with your own service name.)

Resources