How to perform GreenPlum 6.x Backup & Recovery - database-backups

I am using GreenPlum 6.x and facing issues while performing backup and recovery. Does we have any tool to take the physical backup of whole cluster like pgbackrest for Postgres, further how can we purge the WAL of master and each segment as we can't take the pg_basebackup of whole cluster.

Are you using open source Greenplum 6 or a paid version? If paid, you can download the gpbackup/gprestore parallel backup utility (separate from the database software itself) which will back up the whole cluster with a wide variety of options. If using open source, your options are pretty much limited to pgdump/pgdumpall.
There is no way to purge the WAL logs that I am aware of. In Greenplum 6, the WAL logs are used to keep all the individual postgres engines in sync throughout the cluster. You would not want to purge these individually.
Jim McCann
VMware Tanzu Data Engineer

I would like to better understand the the issues you are facing when you are performing your backup and recovery.
For Open Source user of the Greenplum Database, the gpbackup/gprestore utilities can be downloaded from the Releases page on the Github repo:
https://github.com/greenplum-db/gpbackup/releases
v1.19.0 is the latest.
There currently isn't a pg_basebackup / WAL based backup/restore solution for Greenplum Database 6.x

WAL logs are periodically purged (as they get replicated to mirror and flushed) from master and segments individually. So, no manual purging is required. Have you looked into why the WAL logs are not getting purged? One of the reasons could be mirrors in cluster is down. If that happens WAL will continue mounting on primary and won't get purged. Perform select * from pg_replication_slots; for master or segment for which WAL is building to know more.
If the cause for WAL build is due replication slot as for some reason is mirror down, can use guc max_slot_wal_keep_size to configure max size WAL's should consume, after that replication slot will be disabled and not consume more disk space for WAL.

Related

How to sync data between two Greeplum Clusters in remote data centers (DR)

My team is planning for a DR solution and we need to sync data between Greenplum Databases in Production and DR sites.
We are running the 6.4 community edition. So tools like gpbackup and gprestore are not available.
pg_dump and pg_restore not an option because there is large data set involved. What is most suitable solution for our scenario?
gpbackup and gprestore is one way Greenplum users commonly keep two clusters in sync.
While gpbackup and gprestore doesn't ship with open source Greenplum Database, the tools are open source themselves and freely available from their own repository: https://github.com/greenplum-db/gpbackup
Due to Greenplum's distribution of data across segments, there is a requirement the DR cluster contain the same # of primary segments for a successful restore (although the # of segment hosts could differ).
A common approach we see Greenplum users implementing is backing up off cluster to a third party storage system (NFS, s3 compatible storage, etc..) and restoring to the destination/DR cluster from there.
There is an open source gpbackup_s3_plugin available here: https://github.com/greenplum-db/gpbackup-s3-plugin
Let us know if you have any other questions.
oak

how to configure and install a standby master in greenplum?

Ive installed a single node greenplum db with 2 segment hosts , inside them residing 2 primary and mirror segments , and i want to configure a standby master , can anyone help me with it?
It is pretty simple.
gpinitstandby -s smdw -a
Note: If you are using one of the cloud Marketplaces that deploys Greenplum for you, the standby master runs on the first segment host. The overhead of running the standby master is pretty small so it doesn't impact performance. The cloud Marketplaces also have self-healing so if that nodes fails, it is replaced and all services are automatically restored.
As Jon said, this is fairly straightforward. Here is a link to the documentation: https://gpdb.docs.pivotal.io/5170/utility_guide/admin_utilities/gpinitstandby.html
If you have follow up questions, post them here.

How to backup entire hdfs data on local machine

We have small CDH cluster of 3 nodes with approx 2TB data. we are planning to expand it but before that current hadoop machines/racks are being relocated. And I just want to make sure I have backup in local machine, in case racks somehow are not relocated (or gets damaged on the way) and we have to install new ones. How do I ensure this?
I have taken snapshot of HDFS data from cloudera manager as backup and it resides on the cluster. But in this case I need to take backup of whole data on local machine or hard drive. Please advise.
Distcp the data somewhere.
Posslible options:
own solution - the temporary cluster - 2TB is not so much, hardware is cheap.
managed solution - to the cloud. There are plenty of storage as a service providers. If not sure, S3 should work for You. Of course data transfer is Your cost, but there is always a trade of between managed service and own crafted.

How to do a Backup and Restore of Cassandra Nodes in AWS?

We have 2 m3 large instances that we want to do backup of. How to go about it?
The data is in the SSD drive.
nodetool snapshot will cause the data to be written back to the same SSD drive . Whats the correct procedure to be followed?
You can certainly use nodetool snapshot to back up your data on each node. You will have to have enough SSD space to account for snapshots and the compaction frequency. Typically, you would need about 50% of the SSD storage reserved for this. There are other options as well. Datastax Opscenter has backup and recover capabilities that use snapshots and help automate some of the steps but you will need storage allocated for that as well. Talena also has a solution for back/restore & test-dev management for Cassandra (and other data stores like HDFS, Hive, Impala, Vertica, etc.). It relies less on Snapshots by making copies off-cluster and simplifying restores.

How to build Vertica slave?

I have a Vertica instance running on our prod. Currently, we are taking regular backups of the database. I want to build a Master/Slave configuration for Vertica so that I always have the latest backup in case something goes bad. I tried to google but did not find much on this topic. Your help will be much appreciated.
There is no concept of a Master/Slave in Vertica. It seems that you are after a DR solution which would give you a standby instance if your primary goes down.
The standard practice with Vertica is to use a dual load solution which streams data into your primary and DR instances. The option you're currently using would require an identical standby system and take time to restore from your backup. Your other option is to do storage replication which is more expensive.
Take a look at the best practices for disaster recovery in the documentation.

Resources