How to do a Backup and Restore of Cassandra Nodes in AWS? - amazon-ec2

We have 2 m3 large instances that we want to do backup of. How to go about it?
The data is in the SSD drive.
nodetool snapshot will cause the data to be written back to the same SSD drive . Whats the correct procedure to be followed?

You can certainly use nodetool snapshot to back up your data on each node. You will have to have enough SSD space to account for snapshots and the compaction frequency. Typically, you would need about 50% of the SSD storage reserved for this. There are other options as well. Datastax Opscenter has backup and recover capabilities that use snapshots and help automate some of the steps but you will need storage allocated for that as well. Talena also has a solution for back/restore & test-dev management for Cassandra (and other data stores like HDFS, Hive, Impala, Vertica, etc.). It relies less on Snapshots by making copies off-cluster and simplifying restores.

Related

What is the recommended DefaultFS (File system) for Hadoop on ephemeral Dataproc clusters?

What is the recommended DefaultFS (File system) for Hadoop on Dataproc. Are there any benchmarks, considerations available around using GCS vs HDFS as the default file system?
I was also trying to test things out and discovered that when I set the DefaultFS to a gs:// path, the Hive scratch files are getting created - both on HDFS as well as the GCS paths. Is this happening synchronously and adding to latency or does the write to GCS happen after the fact?
Would appreciate any guidance, reference around this.
Thank you
PS: These are ephemeral Dataproc clusters that are going to be using GCS for all persistent data.
HDFS is faster. There should already be public benchmarks for that, or just taken as a fact because GCS is networked storage where HDFS is directly mounted in the Dataproc VMs.
"Recommended" would be persistent storage, though, so GCS, but maybe only after finalizing the data in the applications. For example, you might not want Hive scratch files in GCS since they'll never be used outside of the current query session, but you would want Spark checkpoints if you're running periodic batch jobs that scale down the HDFS cluster in between executions
I would say the default (HDFS) is the recommended. Typically, the input and output data of Dataproc jobs are persisted outside of the cluster in GCS or BigQuery, the cluster is used for compute and intermediate data. These intermediate data are stored on local disks directly or through HDFS which eventually also goes to local disks. After the job is done, you can safely delete the cluster, only pay for the storage of input and output data to save cost.
Also HDFS usually has lower latency for intermediate data, especially for lots of small files and metadata operations, e.g. dir rename. GCS is better at throughput for large files.
But when using HDFS, you need to provision sufficient disk space (at least 1TB each node) and consider using local SSDs. See https://cloud.google.com/dataproc/docs/support/spark-job-tuning#optimize_disk_size for more details.

change persistent disk type to ssd

I have an elasticsearch running as a ECK on a GKE cluster for production purposes and in order to increase its performance I'm thinking of changing the persistent disk type to ssd. I came accross solutions that incite the need to create a snapshot of the disk in GCE and then create another ssd disk with the data stored in the snapshot. I'm still concerned whether it still has a risk of data loss and if I create another disk will my elastic be able to match it or not as it is statefulset.
Since this is a production deployment I would advise to do as follows:
Create a volume snapshot (doc).
Set up a secondary cluster (doc).
Modify the deployment so that it uses an SSD (doc).
Deploy to the second cluster.
Once this new deployment has been fully tested you can switch over the traffic.

How to perform GreenPlum 6.x Backup & Recovery

I am using GreenPlum 6.x and facing issues while performing backup and recovery. Does we have any tool to take the physical backup of whole cluster like pgbackrest for Postgres, further how can we purge the WAL of master and each segment as we can't take the pg_basebackup of whole cluster.
Are you using open source Greenplum 6 or a paid version? If paid, you can download the gpbackup/gprestore parallel backup utility (separate from the database software itself) which will back up the whole cluster with a wide variety of options. If using open source, your options are pretty much limited to pgdump/pgdumpall.
There is no way to purge the WAL logs that I am aware of. In Greenplum 6, the WAL logs are used to keep all the individual postgres engines in sync throughout the cluster. You would not want to purge these individually.
Jim McCann
VMware Tanzu Data Engineer
I would like to better understand the the issues you are facing when you are performing your backup and recovery.
For Open Source user of the Greenplum Database, the gpbackup/gprestore utilities can be downloaded from the Releases page on the Github repo:
https://github.com/greenplum-db/gpbackup/releases
v1.19.0 is the latest.
There currently isn't a pg_basebackup / WAL based backup/restore solution for Greenplum Database 6.x
WAL logs are periodically purged (as they get replicated to mirror and flushed) from master and segments individually. So, no manual purging is required. Have you looked into why the WAL logs are not getting purged? One of the reasons could be mirrors in cluster is down. If that happens WAL will continue mounting on primary and won't get purged. Perform select * from pg_replication_slots; for master or segment for which WAL is building to know more.
If the cause for WAL build is due replication slot as for some reason is mirror down, can use guc max_slot_wal_keep_size to configure max size WAL's should consume, after that replication slot will be disabled and not consume more disk space for WAL.

How to backup entire hdfs data on local machine

We have small CDH cluster of 3 nodes with approx 2TB data. we are planning to expand it but before that current hadoop machines/racks are being relocated. And I just want to make sure I have backup in local machine, in case racks somehow are not relocated (or gets damaged on the way) and we have to install new ones. How do I ensure this?
I have taken snapshot of HDFS data from cloudera manager as backup and it resides on the cluster. But in this case I need to take backup of whole data on local machine or hard drive. Please advise.
Distcp the data somewhere.
Posslible options:
own solution - the temporary cluster - 2TB is not so much, hardware is cheap.
managed solution - to the cloud. There are plenty of storage as a service providers. If not sure, S3 should work for You. Of course data transfer is Your cost, but there is always a trade of between managed service and own crafted.

Cassandra Backup and Restore using EBS snapshot and NodeTool Snaphots

I've a couple of questions regarding best approach to Backup/Restore Cassandra Cluster.
Background : I've a cluster running in EC2. It's nodes are configured like so:
Instance type : m3.medium
Storage : 50 GB Root Volume/100 GB another volume
After reading lot of documents and searching in few websites I understood that EBS Snapshots with Cassandra(nodetool) snapshots looks quite promising.
Questions: EBS also take the incremental snapshots and Nodetools also takes the snapshot then how does these two tools are different or are they same and is there any other better approach to backup cassandra cluster?
Please advice.
Take a look at Netflix's Priam as a possible solution for creating backups for AWS deployments. It only seems to work with 2.0.x though, but might point you in the right direction.

Resources