I've a couple of questions regarding best approach to Backup/Restore Cassandra Cluster.
Background : I've a cluster running in EC2. It's nodes are configured like so:
Instance type : m3.medium
Storage : 50 GB Root Volume/100 GB another volume
After reading lot of documents and searching in few websites I understood that EBS Snapshots with Cassandra(nodetool) snapshots looks quite promising.
Questions: EBS also take the incremental snapshots and Nodetools also takes the snapshot then how does these two tools are different or are they same and is there any other better approach to backup cassandra cluster?
Please advice.
Take a look at Netflix's Priam as a possible solution for creating backups for AWS deployments. It only seems to work with 2.0.x though, but might point you in the right direction.
Related
Just started exploring Druid we dint find any blog on link to install Druid on AWS, Is there any chance to install Druid on AWS EMR ? If so if there are any per-defined Cloud Formation to execute it will be help full for my R&D on Druid.
its pretty straighforward to setup a basic single cluster druid
launch EMR with a single node master, like r3.4xlarge
download imply tar (comes with druid and pivot), https://docs.imply.io/on-prem/quickstart
tar -xzf imply-3.1.8.1.tar.gz cd imply-3.1.8.1
bin/supervise -c conf/supervise/quickstart.conf
If you are looking for a full cluster deploy, EMR is not the right tool.
If you know EKS / kubernetes, I think the easiest way to get started is using Helm
https://github.com/helm/charts/tree/master/incubator/druid
Other option is to look for Imply Cloud
They also solid documentation around Druid. Druid'd own documentation is pretty intense. I found imply to be better for beginners.
https://docs.imply.io/cloud/
Although for POC, a single r3.4xlarge or i3.4xlarge having some 200G storage is good enough
The most likely reason why you would not find much documentation is that the two things have a different nature.
Druid is meant to be long lived and statefull, where the EMR hadoop variant is meant to spin up and down in a more ephemeral manner. As such the combination is somewhat awkward.
Consider using a different hadoop distribution like HDP. Of course you can easily deploy it on AWS if needed, or on your own hardware if you want to minimize infra costs.
Disclaimer: I am an employee of Cloudera, the distributor of HDP, which is currently the most common hadoop platform under Druid.
I am planning to create cluster with three nodes and each node will be launched in three different Amazon EC2 zone.
As per Datastax Documentation, I will use Ec2MultiRegionSnitch and replication stragey is NetworkTopologyStrategy. Below is my needs to be achieved
Cluster Size : 3 (Spanning Across Amazon EC2 Region).
Replication Factor: 3
Read and Write Level : QUORUM.
Based on the above configuration, I can survive on single node loss(Meaning that down of any one of amazon region. Correct me if I am wrong).
In order to achieve the above configuration, I have two option
Option-1 : Using Datastax provided Amazon EC2 AMI image.
This option launch the instance with almost all components needed to run cassandra with some monitoring tools(opscenter..etc)
But It store all data on EC2 Instance Store hence data persists only during the life of the instance and the storage size depends upon instance type.
Option-2 : Using Customised installation
In this option, I have to launch Amazon EC2 Ubuntu AMI,installing JAVA,installing Datastax community edition.
This option enable me to store all my data on EBS. Hence I can expand EBS whenever I needed and the same time I can restore any node using EBS snapshot.
My Question:
Which one of the option is suitable for my needs?.
Note:
I read the documentation provided by Datastax and very new to cassandra. Hence, Whatever inputs you provided will be very useful to me.
Thanks
It's not true that you get Datastax AMI only with EC2 ephemeral storage. Starting from version 2.5 they claim you can choose EBS as well: Introducing the DataStax Auto-Clustering AMI 2.5. That's an relatively easy way of getting started which I've personally chosen.
Should you choose EBS or EC2 ephemeral storage?
The answer is: it depends...
The past (~2012-2013):
EC2 instances with ephemeral storage were a better choice. There were detailed performance benchmarks over the years which indicated that EBS is getting better, but still, attached physical drives were better.
The past (~2014):
EC2 choice is still better. Datastax wrote a nice post about pricing, network and failure resilience: What is the story with AWS storage?
Present (~2016):
instaclustr claims:
By running Cassandra on Amazon EBS, you can run denser, cheaper
Cassandra clusters with just as much availability as ephemeral storage
instances.
Nice presentation here: AWS re:Invent 2015 | (BDT323) Amazon EBS & Cassandra: 1 Million Writes Per Second on 60 Nodes
All in all, I suggest you doing a TCO analysis and if there isn't a big difference in price, choose EBS - because of out of the box ability to make a snapshot. What's more, chances are EBS will be improved over the time.
We have 2 m3 large instances that we want to do backup of. How to go about it?
The data is in the SSD drive.
nodetool snapshot will cause the data to be written back to the same SSD drive . Whats the correct procedure to be followed?
You can certainly use nodetool snapshot to back up your data on each node. You will have to have enough SSD space to account for snapshots and the compaction frequency. Typically, you would need about 50% of the SSD storage reserved for this. There are other options as well. Datastax Opscenter has backup and recover capabilities that use snapshots and help automate some of the steps but you will need storage allocated for that as well. Talena also has a solution for back/restore & test-dev management for Cassandra (and other data stores like HDFS, Hive, Impala, Vertica, etc.). It relies less on Snapshots by making copies off-cluster and simplifying restores.
I'm sorry that this is probably a kind of broad question, but I didn't find a solution form this problem yet.
I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.
This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.
The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it's not predictable if the all data is available to the "new" nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.
The only things that comes to my mind are:
Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the "old" cluster had for example 8 nodes, and the new one only has 4.
Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties...
Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn't find a good resource about this kind of topic. Thanks a lot in advance.
Elasticsearch and NFS are not the best of pals ;-). You don't want to run your cluster on NFS, it's much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you'll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.
The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that "incremental" means incremental at file level, not document level.
The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that's why I'm mentioning it.
The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don't know what is the status, but you may want to give it a try.
I have a Vertica instance running on our prod. Currently, we are taking regular backups of the database. I want to build a Master/Slave configuration for Vertica so that I always have the latest backup in case something goes bad. I tried to google but did not find much on this topic. Your help will be much appreciated.
There is no concept of a Master/Slave in Vertica. It seems that you are after a DR solution which would give you a standby instance if your primary goes down.
The standard practice with Vertica is to use a dual load solution which streams data into your primary and DR instances. The option you're currently using would require an identical standby system and take time to restore from your backup. Your other option is to do storage replication which is more expensive.
Take a look at the best practices for disaster recovery in the documentation.