I have an elasticsearch running as a ECK on a GKE cluster for production purposes and in order to increase its performance I'm thinking of changing the persistent disk type to ssd. I came accross solutions that incite the need to create a snapshot of the disk in GCE and then create another ssd disk with the data stored in the snapshot. I'm still concerned whether it still has a risk of data loss and if I create another disk will my elastic be able to match it or not as it is statefulset.
Since this is a production deployment I would advise to do as follows:
Create a volume snapshot (doc).
Set up a secondary cluster (doc).
Modify the deployment so that it uses an SSD (doc).
Deploy to the second cluster.
Once this new deployment has been fully tested you can switch over the traffic.
Related
I am using EBS for storage for my Elasticsearch cluster on EKS cluster. But in terms of performance, I want to use instance storage instead of EBS. In case the critical point instance closes, shard or replicas will be lost. How can I go about the configuration properly without losing in this scenario?
I've made sample changes to a few parameters I created for storage in the configuration files, but I'm not sure it's the right way. I left it as is so that the changes I made do not cause any data loss.
We have 2 m3 large instances that we want to do backup of. How to go about it?
The data is in the SSD drive.
nodetool snapshot will cause the data to be written back to the same SSD drive . Whats the correct procedure to be followed?
You can certainly use nodetool snapshot to back up your data on each node. You will have to have enough SSD space to account for snapshots and the compaction frequency. Typically, you would need about 50% of the SSD storage reserved for this. There are other options as well. Datastax Opscenter has backup and recover capabilities that use snapshots and help automate some of the steps but you will need storage allocated for that as well. Talena also has a solution for back/restore & test-dev management for Cassandra (and other data stores like HDFS, Hive, Impala, Vertica, etc.). It relies less on Snapshots by making copies off-cluster and simplifying restores.
I've done quite a bit of research and have yet to find an answer to this. Here's what I'm trying to accomplish:
I have an ELK stack container running in a pod on a k8s cluster in GCE - the cluster also contains a PersistentVolume (format: ext4) and a PersistentVolumeClaim.
In order to scale the ELK stack to multiple pods/nodes and keep persistent data in ElasticSearch, I either need to have all pods write to the same PV (using the node/index structure of the ES file system), or have some volume logic to scale up/create these PVs/PVCs.
Currently what happens is if I spin up a second pod on the replication controller, it can't mount the PV.
So I'm wondering if I'm going about this the wrong way, and what is the best way to architect this solution to allow for persistent data in ES when my cluster/nodes autoscale.
Persistent Volumes have access semantics. on GCE I'm assuming you are using a Persistent Disk, which can either be mounted as writable to a single pod or to multiple pods as read-only. If you want multi writer semantics, you need to setup Nfs or some other storage that let's you write from multiple pods.
In case you are interested in running NFS - https://github.com/kubernetes/kubernetes/blob/release-1.2/examples/nfs/README.md
FYI: We are still working on supporting auto-provisioning of PVs as you scale your deployment. As of now it is a manual process.
In the Kubernetes example of Elasticsearch production deployment, there is a warning about using emptyDir, and advises to "be adapted according to your storage needs", which is linked to the documentation of persistent storage on Kubernetes.
Is it better to use a persistent storage, which is an external storage for the node, and so needs (high) I/O over network, or can we deploy a reliable Elasticsearch using multiple data nodes with local emptyDir storage?
Context: We're deploying our Kubernetes on commodity hardware, and we prefer not to use SAN for the storage layer (because it doesn't seem like commodity).
The warning is so that folks don't assume that using emptyDir provides a persistent storage layer. An emptyDir volume will persist as long as the pod is running on the same host. But if the host is replaced or it's disk becomes corrupted, then all data would be lost. Using network mounted storage is one way to work around both of these failure modes. If you want to use replicated storage instead, that works as well.
I've a couple of questions regarding best approach to Backup/Restore Cassandra Cluster.
Background : I've a cluster running in EC2. It's nodes are configured like so:
Instance type : m3.medium
Storage : 50 GB Root Volume/100 GB another volume
After reading lot of documents and searching in few websites I understood that EBS Snapshots with Cassandra(nodetool) snapshots looks quite promising.
Questions: EBS also take the incremental snapshots and Nodetools also takes the snapshot then how does these two tools are different or are they same and is there any other better approach to backup cassandra cluster?
Please advice.
Take a look at Netflix's Priam as a possible solution for creating backups for AWS deployments. It only seems to work with 2.0.x though, but might point you in the right direction.