Clone production cluster - elasticsearch

I'm new to elasticsearch and would like to know the best way to do this
basically I'm trying to clone a production cluster and then use it for testing
it needs to be a complete copy and should not interrupt the production cluster
I thought about adding a new node to the production cluster and increase the number of replicas then separate that node and rename it as a new cluster
Is there a better way?

That's one way to do it, though not my personal preference because I wouldn't want to risk mixing test and production concerns, and you'd have the increased number of replicas to deal with after the fact.
You could also use the Snapshot and Restore APIs to take a snapshot and then restore it into your testing cluster. Just set your test cluster to readonly for the repository, then you can take snapshots from production and load them into test at will.

Related

ElasticSearch backup and restore

as a PoC we are looking to define a method of backing up and restoring elasticsearch clusters that are running on AWS EC2 instances. The clusters each have more than 1 node running on different EC2 instances.
Being new to elasticsearch the main method that appears is to use the elasticsearch snapshot API, however are there any issues with using AWS Backup as a service to take snapshots of the EC2 instances themselves?
The restoration process would then be to create a new EC2 instance from a specified AMI that is created by the AWS Backup snapshot of the original EC2 instance running elasticsearch.
You can do that, but it has some drawbacks and it is not recommended.
First, to make a snapshot of any instance, you will need to stop your entire elasticsearch cluster. If, for example, your cluster has 3 nodes, you will need to stop all your nodes and make the snapshots, you can't make a snapshot of only one node, you will need to make a snapshot of the entire cluster at the same moment, always.
Second, since you are making snapshots of the entire instance, not only the elasticsearch data, you lose the flexiblity of restoring the data in another place, or restore just part of the data, you need to restore everything. Also, if you make snapshots everyday at 23:00 P.M. and for some reason you need to restore your snapshot at 17:00 P.M. next day, everything stored after your last snapshot will be lost.
And Third, even if you took those precautions, there is no guarantee that you will not have problems or corrupted data.
As per the documentation:
The only reliable way to back up a cluster is by using the snapshot
and restore functionality
Since you are using AWS, the best approach would be to use a s3 repository for your snapshots and automate your backups using the snapshot lifecycle managment in kibana.

Elastic Search - How to get lost records

We have a 5 node, 16 shards ElasticSearch cluster across 5 servers, plus a routing server and a monitoring server
Scenario
A developer has accidentally deleted a number of documents from an index within the cluster. ES snapshots have not been set up, though through our VPS provider, each of the servers has regular server-wide backups, and we can spin up and down extra instances easily as necessary. What is the fastest way to restore the lost records?
there is no guarantee that those backups from a regular backup tool are useful, because they do not guarantee a consistent point-in-time snapshot.
You should not try to bring those backups into your production cluster, you can try to have a second cluster in your non-production environment and load those backups, but there is zero guarantee that this will work.
The fastest way would be reindexing from the original source I guess.

Synchronizing Ambari cluster configurations

We have been exploring Apache Ambari with HDP 2.2 to setup a cluster. Our backend features three environments: testing, staging and production which is a standard practice in our industry.
When we would deploy a cluster in the testing environment with Ambari, what is the easiest way to have the same cluster configuration on the staging, and later, production environment ?
The initial step seems easy: you create a cluster in the testing environment using the UI and then you export the configuration as a blueprint. Subsequently, you use the exported blueprint to create a new cluster in the other environments. So far, so good.
Inevitably, we will need to change our Ambari configuration (e.g. deploy a new service, increase heap size for the JVM's,...). I was hoping we could just update the blueprint (using the UI or by hand) and then use the updated blueprint to also update the different clusters. However, this seems not possible unless you destroy and recreate the cluster which seems a bit harsh.. (we don't want to lose our data) ?
Alternatively we could use the REST API of Ambari to do specific updates to the configuration but as configuration changes with respect to the initial blueprint will undoubtedly accumulate, this will prove unwieldy and unmaintainable over time, I am afraid.
Can you suggest us a better solution for this use case?
I believe the easiest way would be to dump each services configuration to a file. Then import each of those configurations into the other clusters. This could be done simply by using the Ambari API or by using the script provided by Ambari to update configurations (/var/lib/ambari-server/resources/scripts/configs.sh).

Strategy to persist the node's data for dynamic Elasticsearch clusters

I'm sorry that this is probably a kind of broad question, but I didn't find a solution form this problem yet.
I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.
This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.
The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it's not predictable if the all data is available to the "new" nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.
The only things that comes to my mind are:
Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the "old" cluster had for example 8 nodes, and the new one only has 4.
Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties...
Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn't find a good resource about this kind of topic. Thanks a lot in advance.
Elasticsearch and NFS are not the best of pals ;-). You don't want to run your cluster on NFS, it's much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you'll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.
The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that "incremental" means incremental at file level, not document level.
The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that's why I'm mentioning it.
The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don't know what is the status, but you may want to give it a try.

Setting up a single backup node for an elasticsearch cluster?

Given Elasticsearch cluster with several machines, I would want to have a single machine(special node) that is located on a different geographical region that can effectively sync with the cluster for read only purpose. (i.e. no write for the special node; and that special node should be able to handle all query on its own). Is it possible and how can this be done?
With elasticsearch 1.0 (currently available in RC1) you can use the snapshot & restore api; have a look at this blog too to know more.
You can basically make a snapshot of your indices, then copy the snapshot over to the secondary location and restore it into a different cluster. The nice part is that snapshots are incremental, which means that only the files that have changed since the last snapshot are actually backed up. You can then create snapshots at regular intervals, and import them into the secondary cluster.
If you are not using 1.0 yet, I would suggest to have a look at it, snapshot & restore is a great addition. You can still make backups manually and restore them with 0.90, but you don't have a nice api to do that and you need to do everything pretty much manually.

Resources