Elastic Search - How to get lost records - elasticsearch

We have a 5 node, 16 shards ElasticSearch cluster across 5 servers, plus a routing server and a monitoring server
Scenario
A developer has accidentally deleted a number of documents from an index within the cluster. ES snapshots have not been set up, though through our VPS provider, each of the servers has regular server-wide backups, and we can spin up and down extra instances easily as necessary. What is the fastest way to restore the lost records?

there is no guarantee that those backups from a regular backup tool are useful, because they do not guarantee a consistent point-in-time snapshot.
You should not try to bring those backups into your production cluster, you can try to have a second cluster in your non-production environment and load those backups, but there is zero guarantee that this will work.
The fastest way would be reindexing from the original source I guess.

Related

How can I migrate to instance storage for an EBS-defined Elasticsearch cluster without losing data?

I am using EBS for storage for my Elasticsearch cluster on EKS cluster. But in terms of performance, I want to use instance storage instead of EBS. In case the critical point instance closes, shard or replicas will be lost. How can I go about the configuration properly without losing in this scenario?
I've made sample changes to a few parameters I created for storage in the configuration files, but I'm not sure it's the right way. I left it as is so that the changes I made do not cause any data loss.

ElasticSearch backup and restore

as a PoC we are looking to define a method of backing up and restoring elasticsearch clusters that are running on AWS EC2 instances. The clusters each have more than 1 node running on different EC2 instances.
Being new to elasticsearch the main method that appears is to use the elasticsearch snapshot API, however are there any issues with using AWS Backup as a service to take snapshots of the EC2 instances themselves?
The restoration process would then be to create a new EC2 instance from a specified AMI that is created by the AWS Backup snapshot of the original EC2 instance running elasticsearch.
You can do that, but it has some drawbacks and it is not recommended.
First, to make a snapshot of any instance, you will need to stop your entire elasticsearch cluster. If, for example, your cluster has 3 nodes, you will need to stop all your nodes and make the snapshots, you can't make a snapshot of only one node, you will need to make a snapshot of the entire cluster at the same moment, always.
Second, since you are making snapshots of the entire instance, not only the elasticsearch data, you lose the flexiblity of restoring the data in another place, or restore just part of the data, you need to restore everything. Also, if you make snapshots everyday at 23:00 P.M. and for some reason you need to restore your snapshot at 17:00 P.M. next day, everything stored after your last snapshot will be lost.
And Third, even if you took those precautions, there is no guarantee that you will not have problems or corrupted data.
As per the documentation:
The only reliable way to back up a cluster is by using the snapshot
and restore functionality
Since you are using AWS, the best approach would be to use a s3 repository for your snapshots and automate your backups using the snapshot lifecycle managment in kibana.

Clone production cluster

I'm new to elasticsearch and would like to know the best way to do this
basically I'm trying to clone a production cluster and then use it for testing
it needs to be a complete copy and should not interrupt the production cluster
I thought about adding a new node to the production cluster and increase the number of replicas then separate that node and rename it as a new cluster
Is there a better way?
That's one way to do it, though not my personal preference because I wouldn't want to risk mixing test and production concerns, and you'd have the increased number of replicas to deal with after the fact.
You could also use the Snapshot and Restore APIs to take a snapshot and then restore it into your testing cluster. Just set your test cluster to readonly for the repository, then you can take snapshots from production and load them into test at will.

What's the easiest way of moving Elastic Search data between servers

I've got Elastic Search v6.1.0 installed on Windows and Centos7 machines. The goal is to migrate data from Win to Centos7 machine.
Since they both have the same ES version, I simply dragged "data" folder from machine A to B. When I checked its health, its status was red and active_primary_shards was 0. So I reversed the changes I made.
What other methods are there? Can Snapshot/Restore method be used for this purpose? I think it's for migrating between different versions.
So the question is, what's the best/easiest method for moving data between 2 servers with same ES versions?
Using snapshot/restore
You can perfectly use snapshot/restore for this task as long as you have a shared file system or a single-node cluster. The shared FS should meet the following criteria:
In order to register the shared file system repository it is necessary
to mount the same shared filesystem to the same location on all master
and data nodes.
So it's not a problem if you have a single-node cluster. In this case just make a snapshot and copy it over to other machine.
It might though be a challenging task if you have many nodes running.
You may use one of the supported plugins for S3, HDFS and other cloud storages.
The advantage of this approach is that the data and the indices are snapshotted entirely.
Using _reindex API
It might be easier to use _reindex API to transfer data from one ES cluster to another. There is a special Reindex from Remote mode that allows exactly this use case.
What reindex actually does is a scroll on the source index and a lot of bulk inserts to the target index (which can be remote).
There are couple of issues you should take care of:
setting up the target index (no mapping, no settings will be set by reindex)
if some fields on the source index are excluded from _source then their contents won't be copied to the target index
Summing up
For snapshot/restore
Pros:
all data and the indices are saved/restored as they are
2 calls to the ES API are needed
Cons:
if cluster has more than 1 node, you need to setup a shared FS or to use some cloud storage
For _reindex
Pros:
Works for cluster of any size
Data is copied directly (no intermediate storage required)
1 call to the ES API is needed
Cons:
Data excluded from _source will be lost
Here's also a similar SO question from some three years ago.
Hope that helps!

Setting up a single backup node for an elasticsearch cluster?

Given Elasticsearch cluster with several machines, I would want to have a single machine(special node) that is located on a different geographical region that can effectively sync with the cluster for read only purpose. (i.e. no write for the special node; and that special node should be able to handle all query on its own). Is it possible and how can this be done?
With elasticsearch 1.0 (currently available in RC1) you can use the snapshot & restore api; have a look at this blog too to know more.
You can basically make a snapshot of your indices, then copy the snapshot over to the secondary location and restore it into a different cluster. The nice part is that snapshots are incremental, which means that only the files that have changed since the last snapshot are actually backed up. You can then create snapshots at regular intervals, and import them into the secondary cluster.
If you are not using 1.0 yet, I would suggest to have a look at it, snapshot & restore is a great addition. You can still make backups manually and restore them with 0.90, but you don't have a nice api to do that and you need to do everything pretty much manually.

Resources