Archive old data from Elasticsearch to Google Cloud Storage - elasticsearch

I have an elasticsearch server installed in Google Compute Instance. A huge amount of data is being ingested every minute and the underline disk fills up pretty quickly.
I understand we can increase the size of the disks but this would cost a lot for storing the long term data.
We need 90 days of data in the Elasticsearch server (Compute engine disk) and data older than 90 days (till 7 years) to be stored in Google Cloud Storage Buckets. The older data should be retrievable in case needed for later analysis.
One way I know is to take snapshots frequently and delete the indices older than 90 days from Elasticsearch server using Curator. This way I can keep the disks free and minimize the storage cost.
Is there any other way this can be done without manually automating the above-mentioned idea?
For example, something provided by Elasticsearch out of the box, that archives the data older than 90 days itself and keeps the data files in the disk, we can then manually move this file form the disk the Google Cloud Storage.

There is no other way around, to make backups of your data you need to use the snapshot/restore API, it is the only safe and reliable option available.
There is a plugin to use google cloud storage as a repository.
If you are using version 7.5+ and Kibana with the basic license, you can configure the Snapshot directly from the Kibana interface, if you are on an older version or do not have Kibana you will need to rely on Curator or a custom script running with a crontab scheduler.
While you can copy the data directory, you would need to stop your entire cluster everytime you want to copy the data, and to restore it you would also need to create a new cluster from scratch every time, this is a lot of work and not practical when you have something like the snapshot/restore API.

Look into Snapshot Lifecycle Management and Index Lifecycle Management. They are available with a Basic license.

Related

How to take snapshots of specific indices in Elastic Cloud Enterprise?

In Elastic Cloud UI, You can take snapshots/backup of your entire on-disk data and store it in a file shared system, say, Object Store S3.
How do I backup only certain indices instead of all with using Elastic Cloud UI only? Is there a way?
If not then and only then I want to go with APIs.
If you link out to the Elasticsearch Service docs for Snapshot and Restore, you will see that we also link to the Elasticsearch Snapshot and Restore docs. Here you will find instructions to backup certain indices. You can use the API console to do this more easily through the Elastic Cloud UI.

Elasticsearch - Migrate lndex from Windows machine to Linux machine

We are currently running ES on Windows 2012 R2 server machine (In-house) and it has total 20 Million Documents with 12 GB of index size.
Now we are called to migrate our Windows server into Linux Server. In order to that I am seeking any reliable method to ship Index data from Windows to Linux machine. Can anyone please suggest the best workaround?
Thanks.
Don't copy the data directory! Choose a supported path:
CCR - the easiest and fastest if you have both cluster platinum licensed
Snapshot via FS/S3 - if you have snapshots already in place, a good option, especially with S3 as storage as you don't need to copy the snapshot to the new nodes or mount on all data nodes in both clusters. This is also a fast option as you don't reindex in the destination cluster - it's just a fast restore of shards and probably the second-best approach in term of speed.
Reindex from remote - comes with the overhead of reindexing the docs but works also with different elasticsearch versions, if you want a simple way or need to update the elastic version to newer major version, try this way
Logstash with elasticsearch input and output - Same as 3.) but with logstash in between. An easy path if you want to modify the docs while copying
Good luck!

Is there any way to restore Elasticsearch snapshots apart from using the Elasticsearch restore API?

my company wants to use an existing Elasticsearch snapshot repository (consisting of various hundreds of gigabytes) to obtain the original documents and store them elsewhere. I must state that the snapshots have been obtained using the Elasticsearch snapshot API.
My company is somehow reluctant to use Elasticsearch to restore the snapshots, as they fear that would involve creating a new Elasticsearch cluster that would consume considerable resources. So far, I have not seen any other way to restore the snapshots than to use Elasticsearch, but, given my company's insistence, I ask here: is there any other tool that I could use to restore said snapshots? Thank you in advance for any help resolving this issue.
What I would do in your shoes is to spin up a local cluster and restore the existing snapshot into it (here is the relevant Elastic documentation: Restoring to a different cluster). Then, from there, I would either export the data by using the Kibana Reporting plugin (https://www.elastic.co/what-is/kibana-reporting), or by writing a Logstash pipeline to export the data from the local cluster to - say - a CSV file.

What's the easiest way of moving Elastic Search data between servers

I've got Elastic Search v6.1.0 installed on Windows and Centos7 machines. The goal is to migrate data from Win to Centos7 machine.
Since they both have the same ES version, I simply dragged "data" folder from machine A to B. When I checked its health, its status was red and active_primary_shards was 0. So I reversed the changes I made.
What other methods are there? Can Snapshot/Restore method be used for this purpose? I think it's for migrating between different versions.
So the question is, what's the best/easiest method for moving data between 2 servers with same ES versions?
Using snapshot/restore
You can perfectly use snapshot/restore for this task as long as you have a shared file system or a single-node cluster. The shared FS should meet the following criteria:
In order to register the shared file system repository it is necessary
to mount the same shared filesystem to the same location on all master
and data nodes.
So it's not a problem if you have a single-node cluster. In this case just make a snapshot and copy it over to other machine.
It might though be a challenging task if you have many nodes running.
You may use one of the supported plugins for S3, HDFS and other cloud storages.
The advantage of this approach is that the data and the indices are snapshotted entirely.
Using _reindex API
It might be easier to use _reindex API to transfer data from one ES cluster to another. There is a special Reindex from Remote mode that allows exactly this use case.
What reindex actually does is a scroll on the source index and a lot of bulk inserts to the target index (which can be remote).
There are couple of issues you should take care of:
setting up the target index (no mapping, no settings will be set by reindex)
if some fields on the source index are excluded from _source then their contents won't be copied to the target index
Summing up
For snapshot/restore
Pros:
all data and the indices are saved/restored as they are
2 calls to the ES API are needed
Cons:
if cluster has more than 1 node, you need to setup a shared FS or to use some cloud storage
For _reindex
Pros:
Works for cluster of any size
Data is copied directly (no intermediate storage required)
1 call to the ES API is needed
Cons:
Data excluded from _source will be lost
Here's also a similar SO question from some three years ago.
Hope that helps!

Setting up a single backup node for an elasticsearch cluster?

Given Elasticsearch cluster with several machines, I would want to have a single machine(special node) that is located on a different geographical region that can effectively sync with the cluster for read only purpose. (i.e. no write for the special node; and that special node should be able to handle all query on its own). Is it possible and how can this be done?
With elasticsearch 1.0 (currently available in RC1) you can use the snapshot & restore api; have a look at this blog too to know more.
You can basically make a snapshot of your indices, then copy the snapshot over to the secondary location and restore it into a different cluster. The nice part is that snapshots are incremental, which means that only the files that have changed since the last snapshot are actually backed up. You can then create snapshots at regular intervals, and import them into the secondary cluster.
If you are not using 1.0 yet, I would suggest to have a look at it, snapshot & restore is a great addition. You can still make backups manually and restore them with 0.90, but you don't have a nice api to do that and you need to do everything pretty much manually.

Resources