Elasticsearch - Migrate lndex from Windows machine to Linux machine - elasticsearch

We are currently running ES on Windows 2012 R2 server machine (In-house) and it has total 20 Million Documents with 12 GB of index size.
Now we are called to migrate our Windows server into Linux Server. In order to that I am seeking any reliable method to ship Index data from Windows to Linux machine. Can anyone please suggest the best workaround?
Thanks.

Don't copy the data directory! Choose a supported path:
CCR - the easiest and fastest if you have both cluster platinum licensed
Snapshot via FS/S3 - if you have snapshots already in place, a good option, especially with S3 as storage as you don't need to copy the snapshot to the new nodes or mount on all data nodes in both clusters. This is also a fast option as you don't reindex in the destination cluster - it's just a fast restore of shards and probably the second-best approach in term of speed.
Reindex from remote - comes with the overhead of reindexing the docs but works also with different elasticsearch versions, if you want a simple way or need to update the elastic version to newer major version, try this way
Logstash with elasticsearch input and output - Same as 3.) but with logstash in between. An easy path if you want to modify the docs while copying
Good luck!

Related

Archive old data from Elasticsearch to Google Cloud Storage

I have an elasticsearch server installed in Google Compute Instance. A huge amount of data is being ingested every minute and the underline disk fills up pretty quickly.
I understand we can increase the size of the disks but this would cost a lot for storing the long term data.
We need 90 days of data in the Elasticsearch server (Compute engine disk) and data older than 90 days (till 7 years) to be stored in Google Cloud Storage Buckets. The older data should be retrievable in case needed for later analysis.
One way I know is to take snapshots frequently and delete the indices older than 90 days from Elasticsearch server using Curator. This way I can keep the disks free and minimize the storage cost.
Is there any other way this can be done without manually automating the above-mentioned idea?
For example, something provided by Elasticsearch out of the box, that archives the data older than 90 days itself and keeps the data files in the disk, we can then manually move this file form the disk the Google Cloud Storage.
There is no other way around, to make backups of your data you need to use the snapshot/restore API, it is the only safe and reliable option available.
There is a plugin to use google cloud storage as a repository.
If you are using version 7.5+ and Kibana with the basic license, you can configure the Snapshot directly from the Kibana interface, if you are on an older version or do not have Kibana you will need to rely on Curator or a custom script running with a crontab scheduler.
While you can copy the data directory, you would need to stop your entire cluster everytime you want to copy the data, and to restore it you would also need to create a new cluster from scratch every time, this is a lot of work and not practical when you have something like the snapshot/restore API.
Look into Snapshot Lifecycle Management and Index Lifecycle Management. They are available with a Basic license.

What's the easiest way of moving Elastic Search data between servers

I've got Elastic Search v6.1.0 installed on Windows and Centos7 machines. The goal is to migrate data from Win to Centos7 machine.
Since they both have the same ES version, I simply dragged "data" folder from machine A to B. When I checked its health, its status was red and active_primary_shards was 0. So I reversed the changes I made.
What other methods are there? Can Snapshot/Restore method be used for this purpose? I think it's for migrating between different versions.
So the question is, what's the best/easiest method for moving data between 2 servers with same ES versions?
Using snapshot/restore
You can perfectly use snapshot/restore for this task as long as you have a shared file system or a single-node cluster. The shared FS should meet the following criteria:
In order to register the shared file system repository it is necessary
to mount the same shared filesystem to the same location on all master
and data nodes.
So it's not a problem if you have a single-node cluster. In this case just make a snapshot and copy it over to other machine.
It might though be a challenging task if you have many nodes running.
You may use one of the supported plugins for S3, HDFS and other cloud storages.
The advantage of this approach is that the data and the indices are snapshotted entirely.
Using _reindex API
It might be easier to use _reindex API to transfer data from one ES cluster to another. There is a special Reindex from Remote mode that allows exactly this use case.
What reindex actually does is a scroll on the source index and a lot of bulk inserts to the target index (which can be remote).
There are couple of issues you should take care of:
setting up the target index (no mapping, no settings will be set by reindex)
if some fields on the source index are excluded from _source then their contents won't be copied to the target index
Summing up
For snapshot/restore
Pros:
all data and the indices are saved/restored as they are
2 calls to the ES API are needed
Cons:
if cluster has more than 1 node, you need to setup a shared FS or to use some cloud storage
For _reindex
Pros:
Works for cluster of any size
Data is copied directly (no intermediate storage required)
1 call to the ES API is needed
Cons:
Data excluded from _source will be lost
Here's also a similar SO question from some three years ago.
Hope that helps!

Elasticsearch Shard Location

I am trying to setup an elasticsearch cluster and have a question thats bothering me. I am transitioning from Marklogic to Elasticsearch and have this concept of storing data on a different disk rather than on the same disk where my software i.e. MarkLogic is installed. I know how to do it in MarkLogic but somehow can not find anything on this on elasticsearch. Can anyone point me to a document that can help me configure my shard on a different machine where elasticsearch is not installed?
Thanks,
S.
You simply need to change the path.data setting in your elasticsearch.yml configuration file:
path:
data:
- /mnt/hda1
- /mnt/hda2
- /mnt/hda3
You can use a single location or several and when you do, ES will store your index data on those locations. Note that data pertaining to a given shard will always be located at the same path location.

elasticsearch snapshot vs elasticdump

I have a very slow internet connection and have a server that is running Elasticsearch. I am looking at having a local, read only, version of the elastic search indices with a local kabana instance as i dont need the data to be live. I know there are 3 ways of doing this, making my local machine a node in the ES cluster, taking a snapshot and transferring it or using elasticdump and transferring the file. i understand the issues with adding my local as a node but dont understand the difference between a snapshot and elasticdump.
What is the difference between a snapshot and elasticdump? what are the advantages and disadvantages of each?
elasticdump will simply scan one index in your remote ES cluster and will either dump the JSON data into a file it can then replay to rebuild the index in the same or some other ES instance (remote or local).
elasticdump can also store the data it pumps from your remote ES directly into your local instance (instead of storing the data into a file).
Snapshot/restore is the official way of backuping your index data. There are various targets (filesystem, S3, etc), but the main idea is that you do a first snapshot and then all subsequent snapshots will be incremental, i.e. the snapshot process will only store what has changed since the last run.
In your case, you can go either way, but using elasticdump is straightforward if all you want to do is to have a local copy of your production data.
Another option we are sometimes using successfully is using autossh for maintaining connection and opening SSH tunnel between remote Elasticsearch nodes.
autossh -M 30010 -f user#remote.example.com -L 9200:localhost:9200 -N
Depending on your security policies and environment, this works really well for accessing live data remotely even with poor connectivity.

Strategy to persist the node's data for dynamic Elasticsearch clusters

I'm sorry that this is probably a kind of broad question, but I didn't find a solution form this problem yet.
I try to run an Elasticsearch cluster on Mesos through Marathon with Docker containers. Therefore, I built a Docker image that can start on Marathon and dynamically scale via either the frontend or the API.
This works great for test setups, but the question remains how to persist the data so that if either the cluster is scaled down (I know this is also about the index configuration itself) or stopped, and I want to restart later (or scale up) with the same data.
The thing is that Marathon decides where (on which Mesos Slave) the nodes are run, so from my point of view it's not predictable if the all data is available to the "new" nodes upon restart when I try to persist the data to the Docker hosts via Docker volumes.
The only things that comes to my mind are:
Using a distributed file system like HDFS or NFS, with mounted volumes either on the Docker host or the Docker images themselves. Still, that would leave the question how to load all data during the new cluster startup if the "old" cluster had for example 8 nodes, and the new one only has 4.
Using the Snapshot API of Elasticsearch to save to a common drive somewhere in the network. I assume that this will have performance penalties...
Are there any other way to approach this? Are there any recommendations? Unfortunately, I didn't find a good resource about this kind of topic. Thanks a lot in advance.
Elasticsearch and NFS are not the best of pals ;-). You don't want to run your cluster on NFS, it's much too slow and Elasticsearch works better when the speed of the storage is better. If you introduce the network in this equation you'll get into trouble. I have no idea about Docker or Mesos. But for sure I recommend against NFS. Use snapshot/restore.
The first snapshot will take some time, but the rest of the snapshots should take less space and less time. Also, note that "incremental" means incremental at file level, not document level.
The snapshot itself needs all the nodes that have the primaries of the indices you want snapshoted. And those nodes all need access to the common location (the repository) so that they can write to. This common access to the same location usually is not that obvious, that's why I'm mentioning it.
The best way to run Elasticsearch on Mesos is to use a specialized Mesos framework. The first effort is this area is https://github.com/mesosphere/elasticsearch-mesos. There is a more recent project, which is, AFAIK, currently under development: https://github.com/mesos/elasticsearch. I don't know what is the status, but you may want to give it a try.

Resources