elasticsearch snapshot vs elasticdump - elasticsearch

I have a very slow internet connection and have a server that is running Elasticsearch. I am looking at having a local, read only, version of the elastic search indices with a local kabana instance as i dont need the data to be live. I know there are 3 ways of doing this, making my local machine a node in the ES cluster, taking a snapshot and transferring it or using elasticdump and transferring the file. i understand the issues with adding my local as a node but dont understand the difference between a snapshot and elasticdump.
What is the difference between a snapshot and elasticdump? what are the advantages and disadvantages of each?

elasticdump will simply scan one index in your remote ES cluster and will either dump the JSON data into a file it can then replay to rebuild the index in the same or some other ES instance (remote or local).
elasticdump can also store the data it pumps from your remote ES directly into your local instance (instead of storing the data into a file).
Snapshot/restore is the official way of backuping your index data. There are various targets (filesystem, S3, etc), but the main idea is that you do a first snapshot and then all subsequent snapshots will be incremental, i.e. the snapshot process will only store what has changed since the last run.
In your case, you can go either way, but using elasticdump is straightforward if all you want to do is to have a local copy of your production data.

Another option we are sometimes using successfully is using autossh for maintaining connection and opening SSH tunnel between remote Elasticsearch nodes.
autossh -M 30010 -f user#remote.example.com -L 9200:localhost:9200 -N
Depending on your security policies and environment, this works really well for accessing live data remotely even with poor connectivity.

Related

Is there any way to restore Elasticsearch snapshots apart from using the Elasticsearch restore API?

my company wants to use an existing Elasticsearch snapshot repository (consisting of various hundreds of gigabytes) to obtain the original documents and store them elsewhere. I must state that the snapshots have been obtained using the Elasticsearch snapshot API.
My company is somehow reluctant to use Elasticsearch to restore the snapshots, as they fear that would involve creating a new Elasticsearch cluster that would consume considerable resources. So far, I have not seen any other way to restore the snapshots than to use Elasticsearch, but, given my company's insistence, I ask here: is there any other tool that I could use to restore said snapshots? Thank you in advance for any help resolving this issue.
What I would do in your shoes is to spin up a local cluster and restore the existing snapshot into it (here is the relevant Elastic documentation: Restoring to a different cluster). Then, from there, I would either export the data by using the Kibana Reporting plugin (https://www.elastic.co/what-is/kibana-reporting), or by writing a Logstash pipeline to export the data from the local cluster to - say - a CSV file.

Can StreamSets be used to fetch data onto a local system?

Our team is exploring options for HDFS to local data fetch. We were suggested about StreamSets and no one in the team has an idea about it. Could anyone help me to understand if this will fit our requirement that is to fetch the data from HDFS onto our local system?
Just an additional question.
I have setup StreamSets locally. For example on local ip: xxx.xx.x.xx:18630 and it works fine on one machine. But when I try to access this URL from some other machine on the network, it doesn't work. While my other application like Shiny-server etc works fine with the same mechanism.
Yes - you can read data from HDFS to a local filesystem using StreamSets Data Collector's Hadoop FS Standalone origin. As cricket_007 mentions in his answer, though, you should carefully consider if this is what you really want to do, as a single Hadoop file can easily be larger than your local disk!
Answering your second question, Data Collector listens on all addresses by default. There is a http.bindHost setting in the sdc.properties config file that you can use to restrict the addresses that Data Collector listens on, but it is commented out by default.
You can use netstat to check - this is what I see on my Mac, with Data Collector listening on all addresses:
$ netstat -ant | grep 18630
tcp46 0 0 *.18630 *.* LISTEN
That wildcard, * in front of the 18630 in the output means that Data Collector will accept connections on any address.
If you are running Data Collector directly on your machine, then the most likely problem is a firewall setting. If you are running Data Collector in a VM or on Docker, you will need to look at your VM/Docker network config.
I believe by default Streamsets only exposes its services on localhost. You'll need to go through the config files to find where you can set it to listen on external addresses
If you are using the CDH Quickstart VM, you'll need to externally forward that port.
Anyway, StreamSets is really designed to run as a cluster, on dedicated servers, for optimal performance. It's production deployments are comparable to Apache Nifi offered in Hortonworks HDF.
So no, it wouldn't make sense to use the local FS destinations for anything other than testing/evaluation purposes.
If you want HDFS exposed as a local device, look into installing an NFS Gateway. Or you can use Streamsets to write to FTP / NFS, probably.
It's not clear what data you're trying to get, but many BI tools can perform CSV exports or Hue can be used to download files from HDFS. At the very least, hdfs dfs -getmerge is the one minimalist way to get data from HDFS to local, however, Hadoop typically stores many TB worth of data in the ideal cases, and if you're using anything smaller, then dumping those results into a database is typically the better option than moving around flatfiles

What's the easiest way of moving Elastic Search data between servers

I've got Elastic Search v6.1.0 installed on Windows and Centos7 machines. The goal is to migrate data from Win to Centos7 machine.
Since they both have the same ES version, I simply dragged "data" folder from machine A to B. When I checked its health, its status was red and active_primary_shards was 0. So I reversed the changes I made.
What other methods are there? Can Snapshot/Restore method be used for this purpose? I think it's for migrating between different versions.
So the question is, what's the best/easiest method for moving data between 2 servers with same ES versions?
Using snapshot/restore
You can perfectly use snapshot/restore for this task as long as you have a shared file system or a single-node cluster. The shared FS should meet the following criteria:
In order to register the shared file system repository it is necessary
to mount the same shared filesystem to the same location on all master
and data nodes.
So it's not a problem if you have a single-node cluster. In this case just make a snapshot and copy it over to other machine.
It might though be a challenging task if you have many nodes running.
You may use one of the supported plugins for S3, HDFS and other cloud storages.
The advantage of this approach is that the data and the indices are snapshotted entirely.
Using _reindex API
It might be easier to use _reindex API to transfer data from one ES cluster to another. There is a special Reindex from Remote mode that allows exactly this use case.
What reindex actually does is a scroll on the source index and a lot of bulk inserts to the target index (which can be remote).
There are couple of issues you should take care of:
setting up the target index (no mapping, no settings will be set by reindex)
if some fields on the source index are excluded from _source then their contents won't be copied to the target index
Summing up
For snapshot/restore
Pros:
all data and the indices are saved/restored as they are
2 calls to the ES API are needed
Cons:
if cluster has more than 1 node, you need to setup a shared FS or to use some cloud storage
For _reindex
Pros:
Works for cluster of any size
Data is copied directly (no intermediate storage required)
1 call to the ES API is needed
Cons:
Data excluded from _source will be lost
Here's also a similar SO question from some three years ago.
Hope that helps!

how to transfer elastic data from one server to another

How do I move Elasticsearch data from one server to another?
I have server A running Elasticsearch 1.4.2 on one local node with multiple indices. I would like to copy that data to server B running Elasticsearch with the same version. The lucene_version is also same on both the servers.But when I copy all the files to server B data is not migrated it only shows the mappings of all the node. I tried the same procedure on my local computer and it worked perfectly. Am I missing something on the server end?
This can be achieved by multiple ways. The easier and safest way is to create a replica on the new node. Replica can be created by starting a new node on the new server by assigning the same cluster name. (if you have changed other network configurations then you might need to change that also). If you have initialized your index with no replica before then you can change the number of replica online using update settings api
Your cluster will be in yellow state until your datas are in sync.Normal operations won't get affected.
Once your cluster state is in green you can shut down the server you do not wish to have. At this stage your cluster stage will go to yellow again. You can use the update setting to change replica count to 0 / add other nodes to bring cluster state in green state.
This way is recommended only if both your servers are on the same network else data syncing will take lots of time.
Another way is to use snapshot. You can create a snapshot on your old server. Copy the snapshot files from the old server to new server in the same location. On the new server create the same snapshot on the same location. You will find the snapshot file you copied. You can restore it using that. Doing it using command line can be a bit cumbersome. You can use a plugin like kopf which will make taking snapshot and restore as easy as button click.

Splitting a Redis RDB file

Currently I'm using redis on a EC2 machine, with 60G RAM without any slaves, but as my data grows I will need more memory.
I was thinking to migrate to 2 x 60G machines and split the already existing data between the two.
Is there any tool for splitting the RDB file? I haven't found anything specifically designed for this.
If you want to split your data, you will need to have a way to shard your keys so some keys will be written/read from server A and the others from server B
There is no way to split a RDB file, but there is something you can do to achieve what you want.
First what you can do is start a redis instance on your second server and say it is a slave of your current server, but set the param slave-read-only to false. This will cause the slave to synchronize and read all of your redis data from master. So far you only have a slave with all the data, but now we will do the interesting bit.
Then you need to decide on a sharding strategy. Some redis clients do this for you. For example, the official Ruby client knows how to handle that if you configure it. You will need to configure your client so keys will be sharded to A and B (or alternative use twemproxy so the clients won't know about different servers and the twemproxy will take care of it)
Once you have the clients configure, you need to deploy the new clients to production and immediately configure the slave as not a slave anymore. You can do this directly using the CONFIG command on the slave server (don't forget to persist the config using CONFIG REWRITE) or you can change the config file of the slave and restart, whatever is more convenient for you. Since the slave is configured as slave-read-only false, it will accept writes even on slave mode. This means if you change the config directly from the redis-cli you can change from slave to just a sharded stand-alone redis without restarting, which I think is quite cool.
Be aware once you shard, you will have to be careful with MULTI commands or when using LUA scripts. If you are using twemproxy you won't be able to use those commands, but if you are sharding on the client side, you will still be able to use MULTI or LUA. Just be careful to use a sharding mechanism in which all the related keys will stay on the same server.
step1: install https://github.com/leonchen83/redis-rdb-cli/
step2: create a config file to set spliting condition
content of nodes.conf
34b6e1dfb871ad30398ef5edd6b9a954617e6ec1 127.0.0.1:10003#20003 master - 0 1531044047088 3 connected 8193-16383
89d020a7e727e81f003836207902ae26fe05fd51 127.0.0.1:10001#20001 myself,master - 0 1531044047000 1 connected 0-8192
vars currentEpoch 6 lastVoteEpoch 0
step3: run rdt -s your-dump.rdb -c nodes.conf -o /path/to
after step3. that will generate 2 rdb files in /path/to directory 34b6e1dfb871ad30398ef5edd6b9a954617e6ec1.rdb and 89d020a7e727e81f003836207902ae26fe05fd51.rdb

Resources