how to move elasticsearch data from one server to another - elasticsearch

How do I move Elasticsearch data from one server to another?
I have server A running Elasticsearch 1.1.1 on one local node with multiple indices.
I would like to copy that data to server B running Elasticsearch 1.3.4
Procedure so far
Shut down ES on both servers and
scp all the data to the correct data dir on the new server. (data seems to be located at /var/lib/elasticsearch/ on my debian boxes)
change permissions and ownership to elasticsearch:elasticsearch
start up the new ES server
When I look at the cluster with the ES head plugin, no indices appear.
It seems that the data is not loaded. Am I missing something?

The selected answer makes it sound slightly more complex than it is, the following is what you need (install npm first on your system).
npm install -g elasticdump
elasticdump --input=http://mysrc.com:9200/my_index --output=http://mydest.com:9200/my_index --type=mapping
elasticdump --input=http://mysrc.com:9200/my_index --output=http://mydest.com:9200/my_index --type=data
You can skip the first elasticdump command for subsequent copies if the mappings remain constant.
I have just done a migration from AWS to Qbox.io with the above without any problems.
More details over at:
https://www.npmjs.com/package/elasticdump
Help page (as of Feb 2016) included for completeness:
elasticdump: Import and export tools for elasticsearch
Usage: elasticdump --input SOURCE --output DESTINATION [OPTIONS]
--input
Source location (required)
--input-index
Source index and type
(default: all, example: index/type)
--output
Destination location (required)
--output-index
Destination index and type
(default: all, example: index/type)
--limit
How many objects to move in bulk per operation
limit is approximate for file streams
(default: 100)
--debug
Display the elasticsearch commands being used
(default: false)
--type
What are we exporting?
(default: data, options: [data, mapping])
--delete
Delete documents one-by-one from the input as they are
moved. Will not delete the source index
(default: false)
--searchBody
Preform a partial extract based on search results
(when ES is the input,
(default: '{"query": { "match_all": {} } }'))
--sourceOnly
Output only the json contained within the document _source
Normal: {"_index":"","_type":"","_id":"", "_source":{SOURCE}}
sourceOnly: {SOURCE}
(default: false)
--all
Load/store documents from ALL indexes
(default: false)
--bulk
Leverage elasticsearch Bulk API when writing documents
(default: false)
--ignore-errors
Will continue the read/write loop on write error
(default: false)
--scrollTime
Time the nodes will hold the requested search in order.
(default: 10m)
--maxSockets
How many simultaneous HTTP requests can we process make?
(default:
5 [node <= v0.10.x] /
Infinity [node >= v0.11.x] )
--bulk-mode
The mode can be index, delete or update.
'index': Add or replace documents on the destination index.
'delete': Delete documents on destination index.
'update': Use 'doc_as_upsert' option with bulk update API to do partial update.
(default: index)
--bulk-use-output-index-name
Force use of destination index name (the actual output URL)
as destination while bulk writing to ES. Allows
leveraging Bulk API copying data inside the same
elasticsearch instance.
(default: false)
--timeout
Integer containing the number of milliseconds to wait for
a request to respond before aborting the request. Passed
directly to the request library. If used in bulk writing,
it will result in the entire batch not being written.
Mostly used when you don't care too much if you lose some
data when importing but rather have speed.
--skip
Integer containing the number of rows you wish to skip
ahead from the input transport. When importing a large
index, things can go wrong, be it connectivity, crashes,
someone forgetting to `screen`, etc. This allows you
to start the dump again from the last known line written
(as logged by the `offset` in the output). Please be
advised that since no sorting is specified when the
dump is initially created, there's no real way to
guarantee that the skipped rows have already been
written/parsed. This is more of an option for when
you want to get most data as possible in the index
without concern for losing some rows in the process,
similar to the `timeout` option.
--inputTransport
Provide a custom js file to us as the input transport
--outputTransport
Provide a custom js file to us as the output transport
--toLog
When using a custom outputTransport, should log lines
be appended to the output stream?
(default: true, except for `$`)
--help
This page
Examples:
# Copy an index from production to staging with mappings:
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=mapping
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=data
# Backup index data to a file:
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=/data/my_index_mapping.json \
--type=mapping
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=/data/my_index.json \
--type=data
# Backup and index to a gzip using stdout:
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=$ \
| gzip > /data/my_index.json.gz
# Backup ALL indices, then use Bulk API to populate another ES cluster:
elasticdump \
--all=true \
--input=http://production-a.es.com:9200/ \
--output=/data/production.json
elasticdump \
--bulk=true \
--input=/data/production.json \
--output=http://production-b.es.com:9200/
# Backup the results of a query to a file
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=query.json \
--searchBody '{"query":{"term":{"username": "admin"}}}'
------------------------------------------------------------------------------
Learn more # https://github.com/taskrabbit/elasticsearch-dump`enter code here`

Use ElasticDump
1) yum install epel-release
2) yum install nodejs
3) yum install npm
4) npm install elasticdump
5) cd node_modules/elasticdump/bin
6)
./elasticdump \
--input=http://192.168.1.1:9200/original \
--output=http://192.168.1.2:9200/newCopy \
--type=data

You can use snapshot/restore feature available in Elasticsearch for this. Once you have setup a Filesystem based snapshot store, you can move it around between clusters and restore on a different cluster

There is also the _reindex option
From documentation:
Through the Elasticsearch reindex API, available in version 5.x and later, you can connect your new Elasticsearch Service deployment remotely to your old Elasticsearch cluster. This pulls the data from your old cluster and indexes it into your new one. Reindexing essentially rebuilds the index from scratch and it can be more resource intensive to run.
POST _reindex
{
"source": {
"remote": {
"host": "https://REMOTE_ELASTICSEARCH_ENDPOINT:PORT",
"username": "USER",
"password": "PASSWORD"
},
"index": "INDEX_NAME",
"query": {
"match_all": {}
}
},
"dest": {
"index": "INDEX_NAME"
}
}

I've always had success simply copying the index directory/folder over to the new server and restarting it. You'll find the index id by doing GET /_cat/indices and the folder matching this id is in data\nodes\0\indices (usually inside your elasticsearch folder unless you moved it).

I tried on ubuntu to move data from ELK 2.4.3 to ELK 5.1.1
Following are the steps
$ sudo apt-get update
$ sudo apt-get install -y python-software-properties python g++ make
$ sudo add-apt-repository ppa:chris-lea/node.js
$ sudo apt-get update
$ sudo apt-get install npm
$ sudo apt-get install nodejs
$ npm install colors
$ npm install nomnom
$ npm install elasticdump
in home directory goto
$ cd node_modules/elasticdump/
execute the command
If you need basic http auth, you can use it like this:
--input=http://name:password#localhost:9200/my_index
Copy an index from production:
$ ./bin/elasticdump --input="http://Source:9200/Sourceindex" --output="http://username:password#Destination:9200/Destination_index" --type=data

If you can add the second server to cluster, you may do this:
Add Server B to cluster with Server A
Increment number of replicas for indices
ES will automatically copy indices to server B
Close server A
Decrement number of replicas for indices
This will only work if number of replaces equal to number of nodes.

If anyone encounter the same issue, when trying to dump from elasticsearch <2.0 to >2.0 you need to do:
elasticdump --input=http://localhost:9200/$SRC_IND --output=http://$TARGET_IP:9200/$TGT_IND --type=analyzer
elasticdump --input=http://localhost:9200/$SRC_IND --output=http://$TARGET_IP:9200/$TGT_IND --type=mapping
elasticdump --input=http://localhost:9200/$SRC_IND --output=http://$TARGET_IP:9200/$TGT_IND --type=data --transform "delete doc.__source['_id']"

We can use elasticdump or multielasticdump to take the backup and restore it, We can move data from one server/cluster to another server/cluster.
Please find a detailed answer which I have provided here.

You can take a snapshot of the complete status of your cluster (including all data indices) and restore them (using the restore API) in the new cluster or server.

If you simply need to transfer data from one elasticsearch server to another, you could also use elasticsearch-document-transfer.
Steps:
Open a directory in your terminal and run
$ npm install elasticsearch-document-transfer.
Create a file config.js
Add the connection details of both elasticsearch servers in config.js
Set appropriate values in options.js
Run in the terminal
$ node index.js

i guess that you can copy the folder data.

Another great new tool which uses the _bulk API to reindex data between server is esm:
Download and Install
wget https://github.com/medcl/esm/releases/download/v0.6.1/migrator-linux-amd64
mv migrator-linux-amd64 esm
chmod +x esm
Migrate One Index
Migrate a single index between 2 servers using 40 workers:
./esm -s https://my.source.server.com:9200 \
-m elastic:*** \
-d http://my.destination.server.com:9200 \
-n elastic:*** \
-x myindex \
-w 40
It may be necessary to create your index (or index template) on the destination server first.
See docs for further examples of how to migrate all or multiple indices.

If you don't want to use the elasticdump like a console tool. You can use next node.js script

Related

How to bulk load data into a dgraph/standalone:graphql container?

Assuming I've a db like the quick-start of https://graphql.dgraph.io/docs/quick-start/
i.e.
type Product {
productID: ID!
name: String #search(by: [term])
reviews: [Review] #hasInverse(field: about)
}
type Customer {
custID: ID!
name: String #search(by: [hash, regexp])
reviews: [Review] #hasInverse(field: by)
}
type Review {
id: ID!
about: Product! #hasInverse(field: reviews)
by: Customer! #hasInverse(field: reviews)
comment: String #search(by: [fulltext])
rating: Int #search
}
Now I would like to import millions of entries and therefore would like to use the bulk loader. My dataset is a bug folder full of .json files.
To what I've seen, I should be able to run a command like
dgraph bulk -f folderOfJsonFiles -s goldendata.schema --map_shards=4 --reduce_shards=2 --http localhost:8000 --zero=localhost:5080
But to run my server, I am using the dgraph/standalone:graphql image ran docker run -v $(pwd):/dgraph -p 9000:9000 -it dgraph/standalone:graphql
Now how to start the bulk import ?
1:
Should I run the command within the docker container itself (and share the volume (folder) containing all my .json files ) or install dgraph on my host and run the dgraph bulk command from the host ?
2: What should be the format of the .json files ?
3: Would the bulk loader support blank nodes (id which are not _:0x1234) ?
[edit]
bulk loader seems not to support graphql schema, the schema should be converted to rdf first. To achieve this, I exported the schema and data right after importing the graphql schema curl 'localhost:8080/admin/export?format=json'
Here a few things to understand:
the bulk loader is not an offline version of the live loader. It is a tool which purpose is to prepare the data for the Dgraph Alpha(s) server(s).
the bulk loader, seems to be only able to load triples
the bulk loader can load a schema and files but this is not the graphql schema, the graphql schema must be loaded apart later.
So to answer the question:
start the dgraph graphql server using docker run -v $(pwd)/dgraph:/dgraph -p 8000:8000 -p 9000:9000 -p 8080:8080 -p 9080:9080 -p 5080:5080 -it dgraph/standalone:graphql for your information, this image launch the /tmp/run.sh script which will itself run dgraph-ratel & dgraph zero & dgraph alpha --lru_mb $lru_mb & dgraph graphql (where lru_mb is the memory you give to dgraph alpha). Keep the container's id for later find it using docker ps if you lost it.
Unless you have + 5 millions of entries (or no time), try using the live loader. If you have troubles with the live loader like: it became very slow after few hundred of thousands entries (300k in my case), this is very likely because your alpha does not have sufficient memory. In my case, I had to tune docker to provide 16Gb of memory to the engine, the script gives to the $lru_mb variable a third of the host memory.
Once you imported your full set of data using live loader, you can export the data using docker exec -it yourDockerContainerId curl localhost:8080/admin/export?format=json, the export will generate 2 files for instance: g01.json.gzand g01.schema.gz which corresponds to your entries and their schema (which is not the graphql schema).
To import those 2 files g01.json.gzand g01.schema.gz back to your dgraph graphql instance, you need to convert them to group’s "p" directory output. To what I understood, the "p" directory holds all the data for the Dgraph Alpha. If you delete it, you lose your data, if you replace it with another set, you will replace / restore the data with the one you just copied. Bulk loader is not an instance of dgraph, it is only the tool which will generate those "p" directory outputs. I have been successful running it within the container. Just run docker exec -it yourDockerContainerId dgraph bulk -f export/pathTo/g01.json.gz -s export/pathTo/g01.schema.gz --map_shards=1 --reduce_shards=1 --http localhost:8001 --zero=localhost:5080. I will be honest, I do not understand the purpose of the http localhost:8001 argument in this command. If the bulk loader ran successfully, it created an out/0/p folder containing the data you can use in your Dgraph Alpha. Stop your docker container docker stop yourDockerContainerId then Replace your current Dgraph Alpha's p folder with the one generated by bulk loader. (Re)start your docker container and you should have your imported data. (perhaps trash the w and zw folders as well, I have no clue about their use).
The data is imported but you will have an warning saying something like there is no graphql schema. Okay let's import our schema (assuming you have it at path dgraph/schemas/schema.graphql) schema=$(cat dgraph/schemas/schema.graphql | tr '\\n' ' ');jq -n --arg schema \"$schema\" '{ query: \"mutation addSchema($sch: String!) { addSchema(input: { schema: $sch }) { schema { schema } } }\", variables: { sch: $schema }}' | curl -X POST -H \"Content-Type: application/json\" http://localhost:9000/admin -d #- This might take few minutes as graph will likely have to index your data according to your graphql schema's indexing rule (typically related to the #search decorator)
You're done…
Now, I am still not completely answering the question because the data we are importing back is the one we just exported (and the one we actually imported using the live loader). So unfortunately, the bulk loader cannot import nice data like live loader, you have to feed him with triples. Therefore you have to prepare the data to load using bulk loader in that format. To help you in this talk, I suggest to
Run the dgraph graphql server docker run -v $(pwd)/dgraph:/dgraph -p 8000:8000 -p 9000:9000 -p 8080:8080 -p 9080:9080 -p 5080:5080 -it dgraph/standalone:graphql
import a graphql schema (assuming the schema is at path dgraph/schemas/schema.graphql ) schema=$(cat dgraph/schemas/schema.graphql | tr '\\n' ' ');jq -n --arg schema \"$schema\" '{ query: \"mutation addSchema($sch: String!) { addSchema(input: { schema: $sch }) { schema { schema } } }\", variables: { sch: $schema }}' | curl -X POST -H \"Content-Type: application/json\" http://localhost:9000/admin -d #-
create one or two basic / template entries using a graphql client. You can install the Altair chrome extension, connect to http://localhost:9000/graphql then add some data, something like:
mutation {
addCustomer(input:{name:"Toto"}){
name
}
}
You can also using a file and the live loader
Then export your small template data docker exec -it yourDockerContainerId curl localhost:8080/admin/export?format=json
Open the g01.json.gz and you will find an example of the data the bulk loader expects to be fed with.
What about blank ids ? I am not sure but as the bulk loader is doing a 2 levels mapping on ids, I can imagine you can provide your ids and those will be converted to dgraph ids later.

elasticdump How do I use offset?

When elasticdump is stopped and restarted, it tries to execute after offset.
but an error occurs.
[Execute command]
nohup ./elasticdump --input=http://host/common --output=http://host/common --type=data --limit=1000 --offset=1000 &
[error]
Error Emitted =>
{"error":{"root_cause":[{"type":"action_request_validation_exception","reason":"Validation
Failed: 1: using [from] is not allowed in a scroll
context;"}],"type":"action_request_validation_exception","reason":"Validation
Failed: 1: using [from] is not allowed in a scroll
context;"},"status":400}
How do I use offset???
From the notes in the elasticdump project:
if you are using Elasticsearch version 6.0.0 or higher the offset parameter is no longer allowed in the scrollContext
What you can do to prevent this (as long as you don't cross the 10000 limit) is to not use the offset parameter (i.e. no scroll context) and provide a search body instead with from and size settings, like this:
nohup ./elasticdump --input=http://host/common --output=http://host/common --type=data --searchBody='{"from": 1000, "size": 1000, "query": { "match_all": {} }}' &
UPDATE:
If you have more than 10K records and elasticdump is prone to stop midway, I suggest leveraging the snapshot/restore feature in order to move the data from one server to another.
You can use --limit parameter in the command, offset is dangerous to use as it can skip n number of records, n being the offset.
more reference - https://github.com/elasticsearch-dump/elasticsearch-dump
e.g.
elasticdump --input=domain/index --output "s3://bucket/file.json" --limit 1000
You can select Elasticsearch max id , and use searchBody continue dump.
elasticdump --input=http://host/common --output=http://host/common --type=data --searchBody='{"query": {"range": {"xxxId": {"gt": 10000}}}}' --limit=1000

Dump all documents of Elasticsearch

Is there any way to create a dump file that contains all the data of an index among with its settings and mappings?
A Similar way as mongoDB does with mongodump
or as in Solr its data folder is copied to a backup location.
Cheers!
Here's a new tool we've been working on for exactly this purpose https://github.com/taskrabbit/elasticsearch-dump. You can export indices into/out of JSON files, or from one cluster to another.
Elasticsearch supports a snapshot function out of the box:
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html
We can use elasticdump to take the backup and restore it, We can move data from one server/cluster to another server/cluster.
1. Commands to move one index data from one server/cluster to another using elasticdump.
# Copy an index from production to staging with analyzer and mapping:
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=analyzer
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=mapping
elasticdump \
--input=http://production.es.com:9200/my_index \
--output=http://staging.es.com:9200/my_index \
--type=data
2. Commands to move all indices data from one server/cluster to another using multielasticdump.
Backup
multielasticdump \
--direction=dump \
--match='^.*$' \
--limit=10000 \
--input=http://production.es.com:9200 \
--output=/tmp
Restore
multielasticdump \
--direction=load \
--match='^.*$' \
--limit=10000 \
--input=/tmp \
--output=http://staging.es.com:9200
Note:
If the --direction is dump, which is the default, --input MUST be a URL for the base location of an ElasticSearch server (i.e. http://localhost:9200) and --output MUST be a directory. Each index that does match will have a data, mapping, and analyzer file created.
For loading files that you have dumped from multi-elasticsearch, --direction should be set to load, --input MUST be a directory of a multielasticsearch dump and --output MUST be a Elasticsearch server URL.
The 2nd command will take a backup of settings, mappings, template and data itself as JSON files.
The --limit should not be more than 10000 otherwise, it will give an exception.
Get more details here.
For your case Elasticdump is the perfect answer.
First, you need to download the mapping and then the index
# Install the elasticdump
npm install elasticdump -g
# Dump the mapping
elasticdump --input=http://<your_es_server_ip>:9200/index --output=es_mapping.json --type=mapping
# Dump the data
elasticdump --input=http://<your_es_server_ip>:9200/index --output=es_index.json --type=data
If you want to dump the data on any server I advise you to install esdump through docker. You can get more info from this website Blog Link
ElasticSearch itself provides a way to create data backup and restoration. The simple command to do it is:
CURL -XPUT 'localhost:9200/_snapshot/<backup_folder name>/<backupname>' -d '{
"indices": "<index_name>",
"ignore_unavailable": true,
"include_global_state": false
}'
Now, how to create, this folder, how to include this folder path in ElasticSearch configuration, so that it will be available for ElasticSearch, restoration method, is well explained here. To see its practical demo surf here.
At the time of writing this answer(2021), the official way of backing up an ElasticSearch cluster is to snapshot it. Refer to: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshot-restore.html
The data itself is one or more lucene indices, since you can have multiple shards. What you also need to backup is the cluster state, which contains all sorts of information regarding the cluster, the available indices, their mappings, the shards they are composed of etc.
It's all within the data directory though, you can just copy it. Its structure is pretty intuitive. Right before copying it's better to disable automatic flush (in order to backup a consistent view of the index and avoiding writes on it while copying files), issue a manual flush, disable allocation as well. Remember to copy the directory from all nodes.
Also, next major version of elasticsearch is going to provide a new snapshot/restore api that will allow you to perform incremental snapshots and restore them too via api. Here is the related github issue: https://github.com/elasticsearch/elasticsearch/issues/3826.
You can also dump elasticsearch data in JSON format by http request:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
CURL -XPOST 'https://ES/INDEX/_search?scroll=10m'
CURL -XPOST 'https://ES/_search/scroll' -d '{"scroll": "10m", "scroll_id": "ID"}'
To export all documents from ElasticSearch into JSON, you can use the esbackupexporter tool. It works with index snapshots. It takes the container with snapshots (S3, Azure blob or file directory) as the input and outputs one or several zipped JSON files per index per day. It is quite handy when exporting your historical snapshots. To export your hot index data, you may need to make the snapshot first (see the answers above).
If you want to massage the data on its way out of Elasticsearch, you might want to use Logstash. It has a handy Elasticsearch Input Plugin.
And then you can export to anything, from a CSV file to reindexing the data on another Elasticsearch cluster. Though for the latter you also have the Elasticsearch's own Reindex.

Amazon Elastic MapReduce: Output directory

I'm running through Amazon's example of running Elastic MapReduce and keep getting hit with the following error:
Error launching job , Output path already exists.
Here is the command to run the job that I am using:
C:\ruby\elastic-mapreduce-cli>ruby elastic-mapreduce --create --stream \
--mapper s3://elasticmapreduce/samples/wordcount/wordSplitter.py \
--input s3://elasticmapreduce/samples/wordcount/input \
--output [A path to a bucket you own on Amazon S3, such as, s3n://myawsbucket] \
--reducer aggregate
Here is where the example comes from here
I'm following Amazon'd directions for the output directory. The bucket name is s3n://mp.maptester321mark/. I've looked through all their suggestions for problems on this url
Here is my credentials.json info:
{
"access_id": "1234123412",
"private_key": "1234123412",
"keypair": "markkeypair",
"key-pair-file": "C:/Ruby/elastic-mapreduce-cli/markkeypair",
"log_uri": "s3n://mp-mapreduce/",
"region": "us-west-2"
}
hadoop jobs won't clobber directories that already exist. You just need to run:
hadoop fs -rmr <output_dir>
before your job ot just use the AWS console to remove the directory.
Use:
--output s3n://mp.maptester321mark/output
instead of:
--output s3n://mp.maptester321mark/
I suppose EMR makes the output bucket before running and that means you'll already have your output directory / if you specify --output s3n://mp.maptester321mark/ and that might be the reason why you get this error.
---> If the folder (bucket) already exists then remove it.
---> If you delete it and you still get the above error make sure your output is like this
s3n://some_bucket_name/your_output_bucket if you have it like this s3n://your_output_bucket/
its an issue with EMR!! as i think it first creates bucket on the path (some_bucket_name) and then tries to create the (your_output_bucket).
Thanks
Hari

search a text in a svn repository across all releases

I need to search a text in a SVN repository, inside all revisions.
I am trying this software, which indexes all repositories and then you can search:
http://www.supose.org/projects/supose/wiki
#create the index
./bin/supose sc -U "svn://..." --username ... -p ... --fromrev 0 --create
#search
./bin/supose se -Q "+contents:class"
if i understood correctly, this should search across all files for the text "class".
this should return a lot of matches, as there a lot of java files.
however, it only returns maches in some .xml; it is ignoring java files.
why?
and what is the "search --fields" option? what is a field?

Resources