How to clone existing index data from elastic search - elasticsearch

I want to clone an existing index data (from day 1 to now) from elastic search to another elastic search in new server. I read some references and know i can take a snapshot and restore it. Is it a best way to clone the index data? and the snapshot includes the data from day 1 to now?
Thanks.

Snapshots are meant to be used for :
Backups & recovery
Data export -> import
Using a snapshot for migrating data b/w 2 ES clusters are fine as long as both the clusters are on the same version of ES.

Related

How to maintain all the changes made to Elasticsearch Mapping?

How do people maintain all the changes done to the elasticsearch index over time so that if I have re-built the elasticsearch index from scratch to be same as the existing one, I can just do so in minutes. Do people maintain the logs of all PUT calls made over time to update the mappings and other settings?
I guess one way is to use snapshot ,It's a backup taken from a running Elasticsearch cluster or index. You can take a snapshot of individual index or of the entire cluster and store it in a repository on a shared filesystem. It contains a copy of the on-disk data structures and mappings that make up an index beside that when you create a snapshot of an index Elasticsearch will avoid copying any data that is already stored in the repository as part of an earlier snapshot so you can build or recover an index from scratch to last version of taken snapshot very quickly.

Is it possible to append (instead of restore) a snapshot of indices?

Suppose we have some indices in our cluster. I can make a snapshot of my favorite index and I can restore the same index again to my cluster if the same index is not exists or is closed. But what if the index currently exists and I need to add/append extra data/documents to it ?
Suppose I currently have 100000 documents in my index in my server. I create/add 100 documents to my index in my local system which has the same name, the same mappings and the same settings, the same number of shards and . . ., now I want to add 100 documents to my current index in my server (100000 documents) . What is the best way ?
In MySQL I use export to csv or excel and ... and it is so easy to import or append data to currently existed index.
There is no Append API for Elasticsearch but I suggest to restore indices with temporary name and use Reindex API to index local data to bigger indices. then delete temporary indices.
also you can use Logstash for this purpose (reindex). build a pipeline which read data from temp indices (Elasticsearch input plugin ) and write data to primary indices (Elasticsearch output plugin)
note: you can't have two indices with the same name in cluster.
In addition to answer by Hamid Bayat, :
Is it possible to append (instead of restore) a snapshot of indices?
Snapshots by nature are incremental i.e append-only. See this and also this. Thus, if your index has 1000 docs and you snapshot it and later add 100 more docs, then when you trigger another snapshot, only the recently added 100 docs will be snapshotted and not all the 1100. However, restore is not incremental. I.e. you cannot restore only those recently added 100 docs. If you restore an index, you restore all the docs.
From your description of the question, it seems you are looking for something like: when you add 100 docs to local ES Cluster, you also want those 100 docs to be added in the remote (other) ES Cluster as well. Am I correct?
As for export csv or excel, there's an excellent tool called es2csv that allows to export data from ES to csv. And then you can use Kibana to import the CSV data. Or use this tool called Elasticsearch_Loader. You might also want to look at another excellent tool called elasticdump

Apache Lucene / Elasticsearch snapshot restore with merge

I have successfully snapshotted and restored data multiple times in ElasticSearch (ES) using its APIs. But now I want to merge two snapshots in ES or directly in Lucene to restore a 'larger' chunk of data.
Details:
I take weekly snapshots of my data and as soon as restoration is done I delete the index so essentially the workflow looks like this
Create index abc
Snapshot index abc
Delete index abc
-----
Create index abc (again)
Snapshot index abc
Delete index abc
I have looked around but it seems there is no way to do that but those posts are an year old so wanted to reach out to the community again.
Also if not in ElasticSearch is there a way to do this Lucene directly and then configure ES to use 'new combined' index for restoration?
My language of choice for development is Python so I am looking into PyLucene as well but haven't explored it much yet.

how do you update or sync with a jdbc river

A question about rivers and data syncing with a production database using elastic search:
Are rivers suited for only bulk loading data initially, or does it somehow listen or monitor for changes.
If I have a nightly import of data, is it just better to delete rivers and indexes, and re-index and recreate the rivers?
If I update or change a river, do I have to delete and re-create the index?
How do I set up a schedule with a river to fetch new data periodically. Can it store last maxid so that it can do diff queries in the sql to select into the river?
Any suggestions on a better way to keep the database and elastic search in sync - without calling individual index update functions with a PUT command?
All of the Elasticsearch rivers are different - some are provided directly by Elasticsearch, many more are developed by third parties:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-plugins.html
Each operates differently, so to answer your questions you have to choose a specific river. For your case, since you're looking to index data from a production database, I'll assume that the JDBC river is what you would use:
https://github.com/jprante/elasticsearch-river-jdbc
This river will index data from your JDBC source, including picking up changes. It can do so on a schedule (there is detailed documentation on the schedule parameter on this page: https://github.com/jprante/elasticsearch-river-jdbc). However, this river will not pick up deletes:
https://github.com/jprante/elasticsearch-river-jdbc/issues/213
you may find this discussion useful, concerning getting around the lack of delete support with building a new river/index daily and using index aliases: ElasticSearch river JDBC MySQL not deleting records
You can just map your id in your DB to be _id with alias, this way the elastic will identify when the document was changed or not.

elasticsearch - how to copy data to another cluster

How can I get an elasticsearch index to a file and then insert that data to another cluster?
I want to move data from one cluster to another but I can't connect them directly.
If you no need to keep _id the same and only important bit is _source you may use logstash with config:
input { //from one cluster } output { //to another cluster }
here is more info: http://www.logstash.net/docs/1.4.2/
Yes it's method is weird, but I tried it for instant data transfer between clusters index by index and it is working as a charm (of course if you no need to keep _id generated by elasticsearch)
There is script which will help you to backup and restore indices from one cluster to another. i didn't tested this but may be it will fix your needs.
check this Backup and restore an Elastic search index
And you can also use perl script to copy index from one cluster to another (or the same cluster).
check this link clintongormley/ElasticSearch.pm
I recently tried my hands around this and there are a couple of approaches that can help you.
Use Elasticsearch's Snapshot and Restore APIs.
You can take a snapshot at the source cluster and use that snapshot to restore data to your destination cluster.
If your setup allows installing external packages, you can use Elasticdump as well.
HTH!

Resources