elasticsearch - how to copy data to another cluster - elasticsearch

How can I get an elasticsearch index to a file and then insert that data to another cluster?
I want to move data from one cluster to another but I can't connect them directly.

If you no need to keep _id the same and only important bit is _source you may use logstash with config:
input { //from one cluster } output { //to another cluster }
here is more info: http://www.logstash.net/docs/1.4.2/
Yes it's method is weird, but I tried it for instant data transfer between clusters index by index and it is working as a charm (of course if you no need to keep _id generated by elasticsearch)

There is script which will help you to backup and restore indices from one cluster to another. i didn't tested this but may be it will fix your needs.
check this Backup and restore an Elastic search index
And you can also use perl script to copy index from one cluster to another (or the same cluster).
check this link clintongormley/ElasticSearch.pm

I recently tried my hands around this and there are a couple of approaches that can help you.
Use Elasticsearch's Snapshot and Restore APIs.
You can take a snapshot at the source cluster and use that snapshot to restore data to your destination cluster.
If your setup allows installing external packages, you can use Elasticdump as well.
HTH!

Related

Is it possible to append (instead of restore) a snapshot of indices?

Suppose we have some indices in our cluster. I can make a snapshot of my favorite index and I can restore the same index again to my cluster if the same index is not exists or is closed. But what if the index currently exists and I need to add/append extra data/documents to it ?
Suppose I currently have 100000 documents in my index in my server. I create/add 100 documents to my index in my local system which has the same name, the same mappings and the same settings, the same number of shards and . . ., now I want to add 100 documents to my current index in my server (100000 documents) . What is the best way ?
In MySQL I use export to csv or excel and ... and it is so easy to import or append data to currently existed index.
There is no Append API for Elasticsearch but I suggest to restore indices with temporary name and use Reindex API to index local data to bigger indices. then delete temporary indices.
also you can use Logstash for this purpose (reindex). build a pipeline which read data from temp indices (Elasticsearch input plugin ) and write data to primary indices (Elasticsearch output plugin)
note: you can't have two indices with the same name in cluster.
In addition to answer by Hamid Bayat, :
Is it possible to append (instead of restore) a snapshot of indices?
Snapshots by nature are incremental i.e append-only. See this and also this. Thus, if your index has 1000 docs and you snapshot it and later add 100 more docs, then when you trigger another snapshot, only the recently added 100 docs will be snapshotted and not all the 1100. However, restore is not incremental. I.e. you cannot restore only those recently added 100 docs. If you restore an index, you restore all the docs.
From your description of the question, it seems you are looking for something like: when you add 100 docs to local ES Cluster, you also want those 100 docs to be added in the remote (other) ES Cluster as well. Am I correct?
As for export csv or excel, there's an excellent tool called es2csv that allows to export data from ES to csv. And then you can use Kibana to import the CSV data. Or use this tool called Elasticsearch_Loader. You might also want to look at another excellent tool called elasticdump

How does elasticsearch snapshot/restore work

I'm using Elasticsearch 2.4.4 and need to use the snapshot/restore mechanism to backup data. But I have a few questions about it.
1. Can a snapshot be taken without any issues while data is being written into ES.
2. Does it matter which of master/data/client node is being used for taking snapshots.
3.Does restore require indices to be closed.If yes then why
Yes
Does not. Most important that storage where you want to write data is available to access from all cluster nodes. Because data are replicated via your cluster and I think you dont have control who will backup your data
No you can snapshot open indicies
Read a bit more about it and here

Is solr cloud applicable for use case where indexing is offline?

Solr cloud seems to be the suggested method to scale solr in future. I understand that legacy scaling methods (like master slave and replication) still exists. My use case with solr does not have to be near real time (NRT). It is fine if the newly indexed data is visible for searchers after about 1 day.
In the master slave (legacy scaling), I could replicate it once a day. In Solr cloud do i have an option like this?
Also i don't want the indexing to impact the searcher performance during index time. Is there a way to isolate the indexer from searcher shards in solr cloud?
You could skip SolrCloud and just index on a dedicate separate collection.
Then, you bring the new content to each machine individually and do a Core Swap.
Or similar thing using Aliases to point to the newest core/collection. Which also allows you to segment old content and new content into different collections and search them together.
I also used collection aliases in such cases. You can build your index once a day and when it is ready you simply change the alias. I'll give an example
At very begining you create index called: index_2014_12_01. This index is aliased by index_2014_12_01. The next day you build index_2014_12_02 and changing the alias now to point index_2014_12_02 instead of index_2014_12_01.

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

How to build distribute search base on hadoop and lucene

I'm preparing to make distribute search module with lucence and hadoop but fell confused with something:
as we know , hdfs is a distribute file system ,when i put a file to hdfs , the file will be divided into severial blocks and stored in diffrent slave machine in the claster , but if i use lucene to write index on hdfs , i want to see the index on each machine , how to acheived it ?
i have read some of the hadoop/contrib/index and some katta ,but don't understand the idea of the "shards ,looks like part of the index" , it was stored on local disk of one computer or only one directionary distribut in the cluster ?
Thanks for advance
-As for your Question 1:
You can implement the Lucene "Directory" interface to make it work with with hadoop and let hadoop handle the files you submit to it. You could also provide your own implementation of "IndexWriter" and "IndexReader" and use your hadoop client to write and read the Index. This way since you could have more control about the format the index you will write. You can "see" or access the index on each machine via the your lucene/hadoop implementation.
-For your question 2:
A shard is a subset of the index. When you run your query all shards are processed in the same time and the results of the index search on all shards are combined. On each machine of your cluster you will have a part of your index: a shard. So a part of the index will be stored on a local machine but will appear to you as as a single file distributed across the cluster.
I can also suggest you to checkout the distributed search SolrCloud, or here
It is runs on Lucene as indexing/search engine and already enables you to have a clustered index. It also provides an API for submitting the files to index and query the index. Maybe it is sufficient for your use case.

Resources