How to reindex AWS Elasticsearch? - ruby

My Ruby/Sinatra app connects to an AWS ES cluster using the elasticsearch-ruby gem to index text documents that authorised (by indexing using their user ID) users can search through. Now, I want to copy a document from one index to another to make a document query-able by a different, authorised user. I tried the _reindex endpoint as documented on this file only to get the following error:
Elasticsearch::Transport::Transport::Errors::Unauthorized - [401] {"Message":"Your request: '/_reindex' is not allowed."}:
Googling around, I stumbled across an Amazon docs page that lists all supported operations on both their API's, and for some twisted reason _reindex isn't there yet. Why is that? More importantly,
how do I get around this efficiently and achieve what I want to do?

You should double check the Elasticsearch version deployed by AWS ES. The _reindex API became available in version 2.2 I believe. You can check the version number by GETting the ES root ip & port with curl e.g. and checking version.number.
To work around not having the _reindex endpoint, I would recommend you implement it yourself. This isn't too bad. You can use a scroll to iterate through all the documents you want to reindex. If it is the entire index, you can use a matchall query with the scroll. You can then manipulate the documents as you wish or simply use the bulk api to post (i.e. reindex) the documents to the new index.
Make sure to have created the new index with the mapping template you want ahead of time.
This procedure above is best for reindexing lots of documents; if you just want to move a few or one (which it sounds like you do). Grab the document from its existing index by id and submit it to your second index.

AWS Elasticsearch now supports remote reindex, check this documentation:
https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/remote-reindex.html
Example below:
'''
POST <local-domain-endpoint>/_reindex
{
"source": {
"remote": {
"host": "https://remote-domain-endpoint:443"
},
"index": "remote_index"
},
"dest": {
"index": "local_index"
}
}
'''

Related

elasticsearch querying only the document I want and saving it as a snapshot

I want to find a way to back up (snapshot) and restore only the document(data) that I want out of Elastic Search.
I looked up the reference page of Elastic Search, but there was only a way to backup the entire index, but I couldn't find a backup by querying the desired document(data).
Is there a way to back up only the desired data using mysql?
The code below backs up the entire index by storing a basic snapshot.
How can I modify something here?
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "my_backup_location"
}
}
You can try creating another index using your query, and then snapshot this new index.
You can use the reindex api to create this new index

Elasticsearch: when inserting a record to index I don't want to create an index mapping

Elasticsearch default behavior when inserting a document to an index, is to create an index mapping if it's not exist.
I know that I can change this behavior on the cluster level using this call
PUT _cluster/settings
{
"persistent": {
"action.auto_create_index": "false"
}
}
but I can't control the customer's elasticsearch.
I'm asking is there a parameter which I can send with the index a document request that will tell elastic not to create the index in case it doesn't exist but to fail instead?
If you couldn’t change cluster settings or settings in elasticsearch.yml, I’m afraid it’s not possible, since there are no special parameters during POST/PUT of the documents.
Another possible solution could be to create an API level, which will prevent going to Elasticsearch completely, if there is no such index.
There is an issue on Github, that is proposing to set action.auto_create_index to false by default, but unfortunately, I couldn’t see if there is any progress on it.

ElasticSearch - Reindex from Remote on a schedule, with daily deletions on the source index

I have an index (we'll call it index01) on an ElasticSearch instance #1 (we'll call this ES1) on a Linux box in the US. I have another ElasticSearch instance on a Linux box in the UK (we'll call it ES2). What I need to do is duplicate index01 from ES1 to ES2, once-a-day.
At first thought it seemed it would be easy enough using the Reindex from Remote functionality but now I'm overwhelmed, and confused by the documentation.
So I first created an index on ES2 called index01, using the exact same settings and parameters as index01 on ES1. Then, per the documentation, I'm supposed to make this call to build the index:
POST _reindex
{
"source": {
"remote": {
"host": "http://otherhost:9200",
"username": "user",
"password": "pass"
},
"index": "source",
"query": {
"match": {
"test": "data"
}
}
},
"dest": {
"index": "dest"
}
}
Turns out I don't need to put anything in the query clause as the match part since I just want to bring the entire index over.
1st question: Each day, index01 on ES1 has many documents added to it and many documents deleted from it. How do I keep the two indexes in sync and make sure index01 on ES2 matches ES1 exactly?
2nd question - is it possible to do this on a schedule using only Postman, or will I need to build an application in order to make this sync happen every 24 hours?
Reindex just copies information from one index to another index. It doesn't track the changes.
So, the answer to first question is you can't using simple reindex. You should delete the index from ES2 and after reindex the index from ES1.
If it were no deletes in ES1, the index delete step was not necessary, due to op_type:create property. (Check https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html )
Regarding second question, you can do this directly from Postman, using 2 calls :
first call: delete index from ES2
second call: reindex from ES1 to ES2

ElasticSearch reindex nested field as new documents

I am currently changing my ElasticSearch schema.
I previously had one type Product in my index with a nested field Product.users.
And I now wants to get 2 different indices, one for Product, an other one for User and make links between both in code.
I use reindex API to reindex all my Product documents to the new index, removing the Product.users field using script:
ctx._source.remove('users');
But I don't know how to reindex all my Product.users documents to the new User index as in script I'll get an ArrayList of users and I want to create one User document for each.
Does anyone knows how to achieve that?
For those who may face this situation, I finally ended up reindexing users nested field using both scroll and bulk APIs.
I used scroll API to get batches of Product documents
For each batch iterate over those Product documents
For each document iterate over Product.users
Create a new User document and add it to a bulk
Send the bulk when I end iterating over Product batch
Doing the job <3
What you need is called ETL (Extract, Transform, Load).
Most the time, this is more handy to write a small python script that does exactly what you want, but, with elasticsearch, there is one I love: Apache Spark + elasticsearch4hadoop plugin.
Also, sometime logstash can do the trick, but with Spark you have:
SQL syntax or support Java/Scala/Python code
read/write elasticsearch very fast because distributed worker (1 ES shard = 1 Spark worker)
fault tolerant (a worker crash ? no problem)
clustering (ideal if you have billion of documents)
Use with Apache Zeppelin (a notebook with Spark packaged & ready), you will love it!
The simplest solution I can think of is to run the reindex command twice. Once selecting the Product fields and re indexing into the newProduct index and once for the user:
POST _reindex
{
"source": {
"index": "Product",
"type": "_doc",
"_source": ["fields", "to keep in", "new Products"]
"query": {
"match_all": {}
}
},
"dest": {
"index": "new_Products"
}
}
Then you should be able to do the re-index again on the new_User table by selecting Product.users only in the 2nd re-index

What is the best way to index Couchbase data on Elastic Search

I work with Couchbase DB and I want to index part of its data on Elastic Search (ES).
The data from Couchbase should be synced, i.e. if the document on CB changes, it should change the document on ES.
I have several questions about what is the best way to do it:
What is the best way to sync the data ? I saw that there is a CB plugin for ES (http://www.couchbase.com/couchbase-server/connectors/elasticsearch), but it that the recommended way ?
I don't want to store all the CB document on ES, but only part of it, e.g. some of the fields I want to store and some not - how can I do it ?
My documents may have different attributes and the difference may be big (e.g. 50 different attributes/fields). Assuming I want to index all these attributes to ES, will it effect the performance because I have a lot of fields indexed ?
10x,
Given the doc link, I am assuming you are using Couchbase and not CouchDB.
You are following the correct link for use of Elastic Search with Couchbase. Per the documentation, configure the Cross Data Center Replication (XDCR) capabilities of Couchbase to push data to ES automatically as mutations occur.
Without a defined mapping file, ES will create a default mapping. You can provide your own mapping file (or alter the one it generates) to control which fields get indexed. Refer to the enabled property in the ES documentation at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-object-type.html.
Yes, indexing all fields will affect performance. You can find some performance management tips for the Couchbase integration at http://docs.couchbase.com/couchbase-elastic-search/#managing-performance. The preferred approach to the integration is perform the search in ES and only get keys back for the matched documents. You then make a multiget call against the Couchbase cluster to retrieve the document details themselves. So while ES will index many fields, you do not store all fields there nor do you retrieve their values from ES. The in-memory multiget against Couchbase is the fastest way to retrieve the matching documents, using the IDs from ES.
Lot of questions..!
Let me answer one by one:
1)The best way and already available solution to use river plugin to dynamically sync the data.And also it ll index the changed document alone..It ll help a lot in performance.
2)yes you can restrict the field to be indexed in river plugin. refer
The documents of plugin is available in couchbase website itself.
Refer: http://docs.couchbase.com/couchbase-elastic-search/
Github river is still in development.,but you can use the code and modify as your need.
https://github.com/mschoch/elasticsearch-river-couchbase
3)If you index all the fields, yes there will be some lag in performance.So better to index the needed fields alone. if you need to store some field just to store, then mention in mapping as not analyzed to specific.It will decrease indexing time and also searching time.
HOpe it helps..!
You might find this additional explanation regarding Don Stacy's answer to question 2 useful:
When replicating from Couchbase, there are 3 ways in which you can interfere with Elasticsearch's default mapping (before you start XDCR) and thus, as desired, not store certain fields by setting "store" = false:
Create manual mappings on your index
Create a dynamic template
Edit couchbase_template.json
Hints:
Note that when we do XDCR from Couchbase to Elasticsearch, Couchbase wraps the original document in a "doc" field. This means that you have to take this modified structure into account when you create your mapping. It would look something like this:
curl -XPUT 'http://localhost:9200/test/couchbaseDocument/_mapping' -d '
{
"couchbaseDocument": {
"_source": {
"enabled": false
},
"properties": {
"doc": {
"properties": {
"your_field_name": {
"store": true,
...
},
...
}
}
}
}
}'
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
Including/Excluding fields from _source: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/2.0/dynamic-templates.html
https://forums.couchbase.com/t/about-elasticsearch-plugin/2433
https://forums.couchbase.com/t/custom-maps-for-jsontypes-with-elasticsearch-plugin/395

Resources