Is there a smarter way to reindex elasticsearch? - elasticsearch

I ask because our search is in a state of flux as we work things out, but each time we make a change to the index (change tokenizer or filter, or number of shards/replicas), we have to blow away the entire index and re-index all our Rails models back into Elasticsearch ... this means we have to factor in downtime to re-index all our records.
Is there a smarter way to do this that I'm not aware of?

I think #karmi makes it right. However let me explain it a bit simpler. I needed to occasionally upgrade production schema with some new properties or analysis settings.
I recently started to use the scenario described below to do live, constant load, zero-downtime index migrations. You can do that remotely.
Here are steps:
Assumptions:
You have index real1 and aliases real_write, real_read pointing to it,
the client writes only to real_write and reads only from real_read ,
_source property of document is available.
1. New index
Create real2 index with new mapping and settings of your choice.
2. Writer alias switch
Using following bulk query switch write alias.
curl -XPOST 'http://esserver:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "real1", "alias" : "real_write" } },
{ "add" : { "index" : "real2", "alias" : "real_write" } }
]
}'
This is atomic operation. From this time real2 is populated with new client's data on all nodes. Readers still use old real1 via real_read. This is eventual consistency.
3. Old data migration
Data must be migrated from real1 to real2, however new documents in real2 can't be overwritten with old entries. Migrating script should use bulk API with create operation (not index or update). I use simple Ruby script es-reindex which has nice E.T.A. status:
$ ruby es-reindex.rb http://esserver:9200/real1 http://esserver:9200/real2
UPDATE 2017 You may consider new Reindex API instead of using the script. It has lot of interesting features like conflicts reporting etc.
4. Reader alias switch
Now real2 is up to date and clients are writing to it, however they are still reading from real1. Let's update reader alias:
curl -XPOST 'http://esserver:9200/_aliases' -d '
{
"actions" : [
{ "remove" : { "index" : "real1", "alias" : "real_read" } },
{ "add" : { "index" : "real2", "alias" : "real_read" } }
]
}'
5. Backup and delete old index
Writes and reads go to real2. You can backup and delete real1 index from ES cluster.
Done!

Yes, there are smarter ways how to re-index your data without downtime.
First, never, ever use the "final" index name as your real index name. So, if you'd like to name your index "articles", don't use that name as a physical index, but create an index such as "articles-2012-12-12" or "articles-A", "articles-1", etc.
Second, create an alias "alias" pointing to that index. Your application will then use this alias, so you'll never need to manually change the index name, restart the application, etc.
Third, when you want or need to re-index the data, re-index them into a different index, let's say "articles-B" -- all the tools in Tire's indexing toolchaing support you here.
When you're done, point the alias to the new index. In this way, you not only minimize downtime (there isn't any), you also have a safe snapshot: if you somehow mess up the indexing into the new index, you can just switch back to the old one, until you resolve the issue.

Wrote up a blog post about how I handled reindexing with no downtime recently. Takes some time to figure out all the little things that need to be in place to do so. Hope this helps!
https://summera.github.io/infrastructure/2016/07/04/reindexing-elasticsearch.html
To summarize:
Step 1: Prepare New Index
Create your new index with your new mapping. This can be on the same instance of Elasticsearch or on a brand new instance.
Step 2: Keep Indexes Up To Date
While you're reindexing you want to keep both your new and old indexes up to date. For a write operation, this can be done by sending the write operation to a background worker on both the new and old index.
Deletes are a bit trickier because there is a race condition between deleting and reindexing the record into the new index. So, you'll want to keep track of the records that need to be deleted during your reindex and process these when you are finished. If you aren't performing many deletes, another way would be to eliminate the possibility of a delete during your reindex.
Step 3: Perform Reindexing
You’ll want to use a scrolled search for reading the data and bulk API for inserting. Since after Step 2 you'll be writing new and updated documents to the new index in the background, you want to make sure you do NOT update existing documents in the new index with your bulk API requests.
This means that the operation you want for your bulk API requests is create, not index. From the documentation: “create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary”. The main point here is you do not want old data from the scrolled search snapshot to overwrite new data in the new index.
There's a great script on github to help you with this process: es-reindex.
Step 4: Switch Over
Once you’re finished reindexing, it’s time to switch your search over to the new index. You’ll want to turn deletes back on or process the enqueued delete jobs for the new index. You may notice that searching the new index is a bit slow at first. This is because Elasticsearch and the JVM need time to warm up.
Perform any code changes you need so your application starts searching the new index. You can continue writing to the old index incase you run into problems and need to rollback. If you feel this is unnecessary, you can stop writing to it.
Step 5: Clean Up
At this point you should be completely transitioned to the new index. If everything is going well, perform any necessary cleanup such as:
Delete the old index host if it’s different from the new
Remove serialization code related to your old index

Maybe create another index, and reindex all the data onto that one, and then make the switch when it's done re-indexing ?

Related

Use Elasticsearch Reindex API effectively

I am working on a task of reindexing my Elastic search indexes in case any change happens. There are 2 ways that I can find to implement this but they look same to me unless I am missing something.
I am getting data to my Elastic search service from Postgres of service B, which has a paginated endpoint.
Approach 1:
Create alias which will point to our existing index.
When reindex is triggered, create a new index and once the reindexing is complete, point the alias, which was pointing to old index, to the newly created index.
Delete the old index.
Approach 2:
Create a new Index.
Use the reindex API to copy the data from old index to new index, which will apply the new changes to the old documents.
To me, both of these look same. Disadvantage of using approach 2 seems that it will create a new index name, hence we will have to change the index names while querying.
Also, considering my reindexing operation would not be a frequent task, I am reading the data from a paginated endpoint and then creating indexes again, Approach 1 seems to make more sense to me.
In approach1, you are using alias. In approach 2, you are not using alias.
Both would be same if you add alias to approach2 as step3 and step4 - delete the old index.
Refer As you need to do little often.

How to make Logstash replace old data?

I have an Oracle DB. Logstash retrieves data from Oracle and puts it to ElasticSearch.
But when Logstash makes planned export every 5 minutes, ElasticSearch filled with copies cause old data still exist. This is an obvious situation. Oracle's condition almost not changed during this 5 minutes. Let's say - added 2-3 rows, and 4-5 deleted.
How can we replace old data with new without copies?
For example:
Delete the whole old index;
Create new index with the same name and make the same configuration (nGram configuration and mapping);
Add all new data;
Wait for 5 minutes and repeat.
It's pretty easy: create a new index for each import and apply the mappings, switch your alias afterwards to the most recent index. Remove old indices if needed. Your currenr data will be always searchable while indexing the most recent data.
Here are the sources you'll probalbly need to read:
Use aliases (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html) to point to the most current data when searching in elasticsearch (BTW it`s always a good idea to have aliases in place).
Use rollover api (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html) to create a new index for each import run - note the alias handling here too.
Use index templates (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-templates.html) to autmatically apply the mappings/settings for your newly created indices.
Shrink, close and/or delete old indices to keep your cluster handling data you really need. Have a look on the curator (https://github.com/elastic/curator) as standalone tool.
You just need to use the fingerprint/hash of each document , or hash of the uniq fields in each document , as the document id , so that eveytime you can overwirte the same documents with updated one , in place , while adding new documents as well.
But this approach will not work with deleting data from oracle.

Elasticsearch Reindexing race condition

Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
(Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index changes from t to t+1
If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.
Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index and dest_index.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}

ElasticSearch : Concurrent updates to index while _reindex for the same index in progress

We have been using this link as a reference to accommodate any change in the mappings for a field in our index with zero downtime.
Question:
Considering the same example taken in the above link, when we reindex the data from
my_index_v1 to my_index_v2 using _reindex API. Does ElasticSearch guarantee that any concurrent updates happening in my_index_v1 would make it to my_index_v2 for sure?
For example, a document might get updated in my_index_v1 before or after it is reindexed by api to my_index_v2.
Ultimately, we just need to ensure that while we did not want any downtime for doing any mapping changes (hence did _reindex using alias and other cool stuff by ES), we also want to ensure that none of the add/update were missed while this huge reindex was in progress, as we are talking about reindexing >50GB data.
Thanks,
Sandeep
The reindex api will not consider the changes made after the process has started..
One thing you can do is once you are done reindexing process.You can again start process with version_type:external.
This will cause only documents from source index to destination index that have different version and are not present
Here is the example
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter",
"version_type": "external"
}
}
Setting version_type to external will cause Elasticsearch to preserve the version from the source, create any documents that are missing, and update any documents that have an older version in the destination index than they do in the source index:
One way to solve this is by using two aliases instead of one. One for queries (let’s call it read_alias), and one for indexing (write_alias). We can write our code so that all indexing happens through the write_alias and all queries go through the read_alias. Let's consider three periods of time:
Before rebuild
read_alias: points to current_index
write_alias: points to current_index
All queries return current data.
All modifications go into current_index.
During rebuild
read_alias: points to current_index
write_alias: points to new_index
All queries keep getting data as it existed before the rebuild, since searching code uses read_alias.
All rows, including modified ones, get indexed into the new_index, since both the rebuilding loop and the DB trigger use the write_alias.
After rebuild
read_alias: points to new_index
write_alias: points to new_index
All queries return new data, including the modifications made during rebuild.
All modifications go into new_index.
It should even be possible to get the modified data from queries while rebuilding, if we make the DB trigger code index modified rows into both the indices while the rebuild is going on (i.e., while the aliases point to different indices).
It is often better to rebuild the index from source data using custom code instead of relying on the _reindex API, since that way we can add new fields that may not have been stored in the old index.
This article has some more details.
It looks like it does it based off of snapshots of the source index.
Which would suggest to me that they couldn't reasonably honor changes to the source happening in the middle of the process. You avoid downtime on the search side, but I think you would need to pause updates on the indexing side during this process.
Something you could do is keep track on your index of when the document was last modified. Then once you finish indexing and switch the alias, you query the old index for what changed in the middle. Propagate those changes over to the new index and you get eventual consistency.

Elasticsearch Reindexing while updating documents?

What if I've changed mapping for my index and wants to reindex?
I'm currenly using the Java API which does not yet have the reindex functionality, so using bulk would solve my problems. So the solution would look something like this
ref How to reindex in ElasticSearch via Java API
Long time ago
create index MY_INDEX_1
create mapping for MY_INDEX_1
create alias MY_INDEX_1 -> MY_INDEX
create documents in MY_INDEX
Time to reindex!
List item
create index MY_INDEX_2
create mapping for MY_INDEX_2
scroll search + bulk all documents from MY_INDEX_1 to MY_INDEX_2
Renaming and deletion of old index
create alias MY_INDEX_2 -> MY_INDEX
delete alias MY_INDEX_1 -> MY_INDEX
delete index MY_INDEX_1
But what happens, while reindexing all documents, a document that was reindexed in the beginning is updated from a user.
Or that between reindexing and rename aliases the above happpens?
Possible Solutions ?
One way would be using external version, such as it does not overwrite an document with an higher version
Or could it be solved in another way?
Or between renaming aliases and deleting my_index_1, reindexing all documents that has been indexed since the reindexing? But then still it would be the case that a document has been updated between renaming aliases and second reindexing
Or should we lock while reindexing? Seems like a bad solution..
I think this is your real question:
But what happens, while reindexing all documents, a document that was reindexed in the beginning is updated from a user. Or that between reindexing and rename aliases the above happpens?
I just asked a question that is very close, but still has questions that need to be resolved separately. However, my research allows me to answer this question. See the question for details and references.
To answer your question, you create a second alias just before reindexing. I call this a duplicate_write_alias and you have your application, if it sees this second alias, write to first the old and then the new index via the two aliases. (the order is important to cancel a potential race). When the indexing is done, your indexing process deletes this duplicate_write_alias and moves your MY_INDEX alias to the new MY_INDEX_2 as noted above. Do the alias switch in one atomic command.
As I noted in my question, you still have to deal with potential 'index does not exist' errors because of a remaining race between your application's checking for existence of the alias and the alias being deleted. I'm hoping there's a better answer than 'always write twice and ignore errors' or 'check and hope for the best'...
I think there is also another (more ugly way):
You can disable write operations for the source index while reindexing, this leads to temporary not usable apis, you don't have to:
Maintain a second storage to hold the truth
Deal with inconsistency
Flag documents for delete which should be deleted after migration
You can use elastic search engine storage to create snapshots between indecies
You can signal users of your api to send their change again later (when the indexing is done)
Downsides:
You have a downtime at least for write operations
You need more logic to handle errors, if the index would not be set to allow-writes-again mode (automatic recovery etc.)
Holding more than one index causes more storage space to be used.
For more information look here:
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index-modules.html

Resources