what is offline and online indexing in Elastic search? and when do we need to reindex? - elasticsearch

what is offline and online indexing in Elastic search? I did my research but I couldn't find enough resources to see what these terms mean? any idea? and also when do we need to reindex? any examples would be great
The terms offline and online indexing are used here.
https://spark-summit.org/2014/wp-content/uploads/2014/07/Streamlining-Search-Indexing-using-Elastic-Search-and-Spark-Holden-Karau.pdf

Reindexing
The most basic form if reindexing just copies one index to another.
I have used this form of reindexing to change a mapping.
Elasticsearch doesn't allow you to change a mapping, so if you want to change a mapping you have to create a new index (index2) with a new mapping and then reindex. The reindex will fill that new mapping with the data of the old index.
The command below will move everything from index to index2.
curl -XPOST 'localhost:9200/_reindex?pretty' -d'
{
"source": {
"index": "index"
},
"dest": {
"index": "index2"
}
}'
You can also use reindexing to fill a new index with a part of the old one. You can do so by using a couple of parameters. The example below will copy the newest 1000 documents.
POST /_reindex
{
"size": 1000,
"source": {
"index": "index",
"sort": { "date": "desc" }
},
"dest": {
"index": "index2"
}
}
For more examples about reindexing please have a look at the official documentation.
offline vs online indexing
In ONLINE mode the new index is built while the old index is accessible to reads and writes. any update on the old index will also get applied to the new index.
In OFFLINE mode the table is locked up front for any read or write, and then the new index gets built from the old index. No read or write operation is permitted on the table while the index is being rebuilt. Only when the operation is done is the lock on the table released and reads and writes are allowed again.

Related

How to reindex and change _type

We need to migrate a number of indexes from ElasticSearch 6.8 to ElasticSearch 7.x. To be able to do this, we now need to go back and fix a large number of documents are the _type field of these documents aren't _doc as required. We fixed this for newer indexes, but some of the older data which we still need has other values in here.
How do we reindex these indexes and also change the _type field?
POST /_reindex
{
"source": {
"index": "my-index-2021-11"
},
"dest": {
"index": "my-index-2021-11-n"
},
"script": {
"source": "ctx._type = '_doc';"
}
}
I saw a post indicating the above might work, but on execution, the value for _type in the next index was still the existing of my-index.
The one option I can think of is to iterate through each document in the index and add it to the new index again which should create the correct _type, but that will take days to complete, so not so keen on doing that.
I think below should work . Please test it out, before running on actual data
{
"source": {
"index": "my-index-2021-11"
},
"dest": {
"index": "my-index-2021-11-n",
"type":"_doc"
}
}
Docs to help in upgradation
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/reindex-upgrade-inplace.html

Elasticsearch reindex only missing documents

I am trying to reindex an index of 200M of documents from cluster A to cluster B. I used the Reindex API with a remote source and everything worked fine. In the menwhile of my reindex some documents were added into the cluster A so I want to add them as well into the cluster B.
I launched again the reindex request but it seems that the reindex process is taking a lot, like if it was reindexing everything again.
My question is, is the cluster reindexing from scratch all the documents, even if they didn't change ?
My elasticsearch version is the 5.6
The elasticsearch does not know there is a change in the documents or not. So it tries to have each document completely in both indices. If you have a field like insert_time in your data, you can use reindex with query to limit the part of index of A to become reindex on B. This will let you use your older reindex and finish it faster. Reindex by query would be something like this:
POST _reindex
{
"source": {
"index": "A",
"query": {
"range": {
"insert_time": {
"gt": "time you want"
}
}
},
"dest": {
"index": "B"
}
}

Elasticsearch reindex API - Not able to reindex large number of documents

I'm using Elasticsearch's reindex API to migrate logs from an old cluster to a new version 7.9.2 cluster. Here is the command I'm using.
curl -X POST "new_host:9200/_reindex?pretty&refresh&wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "*",
"size": 10000,
"query": {
"match_all": {}
}
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
This gets only the last 10000 documents or 1 batch and request gets completed after that. However, I need to reindex more than a million documents. Is there a way to make the request run for all the matched documents? Can we set the number of batches in the request or make the request issue batches till all documents are indexed?
One option I can think of is to send request recursively by modifying query on datetime. Is there a better way to do it? Can I get all the matched documents (1 million plus) in one request?
Remove the query and size params in order to get all the data. If you need to filter only desired documents using a query, just remove the size to fetch all matched logs.
Using wait_for_completion=false as query param will return the task id and you will be able to monitor the reindex progress using GET /_tasks/<task_id>.
If you need or want to break the reindexing into serveral steps/chunks consider using the slice feature.
BTW: Reindex one index after another instead all at one using * and consider using daily/monthly indicies as it becomes easier to resume the process on errors and manage the log retention in comparison to one whole index.
In order to improve the speed, you should reduce the replicas to 0 and set refresh_interval=-1 in the destination index bevore reindexing and reset the values afterwards.
curl -X POST "new_host:9200/_reindex?pretty&wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "index_name"
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
UPDATE based on comments:
While reindexing, there is at least one error what causes the reindexing to stop. The error is being caused by at least one document (id=xiB9...) having 'OK' as value in field 'fields.StatusCode'. But the mapping in the destination index has long as data type what is causing the mentioned exception.
The solution is to change the source documents StatusCode to 200 for example, but there will be probably more documents causing the very same error.
Another solution is to change the mapping in the destination index to keyword type - that requires a handmade mapping set before any data has been inserted and maybe reindexing the already present data.

Update configuration for actively used index without data loss

Sometimes, I need to update mappings, settings, or bind default pipelines to the actively used index.
For the time being, I am using a method with data loss as follows:
update the index template with proper mapping (or binding the default pipeline by index.default_pipeline);
create a_new_index (matching the template index_patterns);
reindex the index_to_fix to a_new_index to migrate the data already indexed;
use alias to redirect the coming indexing request to a_new_index (the alias will have the same name as index_to_fix to ensure the indexing is undisturbed) and delete the index_to_fix;
But between step 3 and step 4, there is a time gap, during which the newly indexed data are lost in the original index_to_fix.
Is there a way, to update configurations for actively used index without any data loss?
Thanks for the help of #LeBigCat, after some discussions. I think this problem could be solved in three steps.
Use Alias for CRUD
First thing first, try not to use index directly, use alias if possible; since you can't use an alias with the same name as the existed indices, directly you can't replace the index even if it's broken (badly designed). The easiest way is to use a template and include the index name directly in the alias.
PUT _template/test
{
...
"aliases" : {
"{index}-alias" : {}
}
}
Redirect the Indexing
Since the index_to_fix is being actively used, after updating the template and create a new index a_new_fix, we can use alias to redirect the indexing to a_new_fix.
POST /_aliases
{
"actions" : [
{ "add": { "index": "a_new_index", "alias": "index_to_fix-alias" } },
{ "remove": { "index": "index_to_fix", "alias": "index_to_fix-alias" } }
]
}
Migrating the Data
Simply use _reindex to migrate all the data from index_to_fix to a_new_index.
POST _reindex
{
"source": {
"index": "index_to_fix"
},
"dest": {
"index": "index_to_fix-alias"
}
}

Best way to reindex multiple indices in ElasticSearch

I am using Elasticsearch 5.1.1 and have 500 + indices created with default mapping provided by ES.
Now we have decided to use dynamic templates.
In order to apply this template/mapping to old indices I need to reindex all indices.
What is the best way to do it? Can we use Kibana for this ? Couldn't find sufficient documentation to do so.
Example: Reindex from a daily index to a monthly index (August)
POST _reindex?slices=10&refresh
{
"source": {
"index": "myindex-2019.08.*"
},
"dest": {
"index": "myindex-2019.08"
}
}
Monitor reindex task (wait until is finished)
GET _tasks?detailed=true&actions=*reindex
Check if new index was created
GET _cat/indices/myindex-2019.08*?v&s=index
You can delete old indices
DELETE myindex-2019.08.*
Source:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
You can use the _reindex API which can also reindex multiple indices. It was specifically built for this.
Bash script to re-index all indices matching a pattern: https://gist.github.com/hartfordfive/e507bc47e17f4e03a89055918900e44d
If you want to filter some field and reindex it from index you can use this.
POST _reindex
{
"source": {
"index": "auditbeat",
"query": {
"match": {
"agent.version": "7.6.0"
}
}
},
"dest": {
"index":"auditbeat-7.6.0"
}
}

Resources