Elasticsearch reindex only missing documents

Elasticsearch reindex only missing documents - elasticsearch

I am trying to reindex an index of 200M of documents from cluster A to cluster B. I used the Reindex API with a remote source and everything worked fine. In the menwhile of my reindex some documents were added into the cluster A so I want to add them as well into the cluster B.
I launched again the reindex request but it seems that the reindex process is taking a lot, like if it was reindexing everything again.
My question is, is the cluster reindexing from scratch all the documents, even if they didn't change ?
My elasticsearch version is the 5.6

The elasticsearch does not know there is a change in the documents or not. So it tries to have each document completely in both indices. If you have a field like insert_time in your data, you can use reindex with query to limit the part of index of A to become reindex on B. This will let you use your older reindex and finish it faster. Reindex by query would be something like this:
POST _reindex
{
"source": {
"index": "A",
"query": {
"range": {
"insert_time": {
"gt": "time you want"
}
}
},
"dest": {
"index": "B"
}
}

Related

How to reindex and change _type

We need to migrate a number of indexes from ElasticSearch 6.8 to ElasticSearch 7.x. To be able to do this, we now need to go back and fix a large number of documents are the _type field of these documents aren't _doc as required. We fixed this for newer indexes, but some of the older data which we still need has other values in here.
How do we reindex these indexes and also change the _type field?
POST /_reindex
{
"source": {
"index": "my-index-2021-11"
},
"dest": {
"index": "my-index-2021-11-n"
},
"script": {
"source": "ctx._type = '_doc';"
}
}
I saw a post indicating the above might work, but on execution, the value for _type in the next index was still the existing of my-index.
The one option I can think of is to iterate through each document in the index and add it to the new index again which should create the correct _type, but that will take days to complete, so not so keen on doing that.

I think below should work . Please test it out, before running on actual data
{
"source": {
"index": "my-index-2021-11"
},
"dest": {
"index": "my-index-2021-11-n",
"type":"_doc"
}
}
Docs to help in upgradation
https://www.elastic.co/guide/en/elasticsearch/reference/7.17/reindex-upgrade-inplace.html

elasticsearch : How can i tell _reindex api in to continue indexing docs while source index still receiving new docs?

I have daily created indices, these indices are filled by an agent which collects a logs every second of the day, and i'am reindexing them (by field) to new indices using _reindex api.
How can i tell _reindex api to still reindixing while the source index still receiving new documents ?
Any help woould be really appriciated!
Thank you

you cannot force reindex API to be online to reindex new received documents.
but I have solution. you can add a date field (index_time) to your source index. write an hourly cron job to run reindex API with a query to get last hour indexed docs via index_time.
POST _reindex
{
"source": {
"index": "my-index-000001",
"query": {
"filter" :{
"query": {
"range": {
"index_time": {"gte" : "now-1h"}
}
}
}
}
},
"dest": {
"index": "my-new-index-000001"
}
}

Elasticsearch reindex API - Not able to reindex large number of documents

I'm using Elasticsearch's reindex API to migrate logs from an old cluster to a new version 7.9.2 cluster. Here is the command I'm using.
curl -X POST "new_host:9200/_reindex?pretty&refresh&wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "*",
"size": 10000,
"query": {
"match_all": {}
}
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
This gets only the last 10000 documents or 1 batch and request gets completed after that. However, I need to reindex more than a million documents. Is there a way to make the request run for all the matched documents? Can we set the number of batches in the request or make the request issue batches till all documents are indexed?
One option I can think of is to send request recursively by modifying query on datetime. Is there a better way to do it? Can I get all the matched documents (1 million plus) in one request?

Remove the query and size params in order to get all the data. If you need to filter only desired documents using a query, just remove the size to fetch all matched logs.
Using wait_for_completion=false as query param will return the task id and you will be able to monitor the reindex progress using GET /_tasks/<task_id>.
If you need or want to break the reindexing into serveral steps/chunks consider using the slice feature.
BTW: Reindex one index after another instead all at one using * and consider using daily/monthly indicies as it becomes easier to resume the process on errors and manage the log retention in comparison to one whole index.
In order to improve the speed, you should reduce the replicas to 0 and set refresh_interval=-1 in the destination index bevore reindexing and reset the values afterwards.
curl -X POST "new_host:9200/_reindex?pretty&wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "index_name"
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
UPDATE based on comments:
While reindexing, there is at least one error what causes the reindexing to stop. The error is being caused by at least one document (id=xiB9...) having 'OK' as value in field 'fields.StatusCode'. But the mapping in the destination index has long as data type what is causing the mentioned exception.
The solution is to change the source documents StatusCode to 200 for example, but there will be probably more documents causing the very same error.
Another solution is to change the mapping in the destination index to keyword type - that requires a handmade mapping set before any data has been inserted and maybe reindexing the already present data.

Best way to reindex multiple indices in ElasticSearch

I am using Elasticsearch 5.1.1 and have 500 + indices created with default mapping provided by ES.
Now we have decided to use dynamic templates.
In order to apply this template/mapping to old indices I need to reindex all indices.
What is the best way to do it? Can we use Kibana for this ? Couldn't find sufficient documentation to do so.

Example: Reindex from a daily index to a monthly index (August)
POST _reindex?slices=10&refresh
{
"source": {
"index": "myindex-2019.08.*"
},
"dest": {
"index": "myindex-2019.08"
}
}
Monitor reindex task (wait until is finished)
GET _tasks?detailed=true&actions=*reindex
Check if new index was created
GET _cat/indices/myindex-2019.08*?v&s=index
You can delete old indices
DELETE myindex-2019.08.*
Source:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

You can use the _reindex API which can also reindex multiple indices. It was specifically built for this.

Bash script to re-index all indices matching a pattern: https://gist.github.com/hartfordfive/e507bc47e17f4e03a89055918900e44d

If you want to filter some field and reindex it from index you can use this.
POST _reindex
{
"source": {
"index": "auditbeat",
"query": {
"match": {
"agent.version": "7.6.0"
}
}
},
"dest": {
"index":"auditbeat-7.6.0"
}
}

what is offline and online indexing in Elastic search? and when do we need to reindex?

what is offline and online indexing in Elastic search? I did my research but I couldn't find enough resources to see what these terms mean? any idea? and also when do we need to reindex? any examples would be great
The terms offline and online indexing are used here.
https://spark-summit.org/2014/wp-content/uploads/2014/07/Streamlining-Search-Indexing-using-Elastic-Search-and-Spark-Holden-Karau.pdf

Reindexing
The most basic form if reindexing just copies one index to another.
I have used this form of reindexing to change a mapping.
Elasticsearch doesn't allow you to change a mapping, so if you want to change a mapping you have to create a new index (index2) with a new mapping and then reindex. The reindex will fill that new mapping with the data of the old index.
The command below will move everything from index to index2.
curl -XPOST 'localhost:9200/_reindex?pretty' -d'
{
"source": {
"index": "index"
},
"dest": {
"index": "index2"
}
}'
You can also use reindexing to fill a new index with a part of the old one. You can do so by using a couple of parameters. The example below will copy the newest 1000 documents.
POST /_reindex
{
"size": 1000,
"source": {
"index": "index",
"sort": { "date": "desc" }
},
"dest": {
"index": "index2"
}
}
For more examples about reindexing please have a look at the official documentation.
offline vs online indexing
In ONLINE mode the new index is built while the old index is accessible to reads and writes. any update on the old index will also get applied to the new index.
In OFFLINE mode the table is locked up front for any read or write, and then the new index gets built from the old index. No read or write operation is permitted on the table while the index is being rebuilt. Only when the operation is done is the lock on the table released and reads and writes are allowed again.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio