Elasticsearch reindex API - Not able to reindex large number of documents - elasticsearch

I'm using Elasticsearch's reindex API to migrate logs from an old cluster to a new version 7.9.2 cluster. Here is the command I'm using.
curl -X POST "new_host:9200/_reindex?pretty&refresh&wait_for_completion=true" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "*",
"size": 10000,
"query": {
"match_all": {}
}
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
This gets only the last 10000 documents or 1 batch and request gets completed after that. However, I need to reindex more than a million documents. Is there a way to make the request run for all the matched documents? Can we set the number of batches in the request or make the request issue batches till all documents are indexed?
One option I can think of is to send request recursively by modifying query on datetime. Is there a better way to do it? Can I get all the matched documents (1 million plus) in one request?

Remove the query and size params in order to get all the data. If you need to filter only desired documents using a query, just remove the size to fetch all matched logs.
Using wait_for_completion=false as query param will return the task id and you will be able to monitor the reindex progress using GET /_tasks/<task_id>.
If you need or want to break the reindexing into serveral steps/chunks consider using the slice feature.
BTW: Reindex one index after another instead all at one using * and consider using daily/monthly indicies as it becomes easier to resume the process on errors and manage the log retention in comparison to one whole index.
In order to improve the speed, you should reduce the replicas to 0 and set refresh_interval=-1 in the destination index bevore reindexing and reset the values afterwards.
curl -X POST "new_host:9200/_reindex?pretty&wait_for_completion=false" -H 'Content-Type: application/json' -d'
{
"source": {
"remote": {
"host": "old_host:9200"
},
"index": "index_name"
},
"conflicts": "proceed",
"dest": {
"index": "logstash"
}
}'
UPDATE based on comments:
While reindexing, there is at least one error what causes the reindexing to stop. The error is being caused by at least one document (id=xiB9...) having 'OK' as value in field 'fields.StatusCode'. But the mapping in the destination index has long as data type what is causing the mentioned exception.
The solution is to change the source documents StatusCode to 200 for example, but there will be probably more documents causing the very same error.
Another solution is to change the mapping in the destination index to keyword type - that requires a handmade mapping set before any data has been inserted and maybe reindexing the already present data.

Related

elasticsearch : How can i tell _reindex api in to continue indexing docs while source index still receiving new docs?

I have daily created indices, these indices are filled by an agent which collects a logs every second of the day, and i'am reindexing them (by field) to new indices using _reindex api.
How can i tell _reindex api to still reindixing while the source index still receiving new documents ?
Any help woould be really appriciated!
Thank you
you cannot force reindex API to be online to reindex new received documents.
but I have solution. you can add a date field (index_time) to your source index. write an hourly cron job to run reindex API with a query to get last hour indexed docs via index_time.
POST _reindex
{
"source": {
"index": "my-index-000001",
"query": {
"filter" :{
"query": {
"range": {
"index_time": {"gte" : "now-1h"}
}
}
}
}
},
"dest": {
"index": "my-new-index-000001"
}
}

Elasticsearch reindex only missing documents

I am trying to reindex an index of 200M of documents from cluster A to cluster B. I used the Reindex API with a remote source and everything worked fine. In the menwhile of my reindex some documents were added into the cluster A so I want to add them as well into the cluster B.
I launched again the reindex request but it seems that the reindex process is taking a lot, like if it was reindexing everything again.
My question is, is the cluster reindexing from scratch all the documents, even if they didn't change ?
My elasticsearch version is the 5.6
The elasticsearch does not know there is a change in the documents or not. So it tries to have each document completely in both indices. If you have a field like insert_time in your data, you can use reindex with query to limit the part of index of A to become reindex on B. This will let you use your older reindex and finish it faster. Reindex by query would be something like this:
POST _reindex
{
"source": {
"index": "A",
"query": {
"range": {
"insert_time": {
"gt": "time you want"
}
}
},
"dest": {
"index": "B"
}
}

Reindex fail due to SearchContextMissingException

My company is using elasticsearch 2.3.4.
We have a cluster that contains 38 ES nodes, and we've been having a problem with reindexing some of our data lately...
We've reindexed before very large indexes and had no problems, but recently, when trying to reindex much smaller indexed (less than 10GB) - we get : "SearchContextMissingException [No search context found for id [XXX]]".
We have no idea what's causing this problem or how to fix it. We'd like some guidance.
Has anyone saw this exception before?
From github comments on issues related to this , i think this can be avoided by changing batch size :
From documentation:
By default _reindex uses scroll batches of 1000. You can change the batch size with the size field in the source element:
POST _reindex
{
"source": {
"index": "source",
"size": 100
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
I had the same problem with an index that holds many huge documents. I had to reduce the batch size down to 10. (100 and 50 both didn't work).
This was the request that worked in the end:
POST _reindex?slices=5&refresh
{
"source": {
"index": "source_index",
"size": 10
},
"dest": {
"index": "dest_index"
}
}
You should also set the slices to the number of shards you have in your index.

what is offline and online indexing in Elastic search? and when do we need to reindex?

what is offline and online indexing in Elastic search? I did my research but I couldn't find enough resources to see what these terms mean? any idea? and also when do we need to reindex? any examples would be great
The terms offline and online indexing are used here.
https://spark-summit.org/2014/wp-content/uploads/2014/07/Streamlining-Search-Indexing-using-Elastic-Search-and-Spark-Holden-Karau.pdf
Reindexing
The most basic form if reindexing just copies one index to another.
I have used this form of reindexing to change a mapping.
Elasticsearch doesn't allow you to change a mapping, so if you want to change a mapping you have to create a new index (index2) with a new mapping and then reindex. The reindex will fill that new mapping with the data of the old index.
The command below will move everything from index to index2.
curl -XPOST 'localhost:9200/_reindex?pretty' -d'
{
"source": {
"index": "index"
},
"dest": {
"index": "index2"
}
}'
You can also use reindexing to fill a new index with a part of the old one. You can do so by using a couple of parameters. The example below will copy the newest 1000 documents.
POST /_reindex
{
"size": 1000,
"source": {
"index": "index",
"sort": { "date": "desc" }
},
"dest": {
"index": "index2"
}
}
For more examples about reindexing please have a look at the official documentation.
offline vs online indexing
In ONLINE mode the new index is built while the old index is accessible to reads and writes. any update on the old index will also get applied to the new index.
In OFFLINE mode the table is locked up front for any read or write, and then the new index gets built from the old index. No read or write operation is permitted on the table while the index is being rebuilt. Only when the operation is done is the lock on the table released and reads and writes are allowed again.

How to copy some ElasticSearch data to a new index

Let's say I have movie data in my ElasticSearch and I created them like this:
curl -XPUT "http://192.168.0.2:9200/movies/movie/1" -d'
{
"title": "The Godfather",
"director": "Francis Ford Coppola",
"year": 1972
}'
And I have a bunch of movies from different years. I want to copy all the movies from a particular year (so, 1972) and copy them to a new index of "70sMovies", but I couldn't see how to do that.
Since ElasticSearch 2.3 you can now use the built in _reindex API
for example:
POST /_reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
}
}
Or only a specific part by adding a filter/query
POST /_reindex
{
"source": {
"index": "twitter",
"query": {
"term": {
"user": "kimchy"
}
}
},
"dest": {
"index": "new_twitter"
}
}
Read more: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
The best approach would be to use elasticsearch-dump tool https://github.com/taskrabbit/elasticsearch-dump.
The real world example I used :
elasticdump \
--input=http://localhost:9700/.kibana \
--output=http://localhost:9700/.kibana_read_only \
--type=mapping
elasticdump \
--input=http://localhost:9700/.kibana \
--output=http://localhost:9700/.kibana_read_only \
--type=data
Check out knapsack:
https://github.com/jprante/elasticsearch-knapsack
Once you have the plugin installed and working, you could export part of your index via query. For example:
curl -XPOST 'localhost:9200/test/test/_export' -d '{
"query" : {
"match" : {
"myfield" : "myvalue"
}
},
"fields" : [ "_parent", "_source" ]
}'
This will create a tarball with only your query results, which you can then import into another index.
To reindex specific type from source index to destination index type syntax is
POST _reindex/
{
"source": {
"index": "source_index",
"type": "source_type",
"query": {
// add filter criteria
}
},
"dest": {
"index": "dest_index",
"type": "dest_type"
}
}
If the intent were to copy some portion of the data or the entire data to an index with the same settings/mappings as that of the original index one could use the clone api to achieve the same. Something like below:
POST /<index>/_clone/<target-index>
OR
PUT /<index>/_clone/<target-index>
However if the intent is to copy the data to a new index with the different settings/mappings than the original index one could use the reindex api to achieve the same. Something like below:
POST _reindex/
{
"source": {
"index": "source_index",
"type": "source_type",
"query": {
// add filter criteria
}
},
"dest": {
"index": "dest_index",
"type": "dest_type"
}
}
*Note: In case of reindex api the target index has to be created prior to actual api call.
For further reading on difference between clone and reindex refer What's the difference between cloning and reindexing an index in Elasticsearch?
You can do it easily with elasticsearch-dump (https://github.com/taskrabbit/elasticsearch-dump) in three steps. In the following example I copy the index "thor" to "thor2"
elasticdump --input=http://localhost:9200/thor --output=http://localhost:9200/thor2 --type=analyzer
elasticdump --input=http://localhost:9200/thor --output=http://localhost:9200/thor2 --type=mapping
elasticdump --input=http://localhost:9200/thor --output=http://localhost:9200/thor2 --type=data
Well the straightforward way to do this is to write code, with the API of your choice, querying for "year": 1972 and then indexing that data into a new index. You would use the Search api or the Scan and Scroll API to get all the documents and then either index them one by one or use the Bulk Api:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-search.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-index_.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-bulk.html
Assuming you don't want to do this via code but are looking for a direct way of doing this, I suggest the Elasticsearch Snapshot and Restore. Basically you would take a snapshot of your existing index, restore it into a new index and then use the Delete command to delete all documents with a year other than 1972.
Snapshot And Restore
The snapshot and restore module allows to create snapshots of
individual indices or an entire cluster into a remote repository. At
the time of the initial release only shared file system repository was
supported, but now a range of backends are available via officially
supported repository plugins.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html
Delete By Query API
The delete by query API allows to delete documents from one or more
indices and one or more types based on a query. The query can either
be provided using a simple query string as a parameter, or using the
Query DSL defined within the request body.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
Since v7.4 the _clone api was introduced and can easily satisfy your need: (read for the relevant prerequisites and monitoring involved)
POST /<index>/_clone/<target-index>
Or:
PUT /<index>/_clone/<target-index>
You can use elasticdump --searchBody:
# Copy documents from movies to 70sMovies (filtering using query)
elasticdump \
--input=http://localhost:9200/movies \
--output=http://localhost:9200/70sMovies \
--type=data \
--searchBody="{\"query\":{\"term\":{\"username\": \"admin\"}}}" # <--- Your query here
more on elasticdump options here.

Resources