I'm using re-index api of Elasticsearch to move documents from an index (named index1) to another index (named index2).
My problem rises when the size of index1 is too big, so the time out response comes back from Elasticsearch. There is another query (GET _tasks?detailed=true&actions=*reindex) which shows the reindex procedure. But I can't figure out how I can see the errors if there are errors during the reindex time and why my reindex task fails.
One possible solution that I don't like is to increase the time out response of Elasticsearch. Is there any solution that I can see the errors without increasing the time out?
What I usually do is to launch the reindex with ?wait_for_completion=false, so that a background task gets created. The reindex call will return almost immediately and tell you the ID of the task that was created.
You can then use the Task API to check the status of the task using:
GET .tasks/task/<taskId>
Even when the reindex is done, the task will stay in the index and you can check the errors if any.
It is your responsibility, though, to delete that document using:
DELETE .tasks/task/<taskId>
Related
I'm performing a large upload of data to an empty index.
This article suggests to set "refresh_interval=-1" and "number_of_replicas=0" to increase upload performance. Then it says to enable it back.
The interesting thing is that if I don't enable it back - I can still send the queries to the newly created index and get the results.
I'd like to know why is that and what I got wrong ? (My expectation was that I should get zero results because indexing is disabled)
And one more thing I'd like to understand - if I enable refresh_interval back to the original value, do I need to execute /_refresh operation ?
By default, Elasticsearch periodically refreshes indices every second,
but only on indices that have received one search request or more in
the last 30 seconds. You can change this default interval using the
index.refresh_interval setting.
so document says: when you send a search request, it will send a refresh request with that. so you could search your data but very slow for first time or miss some data for first search. it is better to have a refresh_interval if you index new data on your indices.
I want to use the elasticsearch reindex api to reindex an index on prod.
What is the behaviour when I keep writing to the source index? Will the reindexing task keep reindexing as long as I still write to it?
The reindex process will reindex only the documents that exist in the source index at the time when the reindex request was made.
Any document created or updated after the reindex request was made will not be indexed in the destination index, you will need to run another reindex to get these documents.
So, if you keep writing to the index, you will need to keep doing reindexes.
The reindex operation in Elasticsearch creates an entry in ".tasks" index.
Following is the excerpt from docs:
If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as you see fit. When you are done with it, delete it so Elasticsearch can reclaim the space it uses
However, If i disable creating index dynamically by using this API/property, then .tasks index is not created during reindex and the operation gets successful.
My question is,
Will this affect the normal processing of Elastic (specially the reindex operation) ?
Was this ".tasks" index exists in versions before 6.6 ?
The .tasks index exists at least since ES 5.0 and the purpose is to let you manage your long-running tasks instead of letting them run and finish without seeing their outcome.
The normal processing is in no way affected by this, that index is just a container for tasks outcome that you can consult at your leisure. However, if you decide to store tasks outcome in there, it's your job to keep that index clean, i.e. ES will not delete task documents from that index.
Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
(Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index changes from t to t+1
If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.
Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index and dest_index.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}
I have a use case where I ran a batch code to first create and then subsequently update my index in elasticsearch.
My program crashed pre-maturedly and now I want to know what was the last time that an update was made to my elasticsearch index.
Is there any api which could give me the information for the last update time of the index.
I have not been able to find any such resources. I looked specifically in https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-stats.html
and tried,
curl http://{myhost}/{indexName}/_stats