How does the 'delete index' command work inside ES? - performance

How does the 'delete index' command work inside ES?
Are there any risks when using the 'delete index' command on a running ES cluster?
will this command cost too much CPU or memory?

Deleting the indices are normally pretty fast and Elasticsearch doesn't actually delete all the documents when it sends the success response of a delete indices request.
Elasticsearch mainly updates the cluster state(maintained on all the nodes of a cluster) to mark the indices as deleted and major heavy-lifting is done in updating it as well as some other things like routing table, metadata, etc.
this is the main method in Elaticsearch source code, which would help you understand the thing which I mentioned above and internals of the delete index.
Some important code snippet from above link
RoutingTable.Builder routingTableBuilder = RoutingTable.builder(currentState.routingTable());
Metadata.Builder metadataBuilder = Metadata.builder(meta);
ClusterBlocks.Builder clusterBlocksBuilder = ClusterBlocks.builder().blocks(currentState.blocks());
final IndexGraveyard.Builder graveyardBuilder = IndexGraveyard.builder(metadataBuilder.indexGraveyard());
final int previousGraveyardSize = graveyardBuilder.tombstones().size();
for (final Index index : indices) {
String indexName = index.getName();
logger.info("{} deleting index", index);
routingTableBuilder.remove(indexName);
clusterBlocksBuilder.removeIndexBlocks(indexName);
metadataBuilder.remove(indexName);
}
// add tombstones to the cluster state for each deleted index
final IndexGraveyard currentGraveyard = graveyardBuilder.addTombstones(indices).build(settings);
metadataBuilder.indexGraveyard(currentGraveyard); // the new graveyard set on the metadata
logger.trace("{} tombstones purged from the cluster state. Previous tombstone size: {}. Current tombstone size: {}.",
graveyardBuilder.getNumPurged(), previousGraveyardSize, currentGraveyard.getTombstones().size());
Coming to your question, Are there any risks when using the 'delete index' command on a running ES cluster? will this command cost too much CPU or memory?
No there is no risk of using delete index request on a running Elasticsearch cluster unless you have a huge cluster state and as mentioned actually deleting the indices happens async and this just updates various state and flags, it doesn't command too much CPU.
You can also enable the trace log in org.elasticsearch.cluster.metadata.MetadataDeleteIndexService and see how big is your cluster size as its logged in the code snippet above.

Related

Elasticsearch .tasks index usage

The reindex operation in Elasticsearch creates an entry in ".tasks" index.
Following is the excerpt from docs:
If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as you see fit. When you are done with it, delete it so Elasticsearch can reclaim the space it uses
However, If i disable creating index dynamically by using this API/property, then .tasks index is not created during reindex and the operation gets successful.
My question is,
Will this affect the normal processing of Elastic (specially the reindex operation) ?
Was this ".tasks" index exists in versions before 6.6 ?
The .tasks index exists at least since ES 5.0 and the purpose is to let you manage your long-running tasks instead of letting them run and finish without seeing their outcome.
The normal processing is in no way affected by this, that index is just a container for tasks outcome that you can consult at your leisure. However, if you decide to store tasks outcome in there, it's your job to keep that index clean, i.e. ES will not delete task documents from that index.

Elasticsearch Reindexing race condition

Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
(Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index changes from t to t+1
If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.
Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index and dest_index.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}

How to recover space after bulk update in ElasticSearch?

We have an index of around 20GB; the documents have several large fields, many of which are now redundant.
So I decided to use bulk update to set those fields to empty, in the expectation of recovering space on the server.
I tested a small number of instances, using code of the form:
POST myindex/doc/_bulk
{"update":{"_id":"ccp-23-1002"}}
{"doc" : { "long_text_1":"", "long_text_2":""}}
{"update":{"_id":"ccp-28-1007"}}
{"doc" : { "long_text_1":"", "long_text_2":""}}
This worked fine, I did a search, they showed the fields long_text_1 and long_text_2 were now blank on the specified docs, the other fields unchanged.
So then I scripted something to run the above across all the docs in the index, 1000 at a time. After a few had gone through, I checked the data in the console using
GET _cat/indices?v&s=store.size&h=index,docs.count,store.size
... which showed that while the index in question had the same number of documents, the store.size had got larger, not smaller!
Presumably what is happening is that in each case after an update, a new doc has been created with the same data as the old doc, except with the fields specified in the update request changed; and the old doc is still sitting in the index, presumably marked as dead, but taking up space. So the exercise is having exactly the opposite of the intended effect.
So my question is, how to instruct ES to compact the index or otherwise reclaim this dead space?

Elasticsearch reindex store sizes vary greatly

I am running Elasticsearch 6.2.4. I have a program that will automatically create an index for me as well as the mappings necessary for my data. For this issue, I created an index called "landsat" but it needs to actually be named "landsat_8", so I chose to reindex. The original "landsat" index has 2 shards and 0 read replicas. The store size is ~13.4gb with ~6.6gb per shard and the index holds just over 515k documents.
I created a new index called "landsat_8" with 5 shards, 1 read replica, and started a reindex with no special options. On a very small Elastic Cloud cluster (4GB RAM), it finished in 8 minutes. It was interesting to see that the final store size was only 4.2gb, yet it still held all 515k documents.
After it was finished, I realized that I failed to create my mappings before reindexing, so I blew it away and started over. I was shocked to find that after an hour, the /cat/_indices endpoint showed that only 7.5gb of data and 154,800 documents had been reindexed. 4 hours later, the entire job seemed to have died at 13.1gb, but it only showed 254,000 documents had been reindexed.
On this small 4gb cluster, this reindex operation was maxing out CPU. I increased the cluster to the biggest one Elastic Cloud offered (64gb ram), 5 shards, 0 RR and started the job again. This time, I set the refresh_interval on the new index to -1 and changed the size for the reindex operation to 2000. Long story short, this job ended in somewhere between 1h10m and 1h19m. However, this time I ended up with a total store size of 25gb, where each shard held ~5gb.
I'm very confused as to why the reindex operation causes such wildly different results in store size and reindex performance. Why, when I don't explicitly define any mappings and let ES automatically create mappings, is the store size so much smaller? And why, when I use the exact same mappings as the original index, is the store so much bigger?
Any advice would be greatly appreciated. Thank you!
UPDATE 1:
Here are the only differences in mappings:
The left image is "landsat" and the right image is "landsat_8". There is a root level "type" field and a nested "properties.type" field in the original "landsat" index. I forgot one of my goals was to remove the field "properties.type" from the data during the reindex. I seem to have been successful in doing so, but at the same time, accidentally renamed the root-level "type" field mapping to "provider", thus "landsat_8" has an unused "provider" mapping and an auto-created "type" mapping.
So there are some problems here, but I wouldn't think this would nearly double my store size...

Indexing new data while replacing old data in Elasticsearch with zero downtime

I am working on Elasticsearch where I need to index the new data while replacing the old data. This replacing happens every day.
My requirement is that until the new data indexing is completed, user should be able to search from old data only. And when this indexing is completed, there should be a pointer in Elasticsearch which would just point to new indexed data in no time following with the deletion of old data. In this way I want to achieve zero downtime in this process. This indexing of data may take around 1 hour to complete.
Is there any switching concept in Elasticsearch which can handle this scenario?
Index Aliases is what you want.

Resources