Internals of deleting a document in elasticsearch - elasticsearch

In Elasticsearch's documentation Updating a document says:
Internally, Elasticsearch has marked the old document as deleted and
added an entirely new document. The old version of the document
doesn’t disappear immediately, although you won’t be able to access
it. Elasticsearch cleans up deleted documents in the background as you
continue to index more data.
An in Deleting a document:
deleting a document doesn’t immediately remove the document from
disk; it just marks it as deleted. Elasticsearch will clean up deleted
documents in the background as you continue to index more data.
Does this mean that if we never index anything, the data will be stored and marked for deletion forever but never deleted?

You can still completely delete non-indexed documents as long as they're marked for deletion. Use the following command -
curl -XPOST 'http://localhost:9200/_forcemerge?only_expunge_deletes=true'
Forcemerge used to be 'optimize' but that is now depreciated.

Related

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

Elasticsearch Reindexing race condition

Hello elasticsearch users/experts,
I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.
I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).
As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias.
Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.
My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :
If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
(Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?
Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?
Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?
Thanks in advance!
Apologies if its too verbose, but my two cents:
If documents would still be incoming while the reindexing process is
working (which would probably take a lot of time), how would the
reindexing process ensure that the document would be ingested in the
old index (to be able to search for it while the reindexing process is
working) but still would be correctly reindexed to the new index?
When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index. All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.
Let's say the state of source_index changes from t to t+1
If you have ran a reindexing job at t to dest_index, it would still consume the data of snapshot of source_index at t. You need to run reindexing job again to have latest data of source_index i.e. data at t+1 in your dest_index.
Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.
Reindexing jobs will never always guarantee consistency between source_index and dest_index.
If a document is modified in the old index, after it has been
reindexed (mapped to the new index), while the reindexing process is
working, how would ElasticSearch ensure that this modification is also
taken account in the new index?
It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t.
You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.
You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).
However for full indexing (from source_index to dest_index), have it scheduled like once in a day or twice as it is an expensive process.
(Similar to 2.) If a record is deleted in the old index, after it has
been reindexed (mapped to the new index), while the reindexing process
is working, how would ElasticSearch ensure that this removal is also
taken account in the new index?
Again, you need to run a new job/reindexing process.
Version_type: External
Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index
You can refer to this LINK for more info on this
POST _reindex
{
"source": {
"index": "source_index"
},
"dest": {
"index": "dest_index",
"version_type": "external"
}
}

ElasticSearch: Deleting then adding a document

I have a problem with making sure that my index are not corrupted. So the way I do my Add and Index actions make sure that the index is not desynchronized with my DB:
Get document version from index
Get document from DB
Index document
This means that it doesn't matter in what order my index requests come in my index is always in sync with DB. The problem comes with Delete action. There can be a situation that Add request comes after a Delete and the document is re-added even though it shouldn't be. I know that Lucene doesn't remove the document right away. Is there any way to know the ids of deleted documents? Alternatively, check the version of a deleted document? If yes how long after the delete request does the document live in the index?
Perhaps this might work :
1. when a document is deleted, it is not removed form the index, but is replaced by a "delete document". This can be achieved by introducing the concept of an empty document with the same ID which will replace the existing document with an empty document when it is "deleted".
2. Before every insert, check if the document is a "delete document" and if so skip the insert/index action, if document does not exist or is not a "delete document" then index the document. This is basically additional work that needs to be done but allows for running into deletes and adds coming in out of order.
Relying on getting ids from the deleted list of docs that lucene might maintain someplace, will put in an additional constraint of timing. Lucene/ES cleans up the deleted docs whenever it needs to.

Is it possible to undelete a document in Elasticsearch?

From reading this article (Lucene's Handling of Deleted Documents), I understand that deleted documents in Elasticsearch are simply marked as deleted, such that they may remain on the disk for some time afterwards.
I was therefore wondering if there was a way to recover deleted documents in Elasticsearch?
Deleted documents and old document versions are totally removed by the segment merging process :(
This is the moment when those old deleted documents are purged from the filesystem. Deleted documents (or old versions of updated documents) are not copied over to the new bigger segment.
https://www.elastic.co/guide/en/elasticsearch/guide/current/merge-process.html

Backing up, Deleting, Restoring Elasticsearch Indexes By Index Folder

Most of the ElasticSearch documentation discusses working with the indexes through the REST API - is there any reason I can't simply move or delete index folders from the disk?
You can move data around on disk, to a point -
If Elasticsearch is running, it is never a good idea to move or delete the index
folders, because Elasticsearch will not know what happened to the data, and you
will get all kinds of FileNotFoundExceptions in the logs as well as indices
that are red until you manually delete them.
If Elasticsearch is not running, you can move index folders to another node (for
instance, if you were decomissioning a node permanently and needed to get the
data off), however, if the delete or move the folder to a place where
Elasticsearch cannot see it when the service is restarted, then Elasticsearch
will be unhappy. This is because Elasticsearch writes what is known as the
cluster state to disk, and in this cluster state the indices are recorded, so if
ES starts up and expects to find index "foo", but you have deleted the "foo"
index directory, the index will stay in a red state until it is deleted through
the REST API.
Because of this, I would recommend that if you want to move or delete individual
index folders from disk, that you use the REST API whenever possible, as it's
possible to get ES into an unhappy state if you delete a folder that it expects
to find an index in.
EDIT: I should mention that it's safe to copy (for backups) an indices folder,
from the perspective of Elasticsearch, because it doesn't modify the contents of
the folder. Sometimes people do this to perform backups outside of the snapshot
& restore API.
I use this procedure: I close, backup, then delete the indexes.
curl -XPOST "http://127.0.0.1:9200/*index_name*/_close"
After this point all index data is on disk and in a consistent state, and no writes are possible. I copy the directory where the index is stored and then delete it:
curl -XPOST "http://127.0.0.1:9200/*index_name*/_delete"
By closing the index, elasticsearch stop all access on the index. Then I send a command to delete the index (and all corresponding files on disk).

Resources