ElasticSearch: Deleting then adding a document - elasticsearch

I have a problem with making sure that my index are not corrupted. So the way I do my Add and Index actions make sure that the index is not desynchronized with my DB:
Get document version from index
Get document from DB
Index document
This means that it doesn't matter in what order my index requests come in my index is always in sync with DB. The problem comes with Delete action. There can be a situation that Add request comes after a Delete and the document is re-added even though it shouldn't be. I know that Lucene doesn't remove the document right away. Is there any way to know the ids of deleted documents? Alternatively, check the version of a deleted document? If yes how long after the delete request does the document live in the index?

Perhaps this might work :
1. when a document is deleted, it is not removed form the index, but is replaced by a "delete document". This can be achieved by introducing the concept of an empty document with the same ID which will replace the existing document with an empty document when it is "deleted".
2. Before every insert, check if the document is a "delete document" and if so skip the insert/index action, if document does not exist or is not a "delete document" then index the document. This is basically additional work that needs to be done but allows for running into deletes and adds coming in out of order.
Relying on getting ids from the deleted list of docs that lucene might maintain someplace, will put in an additional constraint of timing. Lucene/ES cleans up the deleted docs whenever it needs to.

Related

Does updating a doc increase the "delete" count of the index?

I am facing a strange issue in the number of docs getting deleted in an elasticsearch index. The data is never deleted, only inserted and/or updated. While I can see that the total number of docs are increasing, I have also been seeing some non-zero values in the docs deleted column. I am unable to understand from where did this number come from.
I tried reading whether the update doc first deletes the doc and then re-indexes it so in this way the delete count gets increased. However, I could not get any information on this.
The command I type to check the index is:
curl -XGET localhost:9200/_cat/indices
The output I get is:
yellow open e0399e012222b9fe70ec7949d1cc354f17369f20 zcq1wToKRpOICKE9-cDnvg 5 1 21219975 4302430 64.3gb 64.3gb
Note: It is a single node elasticsearch.
I expect to know the reason behind deletion of docs.
You are correct that updates are the cause that you see a count for documents delete.
If we talk about lucene then there is nothing like update there. It can also be said that documents in lucene are immutable.
So how does elastic provides the feature of update?
It does so by making use of _source field. Therefore it is said that _source should be enabled to make use of elastic update feature. When using update api, elastic refers to the _source to get all the fields and their existing values and replace the value for only the fields sent in update request. It marks the existing document as deleted and index a new document with the updated _source.
What is the advantage of this if its not an actual update?
It removes the overhead from application to always compile the complete document even when a small subset of fields need to update. Rather than sending the full document, only the fields that need an update can be sent using update api. Rest is taken care by elastic.
It reduces some extra network round-trips, reduce payload size and also reduces the chances of version conflict.
You can read more how update works here.

Elasticsearch delete api only deletes the record from one shard. The record still can be searched unless I manually delete & rebuild the whole index

Elasticsearch delete api only deletes the record from one shard.
The record still can be searched unless I manually delete & rebuild the whole index.
Following is the response of Delete API:
{"found":false,"_index":"companyindex","_type":"companydata","_id":"932","_version":1,"_shards":{"total":2,"successful":1,"failed":0}}
The issue was in document id. Actually, we were outputting user_id as document_id. So the company_id was never going to be found. Hence we got {"found : false} in the cURL response.
Sorry to say, question text also needs little bit modification. Record was never deleted from 1 shard (after correcting myself, record exists on only node irrespective of how many shards it has)

Elasticsearch: time to index document

I am updating existing documents by deleting and reindexing them. I did it this way because the documents have nested components and it was easier to massage the document myself rather than construct an update operation.
Mostly this works fine but occasionally the system updates the same document twice very quickly. I think what is happening is that the the search for the second update gets the original document (before it was updated the first time) because the the previous updates have not yet been reflected in the indexes. By the time I try to delete the document (by id) the index has updated and it comes up as not found.
I am not doing bulk updates.
Is this a known issue and if so how does one work around it?
I can't find any reference to problems like this anywhere so I am puzzled.

Internals of deleting a document in elasticsearch

In Elasticsearch's documentation Updating a document says:
Internally, Elasticsearch has marked the old document as deleted and
added an entirely new document. The old version of the document
doesn’t disappear immediately, although you won’t be able to access
it. Elasticsearch cleans up deleted documents in the background as you
continue to index more data.
An in Deleting a document:
deleting a document doesn’t immediately remove the document from
disk; it just marks it as deleted. Elasticsearch will clean up deleted
documents in the background as you continue to index more data.
Does this mean that if we never index anything, the data will be stored and marked for deletion forever but never deleted?
You can still completely delete non-indexed documents as long as they're marked for deletion. Use the following command -
curl -XPOST 'http://localhost:9200/_forcemerge?only_expunge_deletes=true'
Forcemerge used to be 'optimize' but that is now depreciated.

Elasticsearch Reindexing while updating documents?

What if I've changed mapping for my index and wants to reindex?
I'm currenly using the Java API which does not yet have the reindex functionality, so using bulk would solve my problems. So the solution would look something like this
ref How to reindex in ElasticSearch via Java API
Long time ago
create index MY_INDEX_1
create mapping for MY_INDEX_1
create alias MY_INDEX_1 -> MY_INDEX
create documents in MY_INDEX
Time to reindex!
List item
create index MY_INDEX_2
create mapping for MY_INDEX_2
scroll search + bulk all documents from MY_INDEX_1 to MY_INDEX_2
Renaming and deletion of old index
create alias MY_INDEX_2 -> MY_INDEX
delete alias MY_INDEX_1 -> MY_INDEX
delete index MY_INDEX_1
But what happens, while reindexing all documents, a document that was reindexed in the beginning is updated from a user.
Or that between reindexing and rename aliases the above happpens?
Possible Solutions ?
One way would be using external version, such as it does not overwrite an document with an higher version
Or could it be solved in another way?
Or between renaming aliases and deleting my_index_1, reindexing all documents that has been indexed since the reindexing? But then still it would be the case that a document has been updated between renaming aliases and second reindexing
Or should we lock while reindexing? Seems like a bad solution..
I think this is your real question:
But what happens, while reindexing all documents, a document that was reindexed in the beginning is updated from a user. Or that between reindexing and rename aliases the above happpens?
I just asked a question that is very close, but still has questions that need to be resolved separately. However, my research allows me to answer this question. See the question for details and references.
To answer your question, you create a second alias just before reindexing. I call this a duplicate_write_alias and you have your application, if it sees this second alias, write to first the old and then the new index via the two aliases. (the order is important to cancel a potential race). When the indexing is done, your indexing process deletes this duplicate_write_alias and moves your MY_INDEX alias to the new MY_INDEX_2 as noted above. Do the alias switch in one atomic command.
As I noted in my question, you still have to deal with potential 'index does not exist' errors because of a remaining race between your application's checking for existence of the alias and the alias being deleted. I'm hoping there's a better answer than 'always write twice and ignore errors' or 'check and hope for the best'...
I think there is also another (more ugly way):
You can disable write operations for the source index while reindexing, this leads to temporary not usable apis, you don't have to:
Maintain a second storage to hold the truth
Deal with inconsistency
Flag documents for delete which should be deleted after migration
You can use elastic search engine storage to create snapshots between indecies
You can signal users of your api to send their change again later (when the indexing is done)
Downsides:
You have a downtime at least for write operations
You need more logic to handle errors, if the index would not be set to allow-writes-again mode (automatic recovery etc.)
Holding more than one index causes more storage space to be used.
For more information look here:
https://www.elastic.co/guide/en/elasticsearch/reference/6.2/index-modules.html

Resources