Is it possible to undelete a document in Elasticsearch? - elasticsearch

From reading this article (Lucene's Handling of Deleted Documents), I understand that deleted documents in Elasticsearch are simply marked as deleted, such that they may remain on the disk for some time afterwards.
I was therefore wondering if there was a way to recover deleted documents in Elasticsearch?

Deleted documents and old document versions are totally removed by the segment merging process :(
This is the moment when those old deleted documents are purged from the filesystem. Deleted documents (or old versions of updated documents) are not copied over to the new bigger segment.
https://www.elastic.co/guide/en/elasticsearch/guide/current/merge-process.html

Related

How to recover Elasticsearch deleted indices if I have a copy of data folder?

I deleted all elasticsearch indices by accident by kibana DELETE request, It was really a huge amount of data there.
immediately after that; I copied elasticsearch/6.4.0/data folder which I have right now and it includes indices folder there.
My question is: is there any way that may help me recovering data again or at least a part of it from that data folder?

Internals of deleting a document in elasticsearch

In Elasticsearch's documentation Updating a document says:
Internally, Elasticsearch has marked the old document as deleted and
added an entirely new document. The old version of the document
doesn’t disappear immediately, although you won’t be able to access
it. Elasticsearch cleans up deleted documents in the background as you
continue to index more data.
An in Deleting a document:
deleting a document doesn’t immediately remove the document from
disk; it just marks it as deleted. Elasticsearch will clean up deleted
documents in the background as you continue to index more data.
Does this mean that if we never index anything, the data will be stored and marked for deletion forever but never deleted?
You can still completely delete non-indexed documents as long as they're marked for deletion. Use the following command -
curl -XPOST 'http://localhost:9200/_forcemerge?only_expunge_deletes=true'
Forcemerge used to be 'optimize' but that is now depreciated.

Documents in elasticsearch getting deleted automatically?

I'm creating an index though logstash and pushing data to it from a MySQL database. But what I noticed in elasticsearch was once the whole data is uploaded, it starts deleting some of the docs. The total number of docs is 160729. Without the scheduler it works fine.
I inserted the cron scheduler in order to check whether new rows have been added to the table. Can that be the issue?
My logstash conf looks like this.
Where am I going wrong? Or is this behavior common?
Any help could be appreciated.
The docs.deleted number doesn't mean that your documents are being deleted, but simply that existing documents are being "updated" and the older version of the updated document is marked as deleted in the process.
Those documents marked as deleted will be eventually cleaned up as Lucene merges segments in the background.

Elasticsearch - What is the indexing process?

I am working on a project that uses Elasticsearch. I have my core search UI working. I'm now looking to improve some things. In this process, I discovered that I do not really understand what happens during "indexing". I understand what an index is. I understand what a document is. I understand that indexing happens either a) when a document is added b) when a document is updated) or c) when the refresh endpoint is called.
Still, I do not really understand the detail behind indexing. For example, does indexing happen if a document is removed? What really happens during indexing? I keep looking for some documentation that explains this. However, I'm not having any luck.
Can someone please explain what happens during indexing and possibly point out some documentation?
Thank you!
Indexing is a huge process and has a lot of steps involved in it. I will try to provide a brief intro to the major steps in indexing process
Making Text Searchable
Every word in a text field needs to be searchable,
The data structure that best supports the multiple-values-per-field requirement is the inverted index. The inverted index contains a sorted list of all of the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.
Updating Index :
First of all, please do note that a "lucene index is immutable"
Hence, in case of any (CRUD (-R)) operation, instead of rewriting the whole inverted index, lucene adds new supplementary indices to reflect more-recent changes.
Indexing Process
New documents are collected in an in-memory indexing buffer.
Every so often, the buffer is commited:
A new segment—a supplementary inverted index—is written to disk.
A new commit point is written to disk, which includes the name of the new segment.
The disk is fsync’ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written.
The new segment is opened, making the documents it contains visible to search.
The in-memory buffer is cleared, and is ready to accept new documents.
What happens in case of Delete
Segments are immutable, so documents cannot be removed from older segments.
When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.
When is it actually removed
In Segment Merging, deleted documents are purged from the filesystem.
References :
Elasticsearch Docs
Inverted Index
Lucene Talks

Backing up, Deleting, Restoring Elasticsearch Indexes By Index Folder

Most of the ElasticSearch documentation discusses working with the indexes through the REST API - is there any reason I can't simply move or delete index folders from the disk?
You can move data around on disk, to a point -
If Elasticsearch is running, it is never a good idea to move or delete the index
folders, because Elasticsearch will not know what happened to the data, and you
will get all kinds of FileNotFoundExceptions in the logs as well as indices
that are red until you manually delete them.
If Elasticsearch is not running, you can move index folders to another node (for
instance, if you were decomissioning a node permanently and needed to get the
data off), however, if the delete or move the folder to a place where
Elasticsearch cannot see it when the service is restarted, then Elasticsearch
will be unhappy. This is because Elasticsearch writes what is known as the
cluster state to disk, and in this cluster state the indices are recorded, so if
ES starts up and expects to find index "foo", but you have deleted the "foo"
index directory, the index will stay in a red state until it is deleted through
the REST API.
Because of this, I would recommend that if you want to move or delete individual
index folders from disk, that you use the REST API whenever possible, as it's
possible to get ES into an unhappy state if you delete a folder that it expects
to find an index in.
EDIT: I should mention that it's safe to copy (for backups) an indices folder,
from the perspective of Elasticsearch, because it doesn't modify the contents of
the folder. Sometimes people do this to perform backups outside of the snapshot
& restore API.
I use this procedure: I close, backup, then delete the indexes.
curl -XPOST "http://127.0.0.1:9200/*index_name*/_close"
After this point all index data is on disk and in a consistent state, and no writes are possible. I copy the directory where the index is stored and then delete it:
curl -XPOST "http://127.0.0.1:9200/*index_name*/_delete"
By closing the index, elasticsearch stop all access on the index. Then I send a command to delete the index (and all corresponding files on disk).

Resources