Most of the ElasticSearch documentation discusses working with the indexes through the REST API - is there any reason I can't simply move or delete index folders from the disk?
You can move data around on disk, to a point -
If Elasticsearch is running, it is never a good idea to move or delete the index
folders, because Elasticsearch will not know what happened to the data, and you
will get all kinds of FileNotFoundExceptions in the logs as well as indices
that are red until you manually delete them.
If Elasticsearch is not running, you can move index folders to another node (for
instance, if you were decomissioning a node permanently and needed to get the
data off), however, if the delete or move the folder to a place where
Elasticsearch cannot see it when the service is restarted, then Elasticsearch
will be unhappy. This is because Elasticsearch writes what is known as the
cluster state to disk, and in this cluster state the indices are recorded, so if
ES starts up and expects to find index "foo", but you have deleted the "foo"
index directory, the index will stay in a red state until it is deleted through
the REST API.
Because of this, I would recommend that if you want to move or delete individual
index folders from disk, that you use the REST API whenever possible, as it's
possible to get ES into an unhappy state if you delete a folder that it expects
to find an index in.
EDIT: I should mention that it's safe to copy (for backups) an indices folder,
from the perspective of Elasticsearch, because it doesn't modify the contents of
the folder. Sometimes people do this to perform backups outside of the snapshot
& restore API.
I use this procedure: I close, backup, then delete the indexes.
curl -XPOST "http://127.0.0.1:9200/*index_name*/_close"
After this point all index data is on disk and in a consistent state, and no writes are possible. I copy the directory where the index is stored and then delete it:
curl -XPOST "http://127.0.0.1:9200/*index_name*/_delete"
By closing the index, elasticsearch stop all access on the index. Then I send a command to delete the index (and all corresponding files on disk).
Related
I have index in production with 1 replica (this takes total ~ 1TB). Into this index every time coming new data (a lot of updates and creates).
When i have created the copy of this index - by running _reindex(with the same data and 1 replica as well) - the new index takes 600 GB.
Looks like there is a lot of junk and some kind of logs in original index which possible to cleanup. But not sure how to do it.
The questions: how to cleanup the index (without _reindex), why this is happening and how to prevent for it in the future?
Lucene segment files are immutable so when you delete or update (since it can't update doc in place) a document, old version is just marked deleted but not actually removed from disk. ES runs merge operation periodically to "defragment" the data but you can also trigger merge manually with _forcemerge (try running with only_expunge_deletes as well: it might be faster).
Also, make sure your shards are sized correctly and use ILM rollover to keep index size under control.
How do people maintain all the changes done to the elasticsearch index over time so that if I have re-built the elasticsearch index from scratch to be same as the existing one, I can just do so in minutes. Do people maintain the logs of all PUT calls made over time to update the mappings and other settings?
I guess one way is to use snapshot ,It's a backup taken from a running Elasticsearch cluster or index. You can take a snapshot of individual index or of the entire cluster and store it in a repository on a shared filesystem. It contains a copy of the on-disk data structures and mappings that make up an index beside that when you create a snapshot of an index Elasticsearch will avoid copying any data that is already stored in the repository as part of an earlier snapshot so you can build or recover an index from scratch to last version of taken snapshot very quickly.
I deleted all elasticsearch indices by accident by kibana DELETE request, It was really a huge amount of data there.
immediately after that; I copied elasticsearch/6.4.0/data folder which I have right now and it includes indices folder there.
My question is: is there any way that may help me recovering data again or at least a part of it from that data folder?
I copied 1TB of data to a cloud server, then ran Elasticsearch on that folder. Things seemed to index great. However, I noticed that hard disk space went from 33% used to 90% used. So it seems Elastic must have copied the source directory? Can I now delete that 1TB of original data from that machine?
If you run GET _stats/?human you'll see lots of details from your cluster, like how much storage you are using or how many documents you have added. If you have all the data you want in your cluster and it's correctly structured, you can delete the original data. Elasticsearch has its own copy.
BTW by default you will get 1 replica if you have more than 1 node; so 1 primary and 1 replica copy of the data. If you have a single node there will only be the primary one.
In Elasticsearch's documentation Updating a document says:
Internally, Elasticsearch has marked the old document as deleted and
added an entirely new document. The old version of the document
doesn’t disappear immediately, although you won’t be able to access
it. Elasticsearch cleans up deleted documents in the background as you
continue to index more data.
An in Deleting a document:
deleting a document doesn’t immediately remove the document from
disk; it just marks it as deleted. Elasticsearch will clean up deleted
documents in the background as you continue to index more data.
Does this mean that if we never index anything, the data will be stored and marked for deletion forever but never deleted?
You can still completely delete non-indexed documents as long as they're marked for deletion. Use the following command -
curl -XPOST 'http://localhost:9200/_forcemerge?only_expunge_deletes=true'
Forcemerge used to be 'optimize' but that is now depreciated.