When I delete documents from Elasticsearch, why does my 'total size' stay the same despite obviously being far smaller with the absence of previously stored data?
I've read about index optimization but I'm not sure what this is or how to do it.
Thanks
I'm sure there are tons of questions relating to this on both SO and Google so this may be a duplicate answer. However - deleting documents only marks them as deleted, it doesn't actually remove them from your data store.
In old ES, there used to be a feature named 'optimize' (which is deprecated) - nowdays forcemerge is the enhanced replacement. The following command should free up the space you're entitled to.
curl -XPOST 'http://localhost:9200/_forcemerge?only_expunge_deletes=true'
Here's a bit more info on forcemerge if you're interested:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
Related
I have 2TB of indices, trying to manually delete some indices removes them from Kibana, etc. I can delete it via curl or Kibana and it is acknowledged and removed. It is however not freeing up the space.
I went ahead and also removed the ILM from the index before deleting a few indices, still no luck.
Although I removed a whole index, also tried POST _forcemerge to no avail.
How can I recover space now that the indices are deleted?
For those who look at this later
Deleting a whole index should free up space instantly! Does not require _forcemerge, etc.
The issue here was the use of a ZFS file system which required a snapshot to be cleared to recover space.
I'm using Elasticsearch 7.5.2 on Ubuntu. Recently, I began using Elasticsearch to display relevant search results on every page load. This shot up the volume, but I also found out that it has created large index files. Note that I'm using 'app-search' to power my queries.
Here's the sample index files that are occupying too much space:
.app-search-analytics-logs-loco_togo_production-7.1.0-2020.01.26 => 52 GB
.app-search-analytics-logs-loco_togo_production-7.1.0-2020.01.27 => 53 GB
I tried deleting these using CURL, but they reappear and show lesser space (~5 GB each).
I want to know if there is a way to control these indexes. I'm not sure what purpose do these indices solve and if there is a way to prevent them?
I tried deleting these using CURL, but they reappear and show lesser space (~5 GB each).
Obviously your delete-action was executed. It seems like that the indices still get written to. If documents still get into elasticsearch, the index gets re-created.
So for example:
The index from 2020.01.27 has 53 GB before the deletion. After you delete it, the data is gone and the index itself too. But as soon as new documents of the very same day (2020.01.27) get indexed, the index gets re-created containing the documents after the deletion which is probably the 5GB.
If this is not what you want, you need to check if there are some sources still sending data.
Hope this helps.
EDIT:
Q: However, is there a way to manage these indices? I don't want them to eat up too much space.
Yes! Index Lifecycle Management (ILM) is what you are looking for. It aims to automate the maintenance/management of indices. So for example you could define a rollover every 30GB to a new index in order to keep them small. Another example is to delete the index after X days. Take a look at all the phases and actions.
I am working on a project that uses Elasticsearch. I have my core search UI working. I'm now looking to improve some things. In this process, I discovered that I do not really understand what happens during "indexing". I understand what an index is. I understand what a document is. I understand that indexing happens either a) when a document is added b) when a document is updated) or c) when the refresh endpoint is called.
Still, I do not really understand the detail behind indexing. For example, does indexing happen if a document is removed? What really happens during indexing? I keep looking for some documentation that explains this. However, I'm not having any luck.
Can someone please explain what happens during indexing and possibly point out some documentation?
Thank you!
Indexing is a huge process and has a lot of steps involved in it. I will try to provide a brief intro to the major steps in indexing process
Making Text Searchable
Every word in a text field needs to be searchable,
The data structure that best supports the multiple-values-per-field requirement is the inverted index. The inverted index contains a sorted list of all of the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.
Updating Index :
First of all, please do note that a "lucene index is immutable"
Hence, in case of any (CRUD (-R)) operation, instead of rewriting the whole inverted index, lucene adds new supplementary indices to reflect more-recent changes.
Indexing Process
New documents are collected in an in-memory indexing buffer.
Every so often, the buffer is commited:
A new segment—a supplementary inverted index—is written to disk.
A new commit point is written to disk, which includes the name of the new segment.
The disk is fsync’ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written.
The new segment is opened, making the documents it contains visible to search.
The in-memory buffer is cleared, and is ready to accept new documents.
What happens in case of Delete
Segments are immutable, so documents cannot be removed from older segments.
When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.
When is it actually removed
In Segment Merging, deleted documents are purged from the filesystem.
References :
Elasticsearch Docs
Inverted Index
Lucene Talks
I'm new to ES, so the question can be somehow stupid, but:
I was experimenting with ES, creating index, putting some data there (1Mio records), and deleting it after and creating the same (with thу same name)
It seems that ES is not actually deleting the data in Index (via curl DELETE) as the disk space is not freed after all the deletes - for now 1Mio records seem to take 40Gb of disk space)
Is there any way to delete the deleted data totally so it will actually free space?
If its just for experimentation a quick dirty way would be to delete your data directory.
Another way to reclaim disk space is to run this command
curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true'
My understanding was that Elasticsearch would store the lastest copy of the document and just update the version field number? But I was playing around with a few thousand documents and had the need to index them repeatedly without changing any data in the document. My thinking was that the index size would remain the same, but that wasn't the case ... the index size seemed to increase.
This confused me a little bit, so i just wanted to seek clarification on the internal mechanism of versioning within elasticsearch.
An update is a Delete + Insert Lucene operation behind the scene.
But you should know that Lucene does not really delete the document but mark it as deleted.
To remove deleted docs, you have to optimize your Lucene segments.
$ curl -XPOST 'http://localhost:9200/twitter/_optimize?only_expunge_deletes=true'
See Optimize API. Also have a look at merge options. Merging segments happens behind the scene at some time.
For a general overview of versioning support in Elasticsearch, please refer to the Elasticsearch Versioning Support.