On my elasticsearch server:
total documents: 3 million, total size: 3.6G
Then, I delete about 2.8 millions documents:
total documents: about 0.13 million, total size: 3.6G
I have deleted the documents, how should I free the size of the documents?
Deleting documents only flags these as deleted, so they would not be searched. To reclaim disk space, you have to optimize the index:
curl -XPOST 'http://localhost:9200/_optimize?only_expunge_deletes=true'
documentation: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html
The documentation has moved to:
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html
Update
Starting with Elasticsearch 2.1.x, optimize is deprecated in favor of forcemerge.
The API is the same, only the endpoint did change.
curl -XPOST 'http://localhost:9200/_forcemerge?only_expunge_deletes=true'
In the current elasticsearch version(7.5),
To optimize all indices:
POST /_forcemerge?only_expunge_deletes=true
To optimize single index
POST /twitter/_forcemerge?only_expunge_deletes=true , where twitter is the index
To optimize several indices
POST /twitter,facebook/_forcemerge?only_expunge_deletes=true , where twitter and facebook are the indices
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/7.5/indices-forcemerge.html#indices-forcemerge
knutwalker's answer is correct. However if you are using AWS ElasticSearch and want to free storage space, this will not quite work.
On AWS the index to forgemerge must be specified in the URL. It can include wildcards as is common with index rotation.
curl -XPOST 'https://something.es.amazonaws.com/index-*/_forcemerge?only_expunge_deletes=true'
AWS publishes a list of ElasticSearch API differences.
I just want to note that the 7.15 docs for the Force Merge API include this warning:
Force merge should only be called against an index after you have finished writing to it. Force merge can cause very large (>5GB) segments to be produced, and if you continue to write to such an index then the automatic merge policy will never consider these segments for future merges until they mostly consist of deleted documents. This can cause very large segments to remain in the index which can result in increased disk usage and worse search performance.
So you should shut down writes to the index before beginning.
Replace indexname with yours. It will immediately free up space
curl -XPOST 'http://localhost:9200/indexname/_forcemerge' -d
'{"only_expunge_deletes": false, "max_num_segments": 1 }'
Related
I have basic question regarding elastic search.
As per documentation : By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html#refresh-api-desc
Also as per documentation: When a document is stored, it is indexed and fully searchable in near real-time--within 1 second.
Reference : https://www.elastic.co/guide/en/elasticsearch/reference/7.14/documents-indices.html
So when write happens, indexing happen. When write is not happening and documents are already indexed, then why elastic search indexes every 1 second existing documents?
it's not indexing existing documents, that's already been done
it's checking to see if it needs to write any in memory indexing requests that need to be written to disk to make them searchable
steps:
- elasticsearch 2.3
- create documents in ES => ~ 1 GB of disk is used
- update same documents in ES => ~ 2 GB of disk is used
Why it happens?
Is it due to versioning?
Is it possible to avoid doubling disk usage?
Currently we use forcemerge (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html) but it takes some hours.
When you index a document in ES that already exists, ES will mark the previous document as deleted (but won't immediately remove it from the index), and index the new document.
Effectively, if your document weighs 1K, once you have reindexed a new version of your document, the space taken by the first document won't be reclaimed immediately. So, the first "version" of the document takes 1K and the second "version" of the document another 1K. The only way to remove deleted documents is to call the Force Merge API as you have discovered, or to wait until segments are merged automatically under the hood. You should not really have to worry about this process.
I am running Elasticsearch on a personal machine that only has so much memory. I'd like to use all of the memory at any given time for whatever problem I'm working on, but make it easy to switch between projects.
For example, I have a project involving a large text corpus, and a different project with geospatial data. I'd like to switch Elasticsearch from indexing one to the other without reindexing all the documents.
Is there an easier way to do this than to do a backup/reload of the index?
ES has open/close index API:
curl -XPOST 'localhost:9200/my_index/_close'
curl -XPOST 'localhost:9200/my_index/_open'
How can I delete ES clusters?
Every time I start ES locally, it brings my indexes back to cluster state, which is now up to 33 and I believe taking up much of my RAM (8 GBs).
I only have 3 very small indexes, the biggest being just about 3 MBs.
Simply delete all the indices that you do not need. Have a look at https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-delete-index.html
You should delete the index that you want
For example, get your indices
curl -X GET http://127.0.0.1:9200/_cat/indices?v=
I got an index called web, for example, just delete it
curl -X DELETE http://127.0.0.1:9200/web
See more here
I'm wondering how does elasticsearch search so fast. Does it use inverted index and how is it represented in memory? How is it stored on disk? How does it load from disk to memory? And how does it merges indexes so fast (I mean when searching how does it combine two lists so fast)?
elasticsearch uses lucene to store inverse document indexes. Lucene in turn will store read-only files called segments with inverse index data. Each segment contains some documents. Those segments are read only and are never changed. To delete or update documents elasticsearch will maintain a delete/update list which will be used to overwrite results from read-only segments.
With this approach some segments might become obsolete altogether or contain only few up-to date data. Such segments will be rewritten or deleted.
There is an interesting elasticsearch plugin which visualizes the segments and the rewriting process:
https://github.com/polyfractal/elasticsearch-segmentspy
To see it in action start indexing a lot of data and see the segment information.
With the Segment API you can retrieve information about the segments:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-segments.html
I'll share what I know of ElasticSearch (ES). Yes ES uses an inverted index, here is how it would be structured - if we have a white space analyzer on these documents -
{
"_id": 1,
"text": "Hello, John"
}
AND
{
"_id": 2,
"text": "Bonjour, John"
}
INVERTED INDEX
Word | Docs
___________________
Hello | 1
Bonjour | 2
John | 1&2
This index is built at index time, the document is allocated to a shard based on hashing the document ID. Whenever a search request is made, a lookup is performed an all shards, the results of which are then merged and returned to the requester. The results are returned and merged blazingly fast due to the performance of the inverted index.
ES stores data within the data folder created once you have launched ES and created an index. The file structure resembles this - /data/clustername/nodes/..., if you look into this directory you will understand how it's organised. You can define how ES' index data is stored here. For instance, all indexed data stored in memory on on disk.
There is plenty of information on the ES website there are also several published books on ES, you can see these here.