steps:
- elasticsearch 2.3
- create documents in ES => ~ 1 GB of disk is used
- update same documents in ES => ~ 2 GB of disk is used
Why it happens?
Is it due to versioning?
Is it possible to avoid doubling disk usage?
Currently we use forcemerge (https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-forcemerge.html) but it takes some hours.
When you index a document in ES that already exists, ES will mark the previous document as deleted (but won't immediately remove it from the index), and index the new document.
Effectively, if your document weighs 1K, once you have reindexed a new version of your document, the space taken by the first document won't be reclaimed immediately. So, the first "version" of the document takes 1K and the second "version" of the document another 1K. The only way to remove deleted documents is to call the Force Merge API as you have discovered, or to wait until segments are merged automatically under the hood. You should not really have to worry about this process.
Related
I have index in production with 1 replica (this takes total ~ 1TB). Into this index every time coming new data (a lot of updates and creates).
When i have created the copy of this index - by running _reindex(with the same data and 1 replica as well) - the new index takes 600 GB.
Looks like there is a lot of junk and some kind of logs in original index which possible to cleanup. But not sure how to do it.
The questions: how to cleanup the index (without _reindex), why this is happening and how to prevent for it in the future?
Lucene segment files are immutable so when you delete or update (since it can't update doc in place) a document, old version is just marked deleted but not actually removed from disk. ES runs merge operation periodically to "defragment" the data but you can also trigger merge manually with _forcemerge (try running with only_expunge_deletes as well: it might be faster).
Also, make sure your shards are sized correctly and use ILM rollover to keep index size under control.
I have basic question regarding elastic search.
As per documentation : By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds.
Reference: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html#refresh-api-desc
Also as per documentation: When a document is stored, it is indexed and fully searchable in near real-time--within 1 second.
Reference : https://www.elastic.co/guide/en/elasticsearch/reference/7.14/documents-indices.html
So when write happens, indexing happen. When write is not happening and documents are already indexed, then why elastic search indexes every 1 second existing documents?
it's not indexing existing documents, that's already been done
it's checking to see if it needs to write any in memory indexing requests that need to be written to disk to make them searchable
I'm using Elasticsearch 7.5.2 on Ubuntu. Recently, I began using Elasticsearch to display relevant search results on every page load. This shot up the volume, but I also found out that it has created large index files. Note that I'm using 'app-search' to power my queries.
Here's the sample index files that are occupying too much space:
.app-search-analytics-logs-loco_togo_production-7.1.0-2020.01.26 => 52 GB
.app-search-analytics-logs-loco_togo_production-7.1.0-2020.01.27 => 53 GB
I tried deleting these using CURL, but they reappear and show lesser space (~5 GB each).
I want to know if there is a way to control these indexes. I'm not sure what purpose do these indices solve and if there is a way to prevent them?
I tried deleting these using CURL, but they reappear and show lesser space (~5 GB each).
Obviously your delete-action was executed. It seems like that the indices still get written to. If documents still get into elasticsearch, the index gets re-created.
So for example:
The index from 2020.01.27 has 53 GB before the deletion. After you delete it, the data is gone and the index itself too. But as soon as new documents of the very same day (2020.01.27) get indexed, the index gets re-created containing the documents after the deletion which is probably the 5GB.
If this is not what you want, you need to check if there are some sources still sending data.
Hope this helps.
EDIT:
Q: However, is there a way to manage these indices? I don't want them to eat up too much space.
Yes! Index Lifecycle Management (ILM) is what you are looking for. It aims to automate the maintenance/management of indices. So for example you could define a rollover every 30GB to a new index in order to keep them small. Another example is to delete the index after X days. Take a look at all the phases and actions.
I am running Elasticsearch 6.2.4. I have a program that will automatically create an index for me as well as the mappings necessary for my data. For this issue, I created an index called "landsat" but it needs to actually be named "landsat_8", so I chose to reindex. The original "landsat" index has 2 shards and 0 read replicas. The store size is ~13.4gb with ~6.6gb per shard and the index holds just over 515k documents.
I created a new index called "landsat_8" with 5 shards, 1 read replica, and started a reindex with no special options. On a very small Elastic Cloud cluster (4GB RAM), it finished in 8 minutes. It was interesting to see that the final store size was only 4.2gb, yet it still held all 515k documents.
After it was finished, I realized that I failed to create my mappings before reindexing, so I blew it away and started over. I was shocked to find that after an hour, the /cat/_indices endpoint showed that only 7.5gb of data and 154,800 documents had been reindexed. 4 hours later, the entire job seemed to have died at 13.1gb, but it only showed 254,000 documents had been reindexed.
On this small 4gb cluster, this reindex operation was maxing out CPU. I increased the cluster to the biggest one Elastic Cloud offered (64gb ram), 5 shards, 0 RR and started the job again. This time, I set the refresh_interval on the new index to -1 and changed the size for the reindex operation to 2000. Long story short, this job ended in somewhere between 1h10m and 1h19m. However, this time I ended up with a total store size of 25gb, where each shard held ~5gb.
I'm very confused as to why the reindex operation causes such wildly different results in store size and reindex performance. Why, when I don't explicitly define any mappings and let ES automatically create mappings, is the store size so much smaller? And why, when I use the exact same mappings as the original index, is the store so much bigger?
Any advice would be greatly appreciated. Thank you!
UPDATE 1:
Here are the only differences in mappings:
The left image is "landsat" and the right image is "landsat_8". There is a root level "type" field and a nested "properties.type" field in the original "landsat" index. I forgot one of my goals was to remove the field "properties.type" from the data during the reindex. I seem to have been successful in doing so, but at the same time, accidentally renamed the root-level "type" field mapping to "provider", thus "landsat_8" has an unused "provider" mapping and an auto-created "type" mapping.
So there are some problems here, but I wouldn't think this would nearly double my store size...
I have a ~3m document (70GB) elastic index that I wish to reindex to an updated template mapping. I issue the basic reindex source -> destination REST call from kibana and it returns a task id. The designated task shows the correct number of documents that need to be reindexed into the new index. However the task completes with only a small fraction (6000 - approx 1/2 GB) of these actually reindexed. I have noticed this trouble on approximately 40% of my reindexes to varying degrees.
I cannot see what the issue may be on my side. The updated template is based on the source one save a few additional multifields therefore is type compatible.
What are the possible issues?