Elasticsearch: Can i safely delete shrunk index? - elasticsearch

I used the Shrink API to shrink an index from 500 shards down to 100. But it seems that no new disk space was taken after the operation was complete.
Can i safely delete the original index with 500 shards? or does the new index rely on that?

Can i safely delete the original index with 500 shards? or does the new index rely on that?
Yes, deleting an index does not affect any other indices.
But it seems that no new disk space was taken after the operation was complete.
This is expected, at least initially. The documentation for the shrink API notes that usually the shrunken index is made by hard-linking the files of the original index rather than copying them. Hard-linking works pretty much just like copying except that it doesn't take up any extra space. In particular, if you have two hard links to the same underlying data then you can delete one of them without affecting the other, and this is the property that Elasticsearch is using when shrinking.

Related

Does huge number of deleted doc count affects ES query performance

I have few read heavy indices(started seeing performance issues on these indices) in my ES cluster which has ~50 million docs and noticed most of them have around 25% of total documents as deleted, I know that these deleted document count decrease over time when background merge operation happens, But in my case these count is always around ~25% of total documents and I have below questions/concerns:
Will these huge no of deleted count affects the search performance as they are still part of lucene immutable segments and search happens to all the segments and latest version of document is returned, so size of immutable segments would be high as they contains huge number of deleted docs and then another operation to figure out the latest version of doc.
Will periodic merge operation would take lot of time and inefficient if huge number of deleted documents are there?
is there is any way to delete these huge number of deleted docs in one shot as looks like background merge operation is not able to keep up with huge number?
Thanks
your deleted documents are still part of the index so they impact the search performance ( but I can't tell you if its a huge impact ).
For the periodic merge, Lucene is "reluctant" to merge heavy segments as it requires some disk space and generates a lot of IO.
You can get some precious insight on your segments thanks to the Index Segments API
If you have segments close to the 5GB limit, it is probable that they won't be merged automatically until they are mostly constituted with deleted docs.
You can force a merge on your index with the force merge API
Remember a force merge can generate some stress on a cluster for huge indices. An option exists to only delete documents, that should reduce the burden.
only_expunge_deletes (Optional, boolean) If true, only expunge
segments containing document deletions. Defaults to false.
In Lucene, a document is not deleted from a segment; just marked as
deleted. During a merge, a new segment is created that does not
contain those document deletions.
Regards

Why segment merge in Elasticsearch requires stopping the writes to index

I am looking to run the optimize(ES 1.X) which is now known as forcemerge API in ES latest version. After reading some articles like this and this. it seems we should run it only on read-only indices, quoting the official ES docs:
Force merge should only be called against read-only indices. Running
force merge against a read-write index can cause very large segments
to be produced (>5Gb per segment)
But I don't understand the
Reason behind putting index on read-only mode before running forcemerge or optimize API.
As explained in above ES doc, it could cause very large segments which shouldn't be the case as what I understand is that, new updates are first written in memory which are written to segments when refresh happens, so why having write during forcemerge can produce the very large segments?
Also is there is any workaround if we don't want to put the index on read-only mode and still run force merge to expunge delete.
Let me know if I need to provide any additional information.
forcemerge can significantly improve the performance of your queries as it allows you to merge the existing number of segments into a smaller number of segments which is more efficient for querying, as segments get searched sequentially. While merging, also all documents marked for deletion get cleaned up.
Merging happens regularly and automatically in the background as part of Elasticsearch‘s housekeeping based on a merge policy.
The tricky thing: only segments up to 5 GB are considered by the merge policy. Using the forcemerge API with the parameter that allows you to specify the number of resulting segments, you risk that the resulting segment(s) get bigger than 5GB, meaning that they will no longer be considered by future merge requests. As long as you don‘t delete or update documents there is nothing wrong about that. However, if you keep on deleting or updating documents, Lucene will mark the old version of your documents in the existing segments as deleted and write the new version of your documents into new segments. If your deleted documents reside in segments larger than 5GB, no more housekeeping is done on them, i.e. the documents marked for deletion will never get cleaned up.
By setting an index to readonly prior to doing a force-merge, you ensure that you will not end up with huge segments, containing a lot of legacy documents, which consume precious resources in memory and on disk and slow down your queries.
A refresh is doing something different: it‘s correct that documents you want to get indexed are first processed in memory, before getting written to disk. But the data structure that allows you to actually find a document (the „segment“) does not get created for every single document right away, as this would be highly inefficient. Segments are only created when the internal buffer gets full, or when a refresh occurs. By triggering a refresh you make a document immediately available for finding. Still the segment at first only lives in memory, as - again - it would be extremely inefficient to immediately sync every segment to disk right after it got created. Segments in memory get periodically synced to disk. Even if you pull the plug before a sync to disk happened you don‘t lose any information, as Elasticsearch maintains a translog that will allow Elasticsearch to „replay“ all indexing request that did not make it yet into a segment on disk.

What does "inverted index are immutable in elastic search" exactly mean?

I was going through the online definitive guide of elastic search.
I have a question on immutability of inverted index described at following link:
https://www.elastic.co/guide/en/elasticsearch/guide/current/making-text-searchable.html
What will happen when a new document is added in index? Will inverted index be recreated to include details/metadata related to new document?
Will it not impact the performance of elastic?
Your question is answered towards the end of that article:
Of course, an immutable index has its downsides too, primarily the fact that it is immutable! You can’t change it. If you want to make new documents searchable, you have to rebuild the entire index. This places a significant limitation either on the amount of data that an index can contain, or the frequency with which the index can be updated.
This means that your old index will need to be destroyed and recreated to include the new document. Performance impact can be mitigated by clustering your data and performing the new index creation on the cold cluster then switching it to hot and then rebuilding the index on the now cold cluster.
When you add new documents to an index, all the documents written within 1 second (default value — you can increase it, but you really shouldn't set it to 0) are written to a (Lucene) segment. That segment will be in memory first and will be flushed to disk later on.
If you update a document, the original version will be marked as deleted and a new document will be created (batched together with other documents within 1s into a segment).
Every segment has its own inverted index(es) and as soon as it's in memory, it is searchable.
Eventually, Elasticsearch will do a merge and combine multiple segments into one. During this step the deleted and replaced (old version of an update) documents will be removed as well. You don't have to call a force merge in general — Elasticsearch is very good at figuring out when it should do that on its own.
This provides a very good performance balance in general. If you don't need to find your documents immediately, a common performance tweak is to set the refresh interval to 30s or a similar value.
PS: Changing existing data will require you to reindex your documents — there's an API for that. Reindexing data is common, especially for search use cases.

elasticsearch - index size change

I insert some data to elasticsearch and here is result screen from es-head.
I do nothing but size is getting smaller. why? and can it be bigger somehow?
Elasticsearch (or actually the underlying Apache Lucene) writes data immutably. The only way to decrease the file size is through a merge, which will happen automatically in the background.
Quoting from the documentation:
Smaller segments are periodically merged into larger segments to keep the index size at bay and to expunge deletes.
I would only expect bigger changes if you have lots of deleted documents. Note however that every update is treated as a (soft) delete and insert — immutable nature of Lucene segments.

efficiently getting all documents in an elasticsearch index

I want to get all results from a match-all query in an elasticsearch cluster. I don't care if the results are up to date and I don't care about the order, I just want to steadily keep going through all results and then start again at the beginning. Is scroll and scan best for this, it seems like a bit of a hit taking a snapshot that I don't need. I'll be looking at processing 10s millions of documents.
Somewhat of a duplicate of elasticsearch query to return all records. But we can add a bit more detail to address the overhead concern. (Viz., "it seems like a bit of a hit taking a snapshot that I don't need.")
A scroll-scan search is definitely what you want in this case. The
"snapshot" is not a lot of overhead here. The documentation describes it metaphorically as "like a snapshot in time" (emphasis added). The actual implementation details are a bit more subtle, and quite clever.
A slightly more detailed explanation comes later in the documentation:
Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.
So the reason the context is cheap to preserve is because of how Lucene index segments behave. A Lucene index is partitioned into multiple segments, each of which is like a stand-alone mini index. As documents are added (and updated), Lucene simply appends a new segment to the index. Segments are write-once: after they are created, they are never again updated.
Over time, as segments accumulate, Lucene will periodically do some housekeeping in the background. It scans through the segments and merges segments to flush the deleted and outdated information, eventually consolidating into a smaller set of fresher and more up-to-date segments. As newer merged segments replace older segments, Lucene will then go and remove any segments that are no longer actively used by the index at large.
This segmented index design is one reason why Lucene is much more performant and resilient than a simple B-tree. Continuously appending segments is cheaper in the long run than the accumulated IO of updating files directly on disk. Plus the write-once design has other useful properties.
The snapshot-like behavior used here by Elasticsearch is to maintain a reference to all of the segments active at the time the scrolling search begins. So the overhead is minimal: some references to a handful of files. Plus, perhaps, the size of those files on disk, as the index is updated over time.
This may be a costly amount of overhead, if disk space is a serious concern on the server. It's conceivable that an index being updated rapidly enough while a scrolling search context is active may as much as double the disk size required for an index. Toward that end, it's helpful to ensure that you have enough capacity such that an index may grow to 2–3 times its expected size.

Resources