efficiently getting all documents in an elasticsearch index - elasticsearch

I want to get all results from a match-all query in an elasticsearch cluster. I don't care if the results are up to date and I don't care about the order, I just want to steadily keep going through all results and then start again at the beginning. Is scroll and scan best for this, it seems like a bit of a hit taking a snapshot that I don't need. I'll be looking at processing 10s millions of documents.

Somewhat of a duplicate of elasticsearch query to return all records. But we can add a bit more detail to address the overhead concern. (Viz., "it seems like a bit of a hit taking a snapshot that I don't need.")
A scroll-scan search is definitely what you want in this case. The
"snapshot" is not a lot of overhead here. The documentation describes it metaphorically as "like a snapshot in time" (emphasis added). The actual implementation details are a bit more subtle, and quite clever.
A slightly more detailed explanation comes later in the documentation:
Normally, the background merge process optimizes the index by merging together smaller segments to create new bigger segments, at which time the smaller segments are deleted. This process continues during scrolling, but an open search context prevents the old segments from being deleted while they are still in use. This is how Elasticsearch is able to return the results of the initial search request, regardless of subsequent changes to documents.
So the reason the context is cheap to preserve is because of how Lucene index segments behave. A Lucene index is partitioned into multiple segments, each of which is like a stand-alone mini index. As documents are added (and updated), Lucene simply appends a new segment to the index. Segments are write-once: after they are created, they are never again updated.
Over time, as segments accumulate, Lucene will periodically do some housekeeping in the background. It scans through the segments and merges segments to flush the deleted and outdated information, eventually consolidating into a smaller set of fresher and more up-to-date segments. As newer merged segments replace older segments, Lucene will then go and remove any segments that are no longer actively used by the index at large.
This segmented index design is one reason why Lucene is much more performant and resilient than a simple B-tree. Continuously appending segments is cheaper in the long run than the accumulated IO of updating files directly on disk. Plus the write-once design has other useful properties.
The snapshot-like behavior used here by Elasticsearch is to maintain a reference to all of the segments active at the time the scrolling search begins. So the overhead is minimal: some references to a handful of files. Plus, perhaps, the size of those files on disk, as the index is updated over time.
This may be a costly amount of overhead, if disk space is a serious concern on the server. It's conceivable that an index being updated rapidly enough while a scrolling search context is active may as much as double the disk size required for an index. Toward that end, it's helpful to ensure that you have enough capacity such that an index may grow to 2–3 times its expected size.

Related

Why is ElasticSearch PIT/search_after a lightweight version to scrollAPI?

Elasticsearch provides different ways of paginating through large amounts of data. The scroll API and search_after/PIT both allow a view of the data at a given point in time.
Normally, the background merge process optimizes the index by merging
together smaller segments to create new bigger segments, at which time
the smaller segments are deleted. This process continues during
scrolling, but an open search context prevents the old segments from
being deleted while they are still in use. This is how Elasticsearch
is able to return the results of the initial search request,
regardless of subsequent changes to documents.
It seems that this is achieved in the same way for both. The old segments are prevented from being deleted. However, search_after/PIT is often referred to as the light-weight alternative. Is this purely because a PIT can be shared between queries and therefore not as many PITs should have to be created?

Lucene index segment lifecycle and performance impact

For a project which maps files in a directory structure to Lucene documents (1:1), I'd like to know the impact of using multiple index segments. When a file on disk changes, the indexing process basically removes the corresponding document and adds a new one.
In the project, at the end of the indexing, the forceMerge() method of IndexWriter is used to reduce the number of segments to 1. This practice has been present in the code for a very long time, likely since early Lucene versions. As noted in the Lucene documentation, this is expensive task:
This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed).
Based on this I am considering removing this step altogether. It's just unclear what will be the performance impact.
In one answer a claim is made that multi segment performance got better over the time, however this is pretty vague statement. Is there some benchmark and/or explanatory article that would shed more light on the performance with multiple segments ? What if the segment count grows to thousands, millions ? Is this even possible ? How much will the search/indexing performance degrade ?
Also, when experimenting with disabiling the forceMerge() step, I noticed that after adding bunch of documents to the index, the next time the indexer is run, the segment count grows, however sometimes decreases after subsequent runs of the indexer (according to the segmentInfos field in the IndexReader object). Is there some automatic segment merge process ?

Does huge number of deleted doc count affects ES query performance

I have few read heavy indices(started seeing performance issues on these indices) in my ES cluster which has ~50 million docs and noticed most of them have around 25% of total documents as deleted, I know that these deleted document count decrease over time when background merge operation happens, But in my case these count is always around ~25% of total documents and I have below questions/concerns:
Will these huge no of deleted count affects the search performance as they are still part of lucene immutable segments and search happens to all the segments and latest version of document is returned, so size of immutable segments would be high as they contains huge number of deleted docs and then another operation to figure out the latest version of doc.
Will periodic merge operation would take lot of time and inefficient if huge number of deleted documents are there?
is there is any way to delete these huge number of deleted docs in one shot as looks like background merge operation is not able to keep up with huge number?
Thanks
your deleted documents are still part of the index so they impact the search performance ( but I can't tell you if its a huge impact ).
For the periodic merge, Lucene is "reluctant" to merge heavy segments as it requires some disk space and generates a lot of IO.
You can get some precious insight on your segments thanks to the Index Segments API
If you have segments close to the 5GB limit, it is probable that they won't be merged automatically until they are mostly constituted with deleted docs.
You can force a merge on your index with the force merge API
Remember a force merge can generate some stress on a cluster for huge indices. An option exists to only delete documents, that should reduce the burden.
only_expunge_deletes (Optional, boolean) If true, only expunge
segments containing document deletions. Defaults to false.
In Lucene, a document is not deleted from a segment; just marked as
deleted. During a merge, a new segment is created that does not
contain those document deletions.
Regards

Why segment merge in Elasticsearch requires stopping the writes to index

I am looking to run the optimize(ES 1.X) which is now known as forcemerge API in ES latest version. After reading some articles like this and this. it seems we should run it only on read-only indices, quoting the official ES docs:
Force merge should only be called against read-only indices. Running
force merge against a read-write index can cause very large segments
to be produced (>5Gb per segment)
But I don't understand the
Reason behind putting index on read-only mode before running forcemerge or optimize API.
As explained in above ES doc, it could cause very large segments which shouldn't be the case as what I understand is that, new updates are first written in memory which are written to segments when refresh happens, so why having write during forcemerge can produce the very large segments?
Also is there is any workaround if we don't want to put the index on read-only mode and still run force merge to expunge delete.
Let me know if I need to provide any additional information.
forcemerge can significantly improve the performance of your queries as it allows you to merge the existing number of segments into a smaller number of segments which is more efficient for querying, as segments get searched sequentially. While merging, also all documents marked for deletion get cleaned up.
Merging happens regularly and automatically in the background as part of Elasticsearch‘s housekeeping based on a merge policy.
The tricky thing: only segments up to 5 GB are considered by the merge policy. Using the forcemerge API with the parameter that allows you to specify the number of resulting segments, you risk that the resulting segment(s) get bigger than 5GB, meaning that they will no longer be considered by future merge requests. As long as you don‘t delete or update documents there is nothing wrong about that. However, if you keep on deleting or updating documents, Lucene will mark the old version of your documents in the existing segments as deleted and write the new version of your documents into new segments. If your deleted documents reside in segments larger than 5GB, no more housekeeping is done on them, i.e. the documents marked for deletion will never get cleaned up.
By setting an index to readonly prior to doing a force-merge, you ensure that you will not end up with huge segments, containing a lot of legacy documents, which consume precious resources in memory and on disk and slow down your queries.
A refresh is doing something different: it‘s correct that documents you want to get indexed are first processed in memory, before getting written to disk. But the data structure that allows you to actually find a document (the „segment“) does not get created for every single document right away, as this would be highly inefficient. Segments are only created when the internal buffer gets full, or when a refresh occurs. By triggering a refresh you make a document immediately available for finding. Still the segment at first only lives in memory, as - again - it would be extremely inefficient to immediately sync every segment to disk right after it got created. Segments in memory get periodically synced to disk. Even if you pull the plug before a sync to disk happened you don‘t lose any information, as Elasticsearch maintains a translog that will allow Elasticsearch to „replay“ all indexing request that did not make it yet into a segment on disk.

elasticsearch ttl vs daily dropping tables

I understand that there are two dominant patterns for keeping a rolling window of data inside elasticsearch:
creating daily indices, as suggested by logstash, and dropping old indices, and therefore all the records they contain, when they fall out of the window
using elasticsearch's TTL feature and a single index, having elasticsearch automatically remove old records individually as they fall out of the window
Instinctively I go with 2, as:
I don't have to write a cron job
a single big index is easier to communicate to my colleagues and for them to query (I think?)
any nightmare stream dynamics, that cause old log events to show up, don't lead to the creation of new indices and the old events only hang around for the 60s period that elasticsearch uses to do ttl cleanup.
But my gut tells me that dropping an index at a time is probably a lot less computationally intensive, though tbh I've no idea how much less intensive, nor how costly the ttl is.
For context, my inbound streams will rarely peak above 4K messages per second (mps) and are much more likely to hang around 1-2K mps.
Does anyone have any experience with comparing these two approaches? As you can probably tell I'm new to this world! Would appreciate any help, including even help with what the correct approach is to thinking about this sort of thing.
Cheers!
Short answer is, go with option 1 and simply delete indexes that are no longer needed.
Long answer is it somewhat depends on the volume of documents that you're adding to the index and your sharding and replication settings. If your index throughput is fairly low, TTLs can be performant but as you start to write more docs to Elasticsearch (or if you a high replication factor) you'll run into two issues.
Deleting documents with a TTL requires that Elasticsearch runs a periodic service (IndicesTTLService) to find documents that are expired across all shards and issue deletes for all those docs. Searching a large index can be a pretty taxing operation (especially if you're heavily sharded), but worse are the deletes.
Deletes are not performed instantly within Elasticsearch (Lucene, really) and instead documents are "marked for deletion". A segment merge is required to expunge the deleted documents and reclaim disk space. If you have large number of deletes in the index, it'll put much much more pressure on your segment merge operations to the point where it will severely affect other thread pools.
We originally went the TTL route and had an ES cluster that was completely unusable and began rejecting search and indexing requests due to greedy merge threads.
You can experiment with "what document throughput is too much?" but judging from your use case, I'd recommend saving some time and just going with the index deletion route which is much more performant.
I would go with option 1 - i.e. daily dropping of indices.
Daily Dropping Indices
pros:
This is the most efficient way of deleting data
If you need to restructure your index (e.g. apply a new mapping, increase number of shards) any changes are easily applied to the new index
Details of the current index (i.e. the name) is hidden from clients by using aliases
Time based searches can be directed to search only a specific small index
Index templates simplify the process of creating the daily index.
These benefits are also detailed in the Time-Based Data Guide, see also Retiring Data
cons:
Needs more work to set up (e.g. set up of cron jobs), but there is a plugin (curator) that can help with this.
If you perform updates on data then all versions of a document data will need to sit in the same index, i.e. multiple indexes won't work for you.
Use of TTL or Queries to delete data
pros:
Simple to understand and easily implemented
cons:
When you delete a document, it is only marked as deleted. It won’t be physically deleted until the segment containing it is merged away. This is very inefficient as the deleted data will consume disk space, CPU and memory.

Resources