Elasticsearch Batch Re-Indexing | To Scroll or Not To Scroll - elasticsearch

OK - Here is what I'm trying to achieve.
I've got an ES cluster with tens of millions of data (can linearly grow). These are raw data (something like an audit log). We will have features (incrementally) to retrospectively transform this audit log into a different document (index) depending upon the feature requirement. Therefore this would require reindexing (bulk read and bulk write).
These are my technical requirements:
The "reindexing component" should be horizontally scalable. Linearly scale by spinning up multiple instances of this (to speed up).
The "reindexing component" should be resilient. If one chunk of data fails during read by one worker, some other worker should pick this up.
Resume from where it left. These should be resumable from where it stopped (or crashed) rather than reading through the full index again.
A bit of research showed me that I'd have to build a bespoke solution for my needs.
Now my question is to use scroll or from&size
Scroll is naturally more intended to do bulk reads in an efficient way, but I also need it to be horizontally scalable. I understand there's a "sliced scroll" feature that allows parallel scrolls but this is only limited to the number of shards? ie if number of shards are 5, then I can only have 5 workers reading the elasticsearch. However, the transformations can be scaled though.
Alternatively, I was wondering if the paging (using from and size) would be ticking all my boxes. The approach is, I'd find the total count. Then I'd be computing the offsets and throwing that in a queue. Then a pool of workers would read the offsets from the queue and reading it using the (from & size). By this, I will exactly know which offsets have failed/pending etc and also reads can scale.
However, the important question I have is does it harm elasticsearch by firing more and more large paging requests concurrently (assuming page size is 2000).
I'd like to hear different views/solutions/pointers/comments on this.

Related

Why is ElasticSearch PIT/search_after a lightweight version to scrollAPI?

Elasticsearch provides different ways of paginating through large amounts of data. The scroll API and search_after/PIT both allow a view of the data at a given point in time.
Normally, the background merge process optimizes the index by merging
together smaller segments to create new bigger segments, at which time
the smaller segments are deleted. This process continues during
scrolling, but an open search context prevents the old segments from
being deleted while they are still in use. This is how Elasticsearch
is able to return the results of the initial search request,
regardless of subsequent changes to documents.
It seems that this is achieved in the same way for both. The old segments are prevented from being deleted. However, search_after/PIT is often referred to as the light-weight alternative. Is this purely because a PIT can be shared between queries and therefore not as many PITs should have to be created?

Amazon Elasticsearch - Concurrent Bulk Requests

When I am adding 200 documents to ElasticSearch via one bulk request - it's super fast.
But I am wondering if is there a chance to speed up the process with concurrent executions: 20 concurrent executions with 10 documents each.
I know it's not efficient, but maybe there is a chance to speed up the process with concurrent executions?
Lower concurrency is preferable for bulk document inserts. Some concurrency is helpful in some circumstances — It Depends™ and I'll get into it — but is not a major or automatic win.
There's a lot that can be tuned when it comes to performance of writes to Elasticsearch. One really quick win that you should check: are you using HTTP keep-alive for your connections? That's going to save a lot of the TCP and TLS overhead of setting up each connection. Just that change can make a big performance boost, and also uncover some meaningful architectural considerations for your indexing pipeline.
So check that out and see how it goes. From there, we should go to the bottom, and work our way up.
The index on disk is Lucene. Lucene is a segmented index. The index part is a core reason why you're using Elasticsearch in the first place: a dictionary of sorted terms can be searched in O(log N) time. That's super fast and scalable. The segment part is because inserting into an index is not particularly fast — depending on your implementation, it costs O(log N) or O(N log N) to maintain the sorting.
So Lucene's trick is to buffer those updates and append a new segment; essentially a collection of mini-indices. Searching some relatively small number of segments is still much faster than taking all the time to maintain a sorted index with every update. Over time Lucene takes care of merging these segments to keep them within some sensible size range, expunging deleted and overwritten docs in the process.
In Elasticsearch, every shard is a distinct Lucene index. If you have an index with a single shard, then there is very little benefit to having more than a single concurrent stream of bulk updates. There may be some benefit to concurrency on the application side, depending on the amount of time it takes for your indexing pipeline to collect and assemble each batch of documents. But on the Elasticsearch side, it's all just one set of buffers getting written out to one segment after another.
Sharding makes this a little more interesting.
One of Elasticsearch's strengths is the ability to partition the data of an index across multiple shards. This helps with availability, and it helps workloads scale beyond the resources of a single server.
Alas it's not quite so simple as to say that the concurrency should be equal, or proportional, to the number of primary shards that an index has. Although, as a rough heuristic, that's not a terrible one.
You see, internally, the first Elasticsearch node to handle the request is going to turn that Bulk request into a sequence of individual document update actions. Each document update is sent to the appropriate node that is hosting the shard that this document belongs to. Responses are collected by the bulk action so that it can send a summary of the bulk operation in its response to the client.
So at this point, depending on the document-shard routing, some shards may be busier than others during the course of processing an incoming bulk request. Is that likely to matter? My intuition says not really. It's possible, but it would be unusual.
In most tests and analysis I've seen, and in my experience over ~ten years with Lucene, the slow part of indexing is the transformation of the documents' values into the inverted index format. Parsing the text, analyzing it into terms, and so on, can be very complex and costly. So long as a bulk request has sufficient documents that are sufficiently well distributed across shards, the concurrency is not as meaningful as saturating the work done at the shard and segment level.
When tuning bulk requests, my advice is something like this.
Use HTTP keep-alive. This is not optional. (You are using TLS, right?)
Choose a batch size where each request is taking a modest amount of time. Somewhere around 1 second, probably not more than 10 seconds.
If you can get fancy, measure how much time each bulk request took, and dynamically grow and shrink your batch.
A durable queue unlocks a lot of capabilities. If can fetch and assemble documents and insert them into, say, Kafka, then that process can be run in parallel to saturate the database and parallelize any denormalization or preparation of documents. A different process then pulls from the queue and sends requests to the server, and with some light coordination you can test and tune different concurrencies at different stages. A queue also lets you pause your updates for various migrations and maintenance tasks when it helps to put the cluster into read-only mode for a time.
I've avoided replication throughout this answer because there's only one reason where I'd ever recommend tweaking replication. And that is when you are bulk creating an index that is not serving any production traffic. In that case, it can help save some resources through your server fleet to turn off all replication to the index, and enable replication after the index is essentially done being loaded with data.
To close, what if you crank up the concurrency anyway? What's the risk? Some workloads don't control the concurrency and there isn't the time or resources to put a queue in front of the search engine. In that case, Elasticsearch can avoid a fairly substantial amount of concurrency. It has fairly generous thread pools for handling concurrent document updates. If those thread pools are saturated, it will reject responses with a HTTP 429 error message and a clear message about queue depths being exceeded. Those can impact stability of the cluster, depending on available resources, and number of shards in the index. But those are all pretty noticeable issues.
Bottom line: no, 20 concurrent bulks with 10 documents each will probably not speed up performance relative to 1 bulk with 200 documents. If your bulk operations are fast, you should increase their size until they run for a second or two, or are problematic. Use keep-alive. If there is other app-side overhead, increase your concurrency to 2x or 3x and measure empirically. If indexing is mission critical, use a fast, durable queue.
There is no straight answer to this as it depends on lots of factors. Above the optimal bulk request size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number.
It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big.
Since you are doing it in batches of 200, the chances are high that it should be most optimal way to index. But again it will depend on the factors mentioned above.

What considerations should I take into account when increasing the size in the Scroll API in Elasticsearch?

I am currently toying around with the Scroll API of Elasticsearch, and want to use it to obtain a large set of data and do some manual processing on it. The processing is performed by an external library and is not of the type that can easily be included as a script.
While this seems to work nicely at the moment, I was wondering what considerations that I should take into account when fine-tuning the scroll size for performing this form of processing. A quick observation seems to indicate that increasing the scroll size will reduce the latency of the operation. While I suspect that larger scroll sizes will generally reduce throughput, I have no idea whether this hypothesis is correct. Also, I have no idea if there are any other consequences that I do not envision right now.
So to summarize, my question is: what impact does changing Elasticsearch's scroll size have, especially on performance, in a scenario where the results are processed for each batch that is obtained?
Thanks in advance!
The one (and the only I know of) consideration is to be able to process batch fast enough to not release scroll context (which is controlled by ?scroll=X parameter).
Assuming that you will consume all the data from query, there, scroll should be tuned based on network and 3rd-party app performance. I.e.
if your app can process data in stream-like manner, bigger chunks is better
if your app processing data in batches (waiting for full ES response first), the upper limit for batch size should guarantee processing time < scroll release time
if you work in poor network environment, less batch size is better to handle overhead of dropped connections/retries
generally, bigger batch is obviously better, as it eliminates some network/ES cpu overhead

elasticsearch ttl vs daily dropping tables

I understand that there are two dominant patterns for keeping a rolling window of data inside elasticsearch:
creating daily indices, as suggested by logstash, and dropping old indices, and therefore all the records they contain, when they fall out of the window
using elasticsearch's TTL feature and a single index, having elasticsearch automatically remove old records individually as they fall out of the window
Instinctively I go with 2, as:
I don't have to write a cron job
a single big index is easier to communicate to my colleagues and for them to query (I think?)
any nightmare stream dynamics, that cause old log events to show up, don't lead to the creation of new indices and the old events only hang around for the 60s period that elasticsearch uses to do ttl cleanup.
But my gut tells me that dropping an index at a time is probably a lot less computationally intensive, though tbh I've no idea how much less intensive, nor how costly the ttl is.
For context, my inbound streams will rarely peak above 4K messages per second (mps) and are much more likely to hang around 1-2K mps.
Does anyone have any experience with comparing these two approaches? As you can probably tell I'm new to this world! Would appreciate any help, including even help with what the correct approach is to thinking about this sort of thing.
Cheers!
Short answer is, go with option 1 and simply delete indexes that are no longer needed.
Long answer is it somewhat depends on the volume of documents that you're adding to the index and your sharding and replication settings. If your index throughput is fairly low, TTLs can be performant but as you start to write more docs to Elasticsearch (or if you a high replication factor) you'll run into two issues.
Deleting documents with a TTL requires that Elasticsearch runs a periodic service (IndicesTTLService) to find documents that are expired across all shards and issue deletes for all those docs. Searching a large index can be a pretty taxing operation (especially if you're heavily sharded), but worse are the deletes.
Deletes are not performed instantly within Elasticsearch (Lucene, really) and instead documents are "marked for deletion". A segment merge is required to expunge the deleted documents and reclaim disk space. If you have large number of deletes in the index, it'll put much much more pressure on your segment merge operations to the point where it will severely affect other thread pools.
We originally went the TTL route and had an ES cluster that was completely unusable and began rejecting search and indexing requests due to greedy merge threads.
You can experiment with "what document throughput is too much?" but judging from your use case, I'd recommend saving some time and just going with the index deletion route which is much more performant.
I would go with option 1 - i.e. daily dropping of indices.
Daily Dropping Indices
pros:
This is the most efficient way of deleting data
If you need to restructure your index (e.g. apply a new mapping, increase number of shards) any changes are easily applied to the new index
Details of the current index (i.e. the name) is hidden from clients by using aliases
Time based searches can be directed to search only a specific small index
Index templates simplify the process of creating the daily index.
These benefits are also detailed in the Time-Based Data Guide, see also Retiring Data
cons:
Needs more work to set up (e.g. set up of cron jobs), but there is a plugin (curator) that can help with this.
If you perform updates on data then all versions of a document data will need to sit in the same index, i.e. multiple indexes won't work for you.
Use of TTL or Queries to delete data
pros:
Simple to understand and easily implemented
cons:
When you delete a document, it is only marked as deleted. It won’t be physically deleted until the segment containing it is merged away. This is very inefficient as the deleted data will consume disk space, CPU and memory.

max number of couchbase views per bucket

How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.

Resources