Elasticsearch - Physical size of a bulk request - elasticsearch

We use BULK api to index multiple docs. We try to control the batch size through various parameters indirectly. But I wanted to know if there is any clean and recommended way to get the physical size of the batch prepared before sending the BULK index request to ES.
Note:Language - C# using NEST

TLDR
entire bulk query should be loaded in RAM
after certain size performance no longer improves
it is different for different hardware -- experiment to find your size
https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html
The entire bulk request needs to be loaded into memory by the node
that receives our request, so the bigger the request, the less memory
available for other requests. There is an optimal size of bulk
request. Above that size, performance no longer improves and may even
drop off. The optimal size, however, is not a fixed number. It depends
entirely on your hardware, your document size and complexity, and your
indexing and search load.
Fortunately, it is easy to find this sweet spot: Try indexing typical
documents in batches of increasing size. When performance starts to
drop off, your batch size is too big. A good place to start is with
batches of 1,000 to 5,000 documents or, if your documents are very
large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk
requests. One thousand 1KB documents is very different from one
thousand 1MB documents. A good bulk size to start playing with is
around 5-15MB in size.

Related

Amazon Elasticsearch - Concurrent Bulk Requests

When I am adding 200 documents to ElasticSearch via one bulk request - it's super fast.
But I am wondering if is there a chance to speed up the process with concurrent executions: 20 concurrent executions with 10 documents each.
I know it's not efficient, but maybe there is a chance to speed up the process with concurrent executions?
Lower concurrency is preferable for bulk document inserts. Some concurrency is helpful in some circumstances — It Depends™ and I'll get into it — but is not a major or automatic win.
There's a lot that can be tuned when it comes to performance of writes to Elasticsearch. One really quick win that you should check: are you using HTTP keep-alive for your connections? That's going to save a lot of the TCP and TLS overhead of setting up each connection. Just that change can make a big performance boost, and also uncover some meaningful architectural considerations for your indexing pipeline.
So check that out and see how it goes. From there, we should go to the bottom, and work our way up.
The index on disk is Lucene. Lucene is a segmented index. The index part is a core reason why you're using Elasticsearch in the first place: a dictionary of sorted terms can be searched in O(log N) time. That's super fast and scalable. The segment part is because inserting into an index is not particularly fast — depending on your implementation, it costs O(log N) or O(N log N) to maintain the sorting.
So Lucene's trick is to buffer those updates and append a new segment; essentially a collection of mini-indices. Searching some relatively small number of segments is still much faster than taking all the time to maintain a sorted index with every update. Over time Lucene takes care of merging these segments to keep them within some sensible size range, expunging deleted and overwritten docs in the process.
In Elasticsearch, every shard is a distinct Lucene index. If you have an index with a single shard, then there is very little benefit to having more than a single concurrent stream of bulk updates. There may be some benefit to concurrency on the application side, depending on the amount of time it takes for your indexing pipeline to collect and assemble each batch of documents. But on the Elasticsearch side, it's all just one set of buffers getting written out to one segment after another.
Sharding makes this a little more interesting.
One of Elasticsearch's strengths is the ability to partition the data of an index across multiple shards. This helps with availability, and it helps workloads scale beyond the resources of a single server.
Alas it's not quite so simple as to say that the concurrency should be equal, or proportional, to the number of primary shards that an index has. Although, as a rough heuristic, that's not a terrible one.
You see, internally, the first Elasticsearch node to handle the request is going to turn that Bulk request into a sequence of individual document update actions. Each document update is sent to the appropriate node that is hosting the shard that this document belongs to. Responses are collected by the bulk action so that it can send a summary of the bulk operation in its response to the client.
So at this point, depending on the document-shard routing, some shards may be busier than others during the course of processing an incoming bulk request. Is that likely to matter? My intuition says not really. It's possible, but it would be unusual.
In most tests and analysis I've seen, and in my experience over ~ten years with Lucene, the slow part of indexing is the transformation of the documents' values into the inverted index format. Parsing the text, analyzing it into terms, and so on, can be very complex and costly. So long as a bulk request has sufficient documents that are sufficiently well distributed across shards, the concurrency is not as meaningful as saturating the work done at the shard and segment level.
When tuning bulk requests, my advice is something like this.
Use HTTP keep-alive. This is not optional. (You are using TLS, right?)
Choose a batch size where each request is taking a modest amount of time. Somewhere around 1 second, probably not more than 10 seconds.
If you can get fancy, measure how much time each bulk request took, and dynamically grow and shrink your batch.
A durable queue unlocks a lot of capabilities. If can fetch and assemble documents and insert them into, say, Kafka, then that process can be run in parallel to saturate the database and parallelize any denormalization or preparation of documents. A different process then pulls from the queue and sends requests to the server, and with some light coordination you can test and tune different concurrencies at different stages. A queue also lets you pause your updates for various migrations and maintenance tasks when it helps to put the cluster into read-only mode for a time.
I've avoided replication throughout this answer because there's only one reason where I'd ever recommend tweaking replication. And that is when you are bulk creating an index that is not serving any production traffic. In that case, it can help save some resources through your server fleet to turn off all replication to the index, and enable replication after the index is essentially done being loaded with data.
To close, what if you crank up the concurrency anyway? What's the risk? Some workloads don't control the concurrency and there isn't the time or resources to put a queue in front of the search engine. In that case, Elasticsearch can avoid a fairly substantial amount of concurrency. It has fairly generous thread pools for handling concurrent document updates. If those thread pools are saturated, it will reject responses with a HTTP 429 error message and a clear message about queue depths being exceeded. Those can impact stability of the cluster, depending on available resources, and number of shards in the index. But those are all pretty noticeable issues.
Bottom line: no, 20 concurrent bulks with 10 documents each will probably not speed up performance relative to 1 bulk with 200 documents. If your bulk operations are fast, you should increase their size until they run for a second or two, or are problematic. Use keep-alive. If there is other app-side overhead, increase your concurrency to 2x or 3x and measure empirically. If indexing is mission critical, use a fast, durable queue.
There is no straight answer to this as it depends on lots of factors. Above the optimal bulk request size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number.
It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big.
Since you are doing it in batches of 200, the chances are high that it should be most optimal way to index. But again it will depend on the factors mentioned above.

Elastisearch Batch size limit

I have 200 documents, and each document is 1 MB each. For a total 200MB, so want to index all of them in a batch at once, using bulk processing.
Is sending 200MB over the wire is too big for the elasticsearch to handle?
Sending 200MB of data across the wire is going to take a very long time and will timeout your connection. You'd be better off writing something that indexes 1 document at a time with maybe 5 concurrent threads. Bulk inserting this much data will not really give you any benefit.
More generally, 1MB of data is ~500 pages of text. I would argue that is WAY too much data to be putting into a single record in ES! I think you're going to be disappointed with performance unless you've got a lot of horsepower, but that's going to be very expensive. I recommend looking into making much smaller documents.

Can Circuit break exception be avoided using horizontal scaling?

I am using crate 1.0.2 which internally uses elasticsearch. So my question is applicable for both. For certain queries I get circuit break exception.
CircuitBreakingException: [parent] Data too large, data for [collect: 0] would be larger than limit of [11946544332/11.1gb]
These queries are mainly group by on multiple columns. I have billions of documents indexed and have 16 GB of RAM allocated as crate heap size. I have multiple such nodes connected together in a cluster. Will adding more nodes in the cluster help in getting rid of this error and will my same queries run fine ? Or is it that I must increase heap to 30 GB ? My worry is when I increase it to 30GB and as I add more data, someday that query will again hit the circuit breaker. So I wanted to solve it by scaling horizontally i.e. adding more nodes. Will that be wiser decision ?
Short answer: Usually horizontal scaling helps.
Your error seems to be caused by group by queries.
The GROUP BY operations are executed in a distributed fashion. So more nodes
will generally split the load and therefore also the memory usage. (Make sure
there are enough shards so that they're spread among all nodes)
There is a catch though: Eventually the data needs to be merged together on the
node you sent the initial query to. This is generally fine because the data
arrives pre-aggregated, but If the cardinality is too high (Ex. GROUP BY on a
primary key), the whole data set has to fit into memory on this coordinator
node.
If your nodes have enough memory to go up to 30 GB (with still having enough to
spare for the file system cache), I'd personally tend to increase the HEAP size
first, before adding new nodes.
Update:
Recent versions (2.1.X) also contain some fixes regarding the circuit-breaker behaviur. So if it's possible to update that'd be recommended as well.
Update2:
Note that there are different cases in which a circuit breaker can trip. In
your case it's caused by a GROUP BY using up quite a lot of memory. But it can
also be tripped if a single request is too large. For example if the bulk size
is too large. In such a case more nodes wouldn't help. You'd have to reduce the
bulk size.

What is the maximum Elasticsearch document size?

I read notes about Lucene being limited to 2Gb documents. Are there any additional limitations on the size of documents that can be indexed in Elasticsearch?
Lucene uses a byte buffer internally that uses 32bit integers for addressing. By definition this limits the size of the documents. So 2GB is max in theory.
In ElasticSearch:
There is a max http request size in the ES GitHub code, and it is set against Integer.MAX_VALUE or 2^31-1. So, basically, 2GB is the maximum document size for bulk indexing over HTTP. And also to add to it, ES does not process an HTTP request until it completes.
Good Practices:
Do not use a very large java heap if you can help it: set it only as large as is necessary (ideally no more than half of the machine’s RAM) to hold the overall maximum working set size for your usage of Elasticsearch. This leaves the remaining (hopefully sizable) RAM for the OS to manage for IO caching.
In client side, always use the bulk api, which indexes multiple documents in one request, and experiment with the right number of documents to send with each bulk request. The optimal size depends on many factors, but try to err in the direction of too few rather than too many documents. Use concurrent bulk requests with client-side threads or separate asynchronous requests.
For further study refer to these links:
Performance considerations for elasticsearch indexing
Document maximum size for bulk indexing over HTTP
Think things have changed slightly over the years with Elasticsearch. In the 7.x documentation referenced here - General Recommendations:
Given that the default http.max_content_length is set to 100MB, Elasticsearch will refuse to index any document that is larger than that. You might decide to increase that particular setting, but Lucene still has a limit of about 2GB.
So it would seem that ES has a limit of ~100MB and Lucene's is 2GB as the other answer stated.

What is the ideal bulk size formula in ElasticSearch?

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.
Number of nodes
Number of shards/index
Document size
RAM
Disk write speed
LAN speed
I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?
Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests
Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
Use bulk size in KiB (or equivalent), not document count !
Send data in bulk (no streaming), pass redundant info API url if you can
Remove superfluous whitespace in your data if possible
Disable search index updates, activate it back later
Round-robin across all your data nodes
There is no golden rule for this. Extracted from the doc:
There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.
I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.
I'd suggest using BulkProcessor if you are using the Java API.
I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size
In my case, I could not get more than 100,000 records to insert at a time. Started with 13 million, down to 500,000 and after no success, started on the other side, 1,000, then 10,000 then 100,000, my max.
I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.
In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:
Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.
I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.
The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.
Actually, there is no clear way of finding out the exact upper limit for the bulk update. An important factor to consider in the bulk update is request data volume not only the no. of documents
An excerpt from link
How Big Is Too Big?
      The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
      Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
      It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.
Actually I'm facing some problems related to bulk API. There is one parameter that impact the bulk api. It's the number of index inside a bulk request.

Resources