I'm indexing docs in batches and try to find out what to prefer - reducing batches to fit existing max_content_length or enlarge limit, and index as much documents as possible per request.
What is recommended strategy for setting max_content_length for Elasticsearch? Is that ok to have 1GB limit, for example?
As the famous saying goes: It depends... :-)
There's no right answer to your question because the maximum size that you can send depends on what your cluster can handle based on the software/hardware specs it is running on.
The empirical way of figuring this out is to test different sizes and see which one offers the best throughput, while still allowing the cluster to serve user requests during peak times.
Related
When I am adding 200 documents to ElasticSearch via one bulk request - it's super fast.
But I am wondering if is there a chance to speed up the process with concurrent executions: 20 concurrent executions with 10 documents each.
I know it's not efficient, but maybe there is a chance to speed up the process with concurrent executions?
Lower concurrency is preferable for bulk document inserts. Some concurrency is helpful in some circumstances — It Depends™ and I'll get into it — but is not a major or automatic win.
There's a lot that can be tuned when it comes to performance of writes to Elasticsearch. One really quick win that you should check: are you using HTTP keep-alive for your connections? That's going to save a lot of the TCP and TLS overhead of setting up each connection. Just that change can make a big performance boost, and also uncover some meaningful architectural considerations for your indexing pipeline.
So check that out and see how it goes. From there, we should go to the bottom, and work our way up.
The index on disk is Lucene. Lucene is a segmented index. The index part is a core reason why you're using Elasticsearch in the first place: a dictionary of sorted terms can be searched in O(log N) time. That's super fast and scalable. The segment part is because inserting into an index is not particularly fast — depending on your implementation, it costs O(log N) or O(N log N) to maintain the sorting.
So Lucene's trick is to buffer those updates and append a new segment; essentially a collection of mini-indices. Searching some relatively small number of segments is still much faster than taking all the time to maintain a sorted index with every update. Over time Lucene takes care of merging these segments to keep them within some sensible size range, expunging deleted and overwritten docs in the process.
In Elasticsearch, every shard is a distinct Lucene index. If you have an index with a single shard, then there is very little benefit to having more than a single concurrent stream of bulk updates. There may be some benefit to concurrency on the application side, depending on the amount of time it takes for your indexing pipeline to collect and assemble each batch of documents. But on the Elasticsearch side, it's all just one set of buffers getting written out to one segment after another.
Sharding makes this a little more interesting.
One of Elasticsearch's strengths is the ability to partition the data of an index across multiple shards. This helps with availability, and it helps workloads scale beyond the resources of a single server.
Alas it's not quite so simple as to say that the concurrency should be equal, or proportional, to the number of primary shards that an index has. Although, as a rough heuristic, that's not a terrible one.
You see, internally, the first Elasticsearch node to handle the request is going to turn that Bulk request into a sequence of individual document update actions. Each document update is sent to the appropriate node that is hosting the shard that this document belongs to. Responses are collected by the bulk action so that it can send a summary of the bulk operation in its response to the client.
So at this point, depending on the document-shard routing, some shards may be busier than others during the course of processing an incoming bulk request. Is that likely to matter? My intuition says not really. It's possible, but it would be unusual.
In most tests and analysis I've seen, and in my experience over ~ten years with Lucene, the slow part of indexing is the transformation of the documents' values into the inverted index format. Parsing the text, analyzing it into terms, and so on, can be very complex and costly. So long as a bulk request has sufficient documents that are sufficiently well distributed across shards, the concurrency is not as meaningful as saturating the work done at the shard and segment level.
When tuning bulk requests, my advice is something like this.
Use HTTP keep-alive. This is not optional. (You are using TLS, right?)
Choose a batch size where each request is taking a modest amount of time. Somewhere around 1 second, probably not more than 10 seconds.
If you can get fancy, measure how much time each bulk request took, and dynamically grow and shrink your batch.
A durable queue unlocks a lot of capabilities. If can fetch and assemble documents and insert them into, say, Kafka, then that process can be run in parallel to saturate the database and parallelize any denormalization or preparation of documents. A different process then pulls from the queue and sends requests to the server, and with some light coordination you can test and tune different concurrencies at different stages. A queue also lets you pause your updates for various migrations and maintenance tasks when it helps to put the cluster into read-only mode for a time.
I've avoided replication throughout this answer because there's only one reason where I'd ever recommend tweaking replication. And that is when you are bulk creating an index that is not serving any production traffic. In that case, it can help save some resources through your server fleet to turn off all replication to the index, and enable replication after the index is essentially done being loaded with data.
To close, what if you crank up the concurrency anyway? What's the risk? Some workloads don't control the concurrency and there isn't the time or resources to put a queue in front of the search engine. In that case, Elasticsearch can avoid a fairly substantial amount of concurrency. It has fairly generous thread pools for handling concurrent document updates. If those thread pools are saturated, it will reject responses with a HTTP 429 error message and a clear message about queue depths being exceeded. Those can impact stability of the cluster, depending on available resources, and number of shards in the index. But those are all pretty noticeable issues.
Bottom line: no, 20 concurrent bulks with 10 documents each will probably not speed up performance relative to 1 bulk with 200 documents. If your bulk operations are fast, you should increase their size until they run for a second or two, or are problematic. Use keep-alive. If there is other app-side overhead, increase your concurrency to 2x or 3x and measure empirically. If indexing is mission critical, use a fast, durable queue.
There is no straight answer to this as it depends on lots of factors. Above the optimal bulk request size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number.
It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big.
Since you are doing it in batches of 200, the chances are high that it should be most optimal way to index. But again it will depend on the factors mentioned above.
We have a cluster of workers that send indexing requests to a 4-node Elasticsearch cluster. The documents are indexed as they are generated, and since the workers have a high degree of concurrency, Elasticsearch is having trouble handling all the requests. To give some numbers, the workers process up to 3,200 tasks at the same time, and each task usually generates about 13 indexing requests. This generates an instantaneous rate that is between 60 and 250 indexing requests per second.
From the start, Elasticsearch had problems and requests were timing out or returning 429. To get around this, we increased the timeout on our workers to 200 seconds and increased the write thread pool queue size on our nodes to 700.
That's not a satisfactory long-term solution though, and I was looking for alternatives. I have noticed that when I copied an index within the same cluster with elasticdump, the write thread pool was almost empty and I attributed that to the fact that elasticdump batches indexing requests and (probably) uses the bulk API to communicate with Elasticsearch.
That gave me the idea that I could write a buffer that receives requests from the workers, batches them in groups of 200-300 requests and then sends the bulk request to Elasticsearch for one group only.
Does such a thing already exist, and does it sound like a good idea?
First of all, it's important to understand what happens behind the scene when you send the index request to Elasticsearch, to troubleshoot the issue or finding the root-cause.
Elasticsearch has several thread pools but for indexing requests(single/bulk) write threadpool is being used, please check this according to your Elasticsearch version as Elastic keeps on changing the threadpools(earlier there was a separate threadpool for single and bulk request with different queue capacity).
In the latest ES version(7.10) write threadpool's queue capacity increased significantly to 10000 from 200(exist in earlier release), there may be below reasons to do it.
Elasticsearch now prefers to buffer more indexing requests instead of rejecting the requests.
Although increasing queue capacity means more latency but it's a trade-off and this will reduce the data-loss if the client doesn't have the retry mechanism.
I am sure, you would have not moved to ES 7.9 version, when capacity was increased, but you can increase the size of this queue slowly and allocate more processors(if you have more capacity) easily through the config change mentioned in this official example. Although this is a very debatable topic and a lot of people consider this as a band-aid solution than the proper fix, but now as Elastic themself increased the queue size, you can also try it, and if you have a short duration of increased traffic than it makes even more sense.
Another critical thing is to find out the root cause why your ES nodes are queuing up more requests, it can be legitimate like increasing indexing traffic and infra reached its limit. but if it's not legitimate you can have a look at my short tips to improve one-time indexing performance and overall indexing performance, by implementing these tips you will get a better indexing rate which will reduce the pressure on write thread pool queue.
Edit: As mentioned by #Val in the comment, if you are also indexing docs one by one then moving to bulk index API will give you the biggest boost.
I have an elasticsearch setup with 192 active indices ranging from a few hundred mb to possibly 5gb each. I read that for a logstash use case with 1gb indices you should only use 1 shard. The difference with my setup is that I will be having more users (estimate of up to 100) expecting a quick response time. I intend to have 1 replica for reliability.
Will having 1 shard per index still be appropriate for my use case?
In a word: yes.
The need to create multiple primary shards derives from the need to isolate documents, extreme counts (e.g., when you're in the billions of documents volume), or to improve write throughput (write documents across more places, thereby reducing individual burden).
In practice, you want to shard based on your use case, unless you're one of those first two scenarios (isolation or extreme counts).
Are you read heavy?
Are you write heavy? (Less common, but it does happen)
If you're read heavy, as most use cases are, then having fewer shards will help you by limiting the request size (fewer places to look). Given that your shard sizes are also relatively small (I'd consider anything under 5 GB to be relatively small), you can easily get away with having a single primary shard and it should benefit your search performance by doing so.
Indexes that share the same mappings, but are also tiny ("few hundred MBs"), should likely be combined if you search across them. If they're independent, then it really makes no difference and the isolation sounds like good practice at the expense of slightly bloating your cluster state (with each index).
Have a look at this blog: https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index. He has a lot of good pointers to sharding and shard sizing.
However, the question you really should be asking yourself is: How easy is it to change? When it comes to sizing and scalability, the answer often is "it depends" - and the real question is: How quickly can you reconfigure?
This could e.g. mean that you design you application in a way, that allows quick re-spooling of data into a new index, that you use aliases so that you can in fact change these things, where your data lies (not just in Elastic, I hope) etc.
By building a system - from the start - so that you can quickly rebuild indicies enables you to experiment with sizes - and more importantly - change them as your need changes.
How scalable is rethinkdb? Can it be used for TB's of data?
I have around 400 GB's of data, which are bound to increase by 10-25 GB's per week. Any suggestions, would be of great help.
Rethinkdb is very scalable and you should be able to use it with TBs of data no problem, that's what its designed for. As for its stability it should be very stable as of version 2.0 and fully production.
http://www.rethinkdb.com/stability/
Note that a NoSQL-Database does increase the amount of storage required massively, so be sure to have enough disk space on your nodes.
But some terabytes are no issue for RethinkDB, it's designed to handle a nearly infinite amount of data if your nodes have a lot of RAM it also provides nearly Memory-alike performance because the caching algorithms are very good.
You can easily spread out your data on many nodes, by sharding, which also increases the absolute throughput of your cluster, but slightly increases the latency of queries, because of the round trips between the cluster-members.
I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.
Number of nodes
Number of shards/index
Document size
RAM
Disk write speed
LAN speed
I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?
Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests
Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy
Use bulk size in KiB (or equivalent), not document count !
Send data in bulk (no streaming), pass redundant info API url if you can
Remove superfluous whitespace in your data if possible
Disable search index updates, activate it back later
Round-robin across all your data nodes
There is no golden rule for this. Extracted from the doc:
There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.
I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.
I'd suggest using BulkProcessor if you are using the Java API.
I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size
In my case, I could not get more than 100,000 records to insert at a time. Started with 13 million, down to 500,000 and after no success, started on the other side, 1,000, then 10,000 then 100,000, my max.
I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.
In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:
Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.
I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.
The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.
Actually, there is no clear way of finding out the exact upper limit for the bulk update. An important factor to consider in the bulk update is request data volume not only the no. of documents
An excerpt from link
How Big Is Too Big?
The entire bulk request needs to be loaded into memory by the node that receives our request, so the bigger the request, the less memory available for other requests. There is an optimal size of bulk request. Above that size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number. It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big. A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size.
Actually I'm facing some problems related to bulk API. There is one parameter that impact the bulk api. It's the number of index inside a bulk request.