Performance issues when pushing data at a constant rate to Elasticsearch on multiple indexes at the same time

Performance issues when pushing data at a constant rate to Elasticsearch on multiple indexes at the same time - performance

We are experiencing some performance issues or anomalies on a elasticsearch specifically on a system we are currently building.
The requirements:
We need to capture data for multiple of our customers, who will query and report on them on a near real time basis. All the documents received are the same format with the same properties and are in a flat structure (all fields are of primary type and no nested objects). We want to keep each customer’s information separate from each other.
Frequency of data received and queried:
We receive data for each customer at a fluctuating rate of 200 to 700 documents per second – with the peak being in the middle of the day.
Queries will be mostly aggregations over around 12 million documents per customer – histogram/percentiles to show patterns over time and the occasional raw document retrieval to find out what happened a particular point in time. We are aiming to serve 50 to 100 customer at varying rates of documents inserted – the smallest one could be 20 docs/sec to the largest one peaking at 1000 docs/sec for some minutes.
How are we storing the data:
Each customer has one index per day. For example, if we have 5 customers, there will be a total of 35 indexes for the whole week. The reason we break it per day is because it is mostly the latest two that get queried with occasionally the remaining others. We also do it that way so we can delete older indexes independently of customers (some may want to keep 7 days, some 14 days’ worth of data)
How we are inserting:
We are sending data in batches of 10 to 2000 – every second. One document is around 900bytes raw.
Environment
AWS C3-Large – 3 nodes
All indexes are created with 10 shards with 2 replica for the test purposes
Both Elasticsearch 1.3.2 and 1.4.1
What we have noticed:
If I push data to one index only, Response time starts at 80 to 100ms for each batch inserted when the rate of insert is around 100 documents per second. I ramp it up and I can reach 1600 before the rate of insert goes to close to 1sec per batch and when I increase it to close to 1700, it will hit a wall at some point because of concurrent insertions and the time will spiral to 4 or 5 seconds. Saying that, if I reduce the rate of inserts, Elasticsearch recovers nicely. CPU usage increases as rate increases.
If I push to 2 indexes concurrently, I can reach a total of 1100 and CPU goes up to 93% around 900 documents per second.
If I push to 3 indexes concurrently, I can reach a total of 150 and CPU goes up to 95 to 97%. I tried it many times. The interesting thing is that response time is around 109ms at the time. I can increase the load to 900 and response time will still be around 400 to 600 but CPU stays up.
Question:
Looking at our requirements and findings above, is the design convenient for what’s asked? Are there any tests that I can do to find out more? Is there any setting that I need to check (and change)?

I've been hosting thousands of Elasticsearch clusters on AWS over at https://bonsai.io for the last few years, and have had many a capacity planning conversation that sound like this.
First off, it sounds to me like you have a pretty good cluster design and test rig going here. My first intuition here is that you are legitimately approaching the limits of your c3.large instances, and will want to bump up to a c3.xlarge (or bigger) fairly soon.
An index per tenant per day could be reasonable, if you have relatively few tenants. You may consider an index per day for all tenants, using filters to focus your searches on specific tenants. And unless there are obvious cost savings to discarding old data, then filters should suffice to enforce data retention windows as well.
The primary benefit of segmenting your indices per tenant would be to move your tenants between different Elasticsearch clusters. This could help if you have some tenants with wildly larger usage than others. Or to reduce the potential for Elasticsearch's cluster state management to be a single point of failure for all tenants.
A few other things to keep in mind that may help explain the performance variance you're seeing.
Most importantly here, indexing is incredibly CPU bottlenecked. This makes sense, because Elasticsearch and Lucene are fundamentally just really fancy string parsers, and you're sending piles of strings. (Piles are a legitimate unit of measurement here, right?) Your primary bottleneck is going to be the number and speed of your CPU cores.
In order to take the best advantage of your CPU resources while indexing, you should consider the number of primary shards you're using. I'd recommend starting with three primary shards to distribute the CPU load evenly across the three nodes in your cluster.
For production, you'll almost certainly end up on larger servers. The goal is for your total CPU load for your peak indexing requirements ends up under 50%, so you have some additional overhead for processing your searches. Aggregations are also fairly CPU hungry. The extra performance overhead is also helpful for gracefully handling any other unforeseen circumstances.
You mention pushing to multiple indices concurrently. I would avoid concurrency when bulk updating into Elasticsearch, in favor of batch updating with the Bulk API. You can bulk load documents for multiple indices with the cluster-level /_bulk endpoint. Let Elasticsearch manage the concurrency internally without adding to the overhead of parsing more HTTP connections.
That's just a quick introduction to the subject of performance benchmarking. The Elasticsearch docs have a good article on Hardware which may also help you plan your cluster size.

Related

Random spikes in usage (CockroachCloud Serverless)

I recently set up a free CockroachDB Serverless cluster on CockroachCloud. It's been really great so far, but sometimes there are random spikes in Request Units even though the amount of SQL statements doesn't increase at all. Here's a screenshot of the two graphs in the cluster management page, it illustrates pretty well what I mean. I would really appreciate some help on how I could eliminate these spikes because CockroachCloud has some limits on free usage. That being said, I'm still fairly new to CockroachDB, so I might be missing something obvious.

You are likely performing enough mutations on your data to trigger automatic statistics collection as a background process. By default when 20% or more rows are modified in a table, CockroachDB will trigger a statistics refresh. The statistics are used by the optimizer to create more efficient query plans.
Your SQL Statements graph indicates that almost all your operations are inserts. That many inserts is almost certainly triggering stats collection. While you can turn off stats collection, the optimizer will then be using stale data to calculate query plans, potentially causing performance problems.
The occasional spikes in your Request Unit graph are above the 100 RUs per second baseline, but the rest of the time you are well below 100 RUs per second. That means you are accumulating RUs most of the time, and that (plus the initial 10 million RU allocation) should cover the bursts.
I added a FAQ entry to the Serverless docs covering this.

Elasticsearch Latency

I am using Elasticsearch's MultiSearch API to make multiple search requests at once for one of my endpoints. My understanding is that these requests are done in parallel, but my endpoint's latency increases with the number of search requests I make through the API (<50). I have two questions:
Why is this latency increase happening/how does multisearch work behind the scenes? I am new to Elasticsearch, apologies for my lack of knowledge here.
What are some ways I can improve latency while keeping multisearch?

To provide a more comprehensive answer, it would be good to know your cluster setup.
These requests are indeed done in parallel, but your cluster still has its limits.
What I believe might be happening is that you might not have enough search threads to process that many searches in parallel and your search thread pool start queueing.
https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
So for instance, if you issue a MultiSearch query of let's say 10 search queries where each query would hit 15 shards, this means that this whole query will need 150 search threads in total. And if there are other searches running and the cluster doesn't have available search threads - they will start queueing and eventually might reject if the queue grows too big.
What can you do about it?
Carefully review indices setups, their number_of_shards of shards, and indices sizes. Reducing the number_of_shards will require fewer search threads. Find a balance between number_of_shards and index sizes and their doc count. If there are less than 5M documents, keep everything in a single shard, otherwise, try to have shards of 3M-5M documents, e.g. index with 23M documents could use 5 or 6 shards.
Scale your cluster horizontally by adding new nodes, this will add new search threads
Tweak default thread pool settings (this is mostly the last thing you'd do)

Amazon Elasticsearch - Concurrent Bulk Requests

When I am adding 200 documents to ElasticSearch via one bulk request - it's super fast.
But I am wondering if is there a chance to speed up the process with concurrent executions: 20 concurrent executions with 10 documents each.
I know it's not efficient, but maybe there is a chance to speed up the process with concurrent executions?

Lower concurrency is preferable for bulk document inserts. Some concurrency is helpful in some circumstances — It Depends™ and I'll get into it — but is not a major or automatic win.
There's a lot that can be tuned when it comes to performance of writes to Elasticsearch. One really quick win that you should check: are you using HTTP keep-alive for your connections? That's going to save a lot of the TCP and TLS overhead of setting up each connection. Just that change can make a big performance boost, and also uncover some meaningful architectural considerations for your indexing pipeline.
So check that out and see how it goes. From there, we should go to the bottom, and work our way up.
The index on disk is Lucene. Lucene is a segmented index. The index part is a core reason why you're using Elasticsearch in the first place: a dictionary of sorted terms can be searched in O(log N) time. That's super fast and scalable. The segment part is because inserting into an index is not particularly fast — depending on your implementation, it costs O(log N) or O(N log N) to maintain the sorting.
So Lucene's trick is to buffer those updates and append a new segment; essentially a collection of mini-indices. Searching some relatively small number of segments is still much faster than taking all the time to maintain a sorted index with every update. Over time Lucene takes care of merging these segments to keep them within some sensible size range, expunging deleted and overwritten docs in the process.
In Elasticsearch, every shard is a distinct Lucene index. If you have an index with a single shard, then there is very little benefit to having more than a single concurrent stream of bulk updates. There may be some benefit to concurrency on the application side, depending on the amount of time it takes for your indexing pipeline to collect and assemble each batch of documents. But on the Elasticsearch side, it's all just one set of buffers getting written out to one segment after another.
Sharding makes this a little more interesting.
One of Elasticsearch's strengths is the ability to partition the data of an index across multiple shards. This helps with availability, and it helps workloads scale beyond the resources of a single server.
Alas it's not quite so simple as to say that the concurrency should be equal, or proportional, to the number of primary shards that an index has. Although, as a rough heuristic, that's not a terrible one.
You see, internally, the first Elasticsearch node to handle the request is going to turn that Bulk request into a sequence of individual document update actions. Each document update is sent to the appropriate node that is hosting the shard that this document belongs to. Responses are collected by the bulk action so that it can send a summary of the bulk operation in its response to the client.
So at this point, depending on the document-shard routing, some shards may be busier than others during the course of processing an incoming bulk request. Is that likely to matter? My intuition says not really. It's possible, but it would be unusual.
In most tests and analysis I've seen, and in my experience over ~ten years with Lucene, the slow part of indexing is the transformation of the documents' values into the inverted index format. Parsing the text, analyzing it into terms, and so on, can be very complex and costly. So long as a bulk request has sufficient documents that are sufficiently well distributed across shards, the concurrency is not as meaningful as saturating the work done at the shard and segment level.
When tuning bulk requests, my advice is something like this.
Use HTTP keep-alive. This is not optional. (You are using TLS, right?)
Choose a batch size where each request is taking a modest amount of time. Somewhere around 1 second, probably not more than 10 seconds.
If you can get fancy, measure how much time each bulk request took, and dynamically grow and shrink your batch.
A durable queue unlocks a lot of capabilities. If can fetch and assemble documents and insert them into, say, Kafka, then that process can be run in parallel to saturate the database and parallelize any denormalization or preparation of documents. A different process then pulls from the queue and sends requests to the server, and with some light coordination you can test and tune different concurrencies at different stages. A queue also lets you pause your updates for various migrations and maintenance tasks when it helps to put the cluster into read-only mode for a time.
I've avoided replication throughout this answer because there's only one reason where I'd ever recommend tweaking replication. And that is when you are bulk creating an index that is not serving any production traffic. In that case, it can help save some resources through your server fleet to turn off all replication to the index, and enable replication after the index is essentially done being loaded with data.
To close, what if you crank up the concurrency anyway? What's the risk? Some workloads don't control the concurrency and there isn't the time or resources to put a queue in front of the search engine. In that case, Elasticsearch can avoid a fairly substantial amount of concurrency. It has fairly generous thread pools for handling concurrent document updates. If those thread pools are saturated, it will reject responses with a HTTP 429 error message and a clear message about queue depths being exceeded. Those can impact stability of the cluster, depending on available resources, and number of shards in the index. But those are all pretty noticeable issues.
Bottom line: no, 20 concurrent bulks with 10 documents each will probably not speed up performance relative to 1 bulk with 200 documents. If your bulk operations are fast, you should increase their size until they run for a second or two, or are problematic. Use keep-alive. If there is other app-side overhead, increase your concurrency to 2x or 3x and measure empirically. If indexing is mission critical, use a fast, durable queue.

There is no straight answer to this as it depends on lots of factors. Above the optimal bulk request size, performance no longer improves and may even drop off. The optimal size, however, is not a fixed number.
It depends entirely on your hardware, your document size and complexity, and your indexing and search load.
Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big.
Since you are doing it in batches of 200, the chances are high that it should be most optimal way to index. But again it will depend on the factors mentioned above.

Creating high throughput Elasticsearch cluster

We are in process of implementing Elasticsearch as a search solution in our organization. For the POC we implemented a 3-Node cluster ( each node with 16 VCores and 60 GB RAM and 6 * 375GB SSDs) with all the nodes acting as master, data and co-ordination node. As it was a POC indexing speeds were not a consideration we were just trying to see if it will work or not.
Note : We did try to index 20 million documents on our POC cluster and it took about 23-24 hours to do that which is pushing us to take time and design the production cluster with proper sizing and settings.
Now we are trying to implement a production cluster (in Google Cloud Platform) with emphasis on both indexing speed and search speed.
Our use case is as follows :
We will bulk index 7 million to 20 million documents per index ( we have 1 index for each client and there will be only one cluster). This bulk index is a weekly process i.e. we'll index all data once and will query it for whole week before refreshing it.We are aiming for a 0.5 million document per second indexing throughput.
We are also looking for a strategy to horizontally scale when we add more clients. I have mentioned the strategy in subsequent sections.
Our data model has nested document structure and lot of queries on nested documents which according to me are CPU, Memory and IO intensive. We are aiming for sub second query times for 95th percentile of queries.
I have done quite a bit of reading around this forum and other blogs where companies have high performing Elasticsearch clusters running successfully.
Following are my learnings :
Have dedicated master nodes (always odd number to avoid split-brain). These machines can be medium sized ( 16 vCores and 60 Gigs ram) .
Give 50% of RAM to ES Heap with an exception of not exceeding heap size above 31 GB to avoid 32 bit pointers. We are planning to set it to 28GB on each node.
Data nodes are the workhorses of the cluster hence have to be high on CPUs, RAM and IO. We are planning to have (64 VCores, 240 Gb RAM and 6 * 375 GB SSDs).
Have co-ordination nodes as well to take bulk index and search requests.
Now we are planning to begin with following configuration:
3 Masters - 16Vcores, 60GB RAM and 1 X 375 GB SSD
3 Cordinators - 64Vcores, 60GB RAM and 1 X 375 GB SSD (Compute Intensive machines)
6 Data Nodes - 64 VCores, 240 Gb RAM and 6 * 375 GB SSDs
We have a plan to adding 1 Data Node for each incoming client.
Now since hardware is out of windows, lets focus on indexing strategy.
Few best practices that I've collated are as follows :
Lower number of shards per node is good of most number of scenarios, but have good data distribution across all the nodes for a load balanced situation. Since we are planning to have 6 data nodes to start with, I'm inclined to have 6 shards for the first client to utilize the cluster fully.
Have 1 replication to survive loss of nodes.
Next is bulk indexing process. We have a full fledged spark installation and are going to use elasticsearch-hadoop connector to push data from Spark to our cluster.
During indexing we set the refresh_interval to 1m to make sure that there are less frequent refreshes.
We are using 100 parallel Spark tasks which each task sending 2MB data for bulk request. So at a time there is 2 * 100 = 200 MB of bulk requests which I believe is well within what ES can handle. We can definitely alter these settings based on feedback or trial and error.
I've read more about setting cache percentage, thread pool size and queue size settings, but we are planning to keep them to smart defaults for beginning.
We are open to use both Concurrent CMS or G1GC algorithms for GC but would need advice on this. I've read pros and cons for using both and in dilemma in which one to use.
Now to my actual questions :
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
We will be sending query requests via coordinator nodes. Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
Say a search request come up to coordinator node it blocks 1 thread there and send request to data nodes which in turn blocks threads on data nodes as per where query data is lying. Is this assumption correct ?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
References
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

We did try to index 20 million documents on our POC cluster and it took about 23-24 hours
That is surprisingly little — like less than 250 docs/s. I think my 8GB RAM laptop can insert 13 million docs in 2h. Either you have very complex documents, some bad settings, or your bottleneck is on the ingestion side.
About your nodes: I think you could easily get away with less memory on the master nodes (like 32GB should be plenty). Also the memory on data nodes is pretty high; I'd normally expect heap in relation to the rest of the memory to be 1:1 or for lots of "hot" data maybe 1:3. Not sure you'll get the most out of that 1:7.5 ratio.
CMS vs G1GC: If you have a current Elasticsearch and Java version, both are an option, otherwise CMS. You're generally trading throughput for (GC) latency, so if you benchmark be sure to have a long enough timeframe to properly hit GC phases and run as close to production queries in parallel as possible.
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
I'd say the coordinator is fine. Unless you use a custom routing key and the bulk only contains data for that specific data node, 5/6th of the documents would need to be forwarded to other data nodes anyway (if you have 6 data nodes). And you can offload the bulk processing and coordination handling to non data nodes.
However, overall it might make more sense to have 3 additional data nodes and skip the dedicated coordinating node. Though this is something you can only say for certain by benchmarking your specific scenario.
Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
I'm not sure I understand the question. But have you looked into https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster, which might shed some more light on this topic?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
While there are different queues for different query operations, there is otherwise no clear separation of tasks (like "only use 20% of the resources for indexing). Maybe go a little more conservative on the parallel bulk requests to avoid overloading the node.
If you are not reading from an index while it's being indexed (ideally you flip an alias once done): You might want to disable the refresh rate entirely and let Elasticsearch create segments as needed, but do a force refresh and change the setting once done. Also you could try running with 0 replicas while indexing, change replicas to 1 once done, and then wait for it to finish — though I'd benchmark if this is helping overall and if it's worth the added complexity.

Resource usage with rolling indices in Elasticsearch

My question is mostly based on the following article:
https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index
The article advises against having multiple shards per node for two reasons:
Each shard is essentially a Lucene index, it consumes file handles, memory, and CPU resources
Each search request will touch a copy of every shard in the index. Contention arises and performance decreases when the shards are competing for the same hardware resources
The article advocates the use of rolling indices for indices that see many writes and fewer reads.
Questions:
Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
How does searching many small indices compare to searching one large one?
I should mention that in our particular case, there is only one ES node though of course generally applicable answers will be more useful to SO readers.

It's very difficult to spit out general best practices and guidelines when it comes to cluster sizing as it depends on so many factors. If you ask five ES experts, you'll get ten different answers.
After several years of tinkering and fiddling around ES, I've found out that what works best for me is always to start small (one node, how many indices your app needs and one shard per index), load a representative data set (ideally your full data set) and load test to death. Your load testing scenarii should represent the real maximum load you're experiencing (or expecting) in your production environment during peak hours.
Increase the capacity of your cluster (add shard, add nodes, tune knobs, etc) until your load test pass and make sure to increase your capacity by a few more percent in order to allow for future growth. You don't want your production to be fine now, you want it to be fine in a year from now. Of course, it will depend on how fast your data will grow and it's very unlikely that you can predict with 100% certainty what will happen in a year from now. For that reason, as soon as my load test pass, if I expect a large exponential data growth, I usually increase the capacity by 50% more percent, knowing that I will have to revisit my cluster topology within a few month or a year.
So to answer your questions:
Yes, if old indices are left open, they will consume resources.
Yes, the more indices you search, the more resources you will need in order to go through every shard of every index. Be careful with aliases spanning many, many rolling indices (especially on a single node)
This is too broad to answer, as it again depends on the amount of data we're talking about and on what kind of query you're sending, whether it uses aggregation, sorting and/or scripting, etc

Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
Yes.
Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
Yes.
How does searching many small indices compare to searching one large one?
When ES searches an index it will pick up one copy of each shard (be it replica or primary) and asks that copy to run the query on its own set of data. Searching a shard will use one thread from the search threadpool the node has (the threadpool is per node). One thread basically means one CPU core. If your node has 8 cores then at any given time the node can search concurrently 8 shards.
Imagine you have 100 shards on that node and your query will want to search all of them. ES will initiate the search and all 100 shards will compete for the 8 cores so some shards will have to wait some amount of time (microseconds, milliseconds etc) to get their share of those 8 cores. Having many shards means less documents on each and, thus, potentially a faster response time from each. But then the node that initiated the request needs to gather all the shards' responses and aggregate the final result. So, the response will be ready when the slowest shard finally responds with its set of results.
On the other hand, if you have a big index with very few shards, there is not so much contention for those CPU cores. But the shards having a lot of work to do individually, it can take more time to return back the individual result.
When choosing the number of shards many aspects need to be considered. But, for some rough guidelines yes, 30GB per shard is a good limit. But this won't work for everyone and for every use case and the article fails to mention that. If, for example, your index is using parent/child relationships those 30GB per shard might be too much and the response time of a single shard can be too slow.
You took this out of the context: "The article advises against having multiple shards per node". No, the article advises one to think about the aspects of structuring the indices shards before hand. One important step here is the testing one. Please, test your data before deciding how many shards you need.
You mentioned in the post "rolling indices", and I assume time-based indices. In this case, one question is about the retention period (for how long you need the data). Based on the answer to this question you can determine how many indices you'll have. Knowing how many indices you'll have gives you the total number of shards you'll have.
Also, with rolling indices, you need to take care of deleting the expired indices. Have a look at Curator for this.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio