mongodb connection count not scale balanced between nodes in replicas set - spring

I am now building a web application using MongoDB + Spring Data Mongodb.
This application has an API which is a simple MongoDB query:
db.myCollectionName.aggregate([{ "$sample" : { "size" : 100}}, { "$project" : { "myFieldName" : 1, "_id" : 0}}])
We randomly pick up 100 documents from myCollectionName and project myFieldName. It's a simple read request.
Total collection is about 100m records and afak I use $sample operator properly:
$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents
Our mongodb cluster is a 3 node replicaset, one node handles write requests, and two others handle read requests.
I am testing my application using JMeter, trying to figure how many TPS the application can support at most.
Report shows that, when the number of concurrency is 100, the TPS performance of the application is the best, reaching 1000, with a average 95pct latency of 180ms. The workload is evenly distributed to the two secondary nodes. The number of operations is about 500, connection count is about 180 in each secondary node.
Meanwhile, on the application side, the work load is fairly low, only 30% of cpu and memory used.
Next, I try to add one more secondary node to see if TPS will increase linearly and now strange things happens:
After the number of concurrency exceeds 100, the TPS of the system does not increase linearly. The TPS stays in 1000, with each secondary node handling only 400 operations (same operationCount). As the number of concurrency goes up, average 95pct lantency of the application also begins to degrade, from 180ms to ~450ms.
In the mean time, I notice that the connectiont count is not evenly established in each secondary node. One secondary node has established over 300 connections, while two others have only 180. Moreover, as the number of concurrency exceeds 150, slow queries start to appear on the node with more connections.
I can confirm that the work load on the application side is fairly low, with 3 secondary nodes, the application still does not reach its load limitations. The max connection count configured on the MongoDB client side is 2000, as far as I know, it should be enough.

Review https://github.com/mongodb/specifications/blob/master/source/server-selection/server-selection.rst under "Server Selection Algorithm".
First, check the latencies as seen by the driver for each of the servers. Increase the window using localThresholdMS URI option if needed.
Then, you either have incorrectly collected the data or you have a server problem, because
Moreover, as the number of concurrency exceeds 150, slow queries start to appear on nodes with more connections.
If you have a node that is able to sustain 400 connections but with the same workload is not able to sustain >150 <400 connections, that's a server problem that you need to investigate and repair.
After you've done that, note that in the server selection algorithm is this item:
Of the two randomly chosen servers, select the one with the lower operationCount. If both servers have the same operationCount, select arbitrarily between the two of them.
This is supposed to route queries to less loaded nodes. This naturally doesn't work well if either the nodes are reporting their load incorrectly, or perhaps you are mistaken as to what load you are reading from where.

Related

Elasticsearch application latency investigation

We have an Elasticsearch setup w/ [data, master, client] nodes. Client receives only query traffic, pass query to data nodes, gets the response, sends back to caller (based on my general understanding).
We are seeing that 'took' latency in the query response is around ~16ms, but our application who is measuring latency when calling into client is around ~90ms. Here are some numbers on our setup:
ES Setup: 3 client nodes (60GB/3 cpu/30GB heap each), 3 data nodes (80GB/16 cpu/30GB heap each), 3 master nodes. Its a k8 based helm chart setup.
client/data have enough cpu/mem (based on k8's pod level cpu/mem usages)
QPS - 20 req/sec
Shard size ~ 24GB, 0 replicas. Each shard is on a separate data-node. Indices are using mmapfs/preload "*"
Query types: bool query w/ 3 match clauses and 3 should for boosting on few fields. We have "_source=true".
Our documents are quite bigger with (mean, p90, p99) as (200kb, 400kb, 800kb)
Our response size is of the order of (mean, p99) (164kb, 840kb). We also observed latencies for bigger response sizes is much higher than the baseline.
Can someone comment on following questions:
How can we know more exactly where is this extra latency is
introduced? When reading about "took" here, it includes the querying and response forming stages. But something happens after that so that our application measured latency jumps to ~90ms. Where else can I look to look more into this increase? I have access to Prometheus ES dash and K8 pods usages, but all of them look normal and no spikes.
Are there some ES settings we can play with to optimize this
latency? We feel its mostly due to bigger response sizes. Can there be some compression introduced in ES to help w/ this?
Your question is very broad with very less information on your deployment, like types of search queries, index mapping, cluster/nodes/indices specification, and QPS..it's generally very difficult to suggest anything without looking at the system which has performance issues...
Reg, client nodes, yes, they receive the traffic but they also calculate the Global result from the local result set received from each shard involved in the search request. so they do the heavy processing and should have enough capacity otherwise would become bottleneck, even though your data nodes calculate the local result fast, processing at the client node would take more time and overall took time would increase.
You can also see if you have a room to improve some of the suggestions I wrote in this blog post.
Hope this helps.

Creating high throughput Elasticsearch cluster

We are in process of implementing Elasticsearch as a search solution in our organization. For the POC we implemented a 3-Node cluster ( each node with 16 VCores and 60 GB RAM and 6 * 375GB SSDs) with all the nodes acting as master, data and co-ordination node. As it was a POC indexing speeds were not a consideration we were just trying to see if it will work or not.
Note : We did try to index 20 million documents on our POC cluster and it took about 23-24 hours to do that which is pushing us to take time and design the production cluster with proper sizing and settings.
Now we are trying to implement a production cluster (in Google Cloud Platform) with emphasis on both indexing speed and search speed.
Our use case is as follows :
We will bulk index 7 million to 20 million documents per index ( we have 1 index for each client and there will be only one cluster). This bulk index is a weekly process i.e. we'll index all data once and will query it for whole week before refreshing it.We are aiming for a 0.5 million document per second indexing throughput.
We are also looking for a strategy to horizontally scale when we add more clients. I have mentioned the strategy in subsequent sections.
Our data model has nested document structure and lot of queries on nested documents which according to me are CPU, Memory and IO intensive. We are aiming for sub second query times for 95th percentile of queries.
I have done quite a bit of reading around this forum and other blogs where companies have high performing Elasticsearch clusters running successfully.
Following are my learnings :
Have dedicated master nodes (always odd number to avoid split-brain). These machines can be medium sized ( 16 vCores and 60 Gigs ram) .
Give 50% of RAM to ES Heap with an exception of not exceeding heap size above 31 GB to avoid 32 bit pointers. We are planning to set it to 28GB on each node.
Data nodes are the workhorses of the cluster hence have to be high on CPUs, RAM and IO. We are planning to have (64 VCores, 240 Gb RAM and 6 * 375 GB SSDs).
Have co-ordination nodes as well to take bulk index and search requests.
Now we are planning to begin with following configuration:
3 Masters - 16Vcores, 60GB RAM and 1 X 375 GB SSD
3 Cordinators - 64Vcores, 60GB RAM and 1 X 375 GB SSD (Compute Intensive machines)
6 Data Nodes - 64 VCores, 240 Gb RAM and 6 * 375 GB SSDs
We have a plan to adding 1 Data Node for each incoming client.
Now since hardware is out of windows, lets focus on indexing strategy.
Few best practices that I've collated are as follows :
Lower number of shards per node is good of most number of scenarios, but have good data distribution across all the nodes for a load balanced situation. Since we are planning to have 6 data nodes to start with, I'm inclined to have 6 shards for the first client to utilize the cluster fully.
Have 1 replication to survive loss of nodes.
Next is bulk indexing process. We have a full fledged spark installation and are going to use elasticsearch-hadoop connector to push data from Spark to our cluster.
During indexing we set the refresh_interval to 1m to make sure that there are less frequent refreshes.
We are using 100 parallel Spark tasks which each task sending 2MB data for bulk request. So at a time there is 2 * 100 = 200 MB of bulk requests which I believe is well within what ES can handle. We can definitely alter these settings based on feedback or trial and error.
I've read more about setting cache percentage, thread pool size and queue size settings, but we are planning to keep them to smart defaults for beginning.
We are open to use both Concurrent CMS or G1GC algorithms for GC but would need advice on this. I've read pros and cons for using both and in dilemma in which one to use.
Now to my actual questions :
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
We will be sending query requests via coordinator nodes. Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
Say a search request come up to coordinator node it blocks 1 thread there and send request to data nodes which in turn blocks threads on data nodes as per where query data is lying. Is this assumption correct ?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
References
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
We did try to index 20 million documents on our POC cluster and it took about 23-24 hours
That is surprisingly little — like less than 250 docs/s. I think my 8GB RAM laptop can insert 13 million docs in 2h. Either you have very complex documents, some bad settings, or your bottleneck is on the ingestion side.
About your nodes: I think you could easily get away with less memory on the master nodes (like 32GB should be plenty). Also the memory on data nodes is pretty high; I'd normally expect heap in relation to the rest of the memory to be 1:1 or for lots of "hot" data maybe 1:3. Not sure you'll get the most out of that 1:7.5 ratio.
CMS vs G1GC: If you have a current Elasticsearch and Java version, both are an option, otherwise CMS. You're generally trading throughput for (GC) latency, so if you benchmark be sure to have a long enough timeframe to properly hit GC phases and run as close to production queries in parallel as possible.
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
I'd say the coordinator is fine. Unless you use a custom routing key and the bulk only contains data for that specific data node, 5/6th of the documents would need to be forwarded to other data nodes anyway (if you have 6 data nodes). And you can offload the bulk processing and coordination handling to non data nodes.
However, overall it might make more sense to have 3 additional data nodes and skip the dedicated coordinating node. Though this is something you can only say for certain by benchmarking your specific scenario.
Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
I'm not sure I understand the question. But have you looked into https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster, which might shed some more light on this topic?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
While there are different queues for different query operations, there is otherwise no clear separation of tasks (like "only use 20% of the resources for indexing). Maybe go a little more conservative on the parallel bulk requests to avoid overloading the node.
If you are not reading from an index while it's being indexed (ideally you flip an alias once done): You might want to disable the refresh rate entirely and let Elasticsearch create segments as needed, but do a force refresh and change the setting once done. Also you could try running with 0 replicas while indexing, change replicas to 1 once done, and then wait for it to finish — though I'd benchmark if this is helping overall and if it's worth the added complexity.

Azure Table Increased Latency

I'm trying to create an app which can efficiently write data into Azure Table. In order to test storage performance, I created a simple console app, which sends hardcoded entities in a loop. Each entry is 0.1 kByte. Data is sent in batches (100 items in each batch, 10 kBytes each batch). For every batch, I prepare entries with the same partition key, which is generated by incrementing a global counter - so I never send more than one request to the same partition. Also, I control a degree of parallelism by increasing/decreasing the number of threads. Each thread sends batches synchronously (no request overlapping).
If I use 1 thread, I see 5 requests per second (5 batches, 500 entities). At that time Azure portal metrics shows table latency below 100ms - which is quite good.
If I increase the number of treads up to 12 I see x12 increase in outgoing requests. This rate stays stable for a few minutes. But then, for some reason I start being throttled - I see latency increase and requests amount drop.
Below you can see account metrics - highlighted point shows 2K31 transactions (batches) per minute. It is 3850 entries per second. If threads are increased up to 50, then latency increases up to 4 seconds, and transaction rate drops to 700 requests per second.
According to documentation, I should be able to send up to 20K transaction per second within one account (my test account is used only for my performance test). 20K batches mean 200K entries. So the question is why I'm being throttled after 3K entries?
Test details:
Azure Datacenter: West US 2.
My location: Los Angeles.
App is written in C#, uses CosmosDB.Table nuget with the following configuration: ServicePointManager.DefaultConnectionLimit = 250, Nagles Algorithm is disabled.
Host machine is quite powerful with 1Gb internet link (i7, 8 cores, no high CPU, no high memory is observed during the test).
PS: I've read docs
The system's ability to handle a sudden burst of traffic to a partition is limited by the scalability of a single partition server until the load balancing operation kicks-in and rebalances the partition key range.
and waited for 30 mins, but the situation didn't change.
EDIT
I got a comment that E2E Latency doesn't reflect server problem.
So below is a new graph which shows not only E2E latency but also the server's one. As you can see they are almost identical and that makes me think that the source of the problem is not on the client side.

Max connection pool size and autoscaling group

In Sequelize.js you should configure the max connection pool size (default 5). I don't know how to deal with this configuration as I work on an autoscaling platform in AWS.
The Aurora DB cluster on r3.2xlarge allows 2000 max connections per read replica (you can get that by running SELECT ##MAX_CONNECTIONS;).
The problem is I don't know what should be the right configuration for each server hosted on our EC2s. What should be the right max connection pool size as I don't know how many servers will be launched by the autoscaling group? Normally, the DB MAX_CONNECTIONS value should be divided by the number of connection pools (one by server), but I don't know how many server will be instantiated at the end.
Our concurrent users count is estimated to be between 50000 and 75000 concurrent users at our release date.
Did someone get previous experience with this kind of situation?
It has been 6 weeks since you asked, but since I got involved in this recently I thought I would share my experience.
The answer various based on how the application works and performs. Plus the characteristics of the application under load for the instance type.
1) You want your pool size to be > than the expected simultaneous queries running on your host.
2) You never want your a situation where number of clients * pool size approaches your max connection limit.
Remember though that simultaneous queries is generally less than simultaneous web requests since most code uses a connection to do a query and then releases it.
So you would need to model your application to understand the actual queries (and amount) that would happen for your 75K users. This is likely a lot LESS than 75K/second db queries a second.
You then can construct a script - we used jmeter - and run a test to simulate performance. One of the items we did during our test was to increase the pool higher and see the difference in performance. We actually used a large number (100) after doing a baseline and found the number made a difference. We then dropped it down until it start making a difference. In our case it was 15 and so I set it to 20.
This was against t2.micro as our app server. If I change the servers to something bigger, this value likely will go up.
Please note that you pay a cost on application startup when you set a higher number...and you also incur some overhead on your server to keep those idle connections so making larger than you need isn't good.
Hope this helps.

Performance issues when pushing data at a constant rate to Elasticsearch on multiple indexes at the same time

We are experiencing some performance issues or anomalies on a elasticsearch specifically on a system we are currently building.
The requirements:
We need to capture data for multiple of our customers, who will query and report on them on a near real time basis. All the documents received are the same format with the same properties and are in a flat structure (all fields are of primary type and no nested objects). We want to keep each customer’s information separate from each other.
Frequency of data received and queried:
We receive data for each customer at a fluctuating rate of 200 to 700 documents per second – with the peak being in the middle of the day.
Queries will be mostly aggregations over around 12 million documents per customer – histogram/percentiles to show patterns over time and the occasional raw document retrieval to find out what happened a particular point in time. We are aiming to serve 50 to 100 customer at varying rates of documents inserted – the smallest one could be 20 docs/sec to the largest one peaking at 1000 docs/sec for some minutes.
How are we storing the data:
Each customer has one index per day. For example, if we have 5 customers, there will be a total of 35 indexes for the whole week. The reason we break it per day is because it is mostly the latest two that get queried with occasionally the remaining others. We also do it that way so we can delete older indexes independently of customers (some may want to keep 7 days, some 14 days’ worth of data)
How we are inserting:
We are sending data in batches of 10 to 2000 – every second. One document is around 900bytes raw.
Environment
AWS C3-Large – 3 nodes
All indexes are created with 10 shards with 2 replica for the test purposes
Both Elasticsearch 1.3.2 and 1.4.1
What we have noticed:
If I push data to one index only, Response time starts at 80 to 100ms for each batch inserted when the rate of insert is around 100 documents per second. I ramp it up and I can reach 1600 before the rate of insert goes to close to 1sec per batch and when I increase it to close to 1700, it will hit a wall at some point because of concurrent insertions and the time will spiral to 4 or 5 seconds. Saying that, if I reduce the rate of inserts, Elasticsearch recovers nicely. CPU usage increases as rate increases.
If I push to 2 indexes concurrently, I can reach a total of 1100 and CPU goes up to 93% around 900 documents per second.
If I push to 3 indexes concurrently, I can reach a total of 150 and CPU goes up to 95 to 97%. I tried it many times. The interesting thing is that response time is around 109ms at the time. I can increase the load to 900 and response time will still be around 400 to 600 but CPU stays up.
Question:
Looking at our requirements and findings above, is the design convenient for what’s asked? Are there any tests that I can do to find out more? Is there any setting that I need to check (and change)?
I've been hosting thousands of Elasticsearch clusters on AWS over at https://bonsai.io for the last few years, and have had many a capacity planning conversation that sound like this.
First off, it sounds to me like you have a pretty good cluster design and test rig going here. My first intuition here is that you are legitimately approaching the limits of your c3.large instances, and will want to bump up to a c3.xlarge (or bigger) fairly soon.
An index per tenant per day could be reasonable, if you have relatively few tenants. You may consider an index per day for all tenants, using filters to focus your searches on specific tenants. And unless there are obvious cost savings to discarding old data, then filters should suffice to enforce data retention windows as well.
The primary benefit of segmenting your indices per tenant would be to move your tenants between different Elasticsearch clusters. This could help if you have some tenants with wildly larger usage than others. Or to reduce the potential for Elasticsearch's cluster state management to be a single point of failure for all tenants.
A few other things to keep in mind that may help explain the performance variance you're seeing.
Most importantly here, indexing is incredibly CPU bottlenecked. This makes sense, because Elasticsearch and Lucene are fundamentally just really fancy string parsers, and you're sending piles of strings. (Piles are a legitimate unit of measurement here, right?) Your primary bottleneck is going to be the number and speed of your CPU cores.
In order to take the best advantage of your CPU resources while indexing, you should consider the number of primary shards you're using. I'd recommend starting with three primary shards to distribute the CPU load evenly across the three nodes in your cluster.
For production, you'll almost certainly end up on larger servers. The goal is for your total CPU load for your peak indexing requirements ends up under 50%, so you have some additional overhead for processing your searches. Aggregations are also fairly CPU hungry. The extra performance overhead is also helpful for gracefully handling any other unforeseen circumstances.
You mention pushing to multiple indices concurrently. I would avoid concurrency when bulk updating into Elasticsearch, in favor of batch updating with the Bulk API. You can bulk load documents for multiple indices with the cluster-level /_bulk endpoint. Let Elasticsearch manage the concurrency internally without adding to the overhead of parsing more HTTP connections.
That's just a quick introduction to the subject of performance benchmarking. The Elasticsearch docs have a good article on Hardware which may also help you plan your cluster size.

Resources