Elasticsearch: QueueResizingEsThreadPoolExecutor exception - elasticsearch

At some point during 35k endurance load test of my java web app which fetches static data from Elasticsearch, I am start getting following elasticsearch exception:
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable#1a25fe82 on QueueResizingEsThreadPoolExecutor[name = search4/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 10.7ms, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor#6312a0bb[Running, pool size = 25, active threads = 25, queued tasks = 1000, completed tasks = 34575035]]
Elasticsearch details:
Elasticsearch version 6.2.4.
The cluster consists of 5 nodes. The JVM heap size setting for each node is Xms16g and Xmx16g. Each of the node machine has 16 processors.
NOTE: Initially, when I got this exception for the first time, I decided to increase the thread_pool.search.queue_size parameter in elasticsearch.yml, set it to 10000. Yes, I understand, I just postponed the problem to happen later.
Elasticsearch indicies details:
Currently, there are about 20 indicies, and only 6 are being used among of them. The unused one are old indicies that were not deleted after newer were created. The indexes itself are really small:
Index within the red rectangle is the index used by my web app. It's shards and replicas settings are "number_of_shards": "5" and "number_of_replicas": "2" respectively.
It's shards details:
In this article I found that
Small shards result in small segments, which increases overhead. Aim
to keep the average shard size between at least a few GB and a few
tens of GB. For use-cases with time-based data, it is common to see
shards between 20GB and 40GB in size.
As you can see from the screenshot above, my shard size is much less than mentioned size.
Hence, Q: what is the right number of shards in my case? Is it 1 or 2?
The index won't grow up much over the time.
ES Queries issued during the test. The load tests simulates scenario where user navigates to the page for searching some products. User can filter the products using corresponding filters (for e.g. name, city, etc...). The unique filter values is fetched from ES index using composite query. So this is the first query type. Another query is for fetching products from ES. It consists of must, must_not, filter, has_child queries, the size attribute equals 100.
I set the slow search logging, but nothing had been logged:
"index": {
"search": {
"slowlog": {
"level": "info",
"threshold": {
"fetch": {
"debug": "500ms",
"info": "500ms"
},
"query": {
"debug": "2s",
"info": "1s"
}
}
}
}
I feel like I am missing something simple to make it finally and being able to handle my load. Appreciate, if any one can help me with solving the issue.

For such a small size, you are using 5 primary shards, which I feel, due to your ES version 6.X(default was 5), and you never changed it, but In short having high number of primary shards for small index, has severe performance penalty, please refer very similar use-case(I was also having 5 PS 😀) which I covered in my blog.
As you already mentioned that your index size will not grow significantly in future, I would suggest to have 1 primary shard and 4 replica shards
1 Primary shard means for a single search, only one thread and one request will be created in Elasticsearch, this will provide better utilisation of resources.
As you have 5 data nodes, having 4 replica means shards are properly distributed on each data node, so your throughput and performance will be optimal.
After this change, measure the performance, and I am sure after this, you can again reduce the search queue size to 1k, as you know having high queue size is just delaying the problem and not addressing the issue at hand.
Coming to your search slow log, I feel you have very high threshold, for query phase 1 seconds for a query is really high for user-facing application, try to lower it ~100ms and not down those queries and optimize them further.

Related

Elastic search, experiencing very low search speed

We have a cluster consisting of 3 masters (4 core, 16 GB RAM each), 3 hot(8 core, 32 GB RAM, 300 GB SSD each), and 3 warm nodes(8 core, 32GB RAM, 1.5TB HDD each).
We have one index for each month of year following the naming convention of voucher_YYYY_MMM(eg voucher_2021_JAN). and all these indexes have an alias voucher which acts as a read alias and our search query is directed towards this read alias.
Our index resides on the hot nodes for 32 days, and that is the period it will receive 99% of writes. Our estimate data is approximately 480 million docs in this index, it has 1 replica and 16 shards( we have taken 16 shards because eventually, our data will grow, right now we are thinking of shrinking down to 8 shards each shard with 30 GB of data, as per our mapping 2 million docs are taking 1GB of space).
After 32 days index will move to the warm nodes, currently, we have 450 million in our hot index and 1.8 billion documents collectively in our warm indexes. The total comes up to 2.25 billion docs.
Our doc contains customer id and some fields on which we are applying filters, they all are mapped as keyword types, we are using custom routing on customer id for improving our search speed.
our typical query looks like
GET voucher/_search?routing=1000636779&search_type=query_then_fetch
{
"from": 0,
"size": 20,
"query": {
"constant_score": {
"filter": {
"bool": {
"filter": [
{
"term": {
"uId": {
"value": "1000636779",
"boost": 1
}
}
},
{
"terms": {
"isGift": [
"false"
]
}
}
]
}
}
}
},
"version": true,
"sort": [
{
"cdInf.crtdAt": {
"order": "desc"
}
}
]
}
We are using a constant score query because we don't want to score our documents and want to increase search speed.
We have 13 search threads on each of our hot and warm nodes and we are sending requests to our master node for indexing and searching.
we are sending 100 search requests per second and getting an average search response time of about 3.5 seconds, where max time is going up to 9 seconds.
I am not understanding what are we missing, why is our search performance so poor.
Thank you for the exhaustive explanations. Based on them here are a few points of improvement (in no particular order):
Never direct your search and index requests to the master nodes, they should never handle traffic. Send them to the data nodes directly, or better yet, to dedicated coordinating nodes.
As a direct consequence, the master(-eligible) nodes don't need 16GB of RAM, 2GB is more than sufficient, because they will not act as coordinating nodes anymore.
In case you have time ranges in your queries, you could leverage index sorting on the cdInf.crtdAt field. Faster searches at the cost of slower ingestion, but it only makes sense if your queries have a time constraint, otherwise not.
16 shards per index on 3 hot nodes is not a good sharding strategy, you should have a multiple of the number of nodes (3, 6, 9, etc) otherwise one of the nodes will have more shards, and hence, you might create hot spots. You can also add one more hot node, so each one has 4 shards. It's a typical example of oversharding. Since your indexes are rolled over each month, it's easy to just modify the number of primary shards in the index template as you see data growing.
It's a good idea to leverage routing in order to search less shards. It's not clear from the question how many indexes in total you have behind the voucher alias, but that would also be a good information to have in order to assess whether the sharding and size of search threads is appropriate. Based on the docs count you provide, it seems you have 1 hot index and 5 warm ones, so 6 indexes in total. So each search request with routing will search only 6 shards.
100 search requests per second and 13 search threads per node (the default for 8 cores) means that each second each node has to handle 7+ search requests, and since requests take approximately 3 seconds to return, you're building up a search queue, because the nodes might not be able to keep up.
Another feature to leverage in order to benefit from the filter caching is the preference query string parameter
Also part of the slowness comes from the fact that 80% of the data you're searching on is located on warm nodes with spinning disks, so depending on your use case, you might want to maybe split your search in two, i.e. one super fast search on the hot data and another slower search on the warm data.
Once your indexes get reallocated to the warm nodes (and if they don't get updated anymore), it might be a good idea to force merge them to a few segments (3 to 5) so that your searches have less segments to browse and also to decrease their size (i.e. remove deleted documents)

mongodb connection count not scale balanced between nodes in replicas set

I am now building a web application using MongoDB + Spring Data Mongodb.
This application has an API which is a simple MongoDB query:
db.myCollectionName.aggregate([{ "$sample" : { "size" : 100}}, { "$project" : { "myFieldName" : 1, "_id" : 0}}])
We randomly pick up 100 documents from myCollectionName and project myFieldName. It's a simple read request.
Total collection is about 100m records and afak I use $sample operator properly:
$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents
Our mongodb cluster is a 3 node replicaset, one node handles write requests, and two others handle read requests.
I am testing my application using JMeter, trying to figure how many TPS the application can support at most.
Report shows that, when the number of concurrency is 100, the TPS performance of the application is the best, reaching 1000, with a average 95pct latency of 180ms. The workload is evenly distributed to the two secondary nodes. The number of operations is about 500, connection count is about 180 in each secondary node.
Meanwhile, on the application side, the work load is fairly low, only 30% of cpu and memory used.
Next, I try to add one more secondary node to see if TPS will increase linearly and now strange things happens:
After the number of concurrency exceeds 100, the TPS of the system does not increase linearly. The TPS stays in 1000, with each secondary node handling only 400 operations (same operationCount). As the number of concurrency goes up, average 95pct lantency of the application also begins to degrade, from 180ms to ~450ms.
In the mean time, I notice that the connectiont count is not evenly established in each secondary node. One secondary node has established over 300 connections, while two others have only 180. Moreover, as the number of concurrency exceeds 150, slow queries start to appear on the node with more connections.
I can confirm that the work load on the application side is fairly low, with 3 secondary nodes, the application still does not reach its load limitations. The max connection count configured on the MongoDB client side is 2000, as far as I know, it should be enough.
Review https://github.com/mongodb/specifications/blob/master/source/server-selection/server-selection.rst under "Server Selection Algorithm".
First, check the latencies as seen by the driver for each of the servers. Increase the window using localThresholdMS URI option if needed.
Then, you either have incorrectly collected the data or you have a server problem, because
Moreover, as the number of concurrency exceeds 150, slow queries start to appear on nodes with more connections.
If you have a node that is able to sustain 400 connections but with the same workload is not able to sustain >150 <400 connections, that's a server problem that you need to investigate and repair.
After you've done that, note that in the server selection algorithm is this item:
Of the two randomly chosen servers, select the one with the lower operationCount. If both servers have the same operationCount, select arbitrarily between the two of them.
This is supposed to route queries to less loaded nodes. This naturally doesn't work well if either the nodes are reporting their load incorrectly, or perhaps you are mistaken as to what load you are reading from where.

Elasticsearch: High CPU usage by Lucene Merge Thread

I have a ES 2.4.1 cluster with 3 master and 18 data nodes which collects log data with a new index being created every day. In a day index size grows to about 2TB. Indexes older than 7 days get deleted. Very few searches are being performed on the cluster, so the main goal is to increase indexing throughput.
I see a lot of the following exceptions which is another symptom of what I am going to say next:
EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4#5a7d8a24 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#5f9ef44f[Running, pool size = 8, active threads = 8, queued tasks = 50, completed tasks = 68888704]]];]];
The nodes in the cluster are constantly pegging CPU. I increased index refresh interval to 30s but that had little effect. When I check hot threads I see multiple "Lucene Merge Thread" per node using 100% CPU. I also noticed that segment count is constantly around 1000 per shard, which seems like a lot. The following is an example of a segment stat:
"_2zo5": {
"generation": 139541,
"num_docs": 5206661,
"deleted_docs": 123023,
"size_in_bytes": 5423948035,
"memory_in_bytes": 7393758,
"committed": true,
"search": true,
"version": "5.5.2",
"compound": false
}
Extremely high "generation" number worries me and I'd like to optimize segment creation and merge to reduce CPU load on the nodes.
Details about indexing and cluster configuration:
Each node is an i2.2xl AWS instance with 8 CPU cores and 1.6T SSD drives
Documents are indexed constantly by 6 client threads with bulk size 1000
Each index has 30 shards with 1 replica
It takes about 25 sec per batch of 1000 documents
/_cat/thread_pool?h=bulk*&v shows that bulk.completed are equally spread out across nodes
Index buffer size and transaction durability are left at default
_all is disabled, but dynamic mappings are enabled
The number of merge threads is left at default, which should be OK given that I am using SSDs
What's the best way to go about it?
Thanks!
Here are the optimizations I made to the cluster to increase indexing throughput:
Increased threadpool.bulk.queue_size to 500 because index requests were frequently overloading the queues
Increased disk watermarks, because default settings were too aggressive for the large SSDs that we were using. I set "cluster.routing.allocation.disk.watermark.low": "100gb" and "cluster.routing.allocation.disk.watermark.high": "10gb"
Deleted unused indexes to free up resources ES uses to manage their shards
Increased number of primary shards to 175 with the goal of keeping shard size under 50GB and have approximately a shard per processor
Set client index batch size to 10MB, which seemed to work very well for us because the size of documents indexed varied drastically (from KBs to MBs)
Hope this helps others
I have run similar workloads and your best bet is to run hourly indices and run optimize on older indices to keep segments in check.

Is there a limit on the number of indexes that can be created on Elastic Search?

I'm using AWS-provided Elastic Search.
I have a signup page on my website, and on each signup; a new index for the new user gets created (to be used later by his work-group), which means that the number of indexes is continuously growing, (now it reached around 4~5k).
My question is: is there a performance limit on the number of indexes? is it safe (performance-wise) to keep creating new indexes dynamically with each new user?
Note: I haven't used AWS-Elasticsearch, so this answer may vary because they have started using open-distro of Elsticsearch and have forked the main branch. But a lot of principles should be the same. Also, this question doesn't have a definitive answer and it depends on various factors but I hope this answer will help the thought process.
One of the factors is the number of shards and replicas per index as that will contribute to the total number of shards per node. Each shard consumes some memory, so you will have to keep the number of shards limited per node so that they don't exceed maximum recommended 30GB heap space. As per this comment 600 to 1000 should be reasonable and you can scale your cluster according to that.
Also, you have to monitor the number of file descriptors and make sure that doesn't create any bottleneck for nodes to operate.
HTH!
If I'm not mistaken, the only limit is the disk space of your server, but if your index is growing too fast you should think about having more replica servers. I recomend reading this page: Indexing Performance Tips
Indexes themselves have no limit, however shards do, the recommended amount of shards per GB of heap is 20(JVM heap - you can check on kibana stack monitoring tab), this means if you have 5GB of JVM heap, the recommended amount is 100.
Remember that 1 index can take from 1 to x number of shards (1 primary and x secondary), normally people have 1 primary and 1 secondary, if this is you case then you would be able to create 50 indexes with those 5GB of heap

How to prevent Elasticsearch from index throttling?

I have a 40 node Elasticsearch cluster which is hammered by a high index request rate. Each of these nodes makes use of an SSD for the best performance. As suggested from several sources, I have tried to prevent index throttling with the following configuration:
indices.store.throttle.type: none
Unfortunately, I'm still seeing performance issues as the cluster still periodically throttles indices. This is confirmed by the following logs:
[2015-03-13 00:03:12,803][INFO ][index.engine.internal ] [CO3SCH010160941] [siphonaudit_20150313][19] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2015-03-13 00:03:12,829][INFO ][index.engine.internal ] [CO3SCH010160941] [siphonaudit_20150313][19] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2015-03-13 00:03:13,804][INFO ][index.engine.internal ] [CO3SCH010160941] [siphonaudit_20150313][19] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2015-03-13 00:03:13,818][INFO ][index.engine.internal ] [CO3SCH010160941] [siphonaudit_20150313][19] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2015-03-13 00:05:00,791][INFO ][index.engine.internal ] [CO3SCH010160941] [siphon_20150313][6] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2015-03-13 00:05:00,808][INFO ][index.engine.internal ] [CO3SCH010160941] [siphon_20150313][6] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
[2015-03-13 00:06:00,861][INFO ][index.engine.internal ] [CO3SCH010160941] [siphon_20150313][6] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2015-03-13 00:06:00,879][INFO ][index.engine.internal ] [CO3SCH010160941] [siphon_20150313][6] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
The throttling occurs after one of the 40 nodes dies for various expected reasons. The cluster immediately enters a yellow state, in which a number of shards will begin initializing on the remaining nodes.
Any idea why the cluster continues to throttle after explicitly configuring it not to? Any other suggestions to have the cluster more quickly return to a green state after a node failure?
The setting that actually corresponds to the maxNumMerges in the log file is called index.merge.scheduler.max_merge_count. Increasing this along with index.merge.scheduler.max_thread_count (where max_thread_count <= max_merge_count) will increase the number of simultaneous merges which are allowed for segments within an individual index's shards.
If you have a very high indexing rate that results in many GBs in a single index, you probably want to raise some of the other assumptions that the Elasticsearch default settings make about segment size, too. Try raising the floor_segment - the minimum size before a segment will be considered for merging, the max_merged_segment - the maximum size of a single segment, and the segments_per_tier -- the number of segments of roughly equivalent size before they start getting merged into a new tier. On an application that has a high indexing rate and finished index sizes of roughly 120GB with 10 shards per index, we use the following settings:
curl -XPUT /index_name/_settings -d'
{
"settings": {
"index.merge.policy.max_merge_at_once": 10,
"index.merge.scheduler.max_thread_count": 10,
"index.merge.scheduler.max_merge_count": 10,
"index.merge.policy.floor_segment": "100mb",
"index.merge.policy.segments_per_tier": 25,
"index.merge.policy.max_merged_segment": "10gb"
}
}
Also, one important thing you can do to improve loss-of-node/node restarted recovery time on applications with high indexing rates is taking advantage of index recovery prioritization (in ES >= 1.7). Tune this setting so that the indices that receive the most indexing activity are recovered first. As you may know, the "normal" shard initialization process just copies the already-indexed segment files between nodes. However, if indexing activity is occurring against a shard before or during initialization, the translog with the new documents can become very large. In the scenario where merging goes through the roof during recovery, it's the replay of this translog against the shard that is almost always the culprit. Thus, using index recovery prioritization to recover those shards first and delay shards with less indexing activity, you can minimize the eventual size of the translog which will dramatically improve recovery time.
We are using 1.7 and noticed a similar problem: The indexing getting throttled even when the IO was not saturated (Fusion IO in our case).
After increasing "index.merge.scheduler.max_thread_count" the problem seems to be gone -- we did not see any more throttling being logged so far.
I would try setting "index.merge.scheduler.max_thread_count" to at least the max reported numMergesInFlight (6 in the logs above).
https://www.elastic.co/guide/en/elasticsearch/reference/1.7/index-modules-merge.html#scheduling
Hope this helps!
Have you looked into increasing the shard allocation delay to give the node time to recover before the master starts promoting replicas?
https://www.elastic.co/guide/en/elasticsearch/reference/current/delayed-allocation.html
try setting index.merge.scheduler.max_thread_count to 1
https://www.elastic.co/blog/performance-considerations-elasticsearch-indexing

Resources