Elasticsearch Uneven Write Queue Distribution - elasticsearch

I have a cluster with an even distribution of data that is routed based on the document _id- which is a random string. During normal operations, searching and writing to the cluster is done with an even distribution. However, when bulking updating documents in the cluster for several minutes, only 1-2 nodes appear to be working.
Here is what a bulk update operation looks like after several minutes of running-
q qs node_id
0 200 Wd5JFj4gRk-9pKL_Jubd3w
0 200 FQ86BI1ASUS0tu-XQMuk6w
0 200 dMeO029LSiqjwicm3YP8JA
0 200 b8zAduWdRyO7P9Lz7hSFBQ
0 200 K0o4v_mHRqSRNZWJpzvJPQ
224 200 HN1yQG_hRF2eiCyy_0Dpcg
0 200 GXsc0FKsSUemue-e1Cuzsg
0 200 LcDaZoipQA63UOg0_WHguA
0 200 PdKFe7nLRaCnEqECNLpFvg
0 200 glani3PYQ4qppwzvLQnjIQ
0 200 T9jqycccQ-a03YtUCGVy0w
As you can see, the HN1y node becomes very active where the other nodes seem to go quiet. The total throughput of updates drops dramatically and the only way to resolve it is to pause the bulk update operaion, wait a minute, and resume. At which point we go through the same steps of even distribution to eventually one node appearing to do all of the work.
How can a cluster get into a situation like this? Does this suggest there really is an uneven distribution, or is something else going on?

Related

Scalling up ElasticSearch

We have a problem with scaling up Elasticsearch.
We have a simple structure with multiple indexes each index has 2 shards, 1 primary and 1 duplicate.
Around now we have ~ 3000 indexes which mean 6000 shards and we think that we will hit limits with shards soon. (currently running 2 nodes 32gb of ram, 4 cores top usage 65%, we are working on changing servers to 3 nodes to have 1 primary shard and 2 duplicates)
Refresh interval is set to 60s.
Some indexes have 200 documents, others have 10 mil documents most of them are less than 200k.
The total amount of documents is: 40 mil. (amount of documents can increase fast)
Our search requests are hitting multiple indexes at the same time (we might be searching in 50 indexes or 500 indexes, we might need in the future to be able to search in all indexes).
Searching need to be fast.
Currently, we are daily synchronizing all documents by bulk in chunks of 5000 of documents ~ 7mb because from tests that is working best ~ 2,3 seconds per request of 5000 ~ 7mb, done by 10 async workers.
We sometimes hit the same index with workers at same time and request with bulk is taking longer extending even to ~ 12,5 sec per request of 5000 ~ 7mb.
Current synchronization process takes about ~1hour / 40 mil of documents.
Documents are stored by uuids (we are using them to get direct hits of documents from elasticsearch), documents values can be modified daily, sometimes we only change synchronization_hash which determins which documents were changed, after all the synchronization process we run a delete on documents with old synchronization_hash.
Other thing is that we think that our data architecture is broken, we have x clients ~ 300 number can increase, each client is assigned to be only allowed to search in y indexes (from 50 to 500), indexes for multiple clients can repeat in search (client x has 50 indexes, client y has 70 indexes, client y,x most often clients need to have access to same documents in indexes, amount of clients can increase) that's why we store data in separated indexes so we don't have to update all indexes where this data is stored.
To increase a speed of indexing we are thinking even moving to 4 nodes (with each index 2 primary, 2 duplicates), or moving to 4 nodes (with each index only having 2, 1 primary, 1 duplicate), but we need to test things out to figure out what would work for us the best. We might be needing to double amount of documents in next few months.
What do you think can be changed to increase a indexing speed without reducing an search speed?
What can be changed in our data architecture?
Is there any other way that our data should be organized to allow us for fast searching and faster indexing?
I tried many sizes of chunks in synchronization / didn't try to change the architecture.
We are trying to achive increased indexing speed without reducing search speed.

Elastic search bulk write requests end up on the same node from the cluster, causing the cluster to reject writes

I have an ElasticSearch 2.1.1 cluster with 11 nodes.
Most of the time everything goes well, but when the load increases (writing data from multiple Storm topologies) it seems that all the write requests go to the same node and end up with a node overworked with his queue over the limit, while other nodes are just siting there doing nothing.
node bulk.active bulk.queue
1 0 0
2 0 0
3 32 114
4 0 0
.and so on
And after a while the cluster starts to reject the write requests:
nested: EsRejectedExecutionException[rejected execution
of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1#7e368c0b on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor#13ff1612[Running, pool size = 32, active threads = 32, queued tasks =
55, completed tasks = 249363622]
After the load passes it recovers, but the same thing happens each time the load increases.
Has anyone encountered something like this? What might be the cause?

Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then

Regularly the past days our ES 7.4 cluster (4 nodes) is giving read timeouts and is getting slower and slower when it comes to running certain management commands. Before that it has been running for more than a year without any trouble. For instance /_cat/nodes was taking 2 minutes yesterday to execute, today it is already taking 4 minutes. Server loads are low, memory usage seems fine, not sure where to look further.
Using the opster.com online tool I managed to get some hint that the management queue size is high, however when executing the suggested commands there to investigate I don't see anything out of the ordinary other than that the command takes long to give a result:
$ curl "http://127.0.0.1:9201/_cat/thread_pool/management?v&h=id,active,rejected,completed,node_id"
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 345 100 345 0 0 2 0 0:02:52 0:02:47 0:00:05 90
id active rejected completed node_id
JZHgYyCKRyiMESiaGlkITA 1 0 4424211 elastic7-1
jllZ8mmTRQmsh8Sxm8eDYg 1 0 4626296 elastic7-4
cI-cn4V3RP65qvE3ZR8MXQ 5 0 4666917 elastic7-2
TJJ_eHLIRk6qKq_qRWmd3w 1 0 4592766 elastic7-3
How can I debug this / solve this? Thanks in advance.
If you notice your elastic7-2 node is having 5 active requests in the management queue, which is really high, As the management queue capacity itself is just 5, and it's used only for very few operations(Management, not search/index).
You can have a look at threadpools in elasticsearch for further read.

Elasticsearch triggers refresh_mappings when deleting indices

We're dealing with a huge number of shards (+70k), which makes our ES (v 1.6.0, replica 1, 5 shards per index) not so reliable. We're in the process of deleting indices, but we're noticing that there's a spike of refresh_mapping tasks after each individual delete (if it matters, these delete actions are performed via the REST api). This can be a problem, because subsequent DELETE request will be interleaved with the refresh-mapping tasks, and eventually they will timeout.
For example, here's the output of _cat/pending_tasks when deleting an index.
3733244 1m URGENT delete-index [test.class_jump.2015-07-16]
3733245 210ms HIGH refresh-mapping [business.bear_case1validation.2015-09-19][[bear_case1Validation]]
3733246 183ms HIGH refresh-mapping [business.bear_case1validation.2015-09-15][[bear_case1Validation]]
3733247 156ms HIGH refresh-mapping [search.cube_scan.2015-09-24][[cube_scan]]
3733248 143ms HIGH refresh-mapping [business.bear_case1validation.2015-09-17][[bear_case1Validation]]
3733249 117ms HIGH refresh-mapping [business.bear_case1validation.2015-09-22][[bear_case1Validation]]
3733250 85ms HIGH refresh-mapping [search.santino.2015-09-25][[santino]]
3733251 27ms HIGH refresh-mapping [search.santino.2015-09-25][[santino]]
3733252 9ms HIGH refresh-mapping [business.output_request_finalized.2015-09-22][[output_request_finalized]]
3733253 2ms HIGH refresh-mapping [business.bear_case1validation.2015-08-19][[bear_case1Validation]]
There are two things which we don't understand:
Why are refresh_mappings being triggered? Maybe they are always triggered, but now visible because they are queued behind the URGENT
task. Is this the case?
Why are they triggered on "old" indices which do not change anymore? (the indices being refreshed are from one to two weeks old. The one being deleted is two weeks old as well)
Could this be caused by load rebalancing between nodes? It seems odd, but nothing else comes to mind. Moreover, seems that there are few documents (see below) in there, so load rebalancing seems an extreme longshot.
_cat/shards for test.class_jump.2015-07-16
index state docs store
test.class_jump.2015-07-16 2 r STARTED 0 144b 192.168.9.240 st-12
test.class_jump.2015-07-16 2 p STARTED 0 108b 192.168.9.252 st-16
test.class_jump.2015-07-16 0 p STARTED 0 144b 192.168.9.237 st-10
test.class_jump.2015-07-16 0 r STARTED 0 108b 192.168.7.49 st-01
test.class_jump.2015-07-16 3 p STARTED 1 15.5kb 192.168.7.51 st-03
test.class_jump.2015-07-16 3 r STARTED 1 15.5kb 192.168.10.11 st-18
test.class_jump.2015-07-16 1 r STARTED 0 144b 192.168.9.107 st-08
test.class_jump.2015-07-16 1 p STARTED 0 144b 192.168.7.48 st-00
test.class_jump.2015-07-16 4 r STARTED 1 15.6kb 192.168.10.65 st-19
test.class_jump.2015-07-16 4 p STARTED 1 15.6kb 192.168.9.106 st-07
Is there any way in which these can be suppressed? And more importantly, any way to speed up Index Deletion?
It looks like you're experiencing the same problem as reported in issue #10318 and it is due to the cluster trying to keep mappings in synch between master and data nodes. The comparison runs on a serialized version of the mappings and the fielddata part is a Java Map that is being serialized.
Since Maps don't guarantee any ordering, the serialization will yield syntactically different mappings everytime and for that reason ES thinks the mappings are different between master and data nodes, hence it tries to refresh mappings all over the place all the time.
Until you migrate to 2.0, it seems that the "fix" is to set indices.cluster.send_refresh_mapping: false in elasticsearch.yml on all your nodes and restart them.

Storm topology performance hit when acking

I'm using this tool from yahoo to run some performance tests on my storm cluster -
https://github.com/yahoo/storm-perf-test
I notice that there's almost a 10x performance hit I get when I turn acking on. Here's some details to reproduce the test -
Cluster -
3 supervisor nodes and 1 nimbus node. Each node is a c3.large.
With acking -
bin/storm jar storm_perf_test-1.0.0-SNAPSHOT-jar-with-dependencies.jar com.yahoo.storm.perftest.Main --ack --boltParallel 60 --maxSpoutPending 100 --messageSizeByte 100 --name some-topo --numWorkers 9 --spoutParallel 20 --testTimeSec 100 --pollFreqSec 20 --numLevels 2
status topologies totalSlots slotsUsed totalExecutors executorsWithMetrics time time-diff ms transferred throughput (MB/s)
WAITING 1 3 0 141 0 1424707134585 0 0 0.0
WAITING 1 3 3 141 141 1424707154585 20000 24660 0.11758804321289062
WAITING 1 3 3 141 141 1424707174585 20000 17320 0.08258819580078125
RUNNING 1 3 3 141 141 1424707194585 20000 13880 0.06618499755859375
RUNNING 1 3 3 141 141 1424707214585 20000 21720 0.10356903076171875
RUNNING 1 3 3 141 141 1424707234585 20000 43220 0.20608901977539062
RUNNING 1 3 3 141 141 1424707254585 20000 35520 0.16937255859375
RUNNING 1 3 3 141 141 1424707274585 20000 33820 0.16126632690429688
Without acking -
bin/storm jar ~/target/storm_perf_test-1.0.0-SNAPSHOT-jar-with-dependencies.jar com.yahoo.storm.perftest.Main --boltParallel 60 --maxSpoutPending 100 --messageSizeByte 100 --name some-topo --numWorkers 9 --spoutParallel 20 --testTimeSec 100 --pollFreqSec 20 --numLevels 2
status topologies totalSlots slotsUsed totalExecutors executorsWithMetrics time time-diff ms transferred throughput (MB/s)
WAITING 1 3 0 140 0 1424707374386 0 0 0.0
WAITING 1 3 3 140 140 1424707394386 20000 565460 2.6963233947753906
WAITING 1 3 3 140 140 1424707414386 20000 1530680 7.298851013183594
RUNNING 1 3 3 140 140 1424707434386 20000 3280760 15.643882751464844
RUNNING 1 3 3 140 140 1424707454386 20000 3308000 15.773773193359375
RUNNING 1 3 3 140 140 1424707474386 20000 4367260 20.824718475341797
RUNNING 1 3 3 140 140 1424707494386 20000 4489000 21.40522003173828
RUNNING 1 3 3 140 140 1424707514386 20000 5058960 24.123001098632812
The last 2 columns are the ones that are really important. It shows the number of tuples transferred and the rate in MBps.
Is this kind of performance hit expected with storm when we turn on acking? I'm using version 0.9.3 and no advanced networking.
There is always going to be a certain degree of performance degradation with acking enabled -- it's the price you pay for reliability. Throughput will ALWAYS be higher with acking disabled, but you have no guarantee if your data is processed or dropped on the floor. Whether that's a 10x hit like you're seeing, or significantly less, is a matter of tuning.
One important setting is topology.max.spout.pending, which allows you to throttle spouts so that only that many tuples are allowed "in flight" at any given time. That setting is useful for making sure downstream bolts don't get overwhelmed and start timing out tuples.
That setting also has no effect with acking disabled -- it's like opening the flood gates and dropping any data that overflows. So again, it will always be faster.
With acking enabled, Storm will make sure everything gets processed at least once, but you need to tune topology.max.spout.pending appropriately for your use case. Since every use case is different, this is a matter of trial and error. Set it too low, and you will have low throughput. Set it too high and your downstream bolts will get overwhelmed, tuples will time out, and you will get replays.
To illustrate, set maxSpoutPending to 1 and run the benchmark again. Then try 1000.
So yes, a 10x performance hit is possible without proper tuning. If data loss is okay for your use case, turn acking off. But if you need reliable processing, turn it on, tune for your use case, and scale horizontally (add more nodes) to reach your throughput requirements.

Resources