Background
With our Elasticsearch nodes, I've noticed very high CPU usage per I/O throughput when indexing documents (queries seem to be ok). I was able to increase throughput via vertical scaling (adding more CPUs to the servers) but I wanted to see what kind of increase I would get by horizontal scaling (doubling the number of nodes from 2 to 4).
Problem
I expected to see increased throughput with the expanded cluster size but the performance was actually a little worse. I also noticed that half of the nodes reported very little I/O and CPU usage.
Research
I saw that the primary shard distribution was wonky so I shuffled some of them around using the re-route API. This didn't really have any effect other than to change which two nodes were being used.
The _search_shards API indicates that all nodes and shards should participate.
Question
I'm not sure why only two nodes are participating in indexing. Once a document has been indexed, is there a way to see which shard it resides in? Is there something obvious that I'm missing?
Setup
Servers: 2 CPU, 10g JVM, 18G RAM, 500G SSD
Index: 8 shards, 1 replica
Routing Key: _id
Total Document Count: 4.1M
Index Document Count: 50k
Avg Document Size: 14.6K
Max Document Size: 32.4M
Stats
Shards
files-v2 4 r STARTED 664644 8.4gb 10.240.219.136 es-qa-03
files-v2 4 p STARTED 664644 8.4gb 10.240.211.15 es-qa-01
files-v2 7 r STARTED 854807 10.5gb 10.240.53.190 es-qa-04
files-v2 7 p STARTED 854807 10.2gb 10.240.147.89 es-qa-02
files-v2 0 r STARTED 147515 711.4mb 10.240.53.190 es-qa-04
files-v2 0 p STARTED 147515 711.4mb 10.240.211.15 es-qa-01
files-v2 3 r STARTED 347552 1.2gb 10.240.53.190 es-qa-04
files-v2 3 p STARTED 347552 1.2gb 10.240.147.89 es-qa-02
files-v2 1 p STARTED 649461 3.5gb 10.240.219.136 es-qa-03
files-v2 1 r STARTED 649461 3.5gb 10.240.147.89 es-qa-02
files-v2 5 r STARTED 488581 3.6gb 10.240.219.136 es-qa-03
files-v2 5 p STARTED 488581 3.6gb 10.240.211.15 es-qa-01
files-v2 6 r STARTED 186067 916.8mb 10.240.147.89 es-qa-02
files-v2 6 p STARTED 186067 916.8mb 10.240.211.15 es-qa-01
files-v2 2 r STARTED 765970 7.8gb 10.240.53.190 es-qa-04
files-v2 2 p STARTED 765970 7.8gb 10.240.219.136 es-qa-03
Make sure that JVM + Elastic configurations are same on all nodes.
For testing purpose - try to make all nodes to hold all data (in your case set number of replicas to 3).
About document-shard relation:
https://www.elastic.co/guide/en/elasticsearch/guide/current/routing-value.html
OK, so I think I found it. I'm using Spring Data's Elasticsearch repository. Inside their save(doc) method, there's a call to refresh:
public <S extends T> S save(S entity) {
Assert.notNull(entity, "Cannot save 'null' entity.");
elasticsearchOperations.index(createIndexQuery(entity));
elasticsearchOperations.refresh(entityInformation.getIndexName(), true);
return entity;
}
I bypassed this by invoking the API without Spring's abstraction and the CPU usage for all nodes was much, much better. I'm still not quite clear why a refresh would have effect on 2 nodes (instead of 1 or all) but the issue appears to be resolved.
Related
We are in process of implementing Elasticsearch as a search solution in our organization. For the POC we implemented a 3-Node cluster ( each node with 16 VCores and 60 GB RAM and 6 * 375GB SSDs) with all the nodes acting as master, data and co-ordination node. As it was a POC indexing speeds were not a consideration we were just trying to see if it will work or not.
Note : We did try to index 20 million documents on our POC cluster and it took about 23-24 hours to do that which is pushing us to take time and design the production cluster with proper sizing and settings.
Now we are trying to implement a production cluster (in Google Cloud Platform) with emphasis on both indexing speed and search speed.
Our use case is as follows :
We will bulk index 7 million to 20 million documents per index ( we have 1 index for each client and there will be only one cluster). This bulk index is a weekly process i.e. we'll index all data once and will query it for whole week before refreshing it.We are aiming for a 0.5 million document per second indexing throughput.
We are also looking for a strategy to horizontally scale when we add more clients. I have mentioned the strategy in subsequent sections.
Our data model has nested document structure and lot of queries on nested documents which according to me are CPU, Memory and IO intensive. We are aiming for sub second query times for 95th percentile of queries.
I have done quite a bit of reading around this forum and other blogs where companies have high performing Elasticsearch clusters running successfully.
Following are my learnings :
Have dedicated master nodes (always odd number to avoid split-brain). These machines can be medium sized ( 16 vCores and 60 Gigs ram) .
Give 50% of RAM to ES Heap with an exception of not exceeding heap size above 31 GB to avoid 32 bit pointers. We are planning to set it to 28GB on each node.
Data nodes are the workhorses of the cluster hence have to be high on CPUs, RAM and IO. We are planning to have (64 VCores, 240 Gb RAM and 6 * 375 GB SSDs).
Have co-ordination nodes as well to take bulk index and search requests.
Now we are planning to begin with following configuration:
3 Masters - 16Vcores, 60GB RAM and 1 X 375 GB SSD
3 Cordinators - 64Vcores, 60GB RAM and 1 X 375 GB SSD (Compute Intensive machines)
6 Data Nodes - 64 VCores, 240 Gb RAM and 6 * 375 GB SSDs
We have a plan to adding 1 Data Node for each incoming client.
Now since hardware is out of windows, lets focus on indexing strategy.
Few best practices that I've collated are as follows :
Lower number of shards per node is good of most number of scenarios, but have good data distribution across all the nodes for a load balanced situation. Since we are planning to have 6 data nodes to start with, I'm inclined to have 6 shards for the first client to utilize the cluster fully.
Have 1 replication to survive loss of nodes.
Next is bulk indexing process. We have a full fledged spark installation and are going to use elasticsearch-hadoop connector to push data from Spark to our cluster.
During indexing we set the refresh_interval to 1m to make sure that there are less frequent refreshes.
We are using 100 parallel Spark tasks which each task sending 2MB data for bulk request. So at a time there is 2 * 100 = 200 MB of bulk requests which I believe is well within what ES can handle. We can definitely alter these settings based on feedback or trial and error.
I've read more about setting cache percentage, thread pool size and queue size settings, but we are planning to keep them to smart defaults for beginning.
We are open to use both Concurrent CMS or G1GC algorithms for GC but would need advice on this. I've read pros and cons for using both and in dilemma in which one to use.
Now to my actual questions :
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
We will be sending query requests via coordinator nodes. Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
Say a search request come up to coordinator node it blocks 1 thread there and send request to data nodes which in turn blocks threads on data nodes as per where query data is lying. Is this assumption correct ?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
References
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
We did try to index 20 million documents on our POC cluster and it took about 23-24 hours
That is surprisingly little — like less than 250 docs/s. I think my 8GB RAM laptop can insert 13 million docs in 2h. Either you have very complex documents, some bad settings, or your bottleneck is on the ingestion side.
About your nodes: I think you could easily get away with less memory on the master nodes (like 32GB should be plenty). Also the memory on data nodes is pretty high; I'd normally expect heap in relation to the rest of the memory to be 1:1 or for lots of "hot" data maybe 1:3. Not sure you'll get the most out of that 1:7.5 ratio.
CMS vs G1GC: If you have a current Elasticsearch and Java version, both are an option, otherwise CMS. You're generally trading throughput for (GC) latency, so if you benchmark be sure to have a long enough timeframe to properly hit GC phases and run as close to production queries in parallel as possible.
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
I'd say the coordinator is fine. Unless you use a custom routing key and the bulk only contains data for that specific data node, 5/6th of the documents would need to be forwarded to other data nodes anyway (if you have 6 data nodes). And you can offload the bulk processing and coordination handling to non data nodes.
However, overall it might make more sense to have 3 additional data nodes and skip the dedicated coordinating node. Though this is something you can only say for certain by benchmarking your specific scenario.
Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
I'm not sure I understand the question. But have you looked into https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster, which might shed some more light on this topic?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
While there are different queues for different query operations, there is otherwise no clear separation of tasks (like "only use 20% of the resources for indexing). Maybe go a little more conservative on the parallel bulk requests to avoid overloading the node.
If you are not reading from an index while it's being indexed (ideally you flip an alias once done): You might want to disable the refresh rate entirely and let Elasticsearch create segments as needed, but do a force refresh and change the setting once done. Also you could try running with 0 replicas while indexing, change replicas to 1 once done, and then wait for it to finish — though I'd benchmark if this is helping overall and if it's worth the added complexity.
We have an ES cluster at AWS running with the following setup:
(I know, i need minimum 3 master nodes)
1 Coordinator
2 Data nodes
1 Master Node
Data nodes spec:
CPU: 8 Cores
Ram: 20GB
Disk: 1TB ssd 4000 IOPS
Problem:
ES endpoints for Search, Delete, Backup, Cluster Heatlh, Insert are working fine.
Since yesterday some endpoints like /_cat/indices, /_nodes/_local/stats and etc, started to take too long to respond(more than 4 minutes) :( and consequently our Kibana is in red state(Timeout after 30000ms)
Useful info:
All Shards are OK (3500 in total)
The cluster is in green state
X-pack disabled
Average of 1gb/shard
500k document count.
Requests made by localhost at AWS
CPU, DISK, RAM, IOPS all fine
Any ideas?
Thanks in advance :)
EDIT/SOLUTION 1:
After a few days i found out what was the problem, but first a little bit context...
We use Elasticsearch for storing user audit messages, and mobile error messages, at the first moment (obiviously in a rush to deliver new microservices and remove load from our MongoDB cluster) we designed elasticsearch indices by day, so every day a new indice was created and at the end of the day that indice had arround 6 ~ 9gb of data.
Six months later, almost 180 indices bigger, and 720 primary shards open we bumped into this problem.
Then i did read this again(the basics!) :
https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html
After talking to the team responsible for this microservice we redesigned our indices to a monthly index, and guess what? problem solved!
Now our cluster is much faster than before and this simple command saved me some sweet nights of sleep.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
Thanks!
We're dealing with a huge number of shards (+70k), which makes our ES (v 1.6.0, replica 1, 5 shards per index) not so reliable. We're in the process of deleting indices, but we're noticing that there's a spike of refresh_mapping tasks after each individual delete (if it matters, these delete actions are performed via the REST api). This can be a problem, because subsequent DELETE request will be interleaved with the refresh-mapping tasks, and eventually they will timeout.
For example, here's the output of _cat/pending_tasks when deleting an index.
3733244 1m URGENT delete-index [test.class_jump.2015-07-16]
3733245 210ms HIGH refresh-mapping [business.bear_case1validation.2015-09-19][[bear_case1Validation]]
3733246 183ms HIGH refresh-mapping [business.bear_case1validation.2015-09-15][[bear_case1Validation]]
3733247 156ms HIGH refresh-mapping [search.cube_scan.2015-09-24][[cube_scan]]
3733248 143ms HIGH refresh-mapping [business.bear_case1validation.2015-09-17][[bear_case1Validation]]
3733249 117ms HIGH refresh-mapping [business.bear_case1validation.2015-09-22][[bear_case1Validation]]
3733250 85ms HIGH refresh-mapping [search.santino.2015-09-25][[santino]]
3733251 27ms HIGH refresh-mapping [search.santino.2015-09-25][[santino]]
3733252 9ms HIGH refresh-mapping [business.output_request_finalized.2015-09-22][[output_request_finalized]]
3733253 2ms HIGH refresh-mapping [business.bear_case1validation.2015-08-19][[bear_case1Validation]]
There are two things which we don't understand:
Why are refresh_mappings being triggered? Maybe they are always triggered, but now visible because they are queued behind the URGENT
task. Is this the case?
Why are they triggered on "old" indices which do not change anymore? (the indices being refreshed are from one to two weeks old. The one being deleted is two weeks old as well)
Could this be caused by load rebalancing between nodes? It seems odd, but nothing else comes to mind. Moreover, seems that there are few documents (see below) in there, so load rebalancing seems an extreme longshot.
_cat/shards for test.class_jump.2015-07-16
index state docs store
test.class_jump.2015-07-16 2 r STARTED 0 144b 192.168.9.240 st-12
test.class_jump.2015-07-16 2 p STARTED 0 108b 192.168.9.252 st-16
test.class_jump.2015-07-16 0 p STARTED 0 144b 192.168.9.237 st-10
test.class_jump.2015-07-16 0 r STARTED 0 108b 192.168.7.49 st-01
test.class_jump.2015-07-16 3 p STARTED 1 15.5kb 192.168.7.51 st-03
test.class_jump.2015-07-16 3 r STARTED 1 15.5kb 192.168.10.11 st-18
test.class_jump.2015-07-16 1 r STARTED 0 144b 192.168.9.107 st-08
test.class_jump.2015-07-16 1 p STARTED 0 144b 192.168.7.48 st-00
test.class_jump.2015-07-16 4 r STARTED 1 15.6kb 192.168.10.65 st-19
test.class_jump.2015-07-16 4 p STARTED 1 15.6kb 192.168.9.106 st-07
Is there any way in which these can be suppressed? And more importantly, any way to speed up Index Deletion?
It looks like you're experiencing the same problem as reported in issue #10318 and it is due to the cluster trying to keep mappings in synch between master and data nodes. The comparison runs on a serialized version of the mappings and the fielddata part is a Java Map that is being serialized.
Since Maps don't guarantee any ordering, the serialization will yield syntactically different mappings everytime and for that reason ES thinks the mappings are different between master and data nodes, hence it tries to refresh mappings all over the place all the time.
Until you migrate to 2.0, it seems that the "fix" is to set indices.cluster.send_refresh_mapping: false in elasticsearch.yml on all your nodes and restart them.
I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector.
However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger).
But: In my benchmark the read is always faster with a 1-instance-cluster then with a 2- or 4-instance-cluster!!!
My Benchmark Results:
Cluster-size 4: Write: 1750 seconds / Read: 360 seconds
Cluster-size 2: Write: 3446 seconds / Read: 420 seconds
Cluster-size 1: Write: 7595 seconds / Read: 284 seconds
ADDITIONAL TRY - WITH THE CASSANDRA-STRESS TOOL
I launched the "cassandra-stress" tool on the Cassandra cluster (size 1 / 2 / 3 / 4 nodes), with following results:
Clustersize Threads Ops/sek Time
1 4 10146 30,1
8 15612 30,1
16 20037 30,2
24 24483 30,2
121 43403 30,5
913 50933 31,7
2 4 8588 30,1
8 15849 30,1
16 24221 30,2
24 29031 30,2
121 59151 30,5
913 73342 31,8
3 4 7984 30,1
8 15263 30,1
16 25649 30,2
24 31110 30,2
121 58739 30,6
913 75867 31,8
4 4 7463 30,1
8 14515 30,1
16 25783 30,3
24 31128 31,1
121 62663 30,9
913 80656 32,4
Results: With 4 or 8 threads the single-node cluster is as fast or faster then the larger clusters!!!
Results as diagram:
The data-sets are the cluster sizes (1/2/3/4), x-axis the threads, and y-axis the ops/sec.
--> Question here: Are these results the cluster-wide results or is this a test for a local node (and so the result of only one instance of the ring)???
Can someone give an explanation? Thank you!
I ran a similar test with a spark worker running on each Cassandra node.
Using a Cassandra table with 15 million rows (about 1.75 GB of data), I ran a spark job to create an RDD from the table with each row as a string, and then printed a count of the number of rows.
Here are the times I got:
1 C* node, 1 spark worker - 1 min. 42 seconds
2 C* nodes, 2 spark workers - 55 seconds
4 C* nodes, 4 spark workers - 35 seconds
So it seems to scale pretty well with the number of nodes when the spark workers are co-located with the C* nodes.
By not co-locating your workers with Cassandra, you are forcing all the table data to go across the network. That will be slow and perhaps in your environment is a bottleneck. If you co-locate them, then you benefit from data locality since spark will create the RDD partitions from the tokens that are local to each machine.
You may also have some other bottleneck. I'm not familiar with EC2 and what it offers. Hopefully it has local disk storage rather than network storage since C* doesn't like network storage.
Currently I'm working on a project which uses elasticsearch 1.4.1.
The cluster consists of 22 nodes, and 20 of them are data nodes, each node has 16GB as the heap memory.
The thing is, when I'm doing massive queries, some of the nodes(2 or 3) consume 70% of the heap memory, while the rest just use less than 10%.
So I'm wondering is this because most of the queries go to those 2 or 3 nodes?
If not, can I further achieve better performance?
Thanks!
Just updated:
I just ran this command: curl -XGET localhost:9200/_cat/shards?v, and it returned:
index shard prirep state docs store ip node
....
mm 2 r STARTED 2248969 293.6mb 10.2.4.117 Mark Todd
mm 2 p STARTED 2248969 293.6mb 10.2.4.129 Saint Elmo
mm 19 r STARTED 30172116 3.5gb 10.2.4.126 Fixer
mm 19 p STARTED 30172116 3.5gb 10.2.4.123 Loki
....
I'm wondering what the store here means? if it is the actual size of the documents, can I load all of them into memory?
This could be because the document matches for that query is somehow colonized to just those 2 machines.
That is , if there are 20 million matches , chances are that 8 million match belongs to one machine , 8 million match comes from another and only just 2 million comes from rest of the 18 machines.
I am guessing you are also using aggregation in the process due to which field data cache is fabricated.