I'm deploying an Elasticsearch cluster with roughly 40GB a day with a time-to-live of 365 days. Write speed would be around 50 msgs/sec. Reads would be mostly driven by user dashboards, so the read frequency won't be high. What would be the best hardware requirements for this amount of data? How many master and data nodes may require in this situation?
obviously base on search index rate you should choose the hardware. 50 msg/sec is very low for elasticsearch. you have total 14.6TB data that is your 85 percent of total disk (base on 85% watermark). this means that you need 17TB disk. I think you can use one server with 128GB RAM and atleast 10 Core CPU and 17TB disk or have two server with half of this config. one server is master and data node and one server will be only data node.
Related
We are in process of implementing Elasticsearch as a search solution in our organization. For the POC we implemented a 3-Node cluster ( each node with 16 VCores and 60 GB RAM and 6 * 375GB SSDs) with all the nodes acting as master, data and co-ordination node. As it was a POC indexing speeds were not a consideration we were just trying to see if it will work or not.
Note : We did try to index 20 million documents on our POC cluster and it took about 23-24 hours to do that which is pushing us to take time and design the production cluster with proper sizing and settings.
Now we are trying to implement a production cluster (in Google Cloud Platform) with emphasis on both indexing speed and search speed.
Our use case is as follows :
We will bulk index 7 million to 20 million documents per index ( we have 1 index for each client and there will be only one cluster). This bulk index is a weekly process i.e. we'll index all data once and will query it for whole week before refreshing it.We are aiming for a 0.5 million document per second indexing throughput.
We are also looking for a strategy to horizontally scale when we add more clients. I have mentioned the strategy in subsequent sections.
Our data model has nested document structure and lot of queries on nested documents which according to me are CPU, Memory and IO intensive. We are aiming for sub second query times for 95th percentile of queries.
I have done quite a bit of reading around this forum and other blogs where companies have high performing Elasticsearch clusters running successfully.
Following are my learnings :
Have dedicated master nodes (always odd number to avoid split-brain). These machines can be medium sized ( 16 vCores and 60 Gigs ram) .
Give 50% of RAM to ES Heap with an exception of not exceeding heap size above 31 GB to avoid 32 bit pointers. We are planning to set it to 28GB on each node.
Data nodes are the workhorses of the cluster hence have to be high on CPUs, RAM and IO. We are planning to have (64 VCores, 240 Gb RAM and 6 * 375 GB SSDs).
Have co-ordination nodes as well to take bulk index and search requests.
Now we are planning to begin with following configuration:
3 Masters - 16Vcores, 60GB RAM and 1 X 375 GB SSD
3 Cordinators - 64Vcores, 60GB RAM and 1 X 375 GB SSD (Compute Intensive machines)
6 Data Nodes - 64 VCores, 240 Gb RAM and 6 * 375 GB SSDs
We have a plan to adding 1 Data Node for each incoming client.
Now since hardware is out of windows, lets focus on indexing strategy.
Few best practices that I've collated are as follows :
Lower number of shards per node is good of most number of scenarios, but have good data distribution across all the nodes for a load balanced situation. Since we are planning to have 6 data nodes to start with, I'm inclined to have 6 shards for the first client to utilize the cluster fully.
Have 1 replication to survive loss of nodes.
Next is bulk indexing process. We have a full fledged spark installation and are going to use elasticsearch-hadoop connector to push data from Spark to our cluster.
During indexing we set the refresh_interval to 1m to make sure that there are less frequent refreshes.
We are using 100 parallel Spark tasks which each task sending 2MB data for bulk request. So at a time there is 2 * 100 = 200 MB of bulk requests which I believe is well within what ES can handle. We can definitely alter these settings based on feedback or trial and error.
I've read more about setting cache percentage, thread pool size and queue size settings, but we are planning to keep them to smart defaults for beginning.
We are open to use both Concurrent CMS or G1GC algorithms for GC but would need advice on this. I've read pros and cons for using both and in dilemma in which one to use.
Now to my actual questions :
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
We will be sending query requests via coordinator nodes. Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
Say a search request come up to coordinator node it blocks 1 thread there and send request to data nodes which in turn blocks threads on data nodes as per where query data is lying. Is this assumption correct ?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
References
https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87
https://thoughts.t37.net/how-we-reindexed-36-billions-documents-in-5-days-within-the-same-elasticsearch-cluster-cd9c054d1db8
https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
We did try to index 20 million documents on our POC cluster and it took about 23-24 hours
That is surprisingly little — like less than 250 docs/s. I think my 8GB RAM laptop can insert 13 million docs in 2h. Either you have very complex documents, some bad settings, or your bottleneck is on the ingestion side.
About your nodes: I think you could easily get away with less memory on the master nodes (like 32GB should be plenty). Also the memory on data nodes is pretty high; I'd normally expect heap in relation to the rest of the memory to be 1:1 or for lots of "hot" data maybe 1:3. Not sure you'll get the most out of that 1:7.5 ratio.
CMS vs G1GC: If you have a current Elasticsearch and Java version, both are an option, otherwise CMS. You're generally trading throughput for (GC) latency, so if you benchmark be sure to have a long enough timeframe to properly hit GC phases and run as close to production queries in parallel as possible.
Is sending bulk indexing requests to coordinator node a good design choice or should we send it directly to data nodes ?
I'd say the coordinator is fine. Unless you use a custom routing key and the bulk only contains data for that specific data node, 5/6th of the documents would need to be forwarded to other data nodes anyway (if you have 6 data nodes). And you can offload the bulk processing and coordination handling to non data nodes.
However, overall it might make more sense to have 3 additional data nodes and skip the dedicated coordinating node. Though this is something you can only say for certain by benchmarking your specific scenario.
Now my question is, lets say since my data node has 64 cores, each node has thread pool size of 64 and 200 queue size. Lets assume that during search data node thread pool and queue size is completely exhausted then will the coordinator nodes keep accepting and buffering search requests at their end till their queue also fill up ? Or will 1 thread on coordinator will also be blocked per each query request ?
I'm not sure I understand the question. But have you looked into https://www.elastic.co/blog/why-am-i-seeing-bulk-rejections-in-my-elasticsearch-cluster, which might shed some more light on this topic?
While bulk indexing is going on ( assuming that we do not run indexing for all the clients in parallel and schedule them to be sequential) how to best design to make sure that query times do not take much hit during this bulk index.
While there are different queues for different query operations, there is otherwise no clear separation of tasks (like "only use 20% of the resources for indexing). Maybe go a little more conservative on the parallel bulk requests to avoid overloading the node.
If you are not reading from an index while it's being indexed (ideally you flip an alias once done): You might want to disable the refresh rate entirely and let Elasticsearch create segments as needed, but do a force refresh and change the setting once done. Also you could try running with 0 replicas while indexing, change replicas to 1 once done, and then wait for it to finish — though I'd benchmark if this is helping overall and if it's worth the added complexity.
What is the maximum number of points that can be written to influxdb (single node) per second? Is it feasible to scale influxdb without going for the paid cluster? And should I consider elasticsearch instead of influxdb for time series data (~3000 bytes/sec/user) if I am expecting around 60 concurrent users?
Depends on hardware.
Limiting factors are
Cardinality of series in the DB (total unique series)
WAL disk throughput (this could be put on tmpfs if you don't have SSD)
Data disk throughput (use SSD for best results)
RAM (more is better)
CPU for ingestion, indexing and queries
How far a single node can go largely depends on these and on the workload.
For write-heavy workloads of low cardinality, CPU generally tends to run out faster than anything else, assuming SSDs are used and disk I/O has been optimised accordingly.
After that, cardinality is the biggest limiting factor. Schema design plays a huge role, much bigger than number of nodes.
From some benchmarks I have done, a single node easily scales to ~70K series per second, with CPU being the limiting factor. This was on an old version though, likely higher than that now. Again, largely depends on data and schema design.
It is feasible to scale it without paid cluster by adding separate nodes, but not if you want to keep a homogeneous view (single source of all your data). Scaling vertically (more CPU, RAM) works only as long as cardinality remains consistent, meaning more data points for roughly same number of series.
InfluxDB suggest up to 250K writes / second with 25 queries per second on up to 1M unique queries is feasible on a single node. See hardware guidelines.
For the amount of data you have single node is more than enough - size of data does not matter, number of series does. Avoid elasticsearch for time series data - needs much more infrastructure to handle same amount of data.
How to plan resources (I suspect, elasticsearch instances) according to load:
With load I mean ≈500K events/min, each containing 8-10 fields.
What are the configuration knobs I should turn?
I'm new to this stack.
500,000 events per minute is 8,333 events per second, which should be pretty easy for a small cluster (3-5 machines) to handle.
The problem will come with keeping 720M daily documents open for 60 days (43B documents). If each of the 10 fields is 32 bytes, that's 13.8TB of disk space (nearly 28TB with a single replica).
For comparison, I have 5 nodes at the max (64GB of RAM, 31GB heap), with 1.2B documents consuming 1.2TB of disk space (double with a replica). This cluster could not handle the load with only 32GB of RAM per machine, but it's happy now with 64GB. This is 10 days of data for us.
Roughly, you're expecting to have 40x the number of documents consuming 10x the disk space than my cluster.
I don't have the exact numbers in front of me, but our pilot project for using doc_values is giving us something like a 90% heap savings.
If all of that math holds, and doc_values is that good, you could be OK with a similar cluster as far as actual bytes indexed were concerned. I would solicit additional information on the overhead of having so many individual documents.
We've done some amount of elasticsearch tuning, but there's probably more than could be done as well.
I would advise you to start with a handful of 64GB machines. You can add more as needed. Toss in a couple of (smaller) client nodes as the front-end for index and search requests.
I setup 3 nodes of Cassandra (1.2.10) cluster on 3 instances of EC2 m1.xlarge.
Based on default configuration with several guidelines included, like:
datastax_clustering_ami_2.4
not using EBS, raided 0 xfs on ephemerals instead,
commit logs on separate disk,
RF=3,
6GB heap, 200MB new size (also tested with greater new size/heap values),
enhanced limits.conf.
With 500 writes per second, the cluster works only for couple of hours. After that time it seems like not being able to respond because of CPU overload (mainly GC + compactions).
Nodes remain Up, but their load is huge and logs are full of GC infos and messages like:
ERROR [Native-Transport-Requests:186] 2013-12-10 18:38:12,412 ErrorMessage.java (line 210) Unexpected exception during request java.io.IOException: Broken pipe
nodetool shows many dropped mutations on each node:
Message type Dropped
RANGE_SLICE 0
READ_REPAIR 7
BINARY 0
READ 2
MUTATION 4072827
_TRACE 0
REQUEST_RESPONSE 1769
Is 500 wps too much for 3-node cluster of m1.xlarge and I should add nodes? Or is it possible to further tune GC somehow? What load are you able to serve with 3 nodes of m1.xlarge? What are your GC configs?
Cassandra is perfectly able to handle tens of thousands small writes per second on a single node. I just checked on my laptop and got about 29000 writes/second from cassandra-stress on Cassandra 1.2. So 500 writes per second is not really an impressive number even for a single node.
However beware that there is also a limit on how fast data can be flushed to disk and you definitely don't want your incoming data rate to be close to the physical capabilities of your HDDs. Therefore 500 writes per second can be too much, if those writes are big enough.
So first - what is the average size of the write? What is your replication factor? Multiply number of writes by replication factor and by average write size - then you'll approximately know what is required write throughput of a cluster. But you should take some safety margin for other I/O related tasks like compaction. There are various benchmarks on the Internet telling a single m1.xlarge instance should be able to write anywhere between 20 MB/s to 100 MB/s...
If your cluster has sufficient I/O throughput (e.g. 3x more than needed), yet you observe OOM problems, you should try to:
reduce memtable_total_space_mb (this will cause C* to flush smaller memtables, more often, freeing heap earlier)
lower write_request_timeout to e.g. 2 seconds instead of 10 (if you have big writes, you don't want to keep too many of them in the incoming queues, which reside on the heap)
turn off row_cache (if you ever enabled it)
lower size of the key_cache
consider upgrading to Cassandra 2.0, which moved quite a lot of things off-heap (e.g. bloom filters and index-summaries); this is especially important if you just store lots of data per node
add more HDDs and set multiple data directories, to improve flush performance
set larger new generation size; I usually set it to about 800M for a 6 GB heap, to avoid pressure on the tenured gen.
if you're sure memtable flushing lags behind, make sure sstable compression is enabled - this will reduce amount of data physically saved to disk, at the cost of additional CPU cycles
I'm currently rebuilding our servers that have our region-servers and data nodes. When I take down a data node, after 10 minutes the blocks that it had are being re-replicated among other data nodes, as it should. We have 10 data-nodes, so I see heavy network traffic as the blocks are being re-replicated. However, I'm seeing that traffic to be about only 500-600mbps per server (the machines all have gigabit interfaces) so it's definitely not network-bound. I'm trying to figure out what is limiting the speed that the data-nodes send and receive blocks. Each data-node has six 7200 rpm sata drives, and the IO usage is very low during this, only peaking to 20-30% per drive. Is there a limit built into hdfs that limits the speed at which blocks are replicated?
The rate of replication work is throttled by HDFS to not interfere with cluster traffic when failures happen during regular cluster load.
The properties that control this are dfs.namenode.replication.work.multiplier.per.iteration (2), dfs.namenode.replication.max-streams (2) and dfs.namenode.replication.max-streams-hard-limit (4). The foremost controls the rate of work to be scheduled to a DN at every heartbeat that occurs, and the other two further limit the maximum parallel threaded network transfers done by a DataNode at a time. The values in () indicate their defaults. Some description of this is available at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
You can perhaps try to increase the set of values to (10, 50, 100) respectively to spruce up the network usage (requires a NameNode restart), but note that your DN memory usage may increase slightly as a result of more blocks information being propagated to it. A reasonable heap size for these values for the DN role would be about 4 GB.
P.s. These values were not tried by me on production systems personally. You will also not want to max out the re-replication workload such that it affects regular cluster work, as recovery of 1/3 replicas may be of lesser priority than missing job/query SLAs due to lack of network resources (unless you have a really fast network that's always under-utilised even under loaded periods). Try to tune it till you're satisfied with the results.