cassandra replication read performance oddities - performance

Sorry, this will take a bit to explain... We're testing the performance of Cassandra using YCSB. We have a 3-node setup and a 9-node setup. The 3-node setup is pretty simple: replication=1 (no copies).
Our 9 node setup contains 3 data centers (3 nodes per data center). In the 9 node setup, we also kept replication=1 because we understand that Cassandra's default NetworkTopologyStrategy is going to automatically replicate across data centers. That effectivly gives us a copy of the data at each data center which is great because we want to test this.
Our read-only test against the 9 node setup uses the DCAwareRoundRobinPolicy to query against the "local" data center only. So, we are querying against just 3 of the 9 nodes and were expecting similar results to our simple 3-node setup. In fact we'd expect the results to be a little worse because of cassandra's read repair messages and also because we are using a QUORUM read consistency.
However, we found the opposite. Our read-only test on the 3 node simple setup performance was a little worse than our more complex 3 data center/9 node setup.
Data loaded on both clusters are the same. Read-only tests were run with varying thread counts and we noticed larger disparity with more threads. The 9-node setup got better with more threads, which should not have been the case because we verified that only the 3 nodes we connected to in our "local" data center are receiving queries.
So, why are reads faster in the more complex setup when we are still hitting the same number of nodes (3)? Our write-only test did not exhibit this behaviour.
Thanks in advance!

Related

Improve h2o DRF runtime on a multi-node cluster

I am currently running h2o's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes).
My data set has 1m rows and 41 columns (40 predictors and 1 response).
I use the R bindings to control the cluster and the RF call is as follows
model=h2o.randomForest(x=x,
y=y,
ignore_const_cols=TRUE,
training_frame=train_data,
seed=1234,
mtries=7,
ntrees=2000,
max_depth=15,
min_rows=50,
stopping_rounds=3,
stopping_metric="MSE",
stopping_tolerance=2e-5)
For the 3-node cluster (c4.8xlarge, enhanced networking turned on), this takes about 240sec; the CPU utilization is between 10-20%; RAM utilization is between 20-30%; network transfer is between 10-50MByte/sec (in and out). 300 trees are built until early stopping kicks in.
On a single-node cluster, I can get the same results in about 80sec. So, instead of an expected 3-fold speed up, I get a 3-fold slow down for the 3-node cluster.
I did some research and found a few resources that were reporting the same issue (not as extreme as mine though). See, for instance:
https://groups.google.com/forum/#!topic/h2ostream/bnyhPyxftX8
Specifically, the author of http://datascience.la/benchmarking-random-forest-implementations/ notes that
While not the focus of this study, there are signs that running the
distributed random forests implementations (e.g. H2O) on multiple
nodes does not provide the speed benefit one would hope for (because
of the high cost of shipping the histograms at each split over the
network).
Also https://www.slideshare.net/0xdata/rf-brighttalk points at 2 different DRF implementations, where one has a larger network overhead.
I think that I am running into the same problems as described in the links above.
How can I improve h2o's DRF performance on a multi-node cluster?
Are there any settings that might improve runtime?
Any help highly appreciated!
If your Random Forest is slower on a multi-node H2O cluster, it just means that your dataset is not big enough to take advantage of distributed computing. There is an overhead to communicate between cluster nodes, so if you can train your model successfully on a single node, then using a single node will always be faster.
Multi-node is designed for when your data is too big to train on a single node. Only then, will it be worth using multiple nodes. Otherwise, you are just adding communication overhead for no reason and will see the type of slowdown that you observed.
If your data fits into memory on a single machine (and you can successfully train a model w/o running out of memory), the way to speed up your training is to switch to a machine with more cores. You can also play around with certain parameter values which affect training speed to see if you can get a speed-up, but that usually comes at a cost in model performance.
As Erin says, often adding more nodes just adds the capability for bigger data sets, not quicker learning. Random forest might be the worst; I get fairly good results with deep learning (e.g. 3x quicker with 4 nodes, 5-6x quicker with 8 nodes).
In your comment on Erin's answer you mention the real problem is you want to speed up hyper-parameter optimization? It is frustrating that h2o.grid() doesn't support building models in parallel, one on each node, when the data will fit in memory on each node. But you can do that yourself, with a bit of scripting: set up one h2o cluster on each node, do a grid search with a subset of hyper-parameters on each node, have them save the results and models to S3, then bring the results in and combine them at the end. (If doing a random grid search, you can run exactly the same grid on each cluster, but it might be a good idea to explicitly use a different seed on each.)

Strange replication in Cassandra

I have configured locally 3 nodes in on 'Test Cluster' of Cassandra. When I run them and create some keyspace or table also on all three nodes the keyspace or the table appears.
The problem I'm dealing with is, when I'm importing from CSV millions of rows in the table I already built the whole data suddenly appears on all three nodes. I have the same data replicated over the three nodes.
As I'm familiar with, the data I'm importing should be replicated/distributed over the nodes but partially. One partition on the first node, second on third, third on second node, fourth again on first node and ...
Am I right or I'm missing something big?
Also, my write speed locally is about 10k rows / second for the multi-node cluster. Isn't that a little bit too low?
I want to create discussion so I can maybe learn something more from your experience and see where I'm messing things.
Thank you!
The number of nodes that data is written to in your cluster is determined by the Replication Factor for that keyspace. If you have 3 nodes and the data is being written to all the nodes, then this setting must be set to 3. If you only want the data the be replicated to two nodes, you'd set this value to two.
Your write speed will be affected by the consistency level you are specifying on the write. If you have it set to ALL then you have to wait until all the nodes that are going to write the data have written the data (in your case all 3 nodes based on your replication factor). Dropping your consistency level on the write will probably net you faster write times. There is a balance between your replication factor, write consistency level, and read consistency level that you can research further.

Datastax Cassandra - Spanning Cluster node across amazon region

I planning to launch three EC2 instance across Amazon hosting region. For say, Region-A,Region-B and Region-C.
Based on the above plan, Each region act as Cluster(Or Datacenter) and have one node.(Correct me if I am wrong).
Using this infrastructure, Can I attain below configuration?
Replication Factor : 2
Write and Read Level:QUORUM.
My basic intention to do these are to achieve "If two region are went down, I can be survive with remaining one region".
Please help me with your inputs.
Note: I am very new to cassandra, hence whatever your inputs you are given will be useful for me.
Thanks
If you have a replication factor of 2 and use CL of Quorum, you will not tolerate failure i.e. if a node goes down, and you only get 1 ack - thats not a majority of responses.
If you deploy across multiple regions, each region is, as you mention, a DC in your Cluster. Each individual DC is a complete replica of all your data i.e. it will hold all the data for your keyspace. If you read/write at a LOCAL_* consistency (eg. LOCAL_ONE, LOCAL_QUORUM) level within each region, then you can tolerate the loss of the other regions.
The number of replicas in each DC/Region and the consistency level you are using to read/write in that DC will determine how much failure you can tolerate. If you are using QUORUM - this is a cross-DC consistency level. It will require a majority of acks from ALL replicas in your cluster in all DCs. If you loose 2 regions then its unlikely that you will be getting a quorum of responses.
Also, its worth remembering that Cassandra can be made aware of the AZ's it is deployed on in the Region and can do its best to ensure replicas of your data are placed in multiple AZs. This will give you even better tolerance to failure.
If this was me and I didnt need to have a strong cross-DC consistency level (like QUORUM). I would have 4 nodes in each region, deployed across each AZ and then a replication factor of 3 in each region. I would then be reading/writing at LOCAL_QUORUM or LOCAL_ONE (preferably). If you go with LOCAL_ONE than you could have fewer replicas in each DC e.g a replication factor of 2 with LOCAL_ONE means you could tolerate the loss of 1 replica.
However, this would be more expensive than what your initially suggesting but (for me) that would be the minimum setup I would need if I wanted to be in multiple regions and tolerate the loss of 2. You could go with 3 nodes in each region if you wanted to really save costs.

Cassandra multiple nodes in different data centers on same server

Just want to know if I can configure multiple nodes from different data centers on the same physical server. Example - Want to have 2 data centers with 3 nodes each. 1 node from each data center will be on each server.
Total of 2 data centers, 6 nodes on 3 physical servers.
You can technically configure it as you describe; however, DataCenter is typically thought of as a location, so having nodes in two locations but configured as a datacenter is confusing (especially for anyone who would have to troubleshoot the environment later).
A best practice would be to have the topology of 3 nodes in each data center (actually be physically located in each data center). Then you could configure the cluster to have your data in both data centers for availability and also have appropriate latency within a single data center for all reads, writes, etc...
For example, using RF: 3 in each data center and then Using a consistency of LOCAL_QUORUM would balance data availability while reducing latency of your request. This example configuration would ensure the read/write occurs in a single data center (lower latency than across datacenters) but ensures the data is saved across both data centers (eventually consistent design).
Yes it is possible to follow the topology you have listed but think about the following scenario
With two nodes from different DC on single machine, there is high chance that you will have the unit of data replicated on a single machine in two different data center nodes. If the single machine fails you would loose two copies of a piece of data.
Assuming you have RF of DC1:2 DC2:2 and using a CF of Quorum, you would need 3 nodes to respond to read requests. With one physical server being down a unit of data will be loosing 2 replicas and your reads will fail and indeed the writes with same CF will also fail.

Cassandra: 6 node cluster, RF=2: What to do when 2 nodes crash?

Good Day
We have a 6 node casssandra cluster witha replication factor of 3 on our keyspaces. Our applications make use of QUORUM so we can survive the loss of a single node wihtout it affecting the application.
Lets assume I lose 2 nodes at the same time. If my application was using consistency level of ONE then it would have been fine and my application would have run without any issues but we would like to keep the level at QUORUM.
My question is if 2 nodes crash at the same time and I do a nodetool removenode for each of the crashed nodes, will the cluster then rebalance the data over the remaining 4 nodes (and getting ir back to a 3 replica) and if done should my application then be able to work again usinng QUORUM?
In title you write RF=2, in text RF=3. You did not specify Cassandra version and if you are using single-token or vnodes. Quorum CL means, in a RF = 3 that 2 nodes must write/read before returning. It is possible that you face minimal issues/no issue even if 2 nodes dies, it depends on how many common ranges (partitions) the nodes shares.
Give a look at this distribution example that is exactly like the one you describe: RF3, 6 nodes.
using single tokens:
if you loose couples like (1,4) - (2,5) - (3,6) -- your cluster should allow all writes and reads, no issues. A good client will recognize nodes down and won't use them anymore as coordinators. Other situations, for example loss of nodes (1,6) might lead to a situation in which any r/w of F and E tokens will fail (assuming an equal distribution about 33% r/w operation will fail)
using vnodes:
here the situation is slightly different and also depends on couples you loose -- now if you repeat the worst scenario above -- you loose couple of nodes like (1,6) only B tokens will be affected in r/w operations since it's the only token shared between them.
Said that, just to clarify the possible scenarios, here's your answer. Nodetool removenode should be used like explained in this document. Use removenode IF AND ONLY IF you want reduce the cluster size (here what to do if you want replace a dead node). Once you did that your application will start working again using Quorum since other nodes will be responsible for partitions previously assigned to a dead node.
If you are using the official Datastax Java Driver you might want to let the driver temporary fight your monsters specifying a DowngradingConsistencyRetryPolicy
HTH,
Carlo

Resources