Is it good to create virtual machines(nodes) to get better performance on cassandra? - performance

I know Cassandra is good in multiple nodes set up. The more nodes,the better performance. If I have two dedicated servers with same hardware, it would be good I create some virtual machines in both of them to have more nodes, or not?
For example I have two dedicated server with this specifications:
1TB hard drive
64 GB RAM
8 core CPU
then create 8 virtual machine(nodes) in both of them. each of them has:
~150GB hard drive
8 GB RAM
share 8 core CPU
So I have 16 nodes. Are these 16 nodes had better performance than 2 nodes with this two dedicated server?
In the other word which side of this trade off is better, more nodes with lower hardware or two stronger nodes?
I know it should be tested, but I want to know basically is it reasonable or not?

Adding new nodes always adds some overhead, they need to communicate within each other and sync their data. Therefore, the more nodes you add, you'd expect the overhead to increase with adding each node. You'd add more nodes only in a situation where the existing number of nodes can't handle the input/output demands. Since in the situation you are describing , you'd be actually writing on the same disk, you'd actually effectively be slowing down your cluster by adding more nodes.
Imagine the situation: you have a server, it receives some data and then writes it on disk. Now imagine the same situation, where the disk is shared between two servers and they both write the same information at the almost same time on the same disk. The two servers also use cpu cycles to communicate between each other that the data has been written so they can sync up. I think this is a sufficient enough information to describe to you why what you are thinking is not a good idea if you can avoid it.
EDIT:
Of course, this is the information only in layman's terms, C* has a very nice architecture in which data is actually spread according to an algorithm to a certain range of nodes (not all of them) and when you are querying for a specific key, the algorithm actually can tell you where to find the data. With that said, when you add and remove nodes, the new nodes have to communicate with the cluster that they want to share 'the burden' and as a result, a recalculation of what is known as a 'token-ring' takes place at the end of which data may be shuffled around so it is accessible in a predictable way.
You can take a look at this:
http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes-2
But in general, there is indeed some overhead when nodes communicate with each other, but the number of the nodes would almost never negatively or positively impact your query speed dramatically if you are querying for a single key.

"I know it should be tested, but I want to know basically is it reasonable or not?"
That will answer most of your assumptions.
The basic advantage of using cassandra is availability. If you are planning to have just two dedicated servers, then there is a question mark on your availability of data. Considering the worst case, you always have just two replicas of data at any point of time.
My take is to go for a nicely split dedicated set up in small chunks. Everything boils down to your use case.
1.If you have a lot of data flowing in and if you consider data as king(in such a case , you need more replicas to handle in case of failures), i would prefer a high end distributed set up.
2.If you are looking for the other way around(data is not your forte and your data is just another part of your set up), you shall just go for the set up what you have mentioned.
3.If you have a cost constraint and if you are a start up with a minimal data that is important to you, set up in two nodes what you have with replication of 2(Simple Strategy ) and replication of 1(Network Topology)

Related

What is locality in Graph Matching problem and Distributed models?

I’m a beginner in the field of Graph Matching and Parallel Computing. I read a paper that talks about an efficient parallel matching algorithm. They explained the importance of the locality, but I don't know it represents what? and What is good and bad locality?
Our distributed memory parallelization (using MPI) on p processing elements (PEs or MPI processes) assigns nodes to PEs and stores all edges incident to a node locally. This can be done in a load balanced way if no node has degree exceeding m/p. The second pass of the basic algorithm from Section 2 has to exchange information on candidate edges that cross a PE boundary. In the worst case, this can involve all edges handled by a PE, i.e., we can expect better performance if we manage to keep most edges locally. In our experiments, one PE owns nodes whose numbers are a consecutive range of the input numbers. Thus, depending on how much locality the input numbering contains we have a highly local or a highly non-local situation.
Generally speaking, locality in distributed models is basically the extent to which a global solution for a computational problem problem can be obtained from locally available data.
Good locality is when most nodes can construct solutions using local data, since they'll require less communication to get any missing data. Bad locality would be if a node spends more than desirable time fetching data, rather than finding a solution using local data.
Think of a simple distributed computer system which comprises a collection of computers each somewhat like a desktop PC, in as much as each one has a CPU and some RAM. (These are the nodes mentioned in the question.) They are assembled into a distributed system by plugging them all into the same network.
Each CPU has memory-bus access (very fast) to data stored in its local RAM. The same CPU's access to data in the RAM on another computer in the system will run across the network (much slower) and may require co-operation with the CPU on that other computer.
locality is a property of the data used in the algorithm, local data is on the same computer as the CPU, non-local data is elsewhere on the distributed system. I trust that it is clear that parallel computations can proceed more quickly the more that each CPU has to work only with local data. So the designers of parallel programs for distributed systems pay great attention to the placement of data often seeking to minimise the number and sizes of exchanges of data between processing elements.
Complication, unnecessary for understanding the key issues: of course on real distributed systems many of the individual CPUs are multi-core, and in some designs multiple multi-core CPUs will share the same enclosure and have approximately memory-bus-speed access to all the RAM in the same enclosure. Which makes for a node which itself is a shared-memory computer. But that's just detail and a topic for another answer.

H2O cluster uneven distribution of performance usage

I set up a cluster with a 4 core (2GHz) and a 16 core (1.8GHz) virtual machine. The creation and connection to the cluster works without problems. But now I want to do some deep learning on the cluster, where I see an uneven distribution for the performance usage of those two virtual machines. The one with 4 cores is always at 100% CPU usage while the 16 core machine is idle most of the time.
Do I have to make additional configuration during the cluster generation? Because it is odd for me that the stronger machine of the two is idle while the weaker one does all the work.
Best regards,
Markus
Two things to keep in mind here.
Your data needs to be large enough to take advantage of data parallelism. In particular, the number of chunks per column needs to be large enough for all the cores to have work to do. See this answer for more details: H2O not working on parallel
H2O-3 assumes your nodes are symmetric. It doesn't try to load balance work across the cluster based on capability of the nodes. Faster nodes will finish their work first and wait idle for the slower nodes to catch up. (You can see this same effect if you have two symmetric nodes but one of them is busy running another process.)
Asymmetry is a bigger problem for memory (where smaller nodes can run out of memory and fail entirely) than it is for CPU (where some nodes are just waiting around). So always make sure to start each H2O node with the same value of -Xmx.
You can limit the number of cores H2O uses with the -nthreads option. So you can try giving each of your two nodes -nthreads 4 and see if they behave more symmetrically with each using roughly four cores. In the case you describe, that would mean the smaller machine is roughly 100% utilized and the larger machine is roughly 25% utilized. (But since the two machines probably have different chips, the cores are probably not identical and won't balance perfectly, which is OK.)
[I'm ignoring the virtualization aspect completely, but CPU shares could also come into the picture depending on the configuration of your hypervisor.]

Improve h2o DRF runtime on a multi-node cluster

I am currently running h2o's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes).
My data set has 1m rows and 41 columns (40 predictors and 1 response).
I use the R bindings to control the cluster and the RF call is as follows
model=h2o.randomForest(x=x,
y=y,
ignore_const_cols=TRUE,
training_frame=train_data,
seed=1234,
mtries=7,
ntrees=2000,
max_depth=15,
min_rows=50,
stopping_rounds=3,
stopping_metric="MSE",
stopping_tolerance=2e-5)
For the 3-node cluster (c4.8xlarge, enhanced networking turned on), this takes about 240sec; the CPU utilization is between 10-20%; RAM utilization is between 20-30%; network transfer is between 10-50MByte/sec (in and out). 300 trees are built until early stopping kicks in.
On a single-node cluster, I can get the same results in about 80sec. So, instead of an expected 3-fold speed up, I get a 3-fold slow down for the 3-node cluster.
I did some research and found a few resources that were reporting the same issue (not as extreme as mine though). See, for instance:
https://groups.google.com/forum/#!topic/h2ostream/bnyhPyxftX8
Specifically, the author of http://datascience.la/benchmarking-random-forest-implementations/ notes that
While not the focus of this study, there are signs that running the
distributed random forests implementations (e.g. H2O) on multiple
nodes does not provide the speed benefit one would hope for (because
of the high cost of shipping the histograms at each split over the
network).
Also https://www.slideshare.net/0xdata/rf-brighttalk points at 2 different DRF implementations, where one has a larger network overhead.
I think that I am running into the same problems as described in the links above.
How can I improve h2o's DRF performance on a multi-node cluster?
Are there any settings that might improve runtime?
Any help highly appreciated!
If your Random Forest is slower on a multi-node H2O cluster, it just means that your dataset is not big enough to take advantage of distributed computing. There is an overhead to communicate between cluster nodes, so if you can train your model successfully on a single node, then using a single node will always be faster.
Multi-node is designed for when your data is too big to train on a single node. Only then, will it be worth using multiple nodes. Otherwise, you are just adding communication overhead for no reason and will see the type of slowdown that you observed.
If your data fits into memory on a single machine (and you can successfully train a model w/o running out of memory), the way to speed up your training is to switch to a machine with more cores. You can also play around with certain parameter values which affect training speed to see if you can get a speed-up, but that usually comes at a cost in model performance.
As Erin says, often adding more nodes just adds the capability for bigger data sets, not quicker learning. Random forest might be the worst; I get fairly good results with deep learning (e.g. 3x quicker with 4 nodes, 5-6x quicker with 8 nodes).
In your comment on Erin's answer you mention the real problem is you want to speed up hyper-parameter optimization? It is frustrating that h2o.grid() doesn't support building models in parallel, one on each node, when the data will fit in memory on each node. But you can do that yourself, with a bit of scripting: set up one h2o cluster on each node, do a grid search with a subset of hyper-parameters on each node, have them save the results and models to S3, then bring the results in and combine them at the end. (If doing a random grid search, you can run exactly the same grid on each cluster, but it might be a good idea to explicitly use a different seed on each.)

compare B+tree implementation: storing internal nodes on disk

is there any implementation where internal nodes of B+tree is also stored on disk? I am just wondering if any one is aware of such an implementation or see real advantage doing it this way? Normally, one stores the leaf nodes on disk and develop the B+ tree as per need.
But it is also possible to save the current state of B+tree's internal nodes (by replacing the pointers by disk block number it points to): I see there are other challenges like keeping the internal nodes in memory in sync with the disk blocks: but the B+ tree may be implemented on nvram or say battery backed dram or some other method to keep it in sync.
Just wondering if anyone has already implemented it this way like linux's bcache or another implementation?
cheers, cforfun!
All persistent B+Tree implementations I've ever seen - as opposed to pure 'transient' in-memory structures - store both node types on disk.
Not doing so would require scanning the all the data (the external nodes, a.k.a. 'sequence set') on every load in order to rebuild the index, something that is feasible only when you're dealing with piddling small amounts of data or very special circumstances.
I've seen single-user implementations that sync the disk image only when the page manager ejects a dirty page and on program shutdown, which has the effect that often-used internal nodes - which are rarely replaced/ejected - can go without sync-to-disk for a long time. This is somewhat justified by the fact that internal ('index') nodes can be rebuilt after a crash, so that only the external ('data') nodes need the full fault-tolerant persistence treatment. The advantage of such schemes is that they eliminate the wasted writes for nodes close to the root whose update frequency is fairly high. Think SSDs, for example.
One way of increasing disk efficiency for persisted in-memory structures is to persist only the log to disk, and to rebuild the whole tree from the log on each restart. One very successful Java package uses this approach to great advantage.

Distributed and replicated data storage for small amounts of data under Windows

We're looking for a good solution to a caching problem. We'd like to distribute a relatively small amount of data (perhaps 10's of GBs) among a cluster of web servers such that:
The data is replicated to all nodes
The data is persistent
The data can be accessed locally
Our motivation for a caching solution is that we currently have a single point of failure: a SQL Server database. We're unable to set up a fail-over cluster for this database, unfortunately. We're already using Memcached to a large extent, but we want to avoid the problem where if a Memcached node goes down, we'd suddenly have a large amount of cache misses and therefore experience a massive amount of requests to one endpoint.
We'd prefer instead to have local persistent caches on each web server node so that the resulting load would be distributed. When a retrieval is made, it would pass through the following:
Check for data in Memcached. If it's not there...
Check for data in local persistent storage. If it's not there...
Retrieve data from the database.
When data changes, the cache key is invalidated at both caching layers.
We've been looking at a bunch of potential solutions, but none of them seem to match exactly what we need:
CouchDB
This is pretty close; the data model we'd like to cache is very document-oriented. However, its replication model isn't exactly what we're looking for. It seems to me as though replication is an action you have to perform rather than a permanent relationship among nodes. You can set up continuous replication, but this doesn't persist between restarts.
Cassandra
This solution seems to be mostly geared toward those with large storage requirements. We have a large amount of users, but small amounts of data. Cassandra looks to be able to support n number of fail-over nodes, but 100% replication among nodes doesn't seem to be what it's intended for; instead, it seems more geared toward distribution only.
SAN
One attractive idea is that we can store a bunch of files on a SAN or similar type of appliance. I haven't worked with these before, but it seems like this would still be a single point of failure; if the SAN goes down, we'd suddenly be going to the database for all cache misses.
DFS Replication
A simple Google search revealed this. It seems to do what we want; it synchronizes files across all nodes in a replication cluster. But the marketing text makes it look like it's more of a system for ensuring documents are copied to different office locations. Also, it has limits, like a file count maximum, that wouldn't work well for us.
Have any of you had similar requirements to ours and found a good solution that meets your needs?
We've been using Riak successfully in production for several months now for a problem that's somewhat similar to what you describe. We too have evaluated CouchDB and Cassandra before.
The advantage of Riak in this sort of problems imo is that distribution and data replication are at the core of the system. You define how many replicas of the data across the cluster you want and it takes care of the rest (it's a bit more complicated than that of course, but that's the essence). We went through adding nodes, removing nodes, had nodes crush, and it's proven surprisingly resilient.
It's a lot like Couch in other matters - document oriented, REST interface, Erlang.
You can check the hazelcast.
It does not persist the data but provides a fail-over system. Each node can have a number of nodes to backup it's data in case a node fails.

Resources