Why should an HDFS cluster not be stretched across DCs? - hadoop

It's easy to find well regarded references stating that HDFS should not be stretched across data centers [1], while Kafka should be stretched [2].
What specific issues make HDFS ill-suited to being stretched?
I'm considering stretching HDFS across two DCs that are less than 50km apart, with an average latency of less than 1ms. I'm planning on running a soak test spanning a couple of weeks, with representative read and write workloads, but with volumes of a few hundred GB - orders of magnitude less than the cluster will store in a few years.
If the tests succeed, what level of confidence does this provide that stretching HDFS is likely to succeed? Specifically, are issues related to the relatively long inter-host latency likely to be hidden; that such issues would only be exposed with far larger volumes e.g. a couple of hundred TB?
Finally, if the inter-DC latency spikes e.g. to 10ms for a few minutes, what issues I am likely to encounter?
[1] Tom White: Hadoop: The Definitive Guide
[2] https://www.confluent.io/blog/design-and-deployment-considerations-for-deploying-apache-kafka-on-aws/

Related

H2O cluster uneven distribution of performance usage

I set up a cluster with a 4 core (2GHz) and a 16 core (1.8GHz) virtual machine. The creation and connection to the cluster works without problems. But now I want to do some deep learning on the cluster, where I see an uneven distribution for the performance usage of those two virtual machines. The one with 4 cores is always at 100% CPU usage while the 16 core machine is idle most of the time.
Do I have to make additional configuration during the cluster generation? Because it is odd for me that the stronger machine of the two is idle while the weaker one does all the work.
Best regards,
Markus
Two things to keep in mind here.
Your data needs to be large enough to take advantage of data parallelism. In particular, the number of chunks per column needs to be large enough for all the cores to have work to do. See this answer for more details: H2O not working on parallel
H2O-3 assumes your nodes are symmetric. It doesn't try to load balance work across the cluster based on capability of the nodes. Faster nodes will finish their work first and wait idle for the slower nodes to catch up. (You can see this same effect if you have two symmetric nodes but one of them is busy running another process.)
Asymmetry is a bigger problem for memory (where smaller nodes can run out of memory and fail entirely) than it is for CPU (where some nodes are just waiting around). So always make sure to start each H2O node with the same value of -Xmx.
You can limit the number of cores H2O uses with the -nthreads option. So you can try giving each of your two nodes -nthreads 4 and see if they behave more symmetrically with each using roughly four cores. In the case you describe, that would mean the smaller machine is roughly 100% utilized and the larger machine is roughly 25% utilized. (But since the two machines probably have different chips, the cores are probably not identical and won't balance perfectly, which is OK.)
[I'm ignoring the virtualization aspect completely, but CPU shares could also come into the picture depending on the configuration of your hypervisor.]

Improve h2o DRF runtime on a multi-node cluster

I am currently running h2o's DRF algorithm an a 3-node EC2 cluster (the h2o server spans across all 3 nodes).
My data set has 1m rows and 41 columns (40 predictors and 1 response).
I use the R bindings to control the cluster and the RF call is as follows
model=h2o.randomForest(x=x,
y=y,
ignore_const_cols=TRUE,
training_frame=train_data,
seed=1234,
mtries=7,
ntrees=2000,
max_depth=15,
min_rows=50,
stopping_rounds=3,
stopping_metric="MSE",
stopping_tolerance=2e-5)
For the 3-node cluster (c4.8xlarge, enhanced networking turned on), this takes about 240sec; the CPU utilization is between 10-20%; RAM utilization is between 20-30%; network transfer is between 10-50MByte/sec (in and out). 300 trees are built until early stopping kicks in.
On a single-node cluster, I can get the same results in about 80sec. So, instead of an expected 3-fold speed up, I get a 3-fold slow down for the 3-node cluster.
I did some research and found a few resources that were reporting the same issue (not as extreme as mine though). See, for instance:
https://groups.google.com/forum/#!topic/h2ostream/bnyhPyxftX8
Specifically, the author of http://datascience.la/benchmarking-random-forest-implementations/ notes that
While not the focus of this study, there are signs that running the
distributed random forests implementations (e.g. H2O) on multiple
nodes does not provide the speed benefit one would hope for (because
of the high cost of shipping the histograms at each split over the
network).
Also https://www.slideshare.net/0xdata/rf-brighttalk points at 2 different DRF implementations, where one has a larger network overhead.
I think that I am running into the same problems as described in the links above.
How can I improve h2o's DRF performance on a multi-node cluster?
Are there any settings that might improve runtime?
Any help highly appreciated!
If your Random Forest is slower on a multi-node H2O cluster, it just means that your dataset is not big enough to take advantage of distributed computing. There is an overhead to communicate between cluster nodes, so if you can train your model successfully on a single node, then using a single node will always be faster.
Multi-node is designed for when your data is too big to train on a single node. Only then, will it be worth using multiple nodes. Otherwise, you are just adding communication overhead for no reason and will see the type of slowdown that you observed.
If your data fits into memory on a single machine (and you can successfully train a model w/o running out of memory), the way to speed up your training is to switch to a machine with more cores. You can also play around with certain parameter values which affect training speed to see if you can get a speed-up, but that usually comes at a cost in model performance.
As Erin says, often adding more nodes just adds the capability for bigger data sets, not quicker learning. Random forest might be the worst; I get fairly good results with deep learning (e.g. 3x quicker with 4 nodes, 5-6x quicker with 8 nodes).
In your comment on Erin's answer you mention the real problem is you want to speed up hyper-parameter optimization? It is frustrating that h2o.grid() doesn't support building models in parallel, one on each node, when the data will fit in memory on each node. But you can do that yourself, with a bit of scripting: set up one h2o cluster on each node, do a grid search with a subset of hyper-parameters on each node, have them save the results and models to S3, then bring the results in and combine them at the end. (If doing a random grid search, you can run exactly the same grid on each cluster, but it might be a good idea to explicitly use a different seed on each.)

is there any apache storm cluster size limit?

I presume that having more nodes in a storm cluster increases the "keep-topology-alive" intra-cluster communication.
Given that the topology works fine with 10 nodes (2 or 4 CPU, 4GB RAM) for small data, can we scale the topology to 1,000 or 10,000 nodes and still be competitive for (very) big data? Is there any known practical limit?
Thanks
The scaling of Storm cluster is limited by the speed of state storage in Zookeeper, most of it is "heartbeats" from workers. The theoretical limit is more or less 1,200 nodes (depends on the disk speed, 80MB/s write speed considered here). Obviously using a faster HDD will make things scale more.
However, people at Yahoo are working on In-memory store for worker heartbeats. Their solution will increase the limit to about 6,250 nodes using a GigabitE connections. 10Gigabit connections will increase this theoretical limit to 62,500 nodes. You can take a look at this Hadoop Summit 2015 presentation from Bobby Evans for further details.

Tasks taking longer over time in Apache Spark

I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.
However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore.
Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be
the case (that would explain why some local stages complete much
earlier than others). Having balanced partitions is the holy grail
in distributed-computing, isn't it? :)
How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you
use, and see what happens. Of course that heavily depends on your
application. You see, sometimes communication costs become so big,
they dominate, so using less machines for example, speeds up the
job. However, I would do that, only if steps 1 and 2 would not suffice.
Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory.
It's just a guess, you should check your Spark UI.

Distributing Data Nodes Across Multiple Data Centers

Has anyone tried to test the performance of data nodes across multiple data centers? Especially over networks with small pipes. I can't seem to find too much information on it and the information I have found is either old (circa 2010) or proprietary (seems like DataStax has something). I know Hadoop supports rack awareness but like I said I haven't seen any documentation for tuning a system for multiple data centers.
I've tried it with a 12 x DataNode cluster arranged in a 2:1 ratio split between two data centers roughly 120 miles apart. Latency between data centres was ~4ms across 2 x 1GbE pipes.
2 racks were configured in site A, 1 rack configured in site B. Each "rack" had 4 machines in it. We were basically testing Site B as a 'DR' site. Replication factor was set to 3.
Long story short, it works, but the performance was really, really bad. You definitely have to use compression on your source, map and reduce outputs in order to shrink your write I/O, and if the links between sites are used for anything else, you will get timeouts while transferring data. TCP windowing would effectively limited our transfer to around 4MBps, instead of a potential 100MBps+ on a 1GbE line.
Save yourself the headache and just use distcp jobs to replicate data!

Resources