is there any apache storm cluster size limit? - apache-storm

I presume that having more nodes in a storm cluster increases the "keep-topology-alive" intra-cluster communication.
Given that the topology works fine with 10 nodes (2 or 4 CPU, 4GB RAM) for small data, can we scale the topology to 1,000 or 10,000 nodes and still be competitive for (very) big data? Is there any known practical limit?
Thanks

The scaling of Storm cluster is limited by the speed of state storage in Zookeeper, most of it is "heartbeats" from workers. The theoretical limit is more or less 1,200 nodes (depends on the disk speed, 80MB/s write speed considered here). Obviously using a faster HDD will make things scale more.
However, people at Yahoo are working on In-memory store for worker heartbeats. Their solution will increase the limit to about 6,250 nodes using a GigabitE connections. 10Gigabit connections will increase this theoretical limit to 62,500 nodes. You can take a look at this Hadoop Summit 2015 presentation from Bobby Evans for further details.

Related

Suggestions required in increasing utilization of yarn containers on our discovery cluster

Current Setup
we have our 10 node discovery cluster.
Each node of this cluster has 24 cores and 264 GB ram Keeping some memory and CPU aside for background processes, we are planning to use 240 GB memory.
now, when it comes to container set up, as each container may need 1 core, so max we can have 24 containers, each with 10GB memory.
Usually clusters have containers with 1-2 GB memory but we are restricted with the available cores we have with us or maybe I am missing something
Problem statement
as our cluster is extensively used by data scientists and analysts, having just 24 containers does not suffice. This leads to heavy resource contention.
Is there any way we can increase number of containers?
Options we are considering
If we ask the team to run many tez queries (not separately) but in a file, then at max we will keep one container.
Requests
Is there any other way possible to manage our discovery cluster.
Is there any possibility of reducing container size.
can a vcore (as it's a logical concept) be shared by multiple containers?
Vcores are just a logical unit and not in anyway related to a CPU core unless you are using YARN with CGroups and have yarn.nodemanager.resource.percentage-physical-cpu-limit enabled. Most tasks are rarely CPU-bound but more typically network I/O bound. So if you were to look at your cluster's overall CPU utilization and memory utilization, you should be able to resize your containers based on the wasted (spare) capacity.
You can measure utilization with a host of tools but sar, ganglia and grafana are the obvious ones but you can also look at Brendan Gregg's Linux Performance tools for more ideas.

Why should an HDFS cluster not be stretched across DCs?

It's easy to find well regarded references stating that HDFS should not be stretched across data centers [1], while Kafka should be stretched [2].
What specific issues make HDFS ill-suited to being stretched?
I'm considering stretching HDFS across two DCs that are less than 50km apart, with an average latency of less than 1ms. I'm planning on running a soak test spanning a couple of weeks, with representative read and write workloads, but with volumes of a few hundred GB - orders of magnitude less than the cluster will store in a few years.
If the tests succeed, what level of confidence does this provide that stretching HDFS is likely to succeed? Specifically, are issues related to the relatively long inter-host latency likely to be hidden; that such issues would only be exposed with far larger volumes e.g. a couple of hundred TB?
Finally, if the inter-DC latency spikes e.g. to 10ms for a few minutes, what issues I am likely to encounter?
[1] Tom White: Hadoop: The Definitive Guide
[2] https://www.confluent.io/blog/design-and-deployment-considerations-for-deploying-apache-kafka-on-aws/

How to make RabbitMQ scalable?

I tried to test RabbitMQ, but I found that rabbitmq has some problems:
if I created a cluster of 3 nodes, I can't publish/delivered more than 6000/s.
in other hand, if I worked with one single node, I can publish/delivery until 25000/s.
which means, more that I add nodes, more performance is deteriorating.
but from this article : https://blog.pivotal.io/pivotal/products/rabbitmq-hits-one-million-messages-per-second-on-google-compute-engine
they can publish more than 1 million, so how they can do that?
I want to make RabbitMQ process more than 1 million messages per second
I resolved the problem by adding load balancer.
The producers send data to load balancer. On the other hand the load balancer id connected to many nodes of rabbitmq, but those nodes are not connected between them (to avoid synchronization which affects the performance).
So by this way, I can multiply the throughput (ex: 3 nodes= 3x throughput).
It might depend on other factors such as your network, or your hardware performance.
When reading benchmark always consider the environment surrounding the tests
As on how to improve perf you can improve your hardware or network if this is the limiting factor.
Consider switching to a SSD or using link aggregation on your network would be a good start.
In this test of RabbitMQ performance, the authors concluded that a small cluster will underperform a single node cluster. More nodes need to be added to increase the performance. This makes sense when you think about the overhead induced by replication required in a distributed system, especially given that RabbitMQ focus is reliability.
The following is mentioned in a blog post by RabbitMQ:
If you use quorum queues or mirrored queues, then each message will be delivered to multiple brokers. If you have a cluster of three brokers and quorum queues with a replicator factor of 3, then every broker will receive every message. In that case, we’ve created a cluster for redundancy only. But we can also create larger clusters for scalability. We could have a cluster of 9 brokers, with quorum queues with a rep factor of 3 and now we’ve spread that load out and can handle a much larger total throughput.

Cloudera 5.4.4 Cluster - Getting aggregate usage metrics

I would like to collect aggregate usage metrics from a Cloudera 5.4.4 Hadoop cluster. Some of the metrics in my mind are as below:
Average CPU utilization of the cluster per day/ per week
Top n longest running jobs/queries on Hadoop
Top n users who use the cluster most (by utilization, by number of submitted jobs)
Cluster disk usage vs disk capacity
Cluster disk usage growth over time
Are there any APIs/resources/tools etc that I could use for starting with this? I don't think I am entirely sure of where to begin from. Any starting point would be greatly appreciated. Also, please do share your experience with cluster usage metrics, if you have had any.
Thanks in advance!
Ganglia is an open-source, scalable and distributed monitoring system for large clusters. It collects, aggregates and provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage. You can see Ganglia in action at UC Berkeley Grid.
Ganglia is also a popular solution for monitoring Hadoop and HBase clusters, since Hadoop (and HBase) has built-in support for publishing its metrics to Ganglia. With Ganglia you may easily see the number of bytes written by a particular HDSF datanode over time, the block cache hit ratio for a given HBase region server, the total number of requests to the HBase cluster, time spent in garbage collection and many, many others.
ref- http://hakunamapdata.com/ganglia-configuration-for-a-small-hadoop-cluster-and-some-troubleshooting/
I hope this link (here) may provide some details for 2 and 3.

Is it faster to replicate your data in hdfs for all your nodes?

If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.

Resources