How can I decide number of Kafka brokers on a zookeeper cluster having 5 nodes? - cluster-computing

I am new to Kafka.
We are going to use Kafka in our production. We have around 63 physical nodes(each 1TB HDD/8GB RAM).
How can I choose minimum zookeeper clusters based on the number of broker nodes?
I have 3 node zookeeper cluster in Dev Environment with 3 node broker for testing. But I want to understand how the zookeeper clusters to be estimated for production?

Related

Geo cluster with pacemaker - quorum vs booth

I configured a geo cluster using pacemaker and DRBD.
The cluster has 3 different nodes, each node is in a different geographic location.
The locations are pretty close to one another and the communication between them is fast enough for our requirements (around 80MB/s).
I have one master node, one slave node and the third node is an arbitrator.
I use AWS route 53 failover DNS record to do a failover between the nodes in the different sites.
A failover will happen from the master to the slave only if the slave has a quorum, thus ensuring it has communication to the outside world.
I have read that using booth is advised to perform failover between clusters/nodes in different locations - but having a quorum between different geographic locations seems to work very well.
I want to emphasize that I don't have a cluster of clusters - it is a single cluster, with each node in a different geo-location.
My question is - do I need booth in my case? If so - why? Am I missing something?
Booth helps in overlay cluster consisting of clusters running at different sites.
You have one single cluster and hence you should be okay with just Quorum.

How to add a node for failover in Elasticsearch

I currently have single node for elasticsearch in a windows server. Can you please explain how to add one extra node for failover in different machine? I also wonder how two nodes can be kept identical using NEST.
Usually, you don't run a failover node, but run a cluster of nodes to provide High Availability.
A minimum topology of 3 master eligible nodes with minimum_master_nodes set to 2 and a sharding strategy that distributes primary and replica shards over nodes to provide data redundancy is the minimum viable topology I'd consider running in production.

HDP cluster with RAID?

What is your experience with RAID1 on HDP cluster?
I have in my mind two options:
Setup RAID 1 for master and zoo nodes, and don't use RAID at all on slave nodes like kafka brokers, hbase regionservers and yarn nodemanager's.
Even if I loose one slave node, I will have two other replicas.
In my opinion, RAID will only slow down my cluster.
Despite everything, setup everything using RAID 1.
What do you think about it? What is you experience with HDP and RAID?
What do you think about using RAID 0 for slave nodes?
I'd recommend no RAID at all on Hadoop hosts. There is one caveat, in that if you are running services like Oozie and the Hive metastore that use a relational DB behind the scenes, raid may well make sense on the DB host.
On a master node, assuming you have Namenode, zookeeper etc - generally the redundancy is built into the service. For namenodes, all the data is stored on both namenodes. For Zookeeper, if you lose one node, then the other two nodes have all the information.
Zookeeper likes fast disks - ideally dedicate a full disk to zookeeper. If you have namenode HA, give the namenode edits directory and each journal node a dedicated disk too.
For the slave nodes, the datanode will write across all disks, effectively striping the data anyway. Each 'write' is at most the HDFS block size, so if you were writing a large file, you could get 128MB on disk 1, then the next 128MB on disk 2 etc.

Actual need of Zookeepers

I am new to HBase and I am still learning it. I just wanted to know that how many Zookeepers do we actually need? Is it one per regionserver or one per cluster?Thanks
The zookeeper is per cluster, and not per regionserver.
From The hbase definitive guide:
How many ZooKeepers should I run? You can run a ZooKeeper ensemble
that comprises 1 node only but in production it is recommended that
you run a ZooKeeper ensemble of 3, 5 or 7 machines; the more members
an ensemble has, the more tolerant the ensemble is of host failures.
Also, run an odd number of machines. In ZooKeeper, an even number of
peers is supported, but it is normally not used because an even sized
ensemble requires, proportionally, more peers to form a quorum than an
odd sized ensemble requires. For example, an ensemble with 4 peers
requires 3 to form a quorum, while an ensemble with 5 also requires 3
to form a quorum. Thus, an ensemble of 5 allows 2 peers to fail, and
thus is more fault tolerant than the ensemble of 4, which allows only
1 down peer.
Give each ZooKeeper server around 1GB of RAM, and if possible, its own
dedicated disk (A dedicated disk is the best thing you can do to
ensure a performant ZooKeeper ensemble). For very heavily loaded
clusters, run ZooKeeper servers on separate machines from
RegionServers (DataNodes and TaskTrackers).

How to set hadoop cluster priority?

I am starting to learn Hadoop. I have a hadoop server and it connects with 3 clusters node. If I run a MapReduce job it works well. I need to set the priority for these clusters.
For example
node1, node2, node3 are my cluster which is connect with my hadoop server. Here If I run the MR job, It will split and assign job like the above priority for every time. Is it possible?
Because the cluster nodes have different memory capacity. So I need to set high memory node will handle the Job first.
It's not possible to 'weight' certain servers based on capacity. However, each server can have a configuration to match it's memory, processor count, etc.
For example, if one server has 16 cores and another has 8 cores, you can configure the first server to run 12 tasks simultaneously and the second to only run 6. The same idea with memory.

Resources