Actual need of Zookeepers - hadoop

I am new to HBase and I am still learning it. I just wanted to know that how many Zookeepers do we actually need? Is it one per regionserver or one per cluster?Thanks

The zookeeper is per cluster, and not per regionserver.
From The hbase definitive guide:
How many ZooKeepers should I run? You can run a ZooKeeper ensemble
that comprises 1 node only but in production it is recommended that
you run a ZooKeeper ensemble of 3, 5 or 7 machines; the more members
an ensemble has, the more tolerant the ensemble is of host failures.
Also, run an odd number of machines. In ZooKeeper, an even number of
peers is supported, but it is normally not used because an even sized
ensemble requires, proportionally, more peers to form a quorum than an
odd sized ensemble requires. For example, an ensemble with 4 peers
requires 3 to form a quorum, while an ensemble with 5 also requires 3
to form a quorum. Thus, an ensemble of 5 allows 2 peers to fail, and
thus is more fault tolerant than the ensemble of 4, which allows only
1 down peer.
Give each ZooKeeper server around 1GB of RAM, and if possible, its own
dedicated disk (A dedicated disk is the best thing you can do to
ensure a performant ZooKeeper ensemble). For very heavily loaded
clusters, run ZooKeeper servers on separate machines from
RegionServers (DataNodes and TaskTrackers).

Related

Can I use the same flow.xml.gz for two different Nifi cluster?

We have a 13 nodes nifi cluster with around 50k processors. The size of the flow.xml.gz is around 300MB. To bring up the 13 nodes Nifi cluster, it usually takes 8-10 hours. Recently we split the cluster into two parts, 5nodes cluster and 8 nodes cluster with the same 300MB flow.xml.gz in both. Since then we are not able to get the Nifi up in both the clusters. Also we are not seeing any valid logs related to this issue. Is it okay to have the same flow.xml.gz . What are the best practices we could be missing here when splitting the Nifi Cluster.
You ask a number of questions that all boil down to "How to improve performance of our NiFi cluster with a very large flow.xml.gz".
Without a lot more details on your cluster and the flows in it, I can't give a definite or guaranteed-to-work answer, but I can point out some of the steps.
Splitting the cluster is no good without splitting the flow.
Yes, you will reduce cluster communications overhead somewhat, but you probably have a number of input processors that are set to "Primary Node only". If you load the same flow.xml.gz on two clusters, both will have a primary node executing these, leading to contention issues.
More importantly, since every node still loads all of the flow.xml.gz (probably 4 Gb unzipped), you don't have any other performance benefits and verifying the 50k processors in the flow at startup still takes ages.
How to split the cluster
Splitting the cluster in the way you did probably left references to nodes that are now in the other cluster, for example in the local state directory. For NiFi clustering, that may cause problems electing a new cluster coordinator and primary node, because a quorum can't be reached.
It would be cleaner to disconnect, offload and delete those nodes first from the cluster GUI so that these references are deleted. Those nodes can then be configured as a fresh cluster with an empty flow. Even if you use the old flow again later, test it out with an empty flow to make it a lot quicker.
Since you already split the cluster, I would try to start one node of the 8 member cluster and see if you can access the cluster menu to delete the split-off nodes (disconnecting and offloading probably doesn't work anymore). Then for the other 7 members of the cluster, delete the flow.xml.gz and start them. They should copy over the flow from the running node. You should adjust the number of candidates expected in nifi.properties (nifi.cluster.flow.election.max.candidates) so that is not larger than the number of nodes to slightly speed up this process.
If successful, you then have the 300 MB flow running on the 8 member cluster and an empty flow on the new 5 member cluster.
Connect the new cluster to your development pipeline (NiFi registry, templates or otherwise). Then you can stop process groups on the 8 member cluster, import them on the new and after verifying that the flows are running on the new cluster, delete the process group from the old, slowly shrinking it.
If you have no pipeline or it's too much work to recreate all the controllers and parameter contexts, you could take a copy of the flow.xml.gz to one new node, start only that node and delete all the stuff you don't need. Only after that should you start the others (with their empty flow.xml.gz) again.
For more expert advice, you should also try the Apache NiFi Users email list. If you supply enough relevant details in your question, someone there may know what is going wrong with your cluster.

How do we allocate different number of reducers of heterogeneous clusters?

Our system has a cluster of 5 hosts (e.g., data node or computer slaves…). Now, I want allocate different number of reducers of these hosts because 1 host is slow. . We are using Hadoop Yarn. The resource manager (so called Job tracker in MapReduce1) always allocate evenly number of reducers of to 5 hosts. Is there anyway that I can limit number of reducers of a specific host? For example, what I want is that a job with 40 reducers, 4 fast computers have 36 reducers (e.g., 9 reducers each host), the slow computer has only 4 reducers.
It is entirely possible and a common phenomenon to have heterogenous systems in a hadoop cluster. Typically, as the cluster keeps becoming larger and hence is scaling horizontally, new nodes of different configurations get added to the cluster.
In such scenarios, in order to have configurations applicable to a specific node or to a group of nodes, we need to modify the configurations accordingly on those hosts.
For example, in case of Hortonworks Data Platform where the cluster is managed through Ambari, the concept of host config groups can be leveraged for this purpose.
Please see the below link for further information:
https://docs.hortonworks.com/HDPDocuments/Ambari-2.1.1.0/bk_Ambari_Users_Guide/content/_using_host_config_groups.html
Also see the below link, where the discussion is about increasing the number of YARN containers at a node level. It remains the same in your case as well, which is the opposite of the use case discussed there:
How to increase the number of containers in nodemanager in YARN
Another useful link:
http://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/

Can a slave node have multiple blocks of the same file in hadoop?

Say I have a hadoop cluster where one node is the Master node and the other is a Data node. The slave node is an 8-core machine just to make sure there are enough cores to process jobs parallelly. Can i still split the file into say 3 blocks and have the slave node store all the three blocks separately on it. In other words, "if we want to utilize all the slave nodes in a hadoop cluster", then is there a 1:1 relation between number of slave nodes and the maximum number of blocks of a file? If yes, then in such a case how would the map-reduce work. Will the master node fire three map jobs to the slave node and have each mapper pick up each block on the slave node?
My question can be seen in a different way. If we have a 1GB file on a cluster with 3 data nodes then how do the 64 MB blocks get divided and how are they distributed between the three nodes?
The second question seems to be more understandable for me so I will take that first.
From HDFS Perspective:
With 64MB block size a 1GB file consists from 16 blocks, blocks are being stored somewhat randomly between DataNodes, if you have more from them as the replication factor, but you can expect an even distribution between the nodes, if you do not load the data from one of the DNs. If you do, that DN will hold a replica from all the blocks, and other DNs will hold the remaining replicas distributed sort of evenly (still randomly placed). So yes, if you have a file consists from 16 blocks, and only 3 DN with a replication factor of 3 all 3 DNs will hold all 16 blocks for example.
From YARN's perspective when you run the MapReduce job:
YARN tries to find a container on a node for a mapper that has the data locally, there is a configurable wait time for a free container on such nodes before YARN starts up the mapper on a node that does not have the data.
YARN does not rely on physical cores directly, you can configure the number of virtual cores and the amount of memory a container uses, and based on these values YARN will allocate the amount of available containers in a NodeManager.
Further reading on YARN tuning on Cloudera Engineering blog
However:
From the first part of the question as I understand you want to achieve paralellism by defining the block size to split your data files.
MapReduce does not care about HDFS blocks, it has its own abstraction to split the input, it is called InputSplit. InputSplits are feeded to the mappers, by the InputFormat. Also InputSplits are defining the place where the split is available locally so that YARN can find a container that is on a node that has the split on local data storage. I suggest to check the API, and the available implementations of InputFormat, as they most likely suit your needs, however if they are not, then you can still write your own implementation, and specify it via the job configuration.

Hadoop - Difference between single large node and multiple small node

I am new in hadoop. I am wondering are there any different between single node and multi-node, if both have same computing power.
For example, there is one server with 4 cores CPU and 32 GB RAM to setup a single node hadoop. On the other hand, there are four servers with one core CPE (same clock rate with the "big" server) and 8GB RAM to setup a 4 node hadoop cluster. Which setup would be better?
It is always better to have more number of data nodes instead of one powerful node. Basically we need to configure Name node with high configuration and data nodes can have less configurations.
So, if you have single node, all the demons will be running in the same node with single JVM, where as if you have multiple nodes the demons can run parallelly (sequentially) with multiple JVMs.

cassandra: strategy for single datacenter deployment

We are planning to use apache shiro & cassandra for distributed session management very similar to mentioned # https://github.com/lhazlewood/shiro-cassandra-sample
Need advice on deployment for cassandra in Amazon EC2:
In EC2, we have below setup:
Single region, 2 Availability Zones(AZ), 4 Nodes
Accordingly, cassandra is configured:
Single DataCenter: DC1
two Racks: Rack1, Rack2
4 Nodes: Rack1_Node1, Rack1_Node2, Rack2_Node1, Rack2_Node2
Data Replication Strategy used is NetworkTopologyStrategy
Since Cassandra is used as session datastore, we need high consistency and availability.
My Questions:
How many replicas shall I keep in a cluster?
Thinking of 2 replicas, 1 per rack.
What shall be the consistency level(CL) for read and write operations?
Thinking of QUORUM for both read and write, considering 2 replicas in a cluster.
In case 1 rack is down, would Cassandra write & read succeed with the above configuration?
I know it can use the hinted-hands-off for temporary down node, but does it work for both read/write operations?
Any other suggestion for my requirements?
Generally going for an even number of nodes is not the best idea, as is going for an even number of availability zones. In this case, if one of the racks fails, the entire cluster will be gone. I'd recommend to go for 3 racks with 1 or 2 nodes per rack, 3 replicas and QUORUM for read and write. Then the cluster would only fail if two nodes/AZ fail.
You probably have heard of the CAP theorem in database theory. If not, You may learn the details about the theorem in wikipedia: https://en.wikipedia.org/wiki/CAP_theorem, or just google it. It says for a distributed database with multiple nodes, a database can only achieve two of the following three goals: consistency, availability and partition tolerance.
Cassandra is designed to achieve high availability and partition tolerance (AP), but sacrifices consistency to achieve that. However, you could set consistency level to all in Cassandra to shift it to CA, which seems to be your goal. Your setting of quorum 2 is essentially the same as "all" since you have 2 replicas. But in this setting, if a single node containing the data is down, the client will get an error message for read/write (not partition-tolerant).
You may take a look at a video here to learn some more (it requires a datastax account): https://academy.datastax.com/courses/ds201-cassandra-core-concepts/introduction-big-data

Resources