HBase and ZooKeeper roles in Hadoop? - hadoop

I have installed Hadoop single node cluster in my Ubuntu machine and able to run NameNode, datanode etc.. Now i need to install HBase and Zookeeper. But i don't really know what they are. Guys could anybody give me brief description about those tools.
Thanks

First of all I would strongly recommend you to go through the official pages of these projects. Go here for HBase and here for Zookeeper.
HBase is a NoSQL datastore that runs on top of your existing Hadoop cluster(HDFS). It provides you capabilities like random, real-time reads/writes, which HDFS being a FS lacks. Since it is a NoSQL datastore it doesn't follow SQL conventions and terminologies. HBase provides a good set of APIs( includes JAVA and Thrift). Along with this HBase also provides seamless integration with MapReduce framework. But, along with all these advantages of HBase you should keep this in mind that random read-write is quick but always has additional overhead. So think well before ye make any decision.
ZooKeeper is a high-performance coordination service for distributed applications(like HBase). It exposes common services like naming, configuration management, synchronization, and group services, in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.
HBase relies completely on Zookeeper. HBase provides you the option to use its built-in Zookeeper which will get started whenever you start HBAse. But it is not good if you are working on a production cluster. In such scenarios it's always good to have a dedicated Zookeeper cluster and integrate it with your HBase cluster.
Note : You should always have odd number of nodes in your ZK Quorum.
HTH

An overview:
Zookeeper: In short, zookeeper is a distributed application (cluster) configuration and management tool, and it exits independent of HBase. From the docs:
ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in
some form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and
race conditions that are inevitable. Because of the difficulty of
implementing these kinds of services, applications initially usually
skimp on them ,which make them brittle in the presence of change and
difficult to manage. Even when done correctly, different
implementations of these services lead to management complexity when
the applications are deployed.
HBase:The NoSQL datastore on top of the HDFS (can use simple file system, but it guarantees no data durability). HBase contains two primary services:
Master server - The master server (HMaster) co-ordinates the
cluster and performs administrative operations, such as assigning
regions and balancing the loads.
Region servers - The region
servers do the real work. A subset of the data of each table is handled by each region server. Clients talk to region servers to access data in HBase.
The connection between HBase and Zookeeper:
A distributed HBase relies completely on Zookeeper (for cluster configuration and management). In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is, for coordination and state communication). Thus if the HBase’s ZooKeeper data is removed, only the transient operations are affected — data can continue to be written and read to/from HBase.
Once you have the HBase started - you can verify the processes it has started using jps command:
$ jps
the command will list all the java processes on the machine (HBase itself is a Java application) - the probable output (in case of simple standalone HBase setup) has to be:
62019 Jps
61098 HMaster
61233 HRegionServer
61003 HQuorumPeer
Technically speaking:
By default HBase manages zookeeper itself i.e. starting and stopping the zookeeper quorum (the cluster of zookeeper nodes) when we start and stop HBase - to verify the settings look into the file conf/hbase-evn.sh (in your hbase directory) there must be a line:
export HBASE_MANAGES_ZK=true
Once set all we need to do is set the following directives in conf/hbase-site.xml - from docs:
<configuration>
...
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
<description> The port at which the clients will connect.
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value>
<description>Comma separated list of servers in the ZooKeeper Quorum.
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
By default this is set to localhost for local and pseudo-distributed modes
of operation. For a fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/zookeeper</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
...
</configuration>

Related

Namenode with high availability vs zookeeper based leader selection

I am reading 2 different things in Apache Hadoop documentation and cloudera's documentation.
Based on cloudera, we should set up namenode in high availability mode, i.e.: by defining primary and secondary namenode, but based on Hadoop documentation, this should automatically taken care by zookeeper and it should decide namenode among the available datanodes.
Can anyone explain the difference and which one to use?
by defining primary and secondary namenode
There is such a thing as a "secondary namenode", but it's actually a very different thing as it's not a standby and able to become active.
There's no "vs". Namenode HA needs Zookeeper
If you read more of the Cloudera documentation it doesn't fail to mention Zookeeper.
Automatic failover adds two new components to an HDFS deployment: a ZooKeeper quorum, and the ZKFailoverController process (abbreviated as ZKFC).
Cloudera doesn't package much extras, if any, on top of the core Hadoop functions.
Regarding your question...
this should automatically taken care by zookeeper
The failover is automatic if HDFS Zookeeper properties are (manually) configured, Zookeeper is running, and the Active Namenode goes down.
among the available datanodes
The operation has nothing to do with datanodes

differences between HDFS and ZooKeeper?

While reading ZooKeeper's documentation, it seems to me that HDFS relies on pretty much the same mechanisms of distribution/replication (broadly speeking) as ZooKeeper. I hear some echo from one to another, but I still can't distinguish things clearly and striclty.
I understand ZooKeeper is a Cluster Management / Sync tool, while HDFS is a Distributed File Management System, but could ZK be needed on an HDFS cluster for example?
Yes, the factor is distributed processing and high availability on a hadoop cluster with a zookeper's quorum
For ex. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
Zookeper resolves cluster address to an active namenode.

Hadoop client and cluster separation

I am a newbie in hadoop, linux as well. My professor asked us to seperate Hadoop client and cluster using port mapping or VPN. I don't understand the meaning of such separation. Can anybody give me a hint?
Now I get the idea of cluster client separation. I think it is required that hadoop is also installed in the client machine. When the client submit a hadoop job, it is submit to the masters of the clusters.
And I have some naiive ideas:
1.Create a client machine and install hadoop .
2.set fs.default.name to be hdfs://master:9000
3.set dfs.namenode.name.dir to be file://master/home/hduser/hadoop_tmp/hdfs/namenode
Is it correct?
4.Then I don't know how to set the dfs.namenode.name.dir and other configurations.
5.I think the main idea is to set the configuration files to make the job run in hadoop clusters, but I don't know how to do it exactly.
First of all.. this link has detailed information on how client communcates with namenode
http://www.informit.com/articles/article.aspx?p=2460260&seqNum=2
To my understanding, your professor wants to have a separate node as client from which you can run hadoop jobs but that node should not be part of the hadoop cluster.
Consider a scenario where you have to submit Hadoop job from client machine and client machine is not part of existing Hadoop cluster. It is expected that job to be get executed on Hadoop cluster.
Namenode and Datanode forms Hadoop Cluster, Client submits job to Namenode.
To achieve this, Client should have same copy of Hadoop Distribution and configuration which is present at Namenode.
Then Only Client will come to know on which node Job tracker is running, and IP of Namenode to access HDFS data.
Go through configuration on Namenode,
core-site.xml will have this property-
<property>
<name>fs.default.name</name>
<value>192.168.0.1:9000</value>
</property>
mapred-site.xml will have this property-
<property>
<name>mapred.job.tracker</name>
<value>192.168.0.1:8021</value>
</property>
These are two important properties must be copied to client machine’s Hadoop configuration.
And you need to set one addtinal property in mapred-site.xml file, to overcome from Privileged Action Exception.
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
Also you need to update /ets/hosts of client machine with IP addresses and hostnames of namenode and datanode.
Now you can submit job from client machine with hadoop jar command, and job will be executed on Hadoop Cluster. Note that, you shouldn’t start any hadoop service on client machine.
Users shouldn't be able to disrupt the functionality of the cluster. That's the meaning. Imagine there is a whole bunch of data scientists that launch their jobs from one of the cluster's masters. In case someone launches a memory-intensive operation, the master processes that are running on the same machine could end up with no memory and crash. That would leave the whole cluster in a failed state.
If you separate client node from master/slave nodes, users could still crash the client, but the cluster would stay up.

what are the differences zookeeper, journal node tasks and quorum journal manager in hadoop?

On studying the material in multiple no of websites and videos, I am confused with the functionalities and differences in the purposes of the 3 hadoop components ZooKeeper, Journal Node and the Quorum Journal Manager.
Could anyone please explain me the reasons for inventing each of the above and differences in the purposes and functionalities of the above three components?
Thanks in advance.
Think of it like this, zookeeper is a group of people, each assigned to watch over a factory and coordinate them, journal node is a place where all factory managers can check others status and coordinate. QJM is a combination of both to be used in HA for better coordination in case of fail over.
zookeeper coordinates hbase regionservers and other hadoop modules which require zookeeper.
journal node coordinates hadoop datanodes with the namenode.
QJM coordinates regionservers using the technique used by journal node
on core hadoop setup only journal node is necessary in case of distributed setup
Firstly, quorum means there is a need of majority for decisions. So, when you see the word "quorum" you should think of a clustered, saying that; multi-host configuration. You can hear this term for both Zookeeper and Journal Nodes.
Short description of their functionalities will help you distinguish their purpose.
Zookeeper: Zookeeper is the central synchronisation application for informations which applications need to check frequently. There may be many informations that application need like naming structure, information, configuration information (or simply configurations) etc. Most common case is configuration of application. When you change a config which relates to lets say 80 servers, to synchronise this change to all nodes, you need to develop a synchronisation service. Application itself may have this feature. But imagine you add another 12 applications to your environment. You need to take care of each application's synchronisation service one by one. This is where zookeeper comes in. Zookeeper can handle management of all these information by itself. If you set it up as a cluster (need an odd number of hosts. why?) you will have high availability for Zookeeper (failover cases) and have a Zoopeeker Quorum.
Journal Node: In an high availability Hadoop cluster you have more than one Namenodes running in active/passive mode. Active namenode informs journal node for changes. Stand by name node asks to journal node about what changed. Like on the case of Zookeeper if you set up as cluster configuration (need odd number of hosts also here. why?), you have high availability also for Journal Node features and have a Quorum Journal Manager.
Actually I didn't hear them set as single host or node except for lab purposes (vm in pc).
1. Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications
Role of Zookeeper in Hadoop ecosystem:
During the Hadoop Namenode failover process, ZooKeeper has been used to avoid split brain scenario so that name node state is not getting diverged due to failover.
Refer to this post for more details:
How does Hadoop Namenode failover process works?
2. JournalNode ( Used in Namenode failover process)
In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called “JournalNodes” (JNs).
JournalNode machines - the machines on which you run the JournalNodes. The JournalNode daemon is relatively lightweight, so these daemons may reasonably be collocated on machines with other Hadoop daemons, for example NameNodes, the JobTracker, or the YARN ResourceManager.
Note: There must be at least 3 JournalNode daemons, since edit log modifications must be written to a majority of JNs. This will allow the system to tolerate the failure of a single machine
3.Quorum Journal Manager (QJM) allows to share edit logs between the Active and Standby NameNodes
Importantly, when using the Quorum Journal Manager, only one NameNode will ever be allowed to write to the JournalNodes, so there is no potential for corrupting the file system metadata from a split-brain scenario

HBase HDFS zookeeper

Now I am learning about HBase. I set up my HBase Cluster and Hadoop Cluster like this:
server1: Namenode HMaster
server2: datanode1 RegionServer1 HQuorumPeer
Server3: datanode2 RegionServer2 HQuorumPeer
Server4: datanode3 RegionServer3 HQuorumPeer
I have several question about HBase cluster:
1: All RegionServers must be in the Hadoop Cluster so it can use HDFS to store
data, even though it will store data into local file system, right?
2: What does RegionServer do? Does the HMaster give the job to all RegionServeres
and let them running parallel, like tasktracker in datanode?
3: What does zookeeper do? Do I need to setup zookeeper in all RegionServers
nodes and the master node?
4: It is related to #3. I know HBase uses zookeeper to recovery once regionServer
is down. How does it specific work?
All RegionServers must be in the Hadoop Cluster so it can use HDFS to store
data, even though it will store data into local file system, right?
Yes. RegionServers are the daemons that are responsible for storing data in a HBase cluster. You store data in HBase tables which are spread over many regions on several RegionServers across the cluster. Although data goes into the RegionServers, it actually gets stored inside HDFS. But if you are on a standalone setup HDFS is not used. The data gets stored directly in the local FS. It is analogous to any DB and FS. Take MSQL and ext3 for example. And yes, all the HDFS data is stored on your disk in reality. You cannot see it directly though.
What does RegionServer do? Does the HMaster give the job to all RegionServeres
and let them running parallel, like tasktracker in datanode?
As specified in the comment above RegionServer is the daemon that actually stores data in a HBase cluster. I'm sorry I didn't quite get the second part of this question. what do you mean by like tasktracker in datanode? In a HBase cluster HMaster is the daemon which is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. Its job is monitoring and management. Regionservers don't run any job like TaskTrackers do. They just store data and are responsible for stuff like serving and managing regions.
What does zookeeper do? Do I need to setup zookeeper in all RegionServers
nodes and the master node?
Zookeeper is the guy who coordinates everything behind the curtains. It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. A distributed HBase setup depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble. HBase by default manages a ZooKeeper cluster. It gets started and stopped as part of the HBase start/stop process. But, you can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use. You don't have to have Zookeepers running on all the nodes. Just decide some number which suits your cluster. One thing to note here is that you should always use an odd number of Zookeepers.
It is related to #3. I know HBase uses zookeeper to recovery once regionServer
is down. How does it specific work?
Each RegionServer is connected to ZooKeeper, and the master watches these connections. ZooKeeper manages a heartbeat with a timeout. So, on a timeout, the HMaster declares the region server as dead, and starts the recovery process. Following things happen during the recovery process :
Identifying that a node is down : a node can cease to respond simply because it is overloaded or as well because it is dead.
Recovering the writes in progress : that’s reading the commit log and recovering the edits that were not flushed.
Reassigning the regions : the region server was previously handling a set of regions. This set must be reallocated to other region servers, depending on their respective workload.
The process is actually a bit more involved. You can find more on this here. I would also suggest you to go through the book HBase The Definitive Guide by Lars in order to get some grip on HBase.
HTH

Resources