I understand based on the slides that in the context of Hadoop that Zookeeper is used for storing information of Master, and status of different tasks, which worker is working on which partition AND also the available workers are also stored in Zookeeper.
Why is Zookeeper is used for this metadata storage here? Any data store can be used right ?
For instance Celery can configure any result backend Redis/Mongo etc. So in practice Hadoop can use any storage backend right? But why Zookeeper?
This doc suggests that Redis, SQLite, MySQL, PostgreSQL can be used for celery task result storage.
https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/index.html
Zookeeper ZAB protocol is utilized for leader election, as well as distributed locks.
It is not simply a datastore, and no, not any can be used.
Celery isn't used within the Hadoop ecosystem, so I'm not sure how that's relevant to the question.
Related
I need to know that does mesos master manage any state information itself such as number of slaves, frameworks or any information. Or does it leverage zookeeper for all information.
Mesos stores cluster data in memory and in a so-called replicated log. If you are curious, what exactly is persisted across Mesos master failovers, check the Registry protobuf. Everything else, e.g. allocation information, agents state, is restored from the cluster via re-registering agents and frameworks.
Zookeeper is used for leader election only, Mesos does not store there any data. However some Mesos frameworks, e.g. Marathon, may use Zookeeper as persistent storage. Such Zookeeper cluster is often configured separately to avoid any interference with Mesos.
I have a doubt in Hadoop related to interoperability.
Can a single zookeeper interact with both Solr and Hbase system ? If yes how is it going to interact.
Also
Let us consider we have a zookeeper which is interacting with both a Solr system as well as a Hbase system.
The requirement for Solr and Hbase system are different.
How is the zookeeper going to differentiate between the requirements of Solr and Hbase system
Zookeeper is a 'dumb' service, it's HBase and Solr that interact with Zookeeper. The way it works is that they'll each have their own set of Zookeeper keys that they write and read, so as long as there is no conflict in the key space then you're good.
Zookeeper is designed to be accessed from a wide range of machines, so as long as both services can interact with the same version of Zookeeper you're set. Think of it like a datastore -- you can have multiple systems interacting with the same datastore.
Here's an example:
Maybe HBase uses /hbase/flags/abc, and Solr uses /solr/flags/abc, so long as they're not writing to the same path it'll work just fine.
Hope that helps.
I have installed Hadoop single node cluster in my Ubuntu machine and able to run NameNode, datanode etc.. Now i need to install HBase and Zookeeper. But i don't really know what they are. Guys could anybody give me brief description about those tools.
Thanks
First of all I would strongly recommend you to go through the official pages of these projects. Go here for HBase and here for Zookeeper.
HBase is a NoSQL datastore that runs on top of your existing Hadoop cluster(HDFS). It provides you capabilities like random, real-time reads/writes, which HDFS being a FS lacks. Since it is a NoSQL datastore it doesn't follow SQL conventions and terminologies. HBase provides a good set of APIs( includes JAVA and Thrift). Along with this HBase also provides seamless integration with MapReduce framework. But, along with all these advantages of HBase you should keep this in mind that random read-write is quick but always has additional overhead. So think well before ye make any decision.
ZooKeeper is a high-performance coordination service for distributed applications(like HBase). It exposes common services like naming, configuration management, synchronization, and group services, in a simple interface so you don't have to write them from scratch. You can use it off-the-shelf to implement consensus, group management, leader election, and presence protocols. And you can build on it for your own, specific needs.
HBase relies completely on Zookeeper. HBase provides you the option to use its built-in Zookeeper which will get started whenever you start HBAse. But it is not good if you are working on a production cluster. In such scenarios it's always good to have a dedicated Zookeeper cluster and integrate it with your HBase cluster.
Note : You should always have odd number of nodes in your ZK Quorum.
HTH
An overview:
Zookeeper: In short, zookeeper is a distributed application (cluster) configuration and management tool, and it exits independent of HBase. From the docs:
ZooKeeper is a centralized service for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services. All of these kinds of services are used in
some form or another by distributed applications. Each time they are
implemented there is a lot of work that goes into fixing the bugs and
race conditions that are inevitable. Because of the difficulty of
implementing these kinds of services, applications initially usually
skimp on them ,which make them brittle in the presence of change and
difficult to manage. Even when done correctly, different
implementations of these services lead to management complexity when
the applications are deployed.
HBase:The NoSQL datastore on top of the HDFS (can use simple file system, but it guarantees no data durability). HBase contains two primary services:
Master server - The master server (HMaster) co-ordinates the
cluster and performs administrative operations, such as assigning
regions and balancing the loads.
Region servers - The region
servers do the real work. A subset of the data of each table is handled by each region server. Clients talk to region servers to access data in HBase.
The connection between HBase and Zookeeper:
A distributed HBase relies completely on Zookeeper (for cluster configuration and management). In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is, for coordination and state communication). Thus if the HBase’s ZooKeeper data is removed, only the transient operations are affected — data can continue to be written and read to/from HBase.
Once you have the HBase started - you can verify the processes it has started using jps command:
$ jps
the command will list all the java processes on the machine (HBase itself is a Java application) - the probable output (in case of simple standalone HBase setup) has to be:
62019 Jps
61098 HMaster
61233 HRegionServer
61003 HQuorumPeer
Technically speaking:
By default HBase manages zookeeper itself i.e. starting and stopping the zookeeper quorum (the cluster of zookeeper nodes) when we start and stop HBase - to verify the settings look into the file conf/hbase-evn.sh (in your hbase directory) there must be a line:
export HBASE_MANAGES_ZK=true
Once set all we need to do is set the following directives in conf/hbase-site.xml - from docs:
<configuration>
...
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
<description> The port at which the clients will connect.
</description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>rs1.example.com,rs2.example.com,rs3.example.com,rs4.example.com,rs5.example.com</value>
<description>Comma separated list of servers in the ZooKeeper Quorum.
For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
By default this is set to localhost for local and pseudo-distributed modes
of operation. For a fully-distributed setup, this should be set to a full
list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop ZooKeeper on.
</description>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/usr/local/zookeeper</value>
<description>Property from ZooKeeper's config zoo.cfg.
The directory where the snapshot is stored.
</description>
</property>
...
</configuration>
I'm new to Hadoop, and running under AWS Elastic Mapreduce.
I need cluster-wide atomic counters in Hadoop and was suggested to use zookeeper for this.
I believe zookeeper is part of the Hadoop stack (right?), how would I access it from an Elastic Mapreduce job in order to set and update a cluster-wide counter?
I believe zookeeper is part of the Hadoop stack (right?)
ZooKeeper (ZK) is not part of the Hadoop Stack. It's a Top Level Project (TLP) under Apache and is independent of Hadoop. So, first ZK has to be installed on EC2. Here are the instructions for the same.
how would I access it from an Elastic Mapreduce job in order to set and update a cluster-wide counter?
Once installed ZK can be used to generate a cluster wide counter using the ZK API. Here (1 and 2) discussions on the approach with the pros and cons. Here are some other alternatives for ZK for the same requirements.
You can, as Praveen Sripati answers.
But I wan't to clarify some points:
Keep in mind, that zk has a limited write rate (~300 request per
second)
Clients can see stale data (zk don't guarantee read consistency across replicas).
I suggest to use dedicated sequence generator server, which will generate sequences for you (and this service can use Zk or whatever it wants). One example of such service: https://github.com/kasabi/H1
Can anyone suggest me that whether I can read data from amazon hbase using the org.apache.hadoop.conf.Configuration and org.apache.hadoop.hbase.client.HTablePool.
We are migrating to Amazon's EMR framework having hbase running on top of it.
The present implementation is based on pure Apache hadoop and hbase distributions. I'm trying to verify that no code changes needed even we migrate to amazon's EMR.
Please share your thoughts.
While it should not happen, I would expect the problems and changes related to the nature of EC2 and its networking.
HBase relay on Regions able to renew their leases in timely manner. If Region servers are two busy - because of some massive operations over them, they can not do so and get kicked off the cluster.
In amazon performance of the EC2 instances are much less predictable then in dedicated cluster (unless you use cluster instances), so adjusting timeout parameters and/or nature of your loads might be needed to get cluster to work properly