Hadoop Interoperability - hadoop

I have a doubt in Hadoop related to interoperability.
Can a single zookeeper interact with both Solr and Hbase system ? If yes how is it going to interact.
Also
Let us consider we have a zookeeper which is interacting with both a Solr system as well as a Hbase system.
The requirement for Solr and Hbase system are different.
How is the zookeeper going to differentiate between the requirements of Solr and Hbase system

Zookeeper is a 'dumb' service, it's HBase and Solr that interact with Zookeeper. The way it works is that they'll each have their own set of Zookeeper keys that they write and read, so as long as there is no conflict in the key space then you're good.
Zookeeper is designed to be accessed from a wide range of machines, so as long as both services can interact with the same version of Zookeeper you're set. Think of it like a datastore -- you can have multiple systems interacting with the same datastore.
Here's an example:
Maybe HBase uses /hbase/flags/abc, and Solr uses /solr/flags/abc, so long as they're not writing to the same path it'll work just fine.
Hope that helps.

Related

Role of Zookeeper in Hadoop

I understand based on the slides that in the context of Hadoop that Zookeeper is used for storing information of Master, and status of different tasks, which worker is working on which partition AND also the available workers are also stored in Zookeeper.
Why is Zookeeper is used for this metadata storage here? Any data store can be used right ?
For instance Celery can configure any result backend Redis/Mongo etc. So in practice Hadoop can use any storage backend right? But why Zookeeper?
This doc suggests that Redis, SQLite, MySQL, PostgreSQL can be used for celery task result storage.
https://docs.celeryq.dev/en/stable/getting-started/backends-and-brokers/index.html
Zookeeper ZAB protocol is utilized for leader election, as well as distributed locks.
It is not simply a datastore, and no, not any can be used.
Celery isn't used within the Hadoop ecosystem, so I'm not sure how that's relevant to the question.

Confusion in Apache Nutch, HBase, Hadoop, Solr, Gora

I am new to all these terms and given some time to understand it. But i have some confusions in it. Please correct me if i am wrong.
Nutch: It's for web crawling, using it we can crawl web pages. We can store these web pages somewhere in db.
Solr: Solr can be used for indexing web pages crawled by Apache Nutch. It helps in searching the indexes web pages.
HBase: It's used as an interface to interact with Hadoop. It helps in getting data at real time from HDFS. It provides simple SQL type interface for interacting.
Hadoop: It provides two functionalities: One is HDFS (Hadoop data file system) and other is Map-Reduce functionality taken from Google algorithms. Its basically used for offline data backup etc.
Gora and ZooKeeper: I am not sure of.
Confusions:
1). Is HBase a key-value pair DB or just an interface to Hadoop ? or i should ask, can HBase exist without Hadoop ?
If yes, can you explain a bit more about its usage.
2). Is there any use of crawling data using Apache Nutch without indexing into Solr ?
3). For running apache nutch, do we need HBase and Hadoop ? If no, how we can make it work without it?
4). Is Hadoop part of HBase ?
Here is a good short discussion of HBase vs. Hadoop: Difference between HBase and Hadoop/HDFS
Because HBase is built on top of Hadoop you can't really have HBase without Hadoop.
Yes you can run Nutch without Solr; there do not seem to be lots of use cases, however, much less living examples in the wild.
Yes, you can run Nutch without Hadoop, but again there don't seem to be a lot of real-world examples of people doing this.
Yes Hadoop is part of HBase, in that there is no HBase without Hadoop, but of course Hadoop is used for other things as well.
Zookeeper is used for configuration, naming, synchronization, etc. in Hadoop stack workflows. Gora is a memory management/persistence framework and is built on top of Hadoop.

Can I access zookeeper from AWS Elastic Mapreduce job

I'm new to Hadoop, and running under AWS Elastic Mapreduce.
I need cluster-wide atomic counters in Hadoop and was suggested to use zookeeper for this.
I believe zookeeper is part of the Hadoop stack (right?), how would I access it from an Elastic Mapreduce job in order to set and update a cluster-wide counter?
I believe zookeeper is part of the Hadoop stack (right?)
ZooKeeper (ZK) is not part of the Hadoop Stack. It's a Top Level Project (TLP) under Apache and is independent of Hadoop. So, first ZK has to be installed on EC2. Here are the instructions for the same.
how would I access it from an Elastic Mapreduce job in order to set and update a cluster-wide counter?
Once installed ZK can be used to generate a cluster wide counter using the ZK API. Here (1 and 2) discussions on the approach with the pros and cons. Here are some other alternatives for ZK for the same requirements.
You can, as Praveen Sripati answers.
But I wan't to clarify some points:
Keep in mind, that zk has a limited write rate (~300 request per
second)
Clients can see stale data (zk don't guarantee read consistency across replicas).
I suggest to use dedicated sequence generator server, which will generate sequences for you (and this service can use Zk or whatever it wants). One example of such service: https://github.com/kasabi/H1

read data from amazon hbase

Can anyone suggest me that whether I can read data from amazon hbase using the org.apache.hadoop.conf.Configuration and org.apache.hadoop.hbase.client.HTablePool.
We are migrating to Amazon's EMR framework having hbase running on top of it.
The present implementation is based on pure Apache hadoop and hbase distributions. I'm trying to verify that no code changes needed even we migrate to amazon's EMR.
Please share your thoughts.
While it should not happen, I would expect the problems and changes related to the nature of EC2 and its networking.
HBase relay on Regions able to renew their leases in timely manner. If Region servers are two busy - because of some massive operations over them, they can not do so and get kicked off the cluster.
In amazon performance of the EC2 instances are much less predictable then in dedicated cluster (unless you use cluster instances), so adjusting timeout parameters and/or nature of your loads might be needed to get cluster to work properly

HBase and Hadoop

HBase requires Hadoop installation based on what I read so far. And it looks like HBase can be set up to use existing Hadoop cluster (which is shared with some other users) or it can be set up to use dedicated Hadoop cluster? I guess the latter would be a safer configuration but I am wondering if anybody has any experience on the former (but then I am not very sure my understanding of HBase setup is correct or not).
I know that Facebook and other large organizations separate their HBase cluster (real time access) from their Hadoop cluster (batch analytics) for performance reasons. Large MapReduce jobs on the cluster have the ability to impact performance of the real-time interface, which can be problematic.
In a smaller organization or in a situation in which your HBase response time doesn't necessarily need to be consistent, you can just use the same cluster.
There aren't many (or any) concerns with coexistence other than performance concerns.
We've set it up with an existing Hadoop cluster that's 1,000 cores strong. Short answer: it works just fine, at least with Cloudera CH2 +149.88. But by Hadoop version, your mileage may vary.
In a distributed mode Hadoop is used for its HDFS storage. HBase will store HFile on HDFS, and thus get benefits from replication strategies and data-locality principles brought by datanodes.
RegionServer are about to basically handle local data, but still might have to fetch data from other datanodes.
Hope that will help you to understand why and how hadoop is used with HBase.

Resources