HBase integration with Hadoop - sync support - hadoop

I am relatively new to HBase or Hadoop and this may sound naive. However..
I have issues in the integration of Hbase with the exisitng hadoop cluster.
For the purpose of learning, i configured a 2 nodes Hadoop 1.1.1 cluster. Lets say master and slave.
I could even run the map reduce examples without any problems.
On Master --- 1. Namenode 2. Secondary Namenode 3. Job Tracker + 4. Datanode 5. Task tracker
On Salve --- 1. Datanode 2. Task Tracker
Now, i want to run HBase 0.90.6 on top of this hadoop cluster. The problem is that this version of HBase is bundled with Hadoop-code-append jar. Now to integrate HBase 0.90.6 with Hadoop 1.1.1, i have replaced the hadoop core jar in hbase lib directory with the hadoop-core-1.1.1 jar. I also have to place the commons-configuration jar under the hbase lib folder. Then make HBase point to the hadoop cluster via hbase.rootdir property under hbase-site.xml This works perfectly fine.
The problem occurs when i start the HBase master web UI and it says
"You are currently running the HMaster without HDFS append support enabled. This may result in data loss. Please see the HBase wiki for details."
When i searched for the sync support, it looks like not all versions of Hadoop support this.
Now the question is, How to get the sync support with Hbase 0.90.6 and hadoop 1.1.1 combination?

Have you turned on append support on both hbase-site.xml and hdfs-site.xml? This works for HBase 0.96.0.
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
You will have to restart the cluster after making this change.

Related

YARN-Specify Which Application to be Run on Which Nodemanager

I have a Hadoop YARN cluster including one resourcemanager and 6 nodemanagers. I want to run both Flink and Spark applications on the cluster. So I have two main question about YARN:
In case of Spark, Should I install and config Spark on resource manager and each nodemanagers? When I want to submit a Spark application on YARN, in addition to YARN resourcemanager and nodemanagers, should Spark cluster (master and slaves) be run?
Can I set YARN such that run Flink in some special nodemanagers?
Thanks
For the first question, that depends on whether you're using a packaged Hadoop distribution (like Cloudera CDH, Hortonworks HDP for example) or not. The distros will likely take care of this. If you're not using a distribution, you need to consider if you want to run Spark on YARN or Spark stand-alone.
For the second question, you can specify special Node Managers if you are using Capacity Scheduler with the node-labelling feature enabled and if you are using Hadoop 2.6 and higher.

Spark cluster - read/write on hadoop

I would like to read data from hadoop, process on spark, and wirte result on hadoop and elastic search. I have few worker nodes to do this.
Spark standalone cluster is sufficient? or Do I need to make hadoop cluster to use yarn or mesos?
If standalone cluster mode is sufficient, should jar file be set on all node unlike yarn, mesos mode?
First of all, you can not write data in Hadoop or read data from Hadoop. It is HDFS (Component of Hadoop ecosystem) which is responsible for read/write of data.
Now coming to your question
Yes, it possible to read data from HDFS and process it in spark engine and then write the output on HDFS.
YARN, mesos and spark standalone all are cluster managers and you can use any one of them to do management of resources in your cluster and it had nothing to do with hadoop. But since you want to read and write data from/to HDFS then you need to install HDFS on cluster and thus it is better to install hadoop on your all nodes that will also install HDFS on all nodes. Now whether you want to use YARN, mesos or spark standalone that is your choice all will work with HDFS I myself use spark standalone for cluster management.
It is not clear about which jar files you are talking to but I assume it will be of spark then yes you need to set the path for spark jar on each node so that there will be no contradiction in paths when spark run's.

Druid + Hadoop (for both uses, deep-store & indexing)

If I have Hadoop server (pseudo-distributed mode) running on a separate machine, do I still need to have these files under my Druid's conf dir ? : http://druid.io/docs/latest/configuration/hadoop.html
The way I see it:
Looks like those -site.xml files are for Hadoop server..., and Druid only acts as Hadoop client. So I don't think Druid needs the hdfs-site.xml.
Core-site.xml..., ok, I can get it. I mean, Druid nees to know the IP of the name node (hadoop).
Mapred-site.xml, partially. Druid needs to know the status of mapreduce jobs (I suppose it will delegate the indexing to Hadoop as MR job). So it needs to communicate with those job trackers to see if the indexing is finished / failed / in progress. For that, it needs the URL of Hadoop JT.
However Druid does not need this prperty "mapreduce.cluster.local.dir", because it does not participate actively in MR job.
Yarn-site.xml? Maybe it should stay, partially. At least for submitting a job (?).
What about HDFS-site.xml? I think this can be scrapped completely.
Capacity-scheduler.xml? It can go.
Please correct me If I'm wrong.
These questions / doubts arises because I'm quite new to hadoop. I have my hadoop setup running. Pseudo distributed mode. I also tested it with javascript webhdfs library to write and read file. Also have tried the sample MR jobs provided by the hadoop dist. So I guess my hadoop setup is fine. I'm just a bit unsure on the Druid site, partly because the doc is not ver clear about it.
Btw.... I have hadoop 2.7.2... While the hadoop-client libs used by Druid is still on 2.3.0.
Should I downgrade my hadoop server to 2.3.0?
http://druid.io/docs/latest/operations/other-hadoop.html
Thansk,
Raka
Please add the mapred-site.xml core-site.xml hdfs-site.xml yarn-site.xml to the classpath.
Also you don't need to downgrade druid works well with 2.7.X.
As you can see in the doc you can use multiple version of hadoop.

Hadoop YARN clusters - Adding node at runtime

I am working on a solution providing run-time ressources addition to an Hadoop Yarn cluster. The purpose is to handle heavy peaks on our application.
I am not an expert and I need help in order to approve / contest what I understand.
Hadoop YARN
This application an run in a cluster-mode. It provides ressource management (CPU & RAM).
A spark a application, for example, ask for a job to be done. Yarn handles the request and provides an executor computing on the Yarn cluster.
HDFS - Data & Executors
The datas are not shared through executors, so they have to be stored in a file System. In my case : HDFS. That means I will have to run a copy of my spark streaming application in the new server (hadoop node).
I am not sure of this :
The yarn cluster and HDFS are different, writing on HDFS won't write on the new hadoop node local data (because it is not an HDFS node).
As I will only write on HDFS new data from a spark streaming application, creating a new application should not be a problem.
Submit the job to yarn
--- peak, resources needed
Instance new server
Install / configure Hadoop & YARN, making it a slave
Modifying hadoop/conf/slaves, adding it's ip adress (or dns name from host file)
Moddifying dfs.include and mapred.include
On host machine :
yarn -refreshNodes
bin/hadoop dfsadmin -refreshNodes
bin/hadoop mradmin -refreshNodes
Should this work ? refreshQueues sounds not really useful here as it seems to only take care of the process queue.
I am not sure if the running job will increase it's capacity. Another idea is to wait for the new ressources to be available and submit a new job.
Thanks for you help

Configuring Hadoop, HBase and Hive Cluster

I am a newbie to Hadoop, HBase and Hive. I installed Hadoop, HBase and Hive in pseudodistributed mode and everything works fine.
Now I am planning to set up an simple Hadoop Cluster (5 nodes) with Hive, HBase and ZooKeeper. I´ve read several documentations and instructions before but i could not find a good explanation for my question. I´m not sure, where to run all the daemons. This is my consideration:
Node_1 (Master)
NameNode
JobTrakcer
HBase Master
ZooKeeper (Standalone node; managed by HBase)
Node_2 (Backup_Master)
SecondaryNameNode
Node_3 (Slave1)
DataNode1
TaskTracker1
RegionServer1
Node_4 (Slave2)
DataNode2
TaskTracker2
RegionServer2
Node_5 (Slave3)
DataNode3
TaskTracker3
RegionServer3
I know, in production it is recommended to run ZooKeeper ensemble at an odd number of nodes (seperate Cluster). But for a simple cluster, is it OK to set up a standalone ZooKeeper node which runs on the master node?
Another question is regarding Hive: I know that Hive is a Hadoop client. Should I also install Hive on the master node? Does it make sense?
Thanks for all tips and comments!
Hakan
Note: I have just 5 machines to simulate a cluster.
For testing purposes, I believe you can setup Zookeeper on the master node; I did install all of them on the same server.
What I do not understand from your question why you installed hadoop in pseudo distributed mode if you have 5 machines in your cluster? it might be better to install a fully distributed mode.
For hive, it seems that you have to install it with hadoop
Hive uses hadoop that means:
you must have hadoop in your path OR export HADOOP_HOME=<hadoop-install-dir>
For hive, it seems that you have to install it with hadoop
Hive uses hadoop that means:
you must have hadoop in your path OR export HADOOP_HOME=
#iTech : That´s right. If you install hive, you have to set the variable "HADOOP_HOME" to your hadoop installation path.But that´s not the problem..As I said, I worked before with Hadoop and Hive in pseudo distributed mode.
The only problem is, I´m not sure where to run the all the daemons in a 5-node-cluster in fully distributed mode. I´m confused because I want to run a lot of Tools together (Hadoop, HBase and Hive)
Hope that someone have a good tip...
If you are planning to use the described cluster for testing purposes, it is OK to have all your master nodes on the same server. Also you can move the SecondaryNameNode role to Node_1, since SecondaryNameNode is not a backup server for the NameNode, it is there to make checkpoints of your NameNode. So it makes sense to use the Node_2 as another "worker" node in you cluster, or the HiveServer2 and the metastore.
Hope this will help.

Resources