Apache HAWQ installation built on top of HDFS - hadoop

I would like to install Apache HAWQ based on the Hadoop.
Before installing HAWQ, I should install Hadoop and configure the all my nodes.
I have four nodes as below and my question is as blow.
Should I install a hadoop distribution for hawq-master?
1. hadoop-master //namenode, Secondary Namenode, ResourceManager, HAWQ Standby,
2. hawq-master //HAWQ Master
3. datanode01 //Datanode, HAWQ Segment
4. datanode02 //Datanode, HAWQ Segment
I wrote the role of each node next to the nodes as above.
In my opinion, I should install hadoop for hadoop-master, datanode01 and datanode02 and I should set hadoop-master as namenode (master) and the others as datanode (slave). And then, I will install apache HAWQ on all the nodes. I will set hawq-master as a master node and hadoop-master as HAWQ Stand by and finally the other two nodes as HAWQ segment.
What I want is installing HAWQ based on the Hadoop. So, I think the hawq-master should be built on top of hadoop, but there are no connection with hadoop-master.
If I proceed above procedure, then I think that I don't have to install hadoop distribution on hawq-master. Is my thought right to successfully install the HAWQ installation based on the hadoop?
If hadoop should be installed on hawq-master then which one is correct?
1. `hawq-master` should be set as `namenode` .
2. `hawq-master` should be set as 'datanode`.
Any help will be appreciated.

Honestly, there is no strictly constraints on how the hadoop installed and hawq installed if they are configured correctly.
For your concern, "I think the hawq-master should be built on top of hadoop, but there are no connection with hadoop-master". IMO, it should be "hawq should be built on top of hadoop". And we configured the hawq-master conf files(hawq-site.xml) to make hawq have connections with hadoop.
Usually, for the hawq master and hadoop master, we could install each component on one node, but we could install some of them on one node to save nodes. But for HDFS datanode and HAWQ segment, we often install them together. Taking the workload of each machine, we could install them as below:
hadoop hawq
hadoop-master namenode hawq standby
hawq-master secondarynamenode hawq master
other node datanode segment
If you configure hawq with yarn integration, there would be resourcemanager and nodemanager in the cluster.
hadoop role hawq role
hadoop-master namenode hawq standby
hawq-master snamenode,resourcemanager hawq master
other node datanode, nodemanager segment
Install them together does not means they have connections, it's your config file that make them can reach each other.
You can install all the master component together, but there maybe too heavy for the machine. Read more information about Apache HAWQ at http://incubator.apache.org/projects/hawq.html and read some docs at http://hdb.docs.pivotal.io/211/hdb/index.html.
Besides, you could subscribe the dev and user mail list, send email to dev-subscribe#hawq.incubator.apache.org / user-subscribe#hawq.incubator.apache.org to subscribe and send emails to dev#hawq.incubator.apache.org / user#hawq.incubator.apache.org to ask questions.

Related

Adding a node to hadoop cluster without restarting master

i have created a hadoop cluster and wanted to add a new node node in the cluster running as a slave without restarting the master node
how can this be acheived
Datanodes and nodemanagers can be added without restarting the namenode(s) or resource manager(s).
More specifically, these need to be ran on the machines of those running services
Namenode
hdfs dfsadmin -refreshNodes
ResourceManager
rmadmin -refreshNodes

Adding external ssd to single node hadoop cluster

I have to add external ssd to my single node hadoop cluster and make use of that disk as datanode where my blocks will be stored.
I have a running apache single node hadoop cluster . But now requirements are that can we use it as another datanode directory and how??
Thanks in advance
Shakir
Yes you may add it as a data-node. Install hadoop on the new node, setup passwordless ssh and copy the configuration to the new node.
Read this: Steps to add a node in Hadoop cluster

How to bring down your namenode?

How to bring down your Namenode in Hadoop 1.2.1 on CentOs and swap your namenode with a Datanode instance, also I have to make sure no data is lost during the process.
I am using Hadoop 1.2.1 with master, slave 1 and slave 2 nodes.
I am looking for the Unix commands or the changes I need to make in the configuration files.
Please ask for any particular details if needed!
You can take a back up of namenode metadata and kill namenode. Install namenode packages on other node of interest and put the backup copy of metadata in namenode data dir. Now start namenode this should pick up your old metadata. Remember to change namenode details in all config files.

How to run hadoop balancer from client node?

I want to ask how can I run the hadoop balancer? I've tried before on the namenode to run hadoop balancer command, but it has no effect at all (my new datanode still empty). I also read that hadoop balancer is not run on namenode but on client node. So what is the client node, how can I configure it, and how can client node access the hadoop file system?
Thanks all, I need your suggest
Client node is also know as edge node, Usually all the developers in a organization will not have access to all nodes on cluster. So for developers to accesss cluster usually we will have a Client node. You need to install hadoop-client packages on client node. If you are using cloudera RPM based installation, you can use below command.
sudo yum install hadoop-client
After client node installation update your configuration files like core-site.xml, hdfs-site.xml and other required files. Now when you execute hadoop CLI commands, they will be executed on cluster.
Balancer can be run from any node in the cluster. It can be a client machine/any node in cluster.
sudo -u hdfs hdfs balancer
Regarding newly added datanode, Just check in the namenode web UI if your node is added ? If you are able to see there, just check logs.

Zookeer is part of hadoop or separate configuration?

As I read from various tuts, zookeeper helps to coordinate and sync various hadoop clusters.
Currently I installed hadoop 2.5.0. When I do jps it displays
4494 SecondaryNameNode
8683 Jps
4679 ResourceManager
3921 NameNode
4174 DataNode
4943 NodeManager
no process for zookeeper.
I had doubt whether zookeeper is part of hdfs or we need to install it manually?
If you use hadoop only, zookeeper is not required! for other tools in hadoop, i.e. hbase, it depends on zookeeper! but you don't need install it dedicatedly, hbase has included it, if you startup hbase, the zookeeper will startup at the same time.

Resources