I am going to install hadoop and HBase on ubuntu. when I tried to search any good link then I was unable to find which is fully clear and more descriptive. I need a detailed link from which I can easily set up hsdoop and hbase.
Thanks
you didn't mention either you want to set up these in pseudo distributed mode or multi distributed OR single-node or multi-node . Anyway here are some links which would be helpful for you
hadoop single node cluster ,
hadoop multi node cluster ,
and for hbase , I think you should see these links
install HBase in pseudo distributed mode ,
hbase installation in fully distributed mode
Hope it will help you - Thanks
For psuedo distributed mode I would suggest the following links
HADOOP INSTALLATION
http://preciselyconcise.com/apis_and_installations/hadoop_installation.php
HBASE INSTALLATION
http://preciselyconcise.com/apis_and_installations/hbase_installation.php
Related
If I have Hadoop server (pseudo-distributed mode) running on a separate machine, do I still need to have these files under my Druid's conf dir ? : http://druid.io/docs/latest/configuration/hadoop.html
The way I see it:
Looks like those -site.xml files are for Hadoop server..., and Druid only acts as Hadoop client. So I don't think Druid needs the hdfs-site.xml.
Core-site.xml..., ok, I can get it. I mean, Druid nees to know the IP of the name node (hadoop).
Mapred-site.xml, partially. Druid needs to know the status of mapreduce jobs (I suppose it will delegate the indexing to Hadoop as MR job). So it needs to communicate with those job trackers to see if the indexing is finished / failed / in progress. For that, it needs the URL of Hadoop JT.
However Druid does not need this prperty "mapreduce.cluster.local.dir", because it does not participate actively in MR job.
Yarn-site.xml? Maybe it should stay, partially. At least for submitting a job (?).
What about HDFS-site.xml? I think this can be scrapped completely.
Capacity-scheduler.xml? It can go.
Please correct me If I'm wrong.
These questions / doubts arises because I'm quite new to hadoop. I have my hadoop setup running. Pseudo distributed mode. I also tested it with javascript webhdfs library to write and read file. Also have tried the sample MR jobs provided by the hadoop dist. So I guess my hadoop setup is fine. I'm just a bit unsure on the Druid site, partly because the doc is not ver clear about it.
Btw.... I have hadoop 2.7.2... While the hadoop-client libs used by Druid is still on 2.3.0.
Should I downgrade my hadoop server to 2.3.0?
http://druid.io/docs/latest/operations/other-hadoop.html
Thansk,
Raka
Please add the mapred-site.xml core-site.xml hdfs-site.xml yarn-site.xml to the classpath.
Also you don't need to downgrade druid works well with 2.7.X.
As you can see in the doc you can use multiple version of hadoop.
I'm trying to set up a multi-node Apache Hadoop cluster. I'm following their tutorial here: http://hadoop.apache.org/docs/stable/cluster_setup.html. I've set up single-node Hadoop set-ups on each individual node, as well as Sun Java installed on each. Unfortunately, the documentation is not very clear and a bit outdated. Am I supposed to (under the 'Configuring the Hadoop Daemons' sections) update those files (.conf/*-site.xml) on every single node?
Also, how am I supposed to locate the host:port/IP:port pair for each node?
Sorry, I am new to all of this. Thank you in advance for your help!
I am a newbie to Hadoop, HBase and Hive. I installed Hadoop, HBase and Hive in pseudodistributed mode and everything works fine.
Now I am planning to set up an simple Hadoop Cluster (5 nodes) with Hive, HBase and ZooKeeper. I´ve read several documentations and instructions before but i could not find a good explanation for my question. I´m not sure, where to run all the daemons. This is my consideration:
Node_1 (Master)
NameNode
JobTrakcer
HBase Master
ZooKeeper (Standalone node; managed by HBase)
Node_2 (Backup_Master)
SecondaryNameNode
Node_3 (Slave1)
DataNode1
TaskTracker1
RegionServer1
Node_4 (Slave2)
DataNode2
TaskTracker2
RegionServer2
Node_5 (Slave3)
DataNode3
TaskTracker3
RegionServer3
I know, in production it is recommended to run ZooKeeper ensemble at an odd number of nodes (seperate Cluster). But for a simple cluster, is it OK to set up a standalone ZooKeeper node which runs on the master node?
Another question is regarding Hive: I know that Hive is a Hadoop client. Should I also install Hive on the master node? Does it make sense?
Thanks for all tips and comments!
Hakan
Note: I have just 5 machines to simulate a cluster.
For testing purposes, I believe you can setup Zookeeper on the master node; I did install all of them on the same server.
What I do not understand from your question why you installed hadoop in pseudo distributed mode if you have 5 machines in your cluster? it might be better to install a fully distributed mode.
For hive, it seems that you have to install it with hadoop
Hive uses hadoop that means:
you must have hadoop in your path OR export HADOOP_HOME=<hadoop-install-dir>
For hive, it seems that you have to install it with hadoop
Hive uses hadoop that means:
you must have hadoop in your path OR export HADOOP_HOME=
#iTech : That´s right. If you install hive, you have to set the variable "HADOOP_HOME" to your hadoop installation path.But that´s not the problem..As I said, I worked before with Hadoop and Hive in pseudo distributed mode.
The only problem is, I´m not sure where to run the all the daemons in a 5-node-cluster in fully distributed mode. I´m confused because I want to run a lot of Tools together (Hadoop, HBase and Hive)
Hope that someone have a good tip...
If you are planning to use the described cluster for testing purposes, it is OK to have all your master nodes on the same server. Also you can move the SecondaryNameNode role to Node_1, since SecondaryNameNode is not a backup server for the NameNode, it is there to make checkpoints of your NameNode. So it makes sense to use the Node_2 as another "worker" node in you cluster, or the HiveServer2 and the metastore.
Hope this will help.
I am a Mahout/Hadoop Beginner.
I am trying to run Mahout examples given in "Mahout in Action" Book. I am able to run the examples in Eclipse without Hadoop.
Can you please let me know how to run the same examples in the Hadoop Cluster.
This wiki page has the different articles implemented in Mahout and how to run them. Many of them take the below as an argument
-xm "execution method: sequential or mapreduce"
Mahout requirements mention that it works on Hadoop 0.20.0+. See this tutorial on how to setup Hadoop on a single node and on a multi node cluster on Ubuntu.
I've a single-node pseudo-distributed hadoop setup on a unix system in the network. What are the minimum steps to add another computer/node (cygwin) on the network to form a hadoop cluster setup?
Instructions for Hadoop a single node cluster.
http://www.michael-noll.com/blog/2007/08/05/running-hadoop-on-ubuntu/
Instructions for Hadoop a multi node cluster.
http://www.michael-noll.com/blog/2007/08/09/running-hadoop-on-ubuntu-part-2-multi-node-cluster/
The author Michael makes installation and configuration very easy and had been keeping the instructions up to date.