Create hadoop 2 node cluster between Windows host and Ubuntu guest VM - windows

I have a JAR file that I want to share between a Windows host and my virtualbox guest machine, with Ubuntu. Because these two OS need to share the same directory (directory needs to have the same name for both), the only way I found to do this is to create an hdfs directory, that is, a 2 node cluster with the same hdfs directory.
I have managed to setup a single node cluster for my Windows host and Ubuntu VM separetly, and that works correctly for both. But now, I want to do it in a multi-cluter fashion. I tried following the instructions in this link http://doctuts.readthedocs.io/en/latest/hadoop.html#multi-node-installation , but it didn't work ( when I start the master node, it does not detect the VM node.
I set up an SSH passwordless connection correctly, but I think it may not be working because of my configurations. Here are the three files that I changed to try and make the 2 node cluster:
core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.2:54310</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.job.user.name</name>
<value>%USERNAME%</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.apps.stagingDir</name>
<value>/user/%USERNAME%/staging</value>
</property>
<property>
<name>mapreduce.jobtracker.address</name>
<value>192.168.1.2:54311</value>
</property>
</configuration>
Again, the single node clusters work well, but it doesn't detect the Ubuntu VM.
Can someone help me? Thank you so much.

Related

Hadoop localhost:9870 browser interface is not working

I need to do data analysis using Hadoop. Therefore I have installed Hadoop and configured as below. But localhost:9870 is not working. Even I have format namenode every time I worked with that. Some articles and answers of this forum mentioned that 9870 is the updated one from 50070. I have win 10. I also referred answers in this forum but none of them worked. Java-home and hadoop-home paths are set. Paths to bin and sbin of hadoop are also set up. Can anyone please tell me what I am doing wrong in here?
I referred this site to do the installation and configuration.
https://medium.com/#pedro.a.hdez.a/hadoop-3-2-2-installation-guide-for-windows-10-454f5b5c22d3
core-site.xml
I have set up the Java path in this xml as well.
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9870</value>
</property>
hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-3.2.2\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-3.2.2\data\datanode</value>
</property>
mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
If you look at the namenode logs, it very likely has an error saying something about a port already being in use.
The default fs.defaultFS port should be 9000 - https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html ; you shouldn't change this without good reason.
The Namenode web UI isn't the value in fs.defaultFS. It's default port is 9870, and is defined by dfs.namenode.http-address in hdfs-site.xml
need to do data analysis
You can do analysis on Windows without Hadoop using Spark, Hive, MapReduce, etc. directly and it'll have direct access to your machine without being limited by YARN container sizes.

Which mode of hadoop and HDFS needed to be instlled?

I am a beginner to hadoop and HDFS, Now I have a situation where I need to connect 3 different PC having a file, NIFI and Hadop+HDFS.
Machine 1 : Will have a .csv file
Machine 2(Personal laptop): Will have my NIFI running to it.
Machine 3(Running at my office) : will have Hadoop+HDFS in it.
Now I would like to send a csv file from machine 1 to my database running on machine 3 using nifi which is running on machine 2.
I connect to machine 3 using ssh connection which is basically a router at my office.
Question:How can I connect to machine 3 from machine 2 which has nifi which can send the file to my hadoop hbase.
Should I use public key as configuration or should I use a different setup or server?
My configuration of files of haddo and hdfs are as follows
hbase-site.xml
<configuration>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://localhost:9000/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2222</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/zookeeper</value>
</property>
</configuration>
<property>
<name>hbase.wal.provider</name>
<value>filesystem</value>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>./tmp</value>
</property>
</configuration>
core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hdoop/tmpdata</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hdfs/namenode</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hdfs/datanode</value>
</property>
</configuration>
Look in to the configuration files and let me know where I need to change the properties and also I have install til now the psudo distributed mode HDFS in machine 3.
Pseudo distributed and fully distributed aren't any different.
You say only machine 3 has HDFS. Therefore only it needs to be running a Namenode and Datanode, setup in a distributed fashion, meaning that external clients will be able to communicate with it.
More specifically, no config file should be using localhost and should instead use LAN IP or hostnames

set up Hadoop multi cluster on 2 windows 10

I am trying to set up a multi-node Hadoop cluster between 2 windows devices. I am using Hadoop 2.9.2.
how can I achieve that, please.
after a lot of trial and error the following did the job me.
do same configuration as previous answer by #AbsoluteBeginner.
disable windows firewall on all machines (i think you could keep it on and just mess around with the rules, but thats for you to find out)
hdfs namenode -format all nodes (master and slaves)
make sure that the datanode folder is empty in all 3 nodes (just shift+del)
in master node run start-all.cmd. all the following should appear.
50436 NameNode
54696 NodeManager
54744 DataNode
60028 Jps
7340 ResourceManager
in slave nodes run start-all.cmd. all the following should appear
6116 DataNode
2408 Jps
3208 NodeManager
note the reason that nameode and resource manager isn't appearing, is becuase they are running on master node and already occupy the port, and you only need the master resourcemanger and name node running
note if you saw multi-cluster tutorial of linux the master node also shows SeceondryNameNode when executing jps. not really sure why its not appearing in windows.
go to master:50070, and navigate to data nodes you should see something like this
go to master:8088, and navigate to Node you should see something like this
Install open-ssh server on both of your systems using this guide. Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password. Add the public key to the authorized_keys and add your hostname to list of known hosts. You can find guides on how to do this by searching the internet.
2.Add your hadoop master and slave ips to your hosts file. Open “C:\Windows\System32\drivers\etc\hosts”
and add
your-master-ip hadoopMaster
your-salve-ip hadoopSlave
you can use these names in your configuration files.
much like Linux systems, these are the steps you have to follow in order to run a Hadoop cluster on windows:
3. First you need to have Java installed on your system and JAVA_HOME must be added to your environment variables. You can download Java from Oracle website and install it.
Download Hadoop binary files from Apache website and extract it.
Note that you shouldn't have space in your folder names or you might encounter problems.
Next you have to add Java and Hadoop home and bin folders to your environment variables. just open start menu and type "environment variable" and open the edit environment variables window from control panel.
Add
HADOOP_HOME=”root of your hadoop extracted folder\hadoop-2.9.2″
HADOOP_BIN=”root of hadoop extracted folder\hadoop-2.9.2\bin”
JAVA_HOME=<Root of your JDK installation>”
Edit your "path" environment variable and add %JAVA_HOME%, %HADOOP_HOME%, %HADOOP_BIN%, %HADOOP_HOME%/sbin to your PATH one by one.
you can validate your additions by opening cmd and type in:
echo %HADOOP_HOME%
echo %HADOOP_BIN%
echo %PATH%
CONFIGURING HADOOP:
10. Open "your hadoop root\hadoop-2.9.2\etc\hadoop\hadoop-env.cmd" and add the following lines to the bottom of the file:
set HADOOP_PREFIX=%HADOOP_HOME%
set HADOOP_CONF_DIR=%HADOOP_PREFIX%\etc\hadoop
set YARN_CONF_DIR=%HADOOP_CONF_DIR%
set PATH=%PATH%;%HADOOP_PREFIX%\bin
11.Open "your-hadoop-root\hadoop-2.9.2\etc\hadoop\hdfs-site.xml" and add the below content:
<property>
<name>dfs.name.dir</name>
<value>your desired address</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>your desired address</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.use.datanode.hostname</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.datanode.registration.ip-hostname-check</name>
<value>false</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>hadoopMaster:50070</value>
<description>Your NameNode hostname for http access.</description>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoopMaster:50090</value>
<description>Your Secondary NameNode hostname for http access.</description>
</property>
edit your core-site.xml and add:
<property>
<name>fs.default.name</name>
<value>hdfs://hadoopMaster:9000</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>your-temp-directory</value>
<description>A base for other temporary directories.</description>
</property>
Open "root to hadoop\hadoop-2.9.2\etc\hadoop\mapred-site.xml" and add below content within tags. If you don’t see mapred-site.xml then open mapred-site.xml.template file and rename it to mapred-site.xml
<property>
<name>mapred.job.tracker</name>
<value>hadoopMaster:9001</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
14.Edit your yarn-site.xml and add:
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
<description>Long running service which executes on Node Manager(s) and provides MapReduce Sort and Shuffle functionality.</description>
</property>
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
<description>Enable log aggregation so application logs are moved onto hdfs and are viewable via web ui after the application completed. The default location on hdfs is '/log' and can be changed via yarn.nodemanager.remote-app-log-dir property</description>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoopMaster:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoopMaster:8031</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoopMaster:8032</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>hadoopMaster:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoopMaster:8088</value>
</property>
In your slaves file in "root-hadoop-directory/hadoop/bin" add
hadoopSlave
Do these steps on your slave nodes too.
open cmd and cd to your sbin folder in hadoop directory.
18.format your nameNode
hadoop namenode -format
19.run the following command:
start-dfs.sh
then run:
start-yarn.sh

How to access my Namenode GUI in hadoop outside the GCP instance in browser

I just set up single node HADOOP setup on a GCP instance. Doing JPS command is showing all the processes are running fine.
I want to access the GUI of my namenode. I am using http://localhost:50070/ on my laptop browser.
It shows This site can’t be reached
Coresite.xml
hduser#laptop:~$ vi /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description></description>
</property>
</configuration>
Mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>
</description>
</property>
</configuration>
Solution attempted:
I have tried replacing my values in <value> tag with the public DNS of GCP instance but then the namenode stopped working.
Anyone having any idea here what i am doing wrong??
I found the answer to this problem:
you need to use your public IP and port number
check your firewall setting it should allow all the traffic in inbound rules in
AWS and firewall setting in GCP

HBase UI doesn't show any region servers

I run hbase in a distributed mode. Hbase starts region servers java processes on all nodes, but web ui doesn' show them
http://s1.ipicture.ru/uploads/20120517/16DXTnsU.png
here is hbase-site.xml
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>10.3.6.44</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/hadoop/hdfs/zookeeper</value>
</property>
<property>
<name>hbase.rootdir</name>
<value>hdfs://10.3.6.44:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
btw hadoop cluster is running normally and sees all the datanodes
thanks very much for your help.
problem was with dns and hosts file.
Add this property to your hbase-site.xml file and see if it works for you
name - hbase.zookeeper.property.clientPort
value - 2181

Resources