Meaning of fs.defaultFS property in core-site.xml in hadoop - hadoop

I am trying to set up hadoop in fully distributed mode, and to some extent I am successful in doing this.
However, I have got some doubt in one of the parameter setting in core-site.xml --> fs.defaultFS
In my set up, I have three nodes as described below:
Node1 -- 192.168.1.2 --> Configured to be Master (Running ResourceManager and NameNode daemons)
Node2 -- 192.168.1.3 --> Configured to be Slave (Running NodeManager and Datanode daemons)
Node3 -- 192.168.1.4 --> Configured to be Slave (Running NodeManager and Datanode daemons)
Now what does property fs.defaultFS mean? For example, if I set it like this:
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.2:9000/</value>
</property>
I am not able to understand the meaning of hdfs://192.168.1.2:9000. I can figure out that hdfs would mean that we are using hdfs file system, but what does the other parts means?
Does this mean that the host with IP address 192.168.1.2 is running the Namenode at port 9000?
Can anyone help me understand this?

In this code:
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.2:9000/</value>
</property>
Include fs.defaultFS/fs.default.name in core-site.xml to allow dfs commands without providing full site name in the command. Running hdfs dfs -ls / instead of hdfs dfs -ls hdfs://hdfs/
This is used to specify the default file system and defaults to your local file system that's why it needs be set to a HDFS address. This is important for client configuration as well so your local configuration file should include this element.
Above #Shashank explained very appropriate that :
hdfs://192.168.1.2:9000/. Here 9000 denotes port on which datanode will send heartbeat to namenode. And full address is the machine name that is converted to hostname.

<name>fs.default.name</name>.
Here fs denotes file system and default.name denotes namenode
<value>hdfs://192.168.1.2:9000/</value>.
Here 9000 denotes port on which datanode will send heartbeat to namenode. And full address is the machine name that is converted to hostname.
Something important to note about port is that you can give any port greater than 1024 as lesser than that have to give root privileges.

Related

How do i check hadoop server name?

when I try to import data to excel from hdfs, it asks to enter the name of hadoop server, so I am confused what to type there, kindly help me on the same..
Give the name of your Namenode and try once.
You should enter the name of your name node here.
For e.g. in hdfs-site.xml, there is a property dfs.namenode.http-address, check the value of this property. You need to set this to the name of the server mentioned in this property.
For me, it is set to:
<property>
<name>dfs.namenode.http-address</name>
<value>mballur.myorg.com:50070</value>
<description>The address and the base port where the dfs namenode
web ui will listen on. If the port is 0 then the server will
start on a free port.</description>
<final>true</final>
</property>
So, if you take my example, you need to set this to mballur.myorg.com.
You can also get the name of the NameNode, by running hadoop fsck command.
For e.g. when I run the following command:
CMD PROMPT>hadoop fsck /tmp/
I get the following output:
Connecting to namenode via http://mballur.myorg.com:50070/fsck?ugi=mballur&path=%2Ftmp
FSCK started by mballur (auth:SIMPLE) from /192.168.56.1 for path /tmp at Wed Jan 06 18:29:57 IST 2016
You can see the first line, highlighted portion is the name of the Name Node:
Connecting to namenode via http://mballur.myorg.com:50070/fsck?ugi=mballur&path=%2Ftmp
Also, check this tutorial in Youtube: https://www.youtube.com/watch?v=_eyE7Qcj0_A

Hadoop UI shows only one Datanode

I've started hadoop cluster composed of on master and 4 slave nodes.
Configuration seems ok:
hduser#ubuntu-amd64:/usr/local/hadoop$ ./bin/hdfs dfsadmin -report
When I enter NameNode UI (http://10.20.0.140:50070/) Overview card seems ok - for example total Capacity of all Nodes sumes up.
The problem is that in the card Datanodes I see only one datanode.
I came across the same problem, fortunately, I solved it. I guess it causes by the 'localhost'.
Config different name for these IP in /etc/host
Remember to restart all the machines, things will go well.
It's because of the same hostname in both datanodes.
In your case both datanodes are registering to the namenode with same hostname ie 'localhost' Try with different hostnames it will fix your problem.
in UI it will show only one entry for a hostname.
in "hdfs dfsadmin -report" output you can see both.
The following tips may help you
Check the core-site.xml and ensure that the namenode hostname is correct
Check the firewall rules in namenode and datanodes and ensure that the required ports are open
Check the logs of datanodes
Ensure that all the datanodes are up and running
As #Rahul said the problem is because of the same hostname
change your hostname in /etc/hostname file and give different hostname for each host
and resolve hostname with ip address /etc/hosts file
then restart your cluster you will see all datanodes in Datanode information tab on browser
I have the same trouble because I use ip instead of hostname, [hdfs dfsadmin -report] is correct though it is only one[localhost] in UI. Finally, I solved it like this:
<property>
       <name>dfs.datanode.hostname</name>                   
       <value>the name you want to show</value>
</property>
you almost can't find it in any doucument...
Sorry, feels like it's been a time. But still I'd like to share my answer:
the root cause is from hadoop/etc/hadoop/hdfs-site.xml:
the xml file has a property named dfs.datanode.data.dir. If you set all the datanodes with the same name, then hadoop is assuming the cluster has only one datanode. So the proper way of doing it is naming every datanode with a unique name:
Regards,
YUN HANXUAN
Your admin report looks absolutely fine. Please run the below to check the HDFS disk space details.
"hdfs dfs -df /"
If you still see the size being good, its just a UI glitch.
My Problems: I have 1 master node and 3 slave nodes. when I start all nodes by start-all. sh and accessing the dashboard of master nodes. I was able to see only one data node on the web UI.
My Solution:
Try to stop the Firewall temporary by sudo systemctl stop firewalld. if you do not want to stop your firewalld service then r allow the ports of the data node by
sudo firewall-cmd --permanent --add-port{PORT_Number/tcp,PORT_number2/tcp} ; sudo firewall-cmd --reload
If you are using sapretae user for Hadoop in my case I am using hadoop user to manage hadoop daemons then change the owner on your dataNode and nameNode file by. sudo chown hadoop:hadoop /opt/data -R
My hdfs-site.xml config as given in image
Check your daemons on data node by jps command. it should show as given in the below image.
jps Output

Hadoop 2.x -- how to configure secondary namenode?

I have an old Hadoop install that I'm looking to update to Hadoop 2. In the
old setup, I have a $HADOOP_HOME/conf/masters file that specifies the
secondary namenode.
Looking through the Hadoop 2 documentation I can't find any mention of a
"masters" file, or how to setup a secondary namenode.
Any help in the right direction would be appreciated.
The slaves and masters files in the conf folder are only used by some scripts in the bin folder like start-mapred.sh, start-dfs.sh and start-all.sh scripts.
These scripts are a mere convenience so that you can run them from a single node to ssh into each master / slave node and start the desired hadoop service daemons.
You only need these files on the name node machine if you intend to launch your cluster from this single node (using password-less ssh).
Alternatively, You can also start an Hadoop daemon manually on a machine via
bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker]
In order to run the secondary name node, use the above script on the designated machines providing the 'secondarynamenode' value to the script
See #pwnz0r 's 2nd comment on answer on How separate hadoop secondary namenode from primary namenode?
To reiterate here:
In hdfs-site.xml:
<property>
<name>dfs.secondary.http.address</name>
<value>$secondarynamenode.full.hostname:50090</value>
<description>SecondaryNameNodeHostname</description>
</property>
I am using Hadoop 2.6 and had to use
<property>
<name>dfs.secondary.http.address</name>
<value>secondarynamenode.hostname:50090</value>
</property>
for further details refer https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
Update hdfs-site.xml file by updating and adding following property
cd $HADOOP_HOME/etc/hadoop
sudo vi hdfs-site.xml
Then paste these lines into configuration tag
<property>
<name>dfs.secondary.http.address</name>
<value>hostname:50090</value>
</property>

hadoop conf "fs.default.name" can't be setted ip:port format directly?

all
I have setupped a hadoop cluster in fully distributed mode. First, I set core-site.xml "fs.default.name" and mapred-site.xml "mapred.job.tracker" in hostname:port format, and chang /etc/hosts correspondingly, the cluster works succesfully.
Then I use another way, I set set core-site.xml "fs.default.name" and mapred-site.xml "mapred.job.tracker" in ip:port format. It dosen't work.
I find
ERROR org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error getting localhost name. Using 'localhost'...
in namenode log file and
ERROR org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Error getting localhos
t name. Using 'localhost'...
java.net.UnknownHostException: slave01: slave01: Name or service not known
in datanode log file.
In my opinion,ip and hostname is equivalent. Is there something wrong in my hadoop conf?
maybe there is a wrong configured hostname in /etc,
you should check hostname /etc/hosts /etc/HOSTNAME (rhel/debian) or rc.conf (archlinux) etc.
I got your point. This is because of that you probably wrote in mapred-site.xml, hdfs://ip:port (it starts with hdfs, this is wrong) but when you write hostname:port, you probably did not write hdfs at the beginning of the value which is correct way. THerefore, firstone did not work,but, second has worked
Fatih haltas
I found answer here.
It seems that HDFS uses host name only for it's all communication and display purposes, so we can NOT use ip directly in core-site.xml and mapred-site.xml

how to setup a hadoop node to be a tasktracker but not a datanode

For a special reason, I want to setup a hadoop node to be a tasktracker but not a datanode. It seems like there is a way to do it but I have not been able too. Could someone give me a hand?
Thanks.
You have to setup a host-exclude file for your namenode.
This edit in the core-site.xml:
<property>
<name>dfs.hosts.exclude</name>
<value>YOUR_PATH_TO_THE_EXCLUDE_FILE</value>
</property>
This file is basically like a slave or master file. You have to just insert the hostname like:
host1
host2
When restarting the namenode will ignore these given hosts but the jobtracker will launch a tasktracker.

Resources