how to setup a hadoop node to be a tasktracker but not a datanode - hadoop

For a special reason, I want to setup a hadoop node to be a tasktracker but not a datanode. It seems like there is a way to do it but I have not been able too. Could someone give me a hand?
Thanks.

You have to setup a host-exclude file for your namenode.
This edit in the core-site.xml:
<property>
<name>dfs.hosts.exclude</name>
<value>YOUR_PATH_TO_THE_EXCLUDE_FILE</value>
</property>
This file is basically like a slave or master file. You have to just insert the hostname like:
host1
host2
When restarting the namenode will ignore these given hosts but the jobtracker will launch a tasktracker.

Related

Meaning of fs.defaultFS property in core-site.xml in hadoop

I am trying to set up hadoop in fully distributed mode, and to some extent I am successful in doing this.
However, I have got some doubt in one of the parameter setting in core-site.xml --> fs.defaultFS
In my set up, I have three nodes as described below:
Node1 -- 192.168.1.2 --> Configured to be Master (Running ResourceManager and NameNode daemons)
Node2 -- 192.168.1.3 --> Configured to be Slave (Running NodeManager and Datanode daemons)
Node3 -- 192.168.1.4 --> Configured to be Slave (Running NodeManager and Datanode daemons)
Now what does property fs.defaultFS mean? For example, if I set it like this:
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.2:9000/</value>
</property>
I am not able to understand the meaning of hdfs://192.168.1.2:9000. I can figure out that hdfs would mean that we are using hdfs file system, but what does the other parts means?
Does this mean that the host with IP address 192.168.1.2 is running the Namenode at port 9000?
Can anyone help me understand this?
In this code:
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.1.2:9000/</value>
</property>
Include fs.defaultFS/fs.default.name in core-site.xml to allow dfs commands without providing full site name in the command. Running hdfs dfs -ls / instead of hdfs dfs -ls hdfs://hdfs/
This is used to specify the default file system and defaults to your local file system that's why it needs be set to a HDFS address. This is important for client configuration as well so your local configuration file should include this element.
Above #Shashank explained very appropriate that :
hdfs://192.168.1.2:9000/. Here 9000 denotes port on which datanode will send heartbeat to namenode. And full address is the machine name that is converted to hostname.
<name>fs.default.name</name>.
Here fs denotes file system and default.name denotes namenode
<value>hdfs://192.168.1.2:9000/</value>.
Here 9000 denotes port on which datanode will send heartbeat to namenode. And full address is the machine name that is converted to hostname.
Something important to note about port is that you can give any port greater than 1024 as lesser than that have to give root privileges.

Hadoop datanode services is not starting in the slaves in hadoop

I am trying to configure hadoop-1.0.3 multinode cluster with one master and two slave in my laptop using vmware workstation.
when I ran the start-all.sh from master all daemon process running in master node (namenode,datanode,tasktracker,jobtracker,secondarynamenode) but Datanode and tasktracker is not starting on slave node. Password less ssh is enabled and I can do ssh for both master and slave from my masternode without pwd.
Please help me resolve this.
Stop the cluster.
If you have specifically defined tmp directory location in core-site.xml, then remove all files under those directory.
If you have specifically defined data node and namenode directory in hdfs-site.xml, then delete all the files under those directories.
If you have not defined anything in core-site.xml or hdfs-site.xml, then please remove all the files under /tmp/hadoop-*nameofyourhadoopuser.
Format the namenode.
It should work!

Hadoop 2.x -- how to configure secondary namenode?

I have an old Hadoop install that I'm looking to update to Hadoop 2. In the
old setup, I have a $HADOOP_HOME/conf/masters file that specifies the
secondary namenode.
Looking through the Hadoop 2 documentation I can't find any mention of a
"masters" file, or how to setup a secondary namenode.
Any help in the right direction would be appreciated.
The slaves and masters files in the conf folder are only used by some scripts in the bin folder like start-mapred.sh, start-dfs.sh and start-all.sh scripts.
These scripts are a mere convenience so that you can run them from a single node to ssh into each master / slave node and start the desired hadoop service daemons.
You only need these files on the name node machine if you intend to launch your cluster from this single node (using password-less ssh).
Alternatively, You can also start an Hadoop daemon manually on a machine via
bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode | jobtracker | tasktracker]
In order to run the secondary name node, use the above script on the designated machines providing the 'secondarynamenode' value to the script
See #pwnz0r 's 2nd comment on answer on How separate hadoop secondary namenode from primary namenode?
To reiterate here:
In hdfs-site.xml:
<property>
<name>dfs.secondary.http.address</name>
<value>$secondarynamenode.full.hostname:50090</value>
<description>SecondaryNameNodeHostname</description>
</property>
I am using Hadoop 2.6 and had to use
<property>
<name>dfs.secondary.http.address</name>
<value>secondarynamenode.hostname:50090</value>
</property>
for further details refer https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
Update hdfs-site.xml file by updating and adding following property
cd $HADOOP_HOME/etc/hadoop
sudo vi hdfs-site.xml
Then paste these lines into configuration tag
<property>
<name>dfs.secondary.http.address</name>
<value>hostname:50090</value>
</property>

Start namenode without formatting

I tried to start namenode using bin/start-all.sh. But, this command doesnt start namenode. I know if I do bin/hadoop namenode -format , namenode will start but in that case, I will lose all my data. Is there a way to start namenode without formatting it?
Your problem might be related to the following:
Hadoop writes its NameNode data in /tmp/hadoop- folder by default which is cleaned after every reboot.
Add following property to conf/hdfs-site.xml
<property>
<name>dfs.name.dir</name>
<value><path to your desired folder></value>
</property>
The "dfs.name.dir" property allows you to control where Hadoop writes NameNode metadata.
bin/start-all.sh should start the namenode, as well as the datanodes, the jobtracker and the tasktrackers. So, check the log of the namenode for possible errors.
An alternative way to skip starting the jobtracker and the tasktrackers and just start the namenode (and the datanodes) is by using the command:
bin/start-dfs.sh
Actually, bin/start-all.sh is equivalent to using the commands:
bin/start-dfs.sh, which starts the namenode and datanodes and
bin/start-mapred.sh, which starts the jobtracker and the tasktrackers.
For more details, visit this page.

DataNode doesn't start in one of the slaves

I am trying to configure hadoop with 5 slaves. After I run start-dfs.sh in the master there is only one slave node which doesn't run DataNode. I tried looking for some difference in the configuration files in that node but I didn't find anything.
There WAS a difference in the configuration files! In the core-site.xml the hadoop.tmp.dir variable was set to a invalid directory so it couldn't be created when the DataNode was started. Lesson learned: look in the logs (Thanks Chris)

Resources