How to remove a hadoop node from DFS but not from Mapred? - hadoop

I am fairly new to hadoop. For running some benchmarks, I need variety of hadoop configuration for comparison.
I want to know a method to remove a hadoop slave from DFS (not running datanode daemon anymore) but not from Mapred (keep running tasktracker), or vice-versa.
AFAIK, there is a single slave file for such hadoop nodes and not separate slave files for DFS and Mapred.
Currently, I am trying to start both DFS and Mapred on the slave node , and then killing datanode on the slave. But it takes a while to put that node in to 'dead nodes' on HDFS GUI. Any parameter can be tuned to make this timeout quicker ?
Thankssss

Try using dfs.hosts and dfs.hosts.exclude in the hdfs-site.xml, mapred.hosts and mapred.hosts.exclude in mapred-site.xml. These are for allowing/excluding hosts to connect to the NameNode and the JobTracker.
Once the list of nodes in the files has been updated appropriately, the NameNode and the JobTracker have to be refreshed using the hadoop dfsadmin -refreshNodes and hadoop mradmin -refreshNodes command respectively.

Instead of using slaves file to start all processes on your cluster, you can start only required daemons on each machine if you have few nodes.

Related

Adding a node to hadoop cluster without restarting master

i have created a hadoop cluster and wanted to add a new node node in the cluster running as a slave without restarting the master node
how can this be acheived
Datanodes and nodemanagers can be added without restarting the namenode(s) or resource manager(s).
More specifically, these need to be ran on the machines of those running services
Namenode
hdfs dfsadmin -refreshNodes
ResourceManager
rmadmin -refreshNodes

Hadoop HDFS start up fails requires formatting

I have a multi-node standalone hadoop cluster for HDFS. I am able to load data to HDFS, however everytime I reboot my computer and start the cluster by start-dfs.sh, I don't see the dashboard until I perform hdfs namenode -format which erases all my data.
How do I start hadoop cluster without having to go through hdfs namenode -format?
You need to shutdown hdfs and the namenode cleanly (stop-dfs) before you shutdown your computer. Otherwise, you can corrupt the namenode, causing you to need to format to get back to a clean state

Hadoop : swap DataNode & NameNode without losing any HDFS data

I have a cluster of 5 machines:
1 big NameNode
4 standard DataNodes
I want to change my current NameNode with a DataNode without losing the data stored in HDFS, so my cluster could become:
1 standard NameNode
3 standard DataNodes
1 big DataNode
Does someone know a simple way to do that?
Thank you very much
Decomission data node where namenode will be moved.
Stop the cluster.
Create a tar of dfs.name.dir from current namenode.
Copy all hadoop config files from current NN to target NN.
Replace the name/ip of target namenode by modifying core-site.xml.
Restore tarball of dfs.name.dir. Make sure that full path is same.
Now start the cluster by starting new namenode and one less datanode.
Verify that everything is working perfectly.
Add old namenode as datanode by configuring it as datanode.
I would suggest to uninstall and then install hadoop on both the nodes so that previous configuration does not cause any problem.

Adding a new Namenode to an existing HDFS cluster

In Hadoop HDFS Federation the latest step of adding a new NameNode to an existing HDFS cluster is:
==> Refresh the Datanodes to pickup the newly added Namenode by running the following command against all the Datanodes in the cluster:
[hdfs]$ $HADOOP_PREFIX/bin/hdfs dfsadmin -refreshNameNodes <datanode_host_name>:<datanode_rpc_port>
Witch is the best place to execute the flowing command: NameNode or datanode ?
If I have 1000 Datanodes is it logical to run it 1OOO time ?
In namenode run this command once.
$HADOOP_PREFIX/sbin/slaves.sh hdfs dfsadmin -refreshNameNodes <datanode_host_name>:<datanode_rpc_port>
slaves.sh script will distribute the command to all the slave hosts which are mentioned in slaves file (typically placed in $HADOOP_CONF_DIR)

What is best way to start and stop hadoop ecosystem, with command line?

I see there are several ways we can start hadoop ecosystem,
start-all.sh & stop-all.sh
Which say it's deprecated use start-dfs.sh & start-yarn.sh.
start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh
hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager
EDIT: I think there has to be some specific use cases for each command.
start-all.sh & stop-all.sh : Used to start and stop hadoop daemons all at once. Issuing it on the master machine will start/stop the daemons on all the nodes of a cluster. Deprecated as you have already noticed.
start-dfs.sh, stop-dfs.sh and start-yarn.sh, stop-yarn.sh : Same as above but start/stop HDFS and YARN daemons separately on all the nodes from the master machine. It is advisable to use these commands now over start-all.sh & stop-all.sh
hadoop-daemon.sh namenode/datanode and yarn-deamon.sh resourcemanager : To start individual daemons on an individual machine manually. You need to go to a particular node and issue these commands.
Use case : Suppose you have added a new DN to your cluster and you need to start the DN daemon only on this machine,
bin/hadoop-daemon.sh start datanode
Note : You should have ssh enabled if you want to start all the daemons on all the nodes from one machine.
Hope this answers your query.
From Hadoop page,
start-all.sh
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
start-dfs.sh
This will bring up HDFS with the Namenode running on the machine you ran the command on. On such a machine you would need start-mapred.sh to separately start the job tracker
start-all.sh/stop-all.sh has to be run on the master node
You would use start-all.sh on a single node cluster (i.e. where you would have all the services on the same node.The namenode is also the datanode and is the master node).
In multi-node setup,
You will use start-all.sh on the master node and would start what is necessary on the slaves as well.
Alternatively,
Use start-dfs.sh on the node you want the Namenode to run on. This will bring up HDFS with the Namenode running on the machine you ran the command on and Datanodes on the machines listed in the slaves file.
Use start-mapred.sh on the machine you plan to run the Jobtracker on. This will bring up the Map/Reduce cluster with Jobtracker running on the machine you ran the command on and Tasktrackers running on machines listed in the slaves file.
hadoop-daemon.sh as stated by Tariq is used on each individual node. The master node will not start the services on the slaves.In a single node setup this will act same as start-all.sh.In a multi-node setup you will have to access each node (master as well as slaves) and execute on each of them.
Have a look at this start-all.sh it call config followed by dfs and mapred
Starting
start-dfs.sh (starts the namenode and the datanode)
start-mapred.sh (starts the jobtracker and the tasktracker)
Stopping
stop-dfs.sh
stop-mapred.sh

Resources