How do I correctly remove nodes in Hadoop? - hadoop

I'm running Hadoop 1.1.2 on a cluster with 10+ machines. I would like to nicely scale up and down, both for HDFS and MapReduce. By "nicely", I mean that I require that data not be lost (allow HDFS nodes to decomission), and nodes running a task finish before shutting down.
I've noticed the datanode process dies once decomissioning is done, which is good. This is what I do to remove a node:
Add node to mapred.exclude
Add node to hdfs.exclude
$ hadoop mradmin -refreshNodes
$ hadoop dfsadmin -refreshNodes
$ hadoop-daemon.sh stop tasktracker
To add the node back in (assuming it was removed like above), this is what I'm doing.
Remove from mapred.exclude
Remove from hdfs.exclude
$ hadoop mradmin -refreshNodes
$ hadoop dfsadmin -refreshNodes
$ hadoop-daemon.sh start tasktracker
$ hadoop-daemon.sh start datanode
Is this the correct way to scale up and down "nicely"? When scaling down, I'm noticing job-duration rises sharply for certain unlucky jobs (since the tasks they had running on the removed node need to be re-scheduled).

If you have not set dfs exclude file before, follow 1-3. Else start from 4.
Shut down the NameNode.
Set dfs.hosts.exclude to point to an empty exclude file.
Restart NameNode.
In the dfs exclude file, specify the nodes using the full hostname or IP or IP:port format.
Do the same in mapred.exclude
execute bin/hadoop dfsadmin -refreshNodes. This forces the NameNode to reread the exclude file and start the decommissioning process.
execute bin/hadoop mradmin -refreshNodes
Monitor the NameNode and JobTracker web UI and confirm the decommission process is in progress. It can take a few seconds to update. Messages like "Decommission complete for node XXXX.XXXX.X.XX:XXXXX" will appear in the NameNode log files when it finishes decommissioning, at which point you can remove the nodes from the cluster.
When the process has completed, the namenode UI will list the datanode as decommissioned. The Jobtracker page will show the updated number of active nodes. Run bin/hadoop dfsadmin -report to verify. Stop the datanode and tasktracker process on the excluded node(s).
If you do not plan to reintroduce the machine to the cluster, remove it from the
include and exclude files.
To add a node as datanode and tasktracker see Hadoop FAQ page
EDIT : When a live node is to be removed from the cluster, what happens to the Job ?
The jobs running on a node to be de-commissioned would get affected as the tasks of the job scheduled on that node(s) would be marked as KILLED_UNCLEAN (for map and reduce tasks) or KILLED (for job setup and cleanup tasks). See line 4633 in JobTracker.java for details. The job will be informed to fail that task. Most of the time, Job tracker will reschedule execution. However, after many repeated failures it may instead decide to allow the entire job to fail or succeed. See line 2957 onwards in JobInProgress.java.

You should be aware that since for Hadoop to perform well, it really wants to have the data available in multiple copies. By removing nodes, you remove the chances of the data being optimally available, and you put extra stress on the cluster to ensure the availablility.
I.e. by taking down a node, you do enfore that an extra copy of all its data is made somewhere else. So you shouldn't really be doing this just for fun, not unless you use a different data management paradigm than in the default configuration (= keep 3 copies in the cluster).
And for a Hadoop cluster to perform well, you will want to actually store the data in the cluster. Otherwise, you can't really move the computation to the data, because the data isn't there yet either. Much about Hadoop is about having "smart drives" that can perform computation before sending the data across the network.
So in order to make this reasonable, you will likely need to somehow split your cluster. Have one set of nodes keep the 3 master copies of the original data, and have some "add-on" nodes that are only used for storing intermediate data and perform computations on that part. Never change the master nodes, so they don't need to redistribute your data. Shut down add-on nodes only when they are empty? But that probably is not yet implemented.

While decommissioning in progress, temporary or staging files get cleaned automatically. These files are missing now and hadoop is not recognizing how that went missing. So the decommissioning process keeps waiting until that is resolved even though the actual decommissioning is done for all the other files.
In Hadoop GUI - if you notice the parameter "Number of Under-Replicated Blocks" is not reducing over the time or almost constant then this is the reason likely.
So list the files using below command
hadoop fsck / -files -blocks -racks
If you see those files are temporary and not required then delete those files or folder
Example: hadoop fs -rmr /var/local/hadoop/hadoop/.staging/* (give the correct path here)
This would solve the problem immediately. De-commissioned nodes will move to Dead Nodes in 5 mins.

Related

Is there any file to read when on of my datanode was dead

Situation:
In my test apache hadoop, I run a MapReduce job.
If one of my datanode was down (I trun off the machine), and this datanode is working with my MapReduce job.
My thinking:
I intuitively think, the job will run longer a little bit, besides it will not be failed, because the file block replicates in other nodes.
Some people say I can set the parameters:
dfs.client.block.write.replace-datanode-on-failure.enable = true
dfs.client.block.write.replace-datanode-on-failure.best-effort=true
Therefore my job will skip the dead datanode and look another available one.
My question is:
Somebody know where is the file I can check it out and see my job's lifecycle?
From a dead datanode to resume another available one?
Firstly I think is editlog, but I can't read it clearly.
please check your datanode and namenode id are same then . if u check datanode file . then u goes to datanode folder and goes to current directory . then watch file .

why hadoop lost nodes

I am confused that when I run the commond " hadoop dfsadmin -report" I can see there
but the resource manager , cluster metric, it shows that
why is that and why could that happen?
Thanks in advance!
Your connected with 9 slave nodes. But 5 slave nodes are in active state remaining are in unhealthy state.
Reason for unhealthy state:
Hadoop MapReduce provides a mechanism by which administrators can configure the TaskTracker to run an administrator supplied script periodically to determine if a node is healthy or not. Administrators can determine if the node is in a healthy state by performing any checks of their choice in the script. If the script detects the node to be in an unhealthy state, it must print a line to standard output beginning with the string ERROR. The TaskTracker spawns the script periodically and checks its output. If the script's output contains the string ERROR, as described above, the node's status is reported as 'unhealthy' and the node is black-listed on the JobTracker. No further tasks will be assigned to this node. However, the TaskTracker continues to run the script, so that if the node becomes healthy again, it will be removed from the blacklisted nodes on the JobTracker automatically. The node's health along with the output of the script, if it is unhealthy, is available to the administrator in the JobTracker's web interface. The time since the node was healthy is also displayed on the web interface.
Reason for Lost Nodes:
I think some BLOCKS (data) may not available in slaves. So It shows lost node as 9.
To remove Dead nodes from cluster use this link To Decommission Nodes
Cluster metrics in ResourceManager show the status of NodeManager.
hadoop dfsadmin -report this command shows the status of Datanodes.

Mapreduce dataflow Internals

I tried to understand map reduce anatomy from various books/blogs.. But I am not getting a clear idea.
What happens when I submit a job to the cluster using this command:
..Loaded the files into hdfs already
bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount /usr/joe/wordcount/input /usr/joe/wordcount/output
Can anyone explain the sequence of opreations that happens right from the client and inside the cluster?
The processes goes like this :
1- The client configures and sets up the job via Job and submits it to the JobTracker.
2- Once the job has been submitted the JobTracker assigns a job ID to this job.
3- Then the output specification of the job is verified. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program.
4- Once this is done, InputSplits for the job are created(based on the InputFormat you are using). If the splits cannot be computed, because the input paths don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce program.
5- Based on the number of InputSplits, map tasks are created and each InputSplits gets processed by one map task.
6- Then the resources which are required to run the job are copied across the cluster like the the job JAR file, the configuration file etc. The job JAR is copied with a high replication factor (which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job.
7- Then based on the location of the data blocks, that are going to get processed, JobTracker directs TaskTrackers to run map tasks on that very same DataNode where that particular data block is present. If there are no free CPU slots on that DataNode, the data is moved to a nearby DataNode with free slots and the processes is continued without having to wait.
8- Once the map phase starts individual records(key-value pairs) from each InputSplit start getting processed by the Mapper one by one completing the entire InputSplit.
9- Once the map phase gets over, the output undergoes shuffle, sort and combine. After this the reduce phase starts giving you the final output.
Below is the pictorial representation of the entire process :
Also, I would suggest you to go through this link.
HTH

why map task always running on a single node

I have a Fully-Distributed Hadoop cluster with 4 nodes.When I submit my job to Jobtracker which decide 12 map tasks will be cool for my job,something strange happens.The 12 map tasks always running on a single node instead of running on the entire cluster.Before I ask the question ,I have already done the things below:
Try different Job
Run start-balance.sh to rebalance the cluster
But it does not work,so I hope someone can tell me why and how to fix it.
If all the blocks of input data files are in that node, the scheduler with prioritize the same node
Apparently the source data files is in one data node now. It could't be the balancer's fault. From what I can see, your hdfs must only have one replication or you are not in a Fully-Distributed Hadoop cluster.
Check how your input is being split. You may only have one input split, meaning that only one Node will be used to process the data. You can test this by adding more input files to your stem and placing them on different nodes, then checking which nodes are doing the work.
If that doesn't work, check to make sure that your cluster is configured correctly. Specifically, check that your name node has paths to your other nodes set in its slaves file, and that each slave node has your name node set in its masters file.

How can I add new nodes to a live hbase/hadoop cluster?

I run some batch jobs with data inputs that are constantly changing and I'm having problems provisioning capacity. I am using whirl to do the intial setup but once I start, for example, 5 machines I don't know how to add new machines to it while its running. I don't know in advance how complex or how large the data will be so I was wondering if there was a way to add new machines to a cluster and have it take effect right away(or with some delay but don't want to have to bring down the cluster and bring it up with the new nodes).
There is exact explanation how to add node:
http://wiki.apache.org/hadoop/FAQ#I_have_a_new_node_I_want_to_add_to_a_running_Hadoop_cluster.3B_how_do_I_start_services_on_just_one_node.3F
In the same time - I am not sure that already running jobs will take advantages of these nodes since planning where to run each task happens during job start time (as far as I understand).
I also think that it is more practical to run Task Trackers only on these transient nodes.
Check the files referred by the below parameters:
dfs.hosts => dfs.include
dfs.hosts.exclude
mapreduce.jobtracker.hosts.filename => mapred.include
mapreduce.jobtracker.hosts.exclude.filename
You can add the list of hosts to the files dfs.include and mapred.include and then run
hadoop mradmin -refreshNodes ;
hadoop dfsadmin -refreshNodes ;
That's all.
BTW, 'mradmin -refreshNodes' facility was added in 0.21
Nikhil

Resources