This is something that I've found nowhere.
I have a YARN cluster with some slaves. When a slave fails (chaos monkey, scale down, etc.), ResourceManager doesn't "get it". Even a rmadmin -refreshNodes doesn't fix it. ResourceManager keeps listing the failed nodes as RUNNING. How do I do in order for ResourceManager to check for slaves health and remove them when they fail?

Please look into Hadoop Definitive Guide, Chapter 10, Maintenance, Commissioning and Decommissioning Nodes. Looks like you are trying to update the jobtracker with the above command. More elaborate process is mentioned there, along with updating the name node, verifying the progress in web UI, and removing the nodes from include file and slave file


I want to take a single machine out of a Hadoop cluster temporarily.
Most documentation says take it out of by adding it to the yarn and dfs .exclude files. I don't want to add it to the dfs.exclude and yarn.exclude files and decommission it with hdfs dfsadmin -refreshNodes, though, because I want to take it out, make some changes to the machine, and bring it back online as soon as possible. I don't want to copy hundreds of gigabytes of data over to avoid under-replicated blocks!
Instead, I'd like to be able to power off the machine quickly while making sure:
The cluster as a whole is still operational.
No data is lost by the journalmanager or nodemanager processes.
No Yarn jobs fail or go AWOL when the process dies.
My best guess at how to do this is by issuing:
./hadoop-daemon.sh --hosts hostname stop datanode
./hadoop-daemon.sh --hosts hostname stop journalnode
./yarn-daemon.sh --hosts hostname stop nodemanager
And then starting each of these processes individually again when the machine comes back online.
Is that safe? And is there a more efficient way to do this?

I want to ask few questions to understand the working of YARN:
Anyone can explain or refer to any document which can easily about the failure modes in YARN (i.e. Task Failure, Application master failure, Node Manager failure, Resource manager failure)
What is the container size in YARN? is it same as slot in Map reduce 1?
Any practical/working example of YARN ?
Thank you
Refer to Hadoop Definitive Guide text book ... Apart from that there is lot of info in apache web site.
Container size is not fixed it is dynamically allocated based on requirement by Resource Manager.
From developer perspective same old map-reduce will work on YARN.
ResourceManager failures
In the initial versions of the YARN framework, ResourceManager failures meant a total cluster failure, as it was a single point of failure. The ResourceManager stores the state of
the cluster, such as the metadata of the submitted application, information on cluster
resource containers, information on the cluster’s general configurations, and so on.
Therefore, if the ResourceManager goes down because of some hardware failure, then
there is no way to avoid manually debugging the cluster and restarting the
ResourceManager. During the time the ResourceManager is down, the cluster is
unavailable, and once it gets restarted, all jobs would need a restart, so the half-completed jobs lose any data and need to be restarted again. In short, a restart of the ResourceManager used to restart all the running ApplicationMasters. The latest versions of YARN address this problem in two ways. One way is by creating an active-passive ResourceManager architecture, so that when one goes down, another becomes active and takes responsibility for the cluster. Another way is by using the Zookeeper ResourceManager quorum, so that the ResourceManager state is stored externally over the Zookeeper, and one
ResourceManager is in an active state and one or more ResourceManagers are in passive mode, waiting for something to happen that brings them to an active state.
ApplicationMaster failures
When the ApplicationMaster fails, the ResourceManager simply starts another container with a new ApplicationMaster running in it for another application attempt. It is the responsibility of the new ApplicationMaster
to recover the state of the older ApplicationMaster, and this is possible only when ApplicationMasters persist their states in the external location so that it can be used for future reference. ApplicatoinMaster will store their state to persisitant disk thus all the status till the failure can be recovered.
NodeManager Failures
If a Node Manager fails, the ResourceManager detects this failure using a time-out (that is, stops receiving the heartbeats from the NodeManager). The ResourceManager then removes the NodeManager from its pool of available NodeManagers. It also kills all the containers running on that node & reports the failure to all running AMs. AMs are then responsible for reacting to node failures, by redoing the work done by any containers running on that node during the fault.
Container Failures
Container failures will be reported by node manager to Resource manager and Resource manager informs the same to Application Master. Now Application will restart the container.

I'm trying to set up and use a 4-node Hadoop cluster.
Setting up seems to go fine, as everything is running in the master and slave nodes.
Master: DataNode, ResourceManager, SecondaryNameNode, NameNode, NodeManager
Slaves: NodeManager, DataNode
Also, the logs show no errors. When I try to run my code however, it takes roughly the same amount of time as when I run it on a single node. Also, there is no increased CPU activity on any of the slave nodes.
Slaves can ssh to the master node, master node is listening at the correct port, ...
Any help on how I can track down the problem?
OS: Ubuntu 14.04.2
Hadoop version: 2.6.0
Basically you have only one datanode and two nodemangers. It not much great configuration compared to single node cluster. To check whats happen you can goto resource manager UI . By default its on port 8088.

I am working on hadoop-2.6.0 single node cluster in windows. When i submit any mapreduce job, it always in accepted state. It seems my nodemanager is in unhealthy state. How to make it healthy? Why the nodemanager in unhealthy state? or when it will back to the healthy state?
Found the solution here
It seems that the cause of the problem is low disk space in the hadoop installed drive. So i just cleaned up with more space then nodemanager automatically changed into healthy state. We can't do it manually using any commands for changing the states of the hadoop nodes as analyzed.
When the job is in Accepted stage , it means that its waiting for the datanode to accept and start processing.
The following are to be done:
Check for available slots
If slots are available and its taking time to change status to Running , then check the datanodes health using either cloudera manager or hadoop dfs admin command.
If there are dead nodes , Restarting would solve the issue.
Please try to add the config in yarn-site.xml
name=yarn.nodemanager.disk-health-checker.enable value=false

I am new to Hadoop. In hadoop, I know that when a NameNode fails the entire Hadoop framework goes down. So it's a single point of failure in Hadoop. Is it same for JobTracker? Because if the JobTracker goes down, there would be no daemon to contact Namenode after a job submission and also no point for running the TaskTrackers. How is this handled exactly?
Yes, JobTracker is a single point of failure in MRv1. In case of JobTracker failure all running jobs are halted (http://wiki.apache.org/hadoop/JobTracker).
In YARN, Resource manager is not a single point of failure.
If you need MRv1, you can use MapR distribution, which provides the JobTracker high availability (http://www.mapr.com/resources/videos/demo-hadoop-jobtracker-failing-and-recovering-mapr-cluster).
Jobtracker HA(High Availability using Active and Standby) can be configured in Cloudera Hadoop distribution. See the following link, this feature is available from CDH4.2.1 onwards:
The same can be configured in Hortwonworks distribution also
In MR2 master service is ResourceManager, which is not Single Point of Failure
Yes job tracker is a single point of failure. In case of namenode failure, secondary namenode will take a charge and act as namenode. In MR-II, there is a resource manager concept introduced. YARN has no. of resource manager, if one fails another resource manager will take a charge.One resource manager is active and other resource manager's are in stand by mode.
No no If NN failure, not Hadoop Framework goes down. Framework different NN failure is different. Hadoop framework is a layer on all nodes. If Name Node goes down, Framework doesn't no where the data should store, and doesn't no where space available to be store. So it's not possible to sore actual data.
Job tracker coordinates with Namenode to get a data to be processed. So when Namenode failure, job tracker also not work properly. So first namenode should work properly. In Hadoop this mechanism is called Namenode Single point of failure.
Job tracker is responsible for job schedule and process the data. If Job tracker not working, Client submits a job request, but the client donesn't no where should that job should submit and where should process. But that logic (you should submit) should know how to resolve the problem, but doesn't know where should submit. So Job tracker failure, it's not possible to process the data and schedule job.
It's a biggest problem in Bigdata analysis problem.
Now Hadoop 2.x resolved these two problems. YERN don't have any single point of failure in namenode level and datanode level.
