Cannot Start A Certain Node Manager After Decommissioning Some Nodes - hadoop

I have a cluster with 1 namenode and 6 datanodes. After decommissioning 3 of the datanodes. Our YARN service is always bad health. And seems like the nodemanager on one of the datanodes never gets started successfully. Then I tried to restart the nodemanager on that box. And here are the logs.
2014-08-01 11:19:08,217 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2014-08-01 11:19:08,217 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from box708.datafireball.com, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:185)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:197)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:352)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:398)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from box708.datafireball.com, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:255)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:179)
... 6 more
I googled around this error but cannot find the solution, any guidance from anyone?

Message from ResourceManager: Disallowed NodeManager
This message means that either your NodeManager isn't in the allowed list of nodemanagers or it's in the list of excluded.
Check configuration of your resourcemanager for the following properties:
yarn.resourcemanager.nodes.include-path
yarn.resourcemanager.nodes.exclude-path

buryat is correct. I had this same problem and the fix was to add all the nodes to the include list. But I would like to add this note to anyone running across this issue.
Make sure and add EXACTLY the hostname that yarn is complaining about. In your example ResourceManager: Disallowed NodeManager from box708.datafireball.com
For my case I was adding a node named "gpu-0-5". The "gpu-0-5" hostname was in my yarn.include file and yarn kept complaining. I noticed it said "gpu-0-5.local" (even though gpu-0-5 routes to the same machine). Once I added gpu-0-5.local to my yarn.include list it started working.
I'm not sure how to change the configuration in yarn to only require "gpu-0-5".

Related

Node manager stops running after few momentes

Getting the below error
ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: NodeManager from ubuntu-VirtualBox doesn't satisfy minimum allocations, Sending SHUTDOWN signal to the NodeManager.

Failed to start namenode.java.lang.IllegalStateException

iam using hadoop apache 2.7.1 high availability cluster that consists of
two name nodes mn1,mn2 and 3 journal nodes
but while i was working on cluster i faced the following error
when i issue start-dfs.sh mn1 is standby and mn2 is active
but after that if one of theses two namenodes are off there is no possibility
to turn it on again
and here are the last lines of log of one of these two name nodes
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 3 entries 72 lookups
2017-08-05 09:37:21,088 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 7052 msecs
2017-08-05 09:37:21,300 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to mn2:8020
2017-08-05 09:37:21,304 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2017-08-05 09:37:21,316 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020
2017-08-05 09:37:21,353 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean
2017-08-05 09:37:21,354 WARN org.apache.hadoop.hdfs.server.common.Util: Path /opt/hadoop/metadata_dir should be specified as a URI in configuration files. Please update hdfs configuration.
2017-08-05 09:37:21,361 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getNumUnderConstructionBlocks(LeaseManager.java:119)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCompleteBlocksTotal(FSNamesystem.java:5741)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:1063)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:678)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:664)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
2017-08-05 09:37:21,364 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-08-05 09:37:21,365 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at mn2/192.168.25.22
************************************************************/
This may be
1.Namenode PORT may be Change for each NODE.
This is a particularly vexing problem.
Swallow IllegalStateExceptions thrown by removeShutdownHook in FileSystem. The javadoc states:
public boolean removeShutdownHook(Thread hook)
Throws:
IllegalStateException - If the virtual machine is already in the process of shutting down
So if we are getting this exception, it MEANS we are already in the process of shutdown, so we CANNOT, try what we may, removeShutdownHook. If Runtime had a method Runtime.isShutdownInProgress(), we could have checked for it before the removeShutdownHook call. As it stands, there is no such method. In my opinion, this would be a good patch regardless of the needs for this JIRA.
Not send SIGTERMs from the NM to the MR-AM in the first place. Rather we should expose a mechanism for the NM to politely tell the AM its no longer needed and should shutdown asap. Even after this, if an admin were to kill the MRAppMaster with a SIGTERM, the JobHistory would be lost defeating the purpose of 3614
i discovered that my problem was in journal node and not in namenode
even though the log of namenode shows the error mentioned in question
jps shows journal node but it is fake because journal node service is shut down
even though it is found in jps output
so as a solution i issue hadoop-daemon.sh stop journalnode
then hadoop-daemon.sh start journalnode
and then namenode starts to work again

Hadoop Name node is not getting started

I'm trying to configure Hadoop in fully distributed mode with 1 master and 1 slave as different nodes. I have attached a screenshot showing the status of my master and slave nodes.
In Master:
ubuntu#hadoop-master:/usr/local/hadoop/etc/hadoop$ $HADOOP_HOME/bin/hdfs dfsadmin -refreshNodes
refreshNodes: Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message tag had invalid wire type.; Host Details : local host is: "hadoop-master/127.0.0.1"; destination host is: "hadoop-master":8020;
This is the error I'm getting when I try to run the refresh nodes command. Can anyone tell me what I'm missing or what mistake have I done ?
Master & Slave Screenshot
2016-04-26 03:29:17,090 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2016-04-26 03:29:17,095 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup#0.0.0.0:50070
2016-04-26 03:29:17,095 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2016-04-26 03:29:17,095 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2016-04-26 03:29:17,096 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2016-04-26 03:29:17,097 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.net.BindException: Problem binding to [hadoop-master:8020] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:721)
at org.apache.hadoop.ipc.Server.bind(Server.java:425)
at org.apache.hadoop.ipc.Server$Listener.(Server.java:574)
at org.apache.hadoop.ipc.Server.(Server.java:2215)
at org.apache.hadoop.ipc.RPC$Server.(RPC.java:938)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:534)
at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:509)
at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:783)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.(NameNodeRpcServer.java:344)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createRpcServer(NameNode.java:673)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:646)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:811)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:795)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:463)
at sun.nio.ch.Net.bind(Net.java:455)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at org.apache.hadoop.ipc.Server.bind(Server.java:408)
... 13 more
2016-04-26 03:29:17,103 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2016-04-26 03:29:17,109 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/127.0.0.1
************************************************************/
ubuntu#hadoop-master:/usr/local/hadoop$
DFS needs to be formatted. Just issue the following command ;
hadoop namenode -format
Or
hdfs namenode -format
Check your namenode address in core-site.xml. Change to 50070 or 9000 and try
The default address of namenode web UI is http://localhost:50070/. You can open this address in your browser and check the namenode information.
The default address of namenode server is hdfs://localhost:8020/. You can connect to it to access HDFS by HDFS api. The is the real service address.
http://blog.cloudera.com/blog/2009/08/hadoop-default-ports-quick-reference/
Try to format the Namenode.
$] hadoop namenode -format
Your error logs clearly say that It is not able to bind the default port.
java.net.BindException: Problem binding to [hadoop-master:8020] java.net.BindException: Address already in use; For more details see: http://wiki.apache.org/hadoop/BindException
You need to change the default port to some port which is free.
Here is the list of ports given in hdfs-default.xml and here.

YARN not able to add vcores

I am trying to run a simple job on hadoop in pseudo distribute mode.
I only have a single machine and I would like to run using yarn a simple word count.
I run the application and it doesn't run, if I check the ResourceManager I think that I understand the problem, I don't have any vcore allocated.
When I check the status of the application I am running the status is
ACCEPTED: waiting for AM container to be allocated, launched and register with RM.
I believe that the problem is just to allocate a single vcore and some RAM, but I have no idea how to achieve this...
user#user:~/hadoop-2.6.1$ jps
18420 Jps
11076 NameNode
8772 DataNode
15786 NodeManager
11296 SecondaryNameNode
16652 ResourceManager
simo#simo:~/hadoop-2.6.1$ jps
11076 NameNode
8772 DataNode
15786 NodeManager
11296 SecondaryNameNode
18881 Jps
16652 ResourceManager
The problem was lack of physical space on my HD

unable to initialized namenode ,datanode,jobtracker,tasktracker in cenos

when i give the command
for service in /etc/init.d/hadoop*
>do
>sudo $service stop
>done
its stops all the service
and when i give
for service in /etc/init.d/hadoop-hdfs-*
>do
>sudo $service stop
>done
its stops all the service
it sometimes start datanode and sometimes namenode
eg:
21270 NameNode
21422 Jps
21374 SecondaryNameNode
2624 HMaster
or
11070 DataNode
11422 Jps
11554 SecondaryNameNode
2554 HMaster
same thing happens for jobtracker and tasktracker
I tried formating the namenode but it didnt help
I also changing the path of localhost in
core-site.xml from 8020 to 50020
and also in mapred-site.xml from 8021 to 50020
this time it shows NameNode, DataNode, JobTracker,Tasktracker using jps
but when i check the browser localhost:50070 and localhost:50030
it refers to 8020 instead of 50020.
why is this happening ?
please help
Run the following script from terminal to stop the running hadoop daemons.
> $HADOOP_INSTALL/hadoop/bin/stop-all.sh
Run the following script from terminal to start the hadoop daemons.
$HADOOP_INSTALL/hadoop/bin/start-all.sh

Resources