I have had a perfectly working installation and running HBASE cluster with 2 nodes.
I shut down the servers and now when i restart it i get this error.
No configurations have been changed, the IP's are the same for the servers and the namenode and data nodes are also exactly the same.
What i have noticed is that HBase master starts and runs, i can logon to Hbase shell and list all the tables, but cannot read any data, or create any new tables either.
I have checked with JPS all datanodes and namenodes are started, have checked on the other nodes they have also started.
From previous installation notes i noticed that Resource Manager is not running. Not sure if this is relevant.
Zookeeper is also running without any errors.
Not sure what is going on but its really critical for me to solve this.
Detailed Info for the steps followed and errors encountered
The steps i followed to start the Hbase cluster is as follows:
Start HDFS
start-dfs.s
JPS Output
2164 NameNode
2519 Jps
2399 SecondaryNameNode
Start Zookeeper
JPS output
2164 NameNode
2554 QuorumPeerMain
2588 Jps
2399 SecondaryNameNode
Start hbase
This gives the following on the console
/opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_USER: bad substitution
/opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_OPTS: bad substitution
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hbase/lib/client-facing-thirdparty/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
running master, logging to /opt/hbase/bin/../logs/hbase-hadoop-master-rd-demo-hbase.out
/opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_USER: bad substitution
/opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_OPTS: bad substitution
rd-demo-hbase-c1: running regionserver, logging to /opt/hbase/bin/../logs/hbase-hadoop-regionserver-rd-demo-hbase-c1.out
rd-demo-hbase-c2: running regionserver, logging to /opt/hbase/bin/../logs/hbase-hadoop-regionserver-rd-demo-hbase-c2.out
rd-demo-hbase-c1: /opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_USER: bad substitution
rd-demo-hbase-c1: /opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_OPTS: bad substitution
rd-demo-hbase-c2: /opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_USER: bad substitution
rd-demo-hbase-c2: /opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_OPTS: bad substitution
Until this time there are no errors in the hbase-hadoop-master-hbase.log
JPS output
2832 HMaster
2164 NameNode
2554 QuorumPeerMain
2399 SecondaryNameNode
3183 Jps
It implies that HMaster is indeed running
Logon to Hbase shell
Gives some warnings
/opt/hadoop/libexec/hadoop-functions.sh: line 2366: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_USER: bad substitution
/opt/hadoop/libexec/hadoop-functions.sh: line 2461: HADOOP_ORG.APACHE.HADOOP.HBASE.UTIL.GETJAVAPROPERTY_OPTS: bad substitution
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hbase/lib/client-facing-thirdparty/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
HBase Shell
Use "help" to get list of supported commands.
Use "exit" to quit this interactive shell.
For Reference, please visit: http://hbase.apache.org/2.0/book.html#shell
Version 2.4.4, r20e7ba45b0c3affdc0c06b1a0e5cbddd1b2d8d18, Mon Jun 7 15:31:55 PDT 2021
Took 0.0052 seconds
Successfully logs on to Hbase Shell
List command gives the tables present
As soon as i try to scan a table, things start to go wrong
Hbase shell shows the following
scan 'md_Domains'
ERROR: Unknown table md_Domains!
Hbase Logs show the following
2022-01-27 18:49:18,055 ERROR [master/rd-demo-hbase:16000:becomeActiveMaster] master.HMaster: Failed to become active master
java.lang.IllegalStateException: Expected the service ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the service has FAILED
at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:379)
at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:319)
at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1233)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1028)
at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2091)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:507)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned and enabled: tableName=hbase:namespace, state=ENABLED
at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:107)
at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:249)
at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1231)
... 4 more
2022-01-27 18:49:18,056 ERROR [master/rd-demo-hbase:16000:becomeActiveMaster] master.HMaster: Master server abort: loaded coprocessors are: []
2022-01-27 18:49:18,057 ERROR [master/rd-demo-hbase:16000:becomeActiveMaster] master.HMaster: ***** ABORTING master rd-demo-hbase.c.rd-demo-320517.internal,16000,1643309039738: Unhandled exception. Starting shutdown. *****
java.lang.IllegalStateException: Expected the service ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the service has FAILED
at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:379)
at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:319)
at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1233)
at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1028)
at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2091)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:507)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Timedout 300000ms waiting for namespace table to be assigned and enabled: tableName=hbase:namespace, state=ENABLED
at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:107)
at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:249)
at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1231)
... 4 more
2022-01-27 18:49:18,057 INFO [master/rd-demo-hbase:16000:becomeActiveMaster] regionserver.HRegionServer: ***** STOPPING region server 'rd-demo-hbase.c.rd-demo-320517.internal,16000,1643309039738' *****
2022-01-27 18:49:18,057 INFO [master/rd-demo-hbase:16000:becomeActiveMaster] regionserver.HRegionServer: STOPPED: Stopped by master/rd-demo-hbase:16000:becomeActiveMaster
2022-01-27 18:49:18,058 INFO [master/rd-demo-hbase:16000] regionserver.HRegionServer: Stopping infoServer
2022-01-27 18:49:18,070 INFO [master/rd-demo-hbase:16000] handler.ContextHandler: Stopped o.a.h.t.o.e.j.w.WebAppContext#35c12c7a{master,/,null,STOPPED}{file:/opt/hbase/hbase-webapps/master}
2022-01-27 18:49:18,075 INFO [master/rd-demo-hbase:16000] server.AbstractConnector: Stopped ServerConnector#3db972d2{HTTP/1.1, (http/1.1)}{0.0.0.0:16010}
2022-01-27 18:49:18,076 INFO [master/rd-demo-hbase:16000] server.session: node0 Stopped scavenging
2022-01-27 18:49:18,083 INFO [master/rd-demo-hbase:16000] handler.ContextHandler: Stopped o.a.h.t.o.e.j.s.ServletContextHandler#3d5790ea{static,/static,file:///opt/hbase/hbase-webapps/static/,STOPPED}
2022-01-27 18:49:18,090 INFO [master/rd-demo-hbase:16000] handler.ContextHandler: Stopped o.a.h.t.o.e.j.s.ServletContextHandler#bfc14b9{logs,/logs,file:///opt/hbase/logs/,STOPPED}
2022-01-27 18:49:18,094 INFO [master/rd-demo-hbase:16000] regionserver.HRegionServer: aborting server rd-demo-hbase.c.rd-demo-320517.internal,16000,1643309039738
2022-01-27 18:49:18,095 INFO [master/rd-demo-hbase:16000] regionserver.HRegionServer: stopping server rd-demo-hbase.c.rd-demo-320517.internal,16000,1643309039738; all regions closed.
2022-01-27 18:49:18,095 INFO [master/rd-demo-hbase:16000] master.ReplicationLogCleaner: Stopping replicationLogCleaner-0x1000019d2180006, quorum=rd-demo-hbase:2181, baseZNode=/hbase
2022-01-27 18:49:18,096 WARN [OldWALsCleaner-1] cleaner.LogCleaner: Interrupted while cleaning old WALs, will try to clean it next round. Exiting.
2022-01-27 18:49:18,098 WARN [OldWALsCleaner-0] cleaner.LogCleaner: Interrupted while cleaning old WALs, will try to clean it next round. Exiting.
2022-01-27 18:49:18,201 INFO [master/rd-demo-hbase:16000] zookeeper.ZooKeeper: Session: 0x1000019d2180006 closed
2022-01-27 18:49:18,202 INFO [master/rd-demo-hbase:16000:becomeActiveMaster-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x1000019d2180006
2022-01-27 18:49:18,202 INFO [master/rd-demo-hbase:16000] hbase.ChoreService: Chore service for: master/rd-demo-hbase:16000 had [] on shutdown
2022-01-27 18:49:18,203 INFO [master/rd-demo-hbase:16000] procedure2.RemoteProcedureDispatcher: Stopping procedure remote dispatcher
2022-01-27 18:49:18,203 INFO [master/rd-demo-hbase:16000] procedure2.ProcedureExecutor: Stopping
2022-01-27 18:49:18,206 INFO [master/rd-demo-hbase:16000] region.RegionProcedureStore: Stopping the Region Procedure Store, isAbort=true
2022-01-27 18:49:18,208 WARN [master/rd-demo-hbase:16000] master.ActiveMasterManager: Failed get of master address: java.io.IOException: Can't get master address from ZooKeeper; znode data == null
2022-01-27 18:49:18,208 INFO [master/rd-demo-hbase:16000] assignment.AssignmentManager: Stopping assignment manager
2022-01-27 18:49:18,208 INFO [master/rd-demo-hbase:16000] region.MasterRegion: Closing local region {ENCODED => 1595e783b53d99cd5eef43b6debb2682, NAME => 'master:store,,1.1595e783b53d99cd5eef43b6debb2682.', STARTKEY => '', ENDKEY => ''}, isAbort=true
2022-01-27 18:49:18,242 INFO [master/rd-demo-hbase:16000] regionserver.HRegion: Closing region master:store,,1.1595e783b53d99cd5eef43b6debb2682.
2022-01-27 18:49:18,248 ERROR [master/rd-demo-hbase:16000] regionserver.HRegion: Memstore data size is 54229 in region master:store,,1.1595e783b53d99cd5eef43b6debb2682.
2022-01-27 18:49:18,248 INFO [master/rd-demo-hbase:16000] regionserver.HRegion: Closed master:store,,1.1595e783b53d99cd5eef43b6debb2682.
2022-01-27 18:49:18,248 INFO [master/rd-demo-hbase:16000] flush.MasterFlushTableProcedureManager: stop: server shutting down.
2022-01-27 18:49:18,249 INFO [master/rd-demo-hbase:16000] ipc.NettyRpcServer: Stopping server on /192.168.0.111:16000
2022-01-27 18:49:18,252 INFO [master:store-WAL-Roller] wal.AbstractWALRoller: LogRoller exiting.
2022-01-27 18:49:18,373 INFO [master/rd-demo-hbase:16000] zookeeper.ZooKeeper: Session: 0x1000019d2180000 closed
2022-01-27 18:49:18,373 INFO [master/rd-demo-hbase:16000] regionserver.HRegionServer: Exiting; stopping=rd-demo-hbase.c.rd-demo-320517.internal,16000,1643309039738; zookeeper connection closed.
2022-01-27 18:49:18,374 ERROR [main] master.HMasterCommandLine: Master exiting
java.lang.RuntimeException: HMaster Aborted
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:261)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:149)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2872)
2022-01-27 18:49:18,374 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down for session: 0x1000019d2180000
If running JPS now it shows that HMASTER is not running anymore
2164 NameNode
3416 Jps
2554 QuorumPeerMain
2399 SecondaryNameNode
I am trying to restart one of the namenode (nn2) but i get the following error in the logs:
2021-12-17 10:23:53,676 ERROR namenode.NameNode (NameNode.java:main(1715)) - Failed to start namenode.
org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0. Expected transaction ID was 274488049
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:226)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:160)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:890)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1090)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
Caused by: org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream$PrematureEOFException: got premature end-of-file at txid 274488048; expected file to go up to 274488109
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:197)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:179)
at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:213)
... 12 more
2021-12-17 10:23:53,678 INFO util.ExitUtil (ExitUtil.java:terminate(210)) - Exiting with status 1: org.apache.hadoop.hdfs.server.namenode.EditLogInputException: Error replaying edit log at offset 0. Expected transaction ID was 274488049
2021-12-17 10:23:53,681 INFO namenode.NameNode (LogAdapter.java:info(51)) - SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at XX-XXX-XX-XXXX.XXXXX.XX/XX.X.XX.XX
************************************************************/
i tryied to do the following steps in order to solve the issue:
i copied from nn01 to the NameNode directories of nn02 the following logs
edits_0000000000274487928-0000000000274488048
edits_0000000000274488049-0000000000274488109
So far the nn02 is still not starting and i get the same error.
Can you please help?
If that is an HA setup, and your NN1 is working properly. Format your NN2(hdfs namenode -format) and do a bootstrap (hdfs namenode -bootstrapStandby)
Then try restarting the NN2.
iam using hadoop apache 2.7.1 high availability cluster that consists of
two name nodes mn1,mn2 and 3 journal nodes
but while i was working on cluster i faced the following error
when i issue start-dfs.sh mn1 is standby and mn2 is active
but after that if one of theses two namenodes are off there is no possibility
to turn it on again
and here are the last lines of log of one of these two name nodes
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 3 entries 72 lookups
2017-08-05 09:37:21,088 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 7052 msecs
2017-08-05 09:37:21,300 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to mn2:8020
2017-08-05 09:37:21,304 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2017-08-05 09:37:21,316 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020
2017-08-05 09:37:21,353 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean
2017-08-05 09:37:21,354 WARN org.apache.hadoop.hdfs.server.common.Util: Path /opt/hadoop/metadata_dir should be specified as a URI in configuration files. Please update hdfs configuration.
2017-08-05 09:37:21,361 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getNumUnderConstructionBlocks(LeaseManager.java:119)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCompleteBlocksTotal(FSNamesystem.java:5741)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:1063)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:678)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:664)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
2017-08-05 09:37:21,364 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-08-05 09:37:21,365 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at mn2/192.168.25.22
************************************************************/
This may be
1.Namenode PORT may be Change for each NODE.
This is a particularly vexing problem.
Swallow IllegalStateExceptions thrown by removeShutdownHook in FileSystem. The javadoc states:
public boolean removeShutdownHook(Thread hook)
Throws:
IllegalStateException - If the virtual machine is already in the process of shutting down
So if we are getting this exception, it MEANS we are already in the process of shutdown, so we CANNOT, try what we may, removeShutdownHook. If Runtime had a method Runtime.isShutdownInProgress(), we could have checked for it before the removeShutdownHook call. As it stands, there is no such method. In my opinion, this would be a good patch regardless of the needs for this JIRA.
Not send SIGTERMs from the NM to the MR-AM in the first place. Rather we should expose a mechanism for the NM to politely tell the AM its no longer needed and should shutdown asap. Even after this, if an admin were to kill the MRAppMaster with a SIGTERM, the JobHistory would be lost defeating the purpose of 3614
i discovered that my problem was in journal node and not in namenode
even though the log of namenode shows the error mentioned in question
jps shows journal node but it is fake because journal node service is shut down
even though it is found in jps output
so as a solution i issue hadoop-daemon.sh stop journalnode
then hadoop-daemon.sh start journalnode
and then namenode starts to work again
I have a cluster with 1 namenode and 6 datanodes. After decommissioning 3 of the datanodes. Our YARN service is always bad health. And seems like the nodemanager on one of the datanodes never gets started successfully. Then I tried to restart the nodemanager on that box. And here are the logs.
2014-08-01 11:19:08,217 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system shutdown complete.
2014-08-01 11:19:08,217 FATAL org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting NodeManager
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from box708.datafireball.com, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:185)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:197)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:352)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:398)
Caused by: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: Disallowed NodeManager from box708.datafireball.com, Sending SHUTDOWN signal to the NodeManager.
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:255)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:179)
... 6 more
I googled around this error but cannot find the solution, any guidance from anyone?
Message from ResourceManager: Disallowed NodeManager
This message means that either your NodeManager isn't in the allowed list of nodemanagers or it's in the list of excluded.
Check configuration of your resourcemanager for the following properties:
yarn.resourcemanager.nodes.include-path
yarn.resourcemanager.nodes.exclude-path
buryat is correct. I had this same problem and the fix was to add all the nodes to the include list. But I would like to add this note to anyone running across this issue.
Make sure and add EXACTLY the hostname that yarn is complaining about. In your example ResourceManager: Disallowed NodeManager from box708.datafireball.com
For my case I was adding a node named "gpu-0-5". The "gpu-0-5" hostname was in my yarn.include file and yarn kept complaining. I noticed it said "gpu-0-5.local" (even though gpu-0-5 routes to the same machine). Once I added gpu-0-5.local to my yarn.include list it started working.
I'm not sure how to change the configuration in yarn to only require "gpu-0-5".