HA-Namenode failover failed by killing the whole Server - hadoop

I´m trying to set up an HA-Hdfs Cluster like the Apache-Doku explained.
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
When i kill the Prozess of der active Nameserver the failover works fine, but when i cause an bigger failture like unplugging the network, the standby namenode does not go active.
The standby node tries to set the active node to standby, but when the maschine died, establishing an SSH-connection would be difficult.
Is there anything that i´ve not seen?
This is what the hadoop-hduser-zkfc-nn1.log says:
2016-09-27 15:03:11,316 INFO org.apache.hadoop.ha.NodeFencer: ====== Beginning Service Fencing Process... ======
2016-09-27 15:03:20,316 INFO org.apache.hadoop.ha.NodeFencer: Trying method 1/1: org.apache.hadoop.ha.SshFenceByTcpPort(null)
2016-09-27 15:03:20,317 INFO org.apache.hadoop.ha.SshFenceByTcpPort: Connecting to nn2.example.org...
2016-09-27 15:03:20,317 INFO org.apache.hadoop.ha.SshFenceByTcpPort.jsch: Connecting to nn2.example.org port 22
2016-09-27 15:03:23,315 WARN org.apache.hadoop.ha.SshFenceByTcpPort: Unable to connect to nn2.example.org as user hduser
com.jcraft.jsch.JSchException: java.net.NoRouteToHostException: No route to host
at com.jcraft.jsch.Util.createSocket(Util.java:386)
at com.jcraft.jsch.Session.connect(Session.java:182)
at org.apache.hadoop.ha.SshFenceByTcpPort.tryFence(SshFenceByTcpPort.java:100)
at org.apache.hadoop.ha.NodeFencer.fence(NodeFencer.java:97)
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:532)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-09-27 15:03:23,315 WARN org.apache.hadoop.ha.NodeFencer: Fencing method org.apache.hadoop.ha.SshFenceByTcpPort(null) was unsuccessful.
2016-09-27 15:03:23,315 ERROR org.apache.hadoop.ha.NodeFencer: Unable to fence service by any configured method.
2016-09-27 15:03:23,315 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election
java.lang.RuntimeException: Unable to fence NameNode at nn2.example.org/172.16.1.188:8040
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:533)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
2016-09-27 15:03:23,316 INFO org.apache.hadoop.ha.ActiveStandbyElector: Trying to re-establish ZK session
2016-09-27 15:03:23,325 INFO org.apache.zookeeper.ZooKeeper: Session: 0x3576b5e84d4003a closed
2016-09-27 15:03:24,329 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=nn1.example.org:2181,nn2.example.org:2181,dn1.example.org:2181,dn2.example.org:2181,dn3.example.org:2181 sessionTimeout=5000 watcher=org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef#3bf4eda6
2016-09-27 15:03:24,335 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server nn1.example.org/172.16.1.187:2181. Will not attempt to authenticate using SASL (unknown error)
2016-09-27 15:03:24,342 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to nn1.example.org/172.16.1.187:2181, initiating session
2016-09-27 15:03:24,379 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server nn1.example.org/172.16.1.187:2181, sessionid = 0x1576b95292f0001, negotiated timeout = 5000
2016-09-27 15:03:24,386 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-09-27 15:03:24,403 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2016-09-27 15:03:24,407 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2016-09-27 15:03:24,425 INFO org.apache.hadoop.ha.ActiveStandbyElector: Old node exists: 0a0a68612d636c757374657212036e6e321a1668646d617374657230322e7265696368656c742e646520e83e28d33e
2016-09-27 15:03:24,444 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at nn2.example.org/172.16.1.188:8040
2016-09-27 15:03:27,315 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: nn2.example.org/172.16.1.188:8040. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
2016-09-27 15:03:29,315 WARN org.apache.hadoop.ha.FailoverController: Unable to gracefully make NameNode at nn2.example.org/172.16.1.188:8040 standby (unable to connect)
java.net.NoRouteToHostException: No Route to Host from nn1.example.org/172.16.1.187 to nn2.example.org:8040 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host; For more details see: http://wiki.apache.org/hadoop/NoRouteToHost
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:758)
at org.apache.hadoop.ipc.Client.call(Client.java:1479)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy9.transitionToStandby(Unknown Source)
at org.apache.hadoop.ha.protocolPB.HAServiceProtocolClientSideTranslatorPB.transitionToStandby(HAServiceProtocolClientSideTranslatorPB.java:112)
at org.apache.hadoop.ha.FailoverController.tryGracefulFence(FailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController.doFence(ZKFailoverController.java:514)
at org.apache.hadoop.ha.ZKFailoverController.fenceOldActive(ZKFailoverController.java:505)
at org.apache.hadoop.ha.ZKFailoverController.access$1100(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$ElectorCallbacks.fenceOldActive(ZKFailoverController.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:910)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:809)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:418)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:614)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:712)
at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:375)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1528)
at org.apache.hadoop.ipc.Client.call(Client.java:1451)
... 14 more

Related

ERROR delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted(hadoop window10)

I use windows 10 and node manager also not starting correctly. I see the following errors:
Resource manager is not connecting and failing due to :
2021-07-07 11:01:52,473 ERROR delegation.AbstractDelegationTokenSecretManager: ExpiredTokenRemover received java.lang.InterruptedException: sleep interrupted
2021-07-07 11:01:52,493 INFO handler.ContextHandler: Stopped o.e.j.w.WebAppContext#756b58a7{/,null,UNAVAILABLE}{/cluster}
2021-07-07 11:01:52,504 INFO server.AbstractConnector: Stopped ServerConnector#633a2e99{HTTP/1.1,[http/1.1]}{0.0.0.0:8088}
2021-07-07 11:01:52,504 INFO handler.ContextHandler: Stopped o.e.j.s.ServletContextHandler#7b420819{/static,jar:file:/F:/hadoop_new/share/hadoop/yarn/hadoop-yarn-common-3.2.1.jar!/webapps/static,UNAVAILABLE}
2021-07-07 11:01:52,507 INFO handler.ContextHandler: Stopped o.e.j.s.ServletContextHandler#c9d0d6{/logs,file:///F:/hadoop_new/logs/,UNAVAILABLE}
2021-07-07 11:01:52,541 INFO ipc.Server: Stopping server on 8033
2021-07-07 11:01:52,543 INFO ipc.Server: Stopping IPC Server listener on 8033
2021-07-07 11:01:52,544 INFO resourcemanager.ResourceManager: Transitioning to standby state
2021-07-07 11:01:52,544 INFO ipc.Server: Stopping IPC Server Responder
2021-07-07 11:01:52,550 INFO resourcemanager.ResourceManager: Transitioned to standby state
2021-07-07 11:01:52,554 FATAL resourcemanager.ResourceManager: Error starting ResourceManager
org.apache.hadoop.service.ServiceStateException: 5: Access is denied.
and
2021-07-07 11:01:51,625 INFO recovery.RMStateStore: Storing RMDTMasterKey.
2021-07-07 11:01:52,158 INFO store.AbstractFSNodeStore: Created store directory :file:/tmp/hadoop-yarn-Abby/node-attribute
2021-07-07 11:01:52,186 INFO service.AbstractService: Service NodeAttributesManagerImpl failed in state STARTED
5: Access is denied.
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileOutputStreamWithMode(NativeIO.java:595)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:246)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:232)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:331)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:320)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
at org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:160)
at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.FileSystemNodeAttributeStore.recover(FileSystemNodeAttributeStore.java:95)
at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.NodeAttributesManagerImpl.initNodeAttributeStore(NodeAttributesManagerImpl.java:140)
at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.NodeAttributesManagerImpl.serviceStart(NodeAttributesManagerImpl.java:123)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:895)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1262)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1303)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1299)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1299)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1350)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1535)
2021-07-07 11:01:52,212 INFO service.AbstractService: Service RMActiveServices failed in state STARTED
org.apache.hadoop.service.ServiceStateException: 5: Access is denied.
at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:203)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:895)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1262)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1303)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1299)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToActive(ResourceManager.java:1299)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1350)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1535)
Caused by: 5: Access is denied.
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createFileOutputStreamWithMode(NativeIO.java:595)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:246)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.<init>(RawLocalFileSystem.java:232)
at org.apache.hadoop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:331)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:320)
at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:305)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:987)
at org.apache.hadoop.yarn.nodelabels.store.AbstractFSNodeStore.recoverFromStore(AbstractFSNodeStore.java:160)
at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.FileSystemNodeAttributeStore.recover(FileSystemNodeAttributeStore.java:95)
at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.NodeAttributesManagerImpl.initNodeAttributeStore(NodeAttributesManagerImpl.java:140)
at org.apache.hadoop.yarn.server.resourcemanager.nodelabels.NodeAttributesManagerImpl.serviceStart(NodeAttributesManagerImpl.java:123)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
... 13 more
You have access denied, maybe need to run with another user. Try to start services with a user with more access like Administrator in windows.

Hbase Bulk Load failing | KosmosFileSystem FileSystem Exception

I just referred following blog for running simple bulk load but its giving some KosmosFileSystem FileSystem exception.
http://www.thecloudavenue.com/2013/04/bulk-loading-data-in-hbase.html
Here is a log that got generated after running following command.
hadoop jar HbaseBulkImport.jar /user/hduser/hbase/input/RowFeeder.csv /user/hduser/hbase/ouput/ NBAFinal2010
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:java.library.path=:/usr/hdp/2.2.0.0-2041/hadoop/lib/native/Linux-amd64-64:/usr/hdp/2.2.0.0-2041/hadoop/lib/native
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-504.12.2.el6.x86_64
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:user.name=hduser
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/hduser
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/hduser/user/shashi/hbase/bulkLoad
15/04/16 12:40:35 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x2dd06f21, quorum=localhost:2181, baseZNode=/hbase-unsecure
15/04/16 12:40:35 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x2dd06f21 connecting to ZooKeeper ensemble=localhost:2181
15/04/16 12:40:35 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
15/04/16 12:40:35 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
15/04/16 12:40:35 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x34cb73f98050030, negotiated timeout = 40000
Exception in thread "main" java.io.IOException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:426)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:403)
at org.apache.hadoop.hbase.client.ConnectionManager.getConnectionInternal(ConnectionManager.java:281)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:207)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:169)
at Driver.main(Driver.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:424)
... 11 more
Caused by: java.lang.ExceptionInInitializerError
at org.apache.hadoop.hbase.ClusterId.parseFrom(ClusterId.java:64)
at org.apache.hadoop.hbase.zookeeper.ZKClusterId.readClusterIdZNode(ZKClusterId.java:75)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getClusterId(ZooKeeperRegistry.java:106)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.retrieveClusterId(ConnectionManager.java:858)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.<init>(ConnectionManager.java:662)
... 16 more
Caused by: java.lang.UnsupportedOperationException: Not implemented by the KosmosFileSystem FileSystem implementation
at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:216)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2564)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2574)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:354)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.hbase.util.DynamicClassLoader.<init>(DynamicClassLoader.java:104)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.<clinit>(ProtobufUtil.java:226)
... 21 more
Looks like it was jar related issue. After using latest jars from hbase lib issue got resolved.

YARN: what subsystem connecting to port 44874

I am trying to run my MR job on YARN. There is this error in one of the userlogs on the node3:
2014-10-10 00:57:16,965 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2014-10-10 00:57:16,965 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1412895371072_0001, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier#69d5af30)
2014-10-10 00:57:17,330 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2014-10-10 00:57:18,547 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: node03/127.0.1.1:44874. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-10-10 00:57:19,548 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: node03/127.0.1.1:44874. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
...
2014-10-10 00:57:27,558 INFO [main] org.apache.hadoop.ipc.Client: Retrying connect to server: node03/127.0.1.1:44874. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2014-10-10 00:57:27,562 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.net.ConnectException: Call From node03/127.0.1.1 to node03:44874 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:231)
at com.sun.proxy.$Proxy9.getTask(Unknown Source)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:137)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
... 4 more
2014-10-10 00:57:27,564 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping MapTask metrics system...
2014-10-10 00:57:27,566 INFO [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: MapTask metrics system stopped.
:
I have the same config on all nodes. I can't find anywhere having specified port 44874. What does this error actually telling?
If by "half-random" you mean completely random and by "can't really be configured" you mean undocumented and completely f'ed when used in a hardened environmment - you are correct.
The problem is that map-reduce jobs are using dynamic ports. Horton of course doesn't document why the random port is being created.
The answer thus far: disable firewall or allow high ranges (32768-65535) to on each data node. I am still looking for why this situation occurs.
Whenever I see a problem with Hadoop's ports, I google the port number and see if it is a default port for something. In your case it doesn't seem to be.
As far as I can tell, Hadoop uses this kind of half-random ports internally for some things and they can't be really configured. If there is a problem with this kind of ports, for me it has always been an indication of some other (detectable) problem.
I suggest you go through all your logs again to find other problems. Also check namenode statuses (web interface) and make sure all connections are working.

Hbase managed zookeeper suddenly trying to connect to localhost instead of zookeeper quorum

I was running some tests with table mappers and reducers on large scale problems. After a certain point my reducers started failing when the job was 80% done. From what I can tell when looking at the syslogs the problem is that one of my zookeepers is attempting to connect to the localhost as opposed to the other zookeepers in the quorum
Oddly it seems to do just fine connecting to the other nodes when mapping is going on, its reducing that it has a problem with. Here are selected portions of the syslog which might be relevant to figuring out whats going on
2014-06-27 09:44:01,599 INFO [main] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hdev02:5181,hdev01:5181,hdev03:5181 sessionTimeout=10000 watcher=hconnection-0x4aee260b, quorum=hdev02:5181,hdev01:5181,hdev03:5181, baseZNode=/hbase
2014-06-27 09:44:01,612 INFO [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4aee260b connecting to ZooKeeper ensemble=hdev02:5181,hdev01:5181,hdev03:5181
2014-06-27 09:44:01,614 INFO [main-SendThread(hdev02:5181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server hdev02/172.17.43.36:5181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
2014-06-27 09:44:01,615 INFO [main-SendThread(hdev02:5181)] org.apache.zookeeper.ClientCnxn: Socket connection established to hdev02/172.17.43.36:5181, initiating session
2014-06-27 09:44:01,617 INFO [main-SendThread(hdev02:5181)] org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2014-06-27 09:44:01,723 WARN [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=hdev02:5181,hdev01:5181,hdev03:5181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2014-06-27 09:44:01,723 INFO [main] org.apache.hadoop.hbase.util.RetryCounter: Sleeping
***
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 1 on-disk map-outputs
2014-06-27 09:55:12,012 INFO [main] org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
2014-06-27 09:55:12,013 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 33206049 bytes
2014-06-27 09:55:12,208 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merged 1 segments, 33206079 bytes to disk to satisfy reduce memory limit
2014-06-27 09:55:12,209 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 2 files, 265119413 bytes from disk
2014-06-27 09:55:12,209 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2014-06-27 09:55:12,210 INFO [main] org.apache.hadoop.mapred.Merger: Merging 2 sorted segments
2014-06-27 09:55:12,212 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 265119345 bytes
2014-06-27 09:55:12,279 INFO [main] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x65afdbbb, quorum=localhost:2181, baseZNode=/hbase
2014-06-27 09:55:12,281 INFO [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x65afdbbb connecting to ZooKeeper ensemble=localhost:2181
2014-06-27 09:55:12,282 INFO [main-SendThread(localhost.localdomain:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
2014-06-27 09:55:12,283 WARN [main-SendThread(localhost.localdomain:2181)] org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2014-06-27 09:55:12,384 WARN [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2014-06-27 09:55:12,384 INFO [main] org.apache.hadoop.hbase.util.RetryCounter: Sleeping 1000ms before retry #0...
2014-06-27 09:55:13,385 INFO [main-SendThread(localhost.localdomain:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
2014-06-27 09:55:13,385 WARN [main-SendThread(localhost.localdomain:2181)] org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing
***
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2014-06-27 09:55:13,486 ERROR [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 1 attempts
2014-06-27 09:55:13,486 WARN [main] org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x65afdbbb, quorum=localhost:2181, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
I'm pretty sure its configured correctly, here is the relevant portion of my hbase-site.xml.
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>5181</value>
<description>Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
</description>
</property>
<property>
<name>zookeeper.session.timeout</name>
<value>10000</value>
<description></description>
</property>
<property>
<name>hbase.client.retries.number</name>
<value>10</value>
<description></description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hdev01,hdev02,hdev03</value>
<description></description>
</property>
So far as I can tell hdev03 is the only server that has any problem with this. Netstating all relevant ports doesn't show me anything strange.
I've had same problem when running HBase through Spark on Yarn. Everything was fine until suddenly it started to trying to connect to localhost instead of quorum. Setting port and quorum programmatically before HBase call fixed the issue
conf.set("hbase.zookeeper.quorum","my.server")
conf.set("hbase.zookeeper.property.clientPort","5181")
I'm using MapR, and it has "unusual" (5181) zookeeper port
Hard to say what is happening with the information, given. I have found the Hadoop Stack (HBase especially) to be quite hostile to even the slightest bit of misconfiguration in DNS or the hosts file.
As the quorum in your hbase-site.xml looks good, I'd start checking with network/hostname resolution related configurations:
Has the nodename slipped into the localhost entry in /etc/hosts on hdev03?
Is there an entry for the host itself in hdev03s /etc/hosts (there should)?
Has Reverse DNS been correctly configured in case you are using DNS for name resolution instead of the hosts file?
These are just a few pointers in the direction I'd look with this kind of issue. Hope it helps!
Add '--driver-class-path ~/hbase-1.1.2/conf' into spark-submit command, so that the task can find the configured zookeeper servers instead of 127.0.0.1.

"getMaster attempt 1 of 1 failed; no more retrying. com.google.protobuf.ServiceException: java.io.IOException: Broken pipe" when connecting

I'm trying to connect to HBase installed in the local system (using Hortonworks 1.1.1.16), by a small program in Java, which executes the next command:
HBaseAdmin.checkHBaseAvailable(conf);
It is worth saying that there is no problem at all when connecting to HBase from the command line using the hbase command.
The content of the host file is the next one (where example.com contains the actual host name):
127.0.0.1 localhost example.com
HBase is configured to work in standalone mode:
hbase.cluster.distributed=false
When executing the program, the next exception is thrown:
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:zookeeper.version=3.4.5-1392090, built on 09/30/2012 17:52 GMT
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:host.name=localhost
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:java.version=1.7.0_19
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:java.vendor=Oracle Corporation
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:java.home=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64/jre
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:java.class.path=[...]
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/tmp
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:os.name=Linux
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:os.arch=amd64
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:os.version=2.6.32-358.2.1.el6.x86_64
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:user.name=root
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:user.home=/root
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Client environment:user.dir=/root/git/project
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=example.com:2181 sessionTimeout=60000 watcher=hconnection-0x678e4593
13/05/13 15:18:29 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is hconnection-0x678e4593
13/05/13 15:18:29 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
13/05/13 15:18:29 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
13/05/13 15:18:29 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x13e9d6851af0046, negotiated timeout = 40000
13/05/13 15:18:29 INFO client.HConnectionManager$HConnectionImplementation: ClusterId is cccadf06-f6bf-492e-8a39-e8beac521ce6
13/05/13 15:18:29 INFO client.HConnectionManager$HConnectionImplementation: getMaster attempt 1 of 1 failed; no more retrying.
com.google.protobuf.ServiceException: java.io.IOException: Broken pipe
at org.apache.hadoop.hbase.ipc.ProtobufRpcClientEngine$Invoker.invoke(ProtobufRpcClientEngine.java:149)
at com.sun.proxy.$Proxy5.isMasterRunning(Unknown Source)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.createMasterInterface(HConnectionManager.java:732)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.createMasterWithRetries(HConnectionManager.java:764)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterProtocol(HConnectionManager.java:1724)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterMonitor(HConnectionManager.java:1757)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isMasterRunning(HConnectionManager.java:837)
at org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.java:2010)
at TestHBase.main(TestHBase.java:37)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.writeConnectionHeader(HBaseClient.java:896)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:847)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1414)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1299)
at org.apache.hadoop.hbase.ipc.ProtobufRpcClientEngine$Invoker.invoke(ProtobufRpcClientEngine.java:131)
... 8 more
13/05/13 15:18:29 INFO client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x13e9d6851af0046
13/05/13 15:18:29 INFO zookeeper.ZooKeeper: Session: 0x13e9d6851af0046 closed
13/05/13 15:18:29 INFO zookeeper.ClientCnxn: EventThread shut down
org.apache.hadoop.hbase.exceptions.MasterNotRunningException: com.google.protobuf.ServiceException: java.io.IOException: Broken pipe
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.createMasterWithRetries(HConnectionManager.java:793)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterProtocol(HConnectionManager.java:1724)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getKeepAliveMasterMonitor(HConnectionManager.java:1757)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.isMasterRunning(HConnectionManager.java:837)
at org.apache.hadoop.hbase.client.HBaseAdmin.checkHBaseAvailable(HBaseAdmin.java:2010)
at TestHBase.main(TestHBase.java:37)
Caused by: com.google.protobuf.ServiceException: java.io.IOException: Broken pipe
at org.apache.hadoop.hbase.ipc.ProtobufRpcClientEngine$Invoker.invoke(ProtobufRpcClientEngine.java:149)
at com.sun.proxy.$Proxy5.isMasterRunning(Unknown Source)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.createMasterInterface(HConnectionManager.java:732)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.createMasterWithRetries(HConnectionManager.java:764)
... 5 more
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:94)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450)
at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:55)
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
at java.io.DataOutputStream.flush(DataOutputStream.java:123)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.writeConnectionHeader(HBaseClient.java:896)
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:847)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1414)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:1299)
at org.apache.hadoop.hbase.ipc.ProtobufRpcClientEngine$Invoker.invoke(ProtobufRpcClientEngine.java:131)
... 8 more
This trace provides some evidence of what may be actually happening. It seems that the connection to ZooKeeper is established, but something fails when tries to access the master.
Though I've spent hours trying to find a solution in Google, I haven't seen such an exception. Particularly, this exception varies from two things from most found elsewhere:
Everybody seems to have the error getMaster attempt 0 of 1 failed rather than getMaster attempt 1 of 1 failed. I don't know whether this makes a point at all, but I find it somehow weird.
I can't find other people getting the Broken pipe error.
By the way, the master is actually running, as far as I can see in the Hortonworks Management Console.
When looking at the most recent logs, this is the output:
2013-05-13 15:30:07,192 WARN org.apache.hadoop.ipc.HBaseServer: Incorrect header or version mismatch from 127.0.0.1:40788 got version 0 expected version 3
As it is a warning rather than an error, I don't know whether it has something to do with the actual problem. The port varies in each execution.
Finally we found the problem and solved it. It turned out to be a dependencies problem. We were using hbase-0.95.0 and hbase-client-0.95.0. Using hbase-0.94.7 or hbase-0.94.9 seemed to work.
Yet, some problems occurred under certain circumstances even with that versions of the HBase library. Particularly, some problems arised when running it within an application server (JBoss AS7). In the end, all problems seems to be solved by removing the dependency hbase-client-0.95.0, and replacing it by haboop-core-1.1.2 as some classes not contained in the hbase libraries were required.
Regards.
I'd recommend first check your HBase Master / RegionServer ports are really bound with netstat -n -a. I had situation when HBase Master IPC was bound to only external IP (this was Cloudera CDH) and it was not reachable through 127.0.0.1. It looks like most probable case for you - hbase shell should still work in this case.
Another possible reason could be previous cluster crash with some HDFS data corrupted. In this case HBase does not actually start waiting for HDFS to exit safe mode. But this looks like not your case. If it is, you can manually force HDFS to exit safe mode from console and then do fsck for Hadoop and similar procedure for HBase.

Resources