Unable to close file because the last block does not have enough number of replicas.​ - hadoop

I got an error like this. The health test result for HDFS_CANARY_HEALTH has become bad: Canary test failed to write file in directory /tmp/.cloudera_health_monitoring_canary_files.
2019-07-14 00:02:13,312 INFO hive.metastore: Trying to connect to metastore with URI xxxx:abc
2019-07-14 00:02:13,322 INFO hive.metastore: Opened a connection to metastore, current connections: 1
2019-07-14 00:02:13,322 INFO hive.metastore: Connected to metastore.
2019-07-14 00:02:14,255 INFO hive.metastore: Closed a connection to metastore, current connections: 0​
2019-07-14 00:02:21,444 INFO com.cloudera.cmon.tstore.leveldb.LDBTimeSeriesRollupManager: Finished rollup: duration=PT21.280S, numStreamsChecked=607071, numStreamsRolle
dUp=199173
2019-07-14 00:03:04,172 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x2659d409 connecting to ZooKeeper ensemble=___
2019-07-14 00:03:04,182 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=ReplicationAdmin connecting to ZooKeeper ensemble=____
2019-07-14 00:03:04,198 INFO org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16b95f7c324bcf3
2019-07-14 00:03:08,346 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x2fc2fce5 connecting to ZooKeeper ensemble=___
2019-07-14 00:03:08,354 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=ReplicationAdmin connecting to ZooKeeper ensemble=___
2019-07-14 00:03:08,365 INFO org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x26b8ea7e7c6cfde
2019-07-14 00:03:11,897 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2019_07_14-00_03_03 retrying...
2019-07-14 00:03:18,388 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2019_07_14-00_03_03 retrying...
2019-07-14 00:03:18,562 ERROR com.cloudera.cmon.firehose.polling.hdfs.HdfsCanary: com.cloudera.cmon.firehose.polling.hdfs.HdfsCanary#5b4e2d83 for hdfs://hdfs-cluster_name Failed to write to /tmp/.cloudera_health_monitoring_canary_files/.canary_file_2019_07_14-00_03_03. Error: java.io.IOException: Unable to close file because the last block
BP-972893351-10.14.208.100-1438206436439:blk_4349006744_3275530539 does not have enough number of replicas.​
My question is which "file" it referring to in ERROR: Unable to close file because the last block does not have enough number of replicas.​

Related

Connection Refused when creating a two nodes cluster in Apache NiFi

I'm facing a problem regarding a refused connection on the cluster node protocol port.
I'm using the following configs to create the two nodes cluster:
For the First node manager :
####################
# State Management #
####################
nifi.state.management.configuration.file=./conf/state-management.xml
# The ID of the local state provider
nifi.state.management.provider.local=local-provider
# The ID of the cluster-wide state provider. This will be ignored if NiFi is not clustered but must be populated if running in a cluster.
nifi.state.management.provider.cluster=zk-provider
# Specifies whether or not this instance of NiFi should run an embedded ZooKeeper server
nifi.state.management.embedded.zookeeper.start=true
# Properties file that provides the ZooKeeper properties to use if <nifi.state.management.embedded.zookeeper.start> is set to true
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
# web properties #
nifi.web.war.directory=./lib
nifi.web.http.host=10.129.140.22
nifi.web.http.port=3000
nifi.web.http.network.interface.default=
nifi.web.https.host=
nifi.web.https.port=
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=
nifi.web.proxy.host=
# cluster node properties (only configure for cluster nodes) #
nifi.cluster.is.node=true
nifi.cluster.node.address=
nifi.cluster.node.protocol.port=10000
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=
# cluster load balancing properties #
nifi.cluster.load.balance.host=
nifi.cluster.load.balance.port=6342
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.connect.timeout=3 secs
nifi.zookeeper.session.timeout=3 secs
nifi.zookeeper.root.node=/nifi
For the second node slave:
####################
# State Management #
####################
nifi.state.management.configuration.file=./conf/state-management.xml
# The ID of the local state provider
nifi.state.management.provider.local=local-provider
# The ID of the cluster-wide state provider. This will be ignored if NiFi is not clustered but must be populated if running in a cluster.
nifi.state.management.provider.cluster=zk-provider
# Specifies whether or not this instance of NiFi should run an embedded ZooKeeper server
nifi.state.management.embedded.zookeeper.start=false
# Properties file that provides the ZooKeeper properties to use if <nifi.state.management.embedded.zookeeper.start> is set to true
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
# web properties #
nifi.web.war.directory=./lib
nifi.web.http.host=
nifi.web.http.port=9021
nifi.web.http.network.interface.default=
nifi.web.https.host=
nifi.web.https.port=
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=
nifi.web.proxy.host=
# cluster node properties (only configure for cluster nodes) #
nifi.cluster.is.node=true
nifi.cluster.node.address=
nifi.cluster.node.protocol.port=10001
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=
# cluster load balancing properties #
nifi.cluster.load.balance.host=10.129.140.22
nifi.cluster.load.balance.port=6343
nifi.cluster.load.balance.connections.per.node=4
nifi.cluster.load.balance.max.thread.count=8
nifi.cluster.load.balance.comms.timeout=30 sec
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=10.129.140.22:2181
nifi.zookeeper.connect.timeout=3 secs
nifi.zookeeper.session.timeout=3 secs
nifi.zookeeper.root.node=/nifi
The logs fils shows the following :
For the slave
2019-05-23 10:37:07,384 INFO [main] o.a.n.c.repository.FileSystemRepository Initializing FileSystemRepository with 'Always Sync' set to false
2019-05-23 10:37:07,541 INFO [main] o.apache.nifi.controller.FlowController Not enabling RAW Socket Site-to-Site functionality because nifi.remote.input.socket.port is not set
2019-05-23 10:37:07,546 INFO [main] o.apache.nifi.controller.FlowController Checking if there is already a Cluster Coordinator Elected...
2019-05-23 10:37:07,591 INFO [main] o.a.c.f.imps.CuratorFrameworkImpl Starting
2019-05-23 10:37:07,658 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2019-05-23 10:37:07,693 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2019-05-23 10:37:07,697 INFO [main] o.apache.nifi.controller.FlowController The Election for Cluster Coordinator has already begun (Leader is localhost:10000). Will not register to be elected for this role until after connecting to the cluster and inheriting the cluster's flow.
2019-05-23 10:37:07,699 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=true] Registered new Leader Selector for role Cluster Coordinator; this node is a silent observer in the election.
2019-05-23 10:37:07,699 INFO [main] o.a.c.f.imps.CuratorFrameworkImpl Starting
2019-05-23 10:37:07,703 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] Registered new Leader Selector for role Cluster Coordinator; this node is a silent observer in the election.
2019-05-23 10:37:07,703 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] started
2019-05-23 10:37:07,703 INFO [main] o.a.n.c.c.h.AbstractHeartbeatMonitor Heartbeat Monitor started
2019-05-23 10:37:07,706 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2019-05-23 10:37:09,587 INFO [main] o.e.jetty.server.handler.ContextHandler Started o.e.j.w.WebAppContext#1a6a4595{nifi-api,/nifi-api,file:///home/superman/nifi-1.9.2/work/jetty/nifi-web-api-1.9.2.war/webapp/,AVAILABLE}{./work/nar/framework/nifi-framework-nar-1.9.2.nar-unpacked/NAR-INF/bundled-dependencies/nifi-web-api-1.9.2.war}
2019-05-23 10:37:09,850 INFO [main] o.e.j.a.AnnotationConfiguration Scanning elapsed time=77ms
2019-05-23 10:37:09,852 INFO [main] o.e.j.s.h.C._nifi_content_viewer No Spring WebApplicationInitializer types detected on classpath
2019-05-23 10:37:09,873 INFO [main] o.e.jetty.server.handler.ContextHandler Started o.e.j.w.WebAppContext#4b1b2255{nifi-content-viewer,/nifi-content-viewer,file:///home/superman/nifi-1.9.2/work/jetty/nifi-web-content-viewer-1.9.2.war/webapp/,AVAILABLE}{./work/nar/framework/nifi-framework-nar-1.9.2.nar-unpacked/NAR-INF/bundled-dependencies/nifi-web-content-viewer-1.9.2.war}
2019-05-23 10:37:09,895 INFO [main] o.e.j.a.AnnotationConfiguration Scanning elapsed time=6ms
2019-05-23 10:37:09,896 WARN [main] o.e.j.webapp.StandardDescriptorProcessor Duplicate mapping from / to default
2019-05-23 10:37:09,915 INFO [main] o.e.j.s.h.ContextHandler._nifi_docs No Spring WebApplicationInitializer types detected on classpath
2019-05-23 10:37:09,917 INFO [main] o.e.jetty.server.handler.ContextHandler Started o.e.j.w.WebAppContext#4965454c{nifi-docs,/nifi-docs,file:///home/superman/nifi-1.9.2/work/jetty/nifi-web-docs-1.9.2.war/webapp/,AVAILABLE}{./work/nar/framework/nifi-framework-nar-1.9.2.nar-unpacked/NAR-INF/bundled-dependencies/nifi-web-docs-1.9.2.war}
2019-05-23 10:37:09,936 INFO [main] o.e.j.a.AnnotationConfiguration Scanning elapsed time=8ms
2019-05-23 10:37:09,955 INFO [main] o.e.j.server.handler.ContextHandler._ No Spring WebApplicationInitializer types detected on classpath
2019-05-23 10:37:09,957 INFO [main] o.e.jetty.server.handler.ContextHandler Started o.e.j.w.WebAppContext#1e4a4ed5{nifi-error,/,file:///home/superman/nifi-1.9.2/work/jetty/nifi-web-error-1.9.2.war/webapp/,AVAILABLE}{./work/nar/framework/nifi-framework-nar-1.9.2.nar-unpacked/NAR-INF/bundled-dependencies/nifi-web-error-1.9.2.war}
2019-05-23 10:37:09,967 INFO [main] o.eclipse.jetty.server.AbstractConnector Started ServerConnector#4518bffd{HTTP/1.1,[http/1.1]}{0.0.0.0:9021}
2019-05-23 10:37:09,967 INFO [main] org.eclipse.jetty.server.Server Started #28769ms
2019-05-23 10:37:09,978 INFO [main] org.apache.nifi.web.server.JettyServer Loading Flow...
2019-05-23 10:37:09,982 INFO [main] org.apache.nifi.io.socket.SocketListener Now listening for connections from nodes on port 10001
2019-05-23 10:37:10,026 INFO [main] o.apache.nifi.controller.FlowController Successfully synchronized controller with proposed flow
2019-05-23 10:37:10,071 INFO [main] o.a.nifi.controller.StandardFlowService Connecting Node: localhost:9021
2019-05-23 10:37:10,073 INFO [main] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at localhost:10000; will use this address for sending heartbeat messages
2019-05-23 10:37:10,074 WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed to create socket to localhost:10000 due to: java.net.ConnectException: Connection refused (Connection refused)
2019-05-23 10:37:12,715 WARN [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Failed to determine which node is elected active Cluster Coordinator: ZooKeeper reports the address as localhost:10000, but there is no node with this address. Attempted to determine the node's information but failed to retrieve its information due to org.apache.nifi.cluster.protocol.ProtocolException: Failed to create socket due to: java.net.ConnectException: Connection refused (Connection refused)
2019-05-23 10:37:12,720 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for localhost:9021 -- Received heartbeat from node previously disconnected due to Has Not Yet Connected to Cluster. Issuing reconnection request.
2019-05-23 10:37:12,721 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for localhost:9021 -- Requesting that node connect to cluster
2019-05-23 10:37:12,721 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Status of localhost:9021 changed from NodeConnectionStatus[nodeId=localhost:9021, state=DISCONNECTED, Disconnect Code=Has Not Yet Connected to Cluster, Disconnect Reason=Has Not Yet Connected to Cluster, updateId=1] to NodeConnectionStatus[nodeId=localhost:9021, state=CONNECTING, updateId=3]
2019-05-23 10:37:15,075 INFO [main] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at localhost:10000; will use this address for sending heartbeat messages
2019-05-23 10:37:15,076 WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed to create socket to localhost:10000 due to: java.net.ConnectException: Connection refused (Connection refused)
For the manager
2019-05-23 10:36:59,752 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
2019-05-23 10:36:59,752 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:java.io.tmpdir=/tmp
2019-05-23 10:36:59,752 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:java.compiler=<NA>
2019-05-23 10:36:59,752 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:os.name=Linux
2019-05-23 10:36:59,752 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:os.arch=amd64
2019-05-23 10:36:59,753 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:os.version=4.15.0-20-generic
2019-05-23 10:36:59,753 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:user.name=root
2019-05-23 10:36:59,753 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:user.home=/root
2019-05-23 10:36:59,753 INFO [main] o.a.zookeeper.server.ZooKeeperServer Server environment:user.dir=/home/superman/nifi-1.9.2
2019-05-23 10:36:59,753 INFO [main] o.a.zookeeper.server.ZooKeeperServer tickTime set to 2000
2019-05-23 10:36:59,754 INFO [main] o.a.zookeeper.server.ZooKeeperServer minSessionTimeout set to -1
2019-05-23 10:36:59,754 INFO [main] o.a.zookeeper.server.ZooKeeperServer maxSessionTimeout set to -1
2019-05-23 10:36:59,855 INFO [main] o.apache.nifi.controller.FlowController Checking if there is already a Cluster Coordinator Elected...
2019-05-23 10:36:59,903 INFO [main] o.a.c.f.imps.CuratorFrameworkImpl Starting
2019-05-23 10:36:59,950 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] o.a.zookeeper.server.ZooKeeperServer Client attempting to establish new session at /127.0.0.1:40388
2019-05-23 10:36:59,950 INFO [SyncThread:0] o.a.z.server.persistence.FileTxnLog Creating new log file: log.3c
2019-05-23 10:36:59,963 INFO [SyncThread:0] o.a.zookeeper.server.ZooKeeperServer Established session 0x16ae443f4130000 with negotiated timeout 4000 for client /127.0.0.1:40388
2019-05-23 10:36:59,975 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2019-05-23 10:36:59,998 INFO [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl backgroundOperationsLoop exiting
2019-05-23 10:37:00,003 INFO [main] o.apache.nifi.controller.FlowController The Election for Cluster Coordinator has already begun (Leader is localhost:10001). Will not register to be elected for this role until after connecting to the cluster and inheriting the cluster's flow.
2019-05-23 10:37:00,005 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=true] Registered new Leader Selector for role Cluster Coordinator; this node is a silent observer in the election.
2019-05-23 10:37:00,005 INFO [main] o.a.c.f.imps.CuratorFrameworkImpl Starting
2019-05-23 10:37:00,017 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] Registered new Leader Selector for role Cluster Coordinator; this node is a silent observer in the election.
2019-05-23 10:37:00,017 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] started
2019-05-23 10:37:00,017 INFO [main] o.a.n.c.c.h.AbstractHeartbeatMonitor Heartbeat Monitor started
2019-05-23 10:37:00,019 INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] o.a.zookeeper.server.ZooKeeperServer Client attempting to establish new session at /127.0.0.1:40390
2019-05-23 10:37:00,020 INFO [SyncThread:0] o.a.zookeeper.server.ZooKeeperServer Established session 0x16ae443f4130001 with negotiated timeout 4000 for client /127.0.0.1:40390
2019-05-23 10:37:00,020 INFO [main-EventThread] o.a.c.f.state.ConnectionStateManager State change: CONNECTED
2019-05-23 10:37:02,022 INFO [main] o.e.jetty.server.handler.ContextHandler Started o.e.j.w.WebAppContext#1a05ff8e{nifi-api,/nifi-api,file:///home/superman/nifi-1.9.2/work/jetty/nifi-web-api-1.9.2.war/webapp/,AVAILABLE}{./work/nar/framework/nifi-framework-nar-1.9.2.nar-unpacked/NAR-INF/bundled-dependencies/nifi-web-api-1.9.2.war}
2019-05-23 10:37:02,373 INFO [main] o.e.j.a.AnnotationConfiguration Scanning elapsed time=165ms
2019-05-23 10:37:02,375 INFO [main] o.e.j.s.h.C._nifi_content_viewer No Spring WebApplicationInitializer types detected on classpath
2019-05-23 10:37:02,401 INFO [main] o.e.jetty.server.handler.ContextHandler Started o.e.j.w.WebAppContext#251e2f4a{nifi-content-viewer,/nifi-content-viewer,file:///home/superman/nifi-1.9.2/work/jetty/nifi-web-content-viewer-1.9.2.war/webapp/,AVAILABLE}{./work/nar/framework/nifi-framework-nar-1.9.2.nar-unpacked/NAR-INF/bundled-dependencies/nifi-web-content-viewer-1.9.2.war}
2019-05-23 10:37:02,419 INFO [main] o.e.j.a.AnnotationConfiguration Scanning elapsed time=6ms
2019-05-23 10:37:02,420 WARN [main] o.e.j.webapp.StandardDescriptorProcessor Duplicate mapping from / to default
2019-05-23 10:37:02,421 INFO [main] o.e.j.s.h.ContextHandler._nifi_docs No Spring WebApplicationInitializer types detected on classpath
2019-05-23 10:37:02,441 INFO [main] o.e.jetty.server.handler.ContextHandler Started o.e.j.w.WebAppContext#1abea1ed{nifi-docs,/nifi-docs,file:///home/superman/nifi-1.9.2/work/jetty/nifi-web-docs-1.9.2.war/webapp/,AVAILABLE}{./work/nar/framework/nifi-framework-nar-1.9.2.nar-unpacked/NAR-INF/bundled-dependencies/nifi-web-docs-1.9.2.war}
2019-05-23 10:37:02,457 INFO [main] o.e.j.a.AnnotationConfiguration Scanning elapsed time=6ms
2019-05-23 10:37:02,475 INFO [main] o.e.j.server.handler.ContextHandler._ No Spring WebApplicationInitializer types detected on classpath
2019-05-23 10:37:02,478 INFO [main] o.e.jetty.server.handler.ContextHandler Started o.e.j.w.WebAppContext#6f5288c5{nifi-error,/,file:///home/superman/nifi-1.9.2/work/jetty/nifi-web-error-1.9.2.war/webapp/,AVAILABLE}{./work/nar/framework/nifi-framework-nar-1.9.2.nar-unpacked/NAR-INF/bundled-dependencies/nifi-web-error-1.9.2.war}
2019-05-23 10:37:02,488 INFO [main] o.eclipse.jetty.server.AbstractConnector Started ServerConnector#167ed1cf{HTTP/1.1,[http/1.1]}{10.129.140.22:3000}
2019-05-23 10:37:02,488 INFO [main] org.eclipse.jetty.server.Server Started #26145ms
2019-05-23 10:37:02,500 INFO [main] org.apache.nifi.web.server.JettyServer Loading Flow...
2019-05-23 10:37:02,503 INFO [main] org.apache.nifi.io.socket.SocketListener Now listening for connections from nodes on port 10000
2019-05-23 10:37:02,545 INFO [main] o.apache.nifi.controller.FlowController Successfully synchronized controller with proposed flow
2019-05-23 10:37:02,587 INFO [main] o.a.nifi.controller.StandardFlowService Connecting Node: 10.129.140.22:3000
2019-05-23 10:37:02,589 INFO [main] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at localhost:10001; will use this address for sending heartbeat messages
2019-05-23 10:37:02,590 WARN [main] o.a.nifi.controller.StandardFlowService Failed to connect to cluster due to: org.apache.nifi.cluster.protocol.ProtocolException: Failed to create socket to localhost:10001 due to: java.net.ConnectException: Connection refused (Connection refused)
2019-05-23 10:37:04,001 INFO [SessionTracker] o.a.zookeeper.server.ZooKeeperServer Expiring session 0x16ae42f180d0003, timeout of 4000ms exceeded
2019-05-23 10:37:04,001 INFO [SessionTracker] o.a.zookeeper.server.ZooKeeperServer Expiring session 0x16ae42f180d0002, timeout of 4000ms exceeded
2019-05-23 10:37:05,026 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.129.140.22:3000 -- Received heartbeat from node previously disconnected due to Has Not Yet Connected to Cluster. Issuing reconnection request.
2019-05-23 10:37:05,028 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.129.140.22:3000 -- Requesting that node connect to cluster
2019-05-23 10:37:05,028 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Status of 10.129.140.22:3000 changed from NodeConnectionStatus[nodeId=10.129.140.22:3000, state=DISCONNECTED, Disconnect Code=Has Not Yet Connected to Cluster, Disconnect Reason=Has Not Yet Connected to Cluster, updateId=0] to NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=5]
2019-05-23 10:37:07,591 WARN [main] o.a.nifi.controller.StandardFlowService There is currently no Cluster Coordinator. This often happens upon restart of NiFi when running an embedded ZooKeeper. Will register this node to become the active Cluster Coordinator and will attempt to connect to cluster again
2019-05-23 10:37:07,594 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] Registered new Leader Selector for role Cluster Coordinator; this node is an active participant in the election.
2019-05-23 10:37:07,612 INFO [Leader Election Notification Thread-1] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener#1d6dcdcb This node has been elected Leader for Role 'Cluster Coordinator'
2019-05-23 10:37:07,612 INFO [Leader Election Notification Thread-1] o.apache.nifi.controller.FlowController This node elected Active Cluster Coordinator
2019-05-23 10:37:07,668 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.129.140.22:3000 -- Received heartbeat from node previously disconnected due to Has Not Yet Connected to Cluster. Issuing reconnection request.
2019-05-23 10:37:07,668 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.129.140.22:3000 -- Requesting that node connect to cluster
2019-05-23 10:37:07,669 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Status of 10.129.140.22:3000 changed from NodeConnectionStatus[nodeId=10.129.140.22:3000, state=DISCONNECTED, Disconnect Code=Has Not Yet Connected to Cluster, Disconnect Reason=Has Not Yet Connected to Cluster, updateId=1] to NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=6]
2019-05-23 10:37:07,675 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.129.140.22:3000 -- Received heartbeat from node previously disconnected due to Has Not Yet Connected to Cluster. Issuing reconnection request.
2019-05-23 10:37:07,675 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.129.140.22:3000 -- Requesting that node connect to cluster
2019-05-23 10:37:07,675 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Status of 10.129.140.22:3000 changed from NodeConnectionStatus[nodeId=10.129.140.22:3000, state=DISCONNECTED, Disconnect Code=Has Not Yet Connected to Cluster, Disconnect Reason=Has Not Yet Connected to Cluster, updateId=2] to NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=7]
2019-05-23 10:37:07,694 INFO [Process Cluster Protocol Request-1] o.a.n.c.c.node.NodeClusterCoordinator Status of 10.129.140.22:3000 changed from NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=5] to NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=5]
2019-05-23 10:37:07,695 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.129.140.22:3000 -- Received heartbeat from node previously disconnected due to Has Not Yet Connected to Cluster. Issuing reconnection request.
2019-05-23 10:37:07,699 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Event Reported for 10.129.140.22:3000 -- Requesting that node connect to cluster
2019-05-23 10:37:07,700 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.node.NodeClusterCoordinator Status of 10.129.140.22:3000 changed from NodeConnectionStatus[nodeId=10.129.140.22:3000, state=DISCONNECTED, Disconnect Code=Has Not Yet Connected to Cluster, Disconnect Reason=Has Not Yet Connected to Cluster, updateId=3] to NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=8]
2019-05-23 10:37:07,701 INFO [Process Cluster Protocol Request-5] o.a.n.c.c.node.NodeClusterCoordinator Status of 10.129.140.22:3000 changed from NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=7] to NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=7]
2019-05-23 10:37:07,702 INFO [Process Cluster Protocol Request-1] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 19834836-9bda-41b3-8fef-4a288d90c7bf (type=NODE_STATUS_CHANGE, length=1103 bytes) from localhost.localdomain in 33 millis
2019-05-23 10:37:07,702 INFO [Process Cluster Protocol Request-5] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 85b0bb3f-c2a6-4dfd-abd6-e9df14710c4d (type=NODE_STATUS_CHANGE, length=1103 bytes) from localhost.localdomain in 10 millis
2019-05-23 10:37:07,703 INFO [Process Cluster Protocol Request-3] o.a.n.c.c.node.NodeClusterCoordinator Status of 10.129.140.22:3000 changed from NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=6] to NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=6]
2019-05-23 10:37:07,705 INFO [Process Cluster Protocol Request-3] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 80447901-4ad3-44e3-91ad-d9f075624eae (type=NODE_STATUS_CHANGE, length=1103 bytes) from localhost.localdomain in 31 millis
2019-05-23 10:37:07,706 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Processing reconnection request from cluster coordinator.
2019-05-23 10:37:07,706 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Received a Reconnection Request that contained no DataFlow. Will attempt to connect to cluster using local flow.
2019-05-23 10:37:07,707 INFO [Process Cluster Protocol Request-2] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 22cdceee-c01f-445f-a091-38812e878d10 (type=RECONNECTION_REQUEST, length=3095 bytes) from 10.129.140.22:3000 in 34 millis
2019-05-23 10:37:07,708 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Processing reconnection request from cluster coordinator.
2019-05-23 10:37:07,708 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Received a Reconnection Request that contained no DataFlow. Will attempt to connect to cluster using local flow.
2019-05-23 10:37:07,709 INFO [Process Cluster Protocol Request-4] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 8605cf39-2034-4ee2-92c4-0fbe54e97fb2 (type=RECONNECTION_REQUEST, length=3013 bytes) from 10.129.140.22:3000 in 27 millis
2019-05-23 10:37:07,712 INFO [Reconnect 10.129.140.22:3000] o.a.n.c.c.node.NodeClusterCoordinator Successfully requested that 10.129.140.22:3000 join the cluster
2019-05-23 10:37:07,712 INFO [Reconnect 10.129.140.22:3000] o.a.n.c.c.node.NodeClusterCoordinator Successfully requested that 10.129.140.22:3000 join the cluster
2019-05-23 10:37:07,725 INFO [Process Cluster Protocol Request-6] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 0ca55348-44eb-416b-91dd-3d80da4c5ebe (type=RECONNECTION_REQUEST, length=3013 bytes) from 10.129.140.22:3000 in 29 millis
2019-05-23 10:37:07,725 INFO [Reconnect 10.129.140.22:3000] o.a.n.c.c.node.NodeClusterCoordinator Successfully requested that 10.129.140.22:3000 join the cluster
2019-05-23 10:37:07,728 INFO [Process Cluster Protocol Request-7] o.a.n.c.c.node.NodeClusterCoordinator Status of 10.129.140.22:3000 changed from NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=8] to NodeConnectionStatus[nodeId=10.129.140.22:3000, state=CONNECTING, updateId=8]
2019-05-23 10:37:07,728 INFO [Process Cluster Protocol Request-7] o.a.n.c.p.impl.SocketProtocolListener Finished processing request c9b647d7-67ac-4d0a-833b-8a0a8cc0ba6d (type=NODE_STATUS_CHANGE, length=1103 bytes) from localhost.localdomain in 3 millis
2019-05-23 10:37:07,728 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Connecting Node: 10.129.140.22:3000
2019-05-23 10:37:07,725 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Processing reconnection request from cluster coordinator.
2019-05-23 10:37:07,732 INFO [Heartbeat Monitor Thread-1] o.a.n.c.c.h.AbstractHeartbeatMonitor Finished processing 4 heartbeats in 2 seconds, 708 millis
2019-05-23 10:37:07,732 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Received a Reconnection Request that contained no DataFlow. Will attempt to connect to cluster using local flow.
2019-05-23 10:37:07,733 INFO [Reconnect to Cluster] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at localhost:10000; will use this address for sending heartbeat messages
2019-05-23 10:37:07,734 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Connecting Node: 10.129.140.22:3000
2019-05-23 10:37:07,735 INFO [Reconnect to Cluster] o.a.nifi.controller.StandardFlowService Connecting Node: 10.129.140.22:3000
2019-05-23 10:37:07,736 INFO [Reconnect to Cluster] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at localhost:10000; will use this address for sending heartbeat messages
2019-05-23 10:37:07,736 INFO [Reconnect to Cluster] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at localhost:10000; will use this address for sending heartbeat messages
2019-05-23 10:37:07,748 INFO [Process Cluster Protocol Request-8] o.a.n.c.p.impl.SocketProtocolListener Finished processing request 434daf63-1beb-4b82-9290-bb0da4e89b7f (type=RECONNECTION_REQUEST, length=2972 bytes) from 10.129.140.22:3000 in 16 millis
2019-05-23 10:37:07,749 INFO [Reconnect 10.129.140.22:3000] o.a.n.c.c.node.NodeClusterCoordinator Successfully requested that 10.129.140.22:3000 join the cluster**
Set these properties on both:
nifi.web.http.host=<host>
nifi.cluster.node.address=<host>
Beware of this value visibility in network scopes:
nifi.zookeeper.connect.string=localhost:2181
e.g 'localhost', in the other side you're using real IP addr
They share it during replication, primary/coordinator node election and flow election.

Hbase managed zookeeper suddenly trying to connect to localhost instead of zookeeper quorum

I was running some tests with table mappers and reducers on large scale problems. After a certain point my reducers started failing when the job was 80% done. From what I can tell when looking at the syslogs the problem is that one of my zookeepers is attempting to connect to the localhost as opposed to the other zookeepers in the quorum
Oddly it seems to do just fine connecting to the other nodes when mapping is going on, its reducing that it has a problem with. Here are selected portions of the syslog which might be relevant to figuring out whats going on
2014-06-27 09:44:01,599 INFO [main] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=hdev02:5181,hdev01:5181,hdev03:5181 sessionTimeout=10000 watcher=hconnection-0x4aee260b, quorum=hdev02:5181,hdev01:5181,hdev03:5181, baseZNode=/hbase
2014-06-27 09:44:01,612 INFO [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x4aee260b connecting to ZooKeeper ensemble=hdev02:5181,hdev01:5181,hdev03:5181
2014-06-27 09:44:01,614 INFO [main-SendThread(hdev02:5181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server hdev02/172.17.43.36:5181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
2014-06-27 09:44:01,615 INFO [main-SendThread(hdev02:5181)] org.apache.zookeeper.ClientCnxn: Socket connection established to hdev02/172.17.43.36:5181, initiating session
2014-06-27 09:44:01,617 INFO [main-SendThread(hdev02:5181)] org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
2014-06-27 09:44:01,723 WARN [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=hdev02:5181,hdev01:5181,hdev03:5181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2014-06-27 09:44:01,723 INFO [main] org.apache.hadoop.hbase.util.RetryCounter: Sleeping
***
org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 1 on-disk map-outputs
2014-06-27 09:55:12,012 INFO [main] org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
2014-06-27 09:55:12,013 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 33206049 bytes
2014-06-27 09:55:12,208 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merged 1 segments, 33206079 bytes to disk to satisfy reduce memory limit
2014-06-27 09:55:12,209 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 2 files, 265119413 bytes from disk
2014-06-27 09:55:12,209 INFO [main] org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
2014-06-27 09:55:12,210 INFO [main] org.apache.hadoop.mapred.Merger: Merging 2 sorted segments
2014-06-27 09:55:12,212 INFO [main] org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 265119345 bytes
2014-06-27 09:55:12,279 INFO [main] org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x65afdbbb, quorum=localhost:2181, baseZNode=/hbase
2014-06-27 09:55:12,281 INFO [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x65afdbbb connecting to ZooKeeper ensemble=localhost:2181
2014-06-27 09:55:12,282 INFO [main-SendThread(localhost.localdomain:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
2014-06-27 09:55:12,283 WARN [main-SendThread(localhost.localdomain:2181)] org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:599)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
2014-06-27 09:55:12,384 WARN [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2014-06-27 09:55:12,384 INFO [main] org.apache.hadoop.hbase.util.RetryCounter: Sleeping 1000ms before retry #0...
2014-06-27 09:55:13,385 INFO [main-SendThread(localhost.localdomain:2181)] org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost.localdomain/127.0.0.1:2181. Will not attempt to authenticate using SASL (Unable to locate a login configuration)
2014-06-27 09:55:13,385 WARN [main-SendThread(localhost.localdomain:2181)] org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing
***
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
2014-06-27 09:55:13,486 ERROR [main] org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 1 attempts
2014-06-27 09:55:13,486 WARN [main] org.apache.hadoop.hbase.zookeeper.ZKUtil: hconnection-0x65afdbbb, quorum=localhost:2181, baseZNode=/hbase Unable to set watcher on znode (/hbase/hbaseid)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
I'm pretty sure its configured correctly, here is the relevant portion of my hbase-site.xml.
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>5181</value>
<description>Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.
</description>
</property>
<property>
<name>zookeeper.session.timeout</name>
<value>10000</value>
<description></description>
</property>
<property>
<name>hbase.client.retries.number</name>
<value>10</value>
<description></description>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hdev01,hdev02,hdev03</value>
<description></description>
</property>
So far as I can tell hdev03 is the only server that has any problem with this. Netstating all relevant ports doesn't show me anything strange.
I've had same problem when running HBase through Spark on Yarn. Everything was fine until suddenly it started to trying to connect to localhost instead of quorum. Setting port and quorum programmatically before HBase call fixed the issue
conf.set("hbase.zookeeper.quorum","my.server")
conf.set("hbase.zookeeper.property.clientPort","5181")
I'm using MapR, and it has "unusual" (5181) zookeeper port
Hard to say what is happening with the information, given. I have found the Hadoop Stack (HBase especially) to be quite hostile to even the slightest bit of misconfiguration in DNS or the hosts file.
As the quorum in your hbase-site.xml looks good, I'd start checking with network/hostname resolution related configurations:
Has the nodename slipped into the localhost entry in /etc/hosts on hdev03?
Is there an entry for the host itself in hdev03s /etc/hosts (there should)?
Has Reverse DNS been correctly configured in case you are using DNS for name resolution instead of the hosts file?
These are just a few pointers in the direction I'd look with this kind of issue. Hope it helps!
Add '--driver-class-path ~/hbase-1.1.2/conf' into spark-submit command, so that the task can find the configured zookeeper servers instead of 127.0.0.1.

While running a topology in storm we are getting error like this

While running a topology in storm we are getting error like this,
8983 [Thread-6] INFO com.netflix.curator.framework.imps.CuratorFrameworkImpl -
Starting
9144 [main] INFO **backtype.storm.daemon.nimbus** - Shutting down master
9199 [Thread-6-EventThread] INFO backtype.storm.zookeeper - Zookeeper state upd
ate: :connected:none
9241 [main] INFO backtype.storm.daemon.nimbus - Shut down master
9273 [Thread-6] INFO com.netflix.curator.framework.imps.CuratorFrameworkImpl -
Starting
9306 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache.zookeeper.serv
er.NIOServerCnxn - EndOfStreamException: Unable to read additional data from cli
ent sessionid 0x143af55728d0003, likely client has closed socket
9354 [main] INFO backtype.storm.daemon.supervisor - Shutting down c094c3b1-a378
-4c4f-af35-9278647c217a:4beddc09-4675-4fb9-8bdc-9cf5013ce9ca
9358 [main] INFO backtype.storm.daemon.supervisor - Shut down c094c3b1-a378-4c4
f-af35-9278647c217a:4beddc09-4675-4fb9-8bdc-9cf5013ce9ca
9361 [main] INFO **backtype.storm.daemon.superviso**r - Shutting down supervisor c0
94c3b1-a378-4c4f-af35-9278647c217a
9364 [Thread-5] INFO **backtype.storm.event** - Event manager interrupted
9369 [Thread-6] INFO backtype.storm.event - Event manager interrupted
9425 [main] INFO **backtype.storm.daemon.supervisor** - Shutting down supervisor 38
6d8d71-c9b5-4b51-bd6e-f9f605034ea0
9428 [Thread-8] INFO backtype.storm.event - Event manager interrupted
9429 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache.zookeeper.serv
er.NIOServerCnxn - EndOfStreamException: Unable to read additional data from cli
ent sessionid 0x143af55728d0007, likely client has closed socket
9429 [Thread-9] INFO backtype.storm.event - Event manager interrupted
9473 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache.zookeeper.serv
er.NIOServerCnxn - EndOfStreamException: Unable to read additional data from cli
ent sessionid 0x143af55728d0009, likely client has closed socket
9476 [main] INFO backtype.storm.testing - Shutting down in process zookeeper
9503 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2000] WARN org.apache.zookeeper.serv
er.NIOServerCnxn - Ignoring exception
**java.nio.channels.ClosedChannelException**: null
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.jav
a:211) ~[na:1.7.0_03]
at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.j
ava:242) ~[zookeeper-3.3.3.jar:3.3.3-1073969]
9510 [main] INFO **backtype.storm.testing** - Done shutting down in process zookeep
er
9513 [main] INFO backtype.storm.testing - Deleting temporary path C:\Users\sowm
iya\AppData\Local\Temp\c9b1bc1a-a950-4098-af77-f81a4d2b112f
9520 [main] INFO backtype.storm.testing - Deleting temporary path C:\Users\sowm
iya\AppData\Local\Temp\7e75c468-18ea-4787-a4ac-496fb108db71
9527 [main] INFO backtype.storm.testing - Unable to delete file: C:\Users\sowmi
ya\AppData\Local\Temp\7e75c468-18ea-4787-a4ac-496fb108db71\version-2\log.1
9529 [main] INFO backtype.storm.testing - Deleting temporary path C:\Users\sowm
iya\AppData\Local\Temp\fa7b3c9b-ac93-4090-b9e2-63f10019e61f
9543 [main] INFO backtype.storm.testing - Deleting temporary path C:\Users\sowm
iya\AppData\Local\Temp\55f1fd11-508e-43bb-b340-0d9b79f3af33
9579 [Thread-6-EventThread] INFO com.netflix.curator.framework.state.Connection
StateManager - State change: SUSPENDED
9580 [ConnectionStateManager-0] WARN com.netflix.curator.framework.state.Connec
tionStateManager - There are no ConnectionStateListeners registered.
9583 [Thread-6-EventThread] WARN backtype.storm.cluster - Received event :disco
nnected::none: with disconnected Zookeeper.
11232 [Thread-6-SendThread(localhost:2000)] WARN org.apache.zookeeper.ClientCnx
n - Session 0x143af55728d000b for server null, unexpected error, closing socket
connection and attempting reconnect
**java.net.ConnectException: Connection refused: no further information**
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.7.0_0
3]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701
) ~[na:1.7.0_03]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
~[zookeeper-3.3.3.jar:3.3.3-1073969]
13992 [Thread-6-SendThread(localhost:2000)] WARN org.apache.zookeeper.ClientCnx
n - Session 0x143af55728d000b for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.7.0_0
3]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701
) ~[na:1.7.0_03]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
Whwn we are trying to run the topology jar file all the operation like nimbus,zookeeper and supervisor process going to dead.please help us to know why this is happened.
Please help us to rectify this error and help to proceed further.
Thank you,
Sowmiya
Priya
This looks like a zookeeper issue. It looks like your processes are not being able to connect to zookeeper. Can't say more without more information.

FATAL master.HMaster: Unexpected state : .. Cannot transit it to OFFLINE

I've got a serious Hbase crash problem. I'm using HBase 0.94.7 with one master and two region servers. The HBase master crashed regularly, I can't even get it restarted. I've got the master logs as following:
DEBUG master.AssignmentManager: Handling transition=RS_ZK_REGION_CLOSED, server=master,60020,1374506461230, region=46c2333f401964bf877254be19c2cc8c
DEBUG handler.ClosedRegionHandler: Handling CLOSED event for 6423df864603aa6e8c45c726ab3ae62f
DEBUG master.AssignmentManager: Forcing OFFLINE; was=LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF8\xB3\x8F\x17\xCE\xE2g\x84,1374498065657.6423df864603aa6e8c45c726ab3ae62f. state=CLOSED, ts=1374508769672, server=slave,60020,1374506460892
DEBUG zookeeper.ZKAssign: master:60000-0x14006f52f3f000e Creating (or updating) unassigned node for 6423df864603aa6e8c45c726ab3ae62f with OFFLINE state
FATAL master.HMaster: Unexpected state : LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF6\xC17p&c\x8F\x14,1374498085655.c2f4143750eb1559a1dd92e937ea712d. state=PENDING_OPEN, ts=1374508769697, server=master,60020,1374506461230 .. Cannot transit it to OFFLINE.
java.lang.IllegalStateException: Unexpected state : LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF6\xC17p&c\x8F\x14,1374498085655.c2f4143750eb1559a1dd92e937ea712d. state=PENDING_OPEN, ts=1374508769697, server=master,60020,1374506461230 .. Cannot transit it to OFFLINE.
at org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394)
at org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105)
at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
INFO master.HMaster: Aborting
DEBUG handler.ClosedRegionHandler: Handling CLOSED event for 0710b486dcb3d51465695b51db376255
....
DEBUG master.AssignmentManager: The znode of region LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF6\xC17p&c\x8F\x14,1374498085655.c2f4143750eb1559a1dd92e937ea712d. has been deleted.
INFO master.AssignmentManager: The master has opened the region LogDetail,\x00\x00\x01\xE8\x00\x00\x01?\xF6\xC17p&c\x8F\x14,1374498085655.c2f4143750eb1559a1dd92e937ea712d. that was online on master,60020,1374506461230
DEBUG master.AssignmentManager: Handling transition=M_ZK_REGION_OFFLINE, server=master,60000,1374508461536, region=c9cfdd360c09b292412ba5ad88815e6f
DEBUG catalog.CatalogTracker: Stopping catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker#5c061cd2
INFO client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x14006f52f3f000f
INFO zookeeper.ZooKeeper: Session: 0x14006f52f3f000f closed
INFO zookeeper.ClientCnxn: EventThread shut down
INFO master.AssignmentManager$TimerUpdater: master,60000,1374508461536.timerUpdater exiting
INFO master.SplitLogManager$TimeoutMonitor: master,60000,1374508461536.splitLogManagerTimeoutMonitor exiting
INFO master.AssignmentManager$TimeoutMonitor: master,60000,1374508461536.timeoutMonitor exiting
INFO zookeeper.ZooKeeper: Session: 0x14006f52f3f000e closed
INFO zookeeper.ClientCnxn: EventThread shut down
INFO master.HMaster: HMaster main thread exiting
ERROR master.HMasterCommandLine: Failed to start master
I also found something unusual in the ZK log:
INFO org.apache.zookeeper.server.NIOServerCnxnFactory: Accepted socket connection from /master:37856
INFO org.apache.zookeeper.server.ZooKeeperServer: Client attempting to establish new session at /master:37856
INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x140100dda0300e1 with negotiated timeout 180000 for client /master:37856
WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x140100dda0300e1, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Thread.java:662)
INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /master:37856 which had sessionid 0x140100dda0300e1
Can anybody help to see what the problem is? Is it related to the unassigned region or something like this? I've tried the bin/hbase hbck -repair and bin/hbase hbck -fix, but it doesn't help.
Thanks
After checked the log of my region server very carefully, I got the answer.
Cause
It turns out that there is one library called 'SNAPPY' for the compression of the hbase table is not well installed on the region server. And all my tables are created using this compression algorithm. When the master tries to balance the region to the region server, it failed. Eventually the master aborted.
Solution
Install and configure the SNAPPY on EVERY NODE as following:
apt-get install libsnappy1
su hbase
mkdir /home/hbase/hbase-0.94.7/lib/native/Linux-amd64-64
ln -s /usr/lib/libsnappy.so.1.1.2 /home/hbase/hbase-0.94.7/lib/native/Linux-amd64-64/libsnappy.so
exit (-> root)
ln -s /usr/lib/libsnappy.so.1.1.2 /usr/lib64/libsnappy.so.1.1.2
ln -s /usr/lib/libsnappy.so.1.1.2 /usr/lib64/libsnappy.so.1
ln -s /usr/lib/libsnappy.so.1.1.2 /usr/lib64/libsnappy.so
ln -s /usr/lib/libsnappy.so.1 /usr/lib/libsnappy.so
Now everything is OK! The regions are well balanced over region servers.
Check the region server log, if it is caused by LZO compressor missing and you are using Cloudera Hadoop,you can install lzo easily according to the following instruction:
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v1/v1-0-1/Installing-and-Using-Impala/ciiu_lzo.html

Running MapReduce on HBase gives Zookeeper error

I am doing a test project with Hadoop and HBase. Currently the cluster has 2 Ubuntu VMs hosted on a Windows machine.
I am able to perform PUT, QUERY and DELETE operation remotly (in my host machine) using following HBase Java API configuration
config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", "192.168.56.90");
config.set("hbase.zookeeper.property.clientPort", "2222");
When I am trying to run a HBase MapReduce job on Windows with the same config as above, I am getting following error
13/03/24 06:11:03 ERROR security.UserGroupInformation: PriviledgedActionException as:Joel cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Joel\mapred\staging\Joel290889388\.staging to 0700
java.io.IOException: Failed to set permissions of path: \tmp\hadoop-Joel\mapred\staging\Joel290889388\.staging to 0700
From what I have read on the web, there seems to be a problem with running MapReduce jobs on Windows. So I tried running the MapReduce job on Linux by using "java - jar MR.jar".
On Linux, I can't connect to Zookeeper. For unknown reason, Zookeeper host and port are getting reseted on the client side
13/03/24 05:59:33 INFO zookeeper.ZooKeeper: Client environment:os.version=3.5.0-23-generic
13/03/24 05:59:33 INFO zookeeper.ZooKeeper: Client environment:user.name=hduser
13/03/24 05:59:33 INFO zookeeper.ZooKeeper: Client environment:user.home=/home/hduser
13/03/24 05:59:33 INFO zookeeper.ZooKeeper: Client environment:user.dir=/home/hduser/testes
13/03/24 05:59:33 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=192.168.56.90:2222 sessionTimeout=180000 watcher=hconnection
13/03/24 05:59:33 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 11552#node01
13/03/24 05:59:33 INFO zookeeper.ClientCnxn: Opening socket connection to server node01/192.168.56.90:2222. Will not attempt to authenticate using SASL (unknown error)
13/03/24 05:59:33 INFO zookeeper.ClientCnxn: Socket connection established to node01/192.168.56.90:2222, initiating session
13/03/24 05:59:33 INFO zookeeper.ClientCnxn: Session establishment complete on server node01/192.168.56.90:2222, sessionid = 0x13d9afaa1a30006, negotiated timeout = 180000
13/03/24 05:59:33 INFO client.HConnectionManager$HConnectionImplementation: Closed zookeeper sessionid=0x13d9afaa1a30006
13/03/24 05:59:33 INFO zookeeper.ZooKeeper: Session: 0x13d9afaa1a30006 closed
13/03/24 05:59:33 INFO zookeeper.ClientCnxn: EventThread shut down
13/03/24 05:59:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/03/24 05:59:33 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/03/24 05:59:33 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
13/03/24 05:59:33 INFO zookeeper.RecoverableZooKeeper: The identifier of this process is 11552#node01
13/03/24 05:59:33 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
13/03/24 05:59:33 WARN zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
Judging from log above, it connects correctly to node01:2222 (node01 resolves to 192.168.56.90). But for some reason, it changes to localhost:2181 and it then gives a connection refused error.
How can I fix this issue to get a MR jobs running on Linux, on the same machine as Zookeeper is running?
Version: Hbase 0.94.5 / Hadoop 1.1.2
Thanks.
You may need to set the hbase.master also.
also check the /etc/hosts file and see if it is correct. Are you able to telnet to the zookeeper using that connection info?
config.set("hbase.zookeeper.quorum", "192.168.56.90");
config.set("hbase.zookeeper.property.clientPort", "2222");
config.set("hbase.master", "some.host.com:60000")

Resources