Storm - Supervisors launched but not connecting to Nimbus - apache-storm

I have a Storm cluster with 1 Nimbus, 4 Supervisors and 2 Zookeeper nodes. My Storm.yaml is as following:
storm.zookeeper.servers:
- "storage14"
- "storage15"
nimbus.seeds: ["storage01"]
#storm.local.hostname: "storage05"
supervisor.supervisors:
- "storage02"
- "storage03"
- "storage04"
- "storage05"
storm.local.dir: "/tmp/storm"
worker.childopts: "-Xmx%HEAP-MEM%m -XX:+PrintGCDetails -Xloggc:artifacts/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump"
This storm.yaml file is used by both Nimbus and Supervisors. When Nimbus is started I have the storm.local.hostname commented out as is shown above.
However, when starting Supervisors on respective nodes, I uncomment the storm.local.hostname and set it to the hostname of the node on which the supervisor is being launched. For instance if I was launching the supervisor on storage05, the storm.yaml file would have the following additional config param:
storm.local.hostname: "storage05"
The problem is even though Nimubs is launched successfully and I can see it on the Storm UI, some supervisors do not seem to be able to connect to Nimbus. For instance of the 4 nodes I start supervisors on, Storm UI often shows only 2 of them connected. However, if I ssh in to these nodes and run jps, I can see that the supervisor process is running on ALL of these nodes.
The Supervisors at the nodes which do end up connecting are not the same always, so it is definitely not a problem with those specific nodes.
Another thing to notice is if I try to execute a topology on whatever nodes that got connected, it does not get registered by the cluster and I can not see that topology on the UI either.
What do you think might be causing this erratic behavior?
UPDATE:
Tail end of nimbus.log has the following lines
2017-01-25 00:04:25.216 o.a.s.s.o.a.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2017-01-25 00:04:25.317 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server storage15/192.168.140.195:2181. Will not attempt to authenticate using SASL (unknown error)
2017-01-25 00:04:25.317 o.a.s.s.o.a.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2017-01-25 00:04:25.686 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server storage15/192.168.140.195:2181. Will not attempt to authenticate using SASL (unknown error)
2017-01-25 00:04:25.686 o.a.s.s.o.a.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2017-01-25 00:04:25.787 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server storage14/192.168.140.194:2181. Will not attempt to authenticate using SASL (unknown error)
2017-01-25 00:04:25.787 o.a.s.s.o.a.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)

Your UPDATE (nimbus log) indicates that your Nimbus cannot connect Zookeeper cluster. Please check that Zookeeper cluster (storage14/storage15) is accessible from storage01 (not only node is accessible, but also do telnet to Zookeeper server via "telnet storage14 (and/or storage15) 2181").
When ZK connectivity issue is gone please try starting supervisor again.

Related

ambari on HDP cluster + ambari-metrics-collector service not start

we have some issue with ambari-metrics-collector service , ( we have HDP cluster version - 2.6.4 with 8 nodes )
ambari metrics collector service can’t start or start of few second then failed
the details about metrics collector version
rpm -qa | grep metrics
ambari-metrics-grafana-2.6.1.0-143.x86_64
ambari-metrics-monitor-2.6.1.0-143.x86_64
ambari-metrics-collector-2.5.0.3-7.x86_64
ambari-metrics-hadoop-sink-2.6.1.0-143.x86_64
all machines are rhel 7.2
we performed the following steps in order to resolve the problem
1.restart metrics-collector service
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ stop'
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ start'
or
ambari-metrics-collector stop
ambari-metrics-collector start
2.restart ambari-metrics-monitor on all nodes
ambari-metrics-monitor stop
ambari-metrics-monitor start
3.clean the folder /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/
mv /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/zookeeper_0 /tmp/bck/zookeeper/
Then restart metrics-collector service
4.Tuning the metrics-collector parameters according - https://docs.cloudera.com/HDPDocuments/Ambari-2.2.1.0/bk_ambari_reference_guide/content/_ams_general_guidelines.html
we update the follwing parameters in ambari
metrics_collector_heap_size=1024
hbase_regionserver_heapsize=1024
hbase_master_heapsize=512
hbase_master_xmn_size=128
status for now: - steps 1-4 doesn’t help
From the logs we can see the following:
log file - ambari-metrics-collector.log
2020-06-25 09:06:14,474 WARN org.apache.zookeeper.ClientCnxn: Session 0x172eab71f310002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-06-25 09:06:14,575 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=master02.sys671.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server
log file - hbase-ams-master-master02.sys671.com.log
2020-06-25 09:38:18,799 WARN [RS:0;master02:51842-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2020-06-25 09:38:20,437 INFO [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server master02.sys671.com/23.2.35.171:61181. Will not attempt to authenticate using SASL (unknown error)
2020-06-25 09:38:20,438 WARN [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
we also not see that port is listening ( timeline.metrics.service.webapp.address )
netstat -tulpn | grep 6188
any advice how to continue from this point ?
we'll appreciate to get any help about this problem

Kerberos HBase Zookeeper fails

I'm trying to kerberise my HBase Cluster and I get some problems with Zookeeper. When I start Hbase I get this error on the Master log :
ERROR [main-SendThread(X.X.X.X:2181)] client.ZooKeeperSaslClient: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - LOOKING_UP_SERVER)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.
ERROR [main-SendThread(X.X.X.X:2181)] zookeeper.ClientCnxn: SASL authentication with Zookeeper Quorum member failed: javax.security.sasl.SaslException: An error: (java.security.PrivilegedActionException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Server not found in Kerberos database (7) - LOOKING_UP_SERVER)]) occurred when evaluating Zookeeper Quorum Member's received SASL token. Zookeeper Client will go to AUTH_FAILED state.
DEBUG [main-EventThread] zookeeper.ZKWatcher: master:16000-0x16c236187be0000, quorum=Y.Y.Y.Y:2181,X.X.X.X:2181, baseZNode=/hbase Received ZooKeeper Event, type=None, state=AuthFailed, path=null
DEBUG [main] zookeeper.ZooKeeper: Close called on already closed client
On the Zookeeper log, I get :
WARN [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:2181] quorum.Learner: Unexpected exception, tries=0, connecting to /X.X.X.X:2888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:229)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
I verified my firewall, the ports are open
For the configuration, I followed the HBase Reference Guide :
http://hbase.apache.org/book.html#zk.sasl.auth
At first I thought it was a problem with my keytab but Hadoop is working fine with it.
I run HBase 2.0.5, Hadoop 3.1.2 and the Zookeeper is the one provided by HBase.
Following #SamsonScharfrichter 's comment, I've tried a few things :
I've created and specified in /etc/hosts the FQDN of my servers and modified my configurations to reflect this change.
Changed the hostname of my servers for the FQDN
tried to nslookup my hostnames, didn't work since they are specified in /etc/hosts
It didn't do anything, I'm still getting the error. My guess is that Kerberos tries to search for a DNS on my public NIC and not my private. I do not know why it struggles so hard to find my servers, since hadoop has absolutely no problem with it.
EDIT - I set up a private DNS on my network. DNS working great, still getting the error. I'm about to give up
EDIT 2 - I installed tshark on the node with the error. Apparently I get a frame with the message :
Error: KRB5KDC_ERR_C_PRINCIPAL_UNKNOWN
which is weird, I verified my keytab and the principals listed in kadmin. Maybe there defaults principals that I don't use ?

Hbase Master turing off after start. Setup for Hbase on Hadoop for a single cluster DB on my local machine

I have installed Hadoop (2.9.1) and Hbase (2.1) on my linux machine with the appropriate configurations.
1) I start all hadoop components. Using jps, I am able to see all the components that are running. This step is working fine.
2) When I start hbase, all the hbase components start again . Using the jps command, I am able to see the required components are running again. However, within 10 seconds, Hmaster turns off.
This is the contents of the log file for hbase master:-
The errors outlined below are pretty much the same for both master and regionserver log file.
2018-08-17 17:13:14,255 WARN [main-SendThread(localhost:2181)] zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
I understand that there is some port connection problem, but don't quite know what and where to make the changes.
Thank you in advance for your guidance.

Nifi - "Connection refused" in gui after installation

I'm installing a three node non-secure nifi cluster (1.5) with embedded zookeeper.
The installation has completed, the cluster startup up and election have finished, without any obvious issues.
However when I hit the gui I see the following:
javax.ws.rs.ProcessingException: java.net.ConnectException: Connection refused (Connection refused)
In nifi-apps.log there is:
2018-03-07 14:41:37,857 WARN [Replicate Request Thread-7] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Failed to replicate request GET /nifi-api/flow/current-user to test-nifi01:8080 due to javax.ws.rs.ProcessingException: java.net.ConnectException: Connection refused (Connection refused)
2018-03-07 14:41:37,858 WARN [Replicate Request Thread-7] o.a.n.c.c.h.r.ThreadPoolRequestReplicator
javax.ws.rs.ProcessingException: java.net.ConnectException: Connection refused (Connection refused)
at org.glassfish.jersey.client.internal.HttpUrlConnector.apply(HttpUrlConnector.java:284)
at org.glassfish.jersey.client.ClientRuntime.invoke(ClientRuntime.java:278)
at org.glassfish.jersey.client.JerseyInvocation.lambda$invoke$0(JerseyInvocation.java:753)
....
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
....
in nifi-user.log:
2018-03-07 15:18:49,475 INFO [NiFi Web Server-21] org.apache.nifi.web.filter.RequestLogger Attempting request for (<no user found>) POST http://test-nifi01:8080/nifi-api/access/kerberos (source ip: 172.100.1.11)
2018-03-07 15:18:49,476 INFO [NiFi Web Server-21] o.a.n.w.a.c.IllegalStateExceptionMapper java.lang.IllegalStateException: Access tokens are only issued over HTTPS.. Returning Conflict response.
2018-03-07 15:18:49,572 INFO [NiFi Web Server-17] org.apache.nifi.web.filter.RequestLogger Attempting request for (<no user found>) POST http://test-nifi01:8080/nifi-api/access/oidc/exchange (source ip: 172.100.1.11)
2018-03-07 15:18:49,573 INFO [NiFi Web Server-17] o.a.n.w.a.c.IllegalStateExceptionMapper java.lang.IllegalStateException: User authentication/authorization is only supported when running over HTTPS.. Returning Conflict response.
2018-03-07 15:18:49,672 INFO [NiFi Web Server-23] org.apache.nifi.web.filter.RequestLogger Attempting request for (anonymous) GET http://test-nifi01:8080/nifi-api/flow/current-user (source ip: 172.100.1.11)
It's the same results regardless of what host I connect to.
I'm not sure what I need to be checking here.
EDIT: Updated with more detail.
Node1:
https://pastebin.com/5nMzWfXw
# web properties #
nifi.web.war.directory=./lib
nifi.web.http.host=test-nifi01
nifi.web.http.port=8080
nifi.web.http.network.interface.default=eth0
nifi.web.https.host=
nifi.web.https.port=
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=
# cluster node properties (only configure for cluster nodes) #
nifi.cluster.is.node=true
nifi.cluster.node.address=test-nifi01
nifi.cluster.node.protocol.port=11003
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=
Node 2: https://pastebin.com/n5GMy1nA
# web properties #
nifi.web.war.directory=./lib
nifi.web.http.host=test-nifi02
nifi.web.http.port=8080
nifi.web.http.network.interface.default=eth0
nifi.web.https.host=
nifi.web.https.port=
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=
# cluster node properties (only configure for cluster nodes) #
nifi.cluster.is.node=true
nifi.cluster.node.address=test-nifi02
nifi.cluster.node.protocol.port=11003
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=
Node 3: https://pastebin.com/EqztLfnD
# web properties #
nifi.web.war.directory=./lib
nifi.web.http.host=test-nifi03
nifi.web.http.port=8080
nifi.web.http.network.interface.default=eth0
nifi.web.https.host=
nifi.web.https.port=
nifi.web.https.network.interface.default=
nifi.web.jetty.working.directory=./work/jetty
nifi.web.jetty.threads=200
nifi.web.max.header.size=16 KB
nifi.web.proxy.context.path=
# cluster node properties (only configure for cluster nodes) #
nifi.cluster.is.node=true
nifi.cluster.node.address=test-nifi03
nifi.cluster.node.protocol.port=11003
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.protocol.max.threads=50
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.node.max.concurrent.requests=100
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=
Anyone have any thoughts on what I should be looking at?
EDIT 2:
It's looks like each node is refusing to connect to itself. E.g on node 2 we have
2018-03-07 20:31:08,099 WARN [Replicate Request Thread-4] o.a.n.c.c.h.r.ThreadPoolRequestReplicator Failed to replicate request GET /nifi-api/flow/current-user to test-nifi02:8080 due to javax.ws.rs.ProcessingException: java.net.ConnectException: Connection refused (Connection refused)
and a curl also fails:
[taas#test-nifi02 logs]$ curl http://test-nifi02:8080
curl: (7) Failed connect to test-nifi02:8080; Connection refused
Curls to the other nodes are fine. The same behavior is occurring on each node, e,g node1 can't curl to node1 and node3 can't curl to node3.
Make sure you have dns enabled for network/subnet . if you dont have luxury of dns try updating /etc/hosts files with the exact ip address and hostnames
ex:
192.168.0.1 test-nifi01
192.168.0.2 test-nifi02
in all the hosts of the nifi nodes then try connecting it. and also make sure to check if you can able to connect from one node to another node by using shell via ssh
In a NiFi cluster, a request comes in to one node, and then that node replicates the request to all other nodes.
This issue shown nifi-app.log is indicating that the connection was refused between the current node test-nifi01:8080.
I assume you have hosts like test-nifi01, test-nifi02, and test-nifi03, so each of these nodes would need to be able to resolve the other hostnames. You can start by issuing a ping from one node to the other two, and do that on each node.

Connection refused error in worker logs - apache storm

I see the below error in worker logs, it happens almost every milliseconds, but the cluster is running fine, I wanted to know what does these error mean and any idea on why this would occur.
This happens on all the worker nodes
2016-05-12T15:32:53.514-0500 b.s.m.n.Client [ERROR] connection attempt 3 to Netty-Client-xxxxx.hq.abc.com/xx.xx.xxx.xx:6700 failed: java.net.ConnectException: Connection refused: xxxxx.hq.abc.com/xx.xx.xxx.xxx:6700
And after some time i see this
2016-05-12T15:44:25.940-0500 b.s.m.n.Client [ERROR] discarding 1 messages because the Netty client to Netty-Client-xxxxx.hq.abc.com/xx.xx.xxx.xxx:6700 is being closed
After struggling forever with this problem, I found that setting storm.local.hostname property in the storm.yaml file solved the issue for me. On my laptop, I set storm.zookeeper.servers, nimbus.host and storm.local.hostname all to "localhost".
I am using version 0.10.2 of storm.

Resources