I setup a mesos cluster with zookeeper, 3 master nodes and 12 agents nodes.
There are three nodes on which running master and agent processes.
When run c++ test framwork, it usually failed, but somtimes the four tasks can finished successfully.
In one agent node, I see stderr log below:
I0420 22:49:05.532886 17577 exec.cpp:162] Version: 1.2.0 I0420
22:49:05.556433 17582 exec.cpp:237] Executor registered on agent
3f442a52-2e2f-4799-8c1a-3a05b27120b9-S1 I0420 22:49:05.811959 17579
exec.cpp:415] Executor asked to shutdown
Also there are many connection closed logs in master node:
E0420 22:48:51.964684 22212 process.cpp:2426] Failed to shutdown
socket with fd 30: Transport endpoint is not connected E0420
22:48:54.974160 22212 process.cpp:2426] Failed to shutdown socket with
fd 30: Transport endpoint is not connected E0420 22:48:56.192914 22212
process.cpp:2426] Failed to shutdown socket with fd 27: Transport
endpoint is not connected E0420 22:48:57.999858 22212
process.cpp:2426] Failed to shutdown socket with fd 30: Transport
endpoint is not connected E0420 22:49:00.994969 22212
process.cpp:2426] Failed to shutdown socket with fd 30: Transport
endpoint is not connected E0420 22:49:03.994499 22212
process.cpp:2426] Failed to shutdown socket with fd 30: Transport
endpoint is not connected E0420 22:49:05.999225 22212
process.cpp:2426] Failed to shutdown socket with fd 30: Transport
endpoint is not connected E0420 22:49:11.194205 22212
process.cpp:2426] Failed to shutdown socket with fd 27: Transport
endpoint is not connected E0420 22:49:26.196691 22212
process.cpp:2426] Failed to shutdown socket with fd 27: Transport
endpoint is not connected E0420 22:49:41.198381 22212
process.cpp:2426] Failed to shutdown socket with fd 27: Transport
endpoint is not connected
It seems the master try to close a connection between itself and one agent, but the connection was closed. I guess that.
How to fix this or find the real reason of this problem.
Related
I am trying to start the grpc server with the property
quarkus.grpc.server.use-separate-server=true
in that case, i am getting the below error during server start up
2023-01-19 13:12:51,762 WARN [io.qua.grp.run.GrpcServerRecorder] (main) Using legacy gRPC support, with separate new HTTP server instance. Switch to single HTTP server instance usage with quarkus.grpc.server.use-separate-server=false property
2023-01-19 13:12:51,824 INFO [io.qua.grp.run.GrpcServerRecorder] (vert.x-eventloop-thread-0) Registering gRPC reflection service
2023-01-19 13:12:51,934 ERROR [io.qua.grp.run.GrpcServerRecorder] (vert.x-eventloop-thread-0) Unable to start the gRPC server: java.nio.channels.UnresolvedAddressException
at java.base/sun.nio.ch.Net.checkAddress(Net.java:149)
at java.base/sun.nio.ch.Net.checkAddress(Net.java:157)
at java.base/sun.nio.ch.ServerSocketChannelImpl.netBind(ServerSocketChannelImpl.java:330)
at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:294)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:141)
But when I start the grpc server with property
quarkus.grpc.server.use-separate-server=false
the grpc server starts but the client is not able to access the server
I am getting the below error on the client side
13:54:28 ERROR line=111 traceId=, parentId=, spanId=, sampled= [qu.ms.of.OfferResource] (executor-thread-0) Exception: UNAVAILABLE: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111: io.grpc.StatusRuntimeException: UNAVAILABLE: upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:271)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:252)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:165)
How do we overcome this issue?
we have some issue with ambari-metrics-collector service , ( we have HDP cluster version - 2.6.4 with 8 nodes )
ambari metrics collector service can’t start or start of few second then failed
the details about metrics collector version
rpm -qa | grep metrics
ambari-metrics-grafana-2.6.1.0-143.x86_64
ambari-metrics-monitor-2.6.1.0-143.x86_64
ambari-metrics-collector-2.5.0.3-7.x86_64
ambari-metrics-hadoop-sink-2.6.1.0-143.x86_64
all machines are rhel 7.2
we performed the following steps in order to resolve the problem
1.restart metrics-collector service
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ stop'
su - ams -c '/usr/sbin/ambari-metrics-collector --config /etc/ambari-metrics-collector/conf/ start'
or
ambari-metrics-collector stop
ambari-metrics-collector start
2.restart ambari-metrics-monitor on all nodes
ambari-metrics-monitor stop
ambari-metrics-monitor start
3.clean the folder /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/
mv /var/lib/ambari-metrics-collector/hbase-tmp/zookeeper/zookeeper_0 /tmp/bck/zookeeper/
Then restart metrics-collector service
4.Tuning the metrics-collector parameters according - https://docs.cloudera.com/HDPDocuments/Ambari-2.2.1.0/bk_ambari_reference_guide/content/_ams_general_guidelines.html
we update the follwing parameters in ambari
metrics_collector_heap_size=1024
hbase_regionserver_heapsize=1024
hbase_master_heapsize=512
hbase_master_xmn_size=128
status for now: - steps 1-4 doesn’t help
From the logs we can see the following:
log file - ambari-metrics-collector.log
2020-06-25 09:06:14,474 WARN org.apache.zookeeper.ClientCnxn: Session 0x172eab71f310002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2020-06-25 09:06:14,575 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=master02.sys671.com:61181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /ams-hbase-unsecure/meta-region-server
log file - hbase-ams-master-master02.sys671.com.log
2020-06-25 09:38:18,799 WARN [RS:0;master02:51842-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0004 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1125)
2020-06-25 09:38:20,437 INFO [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Opening socket connection to server master02.sys671.com/23.2.35.171:61181. Will not attempt to authenticate using SASL (unknown error)
2020-06-25 09:38:20,438 WARN [main-SendThread(master02.sys671.com:61181)] zookeeper.ClientCnxn: Session 0x172ead5d73a0002 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
we also not see that port is listening ( timeline.metrics.service.webapp.address )
netstat -tulpn | grep 6188
any advice how to continue from this point ?
we'll appreciate to get any help about this problem
I have a Storm cluster with 1 Nimbus, 4 Supervisors and 2 Zookeeper nodes. My Storm.yaml is as following:
storm.zookeeper.servers:
- "storage14"
- "storage15"
nimbus.seeds: ["storage01"]
#storm.local.hostname: "storage05"
supervisor.supervisors:
- "storage02"
- "storage03"
- "storage04"
- "storage05"
storm.local.dir: "/tmp/storm"
worker.childopts: "-Xmx%HEAP-MEM%m -XX:+PrintGCDetails -Xloggc:artifacts/gc.log -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=artifacts/heapdump"
This storm.yaml file is used by both Nimbus and Supervisors. When Nimbus is started I have the storm.local.hostname commented out as is shown above.
However, when starting Supervisors on respective nodes, I uncomment the storm.local.hostname and set it to the hostname of the node on which the supervisor is being launched. For instance if I was launching the supervisor on storage05, the storm.yaml file would have the following additional config param:
storm.local.hostname: "storage05"
The problem is even though Nimubs is launched successfully and I can see it on the Storm UI, some supervisors do not seem to be able to connect to Nimbus. For instance of the 4 nodes I start supervisors on, Storm UI often shows only 2 of them connected. However, if I ssh in to these nodes and run jps, I can see that the supervisor process is running on ALL of these nodes.
The Supervisors at the nodes which do end up connecting are not the same always, so it is definitely not a problem with those specific nodes.
Another thing to notice is if I try to execute a topology on whatever nodes that got connected, it does not get registered by the cluster and I can not see that topology on the UI either.
What do you think might be causing this erratic behavior?
UPDATE:
Tail end of nimbus.log has the following lines
2017-01-25 00:04:25.216 o.a.s.s.o.a.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2017-01-25 00:04:25.317 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server storage15/192.168.140.195:2181. Will not attempt to authenticate using SASL (unknown error)
2017-01-25 00:04:25.317 o.a.s.s.o.a.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2017-01-25 00:04:25.686 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server storage15/192.168.140.195:2181. Will not attempt to authenticate using SASL (unknown error)
2017-01-25 00:04:25.686 o.a.s.s.o.a.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2017-01-25 00:04:25.787 o.a.s.s.o.a.z.ClientCnxn [INFO] Opening socket connection to server storage14/192.168.140.194:2181. Will not attempt to authenticate using SASL (unknown error)
2017-01-25 00:04:25.787 o.a.s.s.o.a.z.ClientCnxn [WARN] Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:701)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.storm.shade.org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Your UPDATE (nimbus log) indicates that your Nimbus cannot connect Zookeeper cluster. Please check that Zookeeper cluster (storage14/storage15) is accessible from storage01 (not only node is accessible, but also do telnet to Zookeeper server via "telnet storage14 (and/or storage15) 2181").
When ZK connectivity issue is gone please try starting supervisor again.
I am getting the following error while running the command to start the peer node.
Error:
grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:5005: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:5005" }
Can anybody help me out ?
This happened to me as well after I just set up a development environment in a vagrant VM, following the instructions from http://hyperledger-fabric.readthedocs.io/en/latest/dev-setup/build/.
The connection to "127.0.0.1:5005" is configured in peer/core.yaml:
167: # orderer to talk to
168: orderer: 127.0.0.1:5005
So, the peer expects an orderer service listening on that port. The oderer service (https://github.com/hyperledger/fabric/blob/master/orderer/README.md) listens on port 5151 by default. This is configured in https://github.com/hyperledger/fabric/blob/master/orderer/orderer.yaml.
Build the orderer with make ordererand start it with orderer. Adjust the port in peer/core.yaml to 5151 (the one that the orderer service listens on), rebuild peer with make peer, start peer node start and you will see that the error message disappeared and peer starts correctly:
...
09:51:50.430 [chaincode] notify -> DEBU 056 notifying Txid:vscc
09:51:50.430 [chaincode] Launch -> DEBU 057 sending init completed
09:51:50.430 [chaincode] Launch -> DEBU 058 LaunchChaincode complete
09:51:50.430 [sysccapi] RegisterSysCC -> INFO 059 system chaincode %s(%s) registered vscc github.com/hyperledger/fabric/core/system_chaincode/vscc
09:51:50.433 [committer] NewDeliverService -> INFO 05a Creating committer for single noops endorser
09:51:50.437 [nodeCmd] serve -> INFO 05b Starting peer with ID=name:"jdoe" , network ID=dev, address=0.0.0.0:7051, rootnodes=, validator=true
Nil tx from block
Commit success, created a block!
After some JMS connectivity problems I've noticed in logs:
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61141 failed: java.io.EOFException
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61156 failed: java.io.EOFException
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61148 failed: java.io.EOFException
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61161 failed: java.io.EOFException
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61192 failed: java.io.EOFException
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61197 failed: java.io.EOFException
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61226 failed: java.io.EOFException
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61273 failed: java.io.EOFException
o.a.a.b.T.Transport:? - Transport Connection to: tcp://100.100.100.100:61241 failed: java.io.EOFException
Why JMS is retrying in such way? Does ActiveMQ client/broker technology have any port discovery/negotation protocol?
ActiveMQ has discovery using IP Multicast and failover mechanisms. Not clear why it reconnects to the same host, failover config may have the same host. Client configuration would be helpful to understand.