What will cause zookeeper Client session timed out - apache-storm

I deployed a long running Storm topology. After several hours running, the whole topology went down. I checked worker logs, and found these logs . As it says, zookeeper client session timed out and it caused reconnection. I suspect it was relate to my broken topology. Now I try to find out what can cause clients timeout.
2016-02-29T10:34:12.386+0800 o.a.s.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 23789ms for sessionid 0x252f862028c0083, closing socket connection and attempting reconnect
2016-02-29T10:34:12.986+0800 o.a.s.c.f.s.ConnectionStateManager [INFO] State change: SUSPENDED
2016-02-29T10:34:13.059+0800 b.s.cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2016-02-29T10:34:13.197+0800 o.a.s.z.ClientCnxn [INFO] Opening socket connection to server zk-3.cloud.mos/172.16.13.147:2181. Will not attempt to authenticate using SASL (unknown error)
2016-02-29T10:34:13.241+0800 o.a.s.z.ClientCnxn [WARN] Session 0x252f862028c0083 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_31]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[na:1.8.0_31]
at org.apache.storm.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) ~[storm-core-0.9.6.jar:0.9.6]

Your client can no longer talk to the ZooKeeper server. The first thing that happened was there was no answer to the heartbeats within the negotiated session timeout:
2016-02-29T10:34:12.386+0800 o.a.s.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 23789ms for sessionid 0x252f862028c0083, closing socket connection and attempting reconnect
Then when it tried to reconnect, it got a connection refused:
2016-02-29T10:34:13.241+0800 o.a.s.z.ClientCnxn [WARN] Session 0x252f862028c0083 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
This means either your ZooKeeper server:
Is not reachable (network connection down)
Is dead (so nothing is listening on the socket)
Is GCing itself to death and cannot communicate (although that might have issued a connection timeout error, I'm not sure)
To tell more you will need to check the ZooKeeper server logs on your (Hadoop?) cluster.

Its worked for me by increasing the connection timeout in server.properties:
zookeeper.connection.timeout.ms=60000

One way that this can happen is if you start zookeeper, then break in the terminal, then try to start kafka.
In order to use kafka, you really should use 3 terminal windows (or 3 PuTTY sessions if you are SSHing into your instance from Windows)
First Session for Zookeeper server.
Second Session for Kafka server.
Third Session for running Kafka commands to do things like create topics.

I have started Kafka in cluster mode with 3 zookeeper server and 3 Kafka server. All zookeeper server started successfully but while starting Kafka server its get disconnected stating "fatal error during Kafka server startup. prepare to shutdown (kafka.server.kafkaserver)". while investigation, I found that Kafka server get disconnected every time after 18 seconds[which is zookeeper.connection.timeout.ms = 18000 default value] so I updated the same and issue get resolved.

always use 2181 as port number for zookeeper connection until you haven't configured your zookeeper !!!

Related

Region server issue

I am getting warning like these when region server stopped every time. Any suggestions please
WARN [regionserver60020.compactionChecker] util.Sleeper: **We slept 36092ms instead of 10000ms,** t**his is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html
#trouble.rs.runtime.zkexpired**
regionserver60020] util.Sleeper: We slept 19184ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-10-08 04:11:33,030 INFO [regionserver60020-SendThread(xxxx)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 24827ms for sessionid 0x54ee10bc6f21c05, closing socket connection and attempting reconnect
2015-10-08 04:11:33,030 INFO [SplitLogWorker-xxxx,60020,1443823580220-SendThread( xxxx:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 24356ms for sessionid 0x44ee10bc6361c47, closing socket connection and attempting reconnect
2015-10-08 04:11:33,607 INFO [SplitLogWorker-xxx,60020,1443823580220-SendThread(xxxxx)] client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2015-10-08 22:46:15,211 WARN [main-SendThread(xxxxxxx)] client.ZooKeeperSaslClient: Could not login: the client is being asked for a password, but the Zookeeper client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)' and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this Zookeeper client using the command 'kinit <princ>' (where <princ> is the name of the client's Kerberos principal). If the latter, do 'kinit -k -t <keytab> <princ>' (where <princ> is the name of the Kerberos principal, and <keytab> is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host's clock is in sync with this host's clock.
2015-10-08 22:46:15,919 WARN [main-SendThread(xxxxxxxx)] zookeeper.ClientCnxn: SASL configuration failed: javax.security.auth.login.LoginException: Checksum failed Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.

Could not connect to broker URL after 5000 ms [JMS]

I use Spring Integration 3.0.0 with Active MQ 5.11.1 it works without any problem. but I noticed that when I stop Active MQ, i get error in my logs every 5 seconds.
if you have an idea for this problem?
Error :
ERROR [org.springframework.jms.listener.DefaultMessageListenerContainer#0-1] [DefaultMessageListenerContainer] Could not refresh JMS Connection for destination 'topic' - retrying in 5000 ms. Cause: Could not connect to broker URL: localhost. Reason: java.net.ConnectException: Connection refused: connect
When the listener container loses the connection, it tries to reconnect every 5 seconds by default until the broker is running again.
You can configure the time and/or add an exponential back off. See setRecoveryInterval and setBackOff.
Or, call stop() on the container to stop the attempts.
Call start() to start again.

HBase connection hangs at INFO ClientCnxn - Session establishment complete on server

I am trying to connect to HBase (version 0.94.18) on Hadoop (2.4) from my eclipse and connection hands after this. This happens on on my local machine only.
code works fine on server.Any thoughts?
INFO ZooKeeper - Client environment:user.dir=D:\eclipse\eclipse-jee-64\eclipse
INFO ZooKeeper - Initiating client connection, connectString=11.45.66.78:2181 sessionTimeout=180000 watcher=hconnection
INFO ClientCnxn - Opening socket connection to server ip-55-77-77-99.ec2.internal/11.45.66.78:2181. Will not attempt to authenticate using SASL (unknown error)
INFO ClientCnxn - Socket connection established to ip-55-77-77-99.ec2.internal/11.45.66.78:2181, initiating session
INFO ClientCnxn - Session establishment complete on server ip-55-77-77-99.ec2.internal/11.45.66.78:2181, sessionid = 0x14b0dc1e5030dd7, negotiated timeout = 180000
I don't know if I should answer this question year and a half after question, but you should add all hostnames (if you use AWS add public and private DNS) from cluster to your local /etc/hosts file. I have this issue this week and this resolve my problem.

Reason for region server to become dead

I have one 3 node hbase cluster running on amazon Ec2. Which is working perfectly fine. Now, I try to insert the data from EMR to EC2 using two separate insert queries. So first insert query works perfectly fine and insert the data and after that all of my region servers become dead. So, could you please suggest me general guidelines to debug this problem and why generally region servers become dead?
Moreover, even i explicitly start the region servers after sometime again they become dead.
Update question :
Earlier i was thinking it might be a problem due to HBASE_HEAPSIZE which is by default set to 1GB. But i also increased that to 5.5 Gb still region servers are becoming dead.
Below is the logs which i am getting on every region server after they are dead.
2013-10-07 18:16:27,949 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50000 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
2013-10-07 18:16:27,990 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/10.179.42.93:50020. Already tried 1 time(s).
2013-10-07 18:16:28,049 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master/10.179.42.93:2181
2013-10-07 18:16:28,049 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client'$
2013-10-07 18:16:28,049 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50001 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
2013-10-07 18:16:28,177 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server slave/10.178.5.52:2181
2013-10-07 18:16:28,177 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client'$
2013-10-07 18:16:28,178 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50001 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
You can check logs for RegionServer. Here is the more information about log location.
http://hbase.apache.org/book/trouble.log.html
If you have to explicitly turn on the region server every time,then its problematic situation.
The best way is to spin up a new EMR instance with HBASE.

How to choose zookeeper and regionserver

What is a good practice for setting regionserver and zookeeper quorum ?
I have a small hadoop cluster with 16 nodes. Following the example given in http://hbase.apache.org/book/example_config.html I choose as regionserver the 16 nodes and a subset of these nodes as zookeeper.
But when one job is launched by a node which is not in the list corresponding to hbase.zookeeper.quorum I get the following error :
13/08/23 15:40:05 INFO zookeeper.ClientCnxn: Opening socket connection
to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to
authenticate using SASL (unknown error) 13/08/23 15:40:05 WARN
zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error,
closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused at
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
13/08/23 15:40:05 INFO zookeeper.ClientCnxn: Opening socket connection
to server localhost/127.0.0.1:2181. Will not attempt to authenticate
using SASL (unknown error) 13/08/23 15:40:05 INFO
zookeeper.ClientCnxn: Socket connection established to
localhost/127.0.0.1:2181, initiating session 13/08/23 15:40:05 WARN
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper
exception:
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid 13/08/23 15:40:05
INFO util.RetryCounter: Sleeping 2000ms before retry #1...
So it tries to conncet for 600 sec and then return
Task attempt_xxx failed to report status for 60 seconds. Killing!
After a few attempts it changes node and if by chance the new node belongs to the zookeeper list then the job finishes with succes.
Is this normal?
I ended up adding all nodes to the zookeeper list but I would like to know if it is a good practice. Also is there anycase where the list of regionserver should differ from the node list?
Thank you
No, it doesn't look like what you're doing is a good practice. For a 16 RS cluster, 1 ZK node should be just fine.
Check out the ZK Admin guide:
For the ZooKeeper service to be active,
there must be a majority of non-failing machines that can communicate
with each other. To create a deployment that can tolerate the failure
of F machines, you should count on deploying 2xF+1 machines. Thus, a
deployment that consists of three machines can handle one failure, and
a deployment of five machines can handle two failures. Note that a
deployment of six machines can only handle two failures since three
machines is not a majority. For this reason, ZooKeeper deployments are
usually made up of an odd number of machines.
Although it doesn't say it there, a ZK cluster should be no bigger than 7 nodes. Given the recommendation of an odd number of nodes, that leaves the options of 1, 3, 5, and 7. Again for a smallish cluster like yours, 1 should suffice, but 3 will give you resiliency. 5 is probably overkill. 7 definitely is.
Also, looking at the error you pasted:
java.net.ConnectException: Connection refused
This would appear to indicate either:
Hadoop misconfiguration: you pointed to the wrong server/port, or the service is not currently running, or more likely -
Network misconfiguration, such as a firewall like iptables running

Resources