I have found that our RegionServers connect to the ZooKeeper frequently. They seems to constantly establish the session, close it and reconnect the ZooKeeper. Here is the log for both server and client sides. I have no idea why this happens and how to deal with it? We're using HBase 0.94.11 and ZooKeeper 3.4.4.
The log from HBase RegionServer:
2014-09-18,16:38:17,867 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.2.201.74:11000,10.2.201.73:11000,10.101.10.67:11000,10.101.10.66:11000,10.2.201.75:11000 sessionTimeout=30000 watcher=catalogtracker-on-org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation#69d892a1
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server lg-hadoop-srv-ct01.bj/10.2.201.73:11000. Will attempt to SASL-authenticate using Login Context section 'Client'
2014-09-18,16:38:17,868 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of this process is 11787#lg-hadoop-srv-st05.bj
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to lg-hadoop-srv-ct01.bj/10.2.201.73:11000, initiating session
2014-09-18,16:38:17,870 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server lg-hadoop-srv-ct01.bj/10.2.201.73:11000, sessionid = 0x248782700e52b3c, negotiated timeout = 30000
2014-09-18,16:38:17,876 INFO org.apache.zookeeper.ZooKeeper: Session: 0x248782700e52b3c closed
2014-09-18,16:38:17,876 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2014-09-18,16:38:17,878 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Total replicated: 24
The log from its ZooKeeper server:
2014-09-18,16:38:17,869 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: [myid:2] Accepted socket connection from /10.2.201.76:55621
2014-09-18,16:38:17,869 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] Client attempting to establish new session at /10.2.201.76:55621
2014-09-18,16:38:17,870 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] Established session 0x248782700e52b3c with negotiated timeout 30000 for client /10.2.201.76:55621
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.auth.SaslServerCallbackHandler: [myid:2] Successfully authenticated client: authenticationID=hbase_srv/hadoop#XIAOMI.HADOOP; authorizationID=hbase_srv/hadoop#XIAOMI.HADOOP.
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.auth.SaslServerCallbackHandler: [myid:2] Setting authorizedID: hbase_srv
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] adding SASL authorization for authorizationID: hbase_srv
2014-09-18,16:38:17,877 INFO org.apache.zookeeper.server.NIOServerCnxn: [myid:2] Closed socket connection for client /10.2.201.76:55621 which had sessionid 0x248782700e52b3c
Finally I have found the root cause.
Yes, it's about ReplicationSink and I have found the log, "2014-09-23,14:58:01,736 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Replicating for table online_miliao_recent".
Then I look at the relevant code and found that every time it calls replicateEntries(), it invokes sharedHBaseAdmin.tableExists(table) as well.
The sharedHBaseAdmin.tableExists() will create a new CatalogTracker object which is also a ZooKeeper client.
When this method exits, it will cleanup the ZooKeeper client and the session.
So this log looks reasonable because the Replication is running. But the tableExists() is a little heavy and I don't think we should invoke it for each time I replicate enties. I also notice that CatalogTracker is not in ReplicationSink after 0.94.11 so it's not a problem for the later versions.
It would be great if I have found the jira which removes the CatalogTracker from ReplicationSink :-)
Related
I deployed a long running Storm topology. After several hours running, the whole topology went down. I checked worker logs, and found these logs . As it says, zookeeper client session timed out and it caused reconnection. I suspect it was relate to my broken topology. Now I try to find out what can cause clients timeout.
2016-02-29T10:34:12.386+0800 o.a.s.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 23789ms for sessionid 0x252f862028c0083, closing socket connection and attempting reconnect
2016-02-29T10:34:12.986+0800 o.a.s.c.f.s.ConnectionStateManager [INFO] State change: SUSPENDED
2016-02-29T10:34:13.059+0800 b.s.cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2016-02-29T10:34:13.197+0800 o.a.s.z.ClientCnxn [INFO] Opening socket connection to server zk-3.cloud.mos/172.16.13.147:2181. Will not attempt to authenticate using SASL (unknown error)
2016-02-29T10:34:13.241+0800 o.a.s.z.ClientCnxn [WARN] Session 0x252f862028c0083 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_31]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[na:1.8.0_31]
at org.apache.storm.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) ~[storm-core-0.9.6.jar:0.9.6]
Your client can no longer talk to the ZooKeeper server. The first thing that happened was there was no answer to the heartbeats within the negotiated session timeout:
2016-02-29T10:34:12.386+0800 o.a.s.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 23789ms for sessionid 0x252f862028c0083, closing socket connection and attempting reconnect
Then when it tried to reconnect, it got a connection refused:
2016-02-29T10:34:13.241+0800 o.a.s.z.ClientCnxn [WARN] Session 0x252f862028c0083 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
This means either your ZooKeeper server:
Is not reachable (network connection down)
Is dead (so nothing is listening on the socket)
Is GCing itself to death and cannot communicate (although that might have issued a connection timeout error, I'm not sure)
To tell more you will need to check the ZooKeeper server logs on your (Hadoop?) cluster.
Its worked for me by increasing the connection timeout in server.properties:
zookeeper.connection.timeout.ms=60000
One way that this can happen is if you start zookeeper, then break in the terminal, then try to start kafka.
In order to use kafka, you really should use 3 terminal windows (or 3 PuTTY sessions if you are SSHing into your instance from Windows)
First Session for Zookeeper server.
Second Session for Kafka server.
Third Session for running Kafka commands to do things like create topics.
I have started Kafka in cluster mode with 3 zookeeper server and 3 Kafka server. All zookeeper server started successfully but while starting Kafka server its get disconnected stating "fatal error during Kafka server startup. prepare to shutdown (kafka.server.kafkaserver)". while investigation, I found that Kafka server get disconnected every time after 18 seconds[which is zookeeper.connection.timeout.ms = 18000 default value] so I updated the same and issue get resolved.
always use 2181 as port number for zookeeper connection until you haven't configured your zookeeper !!!
I am getting warning like these when region server stopped every time. Any suggestions please
WARN [regionserver60020.compactionChecker] util.Sleeper: **We slept 36092ms instead of 10000ms,** t**his is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html
#trouble.rs.runtime.zkexpired**
regionserver60020] util.Sleeper: We slept 19184ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-10-08 04:11:33,030 INFO [regionserver60020-SendThread(xxxx)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 24827ms for sessionid 0x54ee10bc6f21c05, closing socket connection and attempting reconnect
2015-10-08 04:11:33,030 INFO [SplitLogWorker-xxxx,60020,1443823580220-SendThread( xxxx:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 24356ms for sessionid 0x44ee10bc6361c47, closing socket connection and attempting reconnect
2015-10-08 04:11:33,607 INFO [SplitLogWorker-xxx,60020,1443823580220-SendThread(xxxxx)] client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2015-10-08 22:46:15,211 WARN [main-SendThread(xxxxxxx)] client.ZooKeeperSaslClient: Could not login: the client is being asked for a password, but the Zookeeper client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)' and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this Zookeeper client using the command 'kinit <princ>' (where <princ> is the name of the client's Kerberos principal). If the latter, do 'kinit -k -t <keytab> <princ>' (where <princ> is the name of the Kerberos principal, and <keytab> is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host's clock is in sync with this host's clock.
2015-10-08 22:46:15,919 WARN [main-SendThread(xxxxxxxx)] zookeeper.ClientCnxn: SASL configuration failed: javax.security.auth.login.LoginException: Checksum failed Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.
I am trying to connect to HBase (version 0.94.18) on Hadoop (2.4) from my eclipse and connection hands after this. This happens on on my local machine only.
code works fine on server.Any thoughts?
INFO ZooKeeper - Client environment:user.dir=D:\eclipse\eclipse-jee-64\eclipse
INFO ZooKeeper - Initiating client connection, connectString=11.45.66.78:2181 sessionTimeout=180000 watcher=hconnection
INFO ClientCnxn - Opening socket connection to server ip-55-77-77-99.ec2.internal/11.45.66.78:2181. Will not attempt to authenticate using SASL (unknown error)
INFO ClientCnxn - Socket connection established to ip-55-77-77-99.ec2.internal/11.45.66.78:2181, initiating session
INFO ClientCnxn - Session establishment complete on server ip-55-77-77-99.ec2.internal/11.45.66.78:2181, sessionid = 0x14b0dc1e5030dd7, negotiated timeout = 180000
I don't know if I should answer this question year and a half after question, but you should add all hostnames (if you use AWS add public and private DNS) from cluster to your local /etc/hosts file. I have this issue this week and this resolve my problem.
I have one 3 node hbase cluster running on amazon Ec2. Which is working perfectly fine. Now, I try to insert the data from EMR to EC2 using two separate insert queries. So first insert query works perfectly fine and insert the data and after that all of my region servers become dead. So, could you please suggest me general guidelines to debug this problem and why generally region servers become dead?
Moreover, even i explicitly start the region servers after sometime again they become dead.
Update question :
Earlier i was thinking it might be a problem due to HBASE_HEAPSIZE which is by default set to 1GB. But i also increased that to 5.5 Gb still region servers are becoming dead.
Below is the logs which i am getting on every region server after they are dead.
2013-10-07 18:16:27,949 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50000 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
2013-10-07 18:16:27,990 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/10.179.42.93:50020. Already tried 1 time(s).
2013-10-07 18:16:28,049 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master/10.179.42.93:2181
2013-10-07 18:16:28,049 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client'$
2013-10-07 18:16:28,049 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50001 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
2013-10-07 18:16:28,177 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server slave/10.178.5.52:2181
2013-10-07 18:16:28,177 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client'$
2013-10-07 18:16:28,178 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50001 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
You can check logs for RegionServer. Here is the more information about log location.
http://hbase.apache.org/book/trouble.log.html
If you have to explicitly turn on the region server every time,then its problematic situation.
The best way is to spin up a new EMR instance with HBASE.
What is a good practice for setting regionserver and zookeeper quorum ?
I have a small hadoop cluster with 16 nodes. Following the example given in http://hbase.apache.org/book/example_config.html I choose as regionserver the 16 nodes and a subset of these nodes as zookeeper.
But when one job is launched by a node which is not in the list corresponding to hbase.zookeeper.quorum I get the following error :
13/08/23 15:40:05 INFO zookeeper.ClientCnxn: Opening socket connection
to server localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to
authenticate using SASL (unknown error) 13/08/23 15:40:05 WARN
zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error,
closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused at
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
at
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
13/08/23 15:40:05 INFO zookeeper.ClientCnxn: Opening socket connection
to server localhost/127.0.0.1:2181. Will not attempt to authenticate
using SASL (unknown error) 13/08/23 15:40:05 INFO
zookeeper.ClientCnxn: Socket connection established to
localhost/127.0.0.1:2181, initiating session 13/08/23 15:40:05 WARN
zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper
exception:
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase/hbaseid 13/08/23 15:40:05
INFO util.RetryCounter: Sleeping 2000ms before retry #1...
So it tries to conncet for 600 sec and then return
Task attempt_xxx failed to report status for 60 seconds. Killing!
After a few attempts it changes node and if by chance the new node belongs to the zookeeper list then the job finishes with succes.
Is this normal?
I ended up adding all nodes to the zookeeper list but I would like to know if it is a good practice. Also is there anycase where the list of regionserver should differ from the node list?
Thank you
No, it doesn't look like what you're doing is a good practice. For a 16 RS cluster, 1 ZK node should be just fine.
Check out the ZK Admin guide:
For the ZooKeeper service to be active,
there must be a majority of non-failing machines that can communicate
with each other. To create a deployment that can tolerate the failure
of F machines, you should count on deploying 2xF+1 machines. Thus, a
deployment that consists of three machines can handle one failure, and
a deployment of five machines can handle two failures. Note that a
deployment of six machines can only handle two failures since three
machines is not a majority. For this reason, ZooKeeper deployments are
usually made up of an odd number of machines.
Although it doesn't say it there, a ZK cluster should be no bigger than 7 nodes. Given the recommendation of an odd number of nodes, that leaves the options of 1, 3, 5, and 7. Again for a smallish cluster like yours, 1 should suffice, but 3 will give you resiliency. 5 is probably overkill. 7 definitely is.
Also, looking at the error you pasted:
java.net.ConnectException: Connection refused
This would appear to indicate either:
Hadoop misconfiguration: you pointed to the wrong server/port, or the service is not currently running, or more likely -
Network misconfiguration, such as a firewall like iptables running