Region server issue - hadoop

I am getting warning like these when region server stopped every time. Any suggestions please
WARN [regionserver60020.compactionChecker] util.Sleeper: **We slept 36092ms instead of 10000ms,** t**his is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html
#trouble.rs.runtime.zkexpired**
regionserver60020] util.Sleeper: We slept 19184ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2015-10-08 04:11:33,030 INFO [regionserver60020-SendThread(xxxx)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 24827ms for sessionid 0x54ee10bc6f21c05, closing socket connection and attempting reconnect
2015-10-08 04:11:33,030 INFO [SplitLogWorker-xxxx,60020,1443823580220-SendThread( xxxx:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 24356ms for sessionid 0x44ee10bc6361c47, closing socket connection and attempting reconnect
2015-10-08 04:11:33,607 INFO [SplitLogWorker-xxx,60020,1443823580220-SendThread(xxxxx)] client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2015-10-08 22:46:15,211 WARN [main-SendThread(xxxxxxx)] client.ZooKeeperSaslClient: Could not login: the client is being asked for a password, but the Zookeeper client code does not currently support obtaining a password from the user. Make sure that the client is configured to use a ticket cache (using the JAAS configuration setting 'useTicketCache=true)' and restart the client. If you still get this message after that, the TGT in the ticket cache has expired and must be manually refreshed. To do so, first determine if you are using a password or a keytab. If the former, run kinit in a Unix shell in the environment of the user who is running this Zookeeper client using the command 'kinit <princ>' (where <princ> is the name of the client's Kerberos principal). If the latter, do 'kinit -k -t <keytab> <princ>' (where <princ> is the name of the Kerberos principal, and <keytab> is the location of the keytab file). After manually refreshing your cache, restart this client. If you continue to see this message after manually refreshing your cache, ensure that your KDC host's clock is in sync with this host's clock.
2015-10-08 22:46:15,919 WARN [main-SendThread(xxxxxxxx)] zookeeper.ClientCnxn: SASL configuration failed: javax.security.auth.login.LoginException: Checksum failed Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it.

Related

What will cause zookeeper Client session timed out

I deployed a long running Storm topology. After several hours running, the whole topology went down. I checked worker logs, and found these logs . As it says, zookeeper client session timed out and it caused reconnection. I suspect it was relate to my broken topology. Now I try to find out what can cause clients timeout.
2016-02-29T10:34:12.386+0800 o.a.s.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 23789ms for sessionid 0x252f862028c0083, closing socket connection and attempting reconnect
2016-02-29T10:34:12.986+0800 o.a.s.c.f.s.ConnectionStateManager [INFO] State change: SUSPENDED
2016-02-29T10:34:13.059+0800 b.s.cluster [WARN] Received event :disconnected::none: with disconnected Zookeeper.
2016-02-29T10:34:13.197+0800 o.a.s.z.ClientCnxn [INFO] Opening socket connection to server zk-3.cloud.mos/172.16.13.147:2181. Will not attempt to authenticate using SASL (unknown error)
2016-02-29T10:34:13.241+0800 o.a.s.z.ClientCnxn [WARN] Session 0x252f862028c0083 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[na:1.8.0_31]
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716) ~[na:1.8.0_31]
at org.apache.storm.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361) ~[storm-core-0.9.6.jar:0.9.6]
at org.apache.storm.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) ~[storm-core-0.9.6.jar:0.9.6]
Your client can no longer talk to the ZooKeeper server. The first thing that happened was there was no answer to the heartbeats within the negotiated session timeout:
2016-02-29T10:34:12.386+0800 o.a.s.z.ClientCnxn [INFO] Client session timed out, have not heard from server in 23789ms for sessionid 0x252f862028c0083, closing socket connection and attempting reconnect
Then when it tried to reconnect, it got a connection refused:
2016-02-29T10:34:13.241+0800 o.a.s.z.ClientCnxn [WARN] Session 0x252f862028c0083 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
This means either your ZooKeeper server:
Is not reachable (network connection down)
Is dead (so nothing is listening on the socket)
Is GCing itself to death and cannot communicate (although that might have issued a connection timeout error, I'm not sure)
To tell more you will need to check the ZooKeeper server logs on your (Hadoop?) cluster.
Its worked for me by increasing the connection timeout in server.properties:
zookeeper.connection.timeout.ms=60000
One way that this can happen is if you start zookeeper, then break in the terminal, then try to start kafka.
In order to use kafka, you really should use 3 terminal windows (or 3 PuTTY sessions if you are SSHing into your instance from Windows)
First Session for Zookeeper server.
Second Session for Kafka server.
Third Session for running Kafka commands to do things like create topics.
I have started Kafka in cluster mode with 3 zookeeper server and 3 Kafka server. All zookeeper server started successfully but while starting Kafka server its get disconnected stating "fatal error during Kafka server startup. prepare to shutdown (kafka.server.kafkaserver)". while investigation, I found that Kafka server get disconnected every time after 18 seconds[which is zookeeper.connection.timeout.ms = 18000 default value] so I updated the same and issue get resolved.
always use 2181 as port number for zookeeper connection until you haven't configured your zookeeper !!!

HBase connection hangs at INFO ClientCnxn - Session establishment complete on server

I am trying to connect to HBase (version 0.94.18) on Hadoop (2.4) from my eclipse and connection hands after this. This happens on on my local machine only.
code works fine on server.Any thoughts?
INFO ZooKeeper - Client environment:user.dir=D:\eclipse\eclipse-jee-64\eclipse
INFO ZooKeeper - Initiating client connection, connectString=11.45.66.78:2181 sessionTimeout=180000 watcher=hconnection
INFO ClientCnxn - Opening socket connection to server ip-55-77-77-99.ec2.internal/11.45.66.78:2181. Will not attempt to authenticate using SASL (unknown error)
INFO ClientCnxn - Socket connection established to ip-55-77-77-99.ec2.internal/11.45.66.78:2181, initiating session
INFO ClientCnxn - Session establishment complete on server ip-55-77-77-99.ec2.internal/11.45.66.78:2181, sessionid = 0x14b0dc1e5030dd7, negotiated timeout = 180000
I don't know if I should answer this question year and a half after question, but you should add all hostnames (if you use AWS add public and private DNS) from cluster to your local /etc/hosts file. I have this issue this week and this resolve my problem.

HBase establishes session with ZooKeeper and close the session immediately

I have found that our RegionServers connect to the ZooKeeper frequently. They seems to constantly establish the session, close it and reconnect the ZooKeeper. Here is the log for both server and client sides. I have no idea why this happens and how to deal with it? We're using HBase 0.94.11 and ZooKeeper 3.4.4.
The log from HBase RegionServer:
2014-09-18,16:38:17,867 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=10.2.201.74:11000,10.2.201.73:11000,10.101.10.67:11000,10.101.10.66:11000,10.2.201.75:11000 sessionTimeout=30000 watcher=catalogtracker-on-org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation#69d892a1
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server lg-hadoop-srv-ct01.bj/10.2.201.73:11000. Will attempt to SASL-authenticate using Login Context section 'Client'
2014-09-18,16:38:17,868 INFO org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: The identifier of this process is 11787#lg-hadoop-srv-st05.bj
2014-09-18,16:38:17,868 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to lg-hadoop-srv-ct01.bj/10.2.201.73:11000, initiating session
2014-09-18,16:38:17,870 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server lg-hadoop-srv-ct01.bj/10.2.201.73:11000, sessionid = 0x248782700e52b3c, negotiated timeout = 30000
2014-09-18,16:38:17,876 INFO org.apache.zookeeper.ZooKeeper: Session: 0x248782700e52b3c closed
2014-09-18,16:38:17,876 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2014-09-18,16:38:17,878 INFO org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Total replicated: 24
The log from its ZooKeeper server:
2014-09-18,16:38:17,869 INFO org.apache.zookeeper.server.NIOServerCnxnFactory: [myid:2] Accepted socket connection from /10.2.201.76:55621
2014-09-18,16:38:17,869 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] Client attempting to establish new session at /10.2.201.76:55621
2014-09-18,16:38:17,870 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] Established session 0x248782700e52b3c with negotiated timeout 30000 for client /10.2.201.76:55621
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.auth.SaslServerCallbackHandler: [myid:2] Successfully authenticated client: authenticationID=hbase_srv/hadoop#XIAOMI.HADOOP; authorizationID=hbase_srv/hadoop#XIAOMI.HADOOP.
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.auth.SaslServerCallbackHandler: [myid:2] Setting authorizedID: hbase_srv
2014-09-18,16:38:17,872 INFO org.apache.zookeeper.server.ZooKeeperServer: [myid:2] adding SASL authorization for authorizationID: hbase_srv
2014-09-18,16:38:17,877 INFO org.apache.zookeeper.server.NIOServerCnxn: [myid:2] Closed socket connection for client /10.2.201.76:55621 which had sessionid 0x248782700e52b3c
Finally I have found the root cause.
Yes, it's about ReplicationSink and I have found the log, "2014-09-23,14:58:01,736 DEBUG org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Replicating for table online_miliao_recent".
Then I look at the relevant code and found that every time it calls replicateEntries(), it invokes sharedHBaseAdmin.tableExists(table) as well.
The sharedHBaseAdmin.tableExists() will create a new CatalogTracker object which is also a ZooKeeper client.
When this method exits, it will cleanup the ZooKeeper client and the session.
So this log looks reasonable because the Replication is running. But the tableExists() is a little heavy and I don't think we should invoke it for each time I replicate enties. I also notice that CatalogTracker is not in ReplicationSink after 0.94.11 so it's not a problem for the later versions.
It would be great if I have found the jira which removes the CatalogTracker from ReplicationSink :-)

Reason for region server to become dead

I have one 3 node hbase cluster running on amazon Ec2. Which is working perfectly fine. Now, I try to insert the data from EMR to EC2 using two separate insert queries. So first insert query works perfectly fine and insert the data and after that all of my region servers become dead. So, could you please suggest me general guidelines to debug this problem and why generally region servers become dead?
Moreover, even i explicitly start the region servers after sometime again they become dead.
Update question :
Earlier i was thinking it might be a problem due to HBASE_HEAPSIZE which is by default set to 1GB. But i also increased that to 5.5 Gb still region servers are becoming dead.
Below is the logs which i am getting on every region server after they are dead.
2013-10-07 18:16:27,949 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50000 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
2013-10-07 18:16:27,990 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/10.179.42.93:50020. Already tried 1 time(s).
2013-10-07 18:16:28,049 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server master/10.179.42.93:2181
2013-10-07 18:16:28,049 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client'$
2013-10-07 18:16:28,049 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50001 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
2013-10-07 18:16:28,177 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server slave/10.178.5.52:2181
2013-10-07 18:16:28,177 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will not SASL-authenticate because the default JAAS configuration section 'Client'$
2013-10-07 18:16:28,178 WARN org.apache.zookeeper.ClientCnxn: Session 0x141916dfbe50001 for server null, unexpected error, closing socket connection and attempting rec$
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:597)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:286)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1035)
You can check logs for RegionServer. Here is the more information about log location.
http://hbase.apache.org/book/trouble.log.html
If you have to explicitly turn on the region server every time,then its problematic situation.
The best way is to spin up a new EMR instance with HBASE.

Redis server reports Reading from client: Connection reset on amazon ec2 c1.medium instance

I run redis2.4.16 on ec2 medium instance, the persistent is standard ebs, and i checked the redis log , found there is some log report "Reading from client: Connection reset " occurs every few hours, all my clients and server are in the same zone:ap-northeast-1a, and the operation system is ubuntu server 12.04. The client is the jredis + spring data redis 1.0.0.M4,Anyone can figure this out or give some advice, thanks!
below is the redis info command result:
redis_version:2.4.16
redis_git_sha1:00000000
redis_git_dirty:0
arch_bits:64
multiplexing_api:epoll
gcc_version:4.5.2
process_id:3265
uptime_in_seconds:2658600
uptime_in_days:30
lru_clock:561139
used_cpu_sys:29421.34
used_cpu_user:10731.37
used_cpu_sys_children:20022.24
used_cpu_user_children:75702.79
connected_clients:44
connected_slaves:1
client_longest_output_list:0
client_biggest_input_buf:0
blocked_clients:0
used_memory:1111572800
used_memory_human:1.04G
used_memory_rss:1133101056
used_memory_peak:1112071512
used_memory_peak_human:1.04G
mem_fragmentation_ratio:1.02
mem_allocator:jemalloc-3.0.0
loading:0
aof_enabled:0
changes_since_last_save:1343
bgsave_in_progress:0
last_save_time:1368760178
bgrewriteaof_in_progress:0
total_connections_received:904643
total_commands_processed:592333133
expired_keys:0
evicted_keys:0
keyspace_hits:443393839
keyspace_misses:30383206
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:359082
vm_enabled:0
role:master
slave0:xxx,online
db0:keys=364558,expires=0
As you can see from logs, redis tries to communicate with a client that has closed its connection.
Thats probably because some of your client are not closing the connection with redis after they are done with it.
This can eventually lead redis to run out of connections (depending on your connection limits and the amount of traffic you have)
An easy solution for this is to set a connection timeout (0 as 'no timeout' by default) in redis.conf so that redis will close opened connection after X seconds.
Note: you should include the output of redis config get * when asking this kind of questions ;)

Resources