I have removed a regionserver from my HBase cluster.I removed its hostname from $HBASE_HOME/conf/regionservers, and restarted the HBase cluster, but the HBase UI still shows the removed region server as a 'dead' region server.
The 'status' command in the hbase shell also shows it as a dead region server. How should I get rid of it ?
Cluster getting haunted by dead regionserver :D
HBase may sometimes still show a decommissioned regionserver as dead. This is because, the WAL (Write-Ahead Log) of the dead regionserver was still in HDFS in the “splitting” state, so from HBase perspective it’s not dead !
So the solution is to go to WALs directory in HDFS (usually at /hbase/WALs) and remove the files of the old regionserver.
Found this at this wonderful blog kill zombie dead regionservers after much digging.
Related
How do I troubleshoot and recover a Lost Node in my long running EMR cluster?
The node stopped reporting a few days ago. The host seems to be fine and HDFS too. I noticed the issue only from the Hadoop Applications UI.
EMR nodes are ephemeral and you cannot recover them once they are marked as LOST. You can avoid this in first place by enabling 'Termination Protection' feature during a cluster launch.
Regarding finding reason for LOST node, you can probably check YARN ResourceManager logs and/or Instance controller logs of your cluster to find out more about root cause.
I am performing some tests using HBase and Hadoop, I did setup a cluster with one master, two zookeeper and four region servers. Up until yestarday everything was working perfectly well, starting from today it simply don't start anymore.
When executing start-hbase all the process get alive:
HMaster using ports 8020 and 60010
HQuorumPeer using ports 2181 and 3888
HRegionServer
However when I take a look onto the server logs it seems the servers got stucked for some reason...
. HServer stop printing a WARNING about a native library that I was supposed to be using
. HQuorumPeer on node 1 prints a WARNING about Getting a zxid 0x10000000001 expected 0x1
. HQuorumPerr on node 1 has not print at all
Does someone has any idea on this?
Thanks.
Well, I am far, far away to be considered a hbase/hadoop expert. In fact it is just the first time I am playing around with it. Probably, the problem I had face was related to unproperly shutdown or corrupt file from the couple hbase/hadoop.
So here is my tip if you found yourself on the same situation:
cleanup all hbase logs, in my case at $HBASE_INSTALL/logs/*
cleanup all zookeeper data, in my case at /var/zookeeper/*
cleanup all hadoop data, in my case at /var/hadoop/*
cleanup all hdfs logs, in my case at /var/hdfs/log/*
cleanup all hdfs namenode data, in my case at /var/hdfs/namenode/*
cleanup all hdfs datanode data, in my case at /var/hdfs/datanode/*
format your hdfs cluster typing the command hdfs namenode -format
IMPORTANT: Don't do that if you have data, you will probably loose all of it. I could do that once I am just using it for test purpose.
I will keep reading about hbase/hadoop in order to understand it better, anyway I can guarantee that is a tool far to be "plug and play" when compared to cassandra.
Hope this can help.
Regards
I have 5 datanodes in my Cloudera cluster (CDH4) and all are showing as healthy.
The only service with an issue is HDFS which is showing under replication. When I go to host:50070/dfsnodelist.jsp?whatNodes=LIVE I'm seeing all 5 nodes, of which only one of them is shwoing is in Service. The rest of them are decommissioned.
I'm seeing a lot of information about removing the decommissioned nodes but very little on how to recommission. As far as cloudera manager is concerned, these nodes are active.
Well that is strange. Althought Cloudera manager thinks they're okay, I thought I'd try decommissioning the datanodes and then recommissioning them as a last ditch. It seems to have fixed the reporting as decommissioned
In addition its fixed my underreplication issue as well.
Now I am learning about HBase. I set up my HBase Cluster and Hadoop Cluster like this:
server1: Namenode HMaster
server2: datanode1 RegionServer1 HQuorumPeer
Server3: datanode2 RegionServer2 HQuorumPeer
Server4: datanode3 RegionServer3 HQuorumPeer
I have several question about HBase cluster:
1: All RegionServers must be in the Hadoop Cluster so it can use HDFS to store
data, even though it will store data into local file system, right?
2: What does RegionServer do? Does the HMaster give the job to all RegionServeres
and let them running parallel, like tasktracker in datanode?
3: What does zookeeper do? Do I need to setup zookeeper in all RegionServers
nodes and the master node?
4: It is related to #3. I know HBase uses zookeeper to recovery once regionServer
is down. How does it specific work?
All RegionServers must be in the Hadoop Cluster so it can use HDFS to store
data, even though it will store data into local file system, right?
Yes. RegionServers are the daemons that are responsible for storing data in a HBase cluster. You store data in HBase tables which are spread over many regions on several RegionServers across the cluster. Although data goes into the RegionServers, it actually gets stored inside HDFS. But if you are on a standalone setup HDFS is not used. The data gets stored directly in the local FS. It is analogous to any DB and FS. Take MSQL and ext3 for example. And yes, all the HDFS data is stored on your disk in reality. You cannot see it directly though.
What does RegionServer do? Does the HMaster give the job to all RegionServeres
and let them running parallel, like tasktracker in datanode?
As specified in the comment above RegionServer is the daemon that actually stores data in a HBase cluster. I'm sorry I didn't quite get the second part of this question. what do you mean by like tasktracker in datanode? In a HBase cluster HMaster is the daemon which is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. Its job is monitoring and management. Regionservers don't run any job like TaskTrackers do. They just store data and are responsible for stuff like serving and managing regions.
What does zookeeper do? Do I need to setup zookeeper in all RegionServers
nodes and the master node?
Zookeeper is the guy who coordinates everything behind the curtains. It is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. A distributed HBase setup depends on a running ZooKeeper cluster. All participating nodes and clients need to be able to access the running ZooKeeper ensemble. HBase by default manages a ZooKeeper cluster. It gets started and stopped as part of the HBase start/stop process. But, you can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use. You don't have to have Zookeepers running on all the nodes. Just decide some number which suits your cluster. One thing to note here is that you should always use an odd number of Zookeepers.
It is related to #3. I know HBase uses zookeeper to recovery once regionServer
is down. How does it specific work?
Each RegionServer is connected to ZooKeeper, and the master watches these connections. ZooKeeper manages a heartbeat with a timeout. So, on a timeout, the HMaster declares the region server as dead, and starts the recovery process. Following things happen during the recovery process :
Identifying that a node is down : a node can cease to respond simply because it is overloaded or as well because it is dead.
Recovering the writes in progress : that’s reading the commit log and recovering the edits that were not flushed.
Reassigning the regions : the region server was previously handling a set of regions. This set must be reallocated to other region servers, depending on their respective workload.
The process is actually a bit more involved. You can find more on this here. I would also suggest you to go through the book HBase The Definitive Guide by Lars in order to get some grip on HBase.
HTH
I am running hbase on HDP on Amazon machine,
When i reboot my system and start all hbase services, it get started.
But after some time my region server get down.
Latest error that i am getting from its log file is that
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /apps/hbase/data/usertable/dd5a251551619e0109349a0dce855e1b/recovered.edits/0000000000000001172.temp could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1657)
Now i am not able to start it.
Any suggestion why it is happing.
Thanks in advance.
Make sure you datanodes are up and running. Also, set "dfs.data.dir" to some permanent location, if you haven't done it yet. It defaults to the "/tmp" dir which gets emptied at each restart. Also, make sure that your datanodes are able to talk to the namenode and there is no network related issue and the datanode machines have enough free space left.