High-Availability not working in Hadoop cluster - hadoop

I am trying to move my non-HA namenode to HA. After setting up all the configurations for JournalNode by following the Apache Hadoop documentation, I was able to bring the namenodes up. However, the namenodes are crashing immediately and throwing the follwing error.
ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.io.IOException: There appears to be a gap in the edit log. We expected txid 43891997, but got txid 45321534.
I tried to recover the edit logs, initialize the shared edits etc., but nothing works. I am not sure how to fix this problem without formatting namenode since I do not want to loose any data.
Any help is greatly appreciated. Thanking in advance.

The problem was with the limit of open files on a linux machine. I increased the limit of open files and then the initialization of shared edits worked.

Related

services messed now when trying to start hdfs, nodemanager cannot be brought up

I am doing some hadoop practice on my local VMware, 12GB RAM, 2CPU, 20GB disk space.
For unknown reason, my master node is having the following issue now:
1. I manually started NameNode, DataNode, ResourceManager, NodeManager.
2. I checked with jps to confirm every service is up, so far so good.
3. I tried to start the last piece Job History Server, no error reported
BUT, when I check with jps, I don't see the NodeManager, it simply disappeared!
So I tried to bring the NodeManager up again:
You can see there is no error reported but the NodeManager is not up.
I wonder if I can find any clue in log, here is the log screenshot:
Another log:
I don't see any clue in both the two logs.
Can anyone enlighten me on this? Thank you very much. Any clue is appreciated.
It would be nice to see what caused the NodeManager to stop. Check for the log and see the time when it was stoped, that may give more info.
Also the open file limit may be low based on the ulimit -a in the output it is set to 1024.

NameNode shuts down itself after starting the hadoop

I have installed hadoop 1.2.1 on linux with single node cluster configuration. It was running fine and the jps command was displaying the information of all 5 jobs
JobTracker
NameNode
TaskTracker
SecondaryNameNode
jps
DataNode.`
Now, when I start the hadoop using command bin/start-all.sh, hadoop starts all 5 jobs but within few seconds namenode shuts down itself.
Any ideas how can I solve this issue?
I have checked the namenode log file and it shows the following error:
ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: Edit log corruption detected: corruption length = 98362 > toleration length = 0; the corruption is intolerable.
This is been asked many times and answered as well, searching with the exception message would give you the results.
Before asking questions in Stackoverflow, please check for same kind of question is asked earlier by search option at top right corner.
coming to the problem statement,
It was most probably due to the hadoop.tmp.dir where your namenode stores the edit logs and and check point data.
After every reboot of your machine, tmp folder will be cleared by many services which causing the problem while trying to access by namenode again.
so only the length is 0 after you reboot it.
in core-site.xml change the property hadoop.tmp.dir directory to other directory.
Reference is : here
Hope it helps!

After restar HBase ZooKeeper log Quorum.Learner: Got zxid 0x100000001 expected 0x1

I am performing some tests using HBase and Hadoop, I did setup a cluster with one master, two zookeeper and four region servers. Up until yestarday everything was working perfectly well, starting from today it simply don't start anymore.
When executing start-hbase all the process get alive:
HMaster using ports 8020 and 60010
HQuorumPeer using ports 2181 and 3888
HRegionServer
However when I take a look onto the server logs it seems the servers got stucked for some reason...
. HServer stop printing a WARNING about a native library that I was supposed to be using
. HQuorumPeer on node 1 prints a WARNING about Getting a zxid 0x10000000001 expected 0x1
. HQuorumPerr on node 1 has not print at all
Does someone has any idea on this?
Thanks.
Well, I am far, far away to be considered a hbase/hadoop expert. In fact it is just the first time I am playing around with it. Probably, the problem I had face was related to unproperly shutdown or corrupt file from the couple hbase/hadoop.
So here is my tip if you found yourself on the same situation:
cleanup all hbase logs, in my case at $HBASE_INSTALL/logs/*
cleanup all zookeeper data, in my case at /var/zookeeper/*
cleanup all hadoop data, in my case at /var/hadoop/*
cleanup all hdfs logs, in my case at /var/hdfs/log/*
cleanup all hdfs namenode data, in my case at /var/hdfs/namenode/*
cleanup all hdfs datanode data, in my case at /var/hdfs/datanode/*
format your hdfs cluster typing the command hdfs namenode -format
IMPORTANT: Don't do that if you have data, you will probably loose all of it. I could do that once I am just using it for test purpose.
I will keep reading about hbase/hadoop in order to understand it better, anyway I can guarantee that is a tool far to be "plug and play" when compared to cassandra.
Hope this can help.
Regards

Region server geting down frequently after system start

I am running hbase on HDP on Amazon machine,
When i reboot my system and start all hbase services, it get started.
But after some time my region server get down.
Latest error that i am getting from its log file is that
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /apps/hbase/data/usertable/dd5a251551619e0109349a0dce855e1b/recovered.edits/0000000000000001172.temp could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1657)
Now i am not able to start it.
Any suggestion why it is happing.
Thanks in advance.
Make sure you datanodes are up and running. Also, set "dfs.data.dir" to some permanent location, if you haven't done it yet. It defaults to the "/tmp" dir which gets emptied at each restart. Also, make sure that your datanodes are able to talk to the namenode and there is no network related issue and the datanode machines have enough free space left.

Errors in setting up HBase on Distributed Hadoop, ZooKeeperServer not running

I'm trying to set up HBase on Hadoop and have been follow various great tutorials online by Michael G. Noll and here. Basically all is fine, my Hdfs and MapRed works well on the web interface it shows that I have 2 nodes (my NameNode is both a NameNode and a DataNode but that's just for testing purposes).
When I got to the point of installing HBase, thats where I meet problems, I get lots of different errors. The latest one I have is on the log file in my slave node
INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client /10.2.xx.xx:43089 (no session established for client)
INFO org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
But when I type in
$ zkServer.sh status
It says shows the mode that both machines are running in!
Anyone has any idea what is this problem. Or does any one know of another guide/tutorial that I can follow to set this up? I've tried following the HBase documentation on setting up HBase on a distributed HDFS but it doesn't work too.
Thanks for any help offered!
Are both the zookeepers servers configured in a Qorum? If so have they managed to connect to one another and vote on who's the leader (this should all be in the logs for both servers).
Zookeeper may be running, but if they can't communicate with one another (firewall rules or miss configuration for example), then zookeeper will not accept in coming client connections

Resources