Cloudera hdfs another namenode already locked the storage directory - hadoop

I am running CDH-5.3.2-1.cdh5.3.2.p0.10 with ClouderaManager on Centos 6.6.
My HDFS service was working on a Cluster. But I wanted to change the mounting point for the hadoop data. Yet without success, so I came with the idea to rollback all changes, but the previous configuration doesnt work what is discouraging.
I have two nodes within the cluster. One node for data is bad DataNodes Health Bad.
In the log I have got a few errors
1:40:10.821 PM ERROR org.apache.hadoop.hdfs.server.common.Storage
It appears that another namenode 931#spark1.xxx.xx has already locked the storage directory
1:40:10.821 PM INFO org.apache.hadoop.hdfs.server.common.Storage
Cannot lock storage /dfs/nn. The directory is already locked
1:40:10.821 PM WARN org.apache.hadoop.hdfs.server.common.Storage
java.io.IOException: Cannot lock storage /dfs/nn. The directory is already locked
1:40:10.822 PM FATAL org.apache.hadoop.hdfs.server.datanode.DataNode
Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to spark1.xxx.xx/10.10.10.10:8022. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:463)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1318)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1288)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:320)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:221)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:829)
at java.lang.Thread.run(Thread.java:745)
I have been trying many possible solutions but without any luck.
formatting hadoop namenode -format
stopping cluster and rm -rf /dfs/* [and reformatting]
some adjustments to /dfs/nn/current/VERSION file
removing in_use.lock file and starting only a lacking node
removing a file in /tmp/hsperfdata_hdfs/ with name like the pid locking the directory.
There are files in the directory
[root#spark1 dfs]# ll
total 8
drwxr-xr-x 3 hdfs hdfs 4096 Apr 28 13:39 nn
drwx------ 3 hdfs hadoop 4096 Apr 28 13:40 snn
There is no dn dir what is a bit interesting.
All operations on hdfs files I perform as an hdfs user.
In the file /etc/hadoop/conf/hdfs-site.xml there is
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///dfs/nn</value>
</property>

Here is a similar thread of CDH users google group which might help you : https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/FYu0gZcdXuE
Also did you do the namenode format from cloudera manager or command line ? Ideally you should be doing it through cloudera manager and not command line.

Related

What is the prefered solution for corrupted namenode metadata

we have HDP cluster , version 2.6.5
cluster include management of two name-node ( one is active and the secondary is standby )
and 65 datanode machines
we have problem with the standby name-node that not started and from the namenode logs we can see the following
2021-01-01 15:19:43,269 ERROR namenode.NameNode (NameNode.java:main(1783)) - Failed to start namenode.
java.io.IOException: There appears to be a gap in the edit log. We expected txid 90247527115, but got txid 90247903412.
at org.apache.hadoop.hdfs.server.namenode.MetaRecoveryContext.editLogLoaderPrompt(MetaRecoveryContext.java:94)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:215)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:143)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:838)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:693)
from ambari we can see that standby is down
for now the active namenode is up but the standby name node is down , and the root cause for this issue is because namenode matadata is damaged/corrupted.
so we have two solution - A or B
A)
run the following recover on standby namenode
su
hadoop namenode -recover
B)
Put Active NN in safemode
su hdfs
hdfs dfsadmin -safemode enter
Do a savenamespace operation on Active NN
su hdfs
hdfs dfsadmin -saveNamespace
Leave Safemode
su hdfs
hdfs dfsadmin -safemode leave
Login to Standby NN
Run below command on Standby namenode to get latest fsimage that we saved in above steps.
su hdfs
hdfs namenode -bootstrapStandby -force
what is the preferred solution for our problem?

Namenode in hadoop cluster and fsimage and Edit_logs consept

I want to give short background about the namenodes and fsimage/edit_logs , and how namenode works in hadoop clusters,
The NameNode stores modifications to the file system as a log appended to a native file system file, edits.
When a NameNode starts up, it reads HDFS state from an image file, fsimage, and then applies edits from the edits log file.
It then writes new HDFS state to the fsimage and starts normal operation with an empty edits file.
FsImage is a file stored on the OS filesystem that contains the complete directory structure (namespace) of the HDFS with details about the location of the data on the Data Blocks and which blocks are stored on which node.
EditLogs is a transaction log that recorde the changes in the HDFS file system or any action performed on the HDFS cluster such as addtion of a new block,
replication, deletion etc., It records the changes since the last FsImage was created,
it then merges the changes into the FsImage file to create a new FsImage file.
When we are starting namenode, latest FsImage file is loaded into "in-memory" and at the same time,
EditLog file is also loaded into memory if FsImage file does not contain up to date information.
Namenode stores metadata in "in-memory" in order to serve the multiple client request(s) as fast as possible.
If this is not done, then for every operation , namenode has to read the metadata information from the disk to in-memory. This process will consume more disk seek time for every operation.
so lets summary
Persistence of HDFS metadata broadly consist of two categories of files:
fsimage
Contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.
edits file
Contains a log that lists each file system change (file creation, deletion or modification) that was made after the most recent fsimage.
Checkpointing
is the process of merging the content of the most recent fsimage, with all edits applied after that fsimage is merged, to create a new fsimage. Checkpointing is triggered automatically by configuration policies or manually by HDFS administration commands.
Until now the brief about namenode and edit logs
So lets talk now about our cluster ( its based on HDP version 2.6.5 )
In folder /var/hadoop/hdfs/namenode/current of each namenode , we have the following fsimage files
fsimage_0000000000000031788 100% 104KB 104.1KB/s 00:00
fsimage_0000000000000031788.md5 100% 62 0.1KB/s 00:00
fsimage_0000000000000041641 100% 104KB 104.1KB/s 00:00
fsimage_0000000000000041641.md5 100% 62 0.1KB/s 00:00
also the edit logs ,
.
.
.
-rw-r--r-- 1 hdfs hadoop 328138542 Jan 23 12:37 edits_0000000022056979997-0000000022059239786
-rw-r--r-- 1 hdfs hadoop 301415558 Jan 23 13:07 edits_0000000022059239787-0000000022061345588
-rw-r--r-- 1 hdfs hadoop 311747850 Jan 23 13:37 edits_0000000022061345589-0000000022063490851
-rw-r--r-- 1 hdfs hadoop 12 Jan 23 13:37 seen_txid
-rw-r--r-- 1 hdfs hadoop 330301440 Jan 24 07:10 edits_0000000022063490852-0000000022065448335
Now , we start both namenode ,
In the namenode logs we see that namenode replaying each of the edit log ( so if for example we have 1965 edit_logs then namenode is replaying to all them one by one .....)
Example:
2020-01-27 06:20:37,306 INFO namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(266)) - replaying edit log: 2072759/2282427 transactions completed. (91%)
2020-01-27 06:20:38,307 INFO namenode.FSEditLogLoader (FSEditLogLoader.java:loadEditRecords(266)) - replaying edit log: 2214991/2282427 transactions completed. (97%)
So namenode completely started with active/standby state after replaying all 1965 edit_logs ,
And this takes almost 17 hours
So after we restart both namenodes , we expect to get fsimage files up to date
For example:
-rw-r--r-- 1 hdfs hadoop 445716 Jan 31 08:11 fsimage_0000000000000132222
-rw-r--r-- 1 hdfs hadoop 62 Jan 31 08:11 fsimage_0000000000000132222.md5
But in our case after both namenode restart we get this example ( fsimage not update - time from Jan 03 )
-rw-r--r-- 1 hdfs hadoop 445716 Jan 03 07:11 fsimage_0000000000000132222
-rw-r--r-- 1 hdfs hadoop 62 Jan 03 07:11 fsimage_0000000000000132222.md5
So we can see that fsimage was not update , in spite both namenode completely started ( after 17 hours ) and with state of active/standby
Any suggestion why fsimage not update with the current time ?
You can create a fsimage file running the checkpoint manually with these commands:
hdfs dfsadmin -safemode enter
hdfs dfsadmin -saveNamespace
hds dfsadmin -safemode leave
IMPORTANT: while doing this commands Hadoop is not available online, so ensure you have HA active and your clients acknowledge this pause (this can take around 5 minutes to complete or more)

ha hdfs : Initialization failed for Block pool <registering> (Datanode Uuid unassigned)

I get the following error trying to start datanodes in HA HDFS cluster
2016-01-06 22:54:58,064 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory [DISK]file:/home/data/hdfs/dn/ has already been used.
2016-01-06 22:54:58,082 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-1354640905-10.146.52.232-1452117061014
2016-01-06 22:54:58,083 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-1354640905-10.146.52.232-1452117061014
java.io.IOException: BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /home/data/hdfs/dn/current/BP-1354640905-10.146.52.232-1452117061014
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:210)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:242)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:396)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1338)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1304)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:226)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:867)
at java.lang.Thread.run(Thread.java:745)
2016-01-06 22:54:58,084 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-1354640905-10.146.52.232-1452117061014 : BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /home/data/hdfs/dn/current/BP-1354640905-10.146.52.232-1452117061014
2016-01-06 22:54:58,084 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to master3/10.146.52.232:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1338)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1304)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:226)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:867)
at java.lang.Thread.run(Thread.java:745)
2016-01-06 22:54:58,084 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool (Datanode Uuid unassigned) service to master3/10.146.52.232:8020
2016-01-06 22:54:58,084 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to master2/10.146.52.231:8020. Exiting.
org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid volume failure config value: 3
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.(FsDatasetImpl.java:261)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1351)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1304)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:226)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:867)
at java.lang.Thread.run(Thread.java:745)
2016-01-06 22:54:58,085 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool (Datanode Uuid unassigned) service to master2/10.146.52.231:8020
2016-01-06 22:54:58,185 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool (Datanode Uuid unassigned)
I have already check the clusters ID in namenode and datanode and they are similar...
I tried to reformat everything several times...
Thanks for your help !
I have seen messages like this in the log file when the file system for the DataNode is corrupt. Perhaps, try running fsck -y on each of the disks used by the DataNode. In your case:
fsck -y /home/data/hdfs
Once the disk(s) is(are) clean you should be able to start the DataNode. The NameNode will work ensure that the replication factor is fixed for any lost blocks.
I had a similar problem (but don't know without more logs, but mine didn't say "Datanode Uuid unassigned"), and fsck didn't solve it.
In my case, I had moved a subset of disks from one node to another node that already had disks, and disabled the old node, so there was a problem with the disks not matching the DatanodeUuid of the new machine.
Above those lines in the log, there were entries like:
2016-04-11 19:32:02,991 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /archive14/dfs/data is in an inconsistent state: Root /archive14/dfs/data: DatanodeUuid=5ba6418e-2c24-4582-8225-3e7f7fff9feb, does not match 519c1e34-a573-41f7-9e80-dca606fce704 from other StorageDirectory.
To solve that, I ran:
sed -i -r "s/${olduuid}/${olduuid}/' /mountpoints*/dfs/data/current/VERSION
This replaces the old UUID in the VERSION file with the new one. Then starting the datanode works.
Maybe in your case, you had a missing UUID rather than an incorrect one.
Deleting the name node directory and the data node directory and then creating the new directories worked for me. Use this technique assuming that you will lost the data.
For my case,I reinstall hdfs by CM6.2.0 and instance two namenodes for HA.
Then reformat these namenode each other,but this option cause the error below.
Initialization failed for Block pool BP-666417012-10.253.76.213-1557044865448 (Datanode Uuid 5132035c-8d6a-4617-af7e-7d07355a905b) service to hzd-t-vbdl-02/10.253.76.222:8022 Blockpool ID mismatch: previously connected to Blockpool ID BP-666417012-10.253.76.213-1557044865448 but now connected to Blockpool ID BP-1262695848-10.253.76.222-1557045124181
Process method:
ansible all -m shell -a " more /XXX/hdfs/dfs/nn/current/VERSION "
hzd-t-vbdl-01 | CHANGED | rc=0 >>
Sun May 05 16:27:45 CST 2019
namespaceID=732385684
clusterID=cluster54
cTime=1557044865448
storageType=NAME_NODE
blockpoolID=BP-666417012-10.253.76.213-1557044865448
layoutVersion=-64
hzd-t-vbdl-02 | CHANGED | rc=0 >>
Sun May 05 16:32:04 CST 2019
namespaceID=892287385
clusterID=cluster54
cTime=1557045124181
storageType=NAME_NODE
blockpoolID=BP-1262695848-10.253.76.222-1557045124181
layoutVersion=-64
Finally copy the context from hzd-t-vbdl-01(early formated) to hzd-t-vbdl-02,and restart namenodes and datanodes

Datanode not starts correctly

I am trying to install Hadoop 2.2.0 in pseudo-distributed mode. While I am trying to start the datanode services it is showing the following error, can anyone please tell how to resolve this?
**2**014-03-11 08:48:15,916 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (storage id unknown) service to localhost/127.0.0.1:9000 starting to offer service
2014-03-11 08:48:15,922 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2014-03-11 08:48:15,922 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
2014-03-11 08:48:16,406 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode/in_use.lock acquired by nodename 3627#prassanna-Studio-1558
2014-03-11 08:48:16,426 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-611836968-127.0.1.1-1394507838610 (storage id DS-1960076343-127.0.1.1-50010-1394127604582) service to localhost/127.0.0.1:9000
java.io.IOException: Incompatible clusterIDs in /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode: namenode clusterID = CID-fb61aa70-4b15-470e-a1d0-12653e357a10; datanode clusterID = CID-8bf63244-0510-4db6-a949-8f74b50f2be9
at**** org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
at java.lang.Thread.run(Thread.java:662)
2014-03-11 08:48:16,427 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-611836968-127.0.1.1-1394507838610 (storage id DS-1960076343-127.0.1.1-50010-1394127604582) service to localhost/127.0.0.1:9000
2014-03-11 08:48:16,532 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-611836968-127.0.1.1-1394507838610 (storage id DS-1960076343-127.0.1.1-50010-1394127604582)
2014-03-11 08:48:18,532 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2014-03-11 08:48:18,534 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2014-03-11 08:48:18,536 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
You can do the following method,
copy to clipboard datanode clusterID for your example, CID-8bf63244-0510-4db6-a949-8f74b50f2be9
and run following command under HADOOP_HOME/bin directory
./hdfs namenode -format -clusterId CID-8bf63244-0510-4db6-a949-8f74b50f2be9
then this code formatted the namenode with datanode cluster ids.
You must do as follow :
bin/stop-all.sh
rm -Rf /home/prassanna/usr/local/hadoop/yarn_data/hdfs/*
bin/hadoop namenode -format
I had the same problem until I found an answer in this web site.
Whenever you are getting below error, trying to start a DN on a slave machine:
java.io.IOException: Incompatible clusterIDs in /home/hadoop/dfs/data: namenode clusterID= ****; datanode clusterID = ****
It is because after you set up your cluster, you, for whatever reason, decided to reformat
your NN. Your DNs on slaves still bear reference to the old NN.
To resolve this simply delete and recreate data folder on that machine in local Linux FS, namely /home/hadoop/dfs/data.
Restarting that DN's daemon on that machine will recreate data/ folder's content and resolve
the problem.
Do following simple steps
Clear the data directory of hadoop
Format the namenode again
start the cluster
After this your cluster will start normally if you are not having any other configuration issue
DataNode dies because of incompatible Clusterids compared to the NameNode. To fix this problem you need to delete the directory /tmp/hadoop-[user]/hdfs/data and restart hadoop.
rm -r /tmp/hadoop-[user]/hdfs/data
I got similar issue in my pseudo distributed environment. I stopped cluster first, then I copied Cluster ID from NameNode's version file and put it in DataNode's version file, then after restarting cluster, its all fine.
my data path is here /usr/local/hadoop/hadoop_store/hdfs/datanode and /usr/local/hadoop/hadoop_store/hdfs/namenode.
FYI : version file is under /usr/local/hadoop/hadoop_store/hdfs/datanode/current/ ; likewise for NameNode.
Here, the datanode gets stopped immediately because the clusterID of datanode and namenode are different. So you have to format the clusterID of namenode with clusterID of datanode
Copy the datanode clusterID for your example, CID-8bf63244-0510-4db6-a949-8f74b50f2be9 and run following command from your home directory. You can go to your home dir by just typing cd on your terminal.
From your home dir now type the command:
hdfs namenode -format -clusterId CID-8bf63244-0510-4db6-a949-8f74b50f2be9
Delete the namenode and datanode directories as specified in the core-site.xml.
After that create the new directories and restart the dfs and yarn.
I also had the similar issue.
I deleted namenode and datanode folders from all the nodes, and rerun:
$HADOOP_HOME/bin> hdfs namenode -format -force
$HADOOP_HOME/sbin> ./start-dfs.sh
$HADOOP_HOME/sbin> ./start-yarn.sh
To check the health report from command line (which I would recommend)
$HADOOP_HOME/bin> hdfs dfsadmin -report
and I got all the nodes working correctly.
I had same issue for hadoop 2.7.7
I removed the namenode/current & datanode/current directory on namenode and all the datanodes
Removed files at /tmp/hadoop-ubuntu/*
then format namenode & datanode
restart all the nodes.
things work fine
steps:
stop all nodes/managers then attempt below steps
rm -rf /tmp/hadoop-ubuntu/* (all nodes)
rm -r /usr/local/hadoop/data/hdfs/namenode/current (namenode: check hdfs-site.xml for path)
rm -r /usr/local/hadoop/data/hdfs/datanode/current (datanode:check hdfs-site.xml for path)
hdfs namenode -format (on namenode)
hdfs datanode -format (on namenode)
Reboot namenode & data nodes
There's been different solutions to this problem, but I tested another easy solution and it worked like a charm :
So if someone get the same error, you just need to change the clusterID in the datanodes with clusterID of the namenode in the VERSION file.
With your case, here's were you can change it on datanode side :
namenode clusterID = CID-fb61aa70-4b15-470e-a1d0-12653e357a10; datanode clusterID = CID-8bf63244-0510-4db6-a949-8f74b50f2be9
Backup the current VERSION : cp /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode/current/VERSION /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode/current/VERSION.BK
vim /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode/current/VERSION and change
clusterID=CID-8bf63244-0510-4db6-a949-8f74b50f2be9
with
clusterID=CID-fb61aa70-4b15-470e-a1d0-12653e357a10
Restart the datanode and it should work.

Hadoop safemode recovery - taking too long!

I have a Hadoop cluster with 18 data nodes.
I restarted the name node over two hours ago and the name node is still in safe mode.
I have been searching for why this might be taking too long and I cannot find a good answer.
The posting here:
Hadoop safemode recovery - taking lot of time
is relevant but I'm not sure if I want/need to restart the name node after making a change to this setting as that article mentions:
<property>
<name>dfs.namenode.handler.count</name>
<value>3</value>
<final>true</final>
</property>
In any case, this is what I've been getting in 'hadoop-hadoop-namenode-hadoop-name-node.log':
2011-02-11 01:39:55,226 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8020, call delete(/tmp/hadoop-hadoop/mapred/system, true) from 10.1.206.27:54864: error: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop-hadoop/mapred/system. Name node is in safe mode.
The reported blocks 319128 needs additional 7183 blocks to reach the threshold 0.9990 of total blocks 326638. Safe mode will be turned off automatically.
org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete /tmp/hadoop-hadoop/mapred/system. Name node is in safe mode.
The reported blocks 319128 needs additional 7183 blocks to reach the threshold 0.9990 of total blocks 326638. Safe mode will be turned off automatically.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1711)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1691)
at org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:565)
at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)
Any advice is appreciated.
Thanks!
I had it once, where some blocks were never reported in. I had to forcefully let the namenode leave safemode (hadoop dfsadmin -safemode leave) and then run an fsck to delete missing files.
Check the properties dfs.namenode.handler.count in hdfs-site.xml.
dfs.namenode.handler.count in hdfs-site.xml specifies the number of threads used by Namenode for it’s processing. its default value is 10. Too low value of this properties might cause the issue specified.
Also check the missing or corrupt blocks
hdfs fsck / | egrep -v '^.+$' | grep -v replica
hdfs fsck /path/to/corrupt/file -locations -blocks -files
if the corrupt blocks are found, remove it.
hdfs fs -rm /file-with-missing-corrupt blocks.

Resources