AWS EMR - How to extend hdfs capacity - hadoop

Our cluster is running with 2 core nodes with little dfs capacity and it needs to be increased.
I added a new volume of 500GB to the core node instance and mounted it to /mnt1 and updated the hdfs-site.xml in both master and core nodes.
<property>
<name>dfs.datanode.dir</name>
<value>/mnt/hdfs,/mnt/hdfs1</value>
</property>
Then I restarted both hadoop-hdfs-namenode and hadoop-hdfs-datanode services. But the datanode is getting shutdown due to the new volume.
2018-06-19 11:25:05,484 FATAL
org.apache.hadoop.hdfs.server.datanode.DataNode (DataNode: [[[DISK]file:/mnt/hdfs/, [DISK]file:/mnt/hdfs1]] heartbeating to ip-10-60-12-232.ap-south-1.compute.internal/10.60.12.232:8020): Initialization
failed for Block pool (Datanode Uuid unassigned) service
to ip-10-60-12-232.ap-south-1.compute.internal/10.60.12.232:8020.
Exiting.
org.apache.hadoop.util.DiskChecker$DiskErrorException: Too many failed
volumes - current valid volumes: 1, volumes configured: 2, volumes
failed: 1, volume failures tolerated: 0
Upon searching I see that people suggested to format namenode so that blockpool id will be assigned to both volumes. How can I fix this issue?

Related

Setting up multinode hadoop cluster Blockpool ID mismatch

While setting up multinode hadoop cluster I faced up several issues.
Going through different web portals for correct setup. Some fundamental question arose
I am using Hadoop 2.8.5 to set up a 2 node cluster in master slave configuration.
On first machine format the namenode using hdfs namenode format
clusterID and BlockpoolID got assigned like below:
#Fri Mar 29 11:14:41 IST 2019
namespaceID=576041649
clusterID=CID-98480e8d-f7a9-4e1a-8997-400a7aa150c3
cTime=1553838281164
storageType=NAME_NODE
blockpoolID=BP-954411427-x.x.x.y-1553838281164
layoutVersion=-63
Now on the 2nd machine, I ran command hdfs namenode format -clusterId CID-98480e8d-f7a9-4e1a-8997-400a7aa150c3
#Fri Mar 29 11:15:38 IST 2019
namespaceID=304822257
clusterID=CID-98480e8d-f7a9-4e1a-8997-400a7aa150c3
cTime=1553838338130
storageType=NAME_NODE
blockpoolID=BP-1421744029-x.x.x.x-1553838338130
layoutVersion=-63
Considering the slave and master should have same clusterID, Correct me if I am wrong.
The configuration seems to be working correctly but I am getting error in logs at logs/hadoop-cassandra-datanode-localnosql1.log and logs/hadoop-cassandra-datanode-localnosql2.log
2019-03-29 11:25:44,009 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-954411427-x.x.x.y-1553838281164 (Datanode Uuid 4b90bebb-3c34-43d5-8285-6ec6dfefc0a7) service to localnosql1/x.x.x.x:8020 Blockpool ID mismatch: previously connected to Blockpool ID BP-954411427-x.x.x.y-1553838281164 but now connected to Blockpool ID BP-1421744029-x.x.x.x-1553838338130
2019-03-29 11:25:49,010 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-954411427-x.x.x.y-1553838281164 (Datanode Uuid 4b90bebb-3c34-43d5-8285-6ec6dfefc0a7) service to localnosql1/x.x.x.x:8020 Blockpool ID mismatch: previously connected to Blockpool ID BP-954411427-x.x.x.y-1553838281164 but now connected to Blockpool ID BP-1421744029-x.x.x.x-1553838338130
2019-03-29 11:25:54,012 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-954411427-x.x.x.y-1553838281164 (Datanode Uuid 4b90bebb-3c34-43d5-8285-6ec6dfefc0a7) service to localnosql1/x.x.x.x:8020 Blockpool ID mismatch: previously connected to Blockpool ID BP-954411427-x.x.x.y-1553838281164 but now connected to Blockpool ID BP-1421744029-x.x.x.x-1553838338130
2019-03-29 11:25:59,013 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-954411427-x.x.x.y-1553838281164 (Datanode Uuid 4b90bebb-3c34-43d5-8285-6ec6dfefc0a7) service to localnosql1/x.x.x.x:8020 Blockpool ID mismatch: previously connected to Blockpool ID BP-954411427-x.x.x.y-1553838281164 but now connected to Blockpool ID BP-1421744029-x.x.x.x-1553838338130
2019-03-29 11:26:04,014 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool BP-954411427-x.x.x.y-1553838281164 (Datanode Uuid 4b90bebb-3c34-43d5-8285-6ec6dfefc0a7) service to localnosql1/x.x.x.x:8020 Blockpool ID mismatch: previously connected to Blockpool ID BP-954411427-x.x.x.y-1553838281164 but now connected to Blockpool ID BP-1421744029-x.x.x.x-1553838338130
What these error logs are suggesting?
Does the blockpool ID on the all master and slave nodes need to be same like clusterId, If yes how to do that?
Why are you trying to format namenode twice? Ideally, in multinode configuration, there is one namenode and many datanodes.
While setting up for the first time, you initialize namenode by "hdfs namenode -format" then you start datanodes and it works fine.
If you are trying multi-master configuration (multiple namenode running at same time), them i am not sure this will work.
If you are trying active-standby configuration for namenode, you may try below steps
Hadoop Namenode HA setup

Namenode doesn't detect datanodes failure

I have set up a Hadoop high available cluster including 3 nodes as masters (3 journal nodes, active namenode, and standby namenode, with no secondary namenode) and 3 datanodes.
using commands
hadoop-daemon.sh start journalnode
hadoop-daemon.sh start namenode
hadoop-daemon.sh start zkfc
I start namenode services and using command hadoop-daemon.sh start datanode I start datanode services.
The problem is when I stop a datanode intentionally using command hadoop-daemon.sh stop datanode, In the namenodes WebUI,(both active and standby) even after some minutes, it is still considered as an alive node and I think namenodes don't detect datanode's failure!
For future readers, from here:
A datanode is considered stale when:
dfs.namenode.stale.datanode.interval < last contact < (2 *
dfs.namenode.heartbeat.recheck-interval)
In the NameNode UI Datanodes tab, a stale datanode will stand out due to having a larger value for the Last contact among live datanodes (also available in JMX output). When a datanode is stale, it will be given lowest priority for reads and writes.
Using default values, the namenode will consider a datanode stale when its heartbeat is absent for 30 seconds. After another 10 minutes without a heartbeat (10.5 minutes total), a datanode is considered dead.
Relevant properties include:
dfs.heartbeat.interval - default: 3 seconds
dfs.namenode.stale.datanode.interval - default: 30 seconds
dfs.namenode.heartbeat.recheck-interval - default: 5 minutes
dfs.namenode.avoid.read.stale.datanode - default: true
dfs.namenode.avoid.write.stale.datanode - default: true

ha hdfs : Initialization failed for Block pool <registering> (Datanode Uuid unassigned)

I get the following error trying to start datanodes in HA HDFS cluster
2016-01-06 22:54:58,064 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory [DISK]file:/home/data/hdfs/dn/ has already been used.
2016-01-06 22:54:58,082 INFO org.apache.hadoop.hdfs.server.common.Storage: Analyzing storage directories for bpid BP-1354640905-10.146.52.232-1452117061014
2016-01-06 22:54:58,083 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to analyze storage directories for block pool BP-1354640905-10.146.52.232-1452117061014
java.io.IOException: BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /home/data/hdfs/dn/current/BP-1354640905-10.146.52.232-1452117061014
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.loadBpStorageDirectories(BlockPoolSliceStorage.java:210)
at org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceStorage.recoverTransitionRead(BlockPoolSliceStorage.java:242)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:396)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:477)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1338)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1304)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:226)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:867)
at java.lang.Thread.run(Thread.java:745)
2016-01-06 22:54:58,084 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage for block pool: BP-1354640905-10.146.52.232-1452117061014 : BlockPoolSliceStorage.recoverTransitionRead: attempt to load an used block storage: /home/data/hdfs/dn/current/BP-1354640905-10.146.52.232-1452117061014
2016-01-06 22:54:58,084 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to master3/10.146.52.232:8020. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1338)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1304)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:226)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:867)
at java.lang.Thread.run(Thread.java:745)
2016-01-06 22:54:58,084 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool (Datanode Uuid unassigned) service to master3/10.146.52.232:8020
2016-01-06 22:54:58,084 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool (Datanode Uuid unassigned) service to master2/10.146.52.231:8020. Exiting.
org.apache.hadoop.util.DiskChecker$DiskErrorException: Invalid volume failure config value: 3
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.(FsDatasetImpl.java:261)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:34)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetFactory.newInstance(FsDatasetFactory.java:30)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1351)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1304)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:314)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:226)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:867)
at java.lang.Thread.run(Thread.java:745)
2016-01-06 22:54:58,085 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool (Datanode Uuid unassigned) service to master2/10.146.52.231:8020
2016-01-06 22:54:58,185 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool (Datanode Uuid unassigned)
I have already check the clusters ID in namenode and datanode and they are similar...
I tried to reformat everything several times...
Thanks for your help !
I have seen messages like this in the log file when the file system for the DataNode is corrupt. Perhaps, try running fsck -y on each of the disks used by the DataNode. In your case:
fsck -y /home/data/hdfs
Once the disk(s) is(are) clean you should be able to start the DataNode. The NameNode will work ensure that the replication factor is fixed for any lost blocks.
I had a similar problem (but don't know without more logs, but mine didn't say "Datanode Uuid unassigned"), and fsck didn't solve it.
In my case, I had moved a subset of disks from one node to another node that already had disks, and disabled the old node, so there was a problem with the disks not matching the DatanodeUuid of the new machine.
Above those lines in the log, there were entries like:
2016-04-11 19:32:02,991 WARN org.apache.hadoop.hdfs.server.common.Storage: org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory /archive14/dfs/data is in an inconsistent state: Root /archive14/dfs/data: DatanodeUuid=5ba6418e-2c24-4582-8225-3e7f7fff9feb, does not match 519c1e34-a573-41f7-9e80-dca606fce704 from other StorageDirectory.
To solve that, I ran:
sed -i -r "s/${olduuid}/${olduuid}/' /mountpoints*/dfs/data/current/VERSION
This replaces the old UUID in the VERSION file with the new one. Then starting the datanode works.
Maybe in your case, you had a missing UUID rather than an incorrect one.
Deleting the name node directory and the data node directory and then creating the new directories worked for me. Use this technique assuming that you will lost the data.
For my case,I reinstall hdfs by CM6.2.0 and instance two namenodes for HA.
Then reformat these namenode each other,but this option cause the error below.
Initialization failed for Block pool BP-666417012-10.253.76.213-1557044865448 (Datanode Uuid 5132035c-8d6a-4617-af7e-7d07355a905b) service to hzd-t-vbdl-02/10.253.76.222:8022 Blockpool ID mismatch: previously connected to Blockpool ID BP-666417012-10.253.76.213-1557044865448 but now connected to Blockpool ID BP-1262695848-10.253.76.222-1557045124181
Process method:
ansible all -m shell -a " more /XXX/hdfs/dfs/nn/current/VERSION "
hzd-t-vbdl-01 | CHANGED | rc=0 >>
Sun May 05 16:27:45 CST 2019
namespaceID=732385684
clusterID=cluster54
cTime=1557044865448
storageType=NAME_NODE
blockpoolID=BP-666417012-10.253.76.213-1557044865448
layoutVersion=-64
hzd-t-vbdl-02 | CHANGED | rc=0 >>
Sun May 05 16:32:04 CST 2019
namespaceID=892287385
clusterID=cluster54
cTime=1557045124181
storageType=NAME_NODE
blockpoolID=BP-1262695848-10.253.76.222-1557045124181
layoutVersion=-64
Finally copy the context from hzd-t-vbdl-01(early formated) to hzd-t-vbdl-02,and restart namenodes and datanodes

Cloudera hdfs another namenode already locked the storage directory

I am running CDH-5.3.2-1.cdh5.3.2.p0.10 with ClouderaManager on Centos 6.6.
My HDFS service was working on a Cluster. But I wanted to change the mounting point for the hadoop data. Yet without success, so I came with the idea to rollback all changes, but the previous configuration doesnt work what is discouraging.
I have two nodes within the cluster. One node for data is bad DataNodes Health Bad.
In the log I have got a few errors
1:40:10.821 PM ERROR org.apache.hadoop.hdfs.server.common.Storage
It appears that another namenode 931#spark1.xxx.xx has already locked the storage directory
1:40:10.821 PM INFO org.apache.hadoop.hdfs.server.common.Storage
Cannot lock storage /dfs/nn. The directory is already locked
1:40:10.821 PM WARN org.apache.hadoop.hdfs.server.common.Storage
java.io.IOException: Cannot lock storage /dfs/nn. The directory is already locked
1:40:10.822 PM FATAL org.apache.hadoop.hdfs.server.datanode.DataNode
Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to spark1.xxx.xx/10.10.10.10:8022. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:463)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1318)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1288)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:320)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:221)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:829)
at java.lang.Thread.run(Thread.java:745)
I have been trying many possible solutions but without any luck.
formatting hadoop namenode -format
stopping cluster and rm -rf /dfs/* [and reformatting]
some adjustments to /dfs/nn/current/VERSION file
removing in_use.lock file and starting only a lacking node
removing a file in /tmp/hsperfdata_hdfs/ with name like the pid locking the directory.
There are files in the directory
[root#spark1 dfs]# ll
total 8
drwxr-xr-x 3 hdfs hdfs 4096 Apr 28 13:39 nn
drwx------ 3 hdfs hadoop 4096 Apr 28 13:40 snn
There is no dn dir what is a bit interesting.
All operations on hdfs files I perform as an hdfs user.
In the file /etc/hadoop/conf/hdfs-site.xml there is
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///dfs/nn</value>
</property>
Here is a similar thread of CDH users google group which might help you : https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/FYu0gZcdXuE
Also did you do the namenode format from cloudera manager or command line ? Ideally you should be doing it through cloudera manager and not command line.

Datanode not starts correctly

I am trying to install Hadoop 2.2.0 in pseudo-distributed mode. While I am trying to start the datanode services it is showing the following error, can anyone please tell how to resolve this?
**2**014-03-11 08:48:15,916 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (storage id unknown) service to localhost/127.0.0.1:9000 starting to offer service
2014-03-11 08:48:15,922 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2014-03-11 08:48:15,922 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
2014-03-11 08:48:16,406 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode/in_use.lock acquired by nodename 3627#prassanna-Studio-1558
2014-03-11 08:48:16,426 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-611836968-127.0.1.1-1394507838610 (storage id DS-1960076343-127.0.1.1-50010-1394127604582) service to localhost/127.0.0.1:9000
java.io.IOException: Incompatible clusterIDs in /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode: namenode clusterID = CID-fb61aa70-4b15-470e-a1d0-12653e357a10; datanode clusterID = CID-8bf63244-0510-4db6-a949-8f74b50f2be9
at**** org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:391)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:191)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:219)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:837)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:808)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:280)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:222)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:664)
at java.lang.Thread.run(Thread.java:662)
2014-03-11 08:48:16,427 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-611836968-127.0.1.1-1394507838610 (storage id DS-1960076343-127.0.1.1-50010-1394127604582) service to localhost/127.0.0.1:9000
2014-03-11 08:48:16,532 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-611836968-127.0.1.1-1394507838610 (storage id DS-1960076343-127.0.1.1-50010-1394127604582)
2014-03-11 08:48:18,532 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2014-03-11 08:48:18,534 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2014-03-11 08:48:18,536 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
You can do the following method,
copy to clipboard datanode clusterID for your example, CID-8bf63244-0510-4db6-a949-8f74b50f2be9
and run following command under HADOOP_HOME/bin directory
./hdfs namenode -format -clusterId CID-8bf63244-0510-4db6-a949-8f74b50f2be9
then this code formatted the namenode with datanode cluster ids.
You must do as follow :
bin/stop-all.sh
rm -Rf /home/prassanna/usr/local/hadoop/yarn_data/hdfs/*
bin/hadoop namenode -format
I had the same problem until I found an answer in this web site.
Whenever you are getting below error, trying to start a DN on a slave machine:
java.io.IOException: Incompatible clusterIDs in /home/hadoop/dfs/data: namenode clusterID= ****; datanode clusterID = ****
It is because after you set up your cluster, you, for whatever reason, decided to reformat
your NN. Your DNs on slaves still bear reference to the old NN.
To resolve this simply delete and recreate data folder on that machine in local Linux FS, namely /home/hadoop/dfs/data.
Restarting that DN's daemon on that machine will recreate data/ folder's content and resolve
the problem.
Do following simple steps
Clear the data directory of hadoop
Format the namenode again
start the cluster
After this your cluster will start normally if you are not having any other configuration issue
DataNode dies because of incompatible Clusterids compared to the NameNode. To fix this problem you need to delete the directory /tmp/hadoop-[user]/hdfs/data and restart hadoop.
rm -r /tmp/hadoop-[user]/hdfs/data
I got similar issue in my pseudo distributed environment. I stopped cluster first, then I copied Cluster ID from NameNode's version file and put it in DataNode's version file, then after restarting cluster, its all fine.
my data path is here /usr/local/hadoop/hadoop_store/hdfs/datanode and /usr/local/hadoop/hadoop_store/hdfs/namenode.
FYI : version file is under /usr/local/hadoop/hadoop_store/hdfs/datanode/current/ ; likewise for NameNode.
Here, the datanode gets stopped immediately because the clusterID of datanode and namenode are different. So you have to format the clusterID of namenode with clusterID of datanode
Copy the datanode clusterID for your example, CID-8bf63244-0510-4db6-a949-8f74b50f2be9 and run following command from your home directory. You can go to your home dir by just typing cd on your terminal.
From your home dir now type the command:
hdfs namenode -format -clusterId CID-8bf63244-0510-4db6-a949-8f74b50f2be9
Delete the namenode and datanode directories as specified in the core-site.xml.
After that create the new directories and restart the dfs and yarn.
I also had the similar issue.
I deleted namenode and datanode folders from all the nodes, and rerun:
$HADOOP_HOME/bin> hdfs namenode -format -force
$HADOOP_HOME/sbin> ./start-dfs.sh
$HADOOP_HOME/sbin> ./start-yarn.sh
To check the health report from command line (which I would recommend)
$HADOOP_HOME/bin> hdfs dfsadmin -report
and I got all the nodes working correctly.
I had same issue for hadoop 2.7.7
I removed the namenode/current & datanode/current directory on namenode and all the datanodes
Removed files at /tmp/hadoop-ubuntu/*
then format namenode & datanode
restart all the nodes.
things work fine
steps:
stop all nodes/managers then attempt below steps
rm -rf /tmp/hadoop-ubuntu/* (all nodes)
rm -r /usr/local/hadoop/data/hdfs/namenode/current (namenode: check hdfs-site.xml for path)
rm -r /usr/local/hadoop/data/hdfs/datanode/current (datanode:check hdfs-site.xml for path)
hdfs namenode -format (on namenode)
hdfs datanode -format (on namenode)
Reboot namenode & data nodes
There's been different solutions to this problem, but I tested another easy solution and it worked like a charm :
So if someone get the same error, you just need to change the clusterID in the datanodes with clusterID of the namenode in the VERSION file.
With your case, here's were you can change it on datanode side :
namenode clusterID = CID-fb61aa70-4b15-470e-a1d0-12653e357a10; datanode clusterID = CID-8bf63244-0510-4db6-a949-8f74b50f2be9
Backup the current VERSION : cp /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode/current/VERSION /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode/current/VERSION.BK
vim /home/prassanna/usr/local/hadoop/yarn_data/hdfs/datanode/current/VERSION and change
clusterID=CID-8bf63244-0510-4db6-a949-8f74b50f2be9
with
clusterID=CID-fb61aa70-4b15-470e-a1d0-12653e357a10
Restart the datanode and it should work.

Resources