Hadoop 2 node cluster UI showing 1 live Node - hadoop

I am trying to configure Hadoop 2.7 2-node cluster.When i start hadoop
using start-dfs.sh and start-yarn.sh.All services on master and slave start perfectly.
Here is my jps command on my master:
23913 Jps
22140 SecondaryNameNode
22316 ResourceManager
22457 NodeManager
21916 DataNode
21777 NameNode
Here is my jps command on my slave:
17223 Jps
14225 DataNode
14363 NodeManager
But if i see Hadoop cluster UI it shows only 1 live data node.
Here is the dfs admin report : /bin/hdfs dfsadmin -report
Live datanodes (1):
Name: 192.168.1.104:50010 (nn1.cluster.com)
Hostname: nn1.cluster.com
Decommission Status : Normal
Configured Capacity: 401224601600 (373.67 GB)
DFS Used: 237568 (232 KB)
Non DFS Used: 48905121792 (45.55 GB)
DFS Remaining: 352319242240 (328.12 GB)
DFS Used%: 0.00%
DFS Remaining%: 87.81%
I am able to ssh on all machines.
Here is the sample of name node logs(i.p = 192.168.1.104) :
2016-07-12 01:17:34,293 INFO BlockStateChange: BLOCK* processReport: from storage DS-d9ed40cf-bd5d-4033-a6ca-14fb4a8c3587 node DatanodeRegistration(192.168.1.104:50010, datanodeUuid=b702b518-5daa-4fa1-8e69-e4d620a72470, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-e86d0353-9f33-495b-88fa-16035abd3672;nsid=616310490;c=0), blocks: 24, hasStaleStorage: false, processing time: 0 msecs
2016-07-12 01:17:35,501 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(192.168.1.104:50010, datanodeUuid=37038a9f-23ac-42e2-abea-bdf356aaefbe, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-e86d0353-9f33-495b-88fa-16035abd3672;nsid=616310490;c=0) storage 37038a9f-23ac-42e2-abea-bdf356aaefbe
2016-07-12 01:17:35,501 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: BLOCK* registerDatanode: 192.168.1.104:50010
2016-07-12 01:17:35,501 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.1.104:50010
2016-07-12 01:17:35,501 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Number of failed storage changes from 0 to 0
2016-07-12 01:17:35,502 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.1.104:50010
2016-07-12 01:17:35,504 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Number of failed storage changes from 0 to 0
2016-07-12 01:17:35,504 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID DS-495b6b0e-f1fc-407c-bb9f-6c314c2fdaec for DN 192.168.1.104:50010
here is the sample datanode logs (i.p = 192.168.1.104):
2016-07-12 02:02:12,044 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 02:02:12,045 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid b702b518-5daa-4fa1-8e69-e4d620a72470) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 02:02:12,047 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid b702b518-5daa-4fa1-8e69-e4d620a72470) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 02:02:12,050 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x236119eb3082, containing 1 storage report(s), of which we sent 1. The reports had 24 total blocks and used 1 RPC(s). This took 0 msec to generate and 1 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 02:02:12,050 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
2016-07-12 02:02:15,049 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 02:02:15,052 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid b702b518-5daa-4fa1-8e69-e4d620a72470) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 02:02:15,056 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid b702b518-5daa-4fa1-8e69-e4d620a72470) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 02:02:15,061 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x2361cd4be40d, containing 1 storage report(s), of which we sent 1. The reports had 24 total blocks and used 1 RPC(s). This took 0 msec to generate and 2 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 02:02:15,061 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
Here is the sample of 2nd datanode logs(ip :192.168.35.128)
2016-07-12 11:45:07,346 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 11:45:07,349 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 11:45:07,355 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 11:45:07,364 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0xb0de42ec7c, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 4 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 11:45:07,364 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
2016-07-12 11:45:10,360 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 11:45:10,363 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 11:45:10,370 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 11:45:10,377 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0xb191ea9cb9, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 3 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 11:45:10,377 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
2016-07-12 11:45:13,376 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from nn1.cluster.com/192.168.1.104:8020 with active state
2016-07-12 11:45:13,380 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 beginning handshake with NN
2016-07-12 11:45:13,385 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool Block pool BP-1235752202-192.168.1.104-1468159707934 (Datanode Uuid 37038a9f-23ac-42e2-abea-bdf356aaefbe) service to nn1.cluster.com/192.168.1.104:8020 successfully registered with NN
2016-07-12 11:45:13,395 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0xb245b893c4, containing 1 storage report(s), of which we sent 1. The reports had 0 total blocks and used 1 RPC(s). This took 0 msec to generate and 5 msecs for RPC and NN processing. Got back one command: FinalizeCommand/5.
2016-07-12 11:45:13,396 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Got finalize command for block pool BP-1235752202-192.168.1.104-1468159707934
Why is this happening? Thank you so much for help!

Got The Solution. If the Slave Nodes/Data Nodes are alive individually and in the report if it is not showing it on the report hadoop dfsadmin - report...then there is the problem with the communication. The communication from data node to the master is not available. Technically speaking there is the issue in the firewall. the Fire wall in the master node is blocking the communication.
" We have to stop the firewall connection in the master" / Allow the specific Ip to access the master"
to stop Firewall in CentOs do the below Command in CentOs
service iptables save
service iptables stop
chkconfig iptables off

Got the Solution. The issue was in namenode Ipaddress. i had considered ipaddress from wlan0 interface which keeps
on changing.Since I have installed VMware workstation so considered the Ipaddress
from vmnet interface which was static ipaddress and after changing Live node shows 2 instead of 1.

Related

Hadoop Exception: All specified directories are failed to load

When I started the Hadoop cluster, the following Exception was thrown. I dont't have idea for solving it. Anyone help me. Thanks
2017-07-10 09:40:58,960 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /tools/hadoop/hadoop_storage/hdfs/datanode: namenode clusterID = CID-47191263-b5b7-4a4d-b8b5-a78b782e66bb; datanode clusterID = CID-79a53373-9652-4c08-9735-b5972e0450ca
2017-07-10 09:40:58,960 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to localhost/127.0.0.1:54310. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:478)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1358)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1323)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:745)
2017-07-10 09:40:58,961 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to localhost/127.0.0.1:54310
2017-07-10 09:40:58,962 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
2017-07-10 09:41:00,962 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2017-07-10 09:41:00,964 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2017-07-10 09:41:00,966 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
It perhaps you format your cluster one more time thus it generate different ID cluster in the master node and data node.
Your namenode and datanode cluster ID does not match and you make sure to make them the same.
In name node, change cluster id in the file located in:
$ nano HADOOP_FILE_SYSTEM/namenode/current/VERSION
In data node you cluster id is stored in the file:
$ nano HADOOP_FILE_SYSTEM/datanode/current/VERSION
Whatever the way you change ID, but assure that the ID in the cluster's nodes are the same.
#VanThaoNguyen is correct
In my case:
/installation directory/hdata/dfs/name/current
/installation directory/hdata/dfs/data/current
clusterID=xxxx-xxxx-xxxx-xxxx
should be same for name node and data node.

hadoop can't start ./sbin/start-dfs.sh

I had started shall( ./sbin/start-dfs.sh )
jps
3098 Jps<br>
2492 NameNode<br>
2700 SecondaryNameNode
hadoop-datanode-log
2017-02-15 15:55:12,787 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage directory [DISK]file:/usr/local/Cellar/hadoop/2.7.3/libexec/%3E/data/hadoop/hdfs/datanode/
java.io.IOException: Incompatible clusterIDs in /usr/local/Cellar/hadoop/2.7.3/libexec/>/data/hadoop/hdfs/datanode: namenode clusterID = CID-4c9d5df1-10c6-45cb-9fe0-e1631e4d13e2; datanode clusterID = CID-6dc3d755-f713-4bec-a62a-c47e96dcbc0d
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:775)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadStorageDirectory(DataStorage.java:300)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.loadDataStorage(DataStorage.java:416)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:395)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:573)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1362)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1327)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:745)
2017-02-15 15:55:12,792 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for Block pool <registering> (Datanode Uuid unassigned) service to localhost/127.0.0.1:9000. Exiting.
java.io.IOException: All specified directories are failed to load.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:574)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1362)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1327)
at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:317)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:223)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:802)
at java.lang.Thread.run(Thread.java:745)
2017-02-15 15:55:12,793 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool <registering> (Datanode Uuid unassigned) service to localhost/127.0.0.1:9000
2017-02-15 15:55:12,799 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool <registering> (Datanode Uuid unassigned)
2017-02-15 15:55:14,800 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2017-02-15 15:55:14,802 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2017-02-15 15:55:14,803 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
Looks like you have formatted the namenode on a working cluster,
Delete the data directories and start the datanode process again in all nodes.
rm -rf <dfs.datanode.data.dir>
./sbin/hadoop-daemon.sh start datanode

Hortonworks HA Namenodes gives an error "Operation category READ is not supported in state standby"

My hadoop cluster HA active namenode (host1) suddenly switch to standby namenode(host2). I could not found any error in hadoop logs (in any server) to identify the root cause.
After switching the Namenodes following error appeared in hdfs logs frequently and non of the application could read the HDFS files.
2014-07-17 01:58:53,381 WARN namenode.FSNamesystem
(FSNamesystem.java:getCorruptFiles(6769)) - Get corrupt file blocks
returned error: Operation category READ is not supported in state
standby
Once I restart the new active node(host2), namenode is switching back to new standby node(host1). Then cluster is working as normal, users also can can retrieve the HDFS files.
I'm using Hortonworks 2.1.2.0 and HDFS version 2.4.0.2.1
Edit:21st Jult 2014
Following logs were found in active namenode logs when active-standby namenode switch happen
NT_SETTINGS-1675610.csv dst=null perm=null 2014-07-20
09:06:44,746 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditMessage(7755)) - allowed=true
ugi=storm (auth:SIMPLE) ip=/10.0.1.50 cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/PRODUCTS/PRODUCTS-138018
6.csv dst=null perm=null 2014-07-20 09:06:44,747 INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7755)) -
allowed=true ugi=storm (auth:SIMPLE) ip=/10.0.1.50
cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/MERCHANT_SETTINGS/MERCHA
NT_SETTINGS-1695794.csv dst=null perm=null 2014-07-20
09:06:44,747 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditMessage(7755)) - allowed=true
ugi=storm (auth:SIMPLE) ip=/10.0.1.50 cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/PRODUCTS/PRODUCTS-139954
1.csv dst=null perm=null 2014-07-20 09:06:44,748 INFO namenode.FSNamesystem (FSNamesystem.java:stopActiveServices(1095)) -
Stopping services started for active state 2014-07-20 09:06:44,750
INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1153)) -
Ending log segment 842249 2014-07-20 09:06:44,752 INFO
namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of
transactions: 2 Total time for transactions(ms): 0 Number of
transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4
35 2014-07-20 09:06:44,774 INFO namenode.FSEditLog
(FSEditLog.java:printStatistics(673)) - Number of transactions: 2
Total time for transactions(ms): 0 Number of transactions batched in
Syncs: 0 Number of syncs: 2 SyncTimes(ms): 24 37 2014-07-20
09:06:44,805 INFO namenode.FSNamesystem (FSNamesystem.java:run(4362))
- NameNodeEditLogRoller was interrupted, exiting 2014-07-20 09:06:44,824 INFO namenode.FileJournalManager
(FileJournalManager.java:finalizeLogSegment(130)) - Finalizing edits
file
/ebs/hadoop/hdfs/namenode/current/edits_inprogress_0000000000000842249
-> /ebs/hadoop/hdfs/name node/current/edits_0000000000000842249-0000000000000842250 2014-07-20
09:06:44,874 INFO blockmanagement.CacheReplicationMonitor
(CacheReplicationMonitor.java:run(168)) - Shutting down
CacheReplicationMonitor 2014-07-20 09:06:44,876 INFO
namenode.FSNamesystem (FSNamesystem.java:startStandbyServices(1136)) -
Starting services required for standby state 2014-07-20 09:06:44,927
INFO ha.EditLogTailer (EditLogTailer.java:(117)) - Will roll
logs on active node at hadoop-client-us-west-1b/10.0.254.10:8020 every
120 seconds. 2014-07-20 09:06:44,929 INFO ha.StandbyCheckpointer
(StandbyCheckpointer.java:start(129)) - Starting standby checkpoint
thread... Checkpointing active NN at
http:// hadoop-client-us-west-1b:50070 Serving checkpoints at
http:// hadoop-client-us-west-1a:50070 2014-07-20 09:06:44,930 INFO
ipc.Server (Server.java:run(2027)) - IPC Server handler 3 on 8020,
call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57297 Call#8431877 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby 2014-07-20 09:06:44,930 INFO ipc.Server
(Server.java:run(2027)) - IPC Server handler 16 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57294 Call#130105071 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby 2014-07-20 09:06:44,940 INFO ipc.Server
(Server.java:run(2027)) - IPC Server handler 14 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57294 Call#130105072 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby
Edit:13th August 2014
We were able to found out root cause of namenode switching, namenode getting lots of file info requests and then namenode switching was happened.
But still could not get resolve Operation category READ is not supported in state standby error.
Edit:7th December 2014
We were found that, as the solution application need to manually connect with current active namenode once previously active namenode failed. Traffic for namenodes in HA mode are not automatically directed to active node.
I had the same issue. You need to update the client libraries. Use amabari to set up spark and have it install the client on the server. Then set your SPARK_HOME environment variable.

Hadoop: How do datanodes register with the namenode?

Do hadoop datanodes register themselves with the namenode by calling the namenode, or does the namenode have a list of datanodes and it reaches out to them.
I want to understand to better troubleshoot a problem with a new namenode I brought up (after a namenode failure) where it doesn't see any of the datanodes (but has the fsimage correct).
Data nodes heartbeat in to the name node. The name node does not reach out to data nodes.
Even when retrieving data, the name node does not reach out to the data nodes. The name node will inform the client where the data is and the client will retrieve it from the data nodes. (To clarify, during an MR workflow the Job Tracker finds from the name node where the data is and assigns task trackers appropriately.)
Each datanode keeps the namenode details in hdfs.conf file. And namenode keep names of all data nodes in slaves file. I think you should update your slaves files in namenode and master file in datanodes.
I suppose you have a working cluster (with fs.default.name in core-site.xml properly configured in datanodes) before hard shutting down the namenode.
When I shut down my namenode with kill -9 pid, my datanodes start to show in log:
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodehost/192.168.0.35:8020. Already tried 0 time(s).
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodehost/192.168.0.35:8020. Already tried 1 time(s).
...
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodehost/192.168.0.35:8020. Already tried 9 time(s).
WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.net.ConnectException: Call to namenodehost/192.168.0.35:8020 failed on connection exception: java.net.ConnectException: Connection refused at ...
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodehost/192.168.0.35:8020. Already tried 0 time(s).
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenodehost/192.168.0.35:8020. Already tried 1 time(s).
...
repeatedly until I load again my namenode. At that moment, datanodes' logs shows:
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action: DNA_REGISTER
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Finished generating blocks being written report for 1 volumes in 0 seconds
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting asynchronous block report scan
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Finished asynchronous block report scan in 10ms
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Reconciled asynchronous block scan with filesystem. 0 blocks concurrently deleted during scan, 0 blocks concurrently added during scan, 4 ongoing creations ignored
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Reconciled asynchronous block report against current state in 0 ms
INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 411 blocks took 0 msec to generate and 68 msecs for RPC and NN processing
Each datanode reconnects to the namenode and everything works ok.
Does this helps?

Hadoop: Datanode process killed

I am currently using Hadoop-2.0.3-alpha and after I could work perfectly with HDFS (copying files into HDFS, getting success from an external framework, using the webfrontend), after a new start of my VM, the datanode process is stopping after a while. The namenode process and all yarn processes work without a problem. I installed Hadoop in a folder under an additional user, as I also still have installed Hadoop 0.2, which worked fine too.
Taking a look at the log-file of all datanode processes I got the following information:
2013-04-11 16:23:50,475 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-04-11 16:24:17,451 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2013-04-11 16:24:23,276 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2013-04-11 16:24:23,279 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2013-04-11 16:24:23,480 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is user-VirtualBox
2013-04-11 16:24:28,896 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming server at /0.0.0.0:50010
2013-04-11 16:24:29,239 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s
2013-04-11 16:24:38,348 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2013-04-11 16:24:44,627 INFO org.apache.hadoop.http.HttpServer: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer$QuotingIn putFilter)
2013-04-11 16:24:45,163 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context datanode
2013-04-11 16:24:45,164 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context logs
2013-04-11 16:24:45,164 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context static
2013-04-11 16:24:45,355 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 0.0.0.0:50075
2013-04-11 16:24:45,508 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dfs.webhdfs.enabled = false
2013-04-11 16:24:45,536 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50075
2013-04-11 16:24:45,576 INFO org.mortbay.log: jetty-6.1.26
2013-04-11 16:25:18,416 INFO org.mortbay.log: Started SelectChannelConnector#0.0.0.0:50075
2013-04-11 16:25:42,670 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 50020
2013-04-11 16:25:44,955 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened IPC server at /0.0.0.0:50020
2013-04-11 16:25:45,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request received for nameservices: null
2013-04-11 16:25:47,079 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting BPOfferServices for nameservices: <default>
2013-04-11 16:25:47,660 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (storage id unknown) service to localhost/127.0.0.1:8020 starting to offer service
2013-04-11 16:25:50,515 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2013-04-11 16:25:50,631 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
2013-04-11 16:26:15,068 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/hadoop/workspace/hadoop_space/hadoop23/dfs/data/in_use.lock acquired by nodename 3099#user-VirtualBox
2013-04-11 16:26:15,720 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363) service to localhost/127.0.0.1:8020
java.io.IOException: Incompatible clusterIDs in /home/hadoop/workspace/hadoop_space/hadoop23/dfs/data: namenode clusterID = CID-1745a89c-fb08-40f0-a14d-d37d01f199c3; datanode clusterID = CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
at org.apache.hadoop.hdfs.server.datanode.DataStorage .doTransition(DataStorage.java:391)
at org.apache.hadoop.hdfs.server.datanode.DataStorage .recoverTransitionRead(DataStorage.java:191)
at org.apache.hadoop.hdfs.server.datanode.DataStorage .recoverTransitionRead(DataStorage.java:219)
at org.apache.hadoop.hdfs.server.datanode.DataNode.in itStorage(DataNode.java:850)
at org.apache.hadoop.hdfs.server.datanode.DataNode.in itBlockPool(DataNode.java:821)
at org.apache.hadoop.hdfs.server.datanode.BPOfferServ ice.verifyAndSetNamespaceInfo(BPOfferService.java: 280)
at org.apache.hadoop.hdfs.server.datanode.BPServiceAc tor.connectToNNAndHandshake(BPServiceActor.java:22 2)
at org.apache.hadoop.hdfs.server.datanode.BPServiceAc tor.run(BPServiceActor.java:664)
at java.lang.Thread.run(Thread.java:722)
2013-04-11 16:26:16,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363) service to localhost/127.0.0.1:8020
2013-04-11 16:26:16,276 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363)
2013-04-11 16:26:18,396 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2013-04-11 16:26:18,940 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2013-04-11 16:26:19,668 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************** **********
SHUTDOWN_MSG: Shutting down DataNode at user-VirtualBox/127.0.1.1
************************************************** **********/
Any ideas? May be I made a mistake during the installation process? But it is strange, that it worked once. I also have to say, that if I am logged in as my additional user to execute the commands ./hadoop-daemon.sh start namenode and the same with the datanode, I need to add sudo.
I used this installation guide: http://jugnu-life.blogspot.ie/2012/0...rial-023x.html
By the way, I use the Oracle Java-7 version.
The problem could be that the namenode was formatted after the cluster was set up and the datanodes were not, so the slaves are still referring to the old namenode.
We have to delete and recreate the folder /home/hadoop/dfs/data on the local filesystem for the datanode.
Check your hdfs-site.xml file to see where dfs.data.dir is pointing to
and delete that folder
and then restart the datanode daemon on the machine
The steps above should recreate the folder and resolve the problem.
Please share your config info if the instructions above do not work.
DataNode dies because of incompatible Clusterids. To fix this problem
If you are using hadoop 2.X, then you have to delete everything in the folder that you have specified in hdfs-site.xml - "dfs.datanode.data.dir" (but NOT the folder itself).
The ClusterID will be maintained in that folder. Delete and restart dfs.sh. This should work!!!
You need to delete both
C:\hadoop\data\dfs\datanode and
C:\hadoop\data\dfs\namenode folders.
If you don't have this folders - open your C:\hadoop\etc\hadoop\hdfs-site.xml file and get paths for this folders for next deletion. For me it says:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/data/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/data/dfs/datanode</value>
</property>
Run command for Format namenodec:\hadoop\bin>hdfs namenode -format
Now it should work!
I think the recommended way of doing this without deleting the data directory is to simply change the clusterID variable in the datanode's VERSION file.
If you look in your daemons directory, you will see the datanode directory exmaple
data/hadoop/daemons/datanode
The VERSION file should look like this.
cat current/VERSION
#Tue Oct 14 17:31:58 CDT 2014
storageID=DS-23bf7f3a-085c-4531-808f-801ff6d52d14
clusterID=CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
cTime=0
datanodeUuid=63154929-ae68-4149-9f75-9a6558545041
storageType=DATA_NODE
layoutVersion=-55
You need to change the clusterId to the first value in the output of the message so in your case that would be CID-1745a89c-fb08-40f0-a14d-d37d01f199c3 instead of CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
The updated version should appear like this with the altered clusterId
cat current/VERSION
#Tue Oct 14 17:31:58 CDT 2014
storageID=DS-23bf7f3a-085c-4531-808f-801ff6d52d14
clusterID=CID-1745a89c-fb08-40f0-a14d-d37d01f199c3
cTime=0
datanodeUuid=63154929-ae68-4149-9f75-9a6558545041
storageType=DATA_NODE
layoutVersion=-55
Restart hadoop and the datanode should start just fine.

Resources