Hadoop: slaves in service but doing nothing at all - hadoop

I set up a hadoop cluster and started a MapReduce job on the cluster.
The master node is running actively but all slaves are doing nothing at all.
JPS on the slave node produces
20390 DataNode
20492 NodeManager
21256 Jps
Here is the screen cast:
The next to last row corresponds to the master node.
So why the slaves using no blocks?
Also running top on master node yields the Java process(hadoop jar jar-file.jar args) taking almost 100% of CPU resources. However, such process does not exist on any slave machines.
That is why I think slaves are at rest, doing nothing at all.
Here is one example of the slave datanode log:
2014-07-24 23:28:01,302 INFO org.apache.hadoop.util.GSet: Computing capacity for map BlockMap
2014-07-24 23:28:01,302 INFO org.apache.hadoop.util.GSet: VM type = 64-bit
2014-07-24 23:28:01,304 INFO org.apache.hadoop.util.GSet: 0.5% max memory 889 MB = 4.4 MB
2014-07-24 23:28:01,304 INFO org.apache.hadoop.util.GSet: capacity = 2^19 = 524288 entries
2014-07-24 23:28:01,304 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Periodic Block Verification Scanner initialized with interval 504 hours for block pool BP-1752077220-193.167.138.8-1406217332464
2014-07-24 23:28:01,310 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Added bpid=BP-1752077220-193.167.138.8-1406217332464 to blockPoolScannerMap, new size=1
2014-07-24 23:31:01,116 INFO org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool BP-1752077220-193.167.138.8-1406217332464 Total blocks: 0, missing metadata files:0, missing block files:0, missing blocks in memory:0, mismatched blocks:0
And nothing more.
However, for the master data node, the log file contains lines like the following:
2014-07-24 22:27:23,443 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1752077220-193.167.138.8-1406217332464:blk_1073742749_1925 src: /193.167.138.8:44210 dest: /193.167.138.8:50010
which I think means the node is receiving tasks and processing the data.
The following is from the yarn log file of one the slave node:
2014-07-24 23:28:13,811 INFO org.mortbay.log: Started SelectChannelConnector#0.0.0.0:8042
2014-07-24 23:28:13,812 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app /node started at 8042
2014-07-24 23:28:14,122 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules
2014-07-24 23:28:14,130 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at ugluk/193.167.138.8:8031
2014-07-24 23:28:14,176 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using finished containers :[]
2014-07-24 23:28:14,366 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id 1336429163
2014-07-24 23:28:14,369 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for nm-tokens, got key with id :1986181585
2014-07-24 23:28:14,370 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as shagrat.hiit.fi:48662 with total resource of <memory:8192, vCores:8>
2014-07-24 23:28:14,370 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests
I am using Hadoop 2.4.0

It seems that you formatted namenode more than once.
The block pool id error is majorly due to formatting of namenode multiple times.
Every time ,you format a namenode ,the blockpool id ,cluster id and the namespace id changes.
So first check the above attributes of the namenode and other datanodes and secondary namenode.
You can check using VERSION file in current directory of these nodes.For this ,first see where you configured your node by checking its path hadoop hdfs-site.xml.
go to that path,and look for the CURRENT directory and make the necessary changes.
Please let me know if this helps.

Related

Failed to start namenode.java.lang.IllegalStateException

iam using hadoop apache 2.7.1 high availability cluster that consists of
two name nodes mn1,mn2 and 3 journal nodes
but while i was working on cluster i faced the following error
when i issue start-dfs.sh mn1 is standby and mn2 is active
but after that if one of theses two namenodes are off there is no possibility
to turn it on again
and here are the last lines of log of one of these two name nodes
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Need to save fs image? false (staleImage=true, haEnabled=true, isRollingUpgrade=false)
2017-08-05 09:37:21,063 INFO org.apache.hadoop.hdfs.server.namenode.NameCache: initialized with 3 entries 72 lookups
2017-08-05 09:37:21,088 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 7052 msecs
2017-08-05 09:37:21,300 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: RPC server is binding to mn2:8020
2017-08-05 09:37:21,304 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue class java.util.concurrent.LinkedBlockingQueue
2017-08-05 09:37:21,316 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8020
2017-08-05 09:37:21,353 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemState MBean
2017-08-05 09:37:21,354 WARN org.apache.hadoop.hdfs.server.common.Util: Path /opt/hadoop/metadata_dir should be specified as a URI in configuration files. Please update hdfs configuration.
2017-08-05 09:37:21,361 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState(Preconditions.java:129)
at org.apache.hadoop.hdfs.server.namenode.LeaseManager.getNumUnderConstructionBlocks(LeaseManager.java:119)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getCompleteBlocksTotal(FSNamesystem.java:5741)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startCommonServices(FSNamesystem.java:1063)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startCommonServices(NameNode.java:678)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:664)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:811)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:795)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1554)
2017-08-05 09:37:21,364 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
2017-08-05 09:37:21,365 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at mn2/192.168.25.22
************************************************************/
This may be
1.Namenode PORT may be Change for each NODE.
This is a particularly vexing problem.
Swallow IllegalStateExceptions thrown by removeShutdownHook in FileSystem. The javadoc states:
public boolean removeShutdownHook(Thread hook)
Throws:
IllegalStateException - If the virtual machine is already in the process of shutting down
So if we are getting this exception, it MEANS we are already in the process of shutdown, so we CANNOT, try what we may, removeShutdownHook. If Runtime had a method Runtime.isShutdownInProgress(), we could have checked for it before the removeShutdownHook call. As it stands, there is no such method. In my opinion, this would be a good patch regardless of the needs for this JIRA.
Not send SIGTERMs from the NM to the MR-AM in the first place. Rather we should expose a mechanism for the NM to politely tell the AM its no longer needed and should shutdown asap. Even after this, if an admin were to kill the MRAppMaster with a SIGTERM, the JobHistory would be lost defeating the purpose of 3614
i discovered that my problem was in journal node and not in namenode
even though the log of namenode shows the error mentioned in question
jps shows journal node but it is fake because journal node service is shut down
even though it is found in jps output
so as a solution i issue hadoop-daemon.sh stop journalnode
then hadoop-daemon.sh start journalnode
and then namenode starts to work again

CDH upgrade from 5.1 to 5.3

After I finished all distribution, activation steps on manager website,
I got the error as below when I restart the cluster:
2016-07-14 14:51:12,335 INFO org.mortbay.log: Stopped HttpServer2$SelectChannelConnectorWithSafeStartup#UT190320.shis.uth.tmc.edu:50070
2016-07-14 14:51:12,436 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Stopping NameNode metrics system...
2016-07-14 14:51:12,436 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system stopped.
2016-07-14 14:51:12,436 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system shutdown complete.
2016-07-14 14:51:12,436 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.io.IOException:
File system image contains an old layout version -55.
An upgrade to version -59 is required.
Please restart NameNode with the "-rollingUpgrade started" option if a rolling upgrade is already started; or restart NameNode with the "-upgrade" option to start a new upgrade.
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:232)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1006)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:736)
at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:553)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:609)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:776)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:760)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1466)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1534)
2016-07-14 14:51:12,439 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
You will need to perform the upgrade as suggested error messages. It is not clear what exactly you did but I suggest you follow the documentation at http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_earlier_cdh5_upgrade.html
sudo service hadoop-hdfs-namenode upgrade is possibly what you need.

Hadoop 2 node Cluster Communication Query

I have a 2 node Hadoop Cluster (Master and Slave). Both the nodes are up and running as I can check their health on the localhost:50070.
So I get this 150 mb folder (with plain text) into the Master's HDFS. Then I run the next command:
hadoop jar hadoop-mapreduce-examples-2.6.0.jar wordcount /In/ /Out/
The issue is that I only get the same execution time as when running the command with one single node. To me it seems like the nodes are not really doing any parallelism!!
I am checking the logs on the slave and I have the following:
2015-03-18 23:52:49,455 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1680309327-31.220.211.10-1426721698684:blk_1073741856_1032 src: /31.220.211.10:46035 dest: /31.220.211.35:50010
2015-03-18 23:52:51,191 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /31.220.211.10:46035, dest: /31.220.211.35:50010, bytes: 3796560, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_688133940_1, offset: 0, srvID: fbea19bb-06ee-4868-af5c-0cb9699064f3, blockid: BP-1680309327-31.220.211.10-1426721698684:blk_1073741856_1032, duration: 1734807025
2015-03-18 23:52:51,191 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1680309327-31.220.211.10-1426721698684:blk_1073741856_1032, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2015-03-18 23:52:59,733 INFO org.apache.hadoop.hdfs.server.datanode.BlockPoolSliceScanner: Verification succeeded for BP-1680309327-31.220.211.10-1426721698684:blk_1073741856_1032
And on the Master:
15/03/18 23:52:50 INFO mapred.Task: Task 'attempt_local1934686363_0001_r_000000_0' done.
15/03/18 23:52:50 INFO mapred.LocalJobRunner: Finishing task: attempt_local1934686363_0001_r_000000_0
15/03/18 23:52:50 INFO mapred.LocalJobRunner: reduce task executor complete.
15/03/18 23:52:50 INFO mapreduce.Job: map 100% reduce 100%
15/03/18 23:52:50 INFO mapreduce.Job: Job job_local1934686363_0001 completed successfully
15/03/18 23:52:51 INFO mapreduce.Job: Counters: 38
Is this normal? Why I am being said that both my nodes are alive but when running the wordcount example it does not parallelize? But instead it acts like everything runs local!!
I can't seem to find an answer to this problem, so I would be very happy if I could get some help.
The problem was that even though both my nodes where recognised as alive the job was still running locally.
That was because the yarn file was missing this property:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
</property>
I have also triple checked that all the config files are the same on all the nodes!! After everything was checked carefully, the job ran globally.
Another thing would be to take attention when configuring the cluster as Hadoop 1.x and Hadoop 2.x don't share the same configuration parameters.

Hortonworks HA Namenodes gives an error "Operation category READ is not supported in state standby"

My hadoop cluster HA active namenode (host1) suddenly switch to standby namenode(host2). I could not found any error in hadoop logs (in any server) to identify the root cause.
After switching the Namenodes following error appeared in hdfs logs frequently and non of the application could read the HDFS files.
2014-07-17 01:58:53,381 WARN namenode.FSNamesystem
(FSNamesystem.java:getCorruptFiles(6769)) - Get corrupt file blocks
returned error: Operation category READ is not supported in state
standby
Once I restart the new active node(host2), namenode is switching back to new standby node(host1). Then cluster is working as normal, users also can can retrieve the HDFS files.
I'm using Hortonworks 2.1.2.0 and HDFS version 2.4.0.2.1
Edit:21st Jult 2014
Following logs were found in active namenode logs when active-standby namenode switch happen
NT_SETTINGS-1675610.csv dst=null perm=null 2014-07-20
09:06:44,746 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditMessage(7755)) - allowed=true
ugi=storm (auth:SIMPLE) ip=/10.0.1.50 cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/PRODUCTS/PRODUCTS-138018
6.csv dst=null perm=null 2014-07-20 09:06:44,747 INFO FSNamesystem.audit (FSNamesystem.java:logAuditMessage(7755)) -
allowed=true ugi=storm (auth:SIMPLE) ip=/10.0.1.50
cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/MERCHANT_SETTINGS/MERCHA
NT_SETTINGS-1695794.csv dst=null perm=null 2014-07-20
09:06:44,747 INFO FSNamesystem.audit
(FSNamesystem.java:logAuditMessage(7755)) - allowed=true
ugi=storm (auth:SIMPLE) ip=/10.0.1.50 cmd=getfileinfo
src=/user/tungsten/staging/LEAPSET/PRODUCTS/PRODUCTS-139954
1.csv dst=null perm=null 2014-07-20 09:06:44,748 INFO namenode.FSNamesystem (FSNamesystem.java:stopActiveServices(1095)) -
Stopping services started for active state 2014-07-20 09:06:44,750
INFO namenode.FSEditLog (FSEditLog.java:endCurrentLogSegment(1153)) -
Ending log segment 842249 2014-07-20 09:06:44,752 INFO
namenode.FSEditLog (FSEditLog.java:printStatistics(673)) - Number of
transactions: 2 Total time for transactions(ms): 0 Number of
transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 4
35 2014-07-20 09:06:44,774 INFO namenode.FSEditLog
(FSEditLog.java:printStatistics(673)) - Number of transactions: 2
Total time for transactions(ms): 0 Number of transactions batched in
Syncs: 0 Number of syncs: 2 SyncTimes(ms): 24 37 2014-07-20
09:06:44,805 INFO namenode.FSNamesystem (FSNamesystem.java:run(4362))
- NameNodeEditLogRoller was interrupted, exiting 2014-07-20 09:06:44,824 INFO namenode.FileJournalManager
(FileJournalManager.java:finalizeLogSegment(130)) - Finalizing edits
file
/ebs/hadoop/hdfs/namenode/current/edits_inprogress_0000000000000842249
-> /ebs/hadoop/hdfs/name node/current/edits_0000000000000842249-0000000000000842250 2014-07-20
09:06:44,874 INFO blockmanagement.CacheReplicationMonitor
(CacheReplicationMonitor.java:run(168)) - Shutting down
CacheReplicationMonitor 2014-07-20 09:06:44,876 INFO
namenode.FSNamesystem (FSNamesystem.java:startStandbyServices(1136)) -
Starting services required for standby state 2014-07-20 09:06:44,927
INFO ha.EditLogTailer (EditLogTailer.java:(117)) - Will roll
logs on active node at hadoop-client-us-west-1b/10.0.254.10:8020 every
120 seconds. 2014-07-20 09:06:44,929 INFO ha.StandbyCheckpointer
(StandbyCheckpointer.java:start(129)) - Starting standby checkpoint
thread... Checkpointing active NN at
http:// hadoop-client-us-west-1b:50070 Serving checkpoints at
http:// hadoop-client-us-west-1a:50070 2014-07-20 09:06:44,930 INFO
ipc.Server (Server.java:run(2027)) - IPC Server handler 3 on 8020,
call org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57297 Call#8431877 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby 2014-07-20 09:06:44,930 INFO ipc.Server
(Server.java:run(2027)) - IPC Server handler 16 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57294 Call#130105071 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby 2014-07-20 09:06:44,940 INFO ipc.Server
(Server.java:run(2027)) - IPC Server handler 14 on 8020, call
org.apache.hadoop.hdfs.protocol.ClientProtocol.getFileInfo from
10.0.1.50:57294 Call#130105072 Retry#0: org.apache.hadoop.ipc.StandbyException: Operation category READ is not
supported in state standby
Edit:13th August 2014
We were able to found out root cause of namenode switching, namenode getting lots of file info requests and then namenode switching was happened.
But still could not get resolve Operation category READ is not supported in state standby error.
Edit:7th December 2014
We were found that, as the solution application need to manually connect with current active namenode once previously active namenode failed. Traffic for namenodes in HA mode are not automatically directed to active node.
I had the same issue. You need to update the client libraries. Use amabari to set up spark and have it install the client on the server. Then set your SPARK_HOME environment variable.

Hadoop: Datanode process killed

I am currently using Hadoop-2.0.3-alpha and after I could work perfectly with HDFS (copying files into HDFS, getting success from an external framework, using the webfrontend), after a new start of my VM, the datanode process is stopping after a while. The namenode process and all yarn processes work without a problem. I installed Hadoop in a folder under an additional user, as I also still have installed Hadoop 0.2, which worked fine too.
Taking a look at the log-file of all datanode processes I got the following information:
2013-04-11 16:23:50,475 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2013-04-11 16:24:17,451 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2013-04-11 16:24:23,276 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2013-04-11 16:24:23,279 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2013-04-11 16:24:23,480 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Configured hostname is user-VirtualBox
2013-04-11 16:24:28,896 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened streaming server at /0.0.0.0:50010
2013-04-11 16:24:29,239 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s
2013-04-11 16:24:38,348 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2013-04-11 16:24:44,627 INFO org.apache.hadoop.http.HttpServer: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer$QuotingIn putFilter)
2013-04-11 16:24:45,163 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context datanode
2013-04-11 16:24:45,164 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context logs
2013-04-11 16:24:45,164 INFO org.apache.hadoop.http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFil ter$StaticUserFilter) to context static
2013-04-11 16:24:45,355 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 0.0.0.0:50075
2013-04-11 16:24:45,508 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dfs.webhdfs.enabled = false
2013-04-11 16:24:45,536 INFO org.apache.hadoop.http.HttpServer: Jetty bound to port 50075
2013-04-11 16:24:45,576 INFO org.mortbay.log: jetty-6.1.26
2013-04-11 16:25:18,416 INFO org.mortbay.log: Started SelectChannelConnector#0.0.0.0:50075
2013-04-11 16:25:42,670 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 50020
2013-04-11 16:25:44,955 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened IPC server at /0.0.0.0:50020
2013-04-11 16:25:45,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request received for nameservices: null
2013-04-11 16:25:47,079 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting BPOfferServices for nameservices: <default>
2013-04-11 16:25:47,660 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (storage id unknown) service to localhost/127.0.0.1:8020 starting to offer service
2013-04-11 16:25:50,515 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2013-04-11 16:25:50,631 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting
2013-04-11 16:26:15,068 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /home/hadoop/workspace/hadoop_space/hadoop23/dfs/data/in_use.lock acquired by nodename 3099#user-VirtualBox
2013-04-11 16:26:15,720 FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363) service to localhost/127.0.0.1:8020
java.io.IOException: Incompatible clusterIDs in /home/hadoop/workspace/hadoop_space/hadoop23/dfs/data: namenode clusterID = CID-1745a89c-fb08-40f0-a14d-d37d01f199c3; datanode clusterID = CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
at org.apache.hadoop.hdfs.server.datanode.DataStorage .doTransition(DataStorage.java:391)
at org.apache.hadoop.hdfs.server.datanode.DataStorage .recoverTransitionRead(DataStorage.java:191)
at org.apache.hadoop.hdfs.server.datanode.DataStorage .recoverTransitionRead(DataStorage.java:219)
at org.apache.hadoop.hdfs.server.datanode.DataNode.in itStorage(DataNode.java:850)
at org.apache.hadoop.hdfs.server.datanode.DataNode.in itBlockPool(DataNode.java:821)
at org.apache.hadoop.hdfs.server.datanode.BPOfferServ ice.verifyAndSetNamespaceInfo(BPOfferService.java: 280)
at org.apache.hadoop.hdfs.server.datanode.BPServiceAc tor.connectToNNAndHandshake(BPServiceActor.java:22 2)
at org.apache.hadoop.hdfs.server.datanode.BPServiceAc tor.run(BPServiceActor.java:664)
at java.lang.Thread.run(Thread.java:722)
2013-04-11 16:26:16,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363) service to localhost/127.0.0.1:8020
2013-04-11 16:26:16,276 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-474150866-127.0.1.1-1365686732002 (storage id DS-317990214-127.0.1.1-50010-1365505141363)
2013-04-11 16:26:18,396 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2013-04-11 16:26:18,940 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2013-04-11 16:26:19,668 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************** **********
SHUTDOWN_MSG: Shutting down DataNode at user-VirtualBox/127.0.1.1
************************************************** **********/
Any ideas? May be I made a mistake during the installation process? But it is strange, that it worked once. I also have to say, that if I am logged in as my additional user to execute the commands ./hadoop-daemon.sh start namenode and the same with the datanode, I need to add sudo.
I used this installation guide: http://jugnu-life.blogspot.ie/2012/0...rial-023x.html
By the way, I use the Oracle Java-7 version.
The problem could be that the namenode was formatted after the cluster was set up and the datanodes were not, so the slaves are still referring to the old namenode.
We have to delete and recreate the folder /home/hadoop/dfs/data on the local filesystem for the datanode.
Check your hdfs-site.xml file to see where dfs.data.dir is pointing to
and delete that folder
and then restart the datanode daemon on the machine
The steps above should recreate the folder and resolve the problem.
Please share your config info if the instructions above do not work.
DataNode dies because of incompatible Clusterids. To fix this problem
If you are using hadoop 2.X, then you have to delete everything in the folder that you have specified in hdfs-site.xml - "dfs.datanode.data.dir" (but NOT the folder itself).
The ClusterID will be maintained in that folder. Delete and restart dfs.sh. This should work!!!
You need to delete both
C:\hadoop\data\dfs\datanode and
C:\hadoop\data\dfs\namenode folders.
If you don't have this folders - open your C:\hadoop\etc\hadoop\hdfs-site.xml file and get paths for this folders for next deletion. For me it says:
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/data/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/data/dfs/datanode</value>
</property>
Run command for Format namenodec:\hadoop\bin>hdfs namenode -format
Now it should work!
I think the recommended way of doing this without deleting the data directory is to simply change the clusterID variable in the datanode's VERSION file.
If you look in your daemons directory, you will see the datanode directory exmaple
data/hadoop/daemons/datanode
The VERSION file should look like this.
cat current/VERSION
#Tue Oct 14 17:31:58 CDT 2014
storageID=DS-23bf7f3a-085c-4531-808f-801ff6d52d14
clusterID=CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
cTime=0
datanodeUuid=63154929-ae68-4149-9f75-9a6558545041
storageType=DATA_NODE
layoutVersion=-55
You need to change the clusterId to the first value in the output of the message so in your case that would be CID-1745a89c-fb08-40f0-a14d-d37d01f199c3 instead of CID-bb3547b0-03e4-4588-ac25-f0299ff81e4f
The updated version should appear like this with the altered clusterId
cat current/VERSION
#Tue Oct 14 17:31:58 CDT 2014
storageID=DS-23bf7f3a-085c-4531-808f-801ff6d52d14
clusterID=CID-1745a89c-fb08-40f0-a14d-d37d01f199c3
cTime=0
datanodeUuid=63154929-ae68-4149-9f75-9a6558545041
storageType=DATA_NODE
layoutVersion=-55
Restart hadoop and the datanode should start just fine.

Resources