After IP change on server java.io.IOException: replica.getGenerationStamp() - hadoop

I am loading data to HDFS using flume. Recently there was a IP change on the server so after that i am not able to start the slaves at all. I have lots of data on the server and data node so reformat is not an option. Even though the master makes a call to slave and tried to start the slave node does not start. Following is the exception i see because slave is still trying to refer to old IP of master:
java.io.IOException: replica.getGenerationStamp() < block.getGenerationStamp()
WARN org.apache.hadoop.hdfs.server.protocol.InterDatanodeProtocol: Failed to obtain replica info for block (=BP-967573188-192.168.XX.XX-1413284771002:blk_1073757987_17249) from datanode (=192.168.XX.XX:50010)
java.io.IOException: replica.getGenerationStamp() < block.getGenerationStamp(), block=blk_1073757987_17249, replica=ReplicaWaitingToBeRecovered, blk_1073757987_17179, RWR
getNumBytes() = 81838954
getBytesOnDisk() = 81838954
getVisibleLength()= -1
getVolume() = /var/hadoop/data/current
getBlockFile() = /var/hadoop/data/current/BP-967573188-192.168.XX.XX-1413284771002/current/rbw/blk_1073757987
unlinked=false
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1613)
at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.initReplicaRecovery(FsDatasetImpl.java:1579)
at org.apache.hadoop.hdfs.server.datanode.DataNode.initReplicaRecovery(DataNode.java:2094)
at org.apache.hadoop.hdfs.server.datanode.DataNode.callInitReplicaRecovery(DataNode.java:2105)
at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:2173)
at org.apache.hadoop.hdfs.server.datanode.DataNode.access$400(DataNode.java:140)
at org.apache.hadoop.hdfs.server.datanode.DataNode$2.run(DataNode.java:2079)
at java.lang.Thread.run(Thread.java:745)
I have updated the config file /etc/hosts but without any effect. Kindly suggest

update the new Ip address of datanode in slaves file in hadoop conf directory.
Regards
Jyoti Ranjan Panda

Related

Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

I'm getting the following error when attempting to write to HDFS as part of my multi-threaded application
could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
I've tried the top-rated answer here around reformatting but this doesn't work for me: HDFS error: could only be replicated to 0 nodes, instead of 1
What is happening is this:
My application consists of 2 threads each one configured with their own Spring Data PartitionTextFileWriter
Thread 1 is the first to process data and this can successfully write to HDFS
However, once Thread 2 starts to process data I get this error when it attempts to flush to a file
Thread 1 and 2 will not be writing to the same file, although they do share a parent directory at the root of my directory tree.
There are no problems with disk space on my server.
I also see this in my name-node logs, but not sure what it means:
2016-03-15 11:23:12,149 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
2016-03-15 11:23:12,150 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2016-03-15 11:23:12,150 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2016-03-15 11:23:12,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.104.247.78:52004 Call#61 Retry#0
java.io.IOException: File /metrics/abc/myfile could only be replicated to 0 nodes instead of [2016-03-15 13:34:16,663] INFO [Group Metadata Manager on Broker 0]: Removed 0 expired offsets in 1 milliseconds. (kafka.coordinator.GroupMetadataManager)
What could be the cause of this error?
Thanks
This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that:
Only a NameNode instance is running and it's not in safe-mode
There is no DataNode instances up and running, or some are dead. (Check the servers)
Namenode and Datanode instances are both running, but they cannot communicate with each other, which means There is connectivity issue between DataNode and NameNode instances.
Running DataNode instances are not able to talk to the server because of some networking of hadoop-based issues (check logs that include datanode info)
There is no hard disk space specified in configured data directories for DataNode instances or DataNode instances have run out of space. (check dfs.data.dir // delete old files if any)
Specified reserved spaces for DataNode instances in dfs.datanode.du.reserved is more than the free space which makes DataNode instances to understand there is no enough free space.
There is no enough threads for DataNode instances (check datanode logs and dfs.datanode.handler.count value)
Make sure dfs.data.transfer.protection is not equal to “authentication” and dfs.encrypt.data.transfer is equal to true.
Also please:
Verify the status of NameNode and DataNode services and check the related logs
Verify if core-site.xml has correct fs.defaultFS value and hdfs-site.xml has a valid value.
Verify hdfs-site.xml has dfs.namenode.http-address.. for all NameNode instances specified in case of PHD HA configuration.
Verify if the permissions on the directories are correct
Ref: https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo
Ref: https://support.pivotal.io/hc/en-us/articles/201846688-HDFS-reports-Configured-Capacity-0-0-B-for-datanode
Also, please check: Writing to HDFS from Java, getting "could only be replicated to 0 nodes instead of minReplication"
Another reason could be that your Datanode machine hasn't exposed the port(50010 by default). In my case, I was trying to write a file from Machine1 to HDFS running on a Docker container C1 which was hosted on Machine2.
For the host machine to forward the requests to the services running on the container, the port forwarding should be taken care of. I could resolve the issue after forwarding the port 50010 from host machine to guest machine.
Check if the jps command on the computers which run the datanodes show that the datanodes are running. If they are running, then it means that they could not connect with the namenode and hence the namenode thinks there are no datanodes in the hadoop system.
In such a case, after running start-dfs.sh, run netstat -ntlp in the master node. 9000 is the port number most tutorials tells you to specify in core-site.xml. So if you see a line like this in the output of netstat
tcp 0 0 120.0.1.1:9000 0.0.0.0:* LISTEN 4209/java
then you have a problem with the host alias. I had the same problem, so I'll state how it was resolved.
This is the contents of my core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://vm-sm:9000</value>
</property>
</configuration>
So the vm-sm alias in the master computer maps to the 127.0.1.1. This is because of the setup of my /etc/hosts file.
127.0.0.1 localhost
127.0.1.1 vm-sm
192.168.1.1 vm-sm
192.168.1.2 vm-sw1
192.168.1.3 vm-sw2
Looks like the core-site.xml of the master system seemed to have mapped on the the 120.0.1.1:9000 while that of the worker nodes are trying to connect through 192.168.1.1:9000.
So I had to change the alias of the master node for the hadoop system (just removed the hyphen) in the /etc/hosts file
127.0.0.1 localhost
127.0.1.1 vm-sm
192.168.1.1 vmsm
192.168.1.2 vm-sw1
192.168.1.3 vm-sw2
and reflected the change in the core-site.xml, mapred-site.xml, and slave files (wherever the old alias of the master occurred).
After deleting the old hdfs files from the hadoop location as well as the tmp folder and restarting all nodes, the issue was solved.
Now, netstat -ntlp after starting DFS returns
tcp 0 0 192.168.1.1:9000 0.0.0.0:* LISTEN ...
...
I had the same error, re-starting hdfs services solved this issue. ie re-started NameNode and DataNode services.
In my case it was a storage policy of output path set to COLD.
How to check settings of your folder:
hdfs storagepolicies -getStoragePolicy -path my_path
In my case it returned
The storage policy of my_path
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], replicationFallbacks=[]}
I dumped the data else where (to HOT storage) and the issue went away.
You may leave HDFS safe mode:
hdfs dfsadmin -safemode forceExit
I had a similar issue recently. As my datanodes (only) had SSDs for storage, I put [SSD]file:///path/to/data/dir for the dfs.datanode.data.dir configuration. Due to the logs containing unavailableStorages=[DISK] I removed the [SSD] tag, which solved the problem.
Apparently, Hadoop uses [DISK] as default Storage Type, and does not 'fallback' (or rather 'fallup') to using SSD if no [DISK] tagged storage location is available. I could not find any documenation on this behaviour though.
I too had the same error, then i have changed the block size. This came to resolve the problem.
In my case the problem was hadoop temporary files
Logs were showing the following error:
2019-02-27 13:52:01,079 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /tmp/hadoop-i843484/dfs/data/in_use.lock acquired by nodename 28111#slel00681841a
2019-02-27 13:52:01,087 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /tmp/hadoop-i843484/dfs/data: namenode clusterID = CID-38b0104b-d3d2-4088-9a54-44b71b452006; datanode clusterID = CID-8e121bbb-5a08-4085-9817-b2040cd399e1
I solved by removing hadoop tmp files
sudo rm -r /tmp/hadoop-*
Got this error as Data Node was not running. To resolve this on VM
Removed Name/Data Node directories
Re-Created the directories
Formatted the name node & data node(not required)hadoop namenode -format
Restarted the service start-dfs.sh
Now jps shows both Name & Data nodes and Sqoop job worked successfully
maybe the number of your DataNode is too small(less than 3), I put 3 ip-address in hadoop/etc/hadoop/slaves, and it works!
1.check your firewall status, you can simply stop firewall in both master and slaves:systemctl stop firewalld. Which fixed my problem.
2.delete namenode and reformat it: delete namenode dir and datanode dir both.(as my slaves computer didn't shutdown normally, causing my datanode broken) then call hdfs namenode -format`.
call jps in both master and slaves. make sure master have namenode and slaves have datanode.

cdh4.3,Exception from the logs ,after ./start-dfs.sh ,datanode and namenode start fail

here is the logs from hadoop-datanode-...log:
FATAL org.apache.hadoop.hdfs.server.datanode.DataNode: Initialization failed for block pool Block pool BP-1421227885-192.168.2.14-1371135284949 (storage id DS-30209445-192.168.2.41-50010-1371109358645) service to /192.168.2.8:8020
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.protocol.DisallowedDatanodeException): Datanode denied communication with namenode: DatanodeRegistration(0.0.0.0, storageID=DS-30209445-192.168.2.41-50010-1371109358645, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=CID-f16e4a3e-4776-4893-9f43-b04d8dc651c9;nsid=1710848135;c=0)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:648)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:3498)
my mistake:namenode can start,datanode can't start
I saw this once too, the namenode server needs to do a reverse lookup request ,
so an nslookup 192.168.2.41 should return a name, it doesn't so 0.0.0.0 is also recorded
You don't need to hardcode address into /etc/hosts if you have dns working correctly (i.e. the in-addr.arpa file matches the entries in domain file) But if you don't have dns then you need to help hadoop out.
There seems to be a Name Resolution issue.
Datanode denied communication with namenode:
DatanodeRegistration(0.0.0.0,
storageID=DS-30209445-192.168.2.41-50010-1371109358645,
infoPort=50075, ipcPort=50020,
Here DataNode is identifying itself as 0.0.0.0.
Looks like dfs.hosts enforcement. Can you recheck on your NameNode's hdfs-site.xml configs that you are surely not using a dfs.hosts file?
This error may arise if the datanode that is trying to connect to the namenode is either listed in the file defined by dfs.hosts.exclude or that dfs.hosts is used and that datanode is not listed within that file. Make sure the datanode is not listed in excludes, and if you are using dfs.hosts, add it to the includes. Restart hadoop after that and run hadoop dfsadmin -refreshNodes.
HTH
Reverse DNS lookup is required when a datanode tries to register with a namenode. I got the same exceptions with Hadoop 2.6.0 because my DNS does not allow reverse lookup.
But you can disable Hadoop's reverse lookup by setting this configuration "dfs.namenode.datanode.registration.ip-hostname-check" to false in hdfs-site.xml
I got this solution from here and it solved my problem.

org.apache.hadoop.hbase.PleaseHoldException: Master is initializing

I am trying to setup the multinode cluster of Hbase. When i do the jps on slave i get
5780 Jps
5558 HQuorumPeer
5684 HRegionServer
1963 DataNode
2093 TaskTracker
similarly on master i get
4254 SecondaryNameNode
15226 Jps
14982 HMaster
3907 NameNode
14921 HQuorumPeer
4340 JobTracker
EVerything is runnnig properly. But when i try to create table on hbase shell. It gives an error
ERROR: org.apache.hadoop.hbase.PleaseHoldException: org.apache.hadoop.hbase.PleaseHoldException: Master is initializing
regionserver log of my slave(where region server is running):
2013-06-11 13:09:53,119 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at localhost,60000,137093$
2013-06-11 13:10:53,190 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is in the failed servers list: localhost/127.0.0.1:60000
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
at $Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:2037)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2083)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:744)
at java.lang.Thread.run(Thread.java:722)
2013-06-11 13:10:53,391 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at localhost,60000,137093$
FYI, i have also took care of /etc/hosts file on both master and slave.
127.0.0.1 localhost
127.0.0.1 naresh-PC
I again did changes in /etc/hosts file 127.0.1.1 to naresh-PC. But still getting this error
2013-06-11 14:51:17,781 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting connect to Master server at naresh-pc,60000,137094$
2013-06-11 14:52:17,817 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to connect to master. Retrying. Error was:
java.net.UnknownHostException: unknown host: naresh-pc
at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.<init>(HBaseClient.java:276)
at org.apache.hadoop.hbase.ipc.HBaseClient.createConnection(HBaseClient.java:255)
at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1111)
at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
at $Proxy8.getProtocolVersion(Unknown Source)
at org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
at org.apache.hadoop.hbase.regionserver.HRegionServer.getMaster(HRegionServer.java:2037)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2083)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:744)
at java.lang.Thread.run(Thread.java:722)
Try clearing all the states in Zookeeper.
Stop Zookeeper
Wipe the Zookeeper data directory
Start Zookeeper
I was getting the same issue and followed this approach and it worked fine.
You need to change the configuration on the slave node to point at the master. It is currently pointing to localhost and not connecting to the actual master:
"org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This
server is in the failed servers list: localhost/127.0.0.1:60000 at "
I'm hosting my own cluster inside Docker. Here's what worked in my case. I grepped the HBase log file for errors and found "Master passed us a different hostname to use"
`[root#docker-iop bin]# grep ERROR /var/log/hbase/hbase-hbase-regionserver-bi-mgmt01.local.log
2016-10-06 00:05:29,816 ERROR [regionserver/bi-mgmt01.local/111.11.2.3:16020] regionserver.HRegionServer: Master passed us a different hostname to use; was=my-host-name, but now=111.22.33.444'
I mapped my-host-name to 111.22.333.444 in my hosts file, restarted HBase and it worked.
I also had the same issue with a fully distributed hbase cluster with the configuration below.
Master Node (Node-A)
Backup Masters ($HBASE_HOME/conf/backup-masters) (Node-B & Node-C)
3 Replication servers (Node-A, Node-B & Node-C)
RCA:
The backup-masters nodes were attempted to be started when the cluster started.
Solution
I removed the backup masters by making $HBASE_HOME/conf/backup-masters empty in all hbase nodes.
So I had a cluster running without backup masters.
I wonder if the master node and master nodes must not also function as regionservers? The HBase documentation says otherwise though.
I came across the same issue and could not find anything, it turns out I was copy pasting from the Hbase documentation (https://hbase.apache.org/book.html#shell_exercises). I believe some character in there may be creating the error, so try to manually enter:
create 'test', 'cf'
We resolved this issue. Solution is to
stop Hbase
log to zookeeper-client as root
execute command rmr /hbase-unsecure/meta-region-server
start Hbase
We stop/start Hbase through Ambari UI, delete /hbase... through server bash shell.
[root#s1 ~]# zookeeper-client
Connecting to localhost:2181
.......
[zk: localhost:2181(CONNECTED) 0] rmr /hbase-unsecure/meta-region-server
I use docker/docker-compose to set up my distributed hbase, after I make changes, I can not create table in hbase shell.
I docker rm all the related images, and rebuild them. It works. Also, simply rebuilding the images doesn't work...

How do you establish single node Hadoop instance on AWS using Apache Whirr?

I am attempting to run a single-node instance of Hadoop on Amazon Web Services using Apache Whirr. I set whirr.instance-templates equal to 1 jt+nn+dn+tt. The instance starts up fine. I am able to create directories, but when I try to put files, I get a File could only be replicated to 0 nodes, instead of 1 error. When I do a hadoop fsck / I get a Exception in thread "main" java.net.ConnectException: Connection refused error. Does anyone know what is wrong with my configuration?
I made the experience that whirr does not always start all services reliable. It sounds like the namenode started (the namenode is responsible for storing directory information) but the datanode did not start (the datanode stores the data).
Try running
hadoop dfsadmin -report
to see if a datanode is available.
If not: often it helps to restart the cluster.

HDFS error: could only be replicated to 0 nodes, instead of 1

I've created a ubuntu single node hadoop cluster in EC2.
Testing a simple file upload to hdfs works from the EC2 machine, but doesn't work from a machine outside of EC2.
I can browse the the filesystem through the web interface from the remote machine, and it shows one datanode which is reported as in service. Have opened all tcp ports in the security from 0 to 60000(!) so I don't think it's that.
I get the error
java.io.IOException: File /user/ubuntu/pies could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)
at org.apache.hadoop.ipc.Client.call(Client.java:905)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)
namenode log just gives the same error. Others don't seem to have anything interesting
Any ideas?
Cheers
WARNING: The following will destroy ALL data on HDFS. Do not execute the steps in this answer unless you do not care about destroying existing data!!
You should do this:
stop all hadoop services
delete dfs/name and dfs/data directories
hdfs namenode -format Answer with a capital Y
start hadoop services
Also, check the diskspace in your system and make sure the logs are not warning you about it.
This is your issue - the client can't communicate with the Datanode. Because the IP that the client received for the Datanode is an internal IP and not the public IP. Take a look at this
http://www.hadoopinrealworld.com/could-only-be-replicated-to-0-nodes/
Look at the sourcecode from DFSClient$DFSOutputStrem (Hadoop 1.2.1)
//
// Connect to first DataNode in the list.
//
success = createBlockOutputStream(nodes, clientName, false);
if (!success) {
LOG.info("Abandoning " + block);
namenode.abandonBlock(block, src, clientName);
if (errorIndex < nodes.length) {
LOG.info("Excluding datanode " + nodes[errorIndex]);
excludedNodes.add(nodes[errorIndex]);
}
// Connection failed. Let's wait a little bit and retry
retry = true;
}
The key to understand here is that Namenode only provide the list of Datanodes to store the blocks. Namenode does not write the data to the Datanodes. It is the job of the Client to write the data to the Datanodes using the DFSOutputStream . Before any write can begin the above code make sure that the Client can communicate with the Datanode(s) and if the communication fails to the Datanode, the Datanode is added to the excludedNodes .
Look at following:
By seeing this exception(could only be replicated to 0 nodes, instead of 1), datanode is not available to Name Node..
This are the following cases Data Node may not available to Name Node
Data Node disk is Full
Data Node is Busy with block report and block scanning
If Block Size is Negative value(dfs.block.size in hdfs-site.xml)
while write in progress primary datanode goes down(Any n/w fluctations b/w Name Node and Data Node Machines)
when Ever we append any partial chunk and call sync for subsequent partial chunk appends client should store the previous data in buffer.
For example after appending "a" I have called sync and when I am trying the to append the buffer should have "ab"
And Server side when the chunk is not multiple of 512 then it will try to do Crc comparison for the data present in block file as well as crc present in metafile. But while constructing crc for the data present in block it is always comparing till the initial Offeset Or For more analysis Please the data node logs
Reference: http://www.mail-archive.com/hdfs-user#hadoop.apache.org/msg01374.html
I had a similar problem setting up a single node cluster. I realized that I didn't config any datanode. I added my hostname to conf/slaves, then it worked out. Hope it helps.
I'll try to describe my setup & solution:
My setup: RHEL 7, hadoop-2.7.3
I tried to setup standalone Operation first and then Pseudo-Distributed Operation where the latter failed with the same issue.
Although, when I start hadoop with:
sbin/start-dfs.sh
I got the following:
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-secondarynamenode-localhost.localdomain.out
which looks promising (starting datanode.. with no failures) - but the datanode wasn't exist indeed.
Another indication was to see that there is no datanode in operation (the below snapshot shows fixed working state):
I've fix that issue by doing:
rm -rf /tmp/hadoop-<user>/dfs/name
rm -rf /tmp/hadoop-<user>/dfs/data
and then start again:
sbin/start-dfs.sh
...
I had the same error on MacOS X 10.7 (hadoop-0.20.2-cdh3u0) due to data node not starting.
start-all.sh produced following output:
starting namenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
localhost: ssh: connect to host localhost port 22: Connection refused
localhost: ssh: connect to host localhost port 22: Connection refused
starting jobtracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
localhost: ssh: connect to host localhost port 22: Connection refused
After enabling ssh login via System Preferences -> Sharing -> Remote Login
it started to work.
start-all.sh output changed to following (note start of datanode):
starting namenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting datanode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting secondarynamenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
starting jobtracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting tasktracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
And I think you should make sure all the datanodes are up when you do copy to dfs. In some case, it takes a while. I think that's why the solution 'checking the health status' works, because you go to the health status webpage and wait for everything up, my five cents.
It take me a week to figure out the problem in my situation.
When the client(your program) ask the nameNode for data operation, the nameNode picks up a dataNode and navigate the client to it, by giving the dataNode's ip to the client.
But, when the dataNode host is configured to has multiple ip, and the nameNode gives you the one your client CAN'T ACCESS TO, the client would add the dataNode to exclude list and ask the nameNode for a new one, and finally all dataNode are excluded, you get this error.
So check node's ip settings before you try everything!!!
If all data nodes are running, one more thing to check whether the HDFS has enough space for your data. I can upload a small file but failed to upload a big file (30GB) to HDFS. 'bin/hdfs dfsadmin -report' shows that each data node only has a few GB available.
Have you tried the recommend from the wiki http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment ?
I was getting this error when putting data into the dfs. The solution is strange and probably inconsistent: I erased all temporary data along with the namenode, reformatted the namenode, started everything up, and visited my "cluster's" dfs health page (http://your_host:50070/dfshealth.jsp). The last step, visiting the health page, is the only way I can get around the error. Once I've visited the page, putting and getting files in and out of the dfs works great!
Reformatting the node is not the solution. You will have to edit the start-all.sh. Start the dfs, wait for it to start completely and then start mapred. You can do this using a sleep. Waiting for 1 second worked for me. See the complete solution here http://sonalgoyal.blogspot.com/2009/06/hadoop-on-ubuntu.html.
I realize I'm a little late to the party, but I wanted to post this
for future visitors of this page. I was having a very similar problem
when I was copying files from local to hdfs and reformatting the
namenode did not fix the problem for me. It turned out that my namenode
logs had the following error message:
2012-07-11 03:55:43,479 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-920118459-192.168.3.229-50010-1341506209533, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Too many open files
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:883)
at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:491)
at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:462)
at org.apache.hadoop.hdfs.server.datanode.FSDataset.createTmpFile(FSDataset.java:1628)
at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1514)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:113)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:381)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)
Apparently, this is a relatively common problem on hadoop clusters and
Cloudera suggests increasing the nofile and epoll limits (if on
kernel 2.6.27) to work around it. The tricky thing is that setting
nofile and epoll limits is highly system dependent. My Ubuntu 10.04
server required a slightly different configuration for this to work
properly, so you may need to alter your approach accordingly.
Don't format the name node immediately. Try stop-all.sh and start it using start-all.sh. If the problem persists, go for formatting the name node.
Follow the below steps:
1. Stop dfs and yarn.
2. Remove datanode and namenode directories as specified in the core-site.xml.
3. Start dfs and yarn as follows:
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

Resources