Hadoop: Datanodes available: 0 (0 total, 0 dead) - hadoop

Each time I run:
hadoop dfsadmin -report
I get the following output:
Configured Capacity: 0 (0 KB)
Present Capacity: 0 (0 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 0 (0 KB)
DFS Used%: �%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 0 (0 total, 0 dead)
There is no data directory in my dfs/ folder.
A lock file exists in this folder: in_use.lock
The master, job tracker and data nodes are running fine.

Please check the datanode logs . It will log errors when it is unable to report to namenode . If you post the those errors , people will be able to help ..

I had exactly same problem and when I checked datanodes logs, there were lots of could not connect to master:9000, and when I checked ports on master via netstat -ntlp I had this in output:
tcp 0 0 127.0.1.1:9000 ...
I realized that I should change my master machine name or change master in all configs. I decided to do the first cause it seems much easier.
so I modified /etc/hosts and changed 127.0.1.1 master to 127.0.1.1 master-machine and added an entry at the end of the file like this:
192.168.1.1 master
Then I changed master to master-machine in /etc/hostname and restart the machine.
The problem was gone.

um...
Did you check firewall?
When i use hadoop, I turn off firewall (iptables -F, in the all nodes)
and then try again.

It has happened to us, when we restarted the cluster. But after a while, the datanodes were automatically detected. Could be possibly because of block report delay time property.

Usually there are errors of namespace id issues in the datanode.
So delete the name dir from master and delete the data dir from the datanodes.
Now format the datanode and try start-dfs.
The report usually takes some time to reflect all the datanodes.
Even I was getting 0 datanodes, but after some time master detects the slaves.

I had the same problem and I just solved it.
/etc/hosts of all nodes should look like this:
127.0.0.1 localhost
xxx.xxx.xxx.xxx master
xxx.xxx.xxx.xxx slave-1
xxx.xxx.xxx.xxx slave-2

Just resolved the issue by following below steps -
Make sure the IP addresses for master and slave nodes are correct in /etc/hosts file
Unless you really need the data, stop-dfs.sh, delete all data directories in master/slave nodes, then run hdfs namenode -format and start-dfs.sh. This should recreate the hdfs and fix the issue

Just formatting the namenode didn't work for me. So I checked the logs at $HADOOP_HOME/logs. In secondarynamenode, I found this error:
ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
java.io.IOException: Inconsistent checkpoint fields.
LV = -64 namespaceID = 2095041698 cTime = 1552034190786 ; clusterId = CID-db399b3f-0a68-47bf-b798-74ed4f5be097 ; blockpoolId = BP-31586866-127.0.1.1-1552034190786.
Expecting respectively: -64; 711453560; 1550608888831; CID-db399b3f-0a68-47bf-b798-74ed4f5be097; BP-2041548842-127.0.1.1-1550608888831.
at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:143)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:550)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
at java.lang.Thread.run(Thread.java:748)
So I stopped hadoop and then specifically formatted the given cluster id:
hdfs namenode -format -clusterId CID-db399b3f-0a68-47bf-b798-74ed4f5be097
This solved the problem.

There's another obscure reason this could happen as well: Your datanode did not start properly, but everything else was working.
In my case, when going through the log, I found that the bound port, 510010, was already in use by SideSync (for MacOS). I found this through
sudo lsof -iTCP -n -P|grep 0010,
But you can use similar techniques to determine what might have already taken your well known data node port.
Killing this off and restarting fixed the problem.
Additionally, if you've installed Hadoop/Yarn as root, but have data dirs in individual home directories, and then try to run it as an individual user, you'll have to make the data node directory public.

Related

Hadoop: ...be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation

I'm getting the following error when attempting to write to HDFS as part of my multi-threaded application
could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and no node(s) are excluded in this operation.
I've tried the top-rated answer here around reformatting but this doesn't work for me: HDFS error: could only be replicated to 0 nodes, instead of 1
What is happening is this:
My application consists of 2 threads each one configured with their own Spring Data PartitionTextFileWriter
Thread 1 is the first to process data and this can successfully write to HDFS
However, once Thread 2 starts to process data I get this error when it attempts to flush to a file
Thread 1 and 2 will not be writing to the same file, although they do share a parent directory at the root of my directory tree.
There are no problems with disk space on my server.
I also see this in my name-node logs, but not sure what it means:
2016-03-15 11:23:12,149 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) For more information, please enable DEBUG log level on org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy
2016-03-15 11:23:12,150 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 1 but only 0 storage types can be selected (replication=1, selected=[], unavailable=[DISK], removed=[DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2016-03-15 11:23:12,150 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 1 to reach 1 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2016-03-15 11:23:12,151 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 10.104.247.78:52004 Call#61 Retry#0
java.io.IOException: File /metrics/abc/myfile could only be replicated to 0 nodes instead of [2016-03-15 13:34:16,663] INFO [Group Metadata Manager on Broker 0]: Removed 0 expired offsets in 1 milliseconds. (kafka.coordinator.GroupMetadataManager)
What could be the cause of this error?
Thanks
This error is caused by the block replication system of HDFS since it could not manage to make any copies of a specific block within the focused file. Common reasons of that:
Only a NameNode instance is running and it's not in safe-mode
There is no DataNode instances up and running, or some are dead. (Check the servers)
Namenode and Datanode instances are both running, but they cannot communicate with each other, which means There is connectivity issue between DataNode and NameNode instances.
Running DataNode instances are not able to talk to the server because of some networking of hadoop-based issues (check logs that include datanode info)
There is no hard disk space specified in configured data directories for DataNode instances or DataNode instances have run out of space. (check dfs.data.dir // delete old files if any)
Specified reserved spaces for DataNode instances in dfs.datanode.du.reserved is more than the free space which makes DataNode instances to understand there is no enough free space.
There is no enough threads for DataNode instances (check datanode logs and dfs.datanode.handler.count value)
Make sure dfs.data.transfer.protection is not equal to “authentication” and dfs.encrypt.data.transfer is equal to true.
Also please:
Verify the status of NameNode and DataNode services and check the related logs
Verify if core-site.xml has correct fs.defaultFS value and hdfs-site.xml has a valid value.
Verify hdfs-site.xml has dfs.namenode.http-address.. for all NameNode instances specified in case of PHD HA configuration.
Verify if the permissions on the directories are correct
Ref: https://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo
Ref: https://support.pivotal.io/hc/en-us/articles/201846688-HDFS-reports-Configured-Capacity-0-0-B-for-datanode
Also, please check: Writing to HDFS from Java, getting "could only be replicated to 0 nodes instead of minReplication"
Another reason could be that your Datanode machine hasn't exposed the port(50010 by default). In my case, I was trying to write a file from Machine1 to HDFS running on a Docker container C1 which was hosted on Machine2.
For the host machine to forward the requests to the services running on the container, the port forwarding should be taken care of. I could resolve the issue after forwarding the port 50010 from host machine to guest machine.
Check if the jps command on the computers which run the datanodes show that the datanodes are running. If they are running, then it means that they could not connect with the namenode and hence the namenode thinks there are no datanodes in the hadoop system.
In such a case, after running start-dfs.sh, run netstat -ntlp in the master node. 9000 is the port number most tutorials tells you to specify in core-site.xml. So if you see a line like this in the output of netstat
tcp 0 0 120.0.1.1:9000 0.0.0.0:* LISTEN 4209/java
then you have a problem with the host alias. I had the same problem, so I'll state how it was resolved.
This is the contents of my core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://vm-sm:9000</value>
</property>
</configuration>
So the vm-sm alias in the master computer maps to the 127.0.1.1. This is because of the setup of my /etc/hosts file.
127.0.0.1 localhost
127.0.1.1 vm-sm
192.168.1.1 vm-sm
192.168.1.2 vm-sw1
192.168.1.3 vm-sw2
Looks like the core-site.xml of the master system seemed to have mapped on the the 120.0.1.1:9000 while that of the worker nodes are trying to connect through 192.168.1.1:9000.
So I had to change the alias of the master node for the hadoop system (just removed the hyphen) in the /etc/hosts file
127.0.0.1 localhost
127.0.1.1 vm-sm
192.168.1.1 vmsm
192.168.1.2 vm-sw1
192.168.1.3 vm-sw2
and reflected the change in the core-site.xml, mapred-site.xml, and slave files (wherever the old alias of the master occurred).
After deleting the old hdfs files from the hadoop location as well as the tmp folder and restarting all nodes, the issue was solved.
Now, netstat -ntlp after starting DFS returns
tcp 0 0 192.168.1.1:9000 0.0.0.0:* LISTEN ...
...
I had the same error, re-starting hdfs services solved this issue. ie re-started NameNode and DataNode services.
In my case it was a storage policy of output path set to COLD.
How to check settings of your folder:
hdfs storagepolicies -getStoragePolicy -path my_path
In my case it returned
The storage policy of my_path
BlockStoragePolicy{COLD:2, storageTypes=[ARCHIVE], creationFallbacks=[], replicationFallbacks=[]}
I dumped the data else where (to HOT storage) and the issue went away.
You may leave HDFS safe mode:
hdfs dfsadmin -safemode forceExit
I had a similar issue recently. As my datanodes (only) had SSDs for storage, I put [SSD]file:///path/to/data/dir for the dfs.datanode.data.dir configuration. Due to the logs containing unavailableStorages=[DISK] I removed the [SSD] tag, which solved the problem.
Apparently, Hadoop uses [DISK] as default Storage Type, and does not 'fallback' (or rather 'fallup') to using SSD if no [DISK] tagged storage location is available. I could not find any documenation on this behaviour though.
I too had the same error, then i have changed the block size. This came to resolve the problem.
In my case the problem was hadoop temporary files
Logs were showing the following error:
2019-02-27 13:52:01,079 INFO org.apache.hadoop.hdfs.server.common.Storage: Lock on /tmp/hadoop-i843484/dfs/data/in_use.lock acquired by nodename 28111#slel00681841a
2019-02-27 13:52:01,087 WARN org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Incompatible clusterIDs in /tmp/hadoop-i843484/dfs/data: namenode clusterID = CID-38b0104b-d3d2-4088-9a54-44b71b452006; datanode clusterID = CID-8e121bbb-5a08-4085-9817-b2040cd399e1
I solved by removing hadoop tmp files
sudo rm -r /tmp/hadoop-*
Got this error as Data Node was not running. To resolve this on VM
Removed Name/Data Node directories
Re-Created the directories
Formatted the name node & data node(not required)hadoop namenode -format
Restarted the service start-dfs.sh
Now jps shows both Name & Data nodes and Sqoop job worked successfully
maybe the number of your DataNode is too small(less than 3), I put 3 ip-address in hadoop/etc/hadoop/slaves, and it works!
1.check your firewall status, you can simply stop firewall in both master and slaves:systemctl stop firewalld. Which fixed my problem.
2.delete namenode and reformat it: delete namenode dir and datanode dir both.(as my slaves computer didn't shutdown normally, causing my datanode broken) then call hdfs namenode -format`.
call jps in both master and slaves. make sure master have namenode and slaves have datanode.

Hadoop UI shows only one Datanode

I've started hadoop cluster composed of on master and 4 slave nodes.
Configuration seems ok:
hduser#ubuntu-amd64:/usr/local/hadoop$ ./bin/hdfs dfsadmin -report
When I enter NameNode UI (http://10.20.0.140:50070/) Overview card seems ok - for example total Capacity of all Nodes sumes up.
The problem is that in the card Datanodes I see only one datanode.
I came across the same problem, fortunately, I solved it. I guess it causes by the 'localhost'.
Config different name for these IP in /etc/host
Remember to restart all the machines, things will go well.
It's because of the same hostname in both datanodes.
In your case both datanodes are registering to the namenode with same hostname ie 'localhost' Try with different hostnames it will fix your problem.
in UI it will show only one entry for a hostname.
in "hdfs dfsadmin -report" output you can see both.
The following tips may help you
Check the core-site.xml and ensure that the namenode hostname is correct
Check the firewall rules in namenode and datanodes and ensure that the required ports are open
Check the logs of datanodes
Ensure that all the datanodes are up and running
As #Rahul said the problem is because of the same hostname
change your hostname in /etc/hostname file and give different hostname for each host
and resolve hostname with ip address /etc/hosts file
then restart your cluster you will see all datanodes in Datanode information tab on browser
I have the same trouble because I use ip instead of hostname, [hdfs dfsadmin -report] is correct though it is only one[localhost] in UI. Finally, I solved it like this:
<property>
       <name>dfs.datanode.hostname</name>                   
       <value>the name you want to show</value>
</property>
you almost can't find it in any doucument...
Sorry, feels like it's been a time. But still I'd like to share my answer:
the root cause is from hadoop/etc/hadoop/hdfs-site.xml:
the xml file has a property named dfs.datanode.data.dir. If you set all the datanodes with the same name, then hadoop is assuming the cluster has only one datanode. So the proper way of doing it is naming every datanode with a unique name:
Regards,
YUN HANXUAN
Your admin report looks absolutely fine. Please run the below to check the HDFS disk space details.
"hdfs dfs -df /"
If you still see the size being good, its just a UI glitch.
My Problems: I have 1 master node and 3 slave nodes. when I start all nodes by start-all. sh and accessing the dashboard of master nodes. I was able to see only one data node on the web UI.
My Solution:
Try to stop the Firewall temporary by sudo systemctl stop firewalld. if you do not want to stop your firewalld service then r allow the ports of the data node by
sudo firewall-cmd --permanent --add-port{PORT_Number/tcp,PORT_number2/tcp} ; sudo firewall-cmd --reload
If you are using sapretae user for Hadoop in my case I am using hadoop user to manage hadoop daemons then change the owner on your dataNode and nameNode file by. sudo chown hadoop:hadoop /opt/data -R
My hdfs-site.xml config as given in image
Check your daemons on data node by jps command. it should show as given in the below image.
jps Output

Hadoop JobClient: Error Reading task output

I'm trying to process 40GB of Wikipedia English articles on my cluster. The problem is the following repeating error message:
13/04/27 17:11:52 INFO mapred.JobClient: Task Id : attempt_201304271659_0003_m_000046_0, Status : FAILED
Too many fetch-failures
13/04/27 17:11:52 WARN mapred.JobClient: Error reading task outputhttp://ubuntu:50060/tasklog?plaintext=true&attemptid=attempt_201304271659_0003_m_000046_0&filter=stdout
When I run the same MapReduce program on a smaller part of the Wikipedia articles rather than the full set, it works just fine and I get all the desired results. Based on that, I figured maybe its a memory issue. I cleared all the user logs (as specified in a similar post) and tried again. No Use.
I turned down replication to 1 and added a few more nodes. Still no use.
The cluster summary are as follows:
Configured Capacity: 205.76 GB
DFS Used: 40.39 GB
Non DFS USed: 44.66 GB
DFS Remaining: 120.7 GB
DFS Used%: 19.63%
DFS Remaining%: 58.66%
Live Nodes: 12
Dead Nodes: 0
Decomissioned Nodes: 0
Number of Under Replicated Blocks: 0
Each node runs on Ubuntu 12.04 LTS
Any help is appreciated.
EDIT
JobTracker Log: http://txtup.co/gtBaY
TaskTracker Log: http://txtup.co/wEZ5l
Fetch-failures are often due to DNS problems. Check each datanode to be sure that the hostname and ip address it is configured with match DNS resolves for that hostname.
You can do this by visiting each node in your cluster and run hostname and ifconfig and note the hostname and ip address returned. Lets say, for instance, this returns the following:
namenode.foo.com 10.1.1.100
datanode1.foo.com 10.1.1.1
datanode2.foo.com 10.1.1.2
datanode3.foo.com 10.1.1.3
Then, revisit each node and nslookup all the hostnames returned from the other nodes. Verify that the returned ip address matches the one found from ifconfig. For instance, when on datanode1.foo.com, you should do the following:
nslookup namenode.foo.com
nslookup datanode2.foo.com
nslookup datanode3.foo.com
and you should get back:
    10.1.1.100
    10.1.1.2
    10.1.1.3
When you ran your job on a subset of data, you probably didn't have enough splits to start a task on the datanode(s) that are misconfigured.
I had a similar problem and was able to find a solution. The problem lies on how hadoop deals with smaller files. In my case, I had about 150 text files that added up to 10MB. Because of how the files are "divided" into blocks the system runs out of memory pretty quickly. So to solve this you have to "fill" the blocks and arrange your new files so that they are spread nicely into blocks. Hadoop lets you "archive" small files so that they are correctly allocated into blocks.
hadoop archive -archiveName files.har -p /user/hadoop/data /user/hadoop/archive
In this case I created an archive called files.har from the /user/hadoop/data folder and stored it into the folder /user/hadoop/archive. After doing this, I rebalance the cluster allocation using start-balancer.sh.
Now when I run the wordcount example agains the files.har everything works perfectly.
Hope this helps.
Best,
Enrique
I had exactly the same problem with Hadoop 1.2.1 on an 8-node cluster. The problem was in the /etc/hosts file. I removed all entries containing "127.0.0.1 localhost". Instead of "127.0.0.1 localhost" you should map your IP Address to your hostname (e.g. "10.15.3.35 myhost"). Note, that you should that for all nodes in the cluster. So, in a two-node cluster,the master's /etc/hosts should contain "10.15.3.36 masters_hostname" and the slave's /etc/hosts should contain "10.15.3.37 slave1_hostname". After these changes, it would be good to restart the cluster.
Also have a look here for some basic Hadoop troubleshooting :Hadoop Troubleshooting

Datanode, tasktracker dies when executing hadoop fs -put command

I have a hadoop cluster with 1 master(running namenode and jobtracker) and 2 slaves(running datanode and tasktracker on each). Now whenever I execute
hadoop fs -put localfile /user/root/tmp/input
for 4-8 GB of data, the command executes perfectly.
But when I increase the data to 30GB one of the slaves dies. As in I get an java.io.noRouteToHost exception and the command exits unsuccessfully. Immediately I did ping to that slave and found that even the Ethernet connection is down. So I have to manually do
ifup eth0
on that slave to bring the host up again.
I am not able to figure out the problem here. I also changed the following properties
dfs.socket.timeout, for read timeout
dfs.datanode.socket.write.timeout, for write timeout
I increased the read timeout to 600000 and write timeout I changed to 0 to make it infinity. Please any suggestions. I've been stuck on this for a couple of days
try using "distCp" to copy large data.
Got the solution. The problem was with hardware. Though my NIC card was Gigabit the switch in which all the nodes were plugged was 100MBps supported. Changed the switch to Gigabit and worked perfectly fine.
I faced a smiliar problem and I used -copyFromLocal instead of -put and it resolved the issue.
hadoop fs -copyFromLocal localfile /user/root/tmp/input

HDFS error: could only be replicated to 0 nodes, instead of 1

I've created a ubuntu single node hadoop cluster in EC2.
Testing a simple file upload to hdfs works from the EC2 machine, but doesn't work from a machine outside of EC2.
I can browse the the filesystem through the web interface from the remote machine, and it shows one datanode which is reported as in service. Have opened all tcp ports in the security from 0 to 60000(!) so I don't think it's that.
I get the error
java.io.IOException: File /user/ubuntu/pies could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1448)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:690)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:342)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1350)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1346)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1344)
at org.apache.hadoop.ipc.Client.call(Client.java:905)
at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:198)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:928)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:811)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:427)
namenode log just gives the same error. Others don't seem to have anything interesting
Any ideas?
Cheers
WARNING: The following will destroy ALL data on HDFS. Do not execute the steps in this answer unless you do not care about destroying existing data!!
You should do this:
stop all hadoop services
delete dfs/name and dfs/data directories
hdfs namenode -format Answer with a capital Y
start hadoop services
Also, check the diskspace in your system and make sure the logs are not warning you about it.
This is your issue - the client can't communicate with the Datanode. Because the IP that the client received for the Datanode is an internal IP and not the public IP. Take a look at this
http://www.hadoopinrealworld.com/could-only-be-replicated-to-0-nodes/
Look at the sourcecode from DFSClient$DFSOutputStrem (Hadoop 1.2.1)
//
// Connect to first DataNode in the list.
//
success = createBlockOutputStream(nodes, clientName, false);
if (!success) {
LOG.info("Abandoning " + block);
namenode.abandonBlock(block, src, clientName);
if (errorIndex < nodes.length) {
LOG.info("Excluding datanode " + nodes[errorIndex]);
excludedNodes.add(nodes[errorIndex]);
}
// Connection failed. Let's wait a little bit and retry
retry = true;
}
The key to understand here is that Namenode only provide the list of Datanodes to store the blocks. Namenode does not write the data to the Datanodes. It is the job of the Client to write the data to the Datanodes using the DFSOutputStream . Before any write can begin the above code make sure that the Client can communicate with the Datanode(s) and if the communication fails to the Datanode, the Datanode is added to the excludedNodes .
Look at following:
By seeing this exception(could only be replicated to 0 nodes, instead of 1), datanode is not available to Name Node..
This are the following cases Data Node may not available to Name Node
Data Node disk is Full
Data Node is Busy with block report and block scanning
If Block Size is Negative value(dfs.block.size in hdfs-site.xml)
while write in progress primary datanode goes down(Any n/w fluctations b/w Name Node and Data Node Machines)
when Ever we append any partial chunk and call sync for subsequent partial chunk appends client should store the previous data in buffer.
For example after appending "a" I have called sync and when I am trying the to append the buffer should have "ab"
And Server side when the chunk is not multiple of 512 then it will try to do Crc comparison for the data present in block file as well as crc present in metafile. But while constructing crc for the data present in block it is always comparing till the initial Offeset Or For more analysis Please the data node logs
Reference: http://www.mail-archive.com/hdfs-user#hadoop.apache.org/msg01374.html
I had a similar problem setting up a single node cluster. I realized that I didn't config any datanode. I added my hostname to conf/slaves, then it worked out. Hope it helps.
I'll try to describe my setup & solution:
My setup: RHEL 7, hadoop-2.7.3
I tried to setup standalone Operation first and then Pseudo-Distributed Operation where the latter failed with the same issue.
Although, when I start hadoop with:
sbin/start-dfs.sh
I got the following:
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /home/<user>/hadoop-2.7.3/logs/hadoop-<user>-secondarynamenode-localhost.localdomain.out
which looks promising (starting datanode.. with no failures) - but the datanode wasn't exist indeed.
Another indication was to see that there is no datanode in operation (the below snapshot shows fixed working state):
I've fix that issue by doing:
rm -rf /tmp/hadoop-<user>/dfs/name
rm -rf /tmp/hadoop-<user>/dfs/data
and then start again:
sbin/start-dfs.sh
...
I had the same error on MacOS X 10.7 (hadoop-0.20.2-cdh3u0) due to data node not starting.
start-all.sh produced following output:
starting namenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
localhost: ssh: connect to host localhost port 22: Connection refused
localhost: ssh: connect to host localhost port 22: Connection refused
starting jobtracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
localhost: ssh: connect to host localhost port 22: Connection refused
After enabling ssh login via System Preferences -> Sharing -> Remote Login
it started to work.
start-all.sh output changed to following (note start of datanode):
starting namenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting datanode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting secondarynamenode, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
starting jobtracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
Password:
localhost: starting tasktracker, logging to /java/hadoop-0.20.2-cdh3u0/logs/...
And I think you should make sure all the datanodes are up when you do copy to dfs. In some case, it takes a while. I think that's why the solution 'checking the health status' works, because you go to the health status webpage and wait for everything up, my five cents.
It take me a week to figure out the problem in my situation.
When the client(your program) ask the nameNode for data operation, the nameNode picks up a dataNode and navigate the client to it, by giving the dataNode's ip to the client.
But, when the dataNode host is configured to has multiple ip, and the nameNode gives you the one your client CAN'T ACCESS TO, the client would add the dataNode to exclude list and ask the nameNode for a new one, and finally all dataNode are excluded, you get this error.
So check node's ip settings before you try everything!!!
If all data nodes are running, one more thing to check whether the HDFS has enough space for your data. I can upload a small file but failed to upload a big file (30GB) to HDFS. 'bin/hdfs dfsadmin -report' shows that each data node only has a few GB available.
Have you tried the recommend from the wiki http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment ?
I was getting this error when putting data into the dfs. The solution is strange and probably inconsistent: I erased all temporary data along with the namenode, reformatted the namenode, started everything up, and visited my "cluster's" dfs health page (http://your_host:50070/dfshealth.jsp). The last step, visiting the health page, is the only way I can get around the error. Once I've visited the page, putting and getting files in and out of the dfs works great!
Reformatting the node is not the solution. You will have to edit the start-all.sh. Start the dfs, wait for it to start completely and then start mapred. You can do this using a sleep. Waiting for 1 second worked for me. See the complete solution here http://sonalgoyal.blogspot.com/2009/06/hadoop-on-ubuntu.html.
I realize I'm a little late to the party, but I wanted to post this
for future visitors of this page. I was having a very similar problem
when I was copying files from local to hdfs and reformatting the
namenode did not fix the problem for me. It turned out that my namenode
logs had the following error message:
2012-07-11 03:55:43,479 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(127.0.0.1:50010, storageID=DS-920118459-192.168.3.229-50010-1341506209533, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Too many open files
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:883)
at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:491)
at org.apache.hadoop.hdfs.server.datanode.FSDataset$FSVolume.createTmpFile(FSDataset.java:462)
at org.apache.hadoop.hdfs.server.datanode.FSDataset.createTmpFile(FSDataset.java:1628)
at org.apache.hadoop.hdfs.server.datanode.FSDataset.writeToBlock(FSDataset.java:1514)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:113)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:381)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:171)
Apparently, this is a relatively common problem on hadoop clusters and
Cloudera suggests increasing the nofile and epoll limits (if on
kernel 2.6.27) to work around it. The tricky thing is that setting
nofile and epoll limits is highly system dependent. My Ubuntu 10.04
server required a slightly different configuration for this to work
properly, so you may need to alter your approach accordingly.
Don't format the name node immediately. Try stop-all.sh and start it using start-all.sh. If the problem persists, go for formatting the name node.
Follow the below steps:
1. Stop dfs and yarn.
2. Remove datanode and namenode directories as specified in the core-site.xml.
3. Start dfs and yarn as follows:
start-dfs.sh
start-yarn.sh
mr-jobhistory-daemon.sh start historyserver

Resources