hdfs data node disconnected from namenode - hadoop

I get from time to time the following errors in cloudera manager:
This DataNode is not connected to one or more of its NameNode(s).
and
The Cloudera Manager agent got an unexpected response from this role's web server.
(usually together, sometimes only one of them)
In most references to these errors in SO and Google, the issue is a configuration problem (and the data node never connects to the name node)
In my case the data nodes usually connect at start up, but loose the connection after some time - so it doesn't appear to be a bad configuration.
Any other options?
Is it possible to force the data node to reconnect to the name node?
Is it possible to "ping" the name node from the data node (simulate the connection attempt of the data node)
Could it be some kind of resource problem (to many open files \ connections)?
sample logs (the errors vary from time to time)
2014-02-25 06:39:49,179 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: exception:
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,180 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.56.144.18:50010, dest: /10.56.144.28:48089, bytes: 132096, op: HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1315770947_27, offset: 0, srvID: DS-990970275-10.56.144.18-50010-1384349167420, blockid: BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440, duration: 480291679056
2014-02-25 06:39:49,180 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.56.144.18, storageID=DS-990970275-10.56.144.18-50010-1384349167420, infoPort=50075, ipcPort=50020, storageInfo=lv=-40;cid=cluster16;nsid=7043943;c=0):Got exception while serving BP-1381780028-10.56.144.16-1384349161741:blk_-8718668700255896235_5121440 to /10.56.144.28:48089
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
2014-02-25 06:39:49,181 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: host.com:50010:DataXceiver error processing READ_BLOCK operation src: /10.56.144.28:48089 dest: /10.56.144.18:50010
java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/10.56.144.18:50010 remote=/10.56.144.28:48089]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:153)
at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:114)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:504)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:673)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:338)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:92)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:64)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)

Hadoop uses specific ports to communicate between the DataNode and the NameNode. It could be that a firewall is blocking those specific ports. Check the default ports in the Cloudera WebSite and test the connectivity to the NameNode with specific ports.

If you're using Linux then please make sure that you have configured these properties correctly:
Disable SELINUX
type the command getenforce on CLI and if it shows enforcing, means it is enabled. Change it fro /etc/selinux/config file.
Disable Firewall
Make sure you have NTP service installed.
Make sure your server can SSH to all client nodes.
Make sure all the nodes have FQDN(Fully Qualified Domain Name) and have an entry in /etc/hosts with name and IP.
If these settings are right in the place then please attach the log of any of your datanode which got disconnected.

I ran into this error
"This DataNode is not connected to one or more of its NameNode(s). "
and I solved it by turning off safe mode and restart HDFS service

I realize you took some steps to test this, but intermittent disconnects still make it sound like a Connectivity issue.
If nodes really don't come back after a disconnect, that may be a configuration issue, which could well be completely independent from the reason why they disconnect in the first place.

Related

Cloudera Manager Health Issue: NameNode Connectivity, Web Server Status

Below is a snapshot of the health issues reported on CM. The datanodes in the list keep changing. Some errors from the datanode logs :
3:59:31.859 PM ERROR org.apache.hadoop.hdfs.server.datanode.DataNode
datanode05.hadoop.com:50010:DataXceiver error processing WRITE_BLOCK operation src: /10.248.200.113:45252 dest: /10.248.200.105:50010
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
5:46:03.606 PM INFO org.apache.hadoop.hdfs.server.datanode.DataNode
Exception for BP-846315089-10.248.200.4-1369774276029:blk_-780307518048042460_200374997
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.248.200.105:50010 remote=/10.248.200.122:43572]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:165)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:156)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.read(DataInputStream.java:132)
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:414)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:635)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:564)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:103)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:67)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:221)
at java.lang.Thread.run(Thread.java:662)
Snapshot:
I am unable to figure out the root cause of the issue. I can manually connect from one datanode to another without issues, I don't believe it is a network issue. Also, the missing blocks and under-replicated block counts change (up & down) as well.
Cloudera Manager : Cloudera Standard 4.8.1
CDH 4.7
Any help in resolving this issue is appreciated.
Update: Jan 01, 2016
For the datanodes listed as bad, when I see the dadanode logs, I see this message a lot...
11:58:30.066 AM INFO org.apache.hadoop.hdfs.server.datanode.DataNode
Receiving BP-846315089-10.248.200.4-1369774276029:blk_-706861374092956879_36606459 src: /10.248.200.123:56795 dest: /10.248.200.112:50010
Why is this datanode receiving a lot of blocks from other datanodes around the same time? It seems that because of this activity the datanode cannot respond to the namenode request in time and thus timing out. All bad datanodes show the same pattern.
Similar question got answered
hdfs data node disconnected from namenode.
Please check your firewall. Use
telnet ipaddress port
to check the connectivity.

getting java.net.SocketTimeoutException when trying to run the Hadoop mapReduce on fresh install of Hortonworks

I have a fresh install of Hortonworks version 2.3_1 for oracle virtualbox and I get a java.net.SocketTimeoutException whenever I try to run a mapreduce job. I changed nothing other than the memory and the cores available to the VM.
full text of run:
WARNING: Use "yarn jar" to launch YARN applications.
15/09/01 01:15:17 INFO impl.TimelineClientImpl: Timeline service address: http:/ /sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/01 01:15:20 INFO client.RMProxy: Connecting to ResourceManager at sandbox. hortonworks.com/10.0.2.15:8050
15/09/01 01:16:19 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your applicatio n with ToolRunner to remedy this.
15/09/01 01:18:09 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor excepti on for block BP-601678901-10.0.2.15-1439987491556:blk_1073742292_1499
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0 .2.15:52924 remote=/10.0.2.15:50010]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja va:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 61)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 31)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 18)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java :2280)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(P ipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor .run(DFSOutputStream.java:749)
15/09/01 01:18:11 INFO mapreduce.JobSubmitter: Cleaning up the staging area /use r/root/.staging/job_1441069639378_0001
Exception in thread "main" java.io.IOException: All datanodes DatanodeInfoWithStorage[10.0.2.15:50010,DS-56099a5f-3cb3-426e-8e1a-ff3b53df9bf2,DISK] are bad. Aborting...
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
Full name of file ova file I am using: Sandbox_HDP_2.3_1_virtualbox.ova
my host is a window 7 home premium machine with eight lines of execution(four hyperthreaded cores, I think)
The problem was exactly what it seemed a timeout error. Fixed by going to the hadoop config folder and raising all the timeouts as well as the number of retries (although from the log that didn't come into play) and stopping unnecessary services on both the host and guest operating system.
Thank, sunrise76 on of those issues pointed me to the config folder.

DataNodes can't talk to NameNode

Set up a Hadoop cluster of 3 nodes. One of them got both NameNode and DataNode roles while other two are just DataNodes.
I started all nodes and services but in summary it shows only one of DataNodes's status is live. Status of other nodes are not even showing.
My question is what is the difference between being started and being live? And why other nodes don't have a status at all?
I guess the issue is datanodes can't talk to namenode. So as Azwaw pointed out, I checked /etc/hosts file. It was like this:
127.0.0.1 nnode.domain nnode localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.1.4.212 nnode.domain nnode
192.1.5.124 dnode02.domain dnode02
192.1.5.125 dnode01.domain dnode01
I changed first line to:
127.0.0.1 localhost.localdomain localhost localhost4 localhost4.localdomain4
Now I can establish connection with nnode.domain:50070, however errors at datanode side changed. Here log piece from datanode:
2015-05-15 10:08:21,721 ERROR datanode.DataNode (DataXceiver.java:run(253)) - dnode01.domain:50010:DataXceiver error processing unknown operation src: /127.0.0.1:49000 dst: /127.0.0.1:50010
java.io.EOFException
at java.io.DataInputStream.readShort(DataInputStream.java:315)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.readOp(Receiver.java:58)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:212)
at java.lang.Thread.run(Thread.java:745)
2015-05-15 10:08:23,670 INFO datanode.DataNode (BPServiceActor.java:register(782)) - Block pool BP-2116866246-127.0.0.1-1431441630609 (Datanode Uuid null) service to nnode.domain/192.1.4.212:8020 beginning handshake with NN
2015-05-15 10:08:23,674 ERROR datanode.DataNode (BPServiceActor.java:run(840)) - Initialization failed for Block pool BP-2116866246-127.0.0.1-1431441630609 (Datanode Uuid null) service to nnode.domain/192.1.4.212:8020 Datanode denied communication with namenode because hostname cannot be resolved (ip=192.1.4.1, hostname=192.1.4.1): DatanodeRegistration(0.0.0.0, datanodeUuid=7f1be518-1255-4a6a-b31c-22be5dc47673, infoPort=50075, ipcPort=8010, storageInfo=lv=-56;cid=CID-51d1dfd0-9376-44a7-b581-c14eec95fd74;nsid=450599258;c=0)
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.registerDatanode(DatanodeManager.java:887)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.registerDatanode(FSNamesystem.java:5282)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.registerDatanode(NameNodeRpcServer.java:1082)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.registerDatanode(DatanodeProtocolServerSideTranslatorPB.java:92)
at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:26378)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
This is odd, there is no host with 192.1.4.1 IP address. Why would datanodes try to connect 192.1.4.1?
Unresolved datanode registration: hostname cannot be resolved (ip=192.1.4.1, hostname=192.1.4.1)
"Datanodes 3/3 started" means 3 datanodes process running
Datanodes status "1 live / 0 Dead / 0 Decommissioning" means your namenode is able to communicate with one node.
It seems to be a network problem (make sure HDFS ports are open on your firewall). The alive Datanode is probably on same machine as your Namenode.
Moving NameNode to same network with DataNodes solved the problem.
DataNodes are in 192.1.5.* network.
NameNode was in 192.1.4.* network.
After moving NameNode to 192.1.5.* did the trick for my case.

SocketTimeoutException in hadoop fs -getmerge

I'm running hadoop fs -getmerge and getting the following error:
12/10/30 09:24:45 INFO hdfs.DFSClient: Failed to connect to /[IP], add to
deadNodes and continue
java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be
ready for read. ch : java.nio.channels.SocketChannel
I'm getting this error with different IP each try and I don't see any suspicious error or warning in the data node logs.
any thoughts?
HDFS reads are done directly from the block holding DataNodes.
A common reason behind this, especially if it is consistent in failure this way, is the lack of proper Client ➜ DataNode connectivity, owing to firewalls or other reasons.

Error in copying files to HDFS

I tried installing hadoop in two nodes. Both the nodes are up and running. The namenode runs on Ubuntu 10.10 and Datanode on Fedora 13. While copying the file from local file system to hdfs I encountered the following errors.
The terminal showed:
12/04/12 02:19:15 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.OException: Bad connect ack with firstBadLink as 10.211.87.162:9200
12/04/12 02:19:15 INFO hdfs.DFSClient: Abandoning block blk_-1069539184735421145_1014
The log file in namenode showed:
2012-10-16 16:17:56,723 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.6.2.26:50010, storageID=DS-880164535-10.18.13.10-50010-1349721715148, infoPort=50075, ipcPort=50020):DataXceiver
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:282)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
at java.lang.Thread.run(Thread.java:662)
Datanodes available are indicated as 2. I've disabled the firewall and selinux.
The following changes have also been made in the hdfs-site.xml
dfs.socket.timeout -> 360000
dfs.datanode.socket.write.timeout -> 3600000
dfs.datanode.max.xcievers -> 1048576
Both the nodes run sun-java6-jdk, The datanode contains Openjdk but the path settings have been made for sun java.
Yet the same error persists.
What might be the solution.
That's because your firewall is on.
try
sudo /etc/init.d/iptables stop
If you are on Ubuntu, do
sudo ufw disable
this should solve the issue.
The exception log mentioned tha the failure reason is No route to host.
Try ping 10.6.2.26 to test your network connection.

Resources