I am implementing a datanode failover for writing in HDFS, that HDFS can still write a block when the first datanode of the block fails.
The algorithm is. First, the failure node would be identified. Then, a new block is requested. The HDFS port api provides excludeNodes, which I used to tell Namenode not to allocate new block there. failedDatanodes are identified failed datanodes, and they are correct in logs.
req := &hdfs.AddBlockRequestProto{
Src: proto.String(bw.src),
ClientName: proto.String(bw.clientName),
ExcludeNodes: failedDatanodes,
}
But, the namenode still locates the block to the failed datanodes.
Anyone knows why? Did I miss anything here?
Thank you.
I found the solution that, first abandon the block and then request the new block. In the previous design, the new requested block cannot replace the old one
Related
I ran a Spark cluster of 12 nodes (8G memory and 8 cores for each) for some tests.
I'm trying to figure out why data localities of a simple wordcount app in "map" stage are all "Any". The 14GB dataset is stored in HDFS.
I have run into the same problem and in my case it was a problem with the configuration. I was running on the EC2 and I had a name mismatch. Maybe the same thing happened to you.
When you check how HDFS sees you cluster it should be something along this lines:
hdfs dfsadmin -printTopology
Rack: /default-rack
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
172.31.xx.xx:50010 (ip-172-31-xx-xxx.eu-central-1.compute.internal)
And the same should be seen in executors' address in the UI (by default it's http://your-cluster-public-dns:8080/).
In my case I was using public hostname for spark slaves. I have changed my SPARK_LOCAL_IP in $SPARK/conf/spark-env.sh to use the private name as well, and after that change I get NODE_LOCAL most of the times.
I encounter the same problem today. This is my situation:
My cluster have 9 workers(each setup one executor by default) ,when i set --total-executor-cores 9, the Locality lever is NODE_LOCAL, but when i set the total-executor-cores below 9 such as --total-executor-cores 7, then Locality lever become ANY, and the total time cost is 10X than NODE_LOCAL lever. You can have a try.
I'm running my cluster on EC2s, and I fixed my problem by adding the following to spark-env.sh on the name node
SPARK_MASTER_HOST=<name node hostname>
and then adding the following to spark-env.sh on the data nodes
SPARK_LOCAL_HOSTNAME=<data node hostname>
Don't start slaves like this start-all.sh. u should start every slave alonely
$SPARK_HOME/sbin/start-slave.sh -h <hostname> <masterURI>
I ran a MapReduce program using the command hadoop jar <jar> [mainClass] path/to/input path/to/output. However, my job was hanging at: INFO mapreduce.Job: map 100% reduce 29%.
Much later, I terminated and checked the datanode log (I am running in pseudo-distributed mode). It contained the following exception:
java.io.IOException: Premature EOF from inputStream
at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:201)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:134)
at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:472)
at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:849)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:804)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:251)
at java.lang.Thread.run(Thread.java:745)
5 seconds later in the log was ERROR DataXceiver error processing WRITE_BLOCK operation.
What problem might be causing this exception and error?
My NodeHealthReport said:
1/1 local-dirs are bad: /home/$USER/hadoop/nm-local-dir;
1/1 log-dirs are bad: /home/$USER/hadoop-2.7.1/logs/userlogs
I found this which indicates that dfs.datanode.max.xcievers may need to be increased. However, it is deprecated and the new property is called dfs.datanode.max.transfer.threads with default value 4096. If changing this would fix my problem, what new value should I set it to?
This indicates that the ulimit for the datanode may need to be increased. My ulimit -n (open files) is 1024. If increasing this would fix my problem, what should I set it to?
Premature EOF can occur due to multiple reasons, one of which is spawning of huge number of threads to write to disk on one reducer node using FileOutputCommitter. MultipleOutputs class allows you to write to files with custom names and to accomplish that, it spawns one thread per file and binds a port to it to write to the disk. Now this puts a limitation on the number of files that could be written to at one reducer node. I encountered this error when the number of files crossed 12000 roughly on one reducer node, as the threads got killed and the _temporary folder got deleted leading to plethora of these exception messages. My guess is - this is not a memory overshoot issue, nor it could be solved by allowing hadoop engine to spawn more threads. Reducing the number of files being written at one time at one node solved my problem - either by reducing the actual number of files being written, or by increasing reducer nodes.
I am new to Hadoop and help with this questions is appreciated.
The replication of blocks in a cluster is handled by individual data nodes having a copy of the block, but how does this transfer take place without considering namenode.
I found that ssh is setup from slaves to master and master to slaves unlike slave to slave.
Could someone explain?
Is it through hadoop data transfer protocol like Client to DN communication ?
http://blog.cloudera.com/blog/2013/03/how-to-set-up-a-hadoop-cluster-with-network-encryption/
After digging into hadoop source code,I find datanodes use BlockSender class to transfer block data.Actually Socket is under the hood.
Below is my hack way to find this.(hadoop version 1.1.2 used here)
DataNode Line 946 is offerService method, which is a main loop
for service.
codes above is datanode send heartbeat to namenode mainly to tell it is alive.the return value are some commands which datanode will process.this is where block copy happens.
digging into processCommand we come at Line 1160
here is a comment which we can be undoubtedly sure transferBlocks is what we want.
digging into transferBlocks, we come at Line 1257, a private method.At the end of the method,
new Daemon(new DataTransfer(xferTargets, block, this)).start();
so,we know datanode start a new thread to do block copy.
Look at DataTransfer in Line 1424,check at run method.
at the nearly end of run method,we find following snippets:
// send data & checksum
blockSender.sendBlock(out, baseStream, null);
from code above, we can know BlockSender is the actual worker.
I have done my work,It is up to you to find more,such as BlockReader
Whenever a block has to be written in HDFS, the NameNode will allocate space for this block on any datanode. It will also allocate space on other datanodes for the replicas of this block. Then it will instruct the first datanode to write the block and also to replicate the block on the other datanodes where space was allocated for the replicas.
How does the task tracker gets its data for map task from another node in case if data is not-local?
Does it talk directly to the data node of the machine containing data directly or it talks to its own data node which in-turn talks to the other one?
Thanks,
Suresh.
The task tracker itself doesn't get the data - it launches (or reuses) a JVM to run a Map task. The map task uses the DFS File System client to query the name node for the block locations of the file it is to process. The client then connects to the data node where one of the blocks is replicated to actually acquire the file contents (as a stream).
If you want to delve deeper, the source is an excellent place to get a good understanding - check out the DFSClient and inner class DFSInputStream (especially the bestNode method)
http://svn.apache.org/viewvc/hadoop/common/tags/release-0.20.2/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java?view=markup
Class starts around line 1443
openInfo() method # line 1494
chooseDataNode() method # 1800
This is a fairly well-documented error and the fix is easy, but does anyone know why Hadoop datanode NamespaceIDs can get screwed up so easily or how Hadoop assigns the NamespaceIDs when it starts up the datanodes?
Here's the error:
2010-08-06 12:12:06,900 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /Users/jchen/Data/Hadoop/dfs/data: namenode namespaceID = 773619367; datanode namespaceID = 2049079249
at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
This seems to even happen for single node instances.
Namenode generates new namespaceID every time you format HDFS. I think this is possibly to differentiate current version and previous version. You can always rollback to previous version if something is not proper which may not be possible if namespaceID is not unique for every formatted instance.
NamespaceID also connects namenode and datanodes. Datanodes bind themselves to namenode through namespaceID
this problem is well explained and helped in the following fine guide
I was getting this too, and then I tried putting my configuration in hdfs-site.xml instead of core-site.xml.
Seems to stop and start without that error now.
[EDIT, 2010-08-13]
Actually this is still happening, and it is caused by formatting.
If you watch the VERSION files when you do a format, you'll see (at least I do) that the namenode gets assigned a new namespaceID, but the data node does not.
Quick solution is to delete the VERSION for the datanode before format.
[TIDE, 2010-08-13]
When I formatted my HDFS I also encountered this error. Apart from datanode not getting started, the jobtracker also won't start.
For the datanode I manually changed the namespaceid; but for the jobtracker one has to create the /mapred/system (as hdfs user) directory and change its owner to mapred. The jobtracker should start running then after the format.
I got the following error "Incompatible namespaceIDs in /home/hadoop/data/dn",
I have four data nodes in the cluster, after starting start-dfs.sh only one datanode used to come up, SO the solution was to stop service in nn and jt and remove dn configuration drom hdfs-site in all datanodes, remove the dn file(/home/hadoop/data/dn) and format the namenode.
Then again add the datanode properties in hdfs-site in all datanodes and format namenode onceagain. try starting services now all data nodes will be up surely