What does Hadoop do with unreplicated data when client closes its connection? - hadoop

I am running a Hadoop 2.5.0-cdh5.3.2 cluster. Flume is running elsewhere writing data to this cluster. When the cluster is under heavy load, the flume-agent finishes writing and attempts to close the file before HDFS has finished replicating the data. The close fails and is retried, but the flume-agent is configured with a timeout and when the close cannot complete in time, the flume-agent disconnects.
What does HDFS do with the file that has not finished replication?
I was under the impression a background thread would finish the replication, but I am seeing only partially written blocks in my cluster. There is one good copy of the block and the replicas are only partially written, so HDFS considers the block corrupt.
I've read through the recovery process and did not think I'd be left with unwritten blocks.
I have the following client settings:
dfs.client.block.write.replace-datanode-on-failure.enable=true
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS
dfs.client.block.write.replace-datanode-on-failure.best-effort=true
I set these because it seemed that the flume-agent was losing connections to datanodes and failing. I wanted it to retry, but if a block was written, to call it good and move on.
Is best-effort preventing the remaining blocks from being written? This seems pretty useless if it results in the final block being called corrupt.

I think flume agend is loosing hdfs connection before it can successfully close the file. DFS client caches some data locally. Before closing the file, it must flush this local cache. If the hdfs connection is lost, close will fail and the block will be marked corrupt. There is one scenario in which hdfs connection is unexpectedly closed. Hdfs Client registers shutdown hooks. The order in which shutdown hook are is invoked is not guaranteed. In your case if the flume agent is shutting down, hdfs client shutdown may get invoked and file close will fail. If you think this is possible, try disabling shutdown hooks.
fs.automatic.close = false

Related

In HDFS are datanodes online for read/write before it finishes the full block report?

In Apache HDFS, when a DataNode starts, it registers with the NameNode. And a block report happens after some time (not atomic with the register). I haven't fully understood the code but it seems to me the NameNode treats a DataNode that has not sent its block report the same as all other DataNodes that have already reported. I get this hint because the DatanodeManger register logic does not mark a DataNode as special state like NOT_REPORTED.
Therefore the HDFS client will be able to issue read/write on the new DataNode before it finishes the report to the NameNode. Let's discuss on whether the DataNode is fresh.
If the DataNode is fresh (i.e. there is no blocks stored anyways), it is safe to use for read/write. There is nothing to read. And blocks written should be reported in the block report to the NameNode later.
If the DataNode is NOT fresh (i.e. there are data on this DataNode, somehow it went offline and back online), there may be a gap in the NameNode side metadata where some blocks exist/disappeared from the DataNode but the NameNode does not know yet. The NameNode is still holding the stale block location metadata from the previous report (if any). Will this cause inconsistencies like below (any many other edge cases of course)?
The NameNode instructs a client to read from this DataNode but the block is in fact gone.
The NameNode instructs a client to write to this DataNode but the block is in fact existent.
If these inconsistencies are in fact handled, what am I missing? Or if these inconsistencies do not matter, why?
Appreciate it if anyone can explain the logic in design or point me to relevant code. Thanks in advance!

HDFS Showing 0 Blocks after cluster reboot

I've setup a small cluster for testing / academic proposes, I have 3 nodes, one of which is acting both as namenode and datanode (and secondarynamenode).
I've uploaded 60GB of files (about 6.5 Million files) and uploads started to get really slow, so I read on the internet that I could stop the secondary namenode service on the main machine, at the moment it had no effect on anything.
After I rebooted all 3 computers, two of my datanodes show 0 blocks (despite showing disk usage in web interface) even with both namenodes services running.
One of the nodes with problem is the one running the namenode as well so I am guessing it is not a network problem.
any ideas on how can I get these blocks to be recognized again? (without start it all over again which took about two weeks to upload all)
Update
After half an hour after another reboot this showed in the logs:
2018-03-01 08:22:50,212 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x199d1a180e357c12, containing 1 storage report(s), of which we sent 0. The reports had 6656617 total blocks and used 0 RPC(s). This took 679 msec to generate and 94 msecs for RPC and NN processing. Got back no commands.
2018-03-01 08:22:50,212 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.EOFException: End of File Exception between local host is: "Warpcore/192.168.15.200"; destination host is: "warpcore":9000; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException
And the EOF stack trace, after searching the web I discovered this [http://community.cloudera.com/t5/CDH-Manual-Installation/CDH-5-5-0-datanode-failed-to-send-a-large-block-report/m-p/34420] but still can't understand how to fix this.
The report block is too big and need to be split, but I don't know how or where should I configure this. I´m googling...
The problem seems to be low RAM on my namenode, as a workaround I added more directories to the namenode configuration as if I had multiple disks and rebalanced the files manually as instructed ins the comments here.
As hadoop 3.0 reports each disk separately the datenode was able to report and I was able to retrieve the files, this is an ugly workaround and not for production, but good enough for my academic purposes.
An interesting side effect was the datanode reporting multiple times the available disk space wich could lead into serious problems on production.
It seems a better solution is using HAR to reduce the number of blocks as described here and here

Why HDFS client caches the file data into a temporary local file?

Why can't HDFS client send directly to DataNode?
What's the advantage of HDFS client cache?
An application request to create a file does not reach the NameNode immediately.
In fact, initially the HDFS client caches the file data into a temporary local file.
Application writes are transparently redirected to this temporary local file.
When the local file accumulates data worth at least one HDFS block size, the client contacts the NameNode to create a file.
The NameNode then proceeds as described in the section on Create. The client flushes the block of data from the local temporary file to the specified DataNodes.
When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the DataNode.
The client then tells the NameNode that the file is closed.
At this point, the NameNode commits the file creation operation into a persistent store. If the NameNode dies before the file is closed, the file is lost.
It sounds like you are referencing the Apache Hadoop HDFS Architecture documentation, specifically the section titled Staging. Unfortunately, this information is out-of-date and no longer an accurate description of current HDFS behavior.
Instead, the client immediately issues a create RPC call to the NameNode. The NameNode tracks the new file in its metadata and replies with a set of candidate DateNode addresses that can receive writes of block data. Then, the client starts writing data to the file. As the client writes data, it is writing on a socket connection to a DataNode. If the written data becomes large enough to cross a block size boundary, then the client will interact with the NameNode again for an addBlock RPC to allocate a new block in NameNode metadata and get a new set of candidate DataNode locations. There is no point at which the client is writing to a local temporary file.
Note however that alternative file systems, such as S3AFileSystem which integrates with Amazon S3, may support options for buffering to disk. (See the Apache Hadoop documentation for Integration with Amazon Web Services if you're interested in more details on this.)
I have filed Apache JIRA HDFS-11995 to track correcting the documentation.

When are files closed in HDFS

I'm running into few issues when writing to HDFS (through flume's HDFS Sink). I think these are caused mostly because of the IO timeouts but not sure.
I end up with files that are open for write for a long long time and give the error "Cannot obtain block length for LocatedBlock{... }". It can be fixed if I explicitly recover the lease. I'm trying to understand what could cause this. I've been trying to reproduce this outside flume but have no luck yet. Could someone help me understand when such a situation could happen - A file on HDFS ends up not getting closed and stay like that until manual intervention to recover lease?
I thought the lease is recovered automatically based on the soft and hard limits. I've tried killing my sample code (I've also tried disconnecting network to make sure no shutdown hooks are executed) that is writing to HDFS to leave a file open for write but couldn't reproduce it.
We have had recurring problems with Flume, but it's substantially better with Flume 1.6+. We have an agent running on servers external to our Hadoop cluster with HDFS as the sink. The agent is configured to roll to new files (close current, and start a new one on the next event) hourly.
Once an event is queued on the channel, the Flume agent operates in a transaction manner -- file is sent, but not dequeued until the agent can confirm successful write to HDFS.
In the case where HDFS is unavailable to the agent (restart, network issue, etc.) there are files left on HDFS that are still open. Once connectivity is restored, Flume agent will find these stranded files and either continue writing to them, or close them normally.
However, we have found several edge cases where files seem to get stranded and left open, even after the hourly rolling has successfully renamed the file. I am not sure if this is a bug, a configuration issue, or just the way it is. When it happens, it completely messes up subsequent processing that needs to read the file.
We can find these files with hdfs fsck /foo/bar -openforwrite and can successfully hdfs dfs -mv them then hdfs dfs -cp from their new location back to their original one -- a horrible hack. We think (but have not confirmed) that hdfs debug recoverLease -path /foo/bar/openfile.fubar will cause the file to be closed, which is far simpler.
Recently we had a case where we stopped HDFS for a couple minutes. This broke the flume connections, and left a bunch of seemingly stranded open files in several different states. After HDFS was restarted, the recoverLease option would close the files, but moments later there would be more files open in some intermediate state. Within an hour or so, all the files had been successfully "handled" -- my assumption is that these files were reassociated with the agent channels. Not sure why it took so long -- not that many files. Another possibility is that it's pure HDFS cleaning up after expired leases.
I am not sure this is an answer to the question (which is also 1 year old now :-) ) but it might be helpful to others.

How does Apache Spark handles system failure when deployed in YARN?

Preconditions
Let's assume Apache Spark is deployed on a hadoop cluster using YARN. Furthermore a spark execution is running. How does spark handle the situations listed below?
Cases & Questions
One node of the hadoop clusters fails due to a disc error. However replication is high enough and no data was lost.
What will happen to tasks that where running at that node?
One node of the hadoop clusters fails due to a disc error. Replication was not high enough and data was lost. Simply spark couldn't find a file anymore which was pre-configured as resource for the work flow.
How will it handle this situation?
During execution the primary namenode fails over.
Did spark automatically use the fail over namenode?
What happens when the secondary namenode fails as well?
For some reasons during a work flow the cluster is totally shut down.
Will spark restart with the cluster automatically?
Will it resume to the last "save" point during the work flow?
I know, some questions might sound odd. Anyway, I hope you can answer some or all.
Thanks in advance. :)
Here are the answers given by the mailing list to the questions (answers where provided by Sandy Ryza of Cloudera):
"Spark will rerun those tasks on a different node."
"After a number of failed task attempts trying to read the block, Spark would pass up whatever error HDFS is returning and fail the job."
"Spark accesses HDFS through the normal HDFS client APIs. Under an HA configuration, these will automatically fail over to the new namenode. If no namenodes are left, the Spark job will fail."
Restart is part of administration and "Spark has support for checkpointing to HDFS, so you would be able to go back to the last time checkpoint was called that HDFS was available."

Resources