Hadoop HDFS does not notice when a block file is manually deleted - hadoop

I would like to remove a specific raw block file (and included .meta file) from a specific machine (DataNode) in my cluster running HDFS and move it to a another specific machine (DataNode).
It's possible to accomplish this if I stop the HDFS, move the block files manually as such, and restart it. The block shows up in the new location fine. However, I would like to do this without stopping the whole cluster.
I have found that if I stop the two DataNodes in question, move the file, and restart them, the Namenode immediately realizes that the destination DataNode now has the file (note that dfsadmin -triggerBlockReport does not work. The DataNodes must be restarted). However, nothing appears capable of making the HDFS realize the file has been deleted from the source DataNode. The now nonexistent replica shows up as existing, healthy, and valid no matter what I try. This means the HDFS decides that the block is over-replicated, causing it to delete a random replica while one of the existing replicas is actually gone.
Is there any way to force the Namenode to refresh more fully in some way, inform it that the replica has been deleted, make it choose to delete the replica that I myself now know to not exist, or otherwise accomplish this task? Any help would be appreciated.
(I'm aware that the Balancer/DiskBalancer must accomplish this in some way and have looked into it's source, however I found it extremely dense and would like to avoid manually editing Hadoop/HDFS source code if at all possible.)
Edit:
Found a solution. If I delete the block file from the source DataNode but not the .meta file, the block report I then trigger informs the Namenode that the replica is missing. I believe that by deleting the .meta file I was making it so that the Namenode never considered changes to that replica on that block on that DataNode (as nothing about it was ever reported).

Related

In HDFS are datanodes online for read/write before it finishes the full block report?

In Apache HDFS, when a DataNode starts, it registers with the NameNode. And a block report happens after some time (not atomic with the register). I haven't fully understood the code but it seems to me the NameNode treats a DataNode that has not sent its block report the same as all other DataNodes that have already reported. I get this hint because the DatanodeManger register logic does not mark a DataNode as special state like NOT_REPORTED.
Therefore the HDFS client will be able to issue read/write on the new DataNode before it finishes the report to the NameNode. Let's discuss on whether the DataNode is fresh.
If the DataNode is fresh (i.e. there is no blocks stored anyways), it is safe to use for read/write. There is nothing to read. And blocks written should be reported in the block report to the NameNode later.
If the DataNode is NOT fresh (i.e. there are data on this DataNode, somehow it went offline and back online), there may be a gap in the NameNode side metadata where some blocks exist/disappeared from the DataNode but the NameNode does not know yet. The NameNode is still holding the stale block location metadata from the previous report (if any). Will this cause inconsistencies like below (any many other edge cases of course)?
The NameNode instructs a client to read from this DataNode but the block is in fact gone.
The NameNode instructs a client to write to this DataNode but the block is in fact existent.
If these inconsistencies are in fact handled, what am I missing? Or if these inconsistencies do not matter, why?
Appreciate it if anyone can explain the logic in design or point me to relevant code. Thanks in advance!

Does secondary namenode also updates metadata stored at NFS?

I am reading "Hadoop: The Definitive guide". This is how author explains fault tolerance before Hadoop 2.x
Without the namenode, the filesystem cannot be used. In fact, if the machine running
the namenode were obliterated, all the files on the filesystem would be lost since there
would be no way of knowing how to reconstruct the files from the blocks on the
datanodes. For this reason, it is important to make the namenode resilient to failure,
and Hadoop provides two mechanisms for this.
The first way is to back up the files that make up the persistent state of the filesystem
metadata. Hadoop can be configured so that the namenode writes its persistent state to
multiple filesystems. These writes are synchronous and atomic. The usual configuration
choice is to write to local disk as well as a remote NFS mount.
It is also possible to run a secondary namenode, which despite its name does not act as
a namenode. Its main role is to periodically merge the namespace image with the edit
log to prevent the edit log from becoming too large. The secondary namenode usually
runs on a separate physical machine because it requires plenty of CPU and as much
memory as the namenode to perform the merge. It keeps a copy of the merged name‐
space image, which can be used in the event of the namenode failing. However, the state
of the secondary namenode lags that of the primary, so in the event of total failure of
the primary, data loss is almost certain. The usual course of action in this case is to copy
the namenode’s metadata files that are on NFS to the secondary and run it as the new
primary
My understanding is NFS is always synced with primary namenode. My question is how does the metadata stored in NFS gets synced with primary namenode after secondary namenode has updated the metadata of primary namenode? What happens if primary fails totally before NFS gets synced?
That document doesn't say the "primary" or Secondary NameNode is necessarily in sync with NFS, it's saying in the event you have configured Namenode backups to NFS (something you must do yourself, I believe, as it says this is a "configuration choice"), you can restore them to a new server and designate it as the new Namenode. Note "despite its name (the secondary namenode) does not act as a namenode", and "the state of the secondary namenode lags that of the primary", therefore it'll never get data that didn't already arrive on the primary, it will checkpoint what's already there.
That quoted section is alluding to having a Standby Namenode, which serves a different purpose than the secondary, and the standby should be in sync
Quoted from that link,
Note that, in an HA cluster, the Standby NameNode also performs checkpoints of the namespace state, and thus it is not necessary to run a Secondary NameNode, CheckpointNode, or BackupNode in an HA cluster. In fact, to do so would be an error

How to format datanodes after formatting the namenode on hdfs?

I've recently been settings up hadoop in pseudo distributed mode and I have created data and loaded that into HDFS. Later I have formatted namenode because of a problem. Now when I do that, I find that the directories and the files which were already there before on the datanodes don't show up anymore. (the word "Formatting" makes sense though) But now, I do have this doubt. As the namenode doesn't hold the metadata of the files anymore, is access to the previously loaded files cut-off? If that's a yes, then how do we delete the data already there on the datanodes?
Your previous datanode directories are now stale, yes.
You need to manually go through each datanode and delete the contents of those directories. There is no such format command via the Hadoop CLI
By default, the data node directory is a single folder under /tmp
Otherwise, you've configured your XML files where to store data
Where HDFS stores data

Corrupted block in hdfs cluster

The screenshot added below shows the output of hdfs fsck /. It shows that the "/" directory is corrupted. This is the masternode of my Hadoop cluster. What to do?
If you are using Hadoop 2, you can run a Standby namenode to achieve High Availability. Without that, your cluster's master will be a Single Point of Failure.
You can not retrieve the data of Namenode from anywhere else since it is different from the usual data you store. If your namenode goes down, your blocks and files will still be there, but you won't be able to access them since there would be no related metadata in the namenode.

When are files closed in HDFS

I'm running into few issues when writing to HDFS (through flume's HDFS Sink). I think these are caused mostly because of the IO timeouts but not sure.
I end up with files that are open for write for a long long time and give the error "Cannot obtain block length for LocatedBlock{... }". It can be fixed if I explicitly recover the lease. I'm trying to understand what could cause this. I've been trying to reproduce this outside flume but have no luck yet. Could someone help me understand when such a situation could happen - A file on HDFS ends up not getting closed and stay like that until manual intervention to recover lease?
I thought the lease is recovered automatically based on the soft and hard limits. I've tried killing my sample code (I've also tried disconnecting network to make sure no shutdown hooks are executed) that is writing to HDFS to leave a file open for write but couldn't reproduce it.
We have had recurring problems with Flume, but it's substantially better with Flume 1.6+. We have an agent running on servers external to our Hadoop cluster with HDFS as the sink. The agent is configured to roll to new files (close current, and start a new one on the next event) hourly.
Once an event is queued on the channel, the Flume agent operates in a transaction manner -- file is sent, but not dequeued until the agent can confirm successful write to HDFS.
In the case where HDFS is unavailable to the agent (restart, network issue, etc.) there are files left on HDFS that are still open. Once connectivity is restored, Flume agent will find these stranded files and either continue writing to them, or close them normally.
However, we have found several edge cases where files seem to get stranded and left open, even after the hourly rolling has successfully renamed the file. I am not sure if this is a bug, a configuration issue, or just the way it is. When it happens, it completely messes up subsequent processing that needs to read the file.
We can find these files with hdfs fsck /foo/bar -openforwrite and can successfully hdfs dfs -mv them then hdfs dfs -cp from their new location back to their original one -- a horrible hack. We think (but have not confirmed) that hdfs debug recoverLease -path /foo/bar/openfile.fubar will cause the file to be closed, which is far simpler.
Recently we had a case where we stopped HDFS for a couple minutes. This broke the flume connections, and left a bunch of seemingly stranded open files in several different states. After HDFS was restarted, the recoverLease option would close the files, but moments later there would be more files open in some intermediate state. Within an hour or so, all the files had been successfully "handled" -- my assumption is that these files were reassociated with the agent channels. Not sure why it took so long -- not that many files. Another possibility is that it's pure HDFS cleaning up after expired leases.
I am not sure this is an answer to the question (which is also 1 year old now :-) ) but it might be helpful to others.

Resources