How to format datanodes after formatting the namenode on hdfs? - hadoop

I've recently been settings up hadoop in pseudo distributed mode and I have created data and loaded that into HDFS. Later I have formatted namenode because of a problem. Now when I do that, I find that the directories and the files which were already there before on the datanodes don't show up anymore. (the word "Formatting" makes sense though) But now, I do have this doubt. As the namenode doesn't hold the metadata of the files anymore, is access to the previously loaded files cut-off? If that's a yes, then how do we delete the data already there on the datanodes?

Your previous datanode directories are now stale, yes.
You need to manually go through each datanode and delete the contents of those directories. There is no such format command via the Hadoop CLI
By default, the data node directory is a single folder under /tmp
Otherwise, you've configured your XML files where to store data
Where HDFS stores data

Related

What will happen if HDFS client failed to upload the file?

In my understanding, uploading files to HDFS (the filesystem for Apache Hadoop) follows the below procedures:
client (hdfs shell) asks Namenode which datanodes to put the data chunks
Namenode answers it and saves the files location and some metadata in itself
client puts the data chunks to the given datanodes
Suppose, in 1, Namenode returned which datanode to store the data. However, after that, the datanode gets unavailable because of some reasons (e.g. Network failure or Machine outage). So, the data cannot be saved in datanode but the metadata is stored in Namenode. The data gets inconsistent state.
Can someone explain how HDFS is avoiding this situation? I tried to read hadoop source code but I finally gave up because it's huge.

what is the difference between fsimage and snapshot in hadoop?

I am new to hadoop. I want to know the difference between snapshot and fsimage used for file system state in hadoop. I heard that both do the same work. then what makes the difference between them?
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. Any change to the file system namespace or its properties is recorded by the NameNode. The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS, change in replication factor, etc causes the NameNode to insert a record into the EditLog indicating this. The NameNode uses a file in its local host OS file system to store the EditLog.
FsImage and EditLog come hand in hand that's why this explanation. Now:
The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system.
Snapshots support storing a copy of data at a particular instant of time. A snapshot can be taken of the entire file system also. This does not involve copying of data but recording filesize, block info, etc to a snapshottable directory.
In very normal terms, FsImage stores the info as to where the data is stored, in how many blocks and related information while Snapshot stores the read-only image of the data/file system.
I hope this explains the difference.
fsimage doesn't store mapping of blocks to files right? That is stored in block address table, and is written every single time during namenode restart.

Hadoop HDFS does not notice when a block file is manually deleted

I would like to remove a specific raw block file (and included .meta file) from a specific machine (DataNode) in my cluster running HDFS and move it to a another specific machine (DataNode).
It's possible to accomplish this if I stop the HDFS, move the block files manually as such, and restart it. The block shows up in the new location fine. However, I would like to do this without stopping the whole cluster.
I have found that if I stop the two DataNodes in question, move the file, and restart them, the Namenode immediately realizes that the destination DataNode now has the file (note that dfsadmin -triggerBlockReport does not work. The DataNodes must be restarted). However, nothing appears capable of making the HDFS realize the file has been deleted from the source DataNode. The now nonexistent replica shows up as existing, healthy, and valid no matter what I try. This means the HDFS decides that the block is over-replicated, causing it to delete a random replica while one of the existing replicas is actually gone.
Is there any way to force the Namenode to refresh more fully in some way, inform it that the replica has been deleted, make it choose to delete the replica that I myself now know to not exist, or otherwise accomplish this task? Any help would be appreciated.
(I'm aware that the Balancer/DiskBalancer must accomplish this in some way and have looked into it's source, however I found it extremely dense and would like to avoid manually editing Hadoop/HDFS source code if at all possible.)
Edit:
Found a solution. If I delete the block file from the source DataNode but not the .meta file, the block report I then trigger informs the Namenode that the replica is missing. I believe that by deleting the .meta file I was making it so that the Namenode never considered changes to that replica on that block on that DataNode (as nothing about it was ever reported).

Corrupted block in hdfs cluster

The screenshot added below shows the output of hdfs fsck /. It shows that the "/" directory is corrupted. This is the masternode of my Hadoop cluster. What to do?
If you are using Hadoop 2, you can run a Standby namenode to achieve High Availability. Without that, your cluster's master will be a Single Point of Failure.
You can not retrieve the data of Namenode from anywhere else since it is different from the usual data you store. If your namenode goes down, your blocks and files will still be there, but you won't be able to access them since there would be no related metadata in the namenode.

Hadoop: How does NameNode know which blocks correspond to a file?

NameNode in hadoop does not store the block information. It is kept in-memory and on startup DataNodes report the block information.
If I copyFromLocal a file to hdfs, it is transferred to hdfs, because I can see with "hadoop fs -ls".
I was wondering how Hadoop knows which filename correspond to which blocks.
The NameNode maintains a File System Image which stores the mapping between files -> blocks. It also stores an edit log which maintains any edits to the File System. The Secondary namenode periodically reads the File System Image and the Edit Log from the Namenode, and combines them to create the new File System Image for the NameNode.

Resources