Metadata storage by Namenode for all file blocks - hadoop

While reading the book Hadoop: The Definitive Guide, I came across this page with the following line:
The namenode also knows the datanodes on which all the blocks for a given file are located, however, it does not store block locations persistently, since this information is reconstructed from datanodes when the system starts.
I am struggling to understand how this works. Let's say, that I copy a 1 GB file on an 8 node cluster with replication factor of 3. So each datanode will have 1 block and these blocks will be replicated on other nodes, bringing the total number of blocks on each node effectively to 3. Now the namenode is supposed to keep an index containing the location of each block. But according to the text, if the namenode does not store block locations persistently, how are they reconstructed after the cluster is shut down and restarted. There will be no way of telling which block belongs to which file. Can someone please explain this to me?

The namenode does preserve some state about the files (name, path, size, block size, block IDs etc), just not eh physical location of where the blocks are.
When the data nodes start up, they effectively tree walk the dfs data directory discovering all the file blocks they have and once complete, reports to the name node the blocks that it hosts.
The namenode builds up a map of the files to block locations from the reports from each data node.
This is one of the reasons it sometimes takes a few minutes to come out of safe mode when the cluster first starts up - if you have lots of files, it can take a few moments for each data node to tree walk and discover the blocks it hosts.

Each fsimage file contains a serialized form of all the directory and file inodes in the filesystem. Each inode is an internal representation of a file or directory’s metadata and contains such information as the file’s replication level, modification and access times, access permissions, block size, and the blocks the file is made up of. For directories, the modification time, permissions, and quota metadata are stored.An fsimage file does not record the datanodes on which the blocks are stored. Instead, the namenode keeps this mapping in memory, which it constructs by asking the datanodes for their block lists when they join the cluster and periodically afterward to ensure the namenode’s block mapping is up to date.

Related

Deleting HDFS Block Pool

I am running a Spark on Hadoop cluster. I tried running a Spark job and noticed I was getting some issues, eventually realised by looking at the logs of the data node that the file system of one of the datanodes is full
I looked at hdfs dfsadmin -report to identify this. The category DFS remaining is 0B because the non-DFS used is massive (155GB of 193GB configured capacity).
When I looked at the file system on this data node I could see most of this comes from the /usr/local/hadoop_work/ directory. There are three block pools there and one of them is very large (98GB). When I look on the other data node in the cluster it only has one block pool.
What I am wondering is can I simply delete two of these block pools? I'm assuming (but don't know enough about this) that the namenode (I have only one) will be looking at the most recent block pool which is smaller in size and corresponds to the one on the other data node.
As outlined in the comment above, eventually I did just delete the two block pools. I did this based on the fact that these block pool ID's didn't exist in the other data node and by looking through the local filesystem I could see the files under these ID's hadn't been updated for a while.

When does file from local system is moved to HDFS

I am new to Hadoop, so please excuse me if my questions are trivial.
Is local file system is different than HDFS.
While creating a mapreduce program, we file input file path using fileinputformat.addInputPath() function. Does it split that data into multiple data node and also perform inputsplits as well? If yes, how long this data will stay in datanodes? And can we write mapreduce program to the existing data in HDFS?
1:HDFS is actually a solution to distributed storage, and there will be more storage ceilings and backup problems in localized storage space. HDFS is the server cluster storage resource as a whole, through the nameNode storage directory and block information management, dataNode is responsible for the block storage container. HDFS can be regarded as a higher level abstract localized storage, and it can be understood by solving the core problem of distributed storage.
2:if we use hadoop fileinputformat , first it create an open () method to filesystem and get connection to namenode to get location messages return those message to client . then create a fsdatainputstream to read from different nodes one by one .. at the end close the fsdatainputstream
if we put data into hdfs the client the data will be split into multiple data and storged in different machine (bigger than 128M [64M])
Data persistence is stored on the hard disk
SO if your file is much bigger beyond the pressure of Common server & need Distributed computing you can use HDFS
HDFS is not your local filesystem - it is a distributed file system. This means your dataset can be larger than the maximum storage capacity of a single machine in your cluster. HDFS by default uses a block size of 64 MB. Each block is replicated to at least 3 other nodes in the cluster to account for redundancies (such as node failure). So with HDFS, you can think of your entire cluster as one large file system.
When you write a MapReduce program and set your input path, it will try to locate that path on the HDFS. The input is then automatically divided up into what is known as input splits - fixed size partitions containing multiple records from your input file. A Mapper is created for each of these splits. Next, the map function (which you define) is applied to each record within each split, and the output generated is stored in the local filesystem of the node where map function ran from. The Reducer then copies this output file to its node and applies the reduce function. In the case of a runtime error when executing map and the task fails, Hadoop will have the same mapper task run on another node and have the reducer copy that output.
The reducers use the outputs generated from all the mapper tasks, so by this point, the reducers are not concerned with the input splits that was fed to the mappers.
Grouping answers as per the questions:
HDFS vs local filesystem
Yes, HDFS and local file system are different. HDFS is a Java-based file system that is a layer above a native filesystem (like ext3). It is designed to be distributed, scalable and fault-tolerant.
How long do data nodes keep data?
When data is ingested into HDFS, it is split into blocks, replicated 3 times (by default) and distributed throughout the cluster data nodes. This process is all done automatically. This data will stay in the data nodes till it is deleted and finally purged from trash.
InputSplit calculation
FileInputFormat.addInputPath() specifies the HDFS file or directory from which files should be read and sent to mappers for processing. Before this point is reached, the data should already be available in HDFS, since it is now attempting to be processed. So the data files themselves have been split into blocks and replicated throughout the data nodes. The mapping of files, their blocks and which nodes they reside on - this is maintained by a master node called the NameNode.
Now, based on the input path specified by this API, Hadoop will calculate the number of InputSplits required for processing the file/s. Calculation of InputSplits is done at the start of the job by the MapReduce framework. Each InputSplit then gets processed by a mapper. This all happens automatically when the job runs.
MapReduce on existing data
Yes, MapReduce program can run on existing data in HDFS.

Data replication in hadoop cluster

I am a beginner learning Hadoop. Is it possible that 2 different data blocks from the same file could be stored in the same data node? For example: blk-A and blk-B from file "file.txt" could be placed in the same data node (datanode 1).
Here is the documentation that explains block placement policy. Currently, HDFS replication is 3 by default which means there are 3 replicas of a block. The way they are placed is:
One block is placed on a datanode on a unique rack.
Second block is placed on a datanode on a different rack.
Third block is placed on a different datanode on the same rack as
second block.
This policy helps when there is an event such as datanode is dead, block gets corrupted, etc.
Is it possible?
Unless you make changes in the source code, there is no property that you can change that will allow you to place two blocks on same datanode.
My opinion is that placing two blocks on same datanode beats the purpose of HDFS. Blocks are replicated so HDFS can recover for reasons described above. If blocks are placed on same datanode and that datanode is dead, you will lose two blocks instead of one.
The answer depends on the cluster topology. Hadoop tries to distribute data among data centers and data nodes. But What if you only have one data center ? or if you have only one node cluster (pseudo cluster). In those cases the optimal distribution doesn't happen and it is possible that all blocks end in the same data node. In production it is recommended have more than one data center (physically, not only in configuration) and at least the same number of data nodes than the replication number.

how hdfs removes over-replicated blocks

For example I wrote a file into HDFS using replication factor 2. The node I was writing to has now all the blocks of the file. Others copies of all blocks of the file are scattered around all remaining nodes in the cluster. That's default HDFS policy.
What exactly happens if I lower replication factor of the file to 1?
How HDFS decides which blocks from which nodes to delete? I hope it tries to delete blocks from nodes that have the most count of blocks of the file?
Why I'm asking - if it does, it would make sense - it will alleviate processing of the file. Because if there is only one copy of all blocks and all the blocks are located on the same node, then it would be harder to process the file using map-reduce because of data transferring to other nodes in the cluster.
When a block becomes over-replicated, the name node chooses a replica to remove. The name node will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the data node with the least amount of available disk space. This may help rebalancing the load over the cluster.
Source: The Architecture of Open Source Applications
Over-replicated blocks are randomly removed from different nodes by the HDFS, and are rebalanced that means they are not just removed from the current node.

What happens when data to be inserted into hdfs is larger than the capacity of datanodes

I know data uploaded into hdfs are replicated across datanodes in a hadoop cluster as blocks. My question is what happens when the capacity of all datanodes in the cluster put together is insufficient? e.g. I have 3 datanodes each with a 10GB data capacity (30GB altogether) and I want to insert a data of size 60GB into the hdfs on the same cluster. I don't see how the 60GB data can be split into blocks (~64MB typically) to be accommodated by the datanodes?
Thanks
I haven't tested it, but it should fail with an out of storage message. As each block of data is written into HDFS, it goes through the replication factor process. Your upload would get about half way through and then die.
That being said, you could potentially gzip the data (high compression) before the upload and potentially squeeze it in, depending on how compressible the data is.
I got this issue when I was trying to move a large file from local fs to hdfs, it was stuck in middle and responded the java error out of space and cancel the move/copy command and deleted all the blocks of file which were already copied to hdfs.
So that means we can't copy a single file greater than the capacity of hdfs size of the cluster.

Resources