While reading about the metadata that is stored on datanodes in HDFS.I came through these options but not sure whether all are correct or some or correct.
It stores a file with the checksum of the blocks that it stored.
It stores the version of the hadoop used for creating the blocks and
the namespaceid.
It stores information about the other blocks in the same namespace.
What is the correct answer.?
As per definitive guide:
HDFS blocks are stored in files with a blk_ prefix; they consist of the raw bytes of a portion of the file being stored. Each block has an associated metadata file with a .meta suffix. It is made up of a header with version and type information, followed by a series of checksums for sections of the block.
Too late to give an answer. But will be useful for some one.
Option 1 is correct.
It stores a file with the checksum of the blocks that it stored.
The .meta file in datanode will contain the checksum information for taht block which would be cross-checked when a client reads that block from datanode, if the checksum is not matched it throws an error.
Related
The checksum of a HDFS block is stored in a local file, along with the raw content of the block, both on each of the dedicated datanodes (replica).
I am wondering: is the checksum of a block stored also within the namenode, as part of the metadata of a file?
No. The checksum is stored only along with the blocks on the slave nodes[sometimes also called as Data Nodes].
From the Apache Documentation for HDFS
Data Integrity
It is possible that a block of data fetched from a DataNode arrives
corrupted. This corruption can occur because of faults in a storage
device, network faults, or buggy software.
It works in the following manner.
The HDFS client software implements checksum checker. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
If the checksum of another Data node block matches with the checksum of the hidden file, the system will serve these data blocks.
The Short Answer: Checksums are stored on datanodes
Explanation:
HDFS transparently checksums all data written to it and by default verifies checksums when reading data. A separate checksum is created for every dfs.bytes-perchecksum bytes of data. The default is 512 bytes, and because a CRC-32C checksum is 4 bytes long, the storage overhead is less than 1%.
Datanodes are responsible for verifying the data they receive before storing the data and its checksum. This applies to data that they receive from clients and from other datanodes during replication.
A client writing data sends it to a pipeline of datanodes and the last datanode in the pipeline verifies the checksum.
If the datanode detects an error, the client receives a subclass of IOException, which it should handle in an application-specific manner (for example, by retrying the operation).
When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes. Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified.
When a client successfully verifies a block, it tells the datanode, which updates its log. Keeping statistics such as these is valuable in detecting bad disks.
In addition to block verification on client reads, each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode. This is to guard against corruption due to “bit rot” in the physical storage media.
see "hadoop the definitive guide 4th edition page 98"
I used the 'hdfs oiv' command to read the fsimage into a xml file.
hdfs oiv -p XML -i /../dfs/nn/current/fsimage_0000000003132155181 -o fsimage.out
Based on my understanding, fsimage is supposed to store the "blockmap" like how the files got broken into blocks, and where each block is storing. However, here is how a record inode looks like in the output file.
<inode>
<id>37749299</id>
<type>FILE</type>
<name>a4467282506298f8-e21f864f16b2e7c1_468511729_data.0.</name>
<replication>3</replication>
<mtime>1442259468957</mtime>
<atime>1454539092207</atime>
<perferredBlockSize>134217728</perferredBlockSize>
<permission>impala:hive:rw-r--r--</permission>
<blocks>
<block>
<id>1108336288</id>
<genstamp>35940487</genstamp>
<numBytes>16187048</numBytes>
</block>
</blocks>
</inode>
However, I was expecting something like, hdfs path to a file, how that file got broken down into smaller pieces and where each piece has been stored (like which machine, which local fs path...etc...)
Is there a mapping anywhere on the name server containing:
the HDFS path to inode mapping
the blockid to local file system path / disk location mapping?
A bit late, but since I am looking into this now and stumbled across your question.
First of all, a bit of context.
(I am working with Hadoop 2.6)
The Name server is responsible for maintaining the INodes, which is in-memory representation of the (virtual) filesystem structure, while Blocks being maintained by the data nodes. I believe that there are several reason for Name node not to maintain the rest of the information, like the links to the data nodes where the data is stored within the each INode:
It would require more memory to represent all that information (memory is the resource which actually limits the amount of files which can be writing into HDFS cluster, since the whole structure is maintained in RAM, for faster access)
Would induce more workload on the name node, in case for example if the file is moved from one node to another, or new node is installed and the file needs to be replicated to it. Each time it would happen, Name node would need to update its state.
Flexibility, since the INode is an abstraction, thus adding the link would bind it to determined technology and communication protocol
Now coming back to your questions:
The fsimage file already contains the mapping to HDFS path. If you look more carefully in the XML, each INode, regardless its type has an ID (in you case it is 37749299). If you look further in the file, you can find the section <INodeDirectorySection>, which has the mapping between the parent and children and it is this ID field which is used to determine the relation. Through the <name> attribute you can easily determine the structure you see for example in the HDFS explorer.
Furthermore, you have <blocks> section, which has block ID (in your case it is 1108336288). If you look carefully into the sources of the Hadoop, you can find the method idToBlockDir in the DatanodeUtil which gives you a hint how the files are being organized on the disk and block id mapping is performed.
Basically the original id is being shifted twice (by 16 and by 8 bits).
int d1 = (int)((blockId >> 16) & 0xff);
int d2 = (int)((blockId >> 8) & 0xff);
And the final directory is built using obtained values:
String path = DataStorage.BLOCK_SUBDIR_PREFIX + d1 + SEP + DataStorage.BLOCK_SUBDIR_PREFIX + d2;
Where the block is stored using in the file which uses blk_<block_id> naming format.
I not a Hadoop expert, so if someone who understands this better could correct any of the flows in my logic, please do so. Hope this helps.
Consider I have a single File which is 300MB. The block size is 128MB.
So the input file is divided into the following chunks and placed in HDFS.
Block1: 128MB
Block2: 128MB
Block3: 64MB.
Now Does each block's data has byte offset information contained in it.
That is, do the blocks have the following offset information?
Block1: 0-128MB of File
Block2 129-256MB of File
Block3: 257MB-64MB of file
If so, how can I get the byte-offset information for Block2 (That is it starts at 129MB) in Hadoop.
This is for understanding purposes only. Any hadoop command-line tools to get this kind of meta data about the blocks?
EDIT
If the byte-offset info is not present, a mapper performing its map job on a block will start consuming lines from the beginning. If the offset information is present, the mapper will skip till it finds the next EOL and then starts processing the records.
So I guess byte offset information is present inside the blocks.
Disclaimer: I might be wrong on this one I have not read that much of the HDFS source code.
Basically, datanodes manage blocks which are just large blobs to them. They know the block id but that its. The namenode knows everything, especially the mapping between a file path and all the block ids of this file and where each block is stored. Each block id can be stored in one or more locations depending of its replication settings.
I don't think you will find public API to get the information you want from a block id because HDFS does not need to do the mapping this way. On the opposite you can easily know the blocks and their locations of a file. You can try explore the source code, especially the blockmanager package.
If you want to learn more, this article about the HDFS architecture could be a good start.
You can run hdfs fsck /path/to/file -files -blocks to get the list of blocks.
A Block does not contain offset info, only length. But you can use LocatedBlocks to get all blocks of a file and from this you can easily reconstruct each block what offset it starts at.
Introduction
Follow-up question to this question.
A File has been provided to HDFS and has been subsequently replicated to three DataNodes.
If the same file is going to be provided again, HDFS indicates that the file already exists.
Based on this answer a file will be split into blocks of 64MB (depending on the configuration settings). A mapping of the filename and the blocks will be created in the NameNode. The NameNode knows in which DataNodes the blocks of a certain file reside. If the same file is provided again the NameNode knows that blocks of this file exists on HDFS and will indicate that the file already exits.
If the content of a file is changed and provided again does the NameNode update the existing file or is the check restricted to mapping of filename to blocks and in particular the filename? Which process is responsible for this?
Which process is responsible for splitting a file into blocks?
Example Write path:
According to this documentation the Write Path of HBase is as follows:
Possible Write Path HDFS:
file provided to HDFS e.g. hadoop fs -copyFromLocal ubuntu-14.04-desktop-amd64.iso /
FileName checked in FSImage whether it already exists. If this is the case the message file already exists is displayed
file split into blocks of 64MB (depending on configuration
setting). Question: Name of the process which is responsible for block splitting?
blocks replicated on DataNodes (replication factor can be
configured)
Mapping of FileName to blocks (MetaData) stored in EditLog located in NameNode
Question
How does the HDFS' Write Path look like?
If the content of a file is changed and provided again does the NameNode update the existing file or is the check restricted to mapping of filename to blocks and in particular the filename?
No, it does not update the file. The name node only checks if the path (file name) already exists.
How does the HDFS' Write Path look like?
This is explained in detail in this paper: "The Hadoop Distributed File System" by Shvachko et al. In particular, read Section 2.C (and check Figure 1):
"When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Choice of DataNodes for each block is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in Fig. 1."
NOTE: A book chapter based on this paper is available online too. And a direct link to the corresponding figure (Fig. 1 on the paper and 8.1 on the book) is here.
In HDFS , the blocks are distributed among the active nodes/slaves. The content of the blocks are simple text so is there any way to see read or access the blocks present in each data node ?
As an entire file or to read a single block (say block number 3) out of sequence?
You can read the file via various mechanisms including the Java API but you cannot start reading in the middle of the file (for example at the start of block 3).
Hadoop reads a block of data and feeds each line to the mapper for further processing. Also, the Hadoop clients gets the blocks related to a file from different Data Nodes before concatenating them. So, it should be possible to get the data from a particular block.
Hadoop Client might be a good place to start with to look at the code. But, HDFS provides file system abstraction. Not sure what the requirement would be for reading the data from a particular block.
Assuming you have ssh access (and appropriate permissions) to the datanodes, you can cd to the path where the blocks are stored and read the blocks stored on that node (e.g., do a cat BLOCK_XXXX). The configuration parameter that tells you where the blocks are stored is dfs.datanode.data.dir, which defaults to file://${hadoop.tmp.dir}/dfs/data. More details here.
Caveat: the block names are coded by HDFS depending on their internal block ID. Just by looking at their names, you cannot know to which file a block belongs.
Finally, I assume you want to do this for debugging purposes or just to satisfy your curiosity. Normally, there is no reason to do this and you should just use the HDFS web-UI or command-line tools to look at the contents of your files.