The checksum of a HDFS block is stored in a local file, along with the raw content of the block, both on each of the dedicated datanodes (replica).
I am wondering: is the checksum of a block stored also within the namenode, as part of the metadata of a file?
No. The checksum is stored only along with the blocks on the slave nodes[sometimes also called as Data Nodes].
From the Apache Documentation for HDFS
Data Integrity
It is possible that a block of data fetched from a DataNode arrives
corrupted. This corruption can occur because of faults in a storage
device, network faults, or buggy software.
It works in the following manner.
The HDFS client software implements checksum checker. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents, it verifies that the data it received from each DataNode matches the checksum stored in the associated checksum file.
If not, then the client can opt to retrieve that block from another DataNode that has a replica of that block.
If the checksum of another Data node block matches with the checksum of the hidden file, the system will serve these data blocks.
The Short Answer: Checksums are stored on datanodes
Explanation:
HDFS transparently checksums all data written to it and by default verifies checksums when reading data. A separate checksum is created for every dfs.bytes-perchecksum bytes of data. The default is 512 bytes, and because a CRC-32C checksum is 4 bytes long, the storage overhead is less than 1%.
Datanodes are responsible for verifying the data they receive before storing the data and its checksum. This applies to data that they receive from clients and from other datanodes during replication.
A client writing data sends it to a pipeline of datanodes and the last datanode in the pipeline verifies the checksum.
If the datanode detects an error, the client receives a subclass of IOException, which it should handle in an application-specific manner (for example, by retrying the operation).
When clients read data from datanodes, they verify checksums as well, comparing them with the ones stored at the datanodes. Each datanode keeps a persistent log of checksum verifications, so it knows the last time each of its blocks was verified.
When a client successfully verifies a block, it tells the datanode, which updates its log. Keeping statistics such as these is valuable in detecting bad disks.
In addition to block verification on client reads, each datanode runs a DataBlockScanner in a background thread that periodically verifies all the blocks stored on the datanode. This is to guard against corruption due to “bit rot” in the physical storage media.
see "hadoop the definitive guide 4th edition page 98"
Related
For example:
reduce results: part-00000, part-00001 ... part-00008,
the cluster has 3 datanodes and I want to
put the part-00000, part-00001 and part-00002 to the slave0
put the part-00003, part-00004 and part-00005 to the slave1
put the part-00006, part-00007 and part-00008 to the slave2
How can I do that?
It doesn't work like that. A file in HDFS is not stored in any specific datanode. Each file is composed of blocks and each block is replicated to multiple nodes (3 by default). So each file is actually stored in differed nodes, since the blocks that compose it are stored in different nodes.
Quoting the official documentation, which I advise you to read:
HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Seeing the partition tag in your question, it may be worth stating that the Partitioner defines in which partition (not datanode), each key will end up. For example, knowing that you have 9 reduce tasks (9 partitions), you may wish to evenly split the workload of each such task. In order to do that, you can define that, e.g., the keys starting with the letter "s" should be sent to partition 0 and the keys starting with the letter "a" or "b" to partition 1, etc. (just a stupid example to illustrate what a partitioner does).
While reading about the metadata that is stored on datanodes in HDFS.I came through these options but not sure whether all are correct or some or correct.
It stores a file with the checksum of the blocks that it stored.
It stores the version of the hadoop used for creating the blocks and
the namespaceid.
It stores information about the other blocks in the same namespace.
What is the correct answer.?
As per definitive guide:
HDFS blocks are stored in files with a blk_ prefix; they consist of the raw bytes of a portion of the file being stored. Each block has an associated metadata file with a .meta suffix. It is made up of a header with version and type information, followed by a series of checksums for sections of the block.
Too late to give an answer. But will be useful for some one.
Option 1 is correct.
It stores a file with the checksum of the blocks that it stored.
The .meta file in datanode will contain the checksum information for taht block which would be cross-checked when a client reads that block from datanode, if the checksum is not matched it throws an error.
Consider I have a single File which is 300MB. The block size is 128MB.
So the input file is divided into the following chunks and placed in HDFS.
Block1: 128MB
Block2: 128MB
Block3: 64MB.
Now Does each block's data has byte offset information contained in it.
That is, do the blocks have the following offset information?
Block1: 0-128MB of File
Block2 129-256MB of File
Block3: 257MB-64MB of file
If so, how can I get the byte-offset information for Block2 (That is it starts at 129MB) in Hadoop.
This is for understanding purposes only. Any hadoop command-line tools to get this kind of meta data about the blocks?
EDIT
If the byte-offset info is not present, a mapper performing its map job on a block will start consuming lines from the beginning. If the offset information is present, the mapper will skip till it finds the next EOL and then starts processing the records.
So I guess byte offset information is present inside the blocks.
Disclaimer: I might be wrong on this one I have not read that much of the HDFS source code.
Basically, datanodes manage blocks which are just large blobs to them. They know the block id but that its. The namenode knows everything, especially the mapping between a file path and all the block ids of this file and where each block is stored. Each block id can be stored in one or more locations depending of its replication settings.
I don't think you will find public API to get the information you want from a block id because HDFS does not need to do the mapping this way. On the opposite you can easily know the blocks and their locations of a file. You can try explore the source code, especially the blockmanager package.
If you want to learn more, this article about the HDFS architecture could be a good start.
You can run hdfs fsck /path/to/file -files -blocks to get the list of blocks.
A Block does not contain offset info, only length. But you can use LocatedBlocks to get all blocks of a file and from this you can easily reconstruct each block what offset it starts at.
Introduction
Follow-up question to this question.
A File has been provided to HDFS and has been subsequently replicated to three DataNodes.
If the same file is going to be provided again, HDFS indicates that the file already exists.
Based on this answer a file will be split into blocks of 64MB (depending on the configuration settings). A mapping of the filename and the blocks will be created in the NameNode. The NameNode knows in which DataNodes the blocks of a certain file reside. If the same file is provided again the NameNode knows that blocks of this file exists on HDFS and will indicate that the file already exits.
If the content of a file is changed and provided again does the NameNode update the existing file or is the check restricted to mapping of filename to blocks and in particular the filename? Which process is responsible for this?
Which process is responsible for splitting a file into blocks?
Example Write path:
According to this documentation the Write Path of HBase is as follows:
Possible Write Path HDFS:
file provided to HDFS e.g. hadoop fs -copyFromLocal ubuntu-14.04-desktop-amd64.iso /
FileName checked in FSImage whether it already exists. If this is the case the message file already exists is displayed
file split into blocks of 64MB (depending on configuration
setting). Question: Name of the process which is responsible for block splitting?
blocks replicated on DataNodes (replication factor can be
configured)
Mapping of FileName to blocks (MetaData) stored in EditLog located in NameNode
Question
How does the HDFS' Write Path look like?
If the content of a file is changed and provided again does the NameNode update the existing file or is the check restricted to mapping of filename to blocks and in particular the filename?
No, it does not update the file. The name node only checks if the path (file name) already exists.
How does the HDFS' Write Path look like?
This is explained in detail in this paper: "The Hadoop Distributed File System" by Shvachko et al. In particular, read Section 2.C (and check Figure 1):
"When a client writes, it first asks the NameNode to choose DataNodes to host replicas of the first block of the file. The client organizes a pipeline from node-to-node and sends the data. When the first block is filled, the client requests new DataNodes to be chosen to host replicas of the next block. A new pipeline is organized, and the client sends the further bytes of the file. Choice of DataNodes for each block is likely to be different. The interactions among the client, the NameNode and the DataNodes are illustrated in Fig. 1."
NOTE: A book chapter based on this paper is available online too. And a direct link to the corresponding figure (Fig. 1 on the paper and 8.1 on the book) is here.
I am running some experiments to benchmark the time it takes (by map-reduce) to read and process data stored on HDFS with varying parameters. I use pig script to launch map-reduce jobs. Since I am working with the same set of files frequently, my results may get affected because of file/block caching.
I want to understand the various caching techniques employed in a map-reduce environment.
Lets say that a file foo (contains some data to be procesed) stored on HDFS occupies 1 HDFS block and it gets stored in machine STORE. During a map-reduce task, machine COMPUTE reads that block over network and processes it. Caching can happen at two levels:
Cached in memory of machine STORE (in-memory file cache)
Cached in memory/disk of machine COMPUTE.
I am pretty sure that #1 caching happens. I want to ensure whether something like #2 happens? From the post here, it looks like there is no client level caching going on since it is very unlikely that the block cached by COMPUTE will be needed again in the same machine before the cache is flushed.
Also, is the hadoop distributed cache used only to distribute any application specific files (not task specific input data files) to all task tracker nodes? Or is the task specific input file data (like the foo file block) cached in the distributed cache? I assume local.cache.size and related parameters only control the distributed cache.
Please clarify.
The only caching that is ever applied within HDFS is the OS caching to minimize disk accesses.
So if you access a block from a datanode, it is likely to be cached if nothing else is going on there.
On your client side, this depends on what you do with the block. If you directly write it to disk, it is also very likely that your client OS caches it.
The distributed cache is just for jars and files that need to be distributed across the cluster where your job launches tasks. The name is thus a bit misleading, as it "caches" nothing.