hadoop t-file and datablock relationship? - hadoop

My understanding is that Hadoop takes a large file and saves it in chunks of "Datablocks". Are these data blocks stored in a T-file? Is the relationship between datablock and T-file 1-1?

HDFS stores large files as a series of data blocks (typically of a fixed size like 64/128/256/512 MB). Say you have a 1GB file, and a block size of 256MB - HDFS will represent this file as 4 blocks. The Name node will track what data nodes have copies (or replicas) of these blocks.
T-Files are a file format, containing Key/Value pairs. Hadoop would store a T-File using one or more data blocks in HDFS (depending on the size of the T-File, and the defined block size - either the system default or file specific).
In summary, you can store any file format in HDFS, it will just be chunked up into fixed sized blocks, distributed and replicated throughout the cluster.

Related

Can a HDFS Block of 128 MB store two different ORC files of size 1MB each?

I'm working on storage aspect of Hadoop and exploring on know how ORC files get stored on HDFS block.
In HDFS, a file is composed of blocks. One block cannot hold multiple files.
Two ORC files, each with 1MB, will need a block per file.
If you are concerned about the actual disk storage it might consume, it will be 2MB only. Though the blocks are 128MB, the disk storage it determined by the size of the actual file/block.

Why should I avoid storing lots of small files in Hadoop HDFS?

I have read that lots of small files stored in HDFS can be a problem because lots of small files means lots of objects Hadoop NameNode memory.
However since each block is stored in named node as an object, how is it different for a large file? Whether you store 1000 blocks from a single file in memory or 1000 blocks for 1000 files, is the amount of NameNode memory used the same?
Similar question for Map jobs. Since they operate on blocks, how does it matter if blocks are of small files or from bigger ones ?
At a high-level, you can think of a Hadoop NameNode as a tracker for where blocks composing 'files' stored in HDFS are located; blocks are used to break down large files into smaller pieces when stored in an HDFS cluster.
When you have lots of small files stored in HDFS, there are also lots of blocks, and the NameNode must keep track of all of those files and blocks in memory.
When you have a large file, for example -- if you combined all of those files into bigger files, first -- you would have fewer files stored in HDFS, and you would also have fewer blocks.
First let's discuss how file size, HDFS blocks, and NameNode memory relate:
This is easier to see with examples and numbers.
Our HDFS NameNode's block size for this example is 100 MB.
Let's pretend we have a thousand (1,000) 1 MB files and we store them in HDFS. When storing these 1,000 1 MB files in HDFS, we would have also have 1,000 blocks composing those files in our HDFS cluster.
Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 150 KB of memory for those 1,000 blocks representing 1,000 1 MB files.
Now, consider that we consolidate or concatenate those 1,000 1 MB files into a single 1,000 MB file and store that single file in HDFS. When storing the 1,000 MB file in HDFS, it would be broken down into blocks based on our HDFS cluster block size; in this example our block size was 100 MB, which means our 1,000 MB file would be stored as ten (10) 100 MB blocks in the HDFS cluster.
Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 1.5 KB of memory for those 10 blocks representing the 1 1,000 MB file.
With the larger file, we have the same data stored in the HDFS cluster, but use 1% of the NameNode memory compared to the situation with many small files.
Input blocks and the number of Map tasks for a job are related.
When it comes to Map tasks, generally you will have 1-map task per input block. The size of input blocks here matters because there is overhead from starting and finishing new tasks; i.e. when Map tasks finish too quickly, the amount of this overhead becomes a greater portion of each tasks's completion time, and completion of the overall job this can be slower than the same job but with fewer, bigger input blocks. For a MapReduce2-based job, Map tasks also involve starting and stopping a YARN container at the resource management layer, for each task, which adds overhead. (Note that you can also instruct MapReduce jobs to use a minimum input size threshold when dealing with many small input blocks to address some of these inefficiencies as well)

What should be the size of the file in HDFS for best MapReduce job performance

I want to do a copy text files from external sources to HDFS. Lets assume that I can combine and split the files based on their size, what should be the size of the text file for best custom Map Reduce job performance. Does size matter ?
HDFS is designed to support very large files not small files. Applications that are compatible with HDFS are those that deal with large data sets.
These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds.
HDFS supports write-once-read-many semantics on files.In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB.
When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB
and you want to place that file in HDFS, then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the datanodes
The goal of splitting of file is parallel processing and fail over of data. These blocks/chunk will reside on a different DataNode based on your
cluster configuration.
How mappers get assigned
Number of mappers is determined by the number of splits of your data in the MapReduce job.
In a typical InputFormat, it is directly proportional to the number of files and file sizes.
suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size
then there will be 2 split and it will occupy 2 block and then 2 mapper will get assigned based on the blocks but suppose
if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.
So you don't need to split the large file, but If you are dealing with very small files then it worth to combine them.
This link will be helpful to understand the problem with small files.
Please refer below link to get more detail about HDFS design.
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

Why does mapreduce split a compressed file into input splits?

So from my understanding, when hdfs stores a bzip2 compressed 1GB file with a block size of 64 MB, the file will be stored as 16 different blocks. If I want to run a map-reduce job on this compressed file, map reduce tries to split the file again. Why doesn't mapreduce automatically use the 16 blocks in hdfs instead of splitting the file again?
I think I see where your confusion is coming from. I'll attempt to clear it up.
HDFS slices your file up into blocks. These are physical partitions of the file.
MapReduce creates logical splits on top of these blocks. These splits are defined based on a number of parameters, with block boundaries and locations being a huge factor. You can set your minimum split size to 128MB, in which case each split will likely be exactly two 64MB blocks.
All of this is unrelated to your bzip2 compression. If you had used gzip compression, each split would be an entire file because gzip is not a splittable compression.

How a small file is stored in HDFS

In hadoop definitive guide :
a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not128 MB.
what does this mean ?
does it use 1MB of size in a block of 128MB or 1MB is used and reamining 127MB is free to occupy by some other file ?
This is often a misconception about HDFS - the block size is more about how a single file is split up / partitioned, not about some reserved part of the file system.
Behind the schemes, each block is stored on the DataNodes underlying files system as a plain file (and an associated checksum). If you look into the data node folder on your disks you should be able to find the file (if you know the file's block ID and data node allocations - which you can discover from the NameNode Web UI).
So back to your question, a 1MB file with a block size of 16MB/32MB/128MB/512MB/1G/2G (you get the idea) will still only be a 1MB file on the data nodes disk. The difference between the block size and the amount of data stored in that block is then free for the underlying file system to use as it sees fit (by HDFS, or something else).
Hadoop Block size is Hadoop Storage Concept. Every Time When you store a File in Hadoop it will divided into the block sizes and based on the replication factor and data locality it will be distributed over the cluster.
For Details you can find my answer here
Small files and HDFS blocks

Resources