Assuming that the block size is 128MB, the cluster has 10GB (so ~80 available blocks). Suppose that I have created 10 small files which together take 128MB on disk (block files, checksums, replication...) and 10 HDFS blocks. If I want to add another small file to HDFS, then what does HDFS use, the used blocks or the actual disk usage, to calculate the available blocks?
80 blocks - 10 blocks = 70 available blocks or (10 GB - 128 MB)/128 MB = 79 available blocks?
Thanks.
Block size is just an indication to HDFS how to split up and distribute the files across the cluster - there is not a physically reserved number of blocks in HDFS (you can change the block size for each individual file if you wish)
For your example, you need to also take into consideration the replication factor and checksum files, but essentially adding lots of small files (less than the block size) does not mean that you have wasted 'available blocks' - they take up as much room as they need (again you need to remember that replication will increase the physical data footprint required to store the file) and the number of 'available blocks' will be closer to your second calculation.
A final note - having lots to small files means that your name node will require more memory to track them (blocks sizes, locations etc), and its generally less efficient to process 128x1MB files than single 128MB file (although that depends on how you're processing it)
Related
I have read many posts saying Hadoop block size of 64 MB reduces metadata and helps in performance improvement over 4 kb block size. But, why data block size is exactly 4kb in OS Disk and 64 MB in Hadoop.
Why not 100 or some other bigger number?
But, why data block size is exactly 4kb in OS Disk and 64 MB in Hadoop.
In HDFS we store huge amounts of data as compared to a single OS filesystem. So, it doesn't make sense to have small block sizes for HDFS. By having small block sizes, there will be more blocks and the NameNode has to store more metadata about the blocks. And also fetching of the data will be slow as data from higher number of blocks dispersed across many machines has to fetched.
Why not 100 or some other bigger number?
Initially the HDFS block size was 64MB and now it's 128MB by default. Check the dfs.blocksize property in hdfs-site.xml here. This is because of the bigger and better storage capacities and speed (HDD and SSD). We shouldn't be surprised when later it's changed to 256MB.
Check this HDFS comic to get a quick overview about HDFS.
In addition to the existing answers, the following is also relevant:
Blocks on an OS level and blocks on a HDFS level are different concepts. When you have a 10kb file on the OS, then that essentially means 3 blocks of 4kb get allocated, and the result is that you consume 12kb.
Obviously you don't want to allocate a large fraction of your space to blocks that are not full, so you need a small blocksize.
On HDFS however, the content of the block determines the size of the block.
So if you have 129MB that could be stored in 1 block of 128MB and 1 block of 1MB. (I'm not sure if it will spread it out differently).
As a result you don't 'lose' the 127 mb which is not allocated.
With this in mind you will want to have a comparatively large blocksize to optimize block management.
I have read that lots of small files stored in HDFS can be a problem because lots of small files means lots of objects Hadoop NameNode memory.
However since each block is stored in named node as an object, how is it different for a large file? Whether you store 1000 blocks from a single file in memory or 1000 blocks for 1000 files, is the amount of NameNode memory used the same?
Similar question for Map jobs. Since they operate on blocks, how does it matter if blocks are of small files or from bigger ones ?
At a high-level, you can think of a Hadoop NameNode as a tracker for where blocks composing 'files' stored in HDFS are located; blocks are used to break down large files into smaller pieces when stored in an HDFS cluster.
When you have lots of small files stored in HDFS, there are also lots of blocks, and the NameNode must keep track of all of those files and blocks in memory.
When you have a large file, for example -- if you combined all of those files into bigger files, first -- you would have fewer files stored in HDFS, and you would also have fewer blocks.
First let's discuss how file size, HDFS blocks, and NameNode memory relate:
This is easier to see with examples and numbers.
Our HDFS NameNode's block size for this example is 100 MB.
Let's pretend we have a thousand (1,000) 1 MB files and we store them in HDFS. When storing these 1,000 1 MB files in HDFS, we would have also have 1,000 blocks composing those files in our HDFS cluster.
Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 150 KB of memory for those 1,000 blocks representing 1,000 1 MB files.
Now, consider that we consolidate or concatenate those 1,000 1 MB files into a single 1,000 MB file and store that single file in HDFS. When storing the 1,000 MB file in HDFS, it would be broken down into blocks based on our HDFS cluster block size; in this example our block size was 100 MB, which means our 1,000 MB file would be stored as ten (10) 100 MB blocks in the HDFS cluster.
Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 1.5 KB of memory for those 10 blocks representing the 1 1,000 MB file.
With the larger file, we have the same data stored in the HDFS cluster, but use 1% of the NameNode memory compared to the situation with many small files.
Input blocks and the number of Map tasks for a job are related.
When it comes to Map tasks, generally you will have 1-map task per input block. The size of input blocks here matters because there is overhead from starting and finishing new tasks; i.e. when Map tasks finish too quickly, the amount of this overhead becomes a greater portion of each tasks's completion time, and completion of the overall job this can be slower than the same job but with fewer, bigger input blocks. For a MapReduce2-based job, Map tasks also involve starting and stopping a YARN container at the resource management layer, for each task, which adds overhead. (Note that you can also instruct MapReduce jobs to use a minimum input size threshold when dealing with many small input blocks to address some of these inefficiencies as well)
My file has a size of 10MB, I stored this in hadoop, but the default block size in hdfs is 64 MB. Thus, my file uses 10 MB out-of 64 MB. How will HDFS utilize the remaining 54 MB of free space in the same block?
Logically, if you files are smaller than block size than HDFS will reduce the block size for that particular files to the size of file. So HDFS will only use 10MB for storing 10MB of small files.It will not waste 54MB or leave it blank.
Small file sin HDFS are desribed in detail here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
The remaining 54MB would be utilized for some other file. So this is how it works, assume you do a put or copyFromLocal of 2 small files each with size 20MB and your block size is 64MB. Now HDFS calculates the available space (suppose previously you have saved a file of 10 MB in a 64MB block it includes these remaining 54MB as well)in the filesystem(not available blocks) and gives a report in terms of block. Since you have 2 files, with replication factor as 3, so a total of 6 blocks would be allocated for your files even if your file size is less than the block size. If the cluster doesn't have 6 blocks(6*64MB) then the put process would fail. Since the report is fetched in terms of space not in terms of blocks, you would never run out of blocks. The only time files are measured in blocks is at block allocation time.
Read this blog for more information.
I have taken below Quoting from Hadoop - The Definitive Guide:
Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB,
Here my questions
1) 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) How does hdfs use the remaining 127M in this block?
2)Is there any chance to store another file in same block?
1 MB file stored in 128MB block with 3 replication. Then the file will be stored in 3 blocks and uses 3*1=3 MB only not 3*128=384 MB. But it shows each the block size as 128 MB. It is just an abstraction to store the metadata in the namenode, but not an actual memory size used.
No way to store more than a file in a single block. Each file will be stored in a separate block.
Reference:
https://stackoverflow.com/a/21274388/3496666
https://stackoverflow.com/a/15065274/3496666
https://stackoverflow.com/a/14109147/3496666
NameNode Memory Usage:
Every file, directory and block in HDFS is represented as an object. i.e. each entry i the namenode is reflected to a item.
in the namenode’s memory, and each of object/item occupies 150 to 200 bytes of namenode memory.memorandums prefer fewer large files as a result of the metadata that needs to be stored.
Consider a 1 GB file with the default block size of 64MB.
-Stored as a single file 1 GB file
Name: 1 item
Block=16
Total Item = 16*3( Replication factor=3) = 48 + 1(filename) = 49
Total NameNode memory: 150*49
-Stored as 1000 individual 1 MB files
Name: 1000
Block=1000
Total Item = 1000*3( Replication factor=3) = 3000 + 1000(filename) = 4000
Total NameNode memory: 150*4000
Above results clarify that large number of small files is a overhead of naemnode memory as it takes more space of NameNode memory.
Block Name and Block ID is a unique ID of a particular block of data.This uniue ID is getting used to identified
the block during reading of the data when client make a request to read data.Hence it can not be shared.
HDFS is designed to handle large files. Lets say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000
requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead.
Each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic!
If you use 64Mb blocks, the number of requests goes down to 16, greatly reducing the cost of overhead and load on the Name Node.
To keep these things in mind hadoop recommend large block size.
HDFS block size is a logical unit of splitting a large file into small chunks. This chunks is basically called a block.
These chunks/block is used during further parallel processing of the data.i.e. MapReduce Programming or other model
to read/process of that within HDFS.
If a file is small enough to fit in this logical block then one block will get assigned for the file and it will
take disk space according to file size and Unix file system you are using.The detail about, how file gets stored in disk is available on this link.
HDFS block size Vs actual file size
As HDFS block size is a logical unit not a physical unit of the memory, so there is no waste of memory.
These link will be useful to understand the problem with small file.
Link1,
Link2
See Kumar's Answer
You could look into SequenceFiles or HAR Files depending on your use case. HAR files are analogous to the Tar command. MapReduce can act upon each HAR files with a little overhead. As for SequenceFiles, they are in a way a container of Key/Value pairs. The benefit of this is a Map task can act upon each of these pairs.
HAR Files
Sequence Files
More About Sequence Files
If I need to do a sequential scan of (non-splittable) thousands of gzip files of sizes varying between 200 and 500mb, what is an appropriate block size for these files?
For the sake of this question, let's say that the processing done is very fast, so restarting a mapper is not costly, even for large block sizes.
My understanding is:
There's hardly an upper limit of block size, as there's "plenty of files" for an appropriate amount of mappers for the size of my cluster.
To ensure data-locality, I want each gzip file to be in 1 block.
However, the gzipped files are of varying sizes. How is data stored if I choose a block size of ~500mb (e.g. max file size of all my input files)? Would it be better to pick a "very large" block size, like 2GB? Is HDD capacity wasted excessively in either scenario?
I guess I'm really asking how files are actually stored and split across hdfs blocks - as well as trying to gain an understanding of best practice for non-splittable files.
Update: A concrete example
Say I'm running a MR Job on three 200 MB files, stored as in the following illustration.
If HDFS stores files as in case A, 3 mappers would be guaranteed to be able to process a "local" file each. However, if the files are stored as in case B, one mapper would need to fetch part of file 2 from another data node.
Given there's plenty of free blocks, does HDFS store files as illustrated by case A or case B?
If you have non-splittable files then you are better off using larger block sizes - as large as the files themselves (or larger, it makes no difference).
If the block size is smaller than the overall filesize then you run into the possibility that all the blocks are not all on the same data node and you lose data locality. This isn't a problem with splittable files as a map task will be created for each block.
As for an upper limit for block size, i know that for certain older version of Hadoop, the limit was 2GB (above which the block contents were unobtainable) - see https://issues.apache.org/jira/browse/HDFS-96
There is no downside for storing smaller files with larger block sizes - to emphasize this point consider a 1MB and 2 GB file, each with a block size of 2 GB:
1 MB - 1 block, single entry in the Name Node, 1 MB physically stored on each data node replica
2 GB - 1 block, single entry in the Name node, 2 GB physically stored on each data node replica
So other that the required physical storage, there is no downside to the Name node block table (both files have a single entry in the block table).
The only possible downside is the time it takes to replicate a smaller versus larger block, but on the flip side if a data node is lost from the cluster, then tasking 2000 x 1 MB blocks to replicate is slower than a single block 2 GB block.
Update - a worked example
Seeing as this is causing some confusion, heres some worked examples:
Say we have a system with a 300 MB HDFS block size, and to make things simpler we have a psuedo cluster with only one data node.
If you want to store a 1100 MB file, then HDFS will break up that file into at most 300 MB blocks and store on the data node in special block indexed files. If you were to go to the data node and look at where it stores the indexed block files on physical disk you may see something like this:
/local/path/to/datanode/storage/0/blk_000000000000001 300 MB
/local/path/to/datanode/storage/0/blk_000000000000002 300 MB
/local/path/to/datanode/storage/0/blk_000000000000003 300 MB
/local/path/to/datanode/storage/0/blk_000000000000004 200 MB
Note that the file isn't exactly divisible by 300 MB, so the final block of the file is sized as the modulo of the file by the block size.
Now if we repeat the same exercise with a file smaller than the block size, say 1 MB, and look at how it would be stored on the data node:
/local/path/to/datanode/storage/0/blk_000000000000005 1 MB
Again note that the actual file stored on the data node is 1 MB, NOT a 200 MB file with 299 MB of zero padding (which i think is where the confusion is coming from).
Now where the block size does play a factor in efficiency is in the Name Node. For the above two examples, the name node needs to maintain a map of the file names, to block names and data node locations (as well as the total file size and block size):
filename index datanode
-------------------------------------------
fileA.txt blk_01 datanode1
fileA.txt blk_02 datanode1
fileA.txt blk_03 datanode1
fileA.txt blk_04 datanode1
-------------------------------------------
fileB.txt blk_05 datanode1
You can see that if you were to use a block size of 1 MB for fileA.txt, you'd need 1100 entries in the above map rather than 4 (which would require more memory in the name node). Also pulling back all the blocks would be more expensive as you'd be making 1100 RPC calls to datanode1 rather than 4.
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
Finally, the relationships of the block, split and file can be summarized as follows:
block boundary
|BLOCK | BLOCK | BLOCK | BLOCK ||||||||
|FILE------------|----------------|----------------|---------|
|SPLIT | | | |
The split can extend beyond the block because the split depends on the InputFormat class definition of how to split the file which may not coincide with the block size so the split extends beyond to include the seek points within the source.