Hadoop block size and file size issue? - hadoop

This might seem as a silly question but in Hadoop suppose blocksize is X (typically 64 or 128 MB) and a local filesize is Y (where Y is less than X).Now when I copy file Y to the HDFS will it consume one block or will hadoop create smaller size blocks?

One block is consumed by Hadoop. That does not mean that storage capacity will be consumed in an equivalent manner.
The output while browsing the HDFS from web looks like this:
filename1 file 48.11 KB 3 128 MB 2012-04-24 18:36
filename2 file 533.24 KB 3 128 MB 2012-04-24 18:36
filename3 file 303.65 KB 3 128 MB 2012-04-24 18:37
You see that each file size is lesser than the block size which is 128 MB. These files are in KB.
HDFS capacity is consumed based on the actual file size but a block is consumed per file.
There are limited number of blocks available dependent on the capacity of the HDFS. You are wasting blocks as you will run out of them before utilizing all the actual storage capacity. Remember that Unix filsystem also has concept of blocksize but is a very small number around 512 Bytes. This concept is inverted in HDFS where the block size is kept bigger around 64-128 MB.
The other issue is that when you run map/reduce programs it will try to spawn mapper per block so in this case when you are processing three small files, it may end up spawning three mappers to work on them eventually.
This wastes resources when the files are of smaller size. You also add latency as each mapper takes time to spawn and then ultimately would work on a very small sized file. You have to compact them into files closer to blocksize to take advantage of mappers working on lesser number of files.
Yet another issue with numerous small files is that it loads namenode which keeps the mapping (metadata) of each block and chunk mapping in main memory. With smaller files, you fill up this table faster and more main memory will be required as metadata grows.
Read the following for reference:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://www.ibm.com/developerworks/web/library/wa-introhdfs/
Oh! there is a discussion on SO : Small files and HDFS blocks

Related

HDFS - one large file or few smaller files with the size of a block size

so I'm having some issues understanding in which way I should store large files.
For example, the block size in my HDFS is 128MB, and I have a 1GB file.
I know that saving files that are smaller than the block size is not the best practice and I understand why.
But what should I do with big files, for my 1GB file, should I save 1 file or 8 files of 128MB each, and why?
You can store 1 file with 1GB. Hadoop will autmatically store that file in 8 blocks.
Hadoop is designed for bigger files not smaller files. Please note that Block is physical storage in hadoop.
As you did not mention split size in your cluster so i assume it is 128 MB. Split is something that on which you parallelism depend. So if you process 1 GB file on 128 split size 8 mappers will be invoked ( 1 mapper on each split).
If you store 8 files of 128 mb each. There will be unneccesary overhead on your Namenode for maintaining info about those 8 files. In case of 8 files performance may be more or less similar as compared to 1 GB file but it will definitely better in case of 1 GB file with 8 blocks.
Do not confuse with Blocks in hadoop they are just storage unit like other file system. Hadoop will autmatically take care of storage no matter how bigger file is and it will divide files in block . Storing small files will be uncessary over head in i/o operations.

Disk block size and hadoop block size

I have read many posts saying Hadoop block size of 64 MB reduces metadata and helps in performance improvement over 4 kb block size. But, why data block size is exactly 4kb in OS Disk and 64 MB in Hadoop.
Why not 100 or some other bigger number?
But, why data block size is exactly 4kb in OS Disk and 64 MB in Hadoop.
In HDFS we store huge amounts of data as compared to a single OS filesystem. So, it doesn't make sense to have small block sizes for HDFS. By having small block sizes, there will be more blocks and the NameNode has to store more metadata about the blocks. And also fetching of the data will be slow as data from higher number of blocks dispersed across many machines has to fetched.
Why not 100 or some other bigger number?
Initially the HDFS block size was 64MB and now it's 128MB by default. Check the dfs.blocksize property in hdfs-site.xml here. This is because of the bigger and better storage capacities and speed (HDD and SSD). We shouldn't be surprised when later it's changed to 256MB.
Check this HDFS comic to get a quick overview about HDFS.
In addition to the existing answers, the following is also relevant:
Blocks on an OS level and blocks on a HDFS level are different concepts. When you have a 10kb file on the OS, then that essentially means 3 blocks of 4kb get allocated, and the result is that you consume 12kb.
Obviously you don't want to allocate a large fraction of your space to blocks that are not full, so you need a small blocksize.
On HDFS however, the content of the block determines the size of the block.
So if you have 129MB that could be stored in 1 block of 128MB and 1 block of 1MB. (I'm not sure if it will spread it out differently).
As a result you don't 'lose' the 127 mb which is not allocated.
With this in mind you will want to have a comparatively large blocksize to optimize block management.

Why should I avoid storing lots of small files in Hadoop HDFS?

I have read that lots of small files stored in HDFS can be a problem because lots of small files means lots of objects Hadoop NameNode memory.
However since each block is stored in named node as an object, how is it different for a large file? Whether you store 1000 blocks from a single file in memory or 1000 blocks for 1000 files, is the amount of NameNode memory used the same?
Similar question for Map jobs. Since they operate on blocks, how does it matter if blocks are of small files or from bigger ones ?
At a high-level, you can think of a Hadoop NameNode as a tracker for where blocks composing 'files' stored in HDFS are located; blocks are used to break down large files into smaller pieces when stored in an HDFS cluster.
When you have lots of small files stored in HDFS, there are also lots of blocks, and the NameNode must keep track of all of those files and blocks in memory.
When you have a large file, for example -- if you combined all of those files into bigger files, first -- you would have fewer files stored in HDFS, and you would also have fewer blocks.
First let's discuss how file size, HDFS blocks, and NameNode memory relate:
This is easier to see with examples and numbers.
Our HDFS NameNode's block size for this example is 100 MB.
Let's pretend we have a thousand (1,000) 1 MB files and we store them in HDFS. When storing these 1,000 1 MB files in HDFS, we would have also have 1,000 blocks composing those files in our HDFS cluster.
Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 150 KB of memory for those 1,000 blocks representing 1,000 1 MB files.
Now, consider that we consolidate or concatenate those 1,000 1 MB files into a single 1,000 MB file and store that single file in HDFS. When storing the 1,000 MB file in HDFS, it would be broken down into blocks based on our HDFS cluster block size; in this example our block size was 100 MB, which means our 1,000 MB file would be stored as ten (10) 100 MB blocks in the HDFS cluster.
Each block stored in HDFS requires about 150 bytes of NameNode memory, which is about 1.5 KB of memory for those 10 blocks representing the 1 1,000 MB file.
With the larger file, we have the same data stored in the HDFS cluster, but use 1% of the NameNode memory compared to the situation with many small files.
Input blocks and the number of Map tasks for a job are related.
When it comes to Map tasks, generally you will have 1-map task per input block. The size of input blocks here matters because there is overhead from starting and finishing new tasks; i.e. when Map tasks finish too quickly, the amount of this overhead becomes a greater portion of each tasks's completion time, and completion of the overall job this can be slower than the same job but with fewer, bigger input blocks. For a MapReduce2-based job, Map tasks also involve starting and stopping a YARN container at the resource management layer, for each task, which adds overhead. (Note that you can also instruct MapReduce jobs to use a minimum input size threshold when dealing with many small input blocks to address some of these inefficiencies as well)

How will the free space of a block in a data node be utilized by hdfs in hadoop?

My file has a size of 10MB, I stored this in hadoop, but the default block size in hdfs is 64 MB. Thus, my file uses 10 MB out-of 64 MB. How will HDFS utilize the remaining 54 MB of free space in the same block?
Logically, if you files are smaller than block size than HDFS will reduce the block size for that particular files to the size of file. So HDFS will only use 10MB for storing 10MB of small files.It will not waste 54MB or leave it blank.
Small file sin HDFS are desribed in detail here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
The remaining 54MB would be utilized for some other file. So this is how it works, assume you do a put or copyFromLocal of 2 small files each with size 20MB and your block size is 64MB. Now HDFS calculates the available space (suppose previously you have saved a file of 10 MB in a 64MB block it includes these remaining 54MB as well)in the filesystem(not available blocks) and gives a report in terms of block. Since you have 2 files, with replication factor as 3, so a total of 6 blocks would be allocated for your files even if your file size is less than the block size. If the cluster doesn't have 6 blocks(6*64MB) then the put process would fail. Since the report is fetched in terms of space not in terms of blocks, you would never run out of blocks. The only time files are measured in blocks is at block allocation time.
Read this blog for more information.

Is there any memory loss in HDFS if we use small files?

I have taken below Quoting from Hadoop - The Definitive Guide:
Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB,
Here my questions
1) 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) How does hdfs use the remaining 127M in this block?
2)Is there any chance to store another file in same block?
1 MB file stored in 128MB block with 3 replication. Then the file will be stored in 3 blocks and uses 3*1=3 MB only not 3*128=384 MB. But it shows each the block size as 128 MB. It is just an abstraction to store the metadata in the namenode, but not an actual memory size used.
No way to store more than a file in a single block. Each file will be stored in a separate block.
Reference:
https://stackoverflow.com/a/21274388/3496666
https://stackoverflow.com/a/15065274/3496666
https://stackoverflow.com/a/14109147/3496666
NameNode Memory Usage:
Every file, directory and block in HDFS is represented as an object. i.e. each entry i the namenode is reflected to a item.
in the namenode’s memory, and each of object/item occupies 150 to 200 bytes of namenode memory.memorandums prefer fewer large files as a result of the metadata that needs to be stored.
Consider a 1 GB file with the default block size of 64MB.
-Stored as a single file 1 GB file
Name: 1 item
Block=16
Total Item = 16*3( Replication factor=3) = 48 + 1(filename) = 49
Total NameNode memory: 150*49
-Stored as 1000 individual 1 MB files
Name: 1000
Block=1000
Total Item = 1000*3( Replication factor=3) = 3000 + 1000(filename) = 4000
Total NameNode memory: 150*4000
Above results clarify that large number of small files is a overhead of naemnode memory as it takes more space of NameNode memory.
Block Name and Block ID is a unique ID of a particular block of data.This uniue ID is getting used to identified
the block during reading of the data when client make a request to read data.Hence it can not be shared.
HDFS is designed to handle large files. Lets say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000
requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead.
Each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic!
If you use 64Mb blocks, the number of requests goes down to 16, greatly reducing the cost of overhead and load on the Name Node.
To keep these things in mind hadoop recommend large block size.
HDFS block size is a logical unit of splitting a large file into small chunks. This chunks is basically called a block.
These chunks/block is used during further parallel processing of the data.i.e. MapReduce Programming or other model
to read/process of that within HDFS.
If a file is small enough to fit in this logical block then one block will get assigned for the file and it will
take disk space according to file size and Unix file system you are using.The detail about, how file gets stored in disk is available on this link.
HDFS block size Vs actual file size
As HDFS block size is a logical unit not a physical unit of the memory, so there is no waste of memory.
These link will be useful to understand the problem with small file.
Link1,
Link2
See Kumar's Answer
You could look into SequenceFiles or HAR Files depending on your use case. HAR files are analogous to the Tar command. MapReduce can act upon each HAR files with a little overhead. As for SequenceFiles, they are in a way a container of Key/Value pairs. The benefit of this is a Map task can act upon each of these pairs.
HAR Files
Sequence Files
More About Sequence Files

Resources