Apache Hadoop(Big Data) - hadoop

In hadoop, the data is split into either 64mb or 128mb blocks. Let us say I have a file of size 70mb. Does it split into two blocks of 64mb and 6mb. If so, the second block is occupied with only 6mb, is the other space in that block wasted or is it occupied by another block?

In hadoop block size can be chosen by an application that writes into hdfs via dfs.blocksize property:
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
There is no limitation on whether it should be 64 or 128mb, but current hadoop version defaults to 128mb.
Different block sizes can be set on different files.
No space is wasted if a file has a size smaller than the block size.
However it is not recommended to have a lot of small files. More info regarding this problem and how to resolve it is here: https://developer.yahoo.com/blogs/hadoop/hadoop-archive-file-compaction-hdfs-461.html

No wasted any space. if second block occupied 6mb than remaining 56mb space assign any for other file.

Related

Disk block size and hadoop block size

I have read many posts saying Hadoop block size of 64 MB reduces metadata and helps in performance improvement over 4 kb block size. But, why data block size is exactly 4kb in OS Disk and 64 MB in Hadoop.
Why not 100 or some other bigger number?
But, why data block size is exactly 4kb in OS Disk and 64 MB in Hadoop.
In HDFS we store huge amounts of data as compared to a single OS filesystem. So, it doesn't make sense to have small block sizes for HDFS. By having small block sizes, there will be more blocks and the NameNode has to store more metadata about the blocks. And also fetching of the data will be slow as data from higher number of blocks dispersed across many machines has to fetched.
Why not 100 or some other bigger number?
Initially the HDFS block size was 64MB and now it's 128MB by default. Check the dfs.blocksize property in hdfs-site.xml here. This is because of the bigger and better storage capacities and speed (HDD and SSD). We shouldn't be surprised when later it's changed to 256MB.
Check this HDFS comic to get a quick overview about HDFS.
In addition to the existing answers, the following is also relevant:
Blocks on an OS level and blocks on a HDFS level are different concepts. When you have a 10kb file on the OS, then that essentially means 3 blocks of 4kb get allocated, and the result is that you consume 12kb.
Obviously you don't want to allocate a large fraction of your space to blocks that are not full, so you need a small blocksize.
On HDFS however, the content of the block determines the size of the block.
So if you have 129MB that could be stored in 1 block of 128MB and 1 block of 1MB. (I'm not sure if it will spread it out differently).
As a result you don't 'lose' the 127 mb which is not allocated.
With this in mind you will want to have a comparatively large blocksize to optimize block management.

How does HDFS manage block size?

My file size is 65MB and default hdfs block size(64MB), then how many 64MB blocks will be allotted to my file?
Is it like 1-64MB block, 1-1MB block or 2-64MB blocks? If it is 2-64MB blocks is it going to be wasted rest of the 63MB or will it be allocated to other file?
Block size 64MB means an upper bound size for a block. It doesn't mean that file blocks less than 64MB will consume 64MB. It will not consume 64MB to store a chunk of 1MB.
If the file is 160 megabytes,
Hope this helps.
According to this page. Looks like it'll be one 64 MB block and one 1 MB block.
HDFS is often blissfully unaware that the final record in one block may be only a partial record, with the rest of its content shunted off to the following block. HDFS only wants to make sure that files are split into evenly sized blocks that match the predefined block size for the Hadoop instance... Not every file you need to store is an exact multiple of your system’s block size, so the final data block for a file uses only as much space as is needed.
The answer is 2 blocks, one 64MB and other 1MB.
HDFS just like other filesystems splits the file into blocks and then saves those blocks to disks.
But there are two major differences between them :
HDFS block sizes are huge because every block has a metadata record at namenode, smaller block sizes means a lot of blocks and overloading of namenode with metadata.
Hence, bigger block sizes used in HDFS.
HDFS block sizes are just an abstraction on the linux based file system, hence 65MB will use one 64MB block and other 1MB space from second block, rest 63MB from second block is still free and available for other data.
That is, Namenode will have two blocks recorded for 65MB but the actual file system space is 65MB only.

How will the free space of a block in a data node be utilized by hdfs in hadoop?

My file has a size of 10MB, I stored this in hadoop, but the default block size in hdfs is 64 MB. Thus, my file uses 10 MB out-of 64 MB. How will HDFS utilize the remaining 54 MB of free space in the same block?
Logically, if you files are smaller than block size than HDFS will reduce the block size for that particular files to the size of file. So HDFS will only use 10MB for storing 10MB of small files.It will not waste 54MB or leave it blank.
Small file sin HDFS are desribed in detail here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
The remaining 54MB would be utilized for some other file. So this is how it works, assume you do a put or copyFromLocal of 2 small files each with size 20MB and your block size is 64MB. Now HDFS calculates the available space (suppose previously you have saved a file of 10 MB in a 64MB block it includes these remaining 54MB as well)in the filesystem(not available blocks) and gives a report in terms of block. Since you have 2 files, with replication factor as 3, so a total of 6 blocks would be allocated for your files even if your file size is less than the block size. If the cluster doesn't have 6 blocks(6*64MB) then the put process would fail. Since the report is fetched in terms of space not in terms of blocks, you would never run out of blocks. The only time files are measured in blocks is at block allocation time.
Read this blog for more information.

Query regarding the block size

With regards to the HDFS, I read from their site under the Data Replication section (below link) that
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
'all blocks in a file except the last block are the same size'
Could you please let me know what is the reason the last block would not be of the same size?
Could it be that the total memory allocation may play a part over here?
However if the memory size is not an issue, would then still the last block would not of the same size as the rest of the blocks for a file?
And if yes, could you please elaborate a bit on this?
Any link to the JIRA for the development effort for this would be greatly appreciated.
Actually this is not at all an issue. Indeed it is uncertain that the last block of the file can be in the same size.
Consider a file of size 1000 MB and the block is 128MB, then the file will be splitted into 8 Blocks, where the first 7 blocks will be in even size which is equal to 128MB.
The total size of the 7 blocks will be 896MB (7*128MB), hence remaining size will be 104MB (1000-896). So the last block's actual size will be 104 MB wherein other 7 blocks are of 128 MB.
The namenode will allocate data blocks for every chunk of the file being stored on HDFS. It will not make any consideration for the chunks which's size is less than the data block size.
HDFS is designed to store chunks of data in equally sized data blocks so that the data blocks available on data nodes can be easily calculated and maintained by the namenode.

Why is the last block in Hadoop Distributed File System is of different size than the others?

Each file in HDFS is stored as as a sequence of Blocks. Blocks are of equal sizes, except the last one. Why? Is is possible to change it?
No, you cannot change this behavior. The size and number of blocks corresponding to files depends on the configuration property dfs.blocksize
Eg : If you wanted to keep a file of size 130MB in HDFS, having block size as 64MB, then there will be 3 blocks created: First two blocks will be having size 64MB each and the size of third block will be 2MB.
If you wanted to make the size of the 3rd block the same as the first two, then usage will be wasted.

Resources