HDFS has the default block size as 60MB. So, does that mean the minimum size of a file in HDFS is 60MB?.
i.e. if we create/copy a file which is less than 60MB in size (say 5bytes) then my assumption is that the actual size if that file in HDFS is 1block i.e. 60MB. But, when I copy a 5bytes file to HDFS then when I see the size of the file (through ls command) I still see the size of that file as 5bytes. Shouldn't that be 60MB?.
or is the ls command showing the size of the data in the file instead of the block size of the file on HDFS?
The default size of hdfs block does not means that it will use all the space whatever we have specified i.e. 60 MB. if data is more that 60 MB then it will split the data into the blocks (data/60 MB) , that number of blocks will be created.
If you are doing the ls command then it will only show you currently using space.
ex:-- i have uploaded test.txt file in hdfs and block size i have set to 128 MB and replication is 2 but our actual file size is only 193 B.
**Permission Owner Group Size Last Modified Replication Block Size Name
-rw-r--r-- hduser supergroup 193 B 10/27/2016, 2:58:41 PM 2 128 MB test.txt**
The default block size is a maximum size of a block. Each file consists of blocks, which are distributed (and replicated) to different datanodes on HDFS. The namenode knows which blocks constitute a file, and where to find them. Perhaps it's easier to understand this with the following image:
If a file exceeds 60MB (120MB in the new version), it cannot be written using a single block, it will need at least two.
Of course, if it is less than 60MB it can be written in a single block, which will occupy as much space, as necessary (less than 60MB).
After all, it doesn't make sense that a 5-byte file will occupy 60MB.
Related
so I'm having some issues understanding in which way I should store large files.
For example, the block size in my HDFS is 128MB, and I have a 1GB file.
I know that saving files that are smaller than the block size is not the best practice and I understand why.
But what should I do with big files, for my 1GB file, should I save 1 file or 8 files of 128MB each, and why?
You can store 1 file with 1GB. Hadoop will autmatically store that file in 8 blocks.
Hadoop is designed for bigger files not smaller files. Please note that Block is physical storage in hadoop.
As you did not mention split size in your cluster so i assume it is 128 MB. Split is something that on which you parallelism depend. So if you process 1 GB file on 128 split size 8 mappers will be invoked ( 1 mapper on each split).
If you store 8 files of 128 mb each. There will be unneccesary overhead on your Namenode for maintaining info about those 8 files. In case of 8 files performance may be more or less similar as compared to 1 GB file but it will definitely better in case of 1 GB file with 8 blocks.
Do not confuse with Blocks in hadoop they are just storage unit like other file system. Hadoop will autmatically take care of storage no matter how bigger file is and it will divide files in block . Storing small files will be uncessary over head in i/o operations.
My file has a size of 10MB, I stored this in hadoop, but the default block size in hdfs is 64 MB. Thus, my file uses 10 MB out-of 64 MB. How will HDFS utilize the remaining 54 MB of free space in the same block?
Logically, if you files are smaller than block size than HDFS will reduce the block size for that particular files to the size of file. So HDFS will only use 10MB for storing 10MB of small files.It will not waste 54MB or leave it blank.
Small file sin HDFS are desribed in detail here : http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
The remaining 54MB would be utilized for some other file. So this is how it works, assume you do a put or copyFromLocal of 2 small files each with size 20MB and your block size is 64MB. Now HDFS calculates the available space (suppose previously you have saved a file of 10 MB in a 64MB block it includes these remaining 54MB as well)in the filesystem(not available blocks) and gives a report in terms of block. Since you have 2 files, with replication factor as 3, so a total of 6 blocks would be allocated for your files even if your file size is less than the block size. If the cluster doesn't have 6 blocks(6*64MB) then the put process would fail. Since the report is fetched in terms of space not in terms of blocks, you would never run out of blocks. The only time files are measured in blocks is at block allocation time.
Read this blog for more information.
I have taken below Quoting from Hadoop - The Definitive Guide:
Note, however, that small files do not take up any more disk space than is required to store the raw contents of the file. For example, a 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB,
Here my questions
1) 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.) How does hdfs use the remaining 127M in this block?
2)Is there any chance to store another file in same block?
1 MB file stored in 128MB block with 3 replication. Then the file will be stored in 3 blocks and uses 3*1=3 MB only not 3*128=384 MB. But it shows each the block size as 128 MB. It is just an abstraction to store the metadata in the namenode, but not an actual memory size used.
No way to store more than a file in a single block. Each file will be stored in a separate block.
Reference:
https://stackoverflow.com/a/21274388/3496666
https://stackoverflow.com/a/15065274/3496666
https://stackoverflow.com/a/14109147/3496666
NameNode Memory Usage:
Every file, directory and block in HDFS is represented as an object. i.e. each entry i the namenode is reflected to a item.
in the namenode’s memory, and each of object/item occupies 150 to 200 bytes of namenode memory.memorandums prefer fewer large files as a result of the metadata that needs to be stored.
Consider a 1 GB file with the default block size of 64MB.
-Stored as a single file 1 GB file
Name: 1 item
Block=16
Total Item = 16*3( Replication factor=3) = 48 + 1(filename) = 49
Total NameNode memory: 150*49
-Stored as 1000 individual 1 MB files
Name: 1000
Block=1000
Total Item = 1000*3( Replication factor=3) = 3000 + 1000(filename) = 4000
Total NameNode memory: 150*4000
Above results clarify that large number of small files is a overhead of naemnode memory as it takes more space of NameNode memory.
Block Name and Block ID is a unique ID of a particular block of data.This uniue ID is getting used to identified
the block during reading of the data when client make a request to read data.Hence it can not be shared.
HDFS is designed to handle large files. Lets say you have a 1000Mb file. With a 4k block size, you'd have to make 256,000
requests to get that file (1 request per block). In HDFS, those requests go across a network and come with a lot of overhead.
Each request has to be processed by the Name Node to figure out where that block can be found. That's a lot of traffic!
If you use 64Mb blocks, the number of requests goes down to 16, greatly reducing the cost of overhead and load on the Name Node.
To keep these things in mind hadoop recommend large block size.
HDFS block size is a logical unit of splitting a large file into small chunks. This chunks is basically called a block.
These chunks/block is used during further parallel processing of the data.i.e. MapReduce Programming or other model
to read/process of that within HDFS.
If a file is small enough to fit in this logical block then one block will get assigned for the file and it will
take disk space according to file size and Unix file system you are using.The detail about, how file gets stored in disk is available on this link.
HDFS block size Vs actual file size
As HDFS block size is a logical unit not a physical unit of the memory, so there is no waste of memory.
These link will be useful to understand the problem with small file.
Link1,
Link2
See Kumar's Answer
You could look into SequenceFiles or HAR Files depending on your use case. HAR files are analogous to the Tar command. MapReduce can act upon each HAR files with a little overhead. As for SequenceFiles, they are in a way a container of Key/Value pairs. The benefit of this is a Map task can act upon each of these pairs.
HAR Files
Sequence Files
More About Sequence Files
If I need to do a sequential scan of (non-splittable) thousands of gzip files of sizes varying between 200 and 500mb, what is an appropriate block size for these files?
For the sake of this question, let's say that the processing done is very fast, so restarting a mapper is not costly, even for large block sizes.
My understanding is:
There's hardly an upper limit of block size, as there's "plenty of files" for an appropriate amount of mappers for the size of my cluster.
To ensure data-locality, I want each gzip file to be in 1 block.
However, the gzipped files are of varying sizes. How is data stored if I choose a block size of ~500mb (e.g. max file size of all my input files)? Would it be better to pick a "very large" block size, like 2GB? Is HDD capacity wasted excessively in either scenario?
I guess I'm really asking how files are actually stored and split across hdfs blocks - as well as trying to gain an understanding of best practice for non-splittable files.
Update: A concrete example
Say I'm running a MR Job on three 200 MB files, stored as in the following illustration.
If HDFS stores files as in case A, 3 mappers would be guaranteed to be able to process a "local" file each. However, if the files are stored as in case B, one mapper would need to fetch part of file 2 from another data node.
Given there's plenty of free blocks, does HDFS store files as illustrated by case A or case B?
If you have non-splittable files then you are better off using larger block sizes - as large as the files themselves (or larger, it makes no difference).
If the block size is smaller than the overall filesize then you run into the possibility that all the blocks are not all on the same data node and you lose data locality. This isn't a problem with splittable files as a map task will be created for each block.
As for an upper limit for block size, i know that for certain older version of Hadoop, the limit was 2GB (above which the block contents were unobtainable) - see https://issues.apache.org/jira/browse/HDFS-96
There is no downside for storing smaller files with larger block sizes - to emphasize this point consider a 1MB and 2 GB file, each with a block size of 2 GB:
1 MB - 1 block, single entry in the Name Node, 1 MB physically stored on each data node replica
2 GB - 1 block, single entry in the Name node, 2 GB physically stored on each data node replica
So other that the required physical storage, there is no downside to the Name node block table (both files have a single entry in the block table).
The only possible downside is the time it takes to replicate a smaller versus larger block, but on the flip side if a data node is lost from the cluster, then tasking 2000 x 1 MB blocks to replicate is slower than a single block 2 GB block.
Update - a worked example
Seeing as this is causing some confusion, heres some worked examples:
Say we have a system with a 300 MB HDFS block size, and to make things simpler we have a psuedo cluster with only one data node.
If you want to store a 1100 MB file, then HDFS will break up that file into at most 300 MB blocks and store on the data node in special block indexed files. If you were to go to the data node and look at where it stores the indexed block files on physical disk you may see something like this:
/local/path/to/datanode/storage/0/blk_000000000000001 300 MB
/local/path/to/datanode/storage/0/blk_000000000000002 300 MB
/local/path/to/datanode/storage/0/blk_000000000000003 300 MB
/local/path/to/datanode/storage/0/blk_000000000000004 200 MB
Note that the file isn't exactly divisible by 300 MB, so the final block of the file is sized as the modulo of the file by the block size.
Now if we repeat the same exercise with a file smaller than the block size, say 1 MB, and look at how it would be stored on the data node:
/local/path/to/datanode/storage/0/blk_000000000000005 1 MB
Again note that the actual file stored on the data node is 1 MB, NOT a 200 MB file with 299 MB of zero padding (which i think is where the confusion is coming from).
Now where the block size does play a factor in efficiency is in the Name Node. For the above two examples, the name node needs to maintain a map of the file names, to block names and data node locations (as well as the total file size and block size):
filename index datanode
-------------------------------------------
fileA.txt blk_01 datanode1
fileA.txt blk_02 datanode1
fileA.txt blk_03 datanode1
fileA.txt blk_04 datanode1
-------------------------------------------
fileB.txt blk_05 datanode1
You can see that if you were to use a block size of 1 MB for fileA.txt, you'd need 1100 entries in the above map rather than 4 (which would require more memory in the name node). Also pulling back all the blocks would be more expensive as you'd be making 1100 RPC calls to datanode1 rather than 4.
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
Finally, the relationships of the block, split and file can be summarized as follows:
block boundary
|BLOCK | BLOCK | BLOCK | BLOCK ||||||||
|FILE------------|----------------|----------------|---------|
|SPLIT | | | |
The split can extend beyond the block because the split depends on the InputFormat class definition of how to split the file which may not coincide with the block size so the split extends beyond to include the seek points within the source.
This might seem as a silly question but in Hadoop suppose blocksize is X (typically 64 or 128 MB) and a local filesize is Y (where Y is less than X).Now when I copy file Y to the HDFS will it consume one block or will hadoop create smaller size blocks?
One block is consumed by Hadoop. That does not mean that storage capacity will be consumed in an equivalent manner.
The output while browsing the HDFS from web looks like this:
filename1 file 48.11 KB 3 128 MB 2012-04-24 18:36
filename2 file 533.24 KB 3 128 MB 2012-04-24 18:36
filename3 file 303.65 KB 3 128 MB 2012-04-24 18:37
You see that each file size is lesser than the block size which is 128 MB. These files are in KB.
HDFS capacity is consumed based on the actual file size but a block is consumed per file.
There are limited number of blocks available dependent on the capacity of the HDFS. You are wasting blocks as you will run out of them before utilizing all the actual storage capacity. Remember that Unix filsystem also has concept of blocksize but is a very small number around 512 Bytes. This concept is inverted in HDFS where the block size is kept bigger around 64-128 MB.
The other issue is that when you run map/reduce programs it will try to spawn mapper per block so in this case when you are processing three small files, it may end up spawning three mappers to work on them eventually.
This wastes resources when the files are of smaller size. You also add latency as each mapper takes time to spawn and then ultimately would work on a very small sized file. You have to compact them into files closer to blocksize to take advantage of mappers working on lesser number of files.
Yet another issue with numerous small files is that it loads namenode which keeps the mapping (metadata) of each block and chunk mapping in main memory. With smaller files, you fill up this table faster and more main memory will be required as metadata grows.
Read the following for reference:
http://www.cloudera.com/blog/2009/02/the-small-files-problem/
http://www.ibm.com/developerworks/web/library/wa-introhdfs/
Oh! there is a discussion on SO : Small files and HDFS blocks