Hadoop HDFS maximum file size - hadoop

A colleague of mine thinks that HDFS has no maximum file size, i.e., by partitioning into 128 / 256 meg chunks any file size can be stored (obviously the HDFS disk has a size and that will limit, but is that the only limit). I can't find anything saying that there is a limit so is she correct?
thanks, jim

Well there is obviously a practical limit. But physically HDFS Block IDs are Java longs
so they have a max of 2^63 and if your block size is 64 MB then the maximum size is 512 yottabytes.

I think she's right about saying there's no maximum file size on HDFS. The only thing you can really set is the chunk size, which is 64 MB by default. I guess sizes of any length can be stored, the only constraint could be that the bigger the size of the file, the greater the hardware to accommodate it.

I am not an expert in Hadoop, but AFAIK, there is no explicit limitation on a single file size, though there are implicit factors such as overall storage capacity and maximum namespace size. Also, there might be administrative quotes on number of entities and directory sizes. The HDFS capacity topic is very well described in this document. Quotes are described here and discussed here.
I'd recommend paying some extra attention to the Michael G Noll's blog referred by the last link, it covers many hadoop-specific topics.

Related

Disk block size and hadoop block size

I have read many posts saying Hadoop block size of 64 MB reduces metadata and helps in performance improvement over 4 kb block size. But, why data block size is exactly 4kb in OS Disk and 64 MB in Hadoop.
Why not 100 or some other bigger number?
But, why data block size is exactly 4kb in OS Disk and 64 MB in Hadoop.
In HDFS we store huge amounts of data as compared to a single OS filesystem. So, it doesn't make sense to have small block sizes for HDFS. By having small block sizes, there will be more blocks and the NameNode has to store more metadata about the blocks. And also fetching of the data will be slow as data from higher number of blocks dispersed across many machines has to fetched.
Why not 100 or some other bigger number?
Initially the HDFS block size was 64MB and now it's 128MB by default. Check the dfs.blocksize property in hdfs-site.xml here. This is because of the bigger and better storage capacities and speed (HDD and SSD). We shouldn't be surprised when later it's changed to 256MB.
Check this HDFS comic to get a quick overview about HDFS.
In addition to the existing answers, the following is also relevant:
Blocks on an OS level and blocks on a HDFS level are different concepts. When you have a 10kb file on the OS, then that essentially means 3 blocks of 4kb get allocated, and the result is that you consume 12kb.
Obviously you don't want to allocate a large fraction of your space to blocks that are not full, so you need a small blocksize.
On HDFS however, the content of the block determines the size of the block.
So if you have 129MB that could be stored in 1 block of 128MB and 1 block of 1MB. (I'm not sure if it will spread it out differently).
As a result you don't 'lose' the 127 mb which is not allocated.
With this in mind you will want to have a comparatively large blocksize to optimize block management.

HDFS and small files - part 2

This is with reference to the question : Small files and HDFS blocks where the answer quotes Hadoop: The Definitive Guide:
Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage.
Which I completely agree with because as per my understanding, blocks are just a way for the namenode to map which piece of file is where in the entire cluster. And since HDFS is an abstraction over our regular filesystems, there is no way a 140 MB will consume 256 MB of space on HDFS if the block size is 128MB, or in other words, the remaining space in the block will not get wasted.
However, I stumbled upon another answer here in Hadoop Block size and file size issue which says:
There are limited number of blocks available dependent on the capacity of the HDFS. You are wasting blocks as you will run out of them before utilizing all the actual storage capacity.
Does that mean if I have 1280 MB of HDFS storage and I try to load 11 files with size 1 MB each ( considering 128 MB block size and 1 replication factor per block ), the HDFS will throw an error regarding the storage?
Please correct if I am assuming anything wrong in the entire process. Thanks!
No. HDFS will not throw error because
1280 MB of storage limit is not exhausted.
11 meta entries won't cross memory limits on the namenode.
For example, say we have 3GB of memory available on namenode. Namenode need to store meta entries for each file, each block. Each of this entries take approx. 150 bytes. Thus, you can store roughly max. 1 million files with each having one block. Thus, even if you have much more storage capacity, you will not be able to utilize it fully if you have multiple small files reaching the memory limit of namenode.
But, specific example mentioned in the question does not reach this memory limit. Thus, there should not be any error.
Consider, hypothetical scenario having available memory in the namenode is just 300 bytes* 10. In this case, it should give an error for request to store 11th block.
References:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
https://www.mail-archive.com/core-user#hadoop.apache.org/msg02835.html

Hadoop Data Node: why is there a magic "number" for threshold of data blocks?

Experts,
We may see our block count grow in our hadoop cluster. "Too many" blocks have consequences such as increased heap requirements at data node, declining execution speeds, more GC etc. We should take notice when the number of blocks exceed a certain "threshold".
I have seen different static numbers for thresholds such as 200,000 or 500,000 -- "magic" numbers. Shouldn't it be a function of memory of node (Java Heap Size of DataNode in Bytes)?
Other interesting related questions:
What does a high block count indicate?
a. too many small files?
b. running out of capacity?
is it (a) or (b)? how to differentiate between the two?
What is a small file? A file whose size is smaller than block size (dfs.blocksize)?
Does each file take a new data block on disk? or is it the meta data associated with new file that is the problem?
The effects are more GC, declising execution speeds etc. How to "quantify" the effects of high block count?
Thanking in advance
Thanks everyone for their input. I have done some research on the topic and share my findings.
any static number is a magic number. I propose the number of block threshold to be: heap memory (in gb) x 1 million * comfort_%age (say 50%)
Why?
Rule of thumb: 1gb for 1M blocks, Cloudera [1]
The actual amount of heap memory required by namenode turns out to be much lower.
Heap needed = (number of blocks + inode (files + folders)) x object size (150-300 bytes [1])
For 1 million small files: heap needed = (1M + 1M) x 300b = 572mb <== much smaller than rule of thumb.
High block count may indicate both.
namenode UI states the heap capacity used.
For example,
http://namenode:50070/dfshealth.html#tab-overview
9,847,555 files and directories, 6,827,152 blocks = 16,674,707 total filesystem object(s).
Heap Memory used 5.82 GB of 15.85 GB Heap Memory. Max Heap Memory is 15.85 GB.
** Note, the heap memory used is still higher than 16,674,707 objects x 300 bytes = 4.65gb
To find out small files, do
hdfs fsck -blocks | grep "Total blocks (validated):"
It would return something like:
Total blocks (validated): 2402 (avg. block size 325594 B) <== which is smaller than 1mb
yes. a file is small if its size < dfs.blocksize.
each file takes a new data block on disk, though the block size is close to file size. so small block.
for every new file, inode type object is created (150B), so stress on heap memory of name node
Impact on name and data nodes:
Small files pose problems for both name node and data nodes:
name nodes:
- Pull the ceiling on number of files down as it needs to keep metadata for each file in memory
- Long time in restarting as it must read the metadata of every file from a cache on local disk
data nodes:
- large number of small files means a large amount of random disk IO. HDFS is designed for large files, and benefits from sequential reads.
[1] https://www.cloudera.com/documentation/enterprise/5-8-x/topics/admin_nn_memory_config.html
Your first assumption is wrong, since Data node does not maintain the data file structure in memory, it is the job of the Name node to keep track of the filesystem (recurring to INodes) in memory. So the small files will actually cause your Name node do run out memory faster (since more metadata will be required to represent the same amount of data) and the execution speed will be affected since the Mapper is created per block.
To have an answer your first question check: Namenode file quantity limit
Execute the following command: hadoop fs -du -s -h. If you see that the first value (which represents the average file size of all files) is much smaller than the configured block size, then you are facing the problem of the small files. To check if you are running out of space: hadoop fs -df -h
Yup, can be much smaller. Sometimes though if the file is too big, it would require additional block. Once the block is reserved for some file it cannot be used by another files.
The block does not reserve the space on the disk beyond what it does actually need to store the data, it is metadata on the namenode which imposes the limits.
As I told before, it is more mapper tasks which need to be executed for the same amount of data. Since the mapper is ran on new JVM, the GC is not a problem, but the overhead of starting it for processing the tiny amount of data is the problem.

How to set data block size in Hadoop ? Is it advantage to change it?

If we can change the data block size in Hadoop please let me know how to do that.
Is it advantageous to change the block size, If yes, then let me know Why and how? If no, then let me know why and how?
You can change the block size any time unless dfs.blocksize parameter is defined as final in hdfs-site.xml.
To change block size
while running hadoop fs command you can run hadoop fs -Ddfs.blocksize=67108864 -put <local_file> <hdfs_path>. This command will save file with 64MB block size
while running hadoop jar command - hadoop jar <jar_file> <class> -Ddfs.blocksize=<desired_block_size> <other_args>. Reducer will use the defined block size while storing the output in HDFS
as part of the map reduce program, you can use job.set and set the value
Criteria for changing block size:
Typically 128 MB for uncompressed files works well
You can consider reducing block size on compressed files. If the compression rate is too high then having higher block size might slow down the processing. If the compression codec is not splittable, it will aggravate the issue.
As long as the file size is more than block size, you need not change the block size. If the number of mappers to process the data is very high, you can reduce number of mappers by increasing the split size. For example if you have 1TB of data with 128 MB block size, then by default it will take 8000 mappers. Instead of changing the block size you can consider changing the split size to 512 MB or even 1 GB and it will take far fewer number of mappers to process the data.
I have covered most of this in 2 and 3 of this performance tuning playlist.
There seems to be much confusion about this topic and also wrong advise going around. To lift the confusion it helps to think about how HDFS is actually implemented:
HDFS is an abstraction over distributed disk-based file systems. So the words "block" and "blocksize" have a different meaning than generally understood. For HDFS a "file" is just a collection of blocks, each "block" in return is stored as an actual file on a datanode. In fact the same file is stored on several datanodes, according to the replication factor. The blocksize of these individual files and their other performance characteristics in turn depend on the underlying filesystems of the individual datanodes.
The mapping between an HDFS-File and the individual files on the datanodes is maintained
by the namenode. But the namenode doesn't expect a specific blocksize, it just stores the
mappings which where created during the creation of the HDFS file, which is usually split
according to the default dfs.blocksize (but can be individually overwritten).
This means for example if you have 1 MB file with a replication of 3 and a blocksize of 64
MB, you don't lose 63 MB * 3 = 189 MB, since physically just three 1 MB files are stored
with the standard blocksize of the underlying filesystems (e.g. ext4).
So the question becomes what a good dfs.blocksize is and if it's advisable to change it.
Let me first list the aspects speaking for a bigger blocksize:
Namenode pressure: As mentioned the namenode has to maintain the mappings between dfs files and their blocks to physical files on datanodes. So the less blocks/file the less memory pressure and communication overhead it has
Disk throughput: Files are written by a single process in hadoop, which usually results in data written sequentially to disk. This is especially advantageous for rotational disks because it avoids costly seeks. If the data is written that way, it can also be read that way so it becomes an advantage for reads and writes. In fact this optimization in combination with data locally (i.e. do the processing where the data is) is one of the main ideas of mapreduce.
Network throughput: Data locality is the more important optimization, but in a distributed system this can not always be achieved, so sometimes it's necessary to copy data between nodes. Normally one file (dfs block) is transferred via one persistent TCP connection which can reach a higher throughput when big files are transferred.
Bigger default splits: even though the splitsize can be configured on Job level, most people don't consider this and just go with the default which is usually the blocksize. If your splitsize is too small though, you can end up with too many mappers which don't have much work to do which in turn can lead to even smaller output files, unnecessary overhead and many occupied containers which can starve other jobs. This also has an adverse affect on the reduce phase, since the results must be fetched from all mappers.
Of course the ideal splitsize heavily depends on the kind of work you've to do. But you always can set a lower splitsize when necessary, whereas when you set a higher splitsize than the blocksize you might lose some data locality.
The latter aspect is less of an issue than one would think though, because the rule for block placement in HDFS is: the first block is written on the datanode where the process creating the file runs, the second one on another node in the same rack and the third one on a node on another rack. So usually one replica for each block of a file can be found on a single datanode, so data locality can still be achieved even when one mapper is reading several blocks due to a splitsize which is a multiple of the blocksize. Still in this case the mapred framework can only select one node instead of the usual three to achieve data locality so an effect can't be denied.
But ultimately this point for a bigger blocksize is probably the weakest of all, since one can set the splitsize independently if necessary.
But there also have to be arguments for a smaller blocksize otherwise we should just set it to infinity…
Parallelism/Distribution: If your input data lies on just a few nodes even a big cluster doesn't help to achieve parallel processing, at least if you want to maintain some data locality. As a rule I would say a good blocksize should match what you also can accept as a splitsize for your default workload.
Fault tolerance and latency: If a network connection breaks the perturbation of retransmitting a smaller file is less. TCP throughput might be important but individual connections shouldn't take forever either.
Weighting these factors against each other depends on your kind of data, cluster, workload etc. But in general I think the default blocksize 128 MB is already a little low for typical usecases. 512 MB or even 1 GB might be worth considering.
But before you even dig into that you should first check the size of your input files. If most of your files are small and don't even reach the max default blocksize your blocksize is basically always the filesize and it wouldn't help anything to increase the default blocksize. There are workarounds like using an input combiner to avoid spawning too many mappers, but ultimately you need to ensure your input files are big enough to take advantage of a big blocksize.
And if your files are already small don't compound the problem by making the blocksize even smaller.
It depends on the input data. The number of mappers is directly proportional to input splits,which depend on DFS block size.
If you want to maximize throughput for a very large input file, using very large blocks (128MB or even 256MB) is best.
If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
For smaller files, using a smaller block size is better.
Have a look at this article
If you have small files and which are less than minimum DFS block size, you can use some alternatives like HAR or SequenceFiles.
Have a look at this cloudera blog

Query regarding the block size

With regards to the HDFS, I read from their site under the Data Replication section (below link) that
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html#Data+Replication
'all blocks in a file except the last block are the same size'
Could you please let me know what is the reason the last block would not be of the same size?
Could it be that the total memory allocation may play a part over here?
However if the memory size is not an issue, would then still the last block would not of the same size as the rest of the blocks for a file?
And if yes, could you please elaborate a bit on this?
Any link to the JIRA for the development effort for this would be greatly appreciated.
Actually this is not at all an issue. Indeed it is uncertain that the last block of the file can be in the same size.
Consider a file of size 1000 MB and the block is 128MB, then the file will be splitted into 8 Blocks, where the first 7 blocks will be in even size which is equal to 128MB.
The total size of the 7 blocks will be 896MB (7*128MB), hence remaining size will be 104MB (1000-896). So the last block's actual size will be 104 MB wherein other 7 blocks are of 128 MB.
The namenode will allocate data blocks for every chunk of the file being stored on HDFS. It will not make any consideration for the chunks which's size is less than the data block size.
HDFS is designed to store chunks of data in equally sized data blocks so that the data blocks available on data nodes can be easily calculated and maintained by the namenode.

Resources