Default size of input split in Hadoop - hadoop

What is the default size of input split in Hadoop. As I know default size of block is 64 MB.
Is there any file in Hadoop jar in which we can see the default values of all such things ? like default replication factor etc. like anything default in Hadoop.

Remember these two parameters: mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize. I refer these as minSize, maxSize respectively. By default minSize is 1 byte and maxSize is Long.MAX_VALUE. The block size can be 64MB or 128MB or more.
The input split size is calculated by a formula like this during runtime:
max(minSize, min(maxSize, blockSize)
Courtesy: Hadoop:The definitive guide.

Yes, you can see all these configurations in hadoop etc/conf folder.
There are various files : core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
It contains all the default configuration for hadoop cluster which can be overridden as well.
You can refer following links:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
And, if you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.

Related

HDFS BLOCK SIZE

What is default block size 128MB or 64MB for hdfs?In hadoop definite guide it is mentioned as "HDFS, too, has the concept of a block, but it is a much larger unit—128 MB by default".Anyone can tell which is the default size?
The best place to refer the default values of hadoop configuration is the apache hadoop website. The default configurations and values are present in the default xml files in apache hadoop website.
The answer to your question is present in the hdfs-default.xml. The latest stable version of hadoop as of now is 2.9.0 and the value of block size (dfs.blocksize) is 128 MB (134217728 bytes).
In hadoop 1.x the default value of block size was 64 MB. Now in the world with high speed network and low cost storage, they made the default value as 128 MB.
Note: The hdfs-default.xml, core-default.xml, yarn-default.xml files present in the apache website can be used as the reference to find the default values of properties in apache hadoop.
This document states that it is 128M (search for dfs.blocksize)

What effects do dfs.blocksize, file.blocksize, kfs.blocksize and etc have in hadoop mapreduce job?

When I check the job.xml file of a hadoop (version 0.21.0) mapreduce job, I found there are multiple blocksize settings exist:
dfs.blocksize = 134217728 (i.e. 128MB)
file.blocksize = 67108864 (i.e. 64MB)
kfs.blocksize = 67108864
s3.blocksize = 67108864
s3native.blocksize = 67108864
ftp.blocksize = 67108864
I am expecting some answers to explain following related questions:
What are the dfs, file, kfs, s3 and etc mean in this context?
What are the differences among them?
What effects do they have when running a mapreduce job?
Thank you very much!
Map reduce can work on data stored on different types of storage systems.The settings above are the default block sizes on the storage techniques used. dfs(distributed file system) is what we commonly use in hadoop has default block size 128MB. Other settings are for file(local), kfs(kosmos distributed filesystem), s3(amazon cloud storage) and ftp(files on ftp server).
You may research them further for a better understanding of each and using them with hadoop features.While running the map reduce job,the settings which are for the particular storage technique being used,are identified for block size.
I hope it was helpful.

Change dfs.block.size on application execution

Since dfs.block.size is an HDFS setting, it shouldn't make a difference if I change it during an application execution, right?
For example, if the block size of the files of a job are 128 and I call
hadoop jar /path/to/.jar xxx -D dfs.block.size=256
would it make a difference or would I need to change the block size before saving the files to HDFS?
Are dfs.block.size and the split size of tasks directly related? If im correct and they are not, is there a way to specify the size of a split?
Parameters which decides your split Size for each MR can be set by
mapred.max.split.size & mapred.min.split.size
"mapred.max.split.size" which can be set per job individually through
your conf Object. Don't change "dfs.block.size" which affects your
HDFS too.Which does change your output block size of execution.
if mapred.min.split.size is less than block size and
mapred.max.split.size is greater than block size then 1 block is sent
to each map task. The block data is split into key value pairs based
on the Input Format you use.

Why only 1 map and 1 reduce task and 1 node is used in a Hadoop job?

I have configured a 3-nodes-cluster to run wordcount mapreduce program. I am using a book, whose size is 659 kb (http://www.gutenberg.org/ebooks/20417) as the test data. Interestingly, in the web UI of that Job, only 1 map, 1 reduce and 1 node is involved. I am wondering if this is because the data size is too small. If yes, could I set manually to split the data into different maps on multi nodes?
Thanks,
Allen
The default block size is 64 MB. So yes, the framework does assign only one task of each kind because your input data is smaller.
1) You can either give input data that are more than 64 MB and see what happens.
2) Change the value of mapred.max.split.size which is specific for the mapreduce jobs
(in mapred-site.xml or running the job with the -D mapred.max-split.size=noOfBytes)
or
3) Change the value of dfs.block.size which has a more global scope and applies for all the HDFS. (in hdfs-site.xml)
Don't forget to restart your cluster to apply changes in case you are modifying the conf files.

What is the default size that each Hadoop mapper will read?

Is it the block size of 64 MB for HDFS? Is there any configuration parameter that I can use to change it?
For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers?
This is dependent on your:
Input format - some input formats (NLineInputFormat, WholeFileInputFormat) work on boundaries other than the block size. In general though anything extended from FileInputFormat will use the block boundaries as guides
File block size - the individual files don't need to have the same block size as the default blocks size. This is set when the file is uploaded into HDFS - if not explicitly set, then the default block size (at the time of upload) is applied. Any changes to the default / system block size after the file is will have no effect in the already uploaded file.
The two FileInputFormat configuration properties mapred.min.split.size and mapred.max.split.size usually default to 1 and Long.MAX_VALUE, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned.
Non-splittable compression - such as gzip, cannot be processed by more than a single mapper, so you'll get 1 mapper per gzip file (unless you're using something like CombineFileInputFormat, CompositeInputFormat)
So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties:
mapred.min.split.size - larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes)
mapred.max.split.size - smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file
If you're using MR2 / YARN then the above properties are deprecated and replaced by:
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize

Resources