What is the default size that each Hadoop mapper will read? - hadoop

Is it the block size of 64 MB for HDFS? Is there any configuration parameter that I can use to change it?
For a mapper reading gzip files, is it true that the number of gzip files must be equal to the number of mappers?

This is dependent on your:
Input format - some input formats (NLineInputFormat, WholeFileInputFormat) work on boundaries other than the block size. In general though anything extended from FileInputFormat will use the block boundaries as guides
File block size - the individual files don't need to have the same block size as the default blocks size. This is set when the file is uploaded into HDFS - if not explicitly set, then the default block size (at the time of upload) is applied. Any changes to the default / system block size after the file is will have no effect in the already uploaded file.
The two FileInputFormat configuration properties mapred.min.split.size and mapred.max.split.size usually default to 1 and Long.MAX_VALUE, but if this is overridden in your system configuration, or in your job, then this will change the amunt of data processed by each mapper, and the number of mapper tasks spawned.
Non-splittable compression - such as gzip, cannot be processed by more than a single mapper, so you'll get 1 mapper per gzip file (unless you're using something like CombineFileInputFormat, CompositeInputFormat)
So if you have file with a block size of 64m, but either want to process more or less than this per map task, then you should just be able to set the following job configuration properties:
mapred.min.split.size - larger than the default, if you want to use less mappers, at the expense of (potentially) losing data locality (all data processed by a single map task may now be on 2 or more data nodes)
mapred.max.split.size - smaller than default, if you want to use more mappers (say you have a CPU intensive mapper) to process each file
If you're using MR2 / YARN then the above properties are deprecated and replaced by:
mapreduce.input.fileinputformat.split.minsize
mapreduce.input.fileinputformat.split.maxsize

Related

Mapreduce configuration : mapreduce.job.split.metainfo.maxsize

I want to understand the property mapreduce.job.split.metainfo.maxsize and its effect. The description says:
The maximum permissible size of the split metainfo file. The JobTracker won't attempt to read split metainfo files bigger than the configured value. No limits if set to -1.
What does "split metainfo file" contain? I have read that it will store the meta info about the input splits. Input split is a logical wrapping on the blocks to create complete records, right? Does the split meta info contain the block address of the actual record that might be available in multiple blocks?
When the hadoop job is submitted, whole set of input files are sliced into “splits”, and stores them to each node with its metadata. From then, But there is a limit to the count of splits’ metadata - the property “mapreduce.jobtracker.split.metainfo.maxsize” determines this limitation and its default value is 10 million. You can circle around this limitation by increasing this value or, unlock the limitation by setting its value to -1

Default size of input split in Hadoop

What is the default size of input split in Hadoop. As I know default size of block is 64 MB.
Is there any file in Hadoop jar in which we can see the default values of all such things ? like default replication factor etc. like anything default in Hadoop.
Remember these two parameters: mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize. I refer these as minSize, maxSize respectively. By default minSize is 1 byte and maxSize is Long.MAX_VALUE. The block size can be 64MB or 128MB or more.
The input split size is calculated by a formula like this during runtime:
max(minSize, min(maxSize, blockSize)
Courtesy: Hadoop:The definitive guide.
Yes, you can see all these configurations in hadoop etc/conf folder.
There are various files : core-default.xml, hdfs-default.xml, yarn-default.xml and mapred-default.xml.
It contains all the default configuration for hadoop cluster which can be overridden as well.
You can refer following links:
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
And, if you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.

Change dfs.block.size on application execution

Since dfs.block.size is an HDFS setting, it shouldn't make a difference if I change it during an application execution, right?
For example, if the block size of the files of a job are 128 and I call
hadoop jar /path/to/.jar xxx -D dfs.block.size=256
would it make a difference or would I need to change the block size before saving the files to HDFS?
Are dfs.block.size and the split size of tasks directly related? If im correct and they are not, is there a way to specify the size of a split?
Parameters which decides your split Size for each MR can be set by
mapred.max.split.size & mapred.min.split.size
"mapred.max.split.size" which can be set per job individually through
your conf Object. Don't change "dfs.block.size" which affects your
HDFS too.Which does change your output block size of execution.
if mapred.min.split.size is less than block size and
mapred.max.split.size is greater than block size then 1 block is sent
to each map task. The block data is split into key value pairs based
on the Input Format you use.

how output files(part-m-0001/part-r-0001) are created in map reduce

I understand that the map reduce output are stored in files named like part-r-* for reducer and part-m-* for mapper.
When I run a mapreduce job sometimes a get the whole output in a single file(size around 150MB), and sometimes for almost same data size I get two output files(one 100mb and other 50mb). This seems very random to me. I cant find out any reason for this.
I want to know how its decided to put that data in a single or multiple output files. and if any way we can control it.
Thanks
Unlike specified in the answer by Jijo here - the number of the files depends on on the number of Reducers/Mappers.
It has nothing to do with the number of physical nodes in the cluster.
The rule is: one part-r-* file for one Reducer. The number of Reducers is set by job.setNumReduceTasks();
If there are no Reducers in your job - then one part-m-* file for one Mapper. There is one Mapper for one InputSplit (usually - unless you use custom InputFormat implementation, there is one InputSplit for one HDFS block of your input data).
The number of output files part-m-* and part-r-* is set according to the number of map tasks and the number of reduce tasks respectively.

Why only 1 map and 1 reduce task and 1 node is used in a Hadoop job?

I have configured a 3-nodes-cluster to run wordcount mapreduce program. I am using a book, whose size is 659 kb (http://www.gutenberg.org/ebooks/20417) as the test data. Interestingly, in the web UI of that Job, only 1 map, 1 reduce and 1 node is involved. I am wondering if this is because the data size is too small. If yes, could I set manually to split the data into different maps on multi nodes?
Thanks,
Allen
The default block size is 64 MB. So yes, the framework does assign only one task of each kind because your input data is smaller.
1) You can either give input data that are more than 64 MB and see what happens.
2) Change the value of mapred.max.split.size which is specific for the mapreduce jobs
(in mapred-site.xml or running the job with the -D mapred.max-split.size=noOfBytes)
or
3) Change the value of dfs.block.size which has a more global scope and applies for all the HDFS. (in hdfs-site.xml)
Don't forget to restart your cluster to apply changes in case you are modifying the conf files.

Resources