Hadoop's input splitting- How does it work - hadoop

I know brief about hadoop
I am curious to know how does it work.
To be precise I want to know, how exactly it divides/splits the input file.
Does it divides in equal chunks in terms of size?
or it is configurable thing.
I did go through this post, but I couldn't understand

This is dependent on the InputFormat, which for most file-based formats is defined in the FileInputFormat base class.
There are a number of configurable options which denote how hadoop will take a single file and either process it as a single split, or divide the file into multiple splits:
If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can't randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)
If you want to know more, and are comfortable looking through the source, check out the getSplits() method in FileInputFormat (both the new and old api have the same method, but they may have some suttle differences).

When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with FileInputFormat. For each of these input splits, a map task must be started.
But you can change the size of input split by configuring following properties:
mapred.min.split.size: The minimum size chunk that map input should be split into.
mapred.max.split.size: The largest valid size inbytes for a file split.
dfs.block.size: The default block size for new files.
And the formula for input split is:
Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));
You can check examples here.

Related

Mapreduce why number of splits (text file) are more than 1 even for a tiny file

I know the difference between a physical block and Inputsplits in hadoop.
BTW I am using Hadoop 2.0 version (Yarn processing).
I have a very tiny input dataset. May be 1.5 Mb in size. When I run mapredce program that consumes this tiny dataset, during the run, it shows there are 2 input splits. Why should the tiny dataset should be split into two when it is less than 128 MB in size.
In my understanding a block size is configured to be 128 MB in size and input split is logical division of data. Meaning where does each split starts (like in which node and which block number) and where it does it end. Starting location and ending location of data is the split.
I didn't get the reason for splits in a tiny datasets.
can someone explain?
thanks
nath
First try to understand how the number of splits will be decided, and it depends on two thing:
If you have not defined any custom split size then it will take default size which will be the block size, in your case 128 MB.
This is important, now if you have two small files, it will be saved in two different blocks. So number of splits will be two.
Your answer is in above two point this is extra info, now the relation between number of mapper and number of splits is one - one so number of splits will be same as number of mapper.
I had the same problem.
I tried on a file of size 2.5MB and I would get on the console from running the job
number of splits:2
Job Counters
Launched map tasks=2
Logically speaking, it should be one split, as I am using the default setting
split_size 128M, block_size 128M
However, I think the reason MapReduce job is splitting the input is due to this config param:
mapreduce.job.maps: The default number of map tasks per job. Ignored when mapreduce.framework.name is "local".
The default value is 2, as you can see in the MapReduce defaults.
You can set it if you are using MRJob for python
MRJob.JOBCONF = {'mapreduce.job.maps': '1'}
By doing so, I got one split, then I ran another file 450M or so with the same setting, it was split into 4 (ceil(450/128)). Which means this setting is ignored when more splits are needed.

How many output files are created after an MR Job in Hadoop?

I have a file which is less than (very less) default block size. The output from my Mapper is a large number of <key,list<values>> pairs (greater than 20).
I read somewhere that the number of output files generated after an MR job is equal to the number of reducers which in my case are greater than 20. But I got a single file in the output.
Then I made job.setNumReduceTasks(2) hoping that it would generate two files in the output. But it still generated a single file.
So can I conclude that the number of output files is equal to the number of blocks?
And also, is one block of data fed to one Mapper?
- Block - A Physical Division:
HDFS was designed to hold and manage large amounts of data. A default block size is 64 MB. That means if a 128-MB text file was put in to HDFS, HDFS would divide the file into two blocks (128 MB/64 MB) and distribute the two chunks to the data nodes in the cluster.
- Split - A Logical Division:
When Hadoop submits jobs, it splits the input data logically and process by each Mapper. Split is only a reference. Split has details in org.apache.hadoop.mapreduce.InputSplitand rules (how to split) decided by getSplits() in class org.apache.hadoop.mapreduce.Input.FileInputFormat.
By default, the size of split = block size = 64M.
Now consider your block size is 64MB. The file which you are processing should be greater than 64MB to create its physical splits. If it is less than 64 MB then you will see only single file as you mentioned in your output. (No matter how many key-value your mapper will produce!)

How to Hadoop Map Reduce entire file

I've played around with various streamin map reduce word count examples where Hadoop/Hbase appears to take a large file and break it (at a line break) equally between the nodes. Then it submits each line of the partial document to the map portion of my code. My question is when I have lots of little unstructured and semi-structured documents, how do I get Hadoop to submit the entire document to my map code?
File split are caluculated by the InputFormat.getSplits. So for the each input file it gets number of splits and each split is submitted to a mapper. Now based on the InputFormat Mapper will process the input split.
We have different types of Input Formats consider for example TextInputFormat which will take text files as input and for each split, it supplies line offset as key and entire line as value to map method in Mapper. Similarly for other InputFormats.
Now if you have many small files, say each file is less than the block size. Then each file will be supplied to a different mapper. If the file size exceeds the block size then it will be split into two blocks and executed on two blocks.
Consider an example where input files each are 1MB and you have 64 such files. Also assume that your block size is 64MB.
Now you will have 64 mappers kicked off for each file.
Consider you have 100 MB file and you have 2 such files.
Now your 100 MB file will be split into 64MB + 36MB and 4 mappers will be kicked off.

What exactly the getSplits() method returns?

What exactly the getSplits() method returns?
According to apache docs it returns the array of InputSplit, what does that mean?
Does it returns the block of file bytes on which mapper is going to run??
Lets say we have 3 files of 50MB each, then it returns bytes of 64MB(50MB+14MB 2nd file)at [0],64MB(36MB 2nd + 28MB of 3rd), 36MB(third file) and each is processed by 3 different mapper?
If we have one big file of 120MB then it returns the block of 64MB for same file?
I am even not sure of what I am asking is logical or not, I new to Hadoop stack.
Method getSplits() return the splits - metadata about parts of the files. Each map process one split.
If your file is large, it is divided into parts with the size of the HDFS block (at least 64MB). In your second example it will be two splits of 64MB and 56MB. Although, nowadays the recommended block size is 128MB or even 256MB.
If the file is smaller then the block size, it will be in the separate split. In your first example you will have three splits of 50MB each. If you want to combine them and process in one Mapper, you could use CombineFileInputFormat (example).
An input split in MapReduce is the unit of parallelization for the mapper phase. If you have ten input splits then you will have ten mappers. In the general case a file block will map to an input split.
An InputSplit object contains information about the split, but not the split data itself. Depending on the subclass (such as FileSplit) this information could be items such as the location of the split and how large it is.

Hadoop MapReduce: default number of mappers

If I don't specify the number of mappers, how would the number be determined? Is there a default setting read from a configuration file (such as mapred-site.xml)?
Adding more to what Chris added above:
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust the number of maps.
The right level of parallelism for maps seems to be around 10-100 maps/node, although this can go upto 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.
You can increased number of Map task by modifying JobConf's conf.setNumMapTasks(int num). Note: This could increase the number of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.
Finally controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits.
A lower bound on the split size can be set via mapred.min.split.size.
Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.
Read more: http://wiki.apache.org/hadoop/HowManyMapsAndReduces
It depends on a number of factors:
Input format and particular configuration properties for the format
for file based input formats (TextInputFormat, SequenceFileInputFormat etc):
Number of input files / paths
are the files splittable (typically compressed files are not, SequenceFiles are an exception to this)
block size of the files
There are probably more, but you hopefully get the idea

Resources