Split size vs Block size in Hadoop - hadoop

What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?

In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS, then there will be 1GB/64MB =
16 split/blocks and these block will be distribute across the DataNodes. These blocks/chunk will reside on a different different DataNode based on your cluster configuration.
Data splitting happens based on file offsets. The goal of splitting of file and store it into different blocks, is parallel processing and fail over of data.
Difference between block size and split size.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other dataprocessing techniques on Hadoop Ecosystem. Split size is user defined value and you can choose your own split size based on your volume of data(How much data you are processing).
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.
Example:
Suppose you have a file of 100MB and HDFS default block configuration is 64MB, then it will chopped in 2 split and occupy 2 blocks. Now you have a Map/Reduce program to process this data but you have not specified any input split then based on the number of blocks(2 block) input split will be considered for the Map/Reduce processing and 2 mapper will get assigned for this job.
But suppose, you have specified the split size(say 100MB) in your Map/Reduce program then both blocks(2 block) will be considered as a single split for the Map/Reduce processing and 1 Mapper will get assigned for this job.
Suppose, you have specified the split size(say 25MB) in your Map/Reduce program then there will be 4 input split for the Map/Reduce program and 4 Mapper will get assigned for the job.
Conclusion:
Split is a logical division of the input data while block is a physical division of data.
HDFS default block size is default split size if input split is not specified.
Split is user defined and user can control split size in his Map/Reduce program.
One split can be mapping to multiple blocks and there can be multiple split of one block.
The number of map tasks (Mapper) are equal to the number of splits.

Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)
If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.
Block 1 contains the entire first record and a 28MB chunk of the second record.
If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
This is the exact problem that input splits solve. Input splits respects logical record boundaries.
Lets Assume the input split size is 200MB
Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.
This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.
If the input split size is n times the block size, an input split could fit multiple blocks and therefore less number of Mappers needed for the whole job and therefore less parallelism. (Number of mappers is the number of input splits)
input split size = block size is the ideal configuration.
Hope this helps.

The Split creation depends on the InputFormat being used. The below diagram explains how FileInputFormat's getSplits() method decides the splits for two different files. Note the role played by the Split Slope (1.1).
The corresponding Java source that does the split is:
The method computeSplitSize() above expands to Max(minSize, min(maxSize, blockSize)), where min/max size can be configured by setting mapreduce.input.fileinputformat.split.minsize/maxsize

Related

Blocks in Mapreduce

I have very important question cause I must make a presentation about map-reduce.
My Question is:
I have read that the file in map-reduce is divided into blocks and every blocks is replicated in 3 different nodes. the block can be 128 MB is this Block the input file? i mean this 128 MB block will be Splitting into parts and every part will go to single map? if yes so this 128 MB will be divided into Which Size?
or the File breaks into blocks and this blocks is the input for mapper
I'm little bit confused.
Could you see the photo and tell me which one is right.
Here HDFS File is divided into blocks and every singel block 128. MB will be as input for 1 Map
Here the HDFS file Is A Block and this 128 M.B will be splitting and every part will be input for 1 Map
Let's say you have a file of 2GB and you want to place that file in HDFS, then there will be 2GB/128MB = 16 blocks and these block will be distributed across the different DataNodes.
Data splitting happens based on file offsets. The goal of splitting the file and store it into different blocks, is parallel processing and fail over of data.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other data-processing techniques in Hadoop. Split size is user defined value and one can choose his own split size based on the volume of data(How much data you are processing).
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split. (i.e., Input Split = Input Block. So 16 mappers will be triggered for a 2 GB file). If Split size is defined as 100 MB (lets say), then 21 Mappers will be triggered (20 Mappers for 2000MB and 21st Mapper for 48MB).
Hope this clears your doubt.
HDFS stores the file as blocks and each block is 128Mb in size (default).
Mapreduce processes this HDFS file. Each mapper processes a block (input split).
So, to answer your question, 128 Mb is a single block size which will not be further split.
Note : input split size used in mapreduce context is logical split, whereas the split size mentioned in the HDFS is physical split.

How does HDFS stores single data which is larger than the block size?

How hadoop will split the data, in case one of my single data is more than the block size?
Eg. Data(talking about single record) I am storing is of size 80 mb and the block size is 64 mb, so how hadoop manages such scenario?
If we use 64MB of block size then data will be load into only two blocks(64MB and 16MB).Hence the size of metadata is decreased.
Edit:
Hadoop framework divides the large file into blocks (64MB or 128 MB) and stores in the slave nodes. HDFS is unware of the content of the block. While writing the data into block it may happen that the record crosses the block limit and part of same record is written on one block and the other is written on other block.
So, the way Hadoop tracks this split of data is by the logical representation of the data known as Input Split. When Map Reduce client calculates the input splits, it actually checks if the entire record resides in the same block or not. If the record over heads and some part of it is written into another block, the input split captures the location information of the next Block and byte offset of the data needed to complete the record. This usually happens in the multi-line record as Hadoop is intelligent enough to handle the single line record scenario.
Usually, input split is configured same as the size of block size but consider if the input split is larger than the block size. Input split represents the size of data that will go in one mapper. Consider below example
• Input split = 256MB
• Block size = 128 MB
Then, mapper will process two blocks that can be on different machines. Which means to process the block the mapper will have to transfer the data between machines to process. Hence to avoid the unnecessary data movement (data locality) we usually keep the same Input split as block size.

How does the hadoop fix the number of mappers or Input splits when mapreduce task is done over multiple input files?

I've four input files (CSV) of sizes 453MB, 449MB, 646MB and 349MB. All these constitute to a total size of 1.85GB.
HDFS block size is 128MB.
Record size is very less as there are hardly 20 fields.
After the completion of mapreduce task, I can observe that 16 mappers have been used for the input files I've provided:
I would like to know how hadoop determined the number of mappers or input splits for multiple input files?
Each file undergoes splitting (based on the split size) individually unless you are using CombileFileInputFormat.
Assuming the mapreduce.input.fileinputformat.split.minsize and mapreduce.input.fileinputformat.split.maxsize properties are at their default. Then the split size will be approximately equal to the dfs.blocksize.
So, in this case
File 1: 453MB = 4 splits
File 2: 449MB = 4 splits
File 3: 646MB = 5 splits (boundary being very close ~640MB)
File 4: 349MB = 3 splits
Total of 16 splits. And with one mapper per split, a total of 16 mappers will be spawned. Also refer this answer for split size computation formula.
UPDATE: Although File 3 has 6 blocks, the 6th block will remain to be part of the 5th split. This is decided by the SPLIT_SLOP factor, which is 1.1 by default (last block to overflow by 10%).
The number of maps is usually driven by the number of HDFS blocks in the input files. Number of Mappers is calculated on the number of splits, however if the files are less than the split size then each file will correspond to one mapper.
For each input file, with the file length, the block size , hadoop calculates the split size as max(minSize, min(maxSize, blockSize)) where maxSize corresponds to mapred.max.split.size and minSize is mapred.min.split.size.
No. of mappers= each file size /inputSplitSize
Here is reference about number of Mappers and reducers on apache wiki http://wiki.apache.org/hadoop/HowManyMapsAndReduces

How many output files are created after an MR Job in Hadoop?

I have a file which is less than (very less) default block size. The output from my Mapper is a large number of <key,list<values>> pairs (greater than 20).
I read somewhere that the number of output files generated after an MR job is equal to the number of reducers which in my case are greater than 20. But I got a single file in the output.
Then I made job.setNumReduceTasks(2) hoping that it would generate two files in the output. But it still generated a single file.
So can I conclude that the number of output files is equal to the number of blocks?
And also, is one block of data fed to one Mapper?
- Block - A Physical Division:
HDFS was designed to hold and manage large amounts of data. A default block size is 64 MB. That means if a 128-MB text file was put in to HDFS, HDFS would divide the file into two blocks (128 MB/64 MB) and distribute the two chunks to the data nodes in the cluster.
- Split - A Logical Division:
When Hadoop submits jobs, it splits the input data logically and process by each Mapper. Split is only a reference. Split has details in org.apache.hadoop.mapreduce.InputSplitand rules (how to split) decided by getSplits() in class org.apache.hadoop.mapreduce.Input.FileInputFormat.
By default, the size of split = block size = 64M.
Now consider your block size is 64MB. The file which you are processing should be greater than 64MB to create its physical splits. If it is less than 64 MB then you will see only single file as you mentioned in your output. (No matter how many key-value your mapper will produce!)

Data flow among Inplutsplit, RecordReader & Map instace and Mapper

If I've a data file with 1000 lines.. and I use TextInputFormat in my map method for my Word Count program. So, every line in the data file will be considered as one split.
A RecordReader will feed each line(or split) as (Key, Value) pair to the map() method.
As per my understanding.. 1000 times map() method should execute for each line or record.
means how many Mappers will run??
sorry confused here.. map() method is just an instance of mapper right.. so how many map instances per Mapper task is decided based on what???
Note: When I executed WordCount MapReduce Program for 1000 lines of data.. I see the number of Mappers as 2. so 500 map instances run for each map tasks???
Please correct my question if I asked it wrong.
How mappers get assigned
Number of mappers is determined by the number of splits determined by the InputFormat used in the Map/Reduce job. In a typical InputFormat, it is directly proportional to the number of files and file sizes.
Suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size then it will occupy 2 block and then 2 mapper will get assigned based on the blocks
Suppose if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.
Suppose you have a file with 60MB then it will occupy 1 block but if you have specified input split size in your code, say split size is 30MB then 2 mapper will get assigned for this job.
RecordReader()- The record reader breaks the data into key/value pairs for input to the Mapper.It takes Parameters:
split - the split that defines the range of records to read.
InputSplt()- Get the size of the split, so that the input splits can be sorted by size.
If you have not specified any InputSplit size then it will take whole block as one split and will read data and generate key, value pair for Mapper.
In your case 2 Mapper is assigned. It indicates you have specified the InputSplit size or your data resides in 2 block.
This link can be helpful to understand the record reader and input split.
First it depends upon the size of the hdfs block.
There is a difference in number of mapper the program invoked wrt number of times the the maper code ran.
The single mapper instance will run as much times as the number of line in case of your code as the input format is textinput format.But the number of mapper will run that purely depends upon blocksize.

Resources