Calculating input splits in MapReduce - hadoop

A file is stored in HDFS of size 260 MB whereas the HDFS default block size is 64 MB. Upon performing a map-reduce job against this file, I found the number of input splits it creates is only 4. how did it calculated.? where is the rest 4 MB.? Any input is much appreciated.

Input split is NOT always a block size. Input split is a logical representation of data. Your input split could have been 63mb, 67mb, 65mb, 65mb(or possibly other sizes based on logical records' sizes) ... see examples in below links...
Hadoop input split size vs block size
Another example - see section 3.3...

Related

Blocks in Mapreduce

I have very important question cause I must make a presentation about map-reduce.
My Question is:
I have read that the file in map-reduce is divided into blocks and every blocks is replicated in 3 different nodes. the block can be 128 MB is this Block the input file? i mean this 128 MB block will be Splitting into parts and every part will go to single map? if yes so this 128 MB will be divided into Which Size?
or the File breaks into blocks and this blocks is the input for mapper
I'm little bit confused.
Could you see the photo and tell me which one is right.
Here HDFS File is divided into blocks and every singel block 128. MB will be as input for 1 Map
Here the HDFS file Is A Block and this 128 M.B will be splitting and every part will be input for 1 Map
Let's say you have a file of 2GB and you want to place that file in HDFS, then there will be 2GB/128MB = 16 blocks and these block will be distributed across the different DataNodes.
Data splitting happens based on file offsets. The goal of splitting the file and store it into different blocks, is parallel processing and fail over of data.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other data-processing techniques in Hadoop. Split size is user defined value and one can choose his own split size based on the volume of data(How much data you are processing).
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split. (i.e., Input Split = Input Block. So 16 mappers will be triggered for a 2 GB file). If Split size is defined as 100 MB (lets say), then 21 Mappers will be triggered (20 Mappers for 2000MB and 21st Mapper for 48MB).
Hope this clears your doubt.
HDFS stores the file as blocks and each block is 128Mb in size (default).
Mapreduce processes this HDFS file. Each mapper processes a block (input split).
So, to answer your question, 128 Mb is a single block size which will not be further split.
Note : input split size used in mapreduce context is logical split, whereas the split size mentioned in the HDFS is physical split.

Mapreduce why number of splits (text file) are more than 1 even for a tiny file

I know the difference between a physical block and Inputsplits in hadoop.
BTW I am using Hadoop 2.0 version (Yarn processing).
I have a very tiny input dataset. May be 1.5 Mb in size. When I run mapredce program that consumes this tiny dataset, during the run, it shows there are 2 input splits. Why should the tiny dataset should be split into two when it is less than 128 MB in size.
In my understanding a block size is configured to be 128 MB in size and input split is logical division of data. Meaning where does each split starts (like in which node and which block number) and where it does it end. Starting location and ending location of data is the split.
I didn't get the reason for splits in a tiny datasets.
can someone explain?
thanks
nath
First try to understand how the number of splits will be decided, and it depends on two thing:
If you have not defined any custom split size then it will take default size which will be the block size, in your case 128 MB.
This is important, now if you have two small files, it will be saved in two different blocks. So number of splits will be two.
Your answer is in above two point this is extra info, now the relation between number of mapper and number of splits is one - one so number of splits will be same as number of mapper.
I had the same problem.
I tried on a file of size 2.5MB and I would get on the console from running the job
number of splits:2
Job Counters
Launched map tasks=2
Logically speaking, it should be one split, as I am using the default setting
split_size 128M, block_size 128M
However, I think the reason MapReduce job is splitting the input is due to this config param:
mapreduce.job.maps: The default number of map tasks per job. Ignored when mapreduce.framework.name is "local".
The default value is 2, as you can see in the MapReduce defaults.
You can set it if you are using MRJob for python
MRJob.JOBCONF = {'mapreduce.job.maps': '1'}
By doing so, I got one split, then I ran another file 450M or so with the same setting, it was split into 4 (ceil(450/128)). Which means this setting is ignored when more splits are needed.

How many output files are created after an MR Job in Hadoop?

I have a file which is less than (very less) default block size. The output from my Mapper is a large number of <key,list<values>> pairs (greater than 20).
I read somewhere that the number of output files generated after an MR job is equal to the number of reducers which in my case are greater than 20. But I got a single file in the output.
Then I made job.setNumReduceTasks(2) hoping that it would generate two files in the output. But it still generated a single file.
So can I conclude that the number of output files is equal to the number of blocks?
And also, is one block of data fed to one Mapper?
- Block - A Physical Division:
HDFS was designed to hold and manage large amounts of data. A default block size is 64 MB. That means if a 128-MB text file was put in to HDFS, HDFS would divide the file into two blocks (128 MB/64 MB) and distribute the two chunks to the data nodes in the cluster.
- Split - A Logical Division:
When Hadoop submits jobs, it splits the input data logically and process by each Mapper. Split is only a reference. Split has details in org.apache.hadoop.mapreduce.InputSplitand rules (how to split) decided by getSplits() in class org.apache.hadoop.mapreduce.Input.FileInputFormat.
By default, the size of split = block size = 64M.
Now consider your block size is 64MB. The file which you are processing should be greater than 64MB to create its physical splits. If it is less than 64 MB then you will see only single file as you mentioned in your output. (No matter how many key-value your mapper will produce!)

Split size vs Block size in Hadoop

What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?
In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS, then there will be 1GB/64MB =
16 split/blocks and these block will be distribute across the DataNodes. These blocks/chunk will reside on a different different DataNode based on your cluster configuration.
Data splitting happens based on file offsets. The goal of splitting of file and store it into different blocks, is parallel processing and fail over of data.
Difference between block size and split size.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other dataprocessing techniques on Hadoop Ecosystem. Split size is user defined value and you can choose your own split size based on your volume of data(How much data you are processing).
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.
Example:
Suppose you have a file of 100MB and HDFS default block configuration is 64MB, then it will chopped in 2 split and occupy 2 blocks. Now you have a Map/Reduce program to process this data but you have not specified any input split then based on the number of blocks(2 block) input split will be considered for the Map/Reduce processing and 2 mapper will get assigned for this job.
But suppose, you have specified the split size(say 100MB) in your Map/Reduce program then both blocks(2 block) will be considered as a single split for the Map/Reduce processing and 1 Mapper will get assigned for this job.
Suppose, you have specified the split size(say 25MB) in your Map/Reduce program then there will be 4 input split for the Map/Reduce program and 4 Mapper will get assigned for the job.
Conclusion:
Split is a logical division of the input data while block is a physical division of data.
HDFS default block size is default split size if input split is not specified.
Split is user defined and user can control split size in his Map/Reduce program.
One split can be mapping to multiple blocks and there can be multiple split of one block.
The number of map tasks (Mapper) are equal to the number of splits.
Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)
If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.
Block 1 contains the entire first record and a 28MB chunk of the second record.
If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
This is the exact problem that input splits solve. Input splits respects logical record boundaries.
Lets Assume the input split size is 200MB
Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.
This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.
If the input split size is n times the block size, an input split could fit multiple blocks and therefore less number of Mappers needed for the whole job and therefore less parallelism. (Number of mappers is the number of input splits)
input split size = block size is the ideal configuration.
Hope this helps.
The Split creation depends on the InputFormat being used. The below diagram explains how FileInputFormat's getSplits() method decides the splits for two different files. Note the role played by the Split Slope (1.1).
The corresponding Java source that does the split is:
The method computeSplitSize() above expands to Max(minSize, min(maxSize, blockSize)), where min/max size can be configured by setting mapreduce.input.fileinputformat.split.minsize/maxsize

How to Hadoop Map Reduce entire file

I've played around with various streamin map reduce word count examples where Hadoop/Hbase appears to take a large file and break it (at a line break) equally between the nodes. Then it submits each line of the partial document to the map portion of my code. My question is when I have lots of little unstructured and semi-structured documents, how do I get Hadoop to submit the entire document to my map code?
File split are caluculated by the InputFormat.getSplits. So for the each input file it gets number of splits and each split is submitted to a mapper. Now based on the InputFormat Mapper will process the input split.
We have different types of Input Formats consider for example TextInputFormat which will take text files as input and for each split, it supplies line offset as key and entire line as value to map method in Mapper. Similarly for other InputFormats.
Now if you have many small files, say each file is less than the block size. Then each file will be supplied to a different mapper. If the file size exceeds the block size then it will be split into two blocks and executed on two blocks.
Consider an example where input files each are 1MB and you have 64 such files. Also assume that your block size is 64MB.
Now you will have 64 mappers kicked off for each file.
Consider you have 100 MB file and you have 2 such files.
Now your 100 MB file will be split into 64MB + 36MB and 4 mappers will be kicked off.

Resources