Data flow among Inplutsplit, RecordReader & Map instace and Mapper

Data flow among Inplutsplit, RecordReader & Map instace and Mapper - hadoop

If I've a data file with 1000 lines.. and I use TextInputFormat in my map method for my Word Count program. So, every line in the data file will be considered as one split.
A RecordReader will feed each line(or split) as (Key, Value) pair to the map() method.
As per my understanding.. 1000 times map() method should execute for each line or record.
means how many Mappers will run??
sorry confused here.. map() method is just an instance of mapper right.. so how many map instances per Mapper task is decided based on what???
Note: When I executed WordCount MapReduce Program for 1000 lines of data.. I see the number of Mappers as 2. so 500 map instances run for each map tasks???
Please correct my question if I asked it wrong.

How mappers get assigned
Number of mappers is determined by the number of splits determined by the InputFormat used in the Map/Reduce job. In a typical InputFormat, it is directly proportional to the number of files and file sizes.
Suppose your HDFS block configuration is configured for 64MB(default size) and you have a files with 100MB size then it will occupy 2 block and then 2 mapper will get assigned based on the blocks
Suppose if you have 2 files with 30MB size(each file) then each file will occupy one block and mapper will get assigned based on that.
Suppose you have a file with 60MB then it will occupy 1 block but if you have specified input split size in your code, say split size is 30MB then 2 mapper will get assigned for this job.
RecordReader()- The record reader breaks the data into key/value pairs for input to the Mapper.It takes Parameters:
split - the split that defines the range of records to read.
InputSplt()- Get the size of the split, so that the input splits can be sorted by size.
If you have not specified any InputSplit size then it will take whole block as one split and will read data and generate key, value pair for Mapper.
In your case 2 Mapper is assigned. It indicates you have specified the InputSplit size or your data resides in 2 block.
This link can be helpful to understand the record reader and input split.

First it depends upon the size of the hdfs block.
There is a difference in number of mapper the program invoked wrt number of times the the maper code ran.
The single mapper instance will run as much times as the number of line in case of your code as the input format is textinput format.But the number of mapper will run that purely depends upon blocksize.

Related

relation between splits and blocks in hadoop mapreduce

I'm trying to understand how mapreduce's InputFormat class is supposed to be working with respect to splits and blocks.
If I understand correctly:
a hdfs file is split into blocks. Each block is guaranteed to be on a specific machine, but 2 blocks of the same file can be on 2 different machines.
a split roughly defines an hdfs file and a starting byte offset + ending byte offset (or can it be a starting line number and ending line number?)
each mapper will receive a split and an instance of a RecordReader with an instance of a specific split. Defining N splits mean that N mappers and RecordReaders are created, one for each split
however a split size doesnt need to be divisible by the blocksize, or the other way around
Is that correct?
However, is the RecordReader instance mandated by the API to strictly process / work on the data inside its split ? or is it allowed to read data outside its split bounds ? Can it still always read any part of the file even if it has to go beyond the current block ? (so potentially the following of the file is on another machine)
In essence, are the splits only a "hint" for the recordreader?
Because if this is not the case and the splits are strict, it seems to me impossible to process a simple file where each record has non-fixed size.

Does hadoop create InputSplits parallely

I have a large text file of size around 13gb. I want to process the file using Hadoop. I know that hadoop uses FileInputFormat to create InputSplits which are assigned to mapper tasks. I want to know if hadoop creates these InputSplits sequentially or in parallel. I mean does it read the large text file sequentially on a single host and create split files which are then distributed to datanodes, or does it read chunks of say 50mb in parallel?
Does hadoop replicate the big file on multiple hosts before splitting it up?
Is it recommended that I split up the file into 50mb chunks to speed up the processing? There are many questions on appropriate split size for mapper tasks but not the exact split process itself.
Thanks

InputSplits are created in the client side and it just a logical representation of the file in the sense it would only contain the file path,start and end offset values(calculated from linerecordreader initialize function). So calculating this logical rep. will not take much time so need to split your chunks the real execution happens at the mapper end where the execution is done in a parallel way. Then the client places the inputsplits into hdfs and jobtracker takes it from there and depending on the splits it allocates a tasktracker. Now here one mapper execution is not dependent on the other. The second mapper knows very well that where it has to start processing that split, so the mapper executions are done in parallel.

I suppose you want to process the file using MapReduce not Hadoop. Hadoop is a platform which provide tools to process and store large size data.
When you store the file in HDFS (Hadoop filesystem) it splits the file into multiple blocks. The size of the block is defined in hdfs-site.xml file as dfs.block.size. For example, if dfs.block.size=128 then your input file will be split into 128MB blocks. This is how HDFS store the data internally. For user it is always as a single file.
When you provide the input file (stored in HDFS) to MapReduce, it launches mapper task for each block/split of the file. This is default behavior.
you need not to split the file in chunks, just store the file in HDFS and it will the desired for you.

First let us understand what is meant by input split.
When your text file is divided into blocks of 128 MB size (default) by hdfs, assume that 10th line of the file is divided and first half of the is in first block and the other half is in second block. But when you submit a Map Program, hadoop understands that the last line of 1st block (which becomes input split here) is not complete. So it carries the second half of the 10th line to first input split. Which implies,
1) 1st input split = 1st Block + 2nd part of 10th line from 2nd block
2) 2nd input split = 2nd Block - 2nd part of 10th line from 2nd block.
This is an inbuilt process of hadoop and you cannot change or set the size of input split. The block size of hadoop v2 is by default 128 MB. You can increase during installation but you cannot decrease it.

Split size vs Block size in Hadoop

What is relationship between split size and block size in Hadoop? As I read in this, split size must be n-times of block size (n is an integer and n > 0), is this correct? Is there any must in relationship between split size and block size?

In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS, then there will be 1GB/64MB =
16 split/blocks and these block will be distribute across the DataNodes. These blocks/chunk will reside on a different different DataNode based on your cluster configuration.
Data splitting happens based on file offsets. The goal of splitting of file and store it into different blocks, is parallel processing and fail over of data.
Difference between block size and split size.
Split is logical split of the data, basically used during data processing using Map/Reduce program or other dataprocessing techniques on Hadoop Ecosystem. Split size is user defined value and you can choose your own split size based on your volume of data(How much data you are processing).
Split is basically used to control number of Mapper in Map/Reduce program. If you have not defined any input split size in Map/Reduce program then default HDFS block split will be considered as input split.
Example:
Suppose you have a file of 100MB and HDFS default block configuration is 64MB, then it will chopped in 2 split and occupy 2 blocks. Now you have a Map/Reduce program to process this data but you have not specified any input split then based on the number of blocks(2 block) input split will be considered for the Map/Reduce processing and 2 mapper will get assigned for this job.
But suppose, you have specified the split size(say 100MB) in your Map/Reduce program then both blocks(2 block) will be considered as a single split for the Map/Reduce processing and 1 Mapper will get assigned for this job.
Suppose, you have specified the split size(say 25MB) in your Map/Reduce program then there will be 4 input split for the Map/Reduce program and 4 Mapper will get assigned for the job.
Conclusion:
Split is a logical division of the input data while block is a physical division of data.
HDFS default block size is default split size if input split is not specified.
Split is user defined and user can control split size in his Map/Reduce program.
One split can be mapping to multiple blocks and there can be multiple split of one block.
The number of map tasks (Mapper) are equal to the number of splits.

Assume we have a file of 400MB with consists of 4 records(e.g : csv file of 400MB and it has 4 rows, 100MB each)
If the HDFS Block Size is configured as 128MB, then the 4 records will not be distributed among the blocks evenly. It will look like this.
Block 1 contains the entire first record and a 28MB chunk of the second record.
If a mapper is to be run on Block 1, the mapper cannot process since it won't have the entire second record.
This is the exact problem that input splits solve. Input splits respects logical record boundaries.
Lets Assume the input split size is 200MB
Therefore the input split 1 should have both the record 1 and record 2. And input split 2 will not start with the record 2 since record 2 has been assigned to input split 1. Input split 2 will start with record 3.
This is why an input split is only a logical chunk of data. It points to start and end locations with in blocks.
If the input split size is n times the block size, an input split could fit multiple blocks and therefore less number of Mappers needed for the whole job and therefore less parallelism. (Number of mappers is the number of input splits)
input split size = block size is the ideal configuration.
Hope this helps.

The Split creation depends on the InputFormat being used. The below diagram explains how FileInputFormat's getSplits() method decides the splits for two different files. Note the role played by the Split Slope (1.1).
The corresponding Java source that does the split is:
The method computeSplitSize() above expands to Max(minSize, min(maxSize, blockSize)), where min/max size can be configured by setting mapreduce.input.fileinputformat.split.minsize/maxsize

How to Hadoop Map Reduce entire file

I've played around with various streamin map reduce word count examples where Hadoop/Hbase appears to take a large file and break it (at a line break) equally between the nodes. Then it submits each line of the partial document to the map portion of my code. My question is when I have lots of little unstructured and semi-structured documents, how do I get Hadoop to submit the entire document to my map code?

File split are caluculated by the InputFormat.getSplits. So for the each input file it gets number of splits and each split is submitted to a mapper. Now based on the InputFormat Mapper will process the input split.
We have different types of Input Formats consider for example TextInputFormat which will take text files as input and for each split, it supplies line offset as key and entire line as value to map method in Mapper. Similarly for other InputFormats.
Now if you have many small files, say each file is less than the block size. Then each file will be supplied to a different mapper. If the file size exceeds the block size then it will be split into two blocks and executed on two blocks.
Consider an example where input files each are 1MB and you have 64 such files. Also assume that your block size is 64MB.
Now you will have 64 mappers kicked off for each file.
Consider you have 100 MB file and you have 2 such files.
Now your 100 MB file will be split into 64MB + 36MB and 4 mappers will be kicked off.

What exactly the getSplits() method returns?

What exactly the getSplits() method returns?
According to apache docs it returns the array of InputSplit, what does that mean?
Does it returns the block of file bytes on which mapper is going to run??
Lets say we have 3 files of 50MB each, then it returns bytes of 64MB(50MB+14MB 2nd file)at [0],64MB(36MB 2nd + 28MB of 3rd), 36MB(third file) and each is processed by 3 different mapper?
If we have one big file of 120MB then it returns the block of 64MB for same file?
I am even not sure of what I am asking is logical or not, I new to Hadoop stack.

Method getSplits() return the splits - metadata about parts of the files. Each map process one split.
If your file is large, it is divided into parts with the size of the HDFS block (at least 64MB). In your second example it will be two splits of 64MB and 56MB. Although, nowadays the recommended block size is 128MB or even 256MB.
If the file is smaller then the block size, it will be in the separate split. In your first example you will have three splits of 50MB each. If you want to combine them and process in one Mapper, you could use CombineFileInputFormat (example).

An input split in MapReduce is the unit of parallelization for the mapper phase. If you have ten input splits then you will have ten mappers. In the general case a file block will map to an input split.
An InputSplit object contains information about the split, but not the split data itself. Depending on the subclass (such as FileSplit) this information could be items such as the location of the split and how large it is.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio