How job client in hadoop compute inputSplits - hadoop

I am trying to get the insight of map reduce architecture. I am consulting this http://answers.oreilly.com/topic/2141-how-mapreduce-works-with-hadoop/ article. I have some questions regarding the component JobClient of mapreduce framework. My questions is:
How the JObClient Computes the input Splits on the data?
According to the stuff to which i am consulting , Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and compued input splits) to the HDFS. Now here is my question, when the input data is in HDFS, why jobClient copies the computed inputsplits into HDFS.
Lets assume that Job Client copies the input splits to the HDFS, Now when the JOb is submitted to the Job Tracker and Job tracker intailize the job why it retrieves input splits from HDFS?
Apologies if my question is not clear. I am a beginner. :)

No the JobClient does not copy the input splits to the HDFS. You have quoted your answer for yourself:
Job Client computes input splits on the data located in the input path
on the HDFS specified while running the job. the article says then Job
Client copies the resources(jars and computed input splits) to the HDFS.
The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.
Have a look at org.apache.hadoop.mapreduce.lib.input.FileSplit, it contains the file path the start offset and the length of the chunk a single task will operate on as its input.
The serializable class you may also want to have a look at is: org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo.
This meta information will be computed for each task that will be run, and copied with the jars to the node that will actually execute this task.

The computation of input split depends on the Input Format. For a typical textual Input format, the generic formula to calculate the split size is
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
or by default
Input split size = mapred.min.split.size < dfs.block.size < mapred.max.split.size
Where
mapred.min.split.size= Minimum Split size
mapred.max.split.size - Maximum Split size
dfs.block.size= DFS Block size
For DB Input Format, the split size is
(total records / number of mappers)
With the above said, number of input splits and size are the meta information given to the mapper tasks and Record readers.

Related

How does Hadoop HDFS decide what data to be put into each block?

I have been trying to dive into how Hadoop HDFS decides what data to be put into one block and don't seem to find any solid answer. We know that Hadoop will automatically distribute data into blocks in HDFS across the cluster, however what data of each file should be put together in a block? Will it just put it arbitrarily ? And is this the same for Spark RDD?
HDFS block behavior
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
spark reading a HDFS splittable file:
sc.textFile doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.
It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.
If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.
If your partitioning strategy (defaultMinPartitions) seems to be swamping the workers because the java representation of your partition (an InputSplit in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:
sc.textFile(file, numPartitions)
.count()

How hadoop read all data and then splits in chunks?

i am using hadoop 2.6 for processing enough data, so i have a question about how hadoop read all data and then splits in chunks?. I understand that first upload data to hdfs, then data is splits in N chunks depends of the size of the chunk. In the case that i have 1TB of text for do wordcount algorithm, i suppose that hadoop first raise memory the file, read file and and somehow read for a x row then copy data that in chunk.
If my assumption is bad, how is the correct way, because i think raise data to memory, this should be done in pieces. As you do internally?
Thanks
Cheers
Your data upload to HDFS statement is correct.
When the WordCount MapReduce job will be launched, for each chuck (block) one Mapper task get assigned and executed. The output of the Mappers is sent to Reducers after the sort-shuffle phase. During sort-shuffle, Mapper output are partitioned, sorted and received (copied) by the Reducers.
The MapReduce framework does not read any data and copy into any chuck. That is already done, when you stored the file in HDFS.
When You upload the data based on your block size, you data is divided in to blocks and stored on different nodes.
But when you launch map-reduce jobs,
We should know about splits.
Its not the block no = mapper no
its no of splits = number of mappers
splits are logical division and block is physical division.
data is read in splits. by default split size = block size but we can change this.

Does hadoop create InputSplits parallely

I have a large text file of size around 13gb. I want to process the file using Hadoop. I know that hadoop uses FileInputFormat to create InputSplits which are assigned to mapper tasks. I want to know if hadoop creates these InputSplits sequentially or in parallel. I mean does it read the large text file sequentially on a single host and create split files which are then distributed to datanodes, or does it read chunks of say 50mb in parallel?
Does hadoop replicate the big file on multiple hosts before splitting it up?
Is it recommended that I split up the file into 50mb chunks to speed up the processing? There are many questions on appropriate split size for mapper tasks but not the exact split process itself.
Thanks
InputSplits are created in the client side and it just a logical representation of the file in the sense it would only contain the file path,start and end offset values(calculated from linerecordreader initialize function). So calculating this logical rep. will not take much time so need to split your chunks the real execution happens at the mapper end where the execution is done in a parallel way. Then the client places the inputsplits into hdfs and jobtracker takes it from there and depending on the splits it allocates a tasktracker. Now here one mapper execution is not dependent on the other. The second mapper knows very well that where it has to start processing that split, so the mapper executions are done in parallel.
I suppose you want to process the file using MapReduce not Hadoop. Hadoop is a platform which provide tools to process and store large size data.
When you store the file in HDFS (Hadoop filesystem) it splits the file into multiple blocks. The size of the block is defined in hdfs-site.xml file as dfs.block.size. For example, if dfs.block.size=128 then your input file will be split into 128MB blocks. This is how HDFS store the data internally. For user it is always as a single file.
When you provide the input file (stored in HDFS) to MapReduce, it launches mapper task for each block/split of the file. This is default behavior.
you need not to split the file in chunks, just store the file in HDFS and it will the desired for you.
First let us understand what is meant by input split.
When your text file is divided into blocks of 128 MB size (default) by hdfs, assume that 10th line of the file is divided and first half of the is in first block and the other half is in second block. But when you submit a Map Program, hadoop understands that the last line of 1st block (which becomes input split here) is not complete. So it carries the second half of the 10th line to first input split. Which implies,
1) 1st input split = 1st Block + 2nd part of 10th line from 2nd block
2) 2nd input split = 2nd Block - 2nd part of 10th line from 2nd block.
This is an inbuilt process of hadoop and you cannot change or set the size of input split. The block size of hadoop v2 is by default 128 MB. You can increase during installation but you cannot decrease it.

who splits the file in hadoop? Is it Job Tracker?

I want to know
when client stores data into hdfs, who exactly performs the task of splitting the Large file into smaller chunks?
Does the client directly write the data into DataNodes? If it is so, when does the data got splitted in to 64 MB or 128 MB?
JobClient does that not the job tracker
Job Client computes input splits on the data located in the input path
on the HDFS specified while running the job. the article says then Job
Client copies the resources(jars and computed input splits) to the HDFS.
The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.

Hadoop's input splitting- How does it work

I know brief about hadoop
I am curious to know how does it work.
To be precise I want to know, how exactly it divides/splits the input file.
Does it divides in equal chunks in terms of size?
or it is configurable thing.
I did go through this post, but I couldn't understand
This is dependent on the InputFormat, which for most file-based formats is defined in the FileInputFormat base class.
There are a number of configurable options which denote how hadoop will take a single file and either process it as a single split, or divide the file into multiple splits:
If the input file is compressed, the input format and compression method must be splittable. Gzip for example is not splittable (you can't randomly seek to a point in the file and recover the compressed stream). BZip2 is splittable. See the specific InputFormat.isSplittable() implementation for your input format for more information
If the file size is less than or equal to its defined HDFS block size, then hadoop will most probably process it in a single split (this can be configured, see a later point about split size properties)
If the file size is greater than its defined HDFS block size, then hadoop will most probably divide up the file into splits based upon the underlying blocks (4 blocks would result in 4 splits)
You can configure two properties mapred.min.split.size and mapred.max.split.size which help the input format when breaking up blocks into splits. Note that the minimum size may be overriden by the input format (which may have a fixed minumum input size)
If you want to know more, and are comfortable looking through the source, check out the getSplits() method in FileInputFormat (both the new and old api have the same method, but they may have some suttle differences).
When you submit a map-reduce job (or pig/hive job), Hadoop first calculates the input splits, each input split size generally equals to HDFS block size. For example, for a file of 1GB size, there will be 16 input splits, if block size is 64MB. However, split size can be configured to be less/more than HDFS block size. Calculation of input splits is done with FileInputFormat. For each of these input splits, a map task must be started.
But you can change the size of input split by configuring following properties:
mapred.min.split.size: The minimum size chunk that map input should be split into.
mapred.max.split.size: The largest valid size inbytes for a file split.
dfs.block.size: The default block size for new files.
And the formula for input split is:
Math.max("mapred.min.split.size", Math.min("mapred.max.split.size", blockSize));
You can check examples here.

Resources