How hadoop read all data and then splits in chunks? - hadoop

i am using hadoop 2.6 for processing enough data, so i have a question about how hadoop read all data and then splits in chunks?. I understand that first upload data to hdfs, then data is splits in N chunks depends of the size of the chunk. In the case that i have 1TB of text for do wordcount algorithm, i suppose that hadoop first raise memory the file, read file and and somehow read for a x row then copy data that in chunk.
If my assumption is bad, how is the correct way, because i think raise data to memory, this should be done in pieces. As you do internally?
Thanks
Cheers

Your data upload to HDFS statement is correct.
When the WordCount MapReduce job will be launched, for each chuck (block) one Mapper task get assigned and executed. The output of the Mappers is sent to Reducers after the sort-shuffle phase. During sort-shuffle, Mapper output are partitioned, sorted and received (copied) by the Reducers.
The MapReduce framework does not read any data and copy into any chuck. That is already done, when you stored the file in HDFS.

When You upload the data based on your block size, you data is divided in to blocks and stored on different nodes.
But when you launch map-reduce jobs,
We should know about splits.
Its not the block no = mapper no
its no of splits = number of mappers
splits are logical division and block is physical division.
data is read in splits. by default split size = block size but we can change this.

Related

How does Hadoop HDFS decide what data to be put into each block?

I have been trying to dive into how Hadoop HDFS decides what data to be put into one block and don't seem to find any solid answer. We know that Hadoop will automatically distribute data into blocks in HDFS across the cluster, however what data of each file should be put together in a block? Will it just put it arbitrarily ? And is this the same for Spark RDD?
HDFS block behavior
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
spark reading a HDFS splittable file:
sc.textFile doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.
It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.
If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.
If your partitioning strategy (defaultMinPartitions) seems to be swamping the workers because the java representation of your partition (an InputSplit in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:
sc.textFile(file, numPartitions)
.count()

Spark RDD partitions vs. Hadoop Splits

I am having a hard time understanding the difference between the RDD partitions and the HDFS Input Splits. So essentially when you submit a Spark application:
When the Spark application wants to read from HDFS, that file on HDFS will have input splits (of let's say 64 mb each and each of these input splits are present on different data nodes).
Now let's say the Spark application wants to load that file from HDFS using the (sc.textFile(PATH_IN_HDFS)). And the file is about 256 mb and has 4 input splits where 2 of the splits are on data node 1 and the other 2 splits are on data node 2.
Now when Spark loads this 256 mb into it's RDD abstraction, will it load each of the input splits (64mb) into 4 separate RDD's (where you will have 2 RDD's with 64mb of data in data node 1 and the other two RDD's of 64mb of data on data node 2). Or will the RDD further partition those input splits on Hadoop? Also how will these partitions be redistributed then? I do not understand if there is a correlation between the RDD partitions and the HDFS input splits?
I'm pretty new to Spark, but splits are strictly related to MapReduce jobs. Spark loads the data in memory in a distributed fashion and which machines will load the data can depend on where the data are (read: somewhat depends on where the data block are and this is very close to the split idea ).
Sparks APIs allows you to think in terms of RDD and no longer splits.
You will work on RDD, how are distributed the data into the RDD is no longer a programmer problem.
Your whole dataset, under spark, is called RDD.
Hope the below answer would help you.
When Spark reads a file from HDFS, it creates a single partition for a single input split.
If you have a 30GB text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions.

Does hadoop create InputSplits parallely

I have a large text file of size around 13gb. I want to process the file using Hadoop. I know that hadoop uses FileInputFormat to create InputSplits which are assigned to mapper tasks. I want to know if hadoop creates these InputSplits sequentially or in parallel. I mean does it read the large text file sequentially on a single host and create split files which are then distributed to datanodes, or does it read chunks of say 50mb in parallel?
Does hadoop replicate the big file on multiple hosts before splitting it up?
Is it recommended that I split up the file into 50mb chunks to speed up the processing? There are many questions on appropriate split size for mapper tasks but not the exact split process itself.
Thanks
InputSplits are created in the client side and it just a logical representation of the file in the sense it would only contain the file path,start and end offset values(calculated from linerecordreader initialize function). So calculating this logical rep. will not take much time so need to split your chunks the real execution happens at the mapper end where the execution is done in a parallel way. Then the client places the inputsplits into hdfs and jobtracker takes it from there and depending on the splits it allocates a tasktracker. Now here one mapper execution is not dependent on the other. The second mapper knows very well that where it has to start processing that split, so the mapper executions are done in parallel.
I suppose you want to process the file using MapReduce not Hadoop. Hadoop is a platform which provide tools to process and store large size data.
When you store the file in HDFS (Hadoop filesystem) it splits the file into multiple blocks. The size of the block is defined in hdfs-site.xml file as dfs.block.size. For example, if dfs.block.size=128 then your input file will be split into 128MB blocks. This is how HDFS store the data internally. For user it is always as a single file.
When you provide the input file (stored in HDFS) to MapReduce, it launches mapper task for each block/split of the file. This is default behavior.
you need not to split the file in chunks, just store the file in HDFS and it will the desired for you.
First let us understand what is meant by input split.
When your text file is divided into blocks of 128 MB size (default) by hdfs, assume that 10th line of the file is divided and first half of the is in first block and the other half is in second block. But when you submit a Map Program, hadoop understands that the last line of 1st block (which becomes input split here) is not complete. So it carries the second half of the 10th line to first input split. Which implies,
1) 1st input split = 1st Block + 2nd part of 10th line from 2nd block
2) 2nd input split = 2nd Block - 2nd part of 10th line from 2nd block.
This is an inbuilt process of hadoop and you cannot change or set the size of input split. The block size of hadoop v2 is by default 128 MB. You can increase during installation but you cannot decrease it.

who splits the file in hadoop? Is it Job Tracker?

I want to know
when client stores data into hdfs, who exactly performs the task of splitting the Large file into smaller chunks?
Does the client directly write the data into DataNodes? If it is so, when does the data got splitted in to 64 MB or 128 MB?
JobClient does that not the job tracker
Job Client computes input splits on the data located in the input path
on the HDFS specified while running the job. the article says then Job
Client copies the resources(jars and computed input splits) to the HDFS.
The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.

How job client in hadoop compute inputSplits

I am trying to get the insight of map reduce architecture. I am consulting this http://answers.oreilly.com/topic/2141-how-mapreduce-works-with-hadoop/ article. I have some questions regarding the component JobClient of mapreduce framework. My questions is:
How the JObClient Computes the input Splits on the data?
According to the stuff to which i am consulting , Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and compued input splits) to the HDFS. Now here is my question, when the input data is in HDFS, why jobClient copies the computed inputsplits into HDFS.
Lets assume that Job Client copies the input splits to the HDFS, Now when the JOb is submitted to the Job Tracker and Job tracker intailize the job why it retrieves input splits from HDFS?
Apologies if my question is not clear. I am a beginner. :)
No the JobClient does not copy the input splits to the HDFS. You have quoted your answer for yourself:
Job Client computes input splits on the data located in the input path
on the HDFS specified while running the job. the article says then Job
Client copies the resources(jars and computed input splits) to the HDFS.
The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.
Have a look at org.apache.hadoop.mapreduce.lib.input.FileSplit, it contains the file path the start offset and the length of the chunk a single task will operate on as its input.
The serializable class you may also want to have a look at is: org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo.
This meta information will be computed for each task that will be run, and copied with the jars to the node that will actually execute this task.
The computation of input split depends on the Input Format. For a typical textual Input format, the generic formula to calculate the split size is
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
or by default
Input split size = mapred.min.split.size < dfs.block.size < mapred.max.split.size
Where
mapred.min.split.size= Minimum Split size
mapred.max.split.size - Maximum Split size
dfs.block.size= DFS Block size
For DB Input Format, the split size is
(total records / number of mappers)
With the above said, number of input splits and size are the meta information given to the mapper tasks and Record readers.

Resources