I want to know
when client stores data into hdfs, who exactly performs the task of splitting the Large file into smaller chunks?
Does the client directly write the data into DataNodes? If it is so, when does the data got splitted in to 64 MB or 128 MB?
JobClient does that not the job tracker
Job Client computes input splits on the data located in the input path
on the HDFS specified while running the job. the article says then Job
Client copies the resources(jars and computed input splits) to the HDFS.
The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.
Related
I have been trying to dive into how Hadoop HDFS decides what data to be put into one block and don't seem to find any solid answer. We know that Hadoop will automatically distribute data into blocks in HDFS across the cluster, however what data of each file should be put together in a block? Will it just put it arbitrarily ? And is this the same for Spark RDD?
HDFS block behavior
I'll attempt to highlight by way of example the differences in blocks splits in reference to file size. In HDFS you have:
Splittable FileA size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 splits and in turn 16 mappers.
Let's look at this scenario with a compressed (non-splittable) file:
Non-Splittable FileA.gzip size 1GB
dfs.block.size=67108864(~64MB)
MapRed job against this file:
16 Blocks will converge on 1 mapper.
It's best to proactively avoid this situation since it means that the tasktracker will have to fetch 16 blocks of data most of which will not be local to the tasktracker.
spark reading a HDFS splittable file:
sc.textFile doesn't commence any reading. It simply defines a driver-resident data structure which can be used for further processing.
It is not until an action is called on an RDD that Spark will build up a strategy to perform all the required transforms (including the read) and then return the result.
If there is an action called to run the sequence, and your next transformation after the read is to map, then Spark will need to read a small section of lines of the file (according to the partitioning strategy based on the number of cores) and then immediately start to map it until it needs to return a result to the driver, or shuffle before the next sequence of transformations.
If your partitioning strategy (defaultMinPartitions) seems to be swamping the workers because the java representation of your partition (an InputSplit in HDFS terms) is bigger than available executor memory, then you need to specify the number of partitions to read as the second parameter to textFile. You can calculate the ideal number of partitions by dividing your file size by your target partition size (allowing for memory growth). A simple check that the file can be read would be:
sc.textFile(file, numPartitions)
.count()
I am new to Hadoop, so please excuse me if my questions are trivial.
Is local file system is different than HDFS.
While creating a mapreduce program, we file input file path using fileinputformat.addInputPath() function. Does it split that data into multiple data node and also perform inputsplits as well? If yes, how long this data will stay in datanodes? And can we write mapreduce program to the existing data in HDFS?
1:HDFS is actually a solution to distributed storage, and there will be more storage ceilings and backup problems in localized storage space. HDFS is the server cluster storage resource as a whole, through the nameNode storage directory and block information management, dataNode is responsible for the block storage container. HDFS can be regarded as a higher level abstract localized storage, and it can be understood by solving the core problem of distributed storage.
2:if we use hadoop fileinputformat , first it create an open () method to filesystem and get connection to namenode to get location messages return those message to client . then create a fsdatainputstream to read from different nodes one by one .. at the end close the fsdatainputstream
if we put data into hdfs the client the data will be split into multiple data and storged in different machine (bigger than 128M [64M])
Data persistence is stored on the hard disk
SO if your file is much bigger beyond the pressure of Common server & need Distributed computing you can use HDFS
HDFS is not your local filesystem - it is a distributed file system. This means your dataset can be larger than the maximum storage capacity of a single machine in your cluster. HDFS by default uses a block size of 64 MB. Each block is replicated to at least 3 other nodes in the cluster to account for redundancies (such as node failure). So with HDFS, you can think of your entire cluster as one large file system.
When you write a MapReduce program and set your input path, it will try to locate that path on the HDFS. The input is then automatically divided up into what is known as input splits - fixed size partitions containing multiple records from your input file. A Mapper is created for each of these splits. Next, the map function (which you define) is applied to each record within each split, and the output generated is stored in the local filesystem of the node where map function ran from. The Reducer then copies this output file to its node and applies the reduce function. In the case of a runtime error when executing map and the task fails, Hadoop will have the same mapper task run on another node and have the reducer copy that output.
The reducers use the outputs generated from all the mapper tasks, so by this point, the reducers are not concerned with the input splits that was fed to the mappers.
Grouping answers as per the questions:
HDFS vs local filesystem
Yes, HDFS and local file system are different. HDFS is a Java-based file system that is a layer above a native filesystem (like ext3). It is designed to be distributed, scalable and fault-tolerant.
How long do data nodes keep data?
When data is ingested into HDFS, it is split into blocks, replicated 3 times (by default) and distributed throughout the cluster data nodes. This process is all done automatically. This data will stay in the data nodes till it is deleted and finally purged from trash.
InputSplit calculation
FileInputFormat.addInputPath() specifies the HDFS file or directory from which files should be read and sent to mappers for processing. Before this point is reached, the data should already be available in HDFS, since it is now attempting to be processed. So the data files themselves have been split into blocks and replicated throughout the data nodes. The mapping of files, their blocks and which nodes they reside on - this is maintained by a master node called the NameNode.
Now, based on the input path specified by this API, Hadoop will calculate the number of InputSplits required for processing the file/s. Calculation of InputSplits is done at the start of the job by the MapReduce framework. Each InputSplit then gets processed by a mapper. This all happens automatically when the job runs.
MapReduce on existing data
Yes, MapReduce program can run on existing data in HDFS.
I am having a hard time understanding the difference between the RDD partitions and the HDFS Input Splits. So essentially when you submit a Spark application:
When the Spark application wants to read from HDFS, that file on HDFS will have input splits (of let's say 64 mb each and each of these input splits are present on different data nodes).
Now let's say the Spark application wants to load that file from HDFS using the (sc.textFile(PATH_IN_HDFS)). And the file is about 256 mb and has 4 input splits where 2 of the splits are on data node 1 and the other 2 splits are on data node 2.
Now when Spark loads this 256 mb into it's RDD abstraction, will it load each of the input splits (64mb) into 4 separate RDD's (where you will have 2 RDD's with 64mb of data in data node 1 and the other two RDD's of 64mb of data on data node 2). Or will the RDD further partition those input splits on Hadoop? Also how will these partitions be redistributed then? I do not understand if there is a correlation between the RDD partitions and the HDFS input splits?
I'm pretty new to Spark, but splits are strictly related to MapReduce jobs. Spark loads the data in memory in a distributed fashion and which machines will load the data can depend on where the data are (read: somewhat depends on where the data block are and this is very close to the split idea ).
Sparks APIs allows you to think in terms of RDD and no longer splits.
You will work on RDD, how are distributed the data into the RDD is no longer a programmer problem.
Your whole dataset, under spark, is called RDD.
Hope the below answer would help you.
When Spark reads a file from HDFS, it creates a single partition for a single input split.
If you have a 30GB text file stored on HDFS, then with the default HDFS block size setting (128MB) it would be stored in 235 blocks, which means that the RDD you read from this file would have 235 partitions.
i am using hadoop 2.6 for processing enough data, so i have a question about how hadoop read all data and then splits in chunks?. I understand that first upload data to hdfs, then data is splits in N chunks depends of the size of the chunk. In the case that i have 1TB of text for do wordcount algorithm, i suppose that hadoop first raise memory the file, read file and and somehow read for a x row then copy data that in chunk.
If my assumption is bad, how is the correct way, because i think raise data to memory, this should be done in pieces. As you do internally?
Thanks
Cheers
Your data upload to HDFS statement is correct.
When the WordCount MapReduce job will be launched, for each chuck (block) one Mapper task get assigned and executed. The output of the Mappers is sent to Reducers after the sort-shuffle phase. During sort-shuffle, Mapper output are partitioned, sorted and received (copied) by the Reducers.
The MapReduce framework does not read any data and copy into any chuck. That is already done, when you stored the file in HDFS.
When You upload the data based on your block size, you data is divided in to blocks and stored on different nodes.
But when you launch map-reduce jobs,
We should know about splits.
Its not the block no = mapper no
its no of splits = number of mappers
splits are logical division and block is physical division.
data is read in splits. by default split size = block size but we can change this.
I am trying to get the insight of map reduce architecture. I am consulting this http://answers.oreilly.com/topic/2141-how-mapreduce-works-with-hadoop/ article. I have some questions regarding the component JobClient of mapreduce framework. My questions is:
How the JObClient Computes the input Splits on the data?
According to the stuff to which i am consulting , Job Client computes input splits on the data located in the input path on the HDFS specified while running the job. the article says then Job Client copies the resources(jars and compued input splits) to the HDFS. Now here is my question, when the input data is in HDFS, why jobClient copies the computed inputsplits into HDFS.
Lets assume that Job Client copies the input splits to the HDFS, Now when the JOb is submitted to the Job Tracker and Job tracker intailize the job why it retrieves input splits from HDFS?
Apologies if my question is not clear. I am a beginner. :)
No the JobClient does not copy the input splits to the HDFS. You have quoted your answer for yourself:
Job Client computes input splits on the data located in the input path
on the HDFS specified while running the job. the article says then Job
Client copies the resources(jars and computed input splits) to the HDFS.
The input itself relies on the cluster. The client only computes on the meta information it got from the namenode (block size, data length, block locations). These computed input splits carry meta information to the tasks, e.G. of the block offset and the length to compute on.
Have a look at org.apache.hadoop.mapreduce.lib.input.FileSplit, it contains the file path the start offset and the length of the chunk a single task will operate on as its input.
The serializable class you may also want to have a look at is: org.apache.hadoop.mapreduce.split.JobSplit.SplitMetaInfo.
This meta information will be computed for each task that will be run, and copied with the jars to the node that will actually execute this task.
The computation of input split depends on the Input Format. For a typical textual Input format, the generic formula to calculate the split size is
max(mapred.min.split.size, min(mapred.max.split.size, dfs.block.size))
or by default
Input split size = mapred.min.split.size < dfs.block.size < mapred.max.split.size
Where
mapred.min.split.size= Minimum Split size
mapred.max.split.size - Maximum Split size
dfs.block.size= DFS Block size
For DB Input Format, the split size is
(total records / number of mappers)
With the above said, number of input splits and size are the meta information given to the mapper tasks and Record readers.