As mentioned in Hadoop definitive guide, during submission of an MR job, Input splits get computed and then, get copied to JobTracker's FileSystem. However, it does not make sense to me if the data is really huge. This copy will take a lot of time and also, if the node running JobTracker does not have enough space, what would happen to this copy? Please clarify this processing framework.
Thanks in advance.
InputSplits are just a logical abstraction of block boundaries. Generally a InputSplit contains the following information:
Path to the file
Block start position
Number of bytes in the file to process
List of hosts containing the blocks for file being processed
For a given job its the responsibility of the JobClient to compute the input splits information (which is just an ArrayList of above stated FileSplit objects) by calling writeSplits method which internally calls the InputFormat's getSplits method, once computed this information is copied to HDFS from where the JobTracker will read and will schedule the mappers based on data-locality.
If you are interested in how the splits themselves are calculated take a look at the FileInputFormat.getSplits method.
Related
Couldn't find enough information on internet so asking here:
Assuming I'm writing a huge file to disk, hundreds of Terabytes, which is a result of mapreduce (or spark or whatever). How would mapreduce write such a file to HDFS efficiently (potentially parallel?) which could be read later in a parallel way as well?
My understanding is that HDFS is simply block based (128MB e.g.). so in order to write the second block, you must have wrote the first block (or at least determine what content will go to block 1). Let's say it's a CSV file, it is quite possible that a line in the file will span two blocks -- how could we read such CSV to different mapper in mapreduce? Does it have to do some smart logic to read two blocks, concat them and read the proper line?
Hadoop uses RecordReaders and InputFormats as the two interfaces which read and understand bytes within blocks.
By default, in Hadoop MapReduce each record ends on a new line with TextInputFormat, and for the scenario where just one line crosses the end of a block, the next block must be read, even if it's just literally the \r\n characters
Writing data is done from reduce tasks, or Spark executors, etc, in that each task is responsible for writing only a subset of the entire output. You'll generally never get a single file for non-small jobs, and this isn't an issue because the input arguments to most Hadoop processing engines are meant to scan directories, not point at single files
Distcp between/within clusters are Map-Reduce jobs. My assumption was, it copies files on the input split level, helping with copy performance since a file will be copied by multiple mappers working on multiple "pieces" in parallel.
However when I was going through the documentation of Hadoop Distcp, it seems Distcp will only work on the file level.
Please refer to here: hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
According to the distcp doc, the distcp will only split the list of files, instead of the files themselves, and give the partitions of list to the mappers.
Can anyone tell how exactly this will work?
additional question: if a file is assigned to only one mapper, how does the mapper find all the input splits on one node that it's running on?
For a single file of ~50G size, 1 map task will be triggered to copy the data since files are the finest level of granularity in Distcp.
Quoting from the documentation:
Why does DistCp not run faster when more maps are specified?
At
present, the smallest unit of work for DistCp is a file. i.e., a file
is processed by only one map. Increasing the number of maps to a value
exceeding the number of files would yield no performance benefit. The
number of maps launched would equal the number of files.
UPDATE
The block locations of the file is obtained from the namenode during mapreduce. On Distcp, each Mapper will be initiated, if possible, on the node where the first block of the file is present. In cases where the file is composed of multiple splits, they will be fetched from the neighbourhood if not available on the same node.
I have some questions regarding the blocks in Hadoop. I read that Hadoop uses HDFS which will creates blocks of specific size.
First Question Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Second Question Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
1.Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes. Blocks exist physically. You can use commands like hadoop fsck /path/to/file -files -blocks
Refer below SE questions for commands to view blocks :
Viewing the number of blocks for a file in hadoop
2.Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop = Distributed storage ( HDFS) + Distributed processing ( MapReduce & Yarn).
A MapReduce job works on input splits => The input splits are are created from Data blocks in Datanodes. Data blocks are created during write operation of a file. If you are running a job on existing files, data blocks are pre-creared before the job and InputSplits are created during Map operation. You can think data block as physical entity and InputSplit as logical entity. Mapreduce job does not change input data blocks. Reducer generates output data as new data blocks.
Mapper process input splits and emit output to Reducer job.
3.Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Input is already available with physicals DFS blocks. A MapReduce job works in InputSplit. Blocks and InputSplits may or may not be same. Block is a physical entity and InputSplit is logical entity. Refer to below SE question for more details :
How does Hadoop perform input splits?
4.Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Mapper input : Input blocks pre-exists. Map process starts on input blocks/splits, which have been stored in HDFS before commencement of Mapper job.
Mapper output : Not stored in HDFS and it does not make sense to store intermediate results on HDFS with replication factor of X more than 1.
Reducer output: Reducer output is stored in HDFS. Number of blocks will depend on size of reducer output data.
Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes, the blocks exist physically on disk across the datanodes in your cluster. I suppose you could "see" them if you were on one of the datanodes and you really wanted to, but it would likely not be illuminating. It would only be a random 128m (or whatever dfs.block.size is set to in hdfs-site.xml) fragment of the file with no meaningful filename. The hdfs dfs commands enable you to treat HDFS as a "real" filesystem.
Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop takes care of splitting the file into blocks and distributing them among the datanodes when you put a file in HDFS (through whatever method applies to your situation).
Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Not entirely sure what you mean, but the blocks exist before, and irrespective of, any processing you do with them.
Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Again, blocks in HDFS are determined before any processing is done, if any is done at all. HDFS is simply a way to store a large file in a distributed fashion. When you do processing, for example with a MapReduce job, Hadoop will write intermediate results to disk. This is not related to the blocking of the raw file in HDFS.
I'm a hadoop newbie. While going through hadoop example for a similar implementation in a rather large cluster, I was wondering why the grep example that comes along with hadoop code, why do they have one map per line ?
I know that it makes sense from the perspective of a teaching example. But in a real hadoop cluster, where a grep is to be done on an industry(1 PB log files) scale, is it worth creating a map() per line? Is the overhead of creating a map(), and the tasktracker keeping track of it and the associated bandwidth usage justified if we create a map per line?
A separate Map task will not be done for every line; You are confusing the programming model for MapReduce with the execution model.
When you implement a mapper, you are implementing a function that operates on a single piece of data (let's say a line in a log file). The hadoop framework takes care of essentially looping over all your log files, reading each line, and passing that line into your mapper.
MapReduce allows you to write your code in such a way that you are dealing with an abstraction that's useful: a line in a log file is a good example. The advantage of using something like Hadoop is that it will take care of the parallelization of this code for you: It will distribute your program out to a bunch of processes that will execute it (TaskTracker) and those TaskTrackers will read chunks llof data from the HDFS nodes that store it (Data Nodes).
I am trying to understand data locality as it relates to Hadoop's Map/Reduce framework. In particular I am trying to understand what component handles data locality (i.e. is it the input format?)
Yahoo's Developer Network Page states "The Hadoop framework then schedules these processes in proximity to the location of data/records using knowledge from the distributed file system." This seems to imply that the HDFS input format will perhaps query the name node to determine which nodes contain the desired data and will start the map tasks on those nodes if possible. One could imagine a similar approach could be taken with HBase by querying to determine which regions are serving certain records.
If a developer writes their own input format would they be responsible for implementing data locality?
You're right. If you're looking at the FileInputFormat class and the getSplits() method. It searches for the Blocklocations:
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
This implies the FileSystem query. This happens inside the JobClient, the results getting written into a SequenceFile (actually it's just raw byte code).
So the Jobtracker reads this file later on while initializing the job and is pretty much just assigning a task to an inputsplit.
BUT the distribution of the data is the NameNodes job.
To your question now:
Normally you are extending from the FileInputFormat. So you will be forced to return a list of InputSplit, and in the initialization step it is required for such a thing to set the location of the split. For example the FileSplit:
public FileSplit(Path file, long start, long length, String[] hosts)
So actually you don't implement the data locality itself, you are just telling on which host the split can be found. This is easily queryable with the FileSystem interface.
Mu understanding is that data locality is jointly determined by HDFS and the InputFormat. The former determines (via rack awareness) and stores the location of HDFS blocks across datanodes while the latter will determine which blocks are associated with which split. The jobtracker will try to optimize which splits are delivered to which map task by making sure that the blocks associated for each split (1 split to 1 map task mapping) are local to the tasktracker.
Unfortunately, this approach to guaranteeing locality is preserved in homogeneous clusters but would break down in inhomogeneous ones i.e. ones where there are different sizes of hard disks per datanode. If you want to dig deeper on this you should read this paper (Improving MapReduce performance through data placement in heterogeneous hadoop clusters) that also touches on several topics relative to your question.