In Mapreduce, we say that the output produced by mappers are called intermediate data.
Are intermediate data also replicated?
Are intermediate data temporary?
When will intermediate data get deleted? Is it deleted automatically or do we need to explicitly delete it?
Mapper's spilled files are stored in the local file system of the worker node where the Mapper is running. Similarly the data streamed from one node to another node is stored in local file system of the worker node where the task is running.
This local file system path is specified by
hadoop.tmp.dir
property which by default is '/tmp'.
After the completion or failure of the job the temporary location used on the local file system get's cleared automatically you don't have to perform any clean up process, it's automatically handled by the framework.
Related
Suppose that
input to a Spark application is a 1GB text file on HDFS,
HDFS block size is 16MB,
Spark cluster has 4 worker nodes.
In the first stage of the application, we read the file from HDFS by sc.textFile("hdfs://..."). Since the block size is 16MB, this stage will have 64 tasks (one task per partition/block). These tasks will be dispatched to the cluster nodes. My questions are:
Does each individual task fetch its own block from HDFS, or does the driver fetch the data for all tasks before dispatching them, and then sends data to the nodes?
If each task fetches its own block from HDFS by itself, does it ask HDFS for a specific block, or does it fetch the whole file and then processes its own block?
Suppose that HDFS doesn't have a copy of the text file on one of the nodes, say node one. Does HDFS make a copy of the file on node one first time a task from node one asks for a block of the file? If not, does it mean that each time a task asks for a block of the file from node one, it has to wait for HDFS to fetch data from other nodes?
Thanks!
In general, Spark's access to HDFS is probably as efficient as you think it should be. Spark uses Hadoop's FileSystem object to access data in HDFS.
Does each individual task fetch its own block from HDFS, or does the driver fetch the data for all tasks before dispatching them, and then sends data to the nodes?
Each task fetches its own block from HDFS.
If each task fetches its own block from HDFS by itself, does it ask HDFS for a specific block, or does it fetch the whole file and then processes its own block?
It pulls a specific block. It does not scan the entire file to get to the block.
Suppose that HDFS doesn't have a copy of the text file on one of the nodes, say node one. Does HDFS make a copy of the file on node one first time a task from node one asks for a block of the file? If not, does it mean that each time a task asks for a block of the file from node one, it has to wait for HDFS to fetch data from other nodes?
Spark will attempt to assign tasks based on the location preferences of the partitions in the RDD. In the case of a HadoopRDD (which you get from sc.textFile), the location preference for each partition is the set of datanodes that have the block local. If a task is not able to be run local to the data, it will run on a separate node and the block will be streamed from a datanode that has the block to the task that is executing on the block.
While mapreduce job runs the map task results are stored in local file system and then final results from reducer are stored in hdfs. The question is
What is the reason that map task results being stored in local file system ?
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
1) Mapper output is stored in local fs because, in most of the scenarios we are interested in output given by Reducer phase(which is also known as final output).Mapper <K,V> pair is intermediate output which is of least importance once passed to Reducer. If we store Mapper output in hdfs, it will be a waste of storage, because, hdfs have replication factor(by default 3) and hence 3 times the space will be taken by data which is not at all required in further processing.
2) In case of map only job, final output is stored in hdfs.
1) After TaskTracker(TT) mapper logic is done, before sending the output to Sort and Shuffle phase, the TT is going to store the o/p in temporary files(LFS)
This is to avoid starting the entire MR job again incase of network glitch.Once stored in LFS, the mapper output can be picked directly from LFS.This data is called Intermediate data and the concept is called Data Localization
This intermediate data will be deleted once the job is completed.Otherwise, the LFS would grow in size with Intermediate data from different jobs as time progresses.
Data Localization is only applicable for Mapper phase but not for Sort & Shuffle,Reducer phases
2) When there is no reducer phase, the Intermediate Data would eventually be pushed onto HDFS.
What is the reason that map task results being stored in local file system ?
Mapper output is temporary output and is relevant only for Reducer. Storing temporary output in HDFS (with replication factor) is overkill. Due to this reason, Hadoop framework stores output of Mapper into local file system instead of HDFS system. It saves lot of disk space.
One more important point from Apache tutorial page :
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
The Mapper outputs are sorted and then partitioned per Reducer
In the case of map reduce job where there is no reduce phase(only map phase exist) where is the final result stored ?
You can more details about this query from Apache tutorial page.
Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.
In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by FileOutputFormat.setOutputPath(Job, Path). The framework does not sort the map-outputs before writing them out to the FileSystem.
If number of Reducers are greater than 0, mapper outputs are stored in local file system and sorted before sending them to Reducer. If number of Reducers are 0, then mapper outputs are stored in HDFS without sorting.
In Hadoop, I understand that the master node(Namenode) is responsible for storing the blocks of data in the slave machines(Datanode).
When we use -copyToLocal or -get, from the master, the files could be copied from the HDFS to the local storage of the master node. Is there any way the slaves can copy the blocks(data) that are stored in them, to their own local file system?
For ex, a file of 128 MB could be split among 2 slave nodes storing 64MB each. Is there any way for the slave to identify and load this chunk of data to its local file system? If so, how can this be done programmatically? Can the commands -copyToLocal or -get be used in this case also? Please help.
Short Answer: No
The data/files cannot be copied directly from Datandode's. The reason is, Datanodes store the data but they don't have any metadata information about the stored files. For them, they are just block of bits and bytes. The metadata of the files is stored in the Namenode. This metadata contains all the information about the files (name, size, etc.). Along with this, Namenode keeps track of which blocks of the file are stored on which Datanodes. The Datanodes are also not aware of the ordering of the blocks, when actual files are splits in multiple blocks.
Can the commands -copyToLocal or -get be used in this case also?
Yes, you can simply run these from the slave. The slave will then contact the namenode (if you've configured it properly) and download the data to your local filesystem.
What it doesn't do is a "short-circuit" copy, in which it would just copy the raw blocks between directories. There is also no guarantee it will read the blocks from the local machine at all, as your commandline client doesn't know its location.
HDFS blocks are stored on the slaves local FS only . you can dig down the directory defined under property "dfs.datanode.dir"
But you wont get any benefit of reading blocks directly (without HDFS API). Also reading and editing block files directory can corrupt the file on HDFS.
If you want to store data on different slave local then you will have to implement your logic of maintaining block metadata (which is already written in Namenode and do for you).
Can you elaborate more why you want to distribute blocks by yourself when Hadoop takes care of all challenges faced in distributed data?
You can copy particular file or directory from one slave to another slave by using distcp
Usage: distcp slave1address slave2address
I want to move some files from one location to another location [both the locations are on HDFS] and need to verify that the data has moved correctly.
In order to compare the data moved, I was thinking of calculating hash code on both the files and then comparing if they are equal. If equal, I would term the data movement as correct else the data movement has not happened correctly.
But I have a couple of questions regarding this.
Do I need to use the hashCode technique at all in first place? I am using MapR distribution and I read somewhere that data movement when done, implement hashing of the data at the backend and make sure that it has been transferred correctly. So is it guaranteed that when data will be moved inside HDFS, it will be consistent and no anomaly will be inserted while movement?
Is there any other way that I can use in order to make sure that the data moved is consistent across locations?
Thanks in advance.
You are asking about data copying. Just use DistCp.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. It uses MapReduce to effect its distribution, error handling and recovery, and reporting.
#sample example
$hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2.
EDIT
DistCp uses MapReduce to effect its distribution, error handling and recovery, and reporting.
After a copy, it is recommended that one generates and cross-checks a listing of the source and destination to ·verify that the copy was truly successful·. Since DistCp employs both MapReduce and the FileSystem API, issues in or between any of the three could adversely and silently affect the copy.
EDIT
The common method I used to check the source and dist files was check the number of files and the specified size of each file. This can be done by generate a manifest at the source, then check at the dist both the number and size.
In HDFS, move doesn't physically move the data (blocks) across the data nodes. It actually changes the name space in HDFS metadata. Where as copying data from one HDFS location to another HDFS location, we have two ways;
copy
parallel copy distcp
In General copy, it doesn't check the integrity of blocks. If you want data integrity while copying file from one location to another location in same HDFS cluster use CheckSum concept by modifying the FsShell.java class or write your own class using HDFS Java API.
In Case of distCp, HDFS checks the data integrity while copying data from One HDFS cluster to another HDFS Cluster.
From various blogs I read, I comprehended that HDFS is another layer that exists over Local filesystem in a computer.
I also installed hadoop but I have trouble understanding the existence of hdfs layer over local file system.
Here is my question..
Consider I am installing hadoop in pseudo-distributed mode. What happens under the hood during this installation? I added a tmp.dir parameter in configuration files. Is is the single folder to which namenode daemon talks to, when it attemps to access the datanode??
OK..let me give it a try..When you configure Hadoop it lays down a virtual FS on top of your local FS, which is the HDFS. HDFS stores data as blocks(similar to the local FS, but much much bigger as compared to it) in a replicated fashion. But the HDFS directory tree or the filesystem namespace is identical to that of local FS. When you start writing data into HDFS, it eventually gets written onto the local FS only, but you can't see it there directly.
The temp directory actually serves 3 purposes :
1- Directory where namenode stores its metadata, with default value ${hadoop.tmp.dir}/dfs/name and can be specified explicitly by dfs.name.dir. If you specify dfs.name.dir, then the namenode metedata will be stored in the directory given as the value of this property.
2- Directory where HDFS data blocks are stored, with default value ${hadoop.tmp.dir}/dfs/data and can be specified explicitly by dfs.data.dir. If you specify dfs.data.dir, then the HDFS data will be stored in the directory given as the value of this property.
3- Directory where secondary namenode store its checkpoints, default value is ${hadoop.tmp.dir}/dfs/namesecondary and can be specified explicitly by fs.checkpoint.dir.
So, it's always better to use some proper dedicated location as the values for these properties for a cleaner setup.
When access to a particular block of data is required metadata stored in the dfs.name.dir directory is searched and the location of that block on a particular datanode is returned to the client(which is somewhere in dfs.data.dir directory on the local FS). The client then reads data directly from there (same holds good for writes as well).
One important point to note here is that HDFS is not a physical FS. It is rather a virtual abstraction on top of your local FS which can't be browsed simply like the local FS. You need to use the HDFS shell or the HDFS webUI or the available APIs to do that.
HTH
When you install hadoop in pseudo distributed mode, all the HDFS daemons namdenode, datanode and secondary name node run on the same machine. The temp dir which you configure is where a data node stores the data. So when you look at it from HDFS point of view, your data is still stored in block and read in blocks which are much bigger (and aggregation) on multiple file system level blocks.