How does a Spark task access HDFS? - hadoop

Suppose that
input to a Spark application is a 1GB text file on HDFS,
HDFS block size is 16MB,
Spark cluster has 4 worker nodes.
In the first stage of the application, we read the file from HDFS by sc.textFile("hdfs://..."). Since the block size is 16MB, this stage will have 64 tasks (one task per partition/block). These tasks will be dispatched to the cluster nodes. My questions are:
Does each individual task fetch its own block from HDFS, or does the driver fetch the data for all tasks before dispatching them, and then sends data to the nodes?
If each task fetches its own block from HDFS by itself, does it ask HDFS for a specific block, or does it fetch the whole file and then processes its own block?
Suppose that HDFS doesn't have a copy of the text file on one of the nodes, say node one. Does HDFS make a copy of the file on node one first time a task from node one asks for a block of the file? If not, does it mean that each time a task asks for a block of the file from node one, it has to wait for HDFS to fetch data from other nodes?
Thanks!

In general, Spark's access to HDFS is probably as efficient as you think it should be. Spark uses Hadoop's FileSystem object to access data in HDFS.
Does each individual task fetch its own block from HDFS, or does the driver fetch the data for all tasks before dispatching them, and then sends data to the nodes?
Each task fetches its own block from HDFS.
If each task fetches its own block from HDFS by itself, does it ask HDFS for a specific block, or does it fetch the whole file and then processes its own block?
It pulls a specific block. It does not scan the entire file to get to the block.
Suppose that HDFS doesn't have a copy of the text file on one of the nodes, say node one. Does HDFS make a copy of the file on node one first time a task from node one asks for a block of the file? If not, does it mean that each time a task asks for a block of the file from node one, it has to wait for HDFS to fetch data from other nodes?
Spark will attempt to assign tasks based on the location preferences of the partitions in the RDD. In the case of a HadoopRDD (which you get from sc.textFile), the location preference for each partition is the set of datanodes that have the block local. If a task is not able to be run local to the data, it will run on a separate node and the block will be streamed from a datanode that has the block to the task that is executing on the block.

Related

In Mapreduce, does replication apply to intermediate data also?

In Mapreduce, we say that the output produced by mappers are called intermediate data.
Are intermediate data also replicated?
Are intermediate data temporary?
When will intermediate data get deleted? Is it deleted automatically or do we need to explicitly delete it?
Mapper's spilled files are stored in the local file system of the worker node where the Mapper is running. Similarly the data streamed from one node to another node is stored in local file system of the worker node where the task is running.
This local file system path is specified by
hadoop.tmp.dir
property which by default is '/tmp'.
After the completion or failure of the job the temporary location used on the local file system get's cleared automatically you don't have to perform any clean up process, it's automatically handled by the framework.

Hive Tables in multiple nodes - Processing

I have a conceptual doubt in Hive. I know that Hive s a data warehouse tool that runs on top of Hadoop. We know that Hadoop has a distributed file system -HDFS.
Suppose, I have one master and three slaves. Now, I have created a table employees in HiveQL. The table is so huge that it cant be stored in one machine. Hence it must be stored in all four machines. How can I load such data. Should it be done manually. Or like I type "LOAD DATA ... " in the master and it will be automatically get distributed among all the machines.
Hive uses HDFS as warehouse to store the data. So HDFS concept is used for data storage.
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files.
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Please refer HDFS architecture for more detail.

Copy files/chunks from HDFS to local file system of slave nodes

In Hadoop, I understand that the master node(Namenode) is responsible for storing the blocks of data in the slave machines(Datanode).
When we use -copyToLocal or -get, from the master, the files could be copied from the HDFS to the local storage of the master node. Is there any way the slaves can copy the blocks(data) that are stored in them, to their own local file system?
For ex, a file of 128 MB could be split among 2 slave nodes storing 64MB each. Is there any way for the slave to identify and load this chunk of data to its local file system? If so, how can this be done programmatically? Can the commands -copyToLocal or -get be used in this case also? Please help.
Short Answer: No
The data/files cannot be copied directly from Datandode's. The reason is, Datanodes store the data but they don't have any metadata information about the stored files. For them, they are just block of bits and bytes. The metadata of the files is stored in the Namenode. This metadata contains all the information about the files (name, size, etc.). Along with this, Namenode keeps track of which blocks of the file are stored on which Datanodes. The Datanodes are also not aware of the ordering of the blocks, when actual files are splits in multiple blocks.
Can the commands -copyToLocal or -get be used in this case also?
Yes, you can simply run these from the slave. The slave will then contact the namenode (if you've configured it properly) and download the data to your local filesystem.
What it doesn't do is a "short-circuit" copy, in which it would just copy the raw blocks between directories. There is also no guarantee it will read the blocks from the local machine at all, as your commandline client doesn't know its location.
HDFS blocks are stored on the slaves local FS only . you can dig down the directory defined under property "dfs.datanode.dir"
But you wont get any benefit of reading blocks directly (without HDFS API). Also reading and editing block files directory can corrupt the file on HDFS.
If you want to store data on different slave local then you will have to implement your logic of maintaining block metadata (which is already written in Namenode and do for you).
Can you elaborate more why you want to distribute blocks by yourself when Hadoop takes care of all challenges faced in distributed data?
You can copy particular file or directory from one slave to another slave by using distcp
Usage: distcp slave1address slave2address

What would happen to a MapReduce job if input data source keep increasing in HDFS?

We have a log collection agent running with HDFS, that is, the agent(like Flume) keeps collecting logs from some applications and then writes then to HDFS. The reading and writing process are running without a break, leading the destination files of HDFS keeping increasing.
And here is the question, since the input data is changing continuously, what would happen to a MapReduce job if I set the collection agent's destination path as the job's input path?
FileInputFormat.addInputPath(job, new Path("hdfs://namenode:9000/data/collect"));
A map-reduce job processes only data available at the start.
Map-Reduce is for batch data processing. For continuous data processing use tools like Storm or Spark Streaming.

Hadoop. About file creation in HDFS

I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb. Is that true? How can we load a file in HDFS which is less than 64 MB? Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
I read that whenever the client needs to create a file in HDFS (The Hadoop Distributed File System), client's file must be of 64mb.
Could you provide the reference for the same? File of any size can be put into HDFS. The file is split into 64 MB (default) blocks and saved on different data nodes in the cluster.
Can we load a file which will be just for reference for processing other file and it has to be available to all datanodes?
It doesn't matter if a block or file is on a particular data node or on all the data nodes. Data nodes can fetch data from each other as long as they are part of a cluster.
Think of HDFS as a very big hard drive and write the code for reading/writing data from HDFS. Hadoop will take care of the internals like 'reading from' or 'writing to' multiple data nodes if required.
Would suggest to read the following 1 2 3 on HDFS, especially the 2nd one which is a comic on HDFS.

Resources