Hadoop: HDFS File Writes & Reads - hadoop

I have a basic question regarding file writes and reads in HDFS.
For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.
If this is the case, then isn't the time taken to write each block is more than a traditional file write due to -
the replication factor (default is 3) and
the write process is happening sequentially block after block.
Please correct me if I am wrong in my understanding. Also, the following questions below:
My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write (i.e. if the replication is set to 1) + some overhead involved in the distributed communication mechanism.
Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.

Though your above explanation of a file write is correct, a DataNode can read and write data simultaneously. From HDFS Architecture Guide:
a DataNode can be receiving data from the previous one in the pipeline
and at the same time forwarding data to the next one in the pipeline
A write operation takes more time than on a traditional file system (due to bandwidth issues and general overhead) but not as much as 3x (assuming a replication factor of 3).

I think your understanding is correct.
One might expect that a simple HDFS client writes some data and when at least one block replica has been written, it takes back the control, while asynchronously HDFS generates the other replicas.
But in Hadoop, HDFS is designed around the pattern "write once, read many times" so the focus wasn't on write performance.
On the other side you can find parallelism in Hadoop MapReduce (which can be seen also an HDFS client) designed explicity to do so.

HDFS Write Operation:
There are two parameters
dfs.replication : Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time
dfs.namenode.replication.min : Minimal block replication.
Even though dfs.replication set as 3, write operation will be considered as successful once dfs.namenode.replication.min (default value : 1) has been replicated.
But this replication up to dfs.replication will happen in sequential pipeline. First Datanode writes the block and forward it to second Datanode. Second Datanode writes the block and forward it to third Datanode.
DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the Datanodes in the pipeline.
Have a look at related SE question: Hadoop 2.0 data write operation acknowledgement
HDFS Read Operation:
HDFS read operations happen in parallel instead of sequential like write operations

Related

How often are blocks on HDFS replicated?

I have a question regarding hadoop hdfs blocks replication. Suppose a block is written on a datanode and the DFS has a replication factor 3, how long does it take for the namenode to replicate this block on other datanodes? Is it instantaneuos? If not, right after writing the block to a datanode suppose the disk on this datanode fails which cannot be recovered, does it mean the block is lost forever? And also how often does the namenode check for missing/corrupt blocks?
You may want to review this article which has a good description of hdfs writes. it should be immediate depending upon how busy the cluster is:
https://data-flair.training/blogs/hdfs-data-write-operation/
What happens if DataNode fails while writing a file in the HDFS?
While writing data to the DataNode, if DataNode fails, then the following actions take place, which is transparent to the client writing the data.
The pipeline gets closed, packets in the ack queue are then added to the front of the data queue making DataNodes downstream from the failed node to not miss any packet.

Concept of blocks in Hadoop HDFS

I have some questions regarding the blocks in Hadoop. I read that Hadoop uses HDFS which will creates blocks of specific size.
First Question Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Second Question Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
1.Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes. Blocks exist physically. You can use commands like hadoop fsck /path/to/file -files -blocks
Refer below SE questions for commands to view blocks :
Viewing the number of blocks for a file in hadoop
2.Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop = Distributed storage ( HDFS) + Distributed processing ( MapReduce & Yarn).
A MapReduce job works on input splits => The input splits are are created from Data blocks in Datanodes. Data blocks are created during write operation of a file. If you are running a job on existing files, data blocks are pre-creared before the job and InputSplits are created during Map operation. You can think data block as physical entity and InputSplit as logical entity. Mapreduce job does not change input data blocks. Reducer generates output data as new data blocks.
Mapper process input splits and emit output to Reducer job.
3.Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Input is already available with physicals DFS blocks. A MapReduce job works in InputSplit. Blocks and InputSplits may or may not be same. Block is a physical entity and InputSplit is logical entity. Refer to below SE question for more details :
How does Hadoop perform input splits?
4.Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Mapper input : Input blocks pre-exists. Map process starts on input blocks/splits, which have been stored in HDFS before commencement of Mapper job.
Mapper output : Not stored in HDFS and it does not make sense to store intermediate results on HDFS with replication factor of X more than 1.
Reducer output: Reducer output is stored in HDFS. Number of blocks will depend on size of reducer output data.
Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?
Yes, the blocks exist physically on disk across the datanodes in your cluster. I suppose you could "see" them if you were on one of the datanodes and you really wanted to, but it would likely not be illuminating. It would only be a random 128m (or whatever dfs.block.size is set to in hdfs-site.xml) fragment of the file with no meaningful filename. The hdfs dfs commands enable you to treat HDFS as a "real" filesystem.
Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.
Hadoop takes care of splitting the file into blocks and distributing them among the datanodes when you put a file in HDFS (through whatever method applies to your situation).
Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?
Not entirely sure what you mean, but the blocks exist before, and irrespective of, any processing you do with them.
Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?
Again, blocks in HDFS are determined before any processing is done, if any is done at all. HDFS is simply a way to store a large file in a distributed fashion. When you do processing, for example with a MapReduce job, Hadoop will write intermediate results to disk. This is not related to the blocking of the raw file in HDFS.

What is "HDFS write pipeline"?

While i was going through hadoop definitive guide, i stuck at below sentence:-
writing the reduce output does consume network bandwidth, but only as
much as a normal HDFS write pipeline consumes.
Questions :
1. Can some help me understand above sentence in more detail.
2. And what does "HDFS write pipeline" mean ?
When files are written to HDFS a number of things are going on behind the scenes related to HDFS block consistency and replication. The main IO component of this process is by far replication. There is also the bidirectional communication with the name node registering the block's existence and state.
I think when it says "write pipeline" it just means the process of:
Creating the blocks
Registering with the NN
Performing replication
Doing write flushes to disk
Maintaining block state across the cluster (location, is-locked, last-updated, checksums, ect)
Can be understood as follows:-
*Datapipeline is writing data to data nodes and no. of datanodes to be written is decided by replication factor, by default is it 3.
*Because reduce output will be stored at 3 different nodes, that is decided by data-pipeline. So network consumption will be equal to datapipeline to be written with data.
*we can understand the same with below diagram, where HDFS client gets location of datapipeline from NN and writes to it via handshake procedure involved in it.(handshake procedure is bit more complex here, we won't go in detail of it.) BTW diagram is taken from Cloudera's site

Why Map tasks outputs are written to the local disk and not to HDFS?

I am prepping for an exam and here is a question in the lecture notes:
Why Map tasks outputs are written to the local disk and not to HDFS?
Here are my thoughts:
Reduce network traffic usage as the reducer may run on the same machine as the output so copying not required.
Don't need the fault tolerance of HDFS. If the job dies halfway, we can always just re-run the map task.
What are other possible reasons? Are my answers reasonable?
Your reasonings are correct. However I would like to add few points: what if map outputs are written to hdfs. Now, writing to hdfs is not like writing to local disk. It's a more involved process with namenode assuring that at least dfs.replication.min copies are written to hdfs. And namenode will also run a background thread to make additional copies for under replicated blocks. Suppose, the user kills the job in between or jobs just fail. There will be lots of intermediate files sitting on hdfs for no reason which you will have to delete manually. And if this process happens too many times, your cluster's perform and will degrade. Hdfs is optimized for appending and not frequent deleting .Also, during map phase , if the job fails, it performs a cleanup before exiting. If it were hdfs, the deletion process would require namenode to send a block deletion message to appropriate datanodes, which will cause invalidation of that block and it's removal from blocksMap. So much operation involved just for a failed cleanup and for no gain!!
Because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.
from "Hadoop The Definitive Guide 4 edition"
There is a point I know of writing the map output to Local file system , the output of all the mappers eventually gets merged and finally made a input for shuffling and sorting stages that precedes Reducer phase.

hadoop node unused for map tasks

I've noticed that all map and all reduce tasks are running on a single node (node1). I tried creating a file consisting of a single hdfs block which resides on node2. When running a mapreduce tasks whose input consists only of this block resident on node2, the task still runs on node1. I was under the impression that hadoop prioritizes running tasks on the nodes that contain the input data. I see no errors reported in log files. Any idea what might be going on here?
I have a 3-node cluster running on kvms created by following the cloudera cdh4 distributed installation guide.
I was under the impression that hadoop prioritizes running tasks on
the nodes that contain the input data.
Well, there might be an exceptional case :
If the node holding the data block doesn't have any free CPU slots, it won't be able to start any mappers on that particular node. In such a scenario instead of waiting data block will be moved to a nearby node and get processed there. But before that framework will try to process the replica of that block, locally(If RF > 1).
HTH
I don't understand when you say "I tried creating a file consisting of a single hdfs block which resides on node2". I don't think you can "direct" hadoop cluster to store some block in some specific node.
Hadoop will decide number of mappers based on input's size. If input size is less than hdfs block size (default I think is 64m), it will spawn just one mapper.
You can set job param "mapred.max.split.size" to whatever size you want to force spawning multiple reducers (default should suffice in most cases).

Resources