What is "HDFS write pipeline"? - hadoop

While i was going through hadoop definitive guide, i stuck at below sentence:-
writing the reduce output does consume network bandwidth, but only as
much as a normal HDFS write pipeline consumes.
Questions :
1. Can some help me understand above sentence in more detail.
2. And what does "HDFS write pipeline" mean ?

When files are written to HDFS a number of things are going on behind the scenes related to HDFS block consistency and replication. The main IO component of this process is by far replication. There is also the bidirectional communication with the name node registering the block's existence and state.
I think when it says "write pipeline" it just means the process of:
Creating the blocks
Registering with the NN
Performing replication
Doing write flushes to disk
Maintaining block state across the cluster (location, is-locked, last-updated, checksums, ect)

Can be understood as follows:-
*Datapipeline is writing data to data nodes and no. of datanodes to be written is decided by replication factor, by default is it 3.
*Because reduce output will be stored at 3 different nodes, that is decided by data-pipeline. So network consumption will be equal to datapipeline to be written with data.
*we can understand the same with below diagram, where HDFS client gets location of datapipeline from NN and writes to it via handshake procedure involved in it.(handshake procedure is bit more complex here, we won't go in detail of it.) BTW diagram is taken from Cloudera's site

Related

HBase on Hadoop, data locality deep diving

I have read multiple articles about how HBase gain data locality i.e link
or HBase the Definitive guide book.
I have understood that when re-writing HFile, Hadoop would write the blocks on the same machine which is actually the same Region Server that made compaction and created bigger file on Hadoop. everything is well understood yet.
Questions:
Assuming a Region server has a region file (HFile) which is splitted on Hadoop to multiple block i.e A,B,C. Does that means all block (A,B,C) would be written to the same region server?
What would happen if HFile after compaction has 10 blocks (huge file), but region server doesn't have storage for all of them? does it means we loose data locality, since those blocks would be written on other machine?
Thanks for the help.
HBase uses HDFS API to write data to the distributed file sytem (HDFS). I know this will increase your doubt on the data locality.
When a client writes data to HDFS using the hdfs API, it ensures that a copy of the data is written to the local datatnode (if applicable) and then go for replication.
Now I will answer your questions,
Yes. HFile(blocks) written by a specific RegionServer(RS) resides in the local datanode until it is moved for load balancing or recovery by the HMaster(will be back on major compaction). So the blocks A,B,C would be there in the same region server.
Yes. This may happen. But we can control the same by configuring region start and end key for each regions for HBase tables at creation time, which allows the data to be equally distributed in the cluster.
Hope this helps.

Hadoop: HDFS File Writes & Reads

I have a basic question regarding file writes and reads in HDFS.
For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.
If this is the case, then isn't the time taken to write each block is more than a traditional file write due to -
the replication factor (default is 3) and
the write process is happening sequentially block after block.
Please correct me if I am wrong in my understanding. Also, the following questions below:
My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write (i.e. if the replication is set to 1) + some overhead involved in the distributed communication mechanism.
Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.
Though your above explanation of a file write is correct, a DataNode can read and write data simultaneously. From HDFS Architecture Guide:
a DataNode can be receiving data from the previous one in the pipeline
and at the same time forwarding data to the next one in the pipeline
A write operation takes more time than on a traditional file system (due to bandwidth issues and general overhead) but not as much as 3x (assuming a replication factor of 3).
I think your understanding is correct.
One might expect that a simple HDFS client writes some data and when at least one block replica has been written, it takes back the control, while asynchronously HDFS generates the other replicas.
But in Hadoop, HDFS is designed around the pattern "write once, read many times" so the focus wasn't on write performance.
On the other side you can find parallelism in Hadoop MapReduce (which can be seen also an HDFS client) designed explicity to do so.
HDFS Write Operation:
There are two parameters
dfs.replication : Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time
dfs.namenode.replication.min : Minimal block replication.
Even though dfs.replication set as 3, write operation will be considered as successful once dfs.namenode.replication.min (default value : 1) has been replicated.
But this replication up to dfs.replication will happen in sequential pipeline. First Datanode writes the block and forward it to second Datanode. Second Datanode writes the block and forward it to third Datanode.
DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the Datanodes in the pipeline.
Have a look at related SE question: Hadoop 2.0 data write operation acknowledgement
HDFS Read Operation:
HDFS read operations happen in parallel instead of sequential like write operations

Why Map tasks outputs are written to the local disk and not to HDFS?

I am prepping for an exam and here is a question in the lecture notes:
Why Map tasks outputs are written to the local disk and not to HDFS?
Here are my thoughts:
Reduce network traffic usage as the reducer may run on the same machine as the output so copying not required.
Don't need the fault tolerance of HDFS. If the job dies halfway, we can always just re-run the map task.
What are other possible reasons? Are my answers reasonable?
Your reasonings are correct. However I would like to add few points: what if map outputs are written to hdfs. Now, writing to hdfs is not like writing to local disk. It's a more involved process with namenode assuring that at least dfs.replication.min copies are written to hdfs. And namenode will also run a background thread to make additional copies for under replicated blocks. Suppose, the user kills the job in between or jobs just fail. There will be lots of intermediate files sitting on hdfs for no reason which you will have to delete manually. And if this process happens too many times, your cluster's perform and will degrade. Hdfs is optimized for appending and not frequent deleting .Also, during map phase , if the job fails, it performs a cleanup before exiting. If it were hdfs, the deletion process would require namenode to send a block deletion message to appropriate datanodes, which will cause invalidation of that block and it's removal from blocksMap. So much operation involved just for a failed cleanup and for no gain!!
Because it doesn’t use valuable cluster bandwidth. This is called the data locality optimization. Sometimes, however, all the nodes hosting the HDFS block replicas for a map task’s input split are running other map tasks, so the job scheduler will look for a free map slot on a node in the same rack as one of the blocks. Very occasionally even this is not possible, so an off-rack node is used, which results in an inter-rack network transfer.
from "Hadoop The Definitive Guide 4 edition"
There is a point I know of writing the map output to Local file system , the output of all the mappers eventually gets merged and finally made a input for shuffling and sorting stages that precedes Reducer phase.

How HDFS works when running Hadoop on a single node cluster?

There is a lot of content explaining data locality and how MapReduce and HDFS works on multi-node clusters. But I can't find much information regarding a single node setup. In the past three months that I'm experimenting with Hadoop I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?
What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?
Does the parallelism that is provided by splitting the input data still applies in this case?
What's the difference of reading input from a single node HDFS and reading from the local filesystem?
I think due to my little experience I can't answer these questions clearly, so any help is appreciated!
Thanks in advance!
EDIT: I understand Hadoop is not suitable for a single node setup because of all the factors listed by #TC1. So, what's the benefit of setting up a pseudo-distributed Hadoop environment?
I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?
It depends. Combiners are run between mapping and reducing and you'd definitely feel the impact even on a single node if they were used right. Custom partitioners -- probably no, the data hits the same disk before reducing. They would affect the logic, i.e., what data your reducers receive, but probably not the performance
What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?
Processing capability. If you can get by with a single node setup for your data, you probably shouldn't be using Hadoop for your processing in the first place.
Does the parallelism that is provided by splitting the input data still applies in this case?
No, the bottleneck typically is I/O, i.e., accessing the disk. In this case, you're still accessing the same disk, only hitting it from more threads.
What's the difference of reading input from a single node HDFS and reading from the local filesystem?
Virtually non-existent. The idea of HDFS is to
store files in big, contiguous blocks, to avoid disk seeking
replicate these blocks among the nodes to provide resilience;
both of those are moot when running on a single node.
EDIT:
The difference between "single-node" and "pseudo-distributed" is that in single-mode all the Hadoop processes run on a single JVM. There's no network communication involved, not even through localhost etc. Even if simply testing a job on small data, I'd advise to use pseudo-distributed since that is essentially the same as a cluster.

How does Hadoop/MapReduce scale when input data is NOT stored?

The intended use for Hadoop appears to be for when the input data is distributed (HDFS) and already stored local to the nodes at the time of the mapping process.
Suppose we have data which does not need to be stored; the data can be generated at runtime. For example, the input to the mapping process is to be every possible IP address. Is Hadoop capable of efficiently distributing the Mapper work across nodes? Would you need to explicitly define how to split the input data (i.e. the IP address space) to different nodes, or does Hadoop handle that automatically?
Let me first clarify a comment you made. Hadoop is designed to support potentially massively parallel computation across a potentially large number of nodes regardless of where the data comes from or goes. The Hadoop design favors scalability over performance when it has to. It is true that being clever about where the data starts out and how that data is distributed can make a significant difference in how well/quickly a hadoop job can run.
To your question and example, if you will generate the input data you have the choice of generating it before the first job runs or you can generate it within the first mapper. If you generate it within the mapper then you can figure out what node the mapper's running on and then generate just the data that would be reduced in that partition (Use a partitioner to direct data between mappers and reducers)
This is going to be a problem you'll have with any distributed platform. Storm, for example, lets you have some say in which bolt instance will will process each tuple. The terminology might be different, but you'll be implementing roughly the same shuffle algorithm in Storm as you would Hadoop.
You are probably trying to run a non-MapReduce task on a map reduce cluster then. (e.g. IP scanning?) There may be more appropriate tools for this, your know...
A thing few people do not realize is that MapReduce is about checkpointing. It was developed for huge clusters, where you can expect machines to fail during the computation. By having checkpointing and recovery built-in into the architecture, this reduces the consequences of failures and slow hosts.
And that is why everything goes from disk to disk in MapReduce. It's checkpointed before, and it's checkpointed after. And if it fails, only this part of the job is re-run.
You can easily outperform MapReduce by leaving away the checkpointing. If you have 10 nodes, you will win easily. If you have 100 nodes, you will usually win. If you have a major computation and 1000 nodes, chances are that one node fails and you wish you had been doing similar checkpointing...
Now your task doesn't sound like a MapReduce job, because the input data is virtual. It sounds much more as if you should be running some other distributed computing tool; and maybe just writing your initial result to HDFS for later processing via MapReduce.
But of course there are way to hack around this. For example, you could use /16 subnets as input. Each mapper reads a /16 subnet and does it's job on that. It's not that much fake input to generate if you realize that you don't need to generate all 2^32 IPs, unless you have that many nodes in your cluster...
Number of Mappers depends on the number of Splits generated by the implementation of the InputFormat.
There is NLineInputFormat, which you could configure to generate as many splits as there are lines in the input file. You could create a file where each line is an IP range. I have not used it personally and there are many reports that it does not work as expected.
If you really need it, you could create your own implementation of the InputFormat which generates the InputSplits for your virtual data and force as many mappers as you need.

Resources