NameNode DataNode communication on read operation - hadoop

So I'm studying for the CCDH certification, and I found some sample questions online but to be honest, I don't think they are all that accurate so I would like to check here.
Which of the following describes best the read operation on HDFS?
A. The client queries the NameNode for the block location(s). The NameNode returns the
block location(s) to the client. The client reads the data directory off the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the requested
data responds directly to the client. The client reads the data directly off the DataNode.
C. The client contacts the NameNode for the block location(s). The NameNode then
queries the DataNodes for block locations. The DataNodes respond to the NameNode,
and the NameNode redirects the client to the DataNode that holds the requested data
block(s). The client then reads the data directly off the DataNode.
D. The client contacts the NameNode for the block location(s). The NameNode contacts
the DataNode that holds the requested data block. Data is transferred from the DataNode
to the NameNode, and then from the NameNode to the client.
I know for sure that B and D. According to the document, the correct answer is C. But I always thought that the NameNode already had the block locations in RAM, and did not need to query the datanodes? So I would expect the correct answer to be A. Am I wrong or is the document wrong?

NameNode doesn't query DataNodes in order to get the block locations. Instead it builds it dynamically with the help of block reports sent by DNs. Remember, DNs send block reports after every few seconds to the NN along with heartbeats.
So, the correct answer should be option A.

The reason why namenode seldom communicates with the datanodes is that, its major work is to provide read/write requests to the client and update the metadata from datanodes, hence it doesnt waste its resources and time fetching data from datanodes. Instead datanodes communicate with namenode which is simple socket based communication to provide heartbeat and block reports. Refer http://hashprompt.blogspot.com/2014/05/multi-node-hadoop-cluster-on-oracle.html.

the correct answer should be option A.
NN-> Client - NN stores all file names, block locations in memory and responds to the client with required information.
NN->DN -- this seems invalid, because in Hadoop(Cheap hardware), DN sometimes unavailable(due to network or hardware issues) in the cluster, so NN should not depend on DD for metdata.
Hope this helps.

Related

Why cant the metadata be stored in HDFS

Why cant the metadata be stored in HDFS with 3 replication. Why does it store in the local disk?
Because it will take more time to name node in resource allocation due to several I/o operations. So it's better to store metadata in memory of name node.
There are multiple reason
If it stored on HDFS, there will be network I/O. which will be
slower.
Name-node will have dependency on data node for metadata.
Again Metadata will be require for metadata to Name-node, So that it can identify where the metadata is on hdfs.
METADATA is the data about the data such as where the block is stored in rack, so that it can be located and if metadata is stored in hdfs and if those datanodes fail's you will lose all your data because now you don't know how to access those blocks where your data was stored.
Even though if you keep replication factor more, for each changes in datanodes, the changes are made in replicas of data nodes as well as in namenode's edit log.
Now since we have 3 replicas of namenodes for every change in datanode it first have to change in
1.Its own replica blocks
In namenode and replicas of namenode.(edit_log is edited 3times )
This would cause to write more data than first.But data storage is not the only and major problem,the main problem is the time that is required to do all these operations.
Therefore namenodes are backup on remote disk,so that even though your whole clusters get fails(possibilities are less) you can always backup your data.
To save from namenode failure Hadoop comes with
Primary Namenode ->consisits of namespace image and edit logs.
Secondary Namenode -> merging namespace and editlogs so that edit logs dont become too large.

understanding how hbase uses hdfs

I’m trying to understand how hbase uses the hdfs.
so here is what I understand (please correct me if I'm wrong):
I know that hbase use hdfs to store data and that data is split into regions, and that each region server my serve many regions,so I guess that one region (exclusively) may communicate with many data node to get and put data, so If that is correct then if that region server fails then data stored in those data node, will not be accessible anymore
thank you in advance :)
In general, a Regionserver runs on a datanode.
Due to how HDFS works, the Regionserver will perform its reads and writes to the local datanode when possible, and then HDFS will ensure that the data is replicated onto two other random datanodes. So at all times, the data written by that regionserver is stored on 3 nodes in HDFS.
While a regionserver is serving a region, only it will read / write the data for that region, but if the regionserver process crashes, the HBase master will select another regionsever to serve that region. The data will be unavailable for a few minutes, but HBase will recover quickly.
If the entire host fails, then as HDFS ensured the data was written onto two other nodes, the scenario is the same - the master will select a new regionserver to open the failed region and the data not be lost.

Reading operations on hadoop and consistency level

I am setting up distributed HBase on HDFS and I trying to understand behavior of the system during read operations.
This is how I understand high level steps of the read operation.
Client connects to NameNode to get list of DataNodes which contain replicas of the rows that he interested in.
From here Client caches list of DataNodes and start talking to chosen DataNode directly until it needs some other rows from other DataNode, in which case it asks NameNode again.
My questions are as follows:
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Hadoop maintains the same reading policy when supporting HBase?
Thank you
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
The client is the one that decides who best to contact. It picks them in this order:
The file is on the same machine. In this case (if properly configured) it will short circuit the DataNode and go directly to the file as an optimization.
The file is in the same rack (if rack awareness is configured).
The file is somewhere else.
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
It's not that smart. It'll switch if it thinks the DataNode is down (meaning it times out) but in not any other situation that I know of. I believe that it will just go to the next one in the list, but it might contact the NameNode again-- I'm not 100% sure.
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Stale data is possible, but not in the situation you describe. Files are write-once and immutable (other than append, but don't append if you don't have to). The NameNode won't tell you the file is there until it is completely written. In the case of append, shame on you then. The behavior of reading from an actively-being-appended-to file on a local filesystem is unpredictable as well. You should expect the same in HDFS.
One way stale data could happen is if you retrieve your list of block locations and the NameNode decides to migrate all three of them at once before you access it. I don't know what would happen there. In the 5 years of using Hadoop, I've never had this be a problem. Even when running the balancer at the same time as doing stuff.
Hadoop maintains the same reading policy when supporting HBase?
HBase is not treated special by HDFS. There is some talk about using a custom block placement strategy with HBase to get better data locality, but that's in the weeds.

How is data written to HDFS?

I'm trying to understand how is data writing managed in HDFS by reading hadoop-2.4.1 documentation.
According to the following schema :
whenever a client writes something to HDFS, he has no contact with the namenode and is in charge of chunking and replication. I assume that in this case, the client is a machine running an HDFS shell (or equivalent).
However, I don't understand how this is managed.
Indeed, according to the same documentation :
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
Is the schema presented above correct ? If so,
is the namenode only informed of new files when it receives a Blockreport (which can take time, I suppose) ?
why does the client write to multiple nodes ?
If this schema is not correct, how is file creation working with HDFs ?
As you said DataNodes are responsible for serving read/write requests and block creation/deletion/replication.
Then they send on a regular basis “HeartBeats” ( state of health report) and “BlockReport”( list of blocks on the DataNode) to the NameNode.
According to this article:
Data Nodes send heartbeats to the Name Node every 3 seconds via a TCP
handshake, ... Every tenth heartbeat is a Block Report,
where the Data Node tells the Name Node about all the blocks it has.
So block reports are done every 30 seconds, I don't think that this may affect Hadoop jobs because in general they are independent jobs.
For your question:
why does the client write to multiple nodes ?
I'll say that actually, the client writes to just one datanode and tell him to send data to other datanodes(see this link picture: CLIENT START WRITING DATA ), but this is transparent. That's why your schema considers that the client is the one who is writing to multiple nodes

how does hdfs choose a datanode to store

As the title indicates, when a client requests to write a file to the hdfs, how does the HDFS or name node choose which datanode to store the file?
Does the hdfs try to store all the blocks of this file in the same node or some node in the same rack if it is too big?
Does the hdfs provide any APIs for applications to store the file in a certain datanode as he likes?
how does the HDFS or name node choose which datanode to store the file?
HDFS has a BlockPlacementPolicyDefault, check the API documentation for more details. It should be possible to extend BlockPlacementPolicy for a custom behavior.
Does the hdfs provide any APIs for applications to store the file in a certain datanode as he likes?
The placement behavior should not be specific to a particular datanode. That's what makes HDFS resilient to failure and also scalable.
The code for choosing datanode is in function ReplicationTargetChooser.chooseTarget().
The comment says that :
The replica placement strategy is that if the writer is on a
datanode, the 1st replica is placed on the local machine, otherwise
a random datanode. The 2nd replica is placed on a datanode that is on
a different rack. The 3rd replica is placed on a datanode which is on
the same rack as the first replica.
It doesn`t provide any API for applications to store the file in the datanode they want.
If someone prefers charts, here is a picture (source):
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
You can see that when namenode instructs datanode to store data. The first replica is stored in the local machine and other two replicas are made on other rack and so on.
If any replica fails, data is stored from other replica. Chances of failing every replica is just like falling of fan on your head while you were sleeping :p i.e. there is very less chance for that.

Resources