how does hdfs choose a datanode to store - hadoop

As the title indicates, when a client requests to write a file to the hdfs, how does the HDFS or name node choose which datanode to store the file?
Does the hdfs try to store all the blocks of this file in the same node or some node in the same rack if it is too big?
Does the hdfs provide any APIs for applications to store the file in a certain datanode as he likes?

how does the HDFS or name node choose which datanode to store the file?
HDFS has a BlockPlacementPolicyDefault, check the API documentation for more details. It should be possible to extend BlockPlacementPolicy for a custom behavior.
Does the hdfs provide any APIs for applications to store the file in a certain datanode as he likes?
The placement behavior should not be specific to a particular datanode. That's what makes HDFS resilient to failure and also scalable.

The code for choosing datanode is in function ReplicationTargetChooser.chooseTarget().
The comment says that :
The replica placement strategy is that if the writer is on a
datanode, the 1st replica is placed on the local machine, otherwise
a random datanode. The 2nd replica is placed on a datanode that is on
a different rack. The 3rd replica is placed on a datanode which is on
the same rack as the first replica.
It doesn`t provide any API for applications to store the file in the datanode they want.

If someone prefers charts, here is a picture (source):

Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.

You can see that when namenode instructs datanode to store data. The first replica is stored in the local machine and other two replicas are made on other rack and so on.
If any replica fails, data is stored from other replica. Chances of failing every replica is just like falling of fan on your head while you were sleeping :p i.e. there is very less chance for that.


Reading operations on hadoop and consistency level

I am setting up distributed HBase on HDFS and I trying to understand behavior of the system during read operations.
This is how I understand high level steps of the read operation.
Client connects to NameNode to get list of DataNodes which contain replicas of the rows that he interested in.
From here Client caches list of DataNodes and start talking to chosen DataNode directly until it needs some other rows from other DataNode, in which case it asks NameNode again.
My questions are as follows:
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Hadoop maintains the same reading policy when supporting HBase?
Thank you
Who chooses the best replica DataNode to contact? How Client chooses "closest" replica? Does NameNode return list of relative DataNodes in a sorted order ?
The client is the one that decides who best to contact. It picks them in this order:
The file is on the same machine. In this case (if properly configured) it will short circuit the DataNode and go directly to the file as an optimization.
The file is in the same rack (if rack awareness is configured).
The file is somewhere else.
What are the scenarios(if any) when Client switches to another DataNode that has requested rows? For example if one of the DataNode becomes overloaded/slow can the client library figure out to contact another DataNode from the list returned by the NameNode?
It's not that smart. It'll switch if it thinks the DataNode is down (meaning it times out) but in not any other situation that I know of. I believe that it will just go to the next one in the list, but it might contact the NameNode again-- I'm not 100% sure.
Is there a possibility of getting stale data from one of the replicas? For example client acquired list of DataNodes and starts reading from one of them. In the mean time there is a write request coming from another client to NameNode. We have dfs.replication == 3 and dfs.replication.min = 2. NameNode consider write successful after flushing to disk on 2 out of 3 nodes, while first client is reading from the 3rd node and doesn't know (yet) that there is another write that has been committed ?
Stale data is possible, but not in the situation you describe. Files are write-once and immutable (other than append, but don't append if you don't have to). The NameNode won't tell you the file is there until it is completely written. In the case of append, shame on you then. The behavior of reading from an actively-being-appended-to file on a local filesystem is unpredictable as well. You should expect the same in HDFS.
One way stale data could happen is if you retrieve your list of block locations and the NameNode decides to migrate all three of them at once before you access it. I don't know what would happen there. In the 5 years of using Hadoop, I've never had this be a problem. Even when running the balancer at the same time as doing stuff.
Hadoop maintains the same reading policy when supporting HBase?
HBase is not treated special by HDFS. There is some talk about using a custom block placement strategy with HBase to get better data locality, but that's in the weeds.

HDFS' Location Awareness

According to several documentation 1, 2, 3 HDFS' Location Awareness is about knowing the physical location of nodes and replicating data on different racks to reduce the impact of rack issues due to, e.g. power supply and/or switch issues.
How does HDFS know the physical location of nodes and racks and subsequently decide to replicate data to nodes located on other racks?
Rack-awareness is configured when the cluster is set up. This can be done either manually for each node or through a script.
Each DataNode is given a network location which is simple a string, much like a file system path.
The NameNode then builds a network topology (basically a tree structure) using the network locations of each DataNode. This topology is then used to determine block replica placement.
somebody needs to know where Data Nodes are located in the network topology and use that information to make an intelligent decision about where data replicas should exist in the cluster. That “somebody” is the Name Node.
The Name node stores this information and is the the namespace.
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

name node Vs secondary name node

Hadoop is Consistent and partition tolerant, i.e. It falls under the CP category of the CAP theoram.
Hadoop is not available because all the nodes are dependent on the name node. If the name node falls the cluster goes down.
But considering the fact that the HDFS cluster has a secondary name node why cant we call hadoop as available. If the name node is down the secondary name node can be used for the writes.
What is the major difference between name node and secondary name node that makes hadoop unavailable.
Thanks in advance.
The namenode stores the HDFS filesystem information in a file named fsimage. Updates to the file system (add/remove blocks) are not updating the fsimage file, but instead are logged into a file, so the I/O is fast append only streaming as opposed to random file writes. When restaring, the namenode reads the fsimage and then applies all the changes from the log file to bring the filesystem state up to date in memory. This process takes time.
The secondarynamenode job is not to be a secondary to the name node, but only to periodically read the filesystem changes log and apply them into the fsimage file, thus bringing it up to date. This allows the namenode to start up faster next time.
Unfortunatley the secondarynamenode service is not a standby secondary namenode, despite its name. Specifically, it does not offer HA for the namenode. This is well illustrated here.
See Understanding NameNode Startup Operations in HDFS.
Note that more recent distributions (current Hadoop 2.6) introduces namenode High Availability using NFS (shared storage) and/or namenode High Availability using Quorum Journal Manager.
Things have been changed over the years especially with Hadoop 2.x. Now Namenode is highly available with fail over feature.
Secondary Namenode is optional now & Standby Namenode has been to used for failover process.
Standby NameNode will stay up-to-date with all the file system changes the Active NameNode makes .
HDFS High availability is possible with two options : NFS and Quorum Journal Manager but Quorum Journal Manager is preferred option.
Have a look at Apache documentation
From Slide 8 from :
When any namespace modification is performed by the Active node, it durably logs a record of the modification to a majority of these JNs. The Standby node is reads these edits from the JNs and apply to its own name space.
In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.
Have a look at about fail over process in related SE question :
How does Hadoop Namenode failover process works?
Regarding your queries on CAP theory for Hadoop:
It can be strong consistent
HDFS is almost highly Available unless you met with some bad luck
( If all three replicas of a block are down, you won't get data)
Supports data Partition
Name Node is a primary node in which all the metadata into is stored into fsimage and editlog files periodically. But, when name node down secondary node will be online but this node only have the read access to the fsimage and editlog files and dont have the write access to them . All the secondary node operations will be stored to temp folder . when name node back to online this temp folder will be copied to name node and the namenode will update the fsimage and editlog files.
Even in HDFS High Availability, where there are two NameNodes instead of one NameNode and one SecondaryNameNode, there is not availability in the strict CAP sense. It only applies to the NameNode component, and even there if a network partition separates the client from both of the NameNodes then the cluster is effectively unavailable.
If I explain it in simple way, suppose Name Node as a men(working/live) and secondary Name Node as a ATM machine(storage/data storage)
So all the functions carried out by NN or men only but if it goes down/fails then SNN will be useless it doesn’t work but later it can be used to recover your data or logs
When NameNode starts, it loads FSImage and replay Edit Logs to create latest updated namespace. This process may take long time if size of Edit Log file is big and hence increase startup time.
The job of Secondary Name Node is to periodically check edit log and replay to create updated FSImage and store in persistent storage. When Name Node starts it doesn't need to replay edit log to create updated FSImage, it uses FSImage created by secondary name node.
The namenode is a master node that contains metadata in terms of fsimage and also contains the edit log. The edit log contains recently added/removed block information in the namespace of the namenode. The fsimage file contains metadata of the entire hadoop system in a permanent storage. Every time we need to make changes permanently in fsimage, we need to restart namenode so that edit log information can be written at namenode, but it takes a lot of time to do that.
A secondary namenode is used to bring fsimage up to date. The secondary name node will access the edit log and make changes in fsimage permanently so that next time namenode can start up faster.
Basically the secondary namenode is a helper for namenode and performs housekeeping functionality for the namenode.

Does decomissioning a node remove data from that node?

In Hadoop, if I decommission a node Hadoop will redistribute the files across the cluster so they are properly replicated. Will the data be deleted from the decomissioned node?
I am trying to balance the data across the disks on a particular node. I plan to do this by decomissioning the node and then recomissioning the node. Do I need to delete the data from that node after decomissioning is complete, or will it be enough to simply recomission it (remove it from the excludes file and run hadoop dfsadmin -refreshNodes)?
UPDATE: It worked for me to decomission a node, delete all the data on that node, and then recomission it.
AFAIK, data is not removed from a DataNode when you decommission it. Further writes on that DataNode will not be possible though. When you decommission a DataNode, the replicas held by that DataNode are marked as "decommissioned" replicas, which are still eligible for read access.
But why do you want to perform this decomissioning/recomissioning cycle?Why don't you just specify all the disks as a comma separated value to the property in your hdfs-site.xml and restart the DataNode daemon. Run the balancer after the restart.
Hadoop currently doesn't support doing this automatically. But there might be hacks around to do that automatically.
Decommissioning and then replication, will be slow in my opinion, then manually moving blocks across different disks.
You can do the balancing manually though across the disks, something like this -
1.Take down the HDFS or only the datanode you are targeting.
2.Use the UNIX mv command to move the individual blocks and meta pairs from one directory to another on the host machine. E.g. move pairs of blk data file and blk.meta files to accross the disks on the same host.
3.Restart the HDFS or the datanode
Reference link for the procedure
You need to probably move pairs of blk_* and blk_*.meta files to and from inside the dfs/current directory of each data disk. E.g. pair files - blk_3340211089776584759 and blk_3340211089776584759_1158.meta
If you don't want do this manually, you can probably write a custom script to detect how much is occupied in the dfs/current directory of the each of your data disks and re-balance them accordingly i.e. move pairs of blk_* and blk_*.meta from one to another.

NameNode DataNode communication on read operation

So I'm studying for the CCDH certification, and I found some sample questions online but to be honest, I don't think they are all that accurate so I would like to check here.
Which of the following describes best the read operation on HDFS?
A. The client queries the NameNode for the block location(s). The NameNode returns the
block location(s) to the client. The client reads the data directory off the DataNode(s).
B. The client queries all DataNodes in parallel. The DataNode that contains the requested
data responds directly to the client. The client reads the data directly off the DataNode.
C. The client contacts the NameNode for the block location(s). The NameNode then
queries the DataNodes for block locations. The DataNodes respond to the NameNode,
and the NameNode redirects the client to the DataNode that holds the requested data
block(s). The client then reads the data directly off the DataNode.
D. The client contacts the NameNode for the block location(s). The NameNode contacts
the DataNode that holds the requested data block. Data is transferred from the DataNode
to the NameNode, and then from the NameNode to the client.
I know for sure that B and D. According to the document, the correct answer is C. But I always thought that the NameNode already had the block locations in RAM, and did not need to query the datanodes? So I would expect the correct answer to be A. Am I wrong or is the document wrong?
NameNode doesn't query DataNodes in order to get the block locations. Instead it builds it dynamically with the help of block reports sent by DNs. Remember, DNs send block reports after every few seconds to the NN along with heartbeats.
So, the correct answer should be option A.
The reason why namenode seldom communicates with the datanodes is that, its major work is to provide read/write requests to the client and update the metadata from datanodes, hence it doesnt waste its resources and time fetching data from datanodes. Instead datanodes communicate with namenode which is simple socket based communication to provide heartbeat and block reports. Refer
the correct answer should be option A.
NN-> Client - NN stores all file names, block locations in memory and responds to the client with required information.
NN->DN -- this seems invalid, because in Hadoop(Cheap hardware), DN sometimes unavailable(due to network or hardware issues) in the cluster, so NN should not depend on DD for metdata.
Hope this helps.
