I was googling about rack topology and found this question...may be it's hadoop certification question:
Your cluster has slave nodes in three different racks, and you have written a rack topology script identifying each machine as being in rack1, rack2, or rack3. A client machine outside of the cluster writes a small (one-block) file to HDFS. The first replica of the block is written to a node on rack2. How is block placement determined for the other two replicas?
answer on some of the sites is Either both will be written to nodes on rack1, or both will be written to nodes on rack3.
Why not write the next block on rack2 itself and the remaining block on either rack1 or rack3?
If the Client is outside the rack, the first replication it writes is considered as the Local Node.
From the documentation, Hadoop places replicas in 3 different Data Nodes:
Local Data Node: The Data Node where the client initiates a write (for e.g. using hadoop fs -cp command). The first replica is placed here. If the client is writing the data from outside the cluster, then this node is chosen at random. It is the node on which the first replica gets written.
Off-rack Data Node: The Data Node, which is present on another rack. The second replica is placed here.
On-Rack Data Node: The Data Node which is physically present on the same rack as the first Data Node. Third replica is placed here
Hence in your case, as first replica is written in Rack 2, it will be the Local Data Node. Second replica in either of Rack 1/ Rack 3 [Off-Rack Data Node]. And third replica again in Rack 2 [On-Rack Data Node].
Related
In HDFS the block placement policy is that it places 1 block in the same rack as of the writer while the two other replicas on different nodes of a different rack.
But why doesn't it place 1 of the other 2 replicas on the same rack as the original block of data? wouldn't that be more optimized? as it wouldn't require too much bandwidth to write the other two blocks on the other rack?
Data replication is performed as follows:
NameNode select new data nodes to host replicas
the name server performs balancing of data placement by nodes and compiles a list of nodes for replication
The 1st replica is placed on the first node from the list
The 2nd replica is copied to another node in the same server rack
The 3rd replica is written to an arbitrary node in another server rack
the rest of the replicas are placed in an arbitrary way
Balancer iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization.
Will that affect the concept of Rack awarness ?
For example
I have three machines placed in two racks and data is placed by following the concept of rack awarness.
What would happen if I add a new machine to the cluster and run the balancer command?
Rack awareness & data locality is a YARN concept. The HDFS balancer only cares about leveling out the Datanode usage.
If you have 3 machines, with 3 replicas by default, then every machine could be guaranteed to have 1 replica, therefore with 2 racks, you're practically guaranteed to have rack locality.
Node locality is more performant than rack awareness, anyway.
If you have 10 GB intra cluster speeds between nodes, data locality is a moot point. This is why AWS can still reasonably process data in S3, for example, where data locality processing is not available
If your question is how load balancing is used: Load balancing is helpful in spreading the load equally across the free nodes when a node is loaded above its threshold level.
Now A cluster is considered balanced if for each data node, the ratio of used space at the node to the total capacity of node (known as the utilization of the node) differs from the the ratio of used space at the cluster to the total capacity of the cluster (utilization of the cluster) by no more than the threshold value.
When you apply load balancing during runtime, it is called dynamic load balancing and this can be realized both in a direct or iterative manner according to the execution node selection:
In the iterative methods, the final destination node is determined through several iteration steps.
In the direct methods, the final destination node is selected in one step.
Rack Awareness
Rack Awareness prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when reading a file.
On Multiple rack cluster, block replications are maintained with a policy that no more than one replica is placed on one node and no more than two replicas are placed in the same rack with a constraint that number of racks used for block replication should be always less than total no of block replicas.
For example,
When a new block is created, the first replica is placed on the local node, the second one is placed at a different rack, the third one is on a different node at the local rack.
When re-replicating a block, if the number of existing replicas is one, place the second one on a different rack.
When the number of existing replicas is two, if the two replicas are on the same rack, place the third one on a different rack;
For reading, the name node first checks if the client’s computer is located in the cluster. If yes, block locations are returned from the close data nodes to the client.
It minimizes the write cost and maximizing read speed.
I am a beginner learning Hadoop. Is it possible that 2 different data blocks from the same file could be stored in the same data node? For example: blk-A and blk-B from file "file.txt" could be placed in the same data node (datanode 1).
Here is the documentation that explains block placement policy. Currently, HDFS replication is 3 by default which means there are 3 replicas of a block. The way they are placed is:
One block is placed on a datanode on a unique rack.
Second block is placed on a datanode on a different rack.
Third block is placed on a different datanode on the same rack as
second block.
This policy helps when there is an event such as datanode is dead, block gets corrupted, etc.
Is it possible?
Unless you make changes in the source code, there is no property that you can change that will allow you to place two blocks on same datanode.
My opinion is that placing two blocks on same datanode beats the purpose of HDFS. Blocks are replicated so HDFS can recover for reasons described above. If blocks are placed on same datanode and that datanode is dead, you will lose two blocks instead of one.
The answer depends on the cluster topology. Hadoop tries to distribute data among data centers and data nodes. But What if you only have one data center ? or if you have only one node cluster (pseudo cluster). In those cases the optimal distribution doesn't happen and it is possible that all blocks end in the same data node. In production it is recommended have more than one data center (physically, not only in configuration) and at least the same number of data nodes than the replication number.
I was going through Hadoop, I have doubt whether there is difference between Rack wareness and Name Node. Will Rack wareness and name node will remain on same box
As Aviral rightly said, the question has been quite vague. But just quoting for your understanding,
Namenode : The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
You can read in detail about this concept here.
Rack Awareness : In simple words rack awareness is the strategy namenode employs to choose the nearest datanode based on rack information. You can read details here
Further more, I would like to suggest this blog
Image credits Brad Hedlund
From Apache HDFS Users Guide
HDFS is the primary distributed storage used by Hadoop applications.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes with in the same rack is much more desirable than network traffic across the racks. In addition NameNode tries to place replicas of block on multiple racks for improved fault tolerance.
From RackAwareness tutorial:
Hadoop components are rack-aware. For example, HDFS block placement will use rack awareness for fault tolerance by placing one block replica on a different rack. This provides data availability in the event of a network switch failure or partition within the cluster.
Let's see how Hadoop writes are implemented.
If the writer is on a datanode, the 1st replica is placed on the local machine, otherwise a random datanode.
The 2nd replica is placed on a datanode that is on a different rack.
The 3rd replica is placed on a datanode which is on a different node of the rack as the second replica.
Due to replication of data blocks on three different nodes across two different RACs, Hadoop read operations provides high availability of data blocks.
At least one replica is stored on different RAC. If one RAC is not accessible, still Hadoop can fetch data block from other RAC.
When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.