Hadoop Replication mechanism - hadoop

In HDFS the block placement policy is that it places 1 block in the same rack as of the writer while the two other replicas on different nodes of a different rack.
But why doesn't it place 1 of the other 2 replicas on the same rack as the original block of data? wouldn't that be more optimized? as it wouldn't require too much bandwidth to write the other two blocks on the other rack?

Data replication is performed as follows:
NameNode select new data nodes to host replicas
the name server performs balancing of data placement by nodes and compiles a list of nodes for replication
The 1st replica is placed on the first node from the list
The 2nd replica is copied to another node in the same server rack
The 3rd replica is written to an arbitrary node in another server rack
the rest of the replicas are placed in an arbitrary way

Related

How does balancer work in HDFS?

Balancer iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization.
Will that affect the concept of Rack awarness ?
For example
I have three machines placed in two racks and data is placed by following the concept of rack awarness.
What would happen if I add a new machine to the cluster and run the balancer command?
Rack awareness & data locality is a YARN concept. The HDFS balancer only cares about leveling out the Datanode usage.
If you have 3 machines, with 3 replicas by default, then every machine could be guaranteed to have 1 replica, therefore with 2 racks, you're practically guaranteed to have rack locality.
Node locality is more performant than rack awareness, anyway.
If you have 10 GB intra cluster speeds between nodes, data locality is a moot point. This is why AWS can still reasonably process data in S3, for example, where data locality processing is not available
If your question is how load balancing is used: Load balancing is helpful in spreading the load equally across the free nodes when a node is loaded above its threshold level.
Now A cluster is considered balanced if for each data node, the ratio of used space at the node to the total capacity of node (known as the utilization of the node) differs from the the ratio of used space at the cluster to the total capacity of the cluster (utilization of the cluster) by no more than the threshold value.
When you apply load balancing during runtime, it is called dynamic load balancing and this can be realized both in a direct or iterative manner according to the execution node selection:
In the iterative methods, the final destination node is determined through several iteration steps.
In the direct methods, the final destination node is selected in one step.
Rack Awareness
Rack Awareness prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when reading a file.
On Multiple rack cluster, block replications are maintained with a policy that no more than one replica is placed on one node and no more than two replicas are placed in the same rack with a constraint that number of racks used for block replication should be always less than total no of block replicas.
For example,
When a new block is created, the first replica is placed on the local node, the second one is placed at a different rack, the third one is on a different node at the local rack.
When re-replicating a block, if the number of existing replicas is one, place the second one on a different rack.
When the number of existing replicas is two, if the two replicas are on the same rack, place the third one on a different rack;
For reading, the name node first checks if the client’s computer is located in the cluster. If yes, block locations are returned from the close data nodes to the client.
It minimizes the write cost and maximizing read speed.

Rack topology in Hadoop

I was googling about rack topology and found this question...may be it's hadoop certification question:
Your cluster has slave nodes in three different racks, and you have written a rack topology script identifying each machine as being in rack1, rack2, or rack3. A client machine outside of the cluster writes a small (one-block) file to HDFS. The first replica of the block is written to a node on rack2. How is block placement determined for the other two replicas?
answer on some of the sites is Either both will be written to nodes on rack1, or both will be written to nodes on rack3.
Why not write the next block on rack2 itself and the remaining block on either rack1 or rack3?
If the Client is outside the rack, the first replication it writes is considered as the Local Node.
From the documentation, Hadoop places replicas in 3 different Data Nodes:
Local Data Node: The Data Node where the client initiates a write (for e.g. using hadoop fs -cp command). The first replica is placed here. If the client is writing the data from outside the cluster, then this node is chosen at random. It is the node on which the first replica gets written.
Off-rack Data Node: The Data Node, which is present on another rack. The second replica is placed here.
On-Rack Data Node: The Data Node which is physically present on the same rack as the first Data Node. Third replica is placed here
Hence in your case, as first replica is written in Rack 2, it will be the Local Data Node. Second replica in either of Rack 1/ Rack 3 [Off-Rack Data Node]. And third replica again in Rack 2 [On-Rack Data Node].

Replica placement logic in Cassandra with multiple datacenters

When write is being performed with consistency EACH_QUORUM and replication 4 with 2 data centers DC1 and DC2 with replica placement 3 in DC1 and 1 in DC2, which class picks the node where the secondary and tertiary copy should reside? The snitch is GossipingPropertyFileSnitch and NetworkTopologyStrategy. The client creates a new file using FileSystem.create and perform a write to it. First copy will go to node based on the token and row key hash. Where does the second and third copy go in DC1 and in DC2?
The consistency level does not have anything to do with the placement strategy. It is simply, how many nodes should report back to the coordinator before the success or failure is reported back to the client.
Each DC places copies according to their replication factor independently. So in the DC2, the only copy will be stored according to the partitioning function. In the DC1, the replica placement is done according to this document: http://www.datastax.com/docs/1.0/cluster_architecture/replication#networktopologystrategy
The NetworkTopologyStrategy determines replica placement independently
within each data center as follows:
The first replica is placed according to the partitioner (same as with
SimpleStrategy). Additional replicas are placed by walking the ring
clockwise until a node in a different rack is found. If no such node
exists, additional replicas are placed in different nodes in the same
rack. NetworkTopologyStrategy attempts to place replicas on distinct
racks because nodes in the same rack (or similar physical grouping)
can fail at the same time due to power, cooling, or network issues.

What are the possible reasons behind the imbalance of files stored on HDFS?

Sometimes, the data blocks are stored in imbalanced way across the data node. Based on HDFS block placement policy, the first replica is favored to be stored on the writer node (i.e. the client node), then the second replica is stored on a remote rack and the third one is stored on a local rack. What are the use cases that make the data blocks unbalanced across the data nodes under this placement policy? one possible reason in mind that if the writer nodes are few, then one replica of the data blocks will be stored on these nodes. Are there any other reasons ?
Here are some potential reasons for data skew:
If some of the DataNodes are unavailable for some time (not accepting requests/writes), the cluster can end up unbalanced.
TaskTrackers are not collocated with DataNodes evenly across cluster nodes. If we write data through MapReduce in this situation, the cluster can be unbalanced because the nodes hosting both a TaskTracker and a DataNode would be preferred.
Same as above, but with the RegionServers of HBase.
Large deletion of data can result in an unbalanced cluster depending on the location of the deleted blocks.
Adding new DataNodes will not automatically rebalance existing blocks across the cluster.
The "hdfs balancer" command allows admins to rebalance the cluster. Also, https://issues.apache.org/jira/browse/HDFS-1804 added a new block storage policy that takes into account free space left on the volume.

HDFS replication factor

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

Resources