How does balancer work in HDFS? - hadoop

Balancer iteratively moves replicas from DataNodes with higher utilization to DataNodes with lower utilization.
Will that affect the concept of Rack awarness ?
For example
I have three machines placed in two racks and data is placed by following the concept of rack awarness.
What would happen if I add a new machine to the cluster and run the balancer command?

Rack awareness & data locality is a YARN concept. The HDFS balancer only cares about leveling out the Datanode usage.
If you have 3 machines, with 3 replicas by default, then every machine could be guaranteed to have 1 replica, therefore with 2 racks, you're practically guaranteed to have rack locality.
Node locality is more performant than rack awareness, anyway.
If you have 10 GB intra cluster speeds between nodes, data locality is a moot point. This is why AWS can still reasonably process data in S3, for example, where data locality processing is not available

If your question is how load balancing is used: Load balancing is helpful in spreading the load equally across the free nodes when a node is loaded above its threshold level.
Now A cluster is considered balanced if for each data node, the ratio of used space at the node to the total capacity of node (known as the utilization of the node) differs from the the ratio of used space at the cluster to the total capacity of the cluster (utilization of the cluster) by no more than the threshold value.
When you apply load balancing during runtime, it is called dynamic load balancing and this can be realized both in a direct or iterative manner according to the execution node selection:
In the iterative methods, the final destination node is determined through several iteration steps.
In the direct methods, the final destination node is selected in one step.
Rack Awareness
Rack Awareness prevents losing data when an entire rack fails and allows to make use of bandwidth from multiple racks when reading a file.
On Multiple rack cluster, block replications are maintained with a policy that no more than one replica is placed on one node and no more than two replicas are placed in the same rack with a constraint that number of racks used for block replication should be always less than total no of block replicas.
For example,
When a new block is created, the first replica is placed on the local node, the second one is placed at a different rack, the third one is on a different node at the local rack.
When re-replicating a block, if the number of existing replicas is one, place the second one on a different rack.
When the number of existing replicas is two, if the two replicas are on the same rack, place the third one on a different rack;
For reading, the name node first checks if the client’s computer is located in the cluster. If yes, block locations are returned from the close data nodes to the client.
It minimizes the write cost and maximizing read speed.

Related

Hadoop Replication mechanism

In HDFS the block placement policy is that it places 1 block in the same rack as of the writer while the two other replicas on different nodes of a different rack.
But why doesn't it place 1 of the other 2 replicas on the same rack as the original block of data? wouldn't that be more optimized? as it wouldn't require too much bandwidth to write the other two blocks on the other rack?
Data replication is performed as follows:
NameNode select new data nodes to host replicas
the name server performs balancing of data placement by nodes and compiles a list of nodes for replication
The 1st replica is placed on the first node from the list
The 2nd replica is copied to another node in the same server rack
The 3rd replica is written to an arbitrary node in another server rack
the rest of the replicas are placed in an arbitrary way

Configuring Elastic Search cluster with machines of different capacity(CPU, RAM) for rolling upgrades

Due to cost restrictions, I only have the following types of machines at disposal for setting up an ES cluster.
Node A: Lean(w.r.t. CPU, RAM) Instance
Node B: Beefy(w.r.t. CPU,RAM) Instance
Node M: "Leaner than A"(w.r.t. CPU, RAM) Instance
Disk-wise, both A and B have the same size.
My plan is to set up Node A and Node B acting as Master Eligible, Data node and Node M as Master-Eligible Only node(no data storing).
Because the two data nodes are NOT identical, what would be the implications?
I am going to make it a cluster of 3 machines only for the possibility of Rolling Upgrades(current volume of data and expected growth for few years can be managed with vertical scaling and leaving the default no. of shards and replica would enable me to scale horizontally if there is a need)
There is absolutely no need for your machines to have the same specs. You will need 3 master-eligible nodes not just for rolling-upgrades, but for high availability in general.
If you want to scale horizontally you can do so by either creating more indices to hold your data, or configure your index to have multiple primary and or replica shards. Since version 7 the default for new indices is to get created with 1 primary and 1 replica shard. A single index like this does not really allow you to schedule horizontally.
Update:
With respect to load and shard allocation (where to put data), Elasticsearch by default will simply consider the amount of storage available. When you start up an instance of Elasticsearch, it introspects the hardware and configures its threadpools (number of threads & size of queue) for various tasks accordingly. So the number of available threads to process tasks can vary. If I‘m not mistaken the coordinating node (the node receiving the external request) will distribute indexing/write requests in a round-robin fashion, not taking a load into consideration. Depending on your version of Elasticsearch, this is different for search/read requests where the coordinating node will leverage adaptive replica selection, taking into account the load/response time of the various replicas when distributing requests.
Besides this, sizing and scaling is a too complex topic to be answered comprehensively in a simple response. It typically also involves testing to figure out the limits/boundaries of a single node.
BTW: the number of default primary shards got changed in v7.x of Elasticsearch, as too much oversharding was one of the most common issues Elasticsearch users were facing. A “reasonable” shard size is in the tens of Gigabytes.

Replica placement logic in Cassandra with multiple datacenters

When write is being performed with consistency EACH_QUORUM and replication 4 with 2 data centers DC1 and DC2 with replica placement 3 in DC1 and 1 in DC2, which class picks the node where the secondary and tertiary copy should reside? The snitch is GossipingPropertyFileSnitch and NetworkTopologyStrategy. The client creates a new file using FileSystem.create and perform a write to it. First copy will go to node based on the token and row key hash. Where does the second and third copy go in DC1 and in DC2?
The consistency level does not have anything to do with the placement strategy. It is simply, how many nodes should report back to the coordinator before the success or failure is reported back to the client.
Each DC places copies according to their replication factor independently. So in the DC2, the only copy will be stored according to the partitioning function. In the DC1, the replica placement is done according to this document: http://www.datastax.com/docs/1.0/cluster_architecture/replication#networktopologystrategy
The NetworkTopologyStrategy determines replica placement independently
within each data center as follows:
The first replica is placed according to the partitioner (same as with
SimpleStrategy). Additional replicas are placed by walking the ring
clockwise until a node in a different rack is found. If no such node
exists, additional replicas are placed in different nodes in the same
rack. NetworkTopologyStrategy attempts to place replicas on distinct
racks because nodes in the same rack (or similar physical grouping)
can fail at the same time due to power, cooling, or network issues.

What are the possible reasons behind the imbalance of files stored on HDFS?

Sometimes, the data blocks are stored in imbalanced way across the data node. Based on HDFS block placement policy, the first replica is favored to be stored on the writer node (i.e. the client node), then the second replica is stored on a remote rack and the third one is stored on a local rack. What are the use cases that make the data blocks unbalanced across the data nodes under this placement policy? one possible reason in mind that if the writer nodes are few, then one replica of the data blocks will be stored on these nodes. Are there any other reasons ?
Here are some potential reasons for data skew:
If some of the DataNodes are unavailable for some time (not accepting requests/writes), the cluster can end up unbalanced.
TaskTrackers are not collocated with DataNodes evenly across cluster nodes. If we write data through MapReduce in this situation, the cluster can be unbalanced because the nodes hosting both a TaskTracker and a DataNode would be preferred.
Same as above, but with the RegionServers of HBase.
Large deletion of data can result in an unbalanced cluster depending on the location of the deleted blocks.
Adding new DataNodes will not automatically rebalance existing blocks across the cluster.
The "hdfs balancer" command allows admins to rebalance the cluster. Also, https://issues.apache.org/jira/browse/HDFS-1804 added a new block storage policy that takes into account free space left on the volume.

HDFS replication factor

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.

Resources