I found similar question
Hadoop HDFS is not distributing blocks of data evenly
but my ask is when replication factor = 1
I still want to understand why HDFS is not evenly distributing file blocks across the cluster nodes? This will result in data skew from start, when I load/run dataframe ops on such files. Am I missing something?
Even if replication factor is one, files are still split and stored in multiples of the HDFS block size. Block placement is on best effort, AFAIK, not purely balanced; replication placement of 3 picks a random node, then another node on the same rack, then another node off rack at random
You'll need to clarify how large your files are and where you are looking to see if data is being split
Note: not all file formats are splittable
Related
For example I wrote a file into HDFS using replication factor 2. The node I was writing to has now all the blocks of the file. Others copies of all blocks of the file are scattered around all remaining nodes in the cluster. That's default HDFS policy.
What exactly happens if I lower replication factor of the file to 1?
How HDFS decides which blocks from which nodes to delete? I hope it tries to delete blocks from nodes that have the most count of blocks of the file?
Why I'm asking - if it does, it would make sense - it will alleviate processing of the file. Because if there is only one copy of all blocks and all the blocks are located on the same node, then it would be harder to process the file using map-reduce because of data transferring to other nodes in the cluster.
When a block becomes over-replicated, the name node chooses a replica to remove. The name node will prefer not to reduce the number of racks that host replicas, and secondly prefer to remove a replica from the data node with the least amount of available disk space. This may help rebalancing the load over the cluster.
Source: The Architecture of Open Source Applications
Over-replicated blocks are randomly removed from different nodes by the HDFS, and are rebalanced that means they are not just removed from the current node.
How does hdfs determine which data block to be stored on which node?There must be some algorithm on choosing the data nodes for data blocks.I would like to learn about that.
HDFS replica placement is rack aware. i.e. it will try to place replicas on different racks to allow for better reliability. There's also work to allow HDFS to run with multi-tiered storage and to run in virtualization and these will also affect the placement algorithm
You can read on the current replica placement policy in the Hadoop architecture guide
I am trying to estimate the resources based on the data size.
Is there a thumb rule for deciding the number of required data nodes based on the data size?
No not really. Usually in a typical Hadoop cluster, there is one DataNode per node.
Sorry for the short answer but thats it! :)
Just keep in mind that Hadoop prefers to deal with a small number of huge files.
Keep in mind that data (by default) is replicated 3 times (original copy + 2 more). That is, if you have 15TB of data, you will need at least 45TB of disk space to accommodate for the replicas.
Replicas cannot be on the same node, so you would need at least 3 Datanodes with 15TB of storage, assuming default configuration.
When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?
hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
According to the Hadoop : Definitive Guide
Hadoop’s default strategy is to place the first replica on the same node as the client (for
clients running outside the cluster, a node is chosen at random, although the system
tries not to pick nodes that are too full or too busy). The second replica is placed on a
different rack from the first (off-rack), chosen at random. The third replica is placed on
the same rack as the second, but on a different node chosen at random. Further replicas
are placed on random nodes on the cluster, although the system tries to avoid placing
too many replicas on the same rack.
This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.
I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.
One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.
If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.
If your cluster is single node then when you upload a file it will be spilled according to the blocksize and it remains in single machine.
If your cluster is Multi node then when you upload a file it will be spilled according to the blocksize and it will be distributed to different datanode in your cluster via pipeline and NameNode will decide where the data should be moved in the cluster.
HDFS replication factor is used to make a copy of the data (i.e) if your replicator factor is 2 then all the data which you upload to HDFS will have a copy.
If you set replication factor is 1 it means that the single node cluster. It has only one client node http://commandstech.com/replication-factor-in-hadoop/. Where you can upload files then use in a single node or client node.
If I copy data from local system to HDFS, сan I be sure that it is distributed evenly across the nodes?
PS HDFS guarantee that each block will be stored at 3 different nodes. But does this mean that all blocks of my files will be sorted on same 3 nodes? Or will HDFS select them by random for each new block?
If your replication is set to 3, it will be put on 3 separate nodes. The number of nodes it's placed on is controlled by your replication factor. If you want greater distribution then you can increase the replication number by editing the $HADOOP_HOME/conf/hadoop-site.xml and changing the dfs.replication value.
I believe new blocks are placed almost randomly. There is some consideration for distribution across different racks (when hadoop is made aware of racks). There is an example (can't find link) that if you have replication at 3 and 2 racks, 2 blocks will be in one rack and the third block will be placed in the other rack. I would guess that there is no preference shown for what node gets the blocks in the rack.
I haven't seen anything indicating or stating a preference to store blocks of the same file on the same nodes.
If you are looking for ways to force balancing data across nodes (with replication at whatever value) a simple option is $HADOOP_HOME/bin/start-balancer.sh which will run a balancing process to move blocks around the cluster automatically.
This and a few other balancing options can be found in at the Hadoop FAQs
Hope that helps.
You can open HDFS Web UI on port 50070 of Your namenode. It will show you the information about data nodes. One thing you will see there - used space per node.
If you do not have UI - you can look on the space used in the HDFS directories of the data nodes.
If you have a data skew, you can run rebalancer which will solve it gradually.
Now with Hadoop-385 patch, we can choose the block placement policy, so as to place all blocks of a file in the same node (and similarly for replicated nodes). Read this blog about this topic - look at the comments section.
Yes, Hadoop distributes data per block, so each block would be distributed separately.