Replication factor

Replication factor - hadoop

I am new to Hadoop and I want to understand how do we determine the highest replication factor we can have for any given cluster. I know that the default setting is 3 replicas, but if I have a cluster with 5 node what is the highest replication factor that I can user in that case. Is there a formula that we have to follow to determine the replication factor?
Thank you

The highest replication factor that you can use is a function of the number of nodes in your cluster (as #Tarik said, you cannot have more replicas than nodes in your cluster), your expected usage (how much data do you plan to store) AND your cluster's storage capacity.
This other SO question has some calculations on capacity and storage use.

Obviously you cannot have more replicas than nodes as storing two copies on the same node is useless. It seems to me to be the upper limit.

In the Hadoop environment, the default replication factor is 3 for 3 slave machines or more than that. Here is a simple formula for that is 'N' Replication Factor = 'N' Slave Nodes. Here is more info about replication http://commandstech.com/replication-factor-in-hadoop/

Related

Hadoop Optimization Suggestion

Consider a scenario:
If I increase the replication factor of the data I have in HDFS ; suppose in a 10 node cluster I make the RF = 5 instead of 3(default), will it increase the performance of my data processing tasks?
Will the map phase complete sooner compared to the default replication setting?
Will there be any effect on the reduce phase?

Impact of Replication on Storage:
Replication factor has a huge impact on the storage of the cluster. It's obvious that: Larger the replication factor, lesser the number of files you can store in the cluster.
If replication factor is 5, then for every 1 GB of data ingested into cluster, you will need 5 GB of storage space and you will quickly run out of space in the cluster.
Since NameNode stores all the meta information in memory, it will quickly run of space to store the meta data. Hence, your NameNode will have to be allocated more memory (check HADOOP_NAMENODE_OPTS).
Data copy operation will take more time, since data copy is daisy-chained across Data Nodes. Instead of 3 Data Nodes, now 5 Data Nodes will have to confirm data storage, before a write/append is committed
Impact of Replication on Computation:
Mapper:
With a higher replication factor, there are more options to schedule a mapper. With a replication factor of 3, you can schedule a mapper on 3 different nodes. But, with a factor of 5, you will have 5 choices
You may be able to achieve better data locality, with increase in the replication factor. Each of the mapper could get scheduled on the same node where the data is present (since now there are 5 choices compared to the default 3), thus improving the performance.
Since there is a better data locality, lesser number of mappers will copy off-node or off-rack data
Due to these reasons, its possible that, with a higher replication factor, the mappers could complete earlier than with a lower replication factor.
Since typically the number of mappers are always higher than the number of reducers, you may see an overall improvement in your job performance.
Reducer:
Since the output of the reducer directly gets written into HDFS, its possible that your reducers will take more time to execute, with a higher replication factor.
Overall, your mappers may execute faster with a higher replication factor. But, actual performance improvement depends on various factors like, the size of your cluster, bandwidth, NameNode memory etc.
After answering this question, I came across another similar question in SO here: Map Job Performance on cluster. This also contains some more information, with links to various research papers.

Setting the replication factor to 5 will cause the HDFS namenode to maintain 5 total copies of the file blocks on the available datanodes in the cluster. This copy operation performed by the namenode will result in higher network bandwidth usage depending on the size of the files to be replicated and the speed of your network.
The replication factor has no direct effect in the either the map or reduce phase. You may see a performance hit initially while blocks are being replicated while running a map-reduce job - this could cause significant network latency depending on the size of the files and your network bandwidth.
A replication factor of 5 across your cluster means that 4 of your data nodes can disappear from your cluster, and you'll still have enough nodes to access to all files in HDFS with no file corruption or missing blocks. If your RF = 4 then you can loose 3 servers and still have access to all files in HDFS.
Setting a higher replication factor increases your overall HDFS usage so if your total data size is 1TB a RF=3 means your HDFS usage will be 3TB since the chopped up blocks are duplicated n-1 (3-1 = 2) times across the cluster.

Hadoop - Maintaining the replication factor after failure and recovery

Say a data node goes down. The replication factor has been configured to be 2.
Would the namenode try to maintain the replication factor, and copy over the lost data blocks over to another machine?
In case the above is true, then say the same data node comes back online. Would the namenode then delete the extra data blocks, because now the replication factor would be 3

Yes, namenode will try to maintain the replication factor.
Over-replicated blocks will be randomly removed from the nodes. See this FAQ