Replication Factor in Hadoop - hadoop

I have a data of 5 TB and actual size of the whole size of combined cluster is 7 TB and I have set the Replication factor to 2.
In this case how it will replicate the data?
Due to the Replication factor the minimum size of the storage on the cluster(Nodes) should be always double the size of the Data,Do you think this is a drawback in Hadoop?

If your minimum size of storage on the cluster is not double the size of your data, then you will end up having under-replicated blocks. Under-replicated block are those which are replicated < replication factor, so if you're replication factor is 2, you will have blocks will have replication factor of 1.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Consider that one of the nodes in your cluster goes down. That node would have some data stored in it and if you do not replicate your data, then a part of your data will not be available due to the node failure. However, if your data is replicated, the data which was on the node which went down will still be accessible to you from other nodes.
If you do not feel the need to replicate your data, you can always set your replication factor = 1.

Replication of the data is not a drawback of Hadoop -- it's the factor that increases the efficiency of Hadoop (HDFS). Replication of data to a larger number of slave nodes provides high availability and good fault tolerance to the cluster. If we consider the losses incurred by the client due to downtime of nodes in the cluster (typically will be in millions of $), the cost spent for buying the extra storage facility required for replication of data is much less. So the replication of data is justified.

This is the case of under replication. Assume you have 5 blocks. HDFS was able to create the replicas only for first 3 blocks because of space constraint. Now the other two blocks are under replicated. When the HDFS finds sufficient space, it will try to replicate the 2 blocks also.

Related

Hadoop Optimization Suggestion

Consider a scenario:
If I increase the replication factor of the data I have in HDFS ; suppose in a 10 node cluster I make the RF = 5 instead of 3(default), will it increase the performance of my data processing tasks?
Will the map phase complete sooner compared to the default replication setting?
Will there be any effect on the reduce phase?
Impact of Replication on Storage:
Replication factor has a huge impact on the storage of the cluster. It's obvious that: Larger the replication factor, lesser the number of files you can store in the cluster.
If replication factor is 5, then for every 1 GB of data ingested into cluster, you will need 5 GB of storage space and you will quickly run out of space in the cluster.
Since NameNode stores all the meta information in memory, it will quickly run of space to store the meta data. Hence, your NameNode will have to be allocated more memory (check HADOOP_NAMENODE_OPTS).
Data copy operation will take more time, since data copy is daisy-chained across Data Nodes. Instead of 3 Data Nodes, now 5 Data Nodes will have to confirm data storage, before a write/append is committed
Impact of Replication on Computation:
Mapper:
With a higher replication factor, there are more options to schedule a mapper. With a replication factor of 3, you can schedule a mapper on 3 different nodes. But, with a factor of 5, you will have 5 choices
You may be able to achieve better data locality, with increase in the replication factor. Each of the mapper could get scheduled on the same node where the data is present (since now there are 5 choices compared to the default 3), thus improving the performance.
Since there is a better data locality, lesser number of mappers will copy off-node or off-rack data
Due to these reasons, its possible that, with a higher replication factor, the mappers could complete earlier than with a lower replication factor.
Since typically the number of mappers are always higher than the number of reducers, you may see an overall improvement in your job performance.
Reducer:
Since the output of the reducer directly gets written into HDFS, its possible that your reducers will take more time to execute, with a higher replication factor.
Overall, your mappers may execute faster with a higher replication factor. But, actual performance improvement depends on various factors like, the size of your cluster, bandwidth, NameNode memory etc.
After answering this question, I came across another similar question in SO here: Map Job Performance on cluster. This also contains some more information, with links to various research papers.
Setting the replication factor to 5 will cause the HDFS namenode to maintain 5 total copies of the file blocks on the available datanodes in the cluster. This copy operation performed by the namenode will result in higher network bandwidth usage depending on the size of the files to be replicated and the speed of your network.
The replication factor has no direct effect in the either the map or reduce phase. You may see a performance hit initially while blocks are being replicated while running a map-reduce job - this could cause significant network latency depending on the size of the files and your network bandwidth.
A replication factor of 5 across your cluster means that 4 of your data nodes can disappear from your cluster, and you'll still have enough nodes to access to all files in HDFS with no file corruption or missing blocks. If your RF = 4 then you can loose 3 servers and still have access to all files in HDFS.
Setting a higher replication factor increases your overall HDFS usage so if your total data size is 1TB a RF=3 means your HDFS usage will be 3TB since the chopped up blocks are duplicated n-1 (3-1 = 2) times across the cluster.

Is it faster to replicate your data in hdfs for all your nodes?

If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.

Hadoop - Maintaining the replication factor after failure and recovery

Say a data node goes down. The replication factor has been configured to be 2.
Would the namenode try to maintain the replication factor, and copy over the lost data blocks over to another machine?
In case the above is true, then say the same data node comes back online. Would the namenode then delete the extra data blocks, because now the replication factor would be 3
Yes, namenode will try to maintain the replication factor.
Over-replicated blocks will be randomly removed from the nodes. See this FAQ

Replication factor

I am new to Hadoop and I want to understand how do we determine the highest replication factor we can have for any given cluster. I know that the default setting is 3 replicas, but if I have a cluster with 5 node what is the highest replication factor that I can user in that case. Is there a formula that we have to follow to determine the replication factor?
Thank you
The highest replication factor that you can use is a function of the number of nodes in your cluster (as #Tarik said, you cannot have more replicas than nodes in your cluster), your expected usage (how much data do you plan to store) AND your cluster's storage capacity.
This other SO question has some calculations on capacity and storage use.
Obviously you cannot have more replicas than nodes as storing two copies on the same node is useless. It seems to me to be the upper limit.
In the Hadoop environment, the default replication factor is 3 for 3 slave machines or more than that. Here is a simple formula for that is 'N' Replication Factor = 'N' Slave Nodes. Here is more info about replication http://commandstech.com/replication-factor-in-hadoop/

How does Hadoop dfs.replicate work?

I have a 2 node hadoop (1 is the master/slave and another slave) setup and 4 input files each of size 1GB.
When i set dfs.replicate to 2, then the entire data is copied over to both the nodes which is understandable. But my question is that, how do i see an improved performance (almost twice as better) over a single node setup since in the 2 node case, map-reduce will still run over the complete data set on both the systems along with the added overhead of channeling the inputs from 2 mappers to reducers.
Also when i set the replication as 1, the entire data exists only on the master node which is also understandable to avoid ethernet overhead. But even in this case, i see a performance improvement compared to single node setup which i find confusing, since map-reduce runs on local data sets, this scenario should essentially be similar to single node setup with one map-reduce program running on master node on the entire data set ??
Can someone help me understand what i am missing here ???
Thanks
Pawan
Pawan,
In the two node case the map reduce job will not run on entire dataset. MapReduce operates in HDFS blocks which will be of size 64 MB or more based on your configuration. Your 1 GB is split into blocks and distributed on the cluster nodes. some of these blocks are processed on node 1 and the other on node 2 but no duplications. The replication factor only increases the availability of data and more tolerance towards node failures. It will not duplicate the tasks.
resultantly what's happening is, from the processing perspective the data is split between the node 1 and node 2 and being processed. Which means, if your are utilizing your processing power fully and rightly, your are doubling your speed theoritically.
Cheers
Rags

Resources