Hadoop Optimization Suggestion - hadoop

Consider a scenario:
If I increase the replication factor of the data I have in HDFS ; suppose in a 10 node cluster I make the RF = 5 instead of 3(default), will it increase the performance of my data processing tasks?
Will the map phase complete sooner compared to the default replication setting?
Will there be any effect on the reduce phase?

Impact of Replication on Storage:
Replication factor has a huge impact on the storage of the cluster. It's obvious that: Larger the replication factor, lesser the number of files you can store in the cluster.
If replication factor is 5, then for every 1 GB of data ingested into cluster, you will need 5 GB of storage space and you will quickly run out of space in the cluster.
Since NameNode stores all the meta information in memory, it will quickly run of space to store the meta data. Hence, your NameNode will have to be allocated more memory (check HADOOP_NAMENODE_OPTS).
Data copy operation will take more time, since data copy is daisy-chained across Data Nodes. Instead of 3 Data Nodes, now 5 Data Nodes will have to confirm data storage, before a write/append is committed
Impact of Replication on Computation:
Mapper:
With a higher replication factor, there are more options to schedule a mapper. With a replication factor of 3, you can schedule a mapper on 3 different nodes. But, with a factor of 5, you will have 5 choices
You may be able to achieve better data locality, with increase in the replication factor. Each of the mapper could get scheduled on the same node where the data is present (since now there are 5 choices compared to the default 3), thus improving the performance.
Since there is a better data locality, lesser number of mappers will copy off-node or off-rack data
Due to these reasons, its possible that, with a higher replication factor, the mappers could complete earlier than with a lower replication factor.
Since typically the number of mappers are always higher than the number of reducers, you may see an overall improvement in your job performance.
Reducer:
Since the output of the reducer directly gets written into HDFS, its possible that your reducers will take more time to execute, with a higher replication factor.
Overall, your mappers may execute faster with a higher replication factor. But, actual performance improvement depends on various factors like, the size of your cluster, bandwidth, NameNode memory etc.
After answering this question, I came across another similar question in SO here: Map Job Performance on cluster. This also contains some more information, with links to various research papers.

Setting the replication factor to 5 will cause the HDFS namenode to maintain 5 total copies of the file blocks on the available datanodes in the cluster. This copy operation performed by the namenode will result in higher network bandwidth usage depending on the size of the files to be replicated and the speed of your network.
The replication factor has no direct effect in the either the map or reduce phase. You may see a performance hit initially while blocks are being replicated while running a map-reduce job - this could cause significant network latency depending on the size of the files and your network bandwidth.
A replication factor of 5 across your cluster means that 4 of your data nodes can disappear from your cluster, and you'll still have enough nodes to access to all files in HDFS with no file corruption or missing blocks. If your RF = 4 then you can loose 3 servers and still have access to all files in HDFS.
Setting a higher replication factor increases your overall HDFS usage so if your total data size is 1TB a RF=3 means your HDFS usage will be 3TB since the chopped up blocks are duplicated n-1 (3-1 = 2) times across the cluster.

Related

Hadoop, uneven load between machines

I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks
I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!

Is it faster to replicate your data in hdfs for all your nodes?

If I have 6 data nodes, is it faster to turn replication to 6 so all the data is replicated across all my nodes so the cluster can split up queries (say in hive) without having to move data around? I believe that if you have a replication of 3 and you put a 300GB file into HDFS, it splits it just across 3 of the data nodes and then when the 6 nodes need to be used for a query it has to move data around to the other 3 nodes that the data doesn't exist on, causing slower responses.. is that accurate?
I understand your means, you are talking about the data-locality. Generally speaking, the data-locality can reduce the run time, because it can save the time that block transmission by network. But in fact, if you don't open the "HDFS Short-Circuit Local Reads"(default it is off, please visit here), the MapTask will also read the block by the TCP protocol, it means by network, even if block and MapTask both on the same node.
Recently, I optimize hadoop and HDFS, we use SSD to instead the HDD disk, but we found the effect is not good and time is not shorter.Because the disk is not the bottleneck and network load is not heavy. According to the result, we conclude the cpu is very heavy. If you want you know the hadoop cluster situation clearly, I advise you to use ganglia to monitoring the cluster, it can help you to analysis your cluster bottleneck.please see here.
At last, hadoop is a very large and complicated system, the disk performance, cpu performance, network bandwidth, parameters values and also, there are many factor to consider. If you want to save time, you have much work to do, not just the replication factor.

Map Job Performance on cluster

Suppose I have 15 blocks of data and two clusters. The first cluster has 5 nodes and a replication factor is 1, while the second one has a replication factor is 3. If I run my map job, should I expect any change in the performance or the execution time of the map job?
In other words, how does replication affect the performance of the mapper on a cluster?
When the JobTracker assigns a job to a TaskTracker on HDFS, a job is assigned to a particular node based upon locality of data (preference is same node first, then same network switch/frame). By having different replication factors, you limit the ability for the JobTracker to assign a node local to the data (JobTracker will still assign the task nodes, but without the benefits of locality). The effect is to restrict the number of TaskTracker nodes which are both local to the data (either data on task node, or data on same switch frame), thus affecting performance for work on your task (reducing parallelization).
Your smaller cluster likely has a single switch, so data is local to the network/frame, so the only bottleneck you might experience would be data transfer from one TaskTracker to another, as the JobTracker is likely to assign jobs to all available TaskTrackers.
But with a larger hadoop cluster, the replication factor = 1 would limit the number of TaskTracker nodes local to the data and thus able to efficiently operate on your data.
There are several papers which support data locality, http://web.eecs.umich.edu/~michjc/papers/tandon_hpdic_minimizeRemoteAccess.pdf, this paper which you cited also supports data locality, http://assured-cloud-computing.illinois.edu/sites/default/files/PID1974767.pdf, and this one, http://www.eng.auburn.edu/~xqin/pubs/hcw10.pdf (which tested a 5 node cluster, same as the OP).
This paper quotes significant benefits to data locality, http://grids.ucs.indiana.edu/ptliupages/publications/InvestigationDataLocalityInMapReduce_CCGrid12_Submitted.pdf, and observes that an increase in replication factor gives better locality.
Note that this paper claims little difference between network throughput and local disk access (8%), http://www.cs.berkeley.edu/~ganesha/disk-irrelevant_hotos2011.pdf, but reports orders of magnitude difference in performance between local memory access and either disk or network access. Furhtermore, the paper quotes a large fraction of jobs (64%) find their data cached in memory "in large part due to the heavy-tailed nature of the workload", as most jobs "access only a small fraction of the blocks".
EDIT: This part of my answer is obsolete now that the other answer was edited: "The other answer is not entirely correct." This was meant to address the incorrect implication that less replicas = fewer paralelism. The rest of my answer (below) still applies.
Any node can execute your tasks, regardless of whether the data is located in that node or not. Hadoop will try to achieve data locality (preference order is: node-local, then rack-local, then any node), but if it can't, then it will chose any node that has available compute capacity to run your task.
Performance wise, in a typical multi-rack installation, rack-local performs almost as good as node-local, since the bottleneck occurs when transmitting data across racks. However, with high-end networking equipment (i.e., full-bisection bandwidth), then it wouldn't matter if your computation is rack-local or not. For more details on this, read this paper.
How much performance improvement can you expect from having more replicas (and thus achieving higher data locality)? Not much; 5-20% maximum improvement. But this is an upper limit, when you implement additional popularity-based replication as in this and this projects. NOTE: I did not just make-up those performance improvement numbers; they come from the papers I linked.
Since vanilla Hadoop does not have these mechanisms in place, I would expect your performance to improve at most 1-5%. This is just a ballpark guess, but you can easily run some tests yourself. The reason for this, is that more replicas could improve the performance of some of your map tasks (the ones that are now able to run with a data-local copy of the block), but it would not improve your shuffle and reduce phases. Furthermore, even if just one mapper takes longer than the rest, this one will determine the length of your whole map phase; so for many jobs, it is likely that increasing locality will not improve their running times at all. Finally, I/O bound jobs can be map-input IO bound, shuffle IO bound (map output heavy), or reduce output IO bound. Only the first type (map-input IO bound) would benefit from locality. More details on MapReduce workload characterization in this paper.
If you are further interested in this, you can also read this paper, in which they improve the running times of mappers but having the input data of ALL the mappers in memory.

Replication Factor in Hadoop

I have a data of 5 TB and actual size of the whole size of combined cluster is 7 TB and I have set the Replication factor to 2.
In this case how it will replicate the data?
Due to the Replication factor the minimum size of the storage on the cluster(Nodes) should be always double the size of the Data,Do you think this is a drawback in Hadoop?
If your minimum size of storage on the cluster is not double the size of your data, then you will end up having under-replicated blocks. Under-replicated block are those which are replicated < replication factor, so if you're replication factor is 2, you will have blocks will have replication factor of 1.
And replicating data is not a drawback of Hadoop at all, in fact it is an integral part of what makes Hadoop effective. Not only does it provide you with a good degree of fault tolerance, but it also helps in running your map tasks close to the data to avoid putting extra load on the network (read about data locality).
Consider that one of the nodes in your cluster goes down. That node would have some data stored in it and if you do not replicate your data, then a part of your data will not be available due to the node failure. However, if your data is replicated, the data which was on the node which went down will still be accessible to you from other nodes.
If you do not feel the need to replicate your data, you can always set your replication factor = 1.
Replication of the data is not a drawback of Hadoop -- it's the factor that increases the efficiency of Hadoop (HDFS). Replication of data to a larger number of slave nodes provides high availability and good fault tolerance to the cluster. If we consider the losses incurred by the client due to downtime of nodes in the cluster (typically will be in millions of $), the cost spent for buying the extra storage facility required for replication of data is much less. So the replication of data is justified.
This is the case of under replication. Assume you have 5 blocks. HDFS was able to create the replicas only for first 3 blocks because of space constraint. Now the other two blocks are under replicated. When the HDFS finds sufficient space, it will try to replicate the 2 blocks also.

How does Hadoop dfs.replicate work?

I have a 2 node hadoop (1 is the master/slave and another slave) setup and 4 input files each of size 1GB.
When i set dfs.replicate to 2, then the entire data is copied over to both the nodes which is understandable. But my question is that, how do i see an improved performance (almost twice as better) over a single node setup since in the 2 node case, map-reduce will still run over the complete data set on both the systems along with the added overhead of channeling the inputs from 2 mappers to reducers.
Also when i set the replication as 1, the entire data exists only on the master node which is also understandable to avoid ethernet overhead. But even in this case, i see a performance improvement compared to single node setup which i find confusing, since map-reduce runs on local data sets, this scenario should essentially be similar to single node setup with one map-reduce program running on master node on the entire data set ??
Can someone help me understand what i am missing here ???
Thanks
Pawan
Pawan,
In the two node case the map reduce job will not run on entire dataset. MapReduce operates in HDFS blocks which will be of size 64 MB or more based on your configuration. Your 1 GB is split into blocks and distributed on the cluster nodes. some of these blocks are processed on node 1 and the other on node 2 but no duplications. The replication factor only increases the availability of data and more tolerance towards node failures. It will not duplicate the tasks.
resultantly what's happening is, from the processing perspective the data is split between the node 1 and node 2 and being processed. Which means, if your are utilizing your processing power fully and rightly, your are doubling your speed theoritically.
Cheers
Rags

Resources