How many mappers will run? - hadoop

I have this query.
lets say i have 3 datanode+nodemanager(clusters). we have replication factor of 3. At first cluster we got 4 blocks, so by default 4 mappers will run parallelly on first cluster.
then as we have replication factor of 3, we ll have 12 mappers running in beginning ?

Number of block depends on file size. If you have 1gb of file that makes 8 blocks (of 128 mb).
So now all 8 blocks will be replicated three times by following data locality and rack awareness - but it doesn't mean all 24 (8 x 3) blocks will be processed when you run any job against this file. Replication is for to recover from disk failures type of scenarios.
So to answer your questions:
Number of mappers = number of input splits(in most cases number of blocks).
There will be only 8 mappers running on cluster. Hadoop will decide on which node these mappers need to be run based on data locality - at closest block location in cluster(node).
There will be different case if speculative execution is enabled for the cluster - hadoop-speculative-task-execution

Related

Hadoop Running reducers in parallel

I have a 4G file with ~ 16 mill lines, maps are running distributed with 6 maps in parallel out of 15 maps. Generates 35000 keys. I am using MultipleTextoutput so each reducer generates a output independent of other reducer.
I have configured the conf with 25-50 reducers, but it always runs 1 reducer at a time.
Machine - 4 core 32 G ram single machine running hortonworks stack
How do I get more than 1 reduce task to run in parallel ?
Have a look hadoop MapReduce Tutorial
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by ( * ).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Have a look at related SE questions:
How hadoop decides how many nodes will do map and reduce tasks
What is Ideal number of reducers on Hadoop?
With specifying a lower reducer memory of 2 GB, the default in the mapred-site xml was 6GB, the framework brings up 3 reducers in parallel rather than 1.

Hadoop, uneven load between machines

I have a cluster of 4 machines that I need to run a benchmark against.
I decide to use Terasort to benchmark.
However, when I run the benchmark, only one out of four machine is under load, while the other three are completely idle.
If I run the test another time, a different machine would be completely under load while the other three would be idle.
When I create the dataset with Teragen everything works just fine, the load is evenly distributed between all the four machine.
What can be wrong in this configuration ?
Thanks
I hope your cluster is distributed properly as 4 nodes (1 name node , 1 secondary name node, 2 data nodes)
The process flow happens like it starts with name-node and job tracker will schedule the job for the task trackers which has the data blocks.
The usage of data-nodes depends on few factors like number of replication, number of mappers and number of blocks.
If The number of blocks are many, it will be placed evenly in all the data nodes of your cluster. If the replication factor is 2, then the blocks will be available in both the data nodes. So both can run the mappers which deal with those blocks
If you have two blocks for a file and two mappers will run simultaneously in the data nodes and utilize the resources properly.
In your case, it seems block size is the problem. Try to reduce it. so there should be at least 2 blocks which makes utilization will be more and so is the performance.
Hadoop can be tuned as per your need with the below settings.
dfs.replication in hdfs-site.xml
dfs.block.size in hdfs-site.xml
Good luck !!!

Hadoop Optimization Suggestion

Consider a scenario:
If I increase the replication factor of the data I have in HDFS ; suppose in a 10 node cluster I make the RF = 5 instead of 3(default), will it increase the performance of my data processing tasks?
Will the map phase complete sooner compared to the default replication setting?
Will there be any effect on the reduce phase?
Impact of Replication on Storage:
Replication factor has a huge impact on the storage of the cluster. It's obvious that: Larger the replication factor, lesser the number of files you can store in the cluster.
If replication factor is 5, then for every 1 GB of data ingested into cluster, you will need 5 GB of storage space and you will quickly run out of space in the cluster.
Since NameNode stores all the meta information in memory, it will quickly run of space to store the meta data. Hence, your NameNode will have to be allocated more memory (check HADOOP_NAMENODE_OPTS).
Data copy operation will take more time, since data copy is daisy-chained across Data Nodes. Instead of 3 Data Nodes, now 5 Data Nodes will have to confirm data storage, before a write/append is committed
Impact of Replication on Computation:
Mapper:
With a higher replication factor, there are more options to schedule a mapper. With a replication factor of 3, you can schedule a mapper on 3 different nodes. But, with a factor of 5, you will have 5 choices
You may be able to achieve better data locality, with increase in the replication factor. Each of the mapper could get scheduled on the same node where the data is present (since now there are 5 choices compared to the default 3), thus improving the performance.
Since there is a better data locality, lesser number of mappers will copy off-node or off-rack data
Due to these reasons, its possible that, with a higher replication factor, the mappers could complete earlier than with a lower replication factor.
Since typically the number of mappers are always higher than the number of reducers, you may see an overall improvement in your job performance.
Reducer:
Since the output of the reducer directly gets written into HDFS, its possible that your reducers will take more time to execute, with a higher replication factor.
Overall, your mappers may execute faster with a higher replication factor. But, actual performance improvement depends on various factors like, the size of your cluster, bandwidth, NameNode memory etc.
After answering this question, I came across another similar question in SO here: Map Job Performance on cluster. This also contains some more information, with links to various research papers.
Setting the replication factor to 5 will cause the HDFS namenode to maintain 5 total copies of the file blocks on the available datanodes in the cluster. This copy operation performed by the namenode will result in higher network bandwidth usage depending on the size of the files to be replicated and the speed of your network.
The replication factor has no direct effect in the either the map or reduce phase. You may see a performance hit initially while blocks are being replicated while running a map-reduce job - this could cause significant network latency depending on the size of the files and your network bandwidth.
A replication factor of 5 across your cluster means that 4 of your data nodes can disappear from your cluster, and you'll still have enough nodes to access to all files in HDFS with no file corruption or missing blocks. If your RF = 4 then you can loose 3 servers and still have access to all files in HDFS.
Setting a higher replication factor increases your overall HDFS usage so if your total data size is 1TB a RF=3 means your HDFS usage will be 3TB since the chopped up blocks are duplicated n-1 (3-1 = 2) times across the cluster.

Need help in understanding MR data processing for small data sets using Hadoop

Need help in understanding MR data processing for small data sets using Hadoop.
Please consider the following hypothetical scenario:
1) Input Data to be processed : 100 MB
2) Block Size : 64 MB
3) Replication Factor : 2
4) Cluster Size : 2 (Data Node 1 and Data Node 2)
The data in Data Node 1 will be split as 64MB + 36MB(total 100MB of Input Data)
The replicated data will be available in Data Node 2 as well (64 MB + 36 MB)
Question:
Please help in understanding how will the 64 MB and 36 MB data be processed?
Will entire data be processed only from DataNode1. DataNode2 will just be used for backup in case if DataNode1 goes down?
OR
Will DataNode2 also be used for processing the data?
Please let me know if more explanation is required on the question.
Yes it will use both the datanodes. So number of mappers will always be equal to number of splits (unless you restrict it using properties or driver code). See this for details.
It depends. If you have a gzipped file as an input, then regardless it have 2 blocks it would be entirely processed by a single mapper on a single node. If you are running YARN NM on both of the datanodes, they have enough memory to start 2 mapper tasks and the cluster is quiet (no other tasks are running), then most likely both mappers would be started on the same node.

HDFS Replication - Data Stored

I am a relative newbie to hadoop and want to get a better understanding of how replication works in HDFS.
Say that I have a 10 node system(1 TB each node), giving me a total capacity of 10 TB. If I have a replication factor of 3, then I have 1 original copy and 3 replicas for each file. So, in essence, only 25% of my storage is original data. So my 10 TB cluster is in effect only 2.5 TB of original(un-replicated) data.
Please let me know if my train of thought is correct.
Your thinking is a little off. A replication factor of 3 means that you have 3 total copies of your data. More specifically, there will be 3 copies of each block for your file, so if your file is made up of 10 blocks there will be 30 total blocks across your 10 nodes, or about 3 blocks per node.
You are correct in thinking that a 10x1TB cluster has less than 10TB capacity- with a replication factor of 3, it actually has a functional capacity of about 3.3TB, with a little less actual capacity because of space needed for doing any processing, holding temporary files, etc.

Resources