default / finding number of mapper and reducers in hadoop 1.x - hadoop

Can somebody please help me in understanding below questions related to Hadoop 1.x?
Say I have just a single node where I have 8 GB of RAM and 40 TB of hard disk with quad core processor. Block size is 64 MB. We need to process 4 TB of data.
How do we decide the number of Mappers and Reducers?
Can someone please explain in detail? Please let me know if I need to consider any other parameter for calculation.
Say I have 10 Data nodes in a cluster and each node is having 8 GB of RAM and 40 TB of Hard disk with quad core processor. Block size is 64MB. We need to process 40 TB data. How do we decide the number of Mappers and Reducers?
What is the default number for mapper and reducer slots in a Data node with quad core processor?
Many Thanks,
Manish

Number of mappers = Number of splits.
Input file would be divided into splits. Each split will have set of records. On an average, each split is of one block size(64 MB above). So in your case you would have around 62500 mappers(or splits) (4TB/64). You also have option to give configurable input split size. Generally this is done when reading the entire file once, and you decide how records should be processed.
Number of reducers = Number of unique keys in the mapper output. You can choose the number of reducers by configuring them in job class or at jab running command. The above number is based on default hash partitioner. You can create your own partitioner, which can decide number of reducers.

Related

Number of reducers in hadoop

I was learning hadoop,
I found number of reducers very confusing :
1) Number of reducers is same as number of partitions.
2) Number of reducers is 0.95 or 1.75 multiplied by (no. of nodes) * (no. of maximum containers per node).
3) Number of reducers is set by mapred.reduce.tasks.
4) Number of reducers is closest to: A multiple of the block size * A task time between 5 and 15 minutes * Creates the fewest files possible.
I am very confused, Do we explicitly set number of reducers or it is done by mapreduce program itself?
How is number of reducers is calculated? Please tell me how to calculate number of reducers.
1 - The number of reducers is as number of partitions - False. A single reducer might work on one or more partitions. But a chosen partition will be fully done on the reducer it is started.
2 - That is just a theoretical number of maximum reducers you can configure for a Hadoop cluster. Which is very much dependent on the kind of data you are processing too (decides how much heavy lifting the reducers are burdened with).
3 - The mapred-site.xml configuration is just a suggestion to the Yarn. But internally the ResourceManager has its own algorithm running, optimizing things on the go. So that value is not really the number of reducer tasks running every time.
4 - This one seems a bit unrealistic. My block size might 128MB and everytime I can't have 128*5 minimum number of reducers. That's again is false, I believe.
There is no fixed number of reducers task that can be configured or calculated. It depends on the moment how much of the resources are actually available to allocate.
Number of reducer is internally calculated from size of the data we are processing if you don't explicitly specify using below API in driver program
job.setNumReduceTasks(x)
By default on 1 GB of data one reducer would be used.
so if you are playing with less than 1 GB of data and you are not specifically setting the number of reducer so 1 reducer would be used .
Similarly if your data is 10 Gb so 10 reducer would be used .
You can change the configuration as well that instead of 1 GB you can specify the bigger size or smaller size.
property in hive for setting size of reducer is :
hive.exec.reducers.bytes.per.reducer
you can view this property by firing set command in hive cli.
Partitioner only decides which data would go to which reducer.
Your job may or may not need reducers, it depends on what are you trying to do. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task. There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition. One rule of thumb is to aim for reducers that each run for five minutes or so, and which produce at least one HDFS block’s worth of output. Too many reducers and you end up with lots of small files.
Partitioner makes sure that same keys from multiple mappers goes to the same reducer. This doesn't mean that number of partitions is equal to number of reducers. However, you can specify number of reduce tasks in the driver program using job instance like job.setNumReduceTasks(2). If you don't specify the number of reduce tasks in the driver program then it picks from the mapred.reduce.tasks which has the default value of 1 (https://hadoop.apache.org/docs/r1.0.4/mapred-default.html) i.e. all mappers output will go to the same reducer.
Also, note that programmer will not have control over number of mappers as it depends on the input split where as programmer can control the number of reducers for any job.

Can hadoop map/reduce be speeded up by splitting data size?

Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.

Need help in understanding MR data processing for small data sets using Hadoop

Need help in understanding MR data processing for small data sets using Hadoop.
Please consider the following hypothetical scenario:
1) Input Data to be processed : 100 MB
2) Block Size : 64 MB
3) Replication Factor : 2
4) Cluster Size : 2 (Data Node 1 and Data Node 2)
The data in Data Node 1 will be split as 64MB + 36MB(total 100MB of Input Data)
The replicated data will be available in Data Node 2 as well (64 MB + 36 MB)
Question:
Please help in understanding how will the 64 MB and 36 MB data be processed?
Will entire data be processed only from DataNode1. DataNode2 will just be used for backup in case if DataNode1 goes down?
OR
Will DataNode2 also be used for processing the data?
Please let me know if more explanation is required on the question.
Yes it will use both the datanodes. So number of mappers will always be equal to number of splits (unless you restrict it using properties or driver code). See this for details.
It depends. If you have a gzipped file as an input, then regardless it have 2 blocks it would be entirely processed by a single mapper on a single node. If you are running YARN NM on both of the datanodes, they have enough memory to start 2 mapper tasks and the cluster is quiet (no other tasks are running), then most likely both mappers would be started on the same node.

pseudo distributed number map and reduce tasks

I am newbie to Hadoop. I have successfully configured a hadoop setup in pseudo distributed mode. Now I would like to know what's the logic of choosing the number of map and reduce tasks. What do we refer to?
Thanks
You cannot generalize how number of mappers/reducers are to be set.
Number of Mappers:
You cannot set number of mappers explicitly to a certain number(There are parameters to set this but it doesn't come into effect). This is decided by the number of Input Splits created by hadoop for your given set of input. You may control this by setting mapred.min.split.size parameter. For more read the InputSplit section here. If you have a lot of mappers being generated due to huge amount of small files and you want to reduce number of mappers then you will need to combine data from more than one files. Read this: How to combine input files to get to a single mapper and control number of mappers.
To quote from the wiki page:
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
Actually controlling the number of maps is subtle. The
mapred.map.tasks parameter is just a hint to the InputFormat for the
number of maps. The default InputFormat behavior is to split the total
number of bytes into the right number of fragments. However, in the
default case the DFS block size of the input files is treated as an
upper bound for input splits. A lower bound on the split size can be
set via mapred.min.split.size. Thus, if you expect 10TB of input data
and have 128MB DFS blocks, you'll end up with 82k maps, unless your
mapred.map.tasks is even larger. Ultimately the InputFormat determines
the number of maps.
The number of map tasks can also be increased manually using the
JobConf's conf.setNumMapTasks(int num). This can be used to increase
the number of map tasks, but will not set the number below that which
Hadoop determines via splitting the input data.
Number of Reducers:
You can explicitly set the number of reducers. Just set the parameter mapred.reduce.tasks. There are guidelines for setting this number, but usually the default number of reducers should be good enough. At times a single report file is required, in those cases you might want number of reducers to be set to be 1.
Again to quote from wiki:
The right number of reduces seems to be 0.95 or 1.75 * (nodes *
mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can
launch immediately and start transfering map outputs as the maps
finish. At 1.75 the faster nodes will finish their first round of
reduces and launch a second round of reduces doing a much better job
of load balancing.
Currently the number of reduces is limited to roughly 1000 by the
buffer size for the output files (io.buffer.size * 2 * numReduces <<
heapSize). This will be fixed at some point, but until it is it
provides a pretty firm upper bound.
The number of reduces also controls the number of output files in the
output directory, but usually that is not important because the next
map/reduce step will split them into even smaller splits for the maps.
The number of reduce tasks can also be increased in the same way as
the map tasks, via JobConf's conf.setNumReduceTasks(int num).
Actually no. of mappers is primarily governed by the no. of InputSplits created by the InputFormat you are using and the no. of reducers by the no. of partitions you get after the map phase. Having said that, you should also keep the no of slots, available per slave, in mind, along with the available memory. But as a rule of thumb you could use this approach :
Take the no. of virtual CPUs*.75 and that's the no. of slots you can configure. For example, if you have 12 physical cores (or 24 virtual cores), you would have (24*.75)=18 slots. Now, based on your requirement you could choose how many mappers and reducers you want to use. With 18 MR slots, you could have 9 mappers and 9 reducers or 12 mappers and 9 reducers or whatever you think is OK with you.
HTH

how to deal with large map output in hadoop?

I am new in hadoop and i'm working with 3 node in a cluster(each of them has 2GB RAM).
the input file is small(5 MB) but map output is very large(about 6 GB).
in the map phase my memory becomes full and the tasks run very slowly.
what's its reason?
Can anyone helps me how to make my program faster?
Use a NLineInputFormat , where N refers to the number of lines of input each mapper will receive. This way, you have more splits created , there by forcing smaller input data to each mapper task. If not, the entire 5 MB will go into one Map task.
Size of map output by itself does not cause memory problem, since mapper can work in "streaming" mode. It consume records, process them and write to output. Hadoop will store some amount of data in memory and then spill it to disc.
So you problems can be caused by one of the two:
a) Your mapper algorithm somehow accumulate data during processing.
b) Cumulative memory given to your mappers is less then RAM of the Nodes. Then OS start swapping and your performance can fell orders of magnitude.
Case b is more likely since 2 GB is actually too little for usual hadoop configuration. If you going to work on it - I would suggest to configure 1, maximum 2 mapper slots per node.

Resources