Tips to improve MapReduce Job performance in Hadoop - performance

I have 100 mapper and 1 reducer running in a job. How to improve the job performance?
As per my understanding: Use of combiner can improve the performance to great extent. But what else we need to configure to improve the jobs performance?

With the limited data in this question ( Input file size, HDFS block size, Average map processing time, Number of Mapper slots & Reduce slots in cluster etc.), we can't suggest tips.
But there are some general guidelines to improve the performance.
If each task takes less than 30-40 seconds, reduce the number of tasks
If a job has more than 1TB of input, consider increasing the block size of the input dataset to 256M or even 512M so that the number of tasks will be smaller.
So long as each task runs for at least 30-40 seconds, increase the number of mapper tasks to some multiple of the number of mapper slots in the cluster
Number of reduce tasks per a job should be equal to or a bit less than the number of reduce slots in the cluster.
Some more tips :
Configure the cluster properly with right diagnostic tools
Use compression when you are writing intermediate data to disk
Tune number of Map & Reduce tasks as per above tips
Incorporate Combiner wherever it is appropriate
Use Most appropriate data types for rendering Output ( Do not use LongWritable when range of output values are in Integer range. IntWritable is right choice in this case)
Reuse Writables
Have right profiling tools
Have a look at this cloudera article for some more tips.

Related

How is the number of map and reduce tasks is determined?

When running certain file on Hadoop using map reduce, sometimes it creates 1 map task and 1 reduce tasks while other file can use 4 map and 1 reduce tasks.
My question is based on what the number of map and reduce tasks is being decided?
is there a certain map/reduce size after which a new map/reduce is created?
Many Thanks Folks.
From the the official doc :
The number of maps is usually driven by the number of DFS blocks in
the input files. Although that causes people to adjust their DFS block
size to adjust the number of maps. The right level of parallelism for
maps seems to be around 10-100 maps/node, although we have taken it up
to 300 or so for very cpu-light map tasks. Task setup takes awhile, so
it is best if the maps take at least a minute to execute.
The ideal reducers should be the optimal value that gets them closest to:
A multiple of the block size
A task time between 5 and 15 minutes
Creates the fewest files possible
Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in one or more of:
Terrible performance on the next phase of the workflow
Terrible performance due to the shuffle
Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless
Destroying disk IO for no really sane reason
Lots of network transfers
The number of Mappers is equal to the the number of HDFS blocks for the input file that will be processed.
The number of reducers ideally should be about 10% of your total mappers. Say you have 100 mappers then ideally the number of reducers should be somewhere around 10.
But however it is possible specify the number of reducers in our Map Reduce job.

Understanding number of map and reduce tasks in Hadoop MapReduce

Assume that 8GB memory is available with a node in hadoop system.
If the task tracker and data nodes consume 2GB and if the memory required for each task is 200MB, how many map and reduce can get started?
8-2 = 6GB
So, 6144MB/200MB = 30.72
So, 30 total map and reduce tasks will be started.
Am I right or am I missing something?
The number of mappers and reducers is not determined by the resources available. You have to set the number of reducers in your code by calling setNumReduceTasks().
For the number of mappers, it is more complicated, as they are set by Hadoop. By default, there is roughly one map task per input split. You can tweak that by changing the default block size, record reader, number of input files.
You should also set in the hadoop configuration files the maximum number of map tasks and reduce tasks that run concurrently, as well as the memory allocated to each task. Those last two configurations are the ones that are based on the available resources. Keep in mind that map and reduce tasks run on your CPU, so you are practically restricted by the number of available cores (one core cannot run two tasks at the same time).
This guide may help you with more details.
The number of concurrent task is not decided just based on the memory available on a node. it depends on the number of cores as well. If your node has 8 vcores and each of your task takes 1 core then at a time only 8 task can run.

Can hadoop map/reduce be speeded up by splitting data size?

Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.

number of map and reduce task does not change in M/R program

I have a question.. I have a mapreduce program that get input from cassandra. my input is a little big, about 100000000 data. my problem is that my program takes too long to process, but I think mapreduce is good and fast for large volume of data. so I think maybe I have problems in number of map and reduce tasks.. I set the number of map and reduce asks with JobConf, with Job, and also in conf/mapred-site.xml, but I don't see any changes.. in my logs at first there is map 0% reduce 0% and after about 2 hours working it shows map 1% reduce 0%..!!
what should I do? please Help me I really get confused...
Please consider these points to check where the bottleneck might be --
Merely configuring to increase the number of map or reduce tasks files won't do. You need hardware to support that. Hadoop is fast, but to process a huge file, as you have mentioned
you need to have more numbers of parellel map and reduce tasks
running. To achieve what you need more processors. To get more
processors you need more machines (nodes). For example, if you have
2 machines with 8 processors each, you get a total processing power
of around 16. So, total 16 map and reduce tasks can run in parallel and the next set of tasks comes in as soon as slots gets unoccupied out of the 16 slots you have.
Now, when you add one more machine with 8 processors, you now have 24.
The Algorithms you used for map and reduce. Even though, you have
processing power, that doesn't mean your Hadoop application will
perform unless your algorithm performs. It might be the case that
a single map task takes forever to complete.

Hadoop slowstart configuration

What's an ideal value for "mapred.reduce.slowstart.completed.maps" for a Hadoop job? What are the rules to follow to set it appropriately?
Thanks!
It depends on a number of characteristics of your job, cluster and utilization:
How many map slots will your job require vs maximum map capacity: If you have a job that spawns 1000's of map tasks, but only have 10 map slots in total (an extreme case to demonstrate a point), then starting your reducers early could deprive over reduce tasks from executing. In this case i would set your slowstart to a large value (0.999 or 1.0). This is also true if your mappers take an age to complete - let someone else use the reducers
If your cluster is relatively lightly loaded (there isn't contention for the reducer slots) and your mappers output a good volume of data, then a low value for slowstart will assist in getting your job to finish earlier (while other map tasks execute, get the map output data moved to the reducers).
There are probably more

Resources