Reducing number of Map tasks during Hadoop Streaming - hadoop

I have a folder with 3072 files, each of ~50mb. I'm running a Python script over this input using Hadoop Streaming and extracting some data.
On a single file, the script doesn't take more than 2 seconds. However, running this on an EMR cluster with 40 m1.large task nodes and 3072 files takes 12 minutes.
Hadoop streaming does this:
14/11/11 09:58:51 INFO mapred.FileInputFormat: Total input paths to process : 3072
14/11/11 09:58:52 INFO mapreduce.JobSubmitter: number of splits:3072
And hence 3072 map tasks are created.
Of course the Map Reduce overhead comes into play. From some initial research, it seems that it's very inefficient if map tasks take less than 30-40 seconds.
What can I do to reduce the number of map tasks here? Ideally, if each task handled around 10-20 files it would greatly reduce the overhead.
I've tried playing around with the block size; but since the files are all around 50mb in size, they're already in separate blocks and increasing the block size makes no differenece.

Unfortunately you can't. The number of map tasks for a given job is driven by the number of input splits. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits.

mapred.min.split.size will specify the minimum split size to process by a mapper.
So, increasing split size should reduce the no of mappers.
Check out the link
Behavior of the parameter "mapred.min.split.size" in HDFS

Related

Controlling number of map and reduce jobs spawned?

I am trying to understand how may map reduce jobs get started for a task and how to control the number of MR jobs.
Say I have a 1TB file in HDFS and my block size is 128MB.
For a MR task on this 1TB file if I specify the input split size as 256MB then how many Map and Reduce jobs gets started. From my understanding this is dependent on split size. i.e number of Map jobs = total size of file / split size and in this case it works out to be 1024 * 1024 MB / 256 MB = 4096. So the number of map task started by hadoop is 4096.
1) Am I right?
2) If I think that this is an inappropriate number, can I inform hadoop to start less number of jobs or even more number of jobs. If yes how?
And how about the number of reducer jobs spawned, I think this is totally controlled by the user.
3) But how and where should I mention the number of reducer jobs required.
1. Yes, you're right. No of mappers=(size of data)/(input split size). So, in your case it would be 4096
As per my understanding ,Before hadoop-2.7 you can only hint system to create number of mapper by conf.setNumMapTasks(int num) but mapper will created by their own. From hadoop-2.7 you can limit number of mapper by mapreduce.job.running.map.limit. See this JIRA ticket
By default number of reducer is 1. You can change it by job.setNumReduceTasks(integer_numer);
You can also provide this parameter from cli
-Dmapred.reduce.tasks=<num reduce tasks>

Understanding number of map and reduce tasks in Hadoop MapReduce

Assume that 8GB memory is available with a node in hadoop system.
If the task tracker and data nodes consume 2GB and if the memory required for each task is 200MB, how many map and reduce can get started?
8-2 = 6GB
So, 6144MB/200MB = 30.72
So, 30 total map and reduce tasks will be started.
Am I right or am I missing something?
The number of mappers and reducers is not determined by the resources available. You have to set the number of reducers in your code by calling setNumReduceTasks().
For the number of mappers, it is more complicated, as they are set by Hadoop. By default, there is roughly one map task per input split. You can tweak that by changing the default block size, record reader, number of input files.
You should also set in the hadoop configuration files the maximum number of map tasks and reduce tasks that run concurrently, as well as the memory allocated to each task. Those last two configurations are the ones that are based on the available resources. Keep in mind that map and reduce tasks run on your CPU, so you are practically restricted by the number of available cores (one core cannot run two tasks at the same time).
This guide may help you with more details.
The number of concurrent task is not decided just based on the memory available on a node. it depends on the number of cores as well. If your node has 8 vcores and each of your task takes 1 core then at a time only 8 task can run.

Reducing number of splits in map reduce job using hadoop streaming jar.

How to reduce number of splits in map reduce job using hadoop streaming jar. I tried to modify mapreduce.input.fileinputformat.split.minsize/maxsize but the number is still the same.
Number of split in my job is 85000 and job takes around 12 hours to complete. In order to reduce time, reducing number of split should be good option since launching a container is an expensive process.
Any other suggestions are also welcome regarding reducing job time apart from increasing the server.

Can hadoop map/reduce be speeded up by splitting data size?

Can I increase the performance time of my hadoop map/reduce job by splitting the input data into smaller chunks?
First question:
For example, I have 1GB of input file for mapping task. My default block size is 250MB. So only 4 mappers will be assigned to do the job. If I split the data into 10 pieces, each piece will be 100MB, then I have 10 mappers to do the work. But then each split piece will occupy 1 block in the storage, which means 150MB will be wasted for each split data block. What should I do in this case if I don't want to change the block size of my storage?
Second question: If I split input data before mapping job, it can increase the performance of the mapping job. So If I want to do the same for reducing job, should I ask mapper to split the data before giving it to reducer or should I let reducer do it ?
Thank you very much. Please correct me if I also misunderstand something. Hadoop is quite new to me. So any help is appreciated.
When you change your block size to 100 MB, 150 MB is not wasted. It is still available memory for the system.
If Mappers are increased, it does not mean that it will definitely increase performance. Because it depends on the number of datanodes you have. For example, if you have 10 DataNode -> 10 Mapper, it is a good deal. But if you have 4 datanode -> 10 Mapper, obviously all mappers cannot run simultaneously. So if you have 4 data nodes, it is better to have 4 blocks (with a 250MB block size).
Reducer is something like a merge of all your mappers' output and you can't ask Mapper to split the data. In reverse, you can ask Mapper to do a mini-reduce by defining a Combiner. Combiner is nothing but a reducer in the same node where the mapper was executed, run before sending to the actual reducer. So the I/O will be minimized and so is the work of actual reducer. Introducing a Combiner will be a better option to improve performance
Good luck with Hadoop !!
There can be multiple parallel mappers running in a node for the same job based on the number of map slots available in the node. So, yes making smaller pieces of the input should give you more parallel mappers and speed up the process.(how to input all the pieces as single input? - put all of them in one directory and add that as input path)
On the reducer side of you are OK to combine multiple output files post processing, you can set more number of reducers and max parallel reducers running could be the number of reduce shots available in your cluster. This should improve cluster utilisation and speed up reduce phase.
If possible you may use combiner also to reduce disk and network i/o overhead.

Default number of reducers

In Hadoop, if we have not set number of reducers, then how many number of reducers will be created?
Like number of mappers is dependent on (total data size)/(input split size),
E.g. if data size is 1 TB and input split size is 100 MB. Then number of mappers will be (1000*1000)/100 = 10000(Ten thousand).
The number of reducer is dependent on which factors ? How many reducers are created for a job?
How Many Reduces? ( From official documentation)
The right number of reduces seems to be 0.95 or 1.75 multiplied by
(no. of nodes) * (no. of maximum containers per node).
With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.
Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.
The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.
This article covers about Mapper count too.
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.
Thus, if you expect 10TB of input data and have a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.
If you want to change the default value of 1 for number of reducers, you can set below property (From hadoop 2.x version) as a command line parameter
mapreduce.job.reduces
OR
you can set programmatically with
job.setNumReduceTasks(integer_numer);
Have a look at one more related SE question: What is Ideal number of reducers on Hadoop?
By default the no of reducers is set to 1.
You can change it by adding a parameter
mapred.reduce.tasks in the command line or in the Driver code or in the conf file that you pass.
e.g: Command Line Argument: bin/hadoop jar ... -Dmapred.reduce.tasks=<num reduce tasks>
or, in Driver code as: conf.setNumReduceTasks(int num);
Recommended read:
https://wiki.apache.org/hadoop/HowManyMapsAndReduces

Resources