Reducing number of splits in map reduce job using hadoop streaming jar. - hadoop

How to reduce number of splits in map reduce job using hadoop streaming jar. I tried to modify mapreduce.input.fileinputformat.split.minsize/maxsize but the number is still the same.
Number of split in my job is 85000 and job takes around 12 hours to complete. In order to reduce time, reducing number of split should be good option since launching a container is an expensive process.
Any other suggestions are also welcome regarding reducing job time apart from increasing the server.

Related

Why the time of Hadoop job decreases significantly when reducers reach certain number

I test the scalability of a MapReduce based algorithm with increasing number of reducers. It looks fine generally (the time decreases with increasing reducers). But the time of the job always decreases significantly when the reducer reach certain number (30 in my hadoop cluster) instead of decreasing gradually. What are the possible causes?
Something about My Hadoop Job:
(1) Light Map Phase. Only a few hundred lines input. Each line will generate around five thousand key-value pairs. The whole map phase won't take more than 2 minutes.
(2) Heavy Reduce Phase. Each key in the reduce function will match 1-2 thousand values. And the algorithm in reduce phase is very compute intensive. Generally the reduce phase will take around 30 minutes to be finished.
Time performance plot:
it should be because of high no of key-value pair. At specific no of reducers they are getting equally distributed to the reducers, which is resulting in all reducer performing the task at almost same time.Otherwise it might be the case that combiner keeps on waiting for 1 or 2 heavily loaded reducers to finish there job.
IMHO it could be that with sufficient number of reducers available the network IO (to transfer intermediate results) between each reduce stage decreases.
As network IO is usually the bottleneck in most Map-Reduce programs. This decrease in network IO needed will give significant improvement.

Hadoop: Does using CombineFileInputFormat for small files gives performance improvement?

I am new to hadoop and peforming some tests on local machine.
There have been many solutions to deal with many small files. I am using CombinedInputFormat which extends CombineFileInputFormat.
I see that number of mapper have changed from 100 to 25 with CombinedInputFormat. Should I also expect any performance gain since number of mappers have reduced?
I have performed the map-reduce job on many small files without CombinedInputFormat: 100 mappers took 10 minutes
But when the map-reduce job was executed with CombinedInputFormat: 25 mappers took 33 minutes.
Any help will be appreciated.
Hadoop performs better with a small number of large files, as opposed to a huge number of small files. ("Small" here means significantly smaller than a Hadoop Distributed File System (HDFS) block."Number" means ranging to 1000s).
That means if you have 1000 1Mb size file the Map-reduce job based on normal TextInputFormat will create 1000 Map tasks, each of these map tasks require certain amount of time to start and end. This latency in task creation can reduce the performance of the job
In a multi tenant cluster with resource limitation, getting large number of Map slots also will be difficult.
Please refer this link for more details and Benchmark results.

Reducing number of Map tasks during Hadoop Streaming

I have a folder with 3072 files, each of ~50mb. I'm running a Python script over this input using Hadoop Streaming and extracting some data.
On a single file, the script doesn't take more than 2 seconds. However, running this on an EMR cluster with 40 m1.large task nodes and 3072 files takes 12 minutes.
Hadoop streaming does this:
14/11/11 09:58:51 INFO mapred.FileInputFormat: Total input paths to process : 3072
14/11/11 09:58:52 INFO mapreduce.JobSubmitter: number of splits:3072
And hence 3072 map tasks are created.
Of course the Map Reduce overhead comes into play. From some initial research, it seems that it's very inefficient if map tasks take less than 30-40 seconds.
What can I do to reduce the number of map tasks here? Ideally, if each task handled around 10-20 files it would greatly reduce the overhead.
I've tried playing around with the block size; but since the files are all around 50mb in size, they're already in separate blocks and increasing the block size makes no differenece.
Unfortunately you can't. The number of map tasks for a given job is driven by the number of input splits. For each input split a map task is spawned. So, over the lifetime of a mapreduce job the number of map tasks is equal to the number of input splits.
mapred.min.split.size will specify the minimum split size to process by a mapper.
So, increasing split size should reduce the no of mappers.
Check out the link
Behavior of the parameter "mapred.min.split.size" in HDFS

why Hadoop shuffling time takes longer than expected

I am trying to figure out which steps takes how much time in simple hadoop wordcount example.
In this example 3 maps and 1 reducer is used where each map generates ~7MB shuffle data. I have a cluster which is connected via 1Gb switches. When I look at the job details, realized that shuffling takes ~7 sec after all map tasks are completed wich is more than expected to transfer such a small data. What could be the reason behind this? Thanks
Hadoop uses heartbeats to communicate with nodes. By default hadoop uses minimal heartbeat interval equals to 3seconds. Consequently hadoop completes your task within two heartbeats (roughly 6 seconds).
More details: https://issues.apache.org/jira/browse/MAPREDUCE-1906
The transfer is not the only thing to complete after the map step. Each mapper outputs their part of a given split locally and sorts it. The reducer that is tasked with a particular split then gathers the parts from each mapper output, each requiring a transfer of 7 MB. The reducer then has to merge these segments into a final sorted file.
Honestly though, the scale you are testing on is absolutely tiny. I don't know all the parts of the Hadoop shuffle step, which I understand has some involved details, but you shouldn't expect performance of such small files to be indicative of actual performance on larger files.
I think the shuffling started after first mapper started. But waited for the next two mappers.
There is option to start reduce phase (begins with shuffling) after all the mappers were finished. But that's not really speed up anything.
(BTW. 7 seconds is considered fast in Hadoop. Hadoop is poor in performance. Especially for small files. Unless somebody else is paying for this. Don't use Hadoop.)

number of map and reduce task does not change in M/R program

I have a question.. I have a mapreduce program that get input from cassandra. my input is a little big, about 100000000 data. my problem is that my program takes too long to process, but I think mapreduce is good and fast for large volume of data. so I think maybe I have problems in number of map and reduce tasks.. I set the number of map and reduce asks with JobConf, with Job, and also in conf/mapred-site.xml, but I don't see any changes.. in my logs at first there is map 0% reduce 0% and after about 2 hours working it shows map 1% reduce 0%..!!
what should I do? please Help me I really get confused...
Please consider these points to check where the bottleneck might be --
Merely configuring to increase the number of map or reduce tasks files won't do. You need hardware to support that. Hadoop is fast, but to process a huge file, as you have mentioned
you need to have more numbers of parellel map and reduce tasks
running. To achieve what you need more processors. To get more
processors you need more machines (nodes). For example, if you have
2 machines with 8 processors each, you get a total processing power
of around 16. So, total 16 map and reduce tasks can run in parallel and the next set of tasks comes in as soon as slots gets unoccupied out of the 16 slots you have.
Now, when you add one more machine with 8 processors, you now have 24.
The Algorithms you used for map and reduce. Even though, you have
processing power, that doesn't mean your Hadoop application will
perform unless your algorithm performs. It might be the case that
a single map task takes forever to complete.

Resources