How to increase number of reducer in canopy clustering algorithm

How to increase number of reducer in canopy clustering algorithm - hadoop

I'm running canopy clustering algorithm using mahout.
This is the command I'm running through mahout Command line.
mahout canopy -i /mahout/o_seqsparse/tfidf-vectors -o /mahout/o_canopy -dm org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -ow -t1 100 -t2 50
Below is number of map & reduce task running:
No. of map tasks runing --> 6
No. of reduce tasks runing --> 1
But this is taking too much time because of one reducer. I think, if I am able to increase the number of reduce tasks, then I will get better performance.
I also tried with increasing map reduce with mapred-site.xml file mapred.map.tasks, mapred.reduce.tasks
But this has no effect, still it is running with 1 reduce.

As Abhiroop Sarkar mentions in his answer, using a single reducer is hard coded. However, it is not simply a matter of how much you benefit by using one or more reducers. You should not use more than one reducers, otherwise the algorithm would not run correctly. The reason is that by using a single reducer at this step, all the canopy centers will be compared to each other, making sure that they are not "too close" to each other.
So, what you correctly specified as the bottleneck of this algorithm, cannot be changed. In fact, if you have too many canopy centers, it will also run out of memory. Not an ideal transformation of the original sequential algorithm IMHO, since it cannot fully exploit parallel programming, but its the only one available (and/or possible) in MapReduce.
In a nutshell, a single reducer is used to return canopy centers away from each other. Using more reducers would give wrong results.

You didnt specify the version of mahout you are using. But looking at the source code of 0.4 here: http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.4/org/apache/mahout/clustering/canopy/CanopyDriver.java
You can find 1 reducer is hard coded. I dont think you can override it.
EDIT
For version 0.9 as you specified check here http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.9/org/apache/mahout/clustering/canopy/CanopyDriver.java/ at line no. 354
job.setNumReduceTasks(1);
Modify this and build again. However the map output must be sent to one reducer. In case of clustering I dont believe you will benefit much by increasing the number of reducers.

Related

why Hadoop shuffling time takes longer than expected

I am trying to figure out which steps takes how much time in simple hadoop wordcount example.
In this example 3 maps and 1 reducer is used where each map generates ~7MB shuffle data. I have a cluster which is connected via 1Gb switches. When I look at the job details, realized that shuffling takes ~7 sec after all map tasks are completed wich is more than expected to transfer such a small data. What could be the reason behind this? Thanks

Hadoop uses heartbeats to communicate with nodes. By default hadoop uses minimal heartbeat interval equals to 3seconds. Consequently hadoop completes your task within two heartbeats (roughly 6 seconds).
More details: https://issues.apache.org/jira/browse/MAPREDUCE-1906

The transfer is not the only thing to complete after the map step. Each mapper outputs their part of a given split locally and sorts it. The reducer that is tasked with a particular split then gathers the parts from each mapper output, each requiring a transfer of 7 MB. The reducer then has to merge these segments into a final sorted file.
Honestly though, the scale you are testing on is absolutely tiny. I don't know all the parts of the Hadoop shuffle step, which I understand has some involved details, but you shouldn't expect performance of such small files to be indicative of actual performance on larger files.

I think the shuffling started after first mapper started. But waited for the next two mappers.
There is option to start reduce phase (begins with shuffling) after all the mappers were finished. But that's not really speed up anything.
(BTW. 7 seconds is considered fast in Hadoop. Hadoop is poor in performance. Especially for small files. Unless somebody else is paying for this. Don't use Hadoop.)

In which part/class of mapreduce is the logic of stopping reduce tasks implemented

In Hadoop MapReduce no reducer starts before all mappers are finished. Can someone please explain me at which part/class/codeline is this logic implemented? I am talking about Hadoop MapReduce version 1 (NOT Yarn). I have searched the map reduce framework but there are so many classes and i don't understand much the method calls and their ordering.
In other words i need (first for test purposes) to let the reducers start reducing even if there are still working mappers. I know that this way i am getting false results for the job but for know this is the start of some work for changing parts of the framework. So where should i start to look and make changes?

This is done in the shuffle phase. For Hadoop 1.x, take a look at org.apache.hadoop.mapred.ReduceTask.ReduceCopier, which implements ShuffleConsumerPlugin. You may also want to read the "Breaking the MapReduce Stage Barrier" research paper by Verma et al.
EDIT:
After reading #chris-white 's answer, I realized that my answer needed an extra explanation. In the MapReduce model, you need to wait for all mappers to finish, since the keys need to be grouped and sorted; plus, you may have some speculative mappers running and you do not know yet which of the duplicate mappers will finish first. However, as the "Breaking the MapReduce Stage Barrier" paper indicates, for some applications, it may make sense not to wait for all of the output of the mappers. If you would want to implement this sort of behavior (most likely for research purposes), then you should take a look at the classes I mentioned above.

Some points for clarification:
A reducer cannot start reducing until all mappers have finished, their partitions copied to the node where the reducer task is running, and finally sorted.
What you may see is a reducer pre-empting the copy of map outputs while other map tasks are still running. This is controlled via a configuration property known as slowstart (mapred.reduce.slowstart.completed.map). This value represents a ratio (0.0 - 1.0) of the number of map tasks that need to have completed before the reducer tasks will start up (copying over the map outputs from those map tasks that have completed). The default value is usually around 0.9, meaning that if you have 100 map tasks for your job, 90 of them would need to finish before the job tracker can start to launch the reduce tasks.
This is all controlled by the job tracker, in the JobInProgress class, lines 775, 1610, 1664.

How to increase number of mappers in Mahout MatrixMultiplicationJob?

I am using Mahout 0.7's MatrixMultiplicationJob for multiplying a large matrix. But it always uses 1 map task which makes it slow. its probably due to the InputSplit which forces the number of mappers to be 1.
Is there a way I can efficiently multiply matrices in Hadoop / Mahout or change the number of mappers?

Ultimately, it is Hadoop that decides how many mappers to use. Generally it will use one mapper per HDFS block (typically 64 or 128MB). If your data is smaller than that, it's too small to bother with more than 1 mapper.
You can encourage it to use more anyway by setting mapred.max.split.size to something smaller than 64MB (remember the value is set in bytes, not MB). But, are you sure you want to? It is much more common to need more reducers, not mappers, since Hadoop will never use more than 1 unless you (or your job) tells it to.
Also know that Hadoop will not be able to use more than one mapper on a single compressed file. So if your input is one huge compressed file, it will only ever use 1 mapper on that file. You can however split it up yourself into many smaller compressed files.

had you tried to specify number of mappers via command line with -Dmapred.map.tasks=N option? I hadn't tried it, but it should work. If it won't work, then try to set this parameter in the MAHOUT_OPTS environment variable...

Idea's for balancing out a HDFS -> HBase map reduce job

For a client, I've been scoping out the short-term feasibility of running a Cloudera flavor hadoop cluster on AWS EC2. For the most part the results have been expected with the performance of the logical volumes being mostly unreliable, that said doing what I can I've got the cluster to run reasonably well for the circumstances.
Last night I ran a full test of their importer script to pull data from a specified HDFS path and push it into Hbase. Their data is somewhat unusual in that the records are less then 1KB's a piece and have been condensed together into 9MB gzipped blocks. All total there are about 500K text records that get extracted from the gzips, sanity checked, then pushed onto the reducer phase.
The job runs within expectations of the environment ( the amount of spilled records is expected by me ) but one really odd problem is that when the job runs, it runs with 8 reducers yet 2 reducers do 99% of the work while the remaining 6 do a fraction of the work.
My so far untested hypothesis is that I'm missing a crucial shuffle or blocksize setting in the job configuration which causes most of the data to be pushed into blocks that can only be consumed by 2 reducers. Unfortunately the last time I worked on Hadoop, another client's data set was in 256GB lzo files on a physically hosted cluster.
To clarify, my question; is there a way to tweak a M/R Job to actually utilize more available reducers either by lowering the output size of the maps or causing each reducer to cut down the amount of data it will parse. Even a improvement of 4 reducers over the current 2 would be a major improvement.

It seems like you are getting hotspots in your reducers. This is likely because a particular key is very popular. What are the keys as the output of the mapper?
You have a couple of options here:
Try more reducers. Sometimes, you get weird artifacts in the randomness of the hashes, so having a prime number of reducers sometimes helps. This will likely not fix it.
Write a custom partitioner that spreads out the work better.
Figure out why a bunch of your data is getting binned into two keys. Is there a way to make your keys more unique to split up the work?
Is there anything you can do with a combiner to reduce the amount of traffic going to the reducers?

Hadoop workload

I am currently using wordcount application in hadoop as a benchmark. I find that the cpu usage is fairly nearly constant around 80-90%. I would like to have a fluctuating cpu usage. Is there any hadoop application that can give me this capability? Thanks a lot.

I don't think there's a way to throttle or specify a range for hadoop to use. Hadoop will use the CPU available to it. When I'm running a lot of jobs, I'm constantly in the 90%+ range.
One way you can control the CPU usage is to change the maximum number of mappers/reducers each tasktracker can run simultaneously. This is done through the
mapred.tasktracker.{map|reduce}.tasks.maximum setting in $HADOOP_HOME/conf/core-site.xml.
It will use less CPU on that tasktracker when the number of mapper/reducers is limited.
Another way is to set the configuration value for mapred.tasktracker.{map|reduce}.tasks when setting up the job. This will force that job to use that many mappers/reducers. This number will be split across the available tasktrackers, so if you have 4 nodes and want each node to have 1 mapper you'd set mapred.tasktracker.map.tasks to 4. It's also possible that if a node can run 4 mappers, it will run all 4, I don't know exactly how hadoop will split out the tasks, but forcing a number, per job, is an option.
I hope that helps get you to where you're going. I still don't quite understand what you are looking for. :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio