I wrote a relatively simple map-reduce program in Hadoop platform (cloudera distribution). Each Map & Reduce write some diagnostic information to standard ouput besides the regular map-reduce tasks.
However when I'm looking at these log files, I found that Map tasks are relatively evenly distributed among the nodes (I have 8 nodes). But the reduce task standard output log can only be found in one single machine.
I guess, that means all the reduce tasks ended up executing in a single machine and that's problematic and confusing.
Does anybody have any idea what's happening here ? Is it configuration problem ?
How can I make the reduce jobs also distribute evenly ?
If the output from your mappers all have the same key they will be put into a single reducer.
If your job has multiple reducers, but they all queue up on a single machine, then you have a configuration issue.
Use the web interface (http://MACHINE_NAME:50030) to monitor the job and see the reducers it has as well as what machines are running them. There is other information that can be drilled into that will provide information that should be helpful in figuring out the issue.
Couple questions about your configuration:
How many reducers are running for the job?
How many reducers are available on each node?
Is the node running the reducer better
hardware than the other nodes?
Hadoop decides which Reducer will process which output keys by the use of a Partitioner
If you are only outputting a few keys and want an even distribution across your reducers, you may be better off implementing a custom Partitioner for your output data. eg
public class MyCustomPartitioner extends Partitioner<KEY, VALUE>
{
public int getPartition(KEY key, VALUE value, int numPartitions) {
// do something based on the key or value to determine which
// partition we want it to go to.
}
}
You can then set this custom partitioner in the job configuration with
Job job = new Job(conf, "My Job Name");
job.setPartitionerClass(MyCustomPartitioner.class);
You can also implement the Configurable interface in your custom Partitioner if you want to do any further configuration based on job settings.
Also, check that you haven't set the number of reduce tasks to 1 anywhere in the configuration (look for "mapred.reduce.tasks"), or in code, eg
job.setNumReduceTasks(1);
Related
I'm having a huge data set and I need to perform different functions for the same data.
I would like to have four output files. Since four operations are different, can I use four partitioners and four reducers to implement the same ? Is it possible or should I need to write four jobs to perform this ? Please help me !
First Approach
I think you should implement the code in a unique reduce method, and emit n keys depending the process performed. For example: You implement A,B,C and D techiniques, then, in your mapper you could implement this (pseudo-code):
dataA = ProcessA(key,value)
context.write("A", dataA)
dataB = ProcessB(key,value)
context.write("B", dataB)
dataC = ProcessC(key,value)
context.write("C", dataC)
dataD = ProcessD(key,value)
context.write("D", dataD)
You should be careful about data types of output. Also, the output key could be more complex.
Second Approach
You could generate N MapReduce applications in the same java project, and then you re-use the Map, and develop N reducers.
In job.setReducerClass in each main class you set each Reducer. The Map will be the same.
You just need to specify number of reducers in your MapReduce
job config. The default partitioner will distribute data to reducers based on hash of key modulus number of specified reducers.
To override behavior of default partitioner, you can implement your own custom partitioner specifying how your data should get across to the reducers.
---Edit to answer questions in the comments section---
How can i specify more than one reducer class in the Map-reduce driver
To set number of reducers, in job conf you can set it like below -
int numReducers = /*number of reducers you want*/;
job.setNumReduceTasks(numReducers);
Whether I should write four different Jobs for this. Or can I do this with a single Job
Hadoop MR jobs are I/O intensive, in your MR job design you should work on minimizing the I/O and parallel processing as much possible.
If your reducers need same input for generating all 4 outputs, it will be better to keep single job, but another consideration can be skewness of data for either output.
For example output1 has more processing time + most of incoming data is likely to be processed for output1.
If you have scenerio like time taken to process output1 is much higher then total time taken to process output2 + output3 + output4, then you should considering splitting processing of output1 in multiple steps.
However if we consider all 4 outouts have more or less equal processing times and consumes same data throughout,
It will be better to have some conditional processing logic in the reducer and let your custom partioner decide which data goes to which reducer.
Your custom partioner can have some check like this incoming data qualifies to be contributing to "GC content" so let it got to Reducer 3.
But if your incoming data needs to be processed for more then one output/distribution use conditional processing and to write multiple output files from same reducer use "MultipleOutputs".
You can google it up and find usage examples, it lets you write output to multiple folders/files at the same time from within a Mapper or Reducer.
Hadoop let's you specify the number of reducer tasks from the job driver job.setNumReduceTasks(num_reducers);. Since you want four outputs, you would specify int num_reducers = 4; Here's an example driver class.
public class run {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "Run NB Count");
job.setJarByClass(NB_train_hadoop.class);
// set mappers, reducers, other stuff
job.setNumReduceTasks(num_reducers);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
While this is handy, you have to understand that there is an optimal number of reducers you can choose which is dependent on the number of nodes in your cluster.
For example, running 4 Amazon m3.xlarge instances (1 master, 3 slaves, and 4 cores an instance), has the following relationship between wall time and number of reducer tasks used in the MapReduce job. You can see that more isn't necessarily better and if you use too many, well then you might as well crunch your data with your mother's hair curler because it would be faster that way.
Hope this is helpful!!
How can I know after the job has completed, how many nodes that actually ran on that job, how many map task, and how many reduce task?
Thanks....
You could use jobtracker UI for the same. It's running on 50030 by default and URL would look like http://myhost:50030/.
Once you go there, you could see how many mapper's and how many reducers that were used by your job. You could play around by clicking on job link itself.
I want to run one task (mapper) per node on my Hadoop cluster, but I cannot modify the configuration with which the tasktracker runs (i'm just a user).
For this reason, I need to be able to push the option through the job configuration. I tried to set the mapred.tasktracker.map.tasks.maximum=1 at hadoop jar command, but the tasktracker ignores it as it has a different setting in its configuration file.
By the way, the cluster uses the Capacity Scheduler.
Is there any way I can force 1 task per node?
Edited:
Why? I have a memory-bound task, so I want each task to use all the memory available to the node.
when you set the no of mappers, either through the configuration files or by some other means, it's just a hint to the framework. it doesn't guarantee that you'll get only the specified no of mappers. the creation of mappers is actually governed by the no of Splits. and the split creation is carried out by the logic which your InputFormat holds. if you really want to have just one mapper to process the entire file, set "issplittable" to true in the InputFormat class you are using. but why would you do that?the power of hadoop actually lies in distributed parallel processing.
I have 4 core desktop and want to use all my cores for local data processing with hadoop.
(i.e. sometimes I have enough power to process data locally sometimes I submit same jobs to cluster).
By default hadoop local mode runs only one mapper and one reducer so my local jobs are really slow.
I do not want to setup cluster on single machine first because of "painful" configuration and second I have to create jar each time. So perfect solution is to how run embedded Hadoop on a single machine
PS pseudo-distributed mode is bad option since it will create cluster with Single node, so I will get only one mapper and I have to spend some time on additional configuration.
You need to use MultithreadedMapRunner - just set up it in JobConf's setMapRunnerClass method and don't forget to set mapred.map.multithreadedrunner.threads to desirable concurrency level.
Also there is an another way, you should:
set MultithreadedMapper as your mapper class in Job-typed object
call MultithreadedMapper.setMapperClass with you actual mapper class
call MultithreadedMapper.setNumberOfThreads with desirable concurrency level
But be careful, your mapper class should be thread safe and it's setup and cleanup methods would be called several times, so it isn't a smart idea to mix MultithreadedMapper with MultipulOutput, unless you implement you own MultithreadedMapper inspired class.
Hadoop purposely does not run more than one task at the same time in one JVM for isolation purposes. And in stand-alone (local) mode, only one JVM is ever used. If you want to make use of your four cores, you should run in pseudo-distributed mode, and increase the max number of concurrent tasks to four. You can do this with the mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum properties.
Configuration conf = new Configuration();
Job job = new Job(conf, "SolerRandomHit");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(MultithreadedMapper.class);
I want to add something to the hadoop counters from outside a mappper.
So, I want to access the getCounter on context object like this:
context.getCounter(counter, key).increment(amount)
I'm not able to get the context object from where I start the job. I can only do
job.getCounters().findCounter()
which doesn't let me add something to the hadoop counters.
You can only use/write to the counters from within the mapper/reducer tasks. The job tracker has built in capabilities to interactz with the counters and you don't really want to interfere with what is already a complex setup.
I had exactly this issue a few months ago, trying to use the counters to store interim information, but I decided to write the informtion I needed to a defined hdfs directory and read that once my job was complete.
EDIT: why and whatdo you want to use the counter for outside of the mapper?
EDIT #2: if you want stats from a finished job, then counters are not the right place for that, as a) they don't seem to bewritable once the job tracker is done collecting data and b) they are intended to be used for aggregating metrics across tasks. I had a similar need recently and endedup doing my stats sums in the job set-up class (on my edge node) and then writing the data to the logs.