Hadoop mapreduce.job.reduces in Generic Option Syntax? - hadoop

I am trying to set the number of reducers to use via command line. It seems like I am using wrong syntax. I am using hadoop 2.5 (yarn) MR2.
hadoop jar mrjobs-0.1.jar com.example.Weather -D mapreduce.job.reduces=2 datasets/inputs output
This commands is not working when I added -D option else its working fine.
Any help appreciated !
thanks!

Syntax looks proper, I have tested against 2.5 YARN MR2 with the following it's working:
hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.job.reduces=5 input output
Most probably the problem could be your Driver class hasn't implemented ToolRunner which works in coordination with GenericOptionsParser to parse generic command line arguments.
Here is an example of how to implement ToolRunner in your MapReduce Driver class:
// imports ignored
public class ExampleDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: ExampleDriver <in> <out>");
System.exit(2);
}
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJobName("example driver");
job.setJarByClass(ExampleDriver.class);
job.setMapperClass(YourMapper.class);
job.setReducerClass(YourReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int ret = job.waitForCompletion(true) ? 0 : 1;
return ret;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ExampleDriver(), args);
System.exit(res);
}
}

Related

is Generic option -D not supporting in hadoop 0.20.2?

i am trying to set the configuration property from the console using -D generic options.
here is my console input:
$ hadoop jar hadoop-0.20.2/gtee.jar dd.MaxTemperature -D file.pattern=2007.* /inputdata /outputdata
but i did cross verification from the code by
Configuration conf;
System.out.println(conf.get("file.pattern"));
results null output.what would be the problem here, why value of the property "file.pattern" not displaying ? Can any one please help me.
Thanks
EDITED SECTION:
Driver Code:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("MaxTemperature");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(args[0]), conf);
// System.out.println(conf.get("file.pattern"));
if (fs.exists(new Path(args[1]))) {
fs.delete(new Path(args[1]), true);
}
System.out.println(conf.get("file.pattern"));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileInputFormat.setInputPathFilter(job, RegexExcludePathFilter.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(MapMapper.class);
job.setCombinerClass(Mapreducers.class);
job.setReducerClass(Mapreducers.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int xx = ToolRunner.run(new Configuration(),new MaxTemperature(), args);
System.exit(xx);
}
path filter implementation:
public static class RegexExcludePathFilter extends Configured implements
PathFilter {
//String pattern = "2007.[0-1]?[0-2].[0-9][0-9].txt" ;
Configuration conf;
Pattern pattern;
#Override
public boolean accept(Path path) {
Matcher m = pattern.matcher(path.toString());
return m.matches();
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
pattern = Pattern.compile(conf.get("file.pattern"));
System.out.println(pattern);
}
}
To confirm you -D option is supported in the version 20.2 however that requires you to implement the Tool interface to read variables from command line
Configuration conf = new Configuration(); //this is the issue
// When implementing tool use this
Configuration conf = this.getConf();
You are passing it with a space in between, that's not how you should do it. Instead try:
-Dfile.pattern=2007.*

How to run the word count job on hadoop yarn from Java code?

I have a requirement like below:
there is a 30 node hadoop YARN cluster, and a client machine for job submission.
Let's use the wordcount MR example, since it's world famous. I'd like to submit and run the wordcount MR job from a java method.
So what's the code required to submit the job? anything specific to configurations on the client machine?
Hadoop should be present on your client machine, with the same configurations as other machines in your hadoop cluster.
To submit the MR job from a java method, please refer to java ProcessBuilder and pass the hadoop command to launch you wordcount example.
The command and necessary application specific requirements for wordcount can be found here
You should make a class that implements Tool. An example here:
public class AggregateJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setJobName(getClass().getSimpleName());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(ProjectionMapper.class);
job.setCombinerClass(LongSumReducer.class);
job.setReducerClass(LongSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int rc = ToolRunner.run(new AggregateJob(), args);
System.exit(rc);
}
}
This example was obtained from here. As #hamsa-zafar already say, the client machine should have present hadoop configuration, as any other node in the cluster.

hadoop, how to include 3part jar while try to run mapred job

As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)
Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.

Multiple ways to write driver of Hadoop program - Which one to choose?

I have observed that there are multiple ways to write driver method of Hadoop program.
Following method is given in Hadoop Tutorial by Yahoo
public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
and this method is given in Hadoop The Definitive Guide 2012 book by Oreilly.
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
While trying program given in Oreilly book I found that constructors of Job class are deprecated. As Oreilly book is based on Hadoop 2 (yarn) I was surprised to see that they have used deprecated class.
I would like to know which method everyone uses?
I use the former approach.If we go with overriding the run() method, we can use hadoop jar options like -D,-libjars,-files etc.,.All these are very much necessary in almost any hadoop project.
Not sure if we can use them through the main() method.
Slightly different to your first (Yahoo) block - you should be using the ToolRunner / Tool classes which take advantage of the GenericOptionsParser (as noted in Eswara's answer)
A template pattern would be something like:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolExample extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
// old API
JobConf jobConf = new JobConf(getConf());
// new API
Job job = new Job(getConf());
// rest of your config here
// determine success / failure (depending on your choice of old / new api)
// return 0 for success, non-zero for an error
return 0;
}
public static void main(String args[]) throws Exception {
System.exit(ToolRunner.run(new ToolExample(), args));
}
}

How to use JobControl in hadoop

I want to merge two files into one.
I made two mappers to read, and one reducer to join.
JobConf classifiedConf = new JobConf(new Configuration());
classifiedConf.setJarByClass(myjob.class);
classifiedConf.setJobName("classifiedjob");
FileInputFormat.setInputPaths(classifiedConf,classifiedInputPath );
classifiedConf.setMapperClass(ClassifiedMapper.class);
classifiedConf.setMapOutputKeyClass(TextPair.class);
classifiedConf.setMapOutputValueClass(Text.class);
Job classifiedJob = new Job(classifiedConf);
//first mapper config
JobConf featureConf = new JobConf(new Configuration());
featureConf.setJobName("featureJob");
featureConf.setJarByClass(myjob.class);
FileInputFormat.setInputPaths(featureConf, featuresInputPath);
featureConf.setMapperClass(FeatureMapper.class);
featureConf.setMapOutputKeyClass(TextPair.class);
featureConf.setMapOutputValueClass(Text.class);
Job featureJob = new Job(featureConf);
//second mapper config
JobConf joinConf = new JobConf(new Configuration());
joinConf.setJobName("joinJob");
joinConf.setJarByClass(myjob.class);
joinConf.setReducerClass(JoinReducer.class);
joinConf.setOutputKeyClass(Text.class);
joinConf.setOutputValueClass(Text.class);
Job joinJob = new Job(joinConf);
//reducer config
//JobControl config
joinJob.addDependingJob(featureJob);
joinJob.addDependingJob(classifiedJob);
secondJob.addDependingJob(joinJob);
JobControl jobControl = new JobControl("jobControl");
jobControl.addJob(classifiedJob);
jobControl.addJob(featureJob);
jobControl.addJob(secondJob);
Thread thread = new Thread(jobControl);
thread.start();
while(jobControl.allFinished()){
jobControl.stop();
}
But, I get this message:
WARN mapred.JobClient:
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
anyone help please..................
Which version of Hadoop are you using?
The warning you get will stop the program?
You don't need to use setJarByClass(). You can see my snippet, I can run it without using setJarByClass() method.
JobConf job = new JobConf(PageRankJob.class);
job.setJobName("PageRankJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);
You should implement your Job this way:
public class MyApp extends Configured implements Tool {
public int run(String[] args) throws Exception {
// Configuration processed by ToolRunner
Configuration conf = getConf();
// Create a JobConf using the processed conf
JobConf job = new JobConf(conf, MyApp.class);
// Process custom command-line options
Path in = new Path(args[1]);
Path out = new Path(args[2]);
// Specify various job-specific parameters
job.setJobName("my-app");
job.setInputPath(in);
job.setOutputPath(out);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new MyApp(), args);
System.exit(res);
}
}
This comes straight out of Hadoop's documentation here.
So basically your job needs to inherit from Configured and implement Tool. This will force you to implement run(). Then start your job from your main class using Toolrunner.run(<your job>, <args>) and the warning will disappear.
You need to have this code in the driver job.setJarByClass(MapperClassName.class);

Resources