Multiple ways to write driver of Hadoop program - Which one to choose? - hadoop

I have observed that there are multiple ways to write driver method of Hadoop program.
Following method is given in Hadoop Tutorial by Yahoo
public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
and this method is given in Hadoop The Definitive Guide 2012 book by Oreilly.
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
While trying program given in Oreilly book I found that constructors of Job class are deprecated. As Oreilly book is based on Hadoop 2 (yarn) I was surprised to see that they have used deprecated class.
I would like to know which method everyone uses?

I use the former approach.If we go with overriding the run() method, we can use hadoop jar options like -D,-libjars,-files etc.,.All these are very much necessary in almost any hadoop project.
Not sure if we can use them through the main() method.

Slightly different to your first (Yahoo) block - you should be using the ToolRunner / Tool classes which take advantage of the GenericOptionsParser (as noted in Eswara's answer)
A template pattern would be something like:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolExample extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
// old API
JobConf jobConf = new JobConf(getConf());
// new API
Job job = new Job(getConf());
// rest of your config here
// determine success / failure (depending on your choice of old / new api)
// return 0 for success, non-zero for an error
return 0;
}
public static void main(String args[]) throws Exception {
System.exit(ToolRunner.run(new ToolExample(), args));
}
}

Related

Map-reduce job giving ClassNotFound exception even though mapper is present when running with yarn?

I am running a hadoop job which is working fine when I am running it without yarn in pseudo-distributed mode, but it is giving me class not found exception when running with yarn
16/03/24 01:43:40 INFO mapreduce.Job: Task Id : attempt_1458775953882_0002_m_000003_1, Status : FAILED
Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.hadoop.keyword.count.ItemMapper not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:186)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:745)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class com.hadoop.keyword.count.ItemMapper not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
... 8 more
Here is the source-code for the job
Configuration conf = new Configuration();
conf.set("keywords", args[2]);
Job job = Job.getInstance(conf, "item count");
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Here is the command I am running
hadoop jar ~/itemcount.jar /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
Edit Code, after suggestion
package com.hadoop.keyword.count;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
public class ItemImpl {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("keywords", args[2]);
Job job = Job.getInstance(conf, "item count");
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class ItemMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
JSONParser parser = new JSONParser();
#Override
public void map(Object key, Text value, Context output) throws IOException,
InterruptedException {
JSONObject tweetObject = null;
String[] keywords = this.getKeyWords(output);
try {
tweetObject = (JSONObject) parser.parse(value.toString());
} catch (ParseException e) {
e.printStackTrace();
}
if (tweetObject != null) {
String tweetText = (String) tweetObject.get("text");
if(tweetText == null){
return;
}
tweetText = tweetText.toLowerCase();
/* StringTokenizer st = new StringTokenizer(tweetText);
ArrayList<String> tokens = new ArrayList<String>();
while (st.hasMoreTokens()) {
tokens.add(st.nextToken());
}*/
for (String keyword : keywords) {
keyword = keyword.toLowerCase();
if (tweetText.contains(keyword)) {
output.write(new Text(keyword), one);
}
}
output.write(new Text("count"), one);
}
}
String[] getKeyWords(Mapper<Object, Text, Text, IntWritable>.Context context) {
Configuration conf = (Configuration) context.getConfiguration();
String param = conf.get("keywords");
return param.split(",");
}
}
public static class ItemReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
#Override
protected void reduce(Text key, Iterable<IntWritable> values, Context output)
throws IOException, InterruptedException {
int wordCount = 0;
for (IntWritable value : values) {
wordCount += value.get();
}
output.write(key, new IntWritable(wordCount));
}
}
}
Running in full distributed mode your TaskTracker/NodeManager (the thing running your mapper) is running in a separate JVM and it sounds like your class is not making it onto that JVM's classpath.
Try using the -libjars <csv,list,of,jars> command line arg on job invocation. This will have Hadoop distribute the jar to the TaskTracker JVM and load your classes from that jar. (Note, this copies the jar out to each node in your cluster and makes it available only for that specific job. If you have common libraries that would need to be invoked for a lot of jobs, you'd want to look into using the Hadoop distributed cache.)
You may also want to try yarn -jar ... when launching your job versus hadoop -jar ... since that's the new/preferred way to launch yarn jobs.
Can you check the content of your itemcount.jar ?( jar -tvf itemcount.jar). I faced this issue once only to find that the .class was missing from the jar.
I had the same error a few days ago.
Changing map and reduce classes to static fixed my problem.
Make your map and reduce classes inner classes.
Control constructors of map and reduce classes (i/o values and override statement)
Check your jar command
old one
hadoop jar ~/itemcount.jar /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
new
hadoop jar ~/itemcount.jar com.hadoop.keyword.count.ItemImpl /user/rohit/tweets /home/rohit/outputs/23mar-yarn13 vodka,wine,whisky
add packageName.mainclass after you specified .jar file
Try-catch
try {
tweetObject = (JSONObject) parser.parse(value.toString());
} catch (Exception e) { **// Change ParseException to Exception if you don't only expect Parse error**
e.printStackTrace();
return; **// return from function in case of any error**
}
}
extends Configured and implement Tool
public class ItemImpl extends Configured implements Tool{
public static void main (String[] args) throws Exception{
int res =ToolRunner.run(new ItemImpl(), args);
System.exit(res);
}
#Override
public int run(String[] args) throws Exception {
Job job=Job.getInstance(getConf(),"ItemImpl ");
job.setJarByClass(this.getClass());
job.setJarByClass(ItemImpl.class);
job.setMapperClass(ItemMapper.class);
job.setReducerClass(ItemReducer.class);
job.setMapOutputKeyClass(Text.class);//probably not essential but make it certain and clear
job.setMapOutputValueClass(IntWritable.class); //probably not essential but make it certain and clear
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
add public static map
add public static reduce
I'm not an expert about this topic but This implementation is from one of my working projects. Try this if doesn't work for you I would suggest you check the libraries you added to your project.
Probably first step will solve it but
If these steps doesn't work , share the code with us.

How can i have multiple mappers and reducers?

I have this code in which i have set one mapper and one reducer.I want to include one more mapper and a reducer for doing further jobs.
The problem is that i have to take the output file of the first map reduce job as the input to the next map reduce job.Is it possible to do that?if yes then how can i do it?
public int run(String[] args) throws Exception
{
JobConf conf = new JobConf(getConf(),DecisionTreec45.class);
conf.setJobName("c4.5");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MyMapper.class);
conf.setReducerClass(MyReducer.class);
//set your input file path below
FileInputFormat.setInputPaths(conf, "/home/hduser/Id3_hds/playtennis.txt");
FileOutputFormat.setOutputPath(conf, new Path("/home/hduser/Id3_hds/1/output"+current_index));
JobClient.runJob(conf);
return 0;
}
yes its possible to do that. you can check the following tutorial to see how chaining occurs. http://gandhigeet.blogspot.com/2012/12/as-discussed-in-previous-post-hadoop.html
Make sure you delete the intermediate output data in HDFS which will be created by each MR phase by using fs.delete(intermediateoutputPath);
Look at how this works.
You need to have two jobs. Job2 is dependent on job1.
public class ChainJobs extends Configured implements Tool {
private static final String OUTPUT_PATH = "intermediate_output";
#Override
public int run(String[] args) throws Exception {
/*
* Job 1
*/
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
Job job = new Job(conf, "Job1");
job.setJarByClass(ChainJobs.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReducer1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/
/*
* Job 2
*/
Configuration conf2 = getConf();
Job job2 = new Job(conf2, "Job 2");
job2.setJarByClass(ChainJobs.class);
job2.setMapperClass(MyMapper2.class);
job2.setReducerClass(MyReducer2.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
TextOutputFormat.setOutputPath(job2, new Path(args[1]));
return job2.waitForCompletion(true) ? 0 : 1;
}
/**
* Method Name: main Return type: none Purpose:Read the arguments from
* command line and run the Job till completion
*
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
if (args.length != 2) {
System.err.println("Enter valid number of arguments <Inputdirectory> <Outputlocation>");
System.exit(0);
}
ToolRunner.run(new Configuration(), new ChainJobs(), args);
}
}

Hadoop mapreduce.job.reduces in Generic Option Syntax?

I am trying to set the number of reducers to use via command line. It seems like I am using wrong syntax. I am using hadoop 2.5 (yarn) MR2.
hadoop jar mrjobs-0.1.jar com.example.Weather -D mapreduce.job.reduces=2 datasets/inputs output
This commands is not working when I added -D option else its working fine.
Any help appreciated !
thanks!
Syntax looks proper, I have tested against 2.5 YARN MR2 with the following it's working:
hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.job.reduces=5 input output
Most probably the problem could be your Driver class hasn't implemented ToolRunner which works in coordination with GenericOptionsParser to parse generic command line arguments.
Here is an example of how to implement ToolRunner in your MapReduce Driver class:
// imports ignored
public class ExampleDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: ExampleDriver <in> <out>");
System.exit(2);
}
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJobName("example driver");
job.setJarByClass(ExampleDriver.class);
job.setMapperClass(YourMapper.class);
job.setReducerClass(YourReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int ret = job.waitForCompletion(true) ? 0 : 1;
return ret;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ExampleDriver(), args);
System.exit(res);
}
}

is Generic option -D not supporting in hadoop 0.20.2?

i am trying to set the configuration property from the console using -D generic options.
here is my console input:
$ hadoop jar hadoop-0.20.2/gtee.jar dd.MaxTemperature -D file.pattern=2007.* /inputdata /outputdata
but i did cross verification from the code by
Configuration conf;
System.out.println(conf.get("file.pattern"));
results null output.what would be the problem here, why value of the property "file.pattern" not displaying ? Can any one please help me.
Thanks
EDITED SECTION:
Driver Code:
public int run(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("MaxTemperature");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(args[0]), conf);
// System.out.println(conf.get("file.pattern"));
if (fs.exists(new Path(args[1]))) {
fs.delete(new Path(args[1]), true);
}
System.out.println(conf.get("file.pattern"));
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileInputFormat.setInputPathFilter(job, RegexExcludePathFilter.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapperClass(MapMapper.class);
job.setCombinerClass(Mapreducers.class);
job.setReducerClass(Mapreducers.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int xx = ToolRunner.run(new Configuration(),new MaxTemperature(), args);
System.exit(xx);
}
path filter implementation:
public static class RegexExcludePathFilter extends Configured implements
PathFilter {
//String pattern = "2007.[0-1]?[0-2].[0-9][0-9].txt" ;
Configuration conf;
Pattern pattern;
#Override
public boolean accept(Path path) {
Matcher m = pattern.matcher(path.toString());
return m.matches();
}
#Override
public void setConf(Configuration conf) {
this.conf = conf;
pattern = Pattern.compile(conf.get("file.pattern"));
System.out.println(pattern);
}
}
To confirm you -D option is supported in the version 20.2 however that requires you to implement the Tool interface to read variables from command line
Configuration conf = new Configuration(); //this is the issue
// When implementing tool use this
Configuration conf = this.getConf();
You are passing it with a space in between, that's not how you should do it. Instead try:
-Dfile.pattern=2007.*

How to use JobControl in hadoop

I want to merge two files into one.
I made two mappers to read, and one reducer to join.
JobConf classifiedConf = new JobConf(new Configuration());
classifiedConf.setJarByClass(myjob.class);
classifiedConf.setJobName("classifiedjob");
FileInputFormat.setInputPaths(classifiedConf,classifiedInputPath );
classifiedConf.setMapperClass(ClassifiedMapper.class);
classifiedConf.setMapOutputKeyClass(TextPair.class);
classifiedConf.setMapOutputValueClass(Text.class);
Job classifiedJob = new Job(classifiedConf);
//first mapper config
JobConf featureConf = new JobConf(new Configuration());
featureConf.setJobName("featureJob");
featureConf.setJarByClass(myjob.class);
FileInputFormat.setInputPaths(featureConf, featuresInputPath);
featureConf.setMapperClass(FeatureMapper.class);
featureConf.setMapOutputKeyClass(TextPair.class);
featureConf.setMapOutputValueClass(Text.class);
Job featureJob = new Job(featureConf);
//second mapper config
JobConf joinConf = new JobConf(new Configuration());
joinConf.setJobName("joinJob");
joinConf.setJarByClass(myjob.class);
joinConf.setReducerClass(JoinReducer.class);
joinConf.setOutputKeyClass(Text.class);
joinConf.setOutputValueClass(Text.class);
Job joinJob = new Job(joinConf);
//reducer config
//JobControl config
joinJob.addDependingJob(featureJob);
joinJob.addDependingJob(classifiedJob);
secondJob.addDependingJob(joinJob);
JobControl jobControl = new JobControl("jobControl");
jobControl.addJob(classifiedJob);
jobControl.addJob(featureJob);
jobControl.addJob(secondJob);
Thread thread = new Thread(jobControl);
thread.start();
while(jobControl.allFinished()){
jobControl.stop();
}
But, I get this message:
WARN mapred.JobClient:
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
anyone help please..................
Which version of Hadoop are you using?
The warning you get will stop the program?
You don't need to use setJarByClass(). You can see my snippet, I can run it without using setJarByClass() method.
JobConf job = new JobConf(PageRankJob.class);
job.setJobName("PageRankJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);
You should implement your Job this way:
public class MyApp extends Configured implements Tool {
public int run(String[] args) throws Exception {
// Configuration processed by ToolRunner
Configuration conf = getConf();
// Create a JobConf using the processed conf
JobConf job = new JobConf(conf, MyApp.class);
// Process custom command-line options
Path in = new Path(args[1]);
Path out = new Path(args[2]);
// Specify various job-specific parameters
job.setJobName("my-app");
job.setInputPath(in);
job.setOutputPath(out);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new MyApp(), args);
System.exit(res);
}
}
This comes straight out of Hadoop's documentation here.
So basically your job needs to inherit from Configured and implement Tool. This will force you to implement run(). Then start your job from your main class using Toolrunner.run(<your job>, <args>) and the warning will disappear.
You need to have this code in the driver job.setJarByClass(MapperClassName.class);

Resources