How to use JobControl in hadoop - hadoop

I want to merge two files into one.
I made two mappers to read, and one reducer to join.
JobConf classifiedConf = new JobConf(new Configuration());
classifiedConf.setJarByClass(myjob.class);
classifiedConf.setJobName("classifiedjob");
FileInputFormat.setInputPaths(classifiedConf,classifiedInputPath );
classifiedConf.setMapperClass(ClassifiedMapper.class);
classifiedConf.setMapOutputKeyClass(TextPair.class);
classifiedConf.setMapOutputValueClass(Text.class);
Job classifiedJob = new Job(classifiedConf);
//first mapper config
JobConf featureConf = new JobConf(new Configuration());
featureConf.setJobName("featureJob");
featureConf.setJarByClass(myjob.class);
FileInputFormat.setInputPaths(featureConf, featuresInputPath);
featureConf.setMapperClass(FeatureMapper.class);
featureConf.setMapOutputKeyClass(TextPair.class);
featureConf.setMapOutputValueClass(Text.class);
Job featureJob = new Job(featureConf);
//second mapper config
JobConf joinConf = new JobConf(new Configuration());
joinConf.setJobName("joinJob");
joinConf.setJarByClass(myjob.class);
joinConf.setReducerClass(JoinReducer.class);
joinConf.setOutputKeyClass(Text.class);
joinConf.setOutputValueClass(Text.class);
Job joinJob = new Job(joinConf);
//reducer config
//JobControl config
joinJob.addDependingJob(featureJob);
joinJob.addDependingJob(classifiedJob);
secondJob.addDependingJob(joinJob);
JobControl jobControl = new JobControl("jobControl");
jobControl.addJob(classifiedJob);
jobControl.addJob(featureJob);
jobControl.addJob(secondJob);
Thread thread = new Thread(jobControl);
thread.start();
while(jobControl.allFinished()){
jobControl.stop();
}
But, I get this message:
WARN mapred.JobClient:
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
anyone help please..................

Which version of Hadoop are you using?
The warning you get will stop the program?
You don't need to use setJarByClass(). You can see my snippet, I can run it without using setJarByClass() method.
JobConf job = new JobConf(PageRankJob.class);
job.setJobName("PageRankJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);

You should implement your Job this way:
public class MyApp extends Configured implements Tool {
public int run(String[] args) throws Exception {
// Configuration processed by ToolRunner
Configuration conf = getConf();
// Create a JobConf using the processed conf
JobConf job = new JobConf(conf, MyApp.class);
// Process custom command-line options
Path in = new Path(args[1]);
Path out = new Path(args[2]);
// Specify various job-specific parameters
job.setJobName("my-app");
job.setInputPath(in);
job.setOutputPath(out);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new MyApp(), args);
System.exit(res);
}
}
This comes straight out of Hadoop's documentation here.
So basically your job needs to inherit from Configured and implement Tool. This will force you to implement run(). Then start your job from your main class using Toolrunner.run(<your job>, <args>) and the warning will disappear.

You need to have this code in the driver job.setJarByClass(MapperClassName.class);

Related

How can i have multiple mappers and reducers?

I have this code in which i have set one mapper and one reducer.I want to include one more mapper and a reducer for doing further jobs.
The problem is that i have to take the output file of the first map reduce job as the input to the next map reduce job.Is it possible to do that?if yes then how can i do it?
public int run(String[] args) throws Exception
{
JobConf conf = new JobConf(getConf(),DecisionTreec45.class);
conf.setJobName("c4.5");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MyMapper.class);
conf.setReducerClass(MyReducer.class);
//set your input file path below
FileInputFormat.setInputPaths(conf, "/home/hduser/Id3_hds/playtennis.txt");
FileOutputFormat.setOutputPath(conf, new Path("/home/hduser/Id3_hds/1/output"+current_index));
JobClient.runJob(conf);
return 0;
}
yes its possible to do that. you can check the following tutorial to see how chaining occurs. http://gandhigeet.blogspot.com/2012/12/as-discussed-in-previous-post-hadoop.html
Make sure you delete the intermediate output data in HDFS which will be created by each MR phase by using fs.delete(intermediateoutputPath);
Look at how this works.
You need to have two jobs. Job2 is dependent on job1.
public class ChainJobs extends Configured implements Tool {
private static final String OUTPUT_PATH = "intermediate_output";
#Override
public int run(String[] args) throws Exception {
/*
* Job 1
*/
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
Job job = new Job(conf, "Job1");
job.setJarByClass(ChainJobs.class);
job.setMapperClass(MyMapper1.class);
job.setReducerClass(MyReducer1.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job, new Path(args[0]));
TextOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true); /*this goes to next command after this job is completed. your second job is dependent on your first job.*/
/*
* Job 2
*/
Configuration conf2 = getConf();
Job job2 = new Job(conf2, "Job 2");
job2.setJarByClass(ChainJobs.class);
job2.setMapperClass(MyMapper2.class);
job2.setReducerClass(MyReducer2.class);
job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(Text.class);
job2.setInputFormatClass(TextInputFormat.class);
job2.setOutputFormatClass(TextOutputFormat.class);
TextInputFormat.addInputPath(job2, new Path(OUTPUT_PATH));
TextOutputFormat.setOutputPath(job2, new Path(args[1]));
return job2.waitForCompletion(true) ? 0 : 1;
}
/**
* Method Name: main Return type: none Purpose:Read the arguments from
* command line and run the Job till completion
*
*/
public static void main(String[] args) throws Exception {
// TODO Auto-generated method stub
if (args.length != 2) {
System.err.println("Enter valid number of arguments <Inputdirectory> <Outputlocation>");
System.exit(0);
}
ToolRunner.run(new Configuration(), new ChainJobs(), args);
}
}

MultipleInputs not working - Hadoop 2.5.0

I'm trying to write a program that has 2 mappers that are executed simultaneously and one reducer. Each mapper has a different input file. Basically, I'm trying to do a reduce-side join. But I am getting errors when I declare my job the following way:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 3) {
System.err.println("Usage: MovieRatings <in1> <in2> <out>");
System.exit(2);
}
Job job = new Job(conf, "movieratings");
job.setJarByClass(MovieRatings.class);
job.setMapperClass(MovieIDJoinMapper.class);
job.setMapperClass(MovieNameJoinMapper.class);
MultipleInputs.addInputPath(job, new Path("/temp2"), TextInputFormat.class, MovieIDJoinMapper.class);
MultipleInputs.addInputPath(job, new Path(otherArgs[1]), TextInputFormat.class, MovieNameJoinMapper.class);
job.setReducerClass(ReduceSideJoin.class);
job.setNumReduceTasks(1);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
The error I can't get rid of is:
The method addInputPath(JobConf, Path, Class<? extends InputFormat>, Class<? extends Mapper>) in the type MultipleInputs is not applicable for the arguments (Job, Path, Class<TextInputFormat>, Class<MovieRatings.MovieIDJoinMapper>) MovieRatings.java /homework2/src
Now I get that it should work if I do:
JobConf job = new JobConf();
But that doesn't work either. I am using Hadoop 2.5.0. I know this might be a problem due to mismatch between the version and the API but I've tried different ways and nothing seems to work. Can someone help me please? Thanks!
This is an API mismatch issue. You are using newer types, but somehow imported the old org.apache.hadoop.mapred.lib.MultipleInputs class. Change it to the following, and the errors should be gone :
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
I too had got the same error. Here the problem must be you might have used both mapred and mapreduce libraries at the same time.
Replace
import org.apache.hadoop.mapred.TextInputFormat
with
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

hadoop, how to include 3part jar while try to run mapred job

As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)
Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.

Multiple ways to write driver of Hadoop program - Which one to choose?

I have observed that there are multiple ways to write driver method of Hadoop program.
Following method is given in Hadoop Tutorial by Yahoo
public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
and this method is given in Hadoop The Definitive Guide 2012 book by Oreilly.
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
While trying program given in Oreilly book I found that constructors of Job class are deprecated. As Oreilly book is based on Hadoop 2 (yarn) I was surprised to see that they have used deprecated class.
I would like to know which method everyone uses?
I use the former approach.If we go with overriding the run() method, we can use hadoop jar options like -D,-libjars,-files etc.,.All these are very much necessary in almost any hadoop project.
Not sure if we can use them through the main() method.
Slightly different to your first (Yahoo) block - you should be using the ToolRunner / Tool classes which take advantage of the GenericOptionsParser (as noted in Eswara's answer)
A template pattern would be something like:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class ToolExample extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
// old API
JobConf jobConf = new JobConf(getConf());
// new API
Job job = new Job(getConf());
// rest of your config here
// determine success / failure (depending on your choice of old / new api)
// return 0 for success, non-zero for an error
return 0;
}
public static void main(String args[]) throws Exception {
System.exit(ToolRunner.run(new ToolExample(), args));
}
}

Accessing files in hadoop distributed cache

I want to use the distributed cache to allow my mappers to access data. In main, I'm using the command
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Where /user/peter/cacheFile/testCache1 is a file that exists in hdfs
Then, my setup function looks like this:
public void setup(Context context) throws IOException, InterruptedException{
Configuration conf = context.getConfiguration();
Path[] localFiles = DistributedCache.getLocalCacheFiles(conf);
//etc
}
However, this localFiles array is always null.
I was initially running on a single-host cluster for testing, but I read that this will prevent the distributed cache from working. I tried with a pseudo-distributed, but that didn't work either
I'm using hadoop 1.0.3
thanks
Peter
Problem here was that I was doing the following:
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Since the Job constructor makes an internal copy of the conf instance, adding the cache file afterwards doesn't affect things. Instead, I should do this:
Configuration conf = new Configuration();
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Job job = new Job(conf, "wordcount");
And now it works. Thanks to Harsh on hadoop user list for the help.
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
DistributedCache.addCacheFile(new URI("/userpetercacheFiletestCache1"),job.getConfiguration());
You can also do it in this way.
Once the Job is assigned to with a configuration object,
ie Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
And then if deal with attributes of conf as shown below, eg
conf.set("demiliter","|");
or
DistributedCache.addCacheFile(new URI("/user/peter/cacheFile/testCache1"), conf);
Such changes would not be reflected in a pseudo cluster or cluster how ever it would work with local environment.
This version of code ( which is slightly different from the above mentioned constructs) has always worked for me.
//in main(String [] args)
Job job = new Job(conf,"Word Count");
...
DistributedCache.addCacheFile(new URI(/user/peter/cacheFile/testCache1), job.getConfiguration());
I didnt see the complete setup() function in Mapper code
public void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path[] dataFile = DistributedCache.getLocalCacheFiles(conf);
// [0] because we added just one file.
BufferedReader cacheReader = new BufferedReader(new InputStreamReader(fs.open(dataFile[0])));
// now one can use BufferedReader's readLine() to read data
}

Resources