hadoop, how to include 3part jar while try to run mapred job - hadoop

As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)

Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}

For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.

Related

Specify job properties and override properties in hadoop jobs

I have a hadoop (2.2.0) map-reduce job which reads text from a specified path (say INPUT_PATH), and does some processing. I don't want to hardcode the input path (since it comes from some other source which changes each week).
I believe there should be a way in hadoop to specify an xml properties file while running though the command line. How should I do it?
One way I thought was to set an environment variable which points to the location of the properties file and then read this env variable in code and subsequently read the property file. This could work because the value of the env variable can be changed each week without changing the code. But I feel this is an ugly way of loading properties and overrides.
Please let me know the least hacky way of doing this.
There is no inbuilt way to read any configuration file for input/output.
One way I can suggest is to implement a Java M/R Driver program that does the following,
Read the configuration (XML/properties/anything) (Probably generated / updated by the other process)
Set the Job Properties
Submit the Job using your hadoop command (pass the configuration file as argument)
Something like this,
public class SampleMRDriver
extends Configured implements Tool {
#Override
public int run(
String[] args)
throws Exception {
// Read from args the configuration file
Properties prop = new Properties();
prop.loadFromXML(new FileInputStream(args[0]));
Job job = Job.getInstance(getConf(), "Test Job");
job.setJarByClass(SampleMRDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(TestMapper.class);
job.setReducerClass(TestReducer.class);
FileInputFormat.setInputPaths(job, new Path(prop.get("input_path")));
FileOutputFormat.setOutputPath(job, new Path(prop.get("output_path")));
boolean success = job.waitForCompletion(true);
return success ? 0 : 1;
}
public static void main(
String[] args)
throws Exception {
ToolRunner.run(new BatteryAnomalyDetection(), args);
}
}

Hadoop mapreduce.job.reduces in Generic Option Syntax?

I am trying to set the number of reducers to use via command line. It seems like I am using wrong syntax. I am using hadoop 2.5 (yarn) MR2.
hadoop jar mrjobs-0.1.jar com.example.Weather -D mapreduce.job.reduces=2 datasets/inputs output
This commands is not working when I added -D option else its working fine.
Any help appreciated !
thanks!
Syntax looks proper, I have tested against 2.5 YARN MR2 with the following it's working:
hadoop jar hadoop-mapreduce-examples.jar wordcount -Dmapreduce.job.reduces=5 input output
Most probably the problem could be your Driver class hasn't implemented ToolRunner which works in coordination with GenericOptionsParser to parse generic command line arguments.
Here is an example of how to implement ToolRunner in your MapReduce Driver class:
// imports ignored
public class ExampleDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: ExampleDriver <in> <out>");
System.exit(2);
}
Configuration conf = getConf();
Job job = Job.getInstance(conf);
job.setJobName("example driver");
job.setJarByClass(ExampleDriver.class);
job.setMapperClass(YourMapper.class);
job.setReducerClass(YourReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
int ret = job.waitForCompletion(true) ? 0 : 1;
return ret;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new ExampleDriver(), args);
System.exit(res);
}
}

ClassNotFoundException when running HBase map reduce job on cluster

I have been testing a map reduce job on a single node and it seems to work but now that I am trying to run it on a remote cluster I am getting a ClassNotFoundExcepton. My code is structured as follows:
public class Pivot {
public static class Mapper extends TableMapper<ImmutableBytesWritable, ImmutableBytesWritable> {
#Override
public void map(ImmutableBytesWritable rowkey, Result values, Context context) throws IOException {
(map code)
}
}
public static class Reducer extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
(reduce code)
}
}
public static void main(String[] args) {
Configuration conf = HBaseConfiguration.create();
conf.set("fs.default.name", "hdfs://hadoop-master:9000");
conf.set("mapred.job.tracker", "hdfs://hadoop-master:9001");
conf.set("hbase.master", "hadoop-master:60000");
conf.set("hbase.zookeeper.quorum", "hadoop-master");
conf.set("hbase.zookeeper.property.clientPort", "2222");
Job job = new Job(conf);
job.setJobName("Pivot");
job.setJarByClass(Pivot.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob("InputTable", scan, Mapper.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);
TableMapReduceUtil.initTableReducerJob("OutputTable", Reducer.class, job);
job.waitForCompletion(true);
}
}
The error I am receiving when I try to run this job is the following:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Pivot$Mapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
...
Is there something I'm missing? Why is the job having difficulty finding the mapper?
When running a job from Eclipse it's important to note that Hadoop requires you to launch your job from a jar. Hadoop requires this so it can send your code up to HDFS / JobTracker.
In your case i imagine you haven't bundled up your job classes into a jar, and then run the program 'from the jar' - resulting in a CNFE.
Try building a jar and running from the command line using hadoop jar myjar.jar ..., once this works then you can test running from within Eclipse

How to run the word count job on hadoop yarn from Java code?

I have a requirement like below:
there is a 30 node hadoop YARN cluster, and a client machine for job submission.
Let's use the wordcount MR example, since it's world famous. I'd like to submit and run the wordcount MR job from a java method.
So what's the code required to submit the job? anything specific to configurations on the client machine?
Hadoop should be present on your client machine, with the same configurations as other machines in your hadoop cluster.
To submit the MR job from a java method, please refer to java ProcessBuilder and pass the hadoop command to launch you wordcount example.
The command and necessary application specific requirements for wordcount can be found here
You should make a class that implements Tool. An example here:
public class AggregateJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setJobName(getClass().getSimpleName());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(ProjectionMapper.class);
job.setCombinerClass(LongSumReducer.class);
job.setReducerClass(LongSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int rc = ToolRunner.run(new AggregateJob(), args);
System.exit(rc);
}
}
This example was obtained from here. As #hamsa-zafar already say, the client machine should have present hadoop configuration, as any other node in the cluster.

How to use JobControl in hadoop

I want to merge two files into one.
I made two mappers to read, and one reducer to join.
JobConf classifiedConf = new JobConf(new Configuration());
classifiedConf.setJarByClass(myjob.class);
classifiedConf.setJobName("classifiedjob");
FileInputFormat.setInputPaths(classifiedConf,classifiedInputPath );
classifiedConf.setMapperClass(ClassifiedMapper.class);
classifiedConf.setMapOutputKeyClass(TextPair.class);
classifiedConf.setMapOutputValueClass(Text.class);
Job classifiedJob = new Job(classifiedConf);
//first mapper config
JobConf featureConf = new JobConf(new Configuration());
featureConf.setJobName("featureJob");
featureConf.setJarByClass(myjob.class);
FileInputFormat.setInputPaths(featureConf, featuresInputPath);
featureConf.setMapperClass(FeatureMapper.class);
featureConf.setMapOutputKeyClass(TextPair.class);
featureConf.setMapOutputValueClass(Text.class);
Job featureJob = new Job(featureConf);
//second mapper config
JobConf joinConf = new JobConf(new Configuration());
joinConf.setJobName("joinJob");
joinConf.setJarByClass(myjob.class);
joinConf.setReducerClass(JoinReducer.class);
joinConf.setOutputKeyClass(Text.class);
joinConf.setOutputValueClass(Text.class);
Job joinJob = new Job(joinConf);
//reducer config
//JobControl config
joinJob.addDependingJob(featureJob);
joinJob.addDependingJob(classifiedJob);
secondJob.addDependingJob(joinJob);
JobControl jobControl = new JobControl("jobControl");
jobControl.addJob(classifiedJob);
jobControl.addJob(featureJob);
jobControl.addJob(secondJob);
Thread thread = new Thread(jobControl);
thread.start();
while(jobControl.allFinished()){
jobControl.stop();
}
But, I get this message:
WARN mapred.JobClient:
Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
anyone help please..................
Which version of Hadoop are you using?
The warning you get will stop the program?
You don't need to use setJarByClass(). You can see my snippet, I can run it without using setJarByClass() method.
JobConf job = new JobConf(PageRankJob.class);
job.setJobName("PageRankJob");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapperClass(PageRankMapper.class);
job.setReducerClass(PageRankReducer.class);
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
JobClient.runJob(job);
You should implement your Job this way:
public class MyApp extends Configured implements Tool {
public int run(String[] args) throws Exception {
// Configuration processed by ToolRunner
Configuration conf = getConf();
// Create a JobConf using the processed conf
JobConf job = new JobConf(conf, MyApp.class);
// Process custom command-line options
Path in = new Path(args[1]);
Path out = new Path(args[2]);
// Specify various job-specific parameters
job.setJobName("my-app");
job.setInputPath(in);
job.setOutputPath(out);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
// Submit the job, then poll for progress until the job is complete
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
// Let ToolRunner handle generic command-line options
int res = ToolRunner.run(new Configuration(), new MyApp(), args);
System.exit(res);
}
}
This comes straight out of Hadoop's documentation here.
So basically your job needs to inherit from Configured and implement Tool. This will force you to implement run(). Then start your job from your main class using Toolrunner.run(<your job>, <args>) and the warning will disappear.
You need to have this code in the driver job.setJarByClass(MapperClassName.class);

Resources