error using -libjars while running map reduce job - hadoop

I am trying to run a map reduce job using hadoop jar command.
I am trying to include external libraries using the -libjars option.
The command that I am running currently is
hadoop jar mapR.jar com.ms.hadoop.poc.CsvParser -libjars google-gson.jar Test1.txt output
But I am recieveing this as the output
usage: [input] [output]
Can anyone please help me out.
I have included the the exteranal libraries in my classpath as well.

Can you list the contents of your main(String args[]) method? Are you using the ToolRunner interface to launch your job? The parsing of the -libjars argument is a function of the GenericOptionsParser, which is invoked for you via the ToolRunner utility class:
public class Driver extends Configured implements Tool {
public static void main(String args[]) {
System.exit(ToolRunner.run(new Driver(), args)));
}
public int run(String args[]) {
Job job = new Job(getConf());
Configuration conf = job.getConfiguration();
// other job configuration
return job.waitForCompletion(true) ? 0 : 1;
}
}

Related

fix - warning "Use GenericOptionsParser for parsing the arguments" when running hadoop job?

When I submit hadoop job It always said
WARN [JobClient] Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same
How can I fix this?
I am using CDH 4.6.0.
You should use something like below driver code to start your MapReduce job to get rid of warning (although it doesn't do any harm):
public class MyClass extends Configured implements Tool {
public int run(String [] args) throws IOException {
JobConf conf = new JobConf(getConf(), MyClass.class);
// run the job here.
return 0;
}
public static void main(String [] args) throws Exception {
int status = ToolRunner.run(new MyClass(), args); // calls your run() method.
System.exit(status);
}
}

ClassNotFoundException when running HBase map reduce job on cluster

I have been testing a map reduce job on a single node and it seems to work but now that I am trying to run it on a remote cluster I am getting a ClassNotFoundExcepton. My code is structured as follows:
public class Pivot {
public static class Mapper extends TableMapper<ImmutableBytesWritable, ImmutableBytesWritable> {
#Override
public void map(ImmutableBytesWritable rowkey, Result values, Context context) throws IOException {
(map code)
}
}
public static class Reducer extends TableReducer<ImmutableBytesWritable, ImmutableBytesWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<ImmutableBytesWritable> values, Context context) throws IOException, InterruptedException {
(reduce code)
}
}
public static void main(String[] args) {
Configuration conf = HBaseConfiguration.create();
conf.set("fs.default.name", "hdfs://hadoop-master:9000");
conf.set("mapred.job.tracker", "hdfs://hadoop-master:9001");
conf.set("hbase.master", "hadoop-master:60000");
conf.set("hbase.zookeeper.quorum", "hadoop-master");
conf.set("hbase.zookeeper.property.clientPort", "2222");
Job job = new Job(conf);
job.setJobName("Pivot");
job.setJarByClass(Pivot.class);
Scan scan = new Scan();
TableMapReduceUtil.initTableMapperJob("InputTable", scan, Mapper.class, ImmutableBytesWritable.class, ImmutableBytesWritable.class, job);
TableMapReduceUtil.initTableReducerJob("OutputTable", Reducer.class, job);
job.waitForCompletion(true);
}
}
The error I am receiving when I try to run this job is the following:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Pivot$Mapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
...
Is there something I'm missing? Why is the job having difficulty finding the mapper?
When running a job from Eclipse it's important to note that Hadoop requires you to launch your job from a jar. Hadoop requires this so it can send your code up to HDFS / JobTracker.
In your case i imagine you haven't bundled up your job classes into a jar, and then run the program 'from the jar' - resulting in a CNFE.
Try building a jar and running from the command line using hadoop jar myjar.jar ..., once this works then you can test running from within Eclipse

How to run the word count job on hadoop yarn from Java code?

I have a requirement like below:
there is a 30 node hadoop YARN cluster, and a client machine for job submission.
Let's use the wordcount MR example, since it's world famous. I'd like to submit and run the wordcount MR job from a java method.
So what's the code required to submit the job? anything specific to configurations on the client machine?
Hadoop should be present on your client machine, with the same configurations as other machines in your hadoop cluster.
To submit the MR job from a java method, please refer to java ProcessBuilder and pass the hadoop command to launch you wordcount example.
The command and necessary application specific requirements for wordcount can be found here
You should make a class that implements Tool. An example here:
public class AggregateJob extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
job.setJobName(getClass().getSimpleName());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(ProjectionMapper.class);
job.setCombinerClass(LongSumReducer.class);
job.setReducerClass(LongSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int rc = ToolRunner.run(new AggregateJob(), args);
System.exit(rc);
}
}
This example was obtained from here. As #hamsa-zafar already say, the client machine should have present hadoop configuration, as any other node in the cluster.

hadoop, how to include 3part jar while try to run mapred job

As we know, new need to pack all needed class into the job-jar and upload it to server. it's so slow, i will to know whether there is a way which to specify the thirdpart jar include executing map-red job, so that i could only pack my classes with out dependencies.
PS(i found there is a "-libjar" command, but i doesn't figure out how to use it. Here is the link http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/)
Those are called generic options.
So, to support those, your job should implement Tool.
Run your job like --
hadoop jar yourfile.jar [mainClass] args -libjars <comma seperated list of jars>
Edit:
To implement Tool and extend Configured, you do something like this in your MapReduce application --
public class YourClass extends Configured implements Tool {
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new YourClass(), args);
System.exit(res);
}
public int run(String[] args) throws Exception
{
//parse you normal arguments here.
Configuration conf = getConf();
Job job = new Job(conf, "Name of job");
//set the class names etc
//set the output data type classes etc
//to accept the hdfs input and outpur dir at run time
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
}
For me I had to specify -libjar option before the arguments. Otherwise it was considered as an argument.

How to set -file option for java hadoop?

How do i copy a file that is required for a hadoop program, to all compute nodes? I am aware that -file option for hadoop streaming does that. How do i do this for java+hadoop?
Exactly the same way.
Assuming you use the ToolRunner / Configured / Tool pattern, the files you specify after the -files option will be in the local dir when your mapper / reducer / combiner tasks run:
public class Driver extends Configured implements Tool {
public static void main(String args[]) {
ToolRunner.run(new Driver(), args);
}
public int run(String args[]) {
Job job = new Job(getConf());
// ...
job.waitForCompletion(true);
}
}
public class MyMapper extends Mapper<K1, V1, K2, V2> {
public void setup(Context context) {
File myFile = new File("file.csv");
// do something with file
}
// ...
}
You can then execute with:
#> hadoop jar myJar.jar Driver -files file.csv ......
See the Javadoc for GenericOptionsParser for more info

Resources