Giraph Shortest Paths Example ClassNotFoundException - hadoop

I am trying to run the shortest paths example from the giraph incubator (https://cwiki.apache.org/confluence/display/GIRAPH/Shortest+Paths+Example). However instead of executing the example from the giraph-*-dependencies.jar, I have created my own job jar. When I created a single Job file as presented in the example, I was getting
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.Test$SimpleShortestPathsVertexInputFormat
Then I have moved the inner classes (SimpleShortestPathsVertexInputFormat and SimpleShortestPathsVertexOutputFormat) to separates files and renamed them just in case (SimpleShortestPathsVertexInputFormat_v2, SimpleShortestPathsVertexOutputFormat_v2); the classes are not static anymore. This have solved the issues of class not found for the SimpleShortestPathsVertexInputFormat_v2, however I am still getting the same error for the SimpleShortestPathsVertexOutputFormat_v2. Below is my stack trace.
INFO mapred.JobClient: Running job: job_201205221101_0003
INFO mapred.JobClient: map 0% reduce 0%
INFO mapred.JobClient: Task Id : attempt_201205221101_0003_m_000005_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.utils.SimpleShortestPathsVertexOutputFormat_v2
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:898)
at org.apache.giraph.graph.BspUtils.getVertexOutputFormatClass(BspUtils.java:134)
at org.apache.giraph.bsp.BspOutputFormat.getOutputCommitter(BspOutputFormat.java:56)
at org.apache.hadoop.mapred.Task.initialize(Task.java:490)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:352)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.utils.SimpleShortestPathsVertexOutputFormat_v2
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:866)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:890)
... 9 more
I have inspected my job jar and all classes are there. Furthermore I am using hadoop 0.20.203 in a pseudo distributed mode. The way I launch my job is presented below.
hadoop jar giraphJobs.jar org.test.giraph.Test -libjars /path/to/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar /path/to/input /path/to/output 0 3
Also I have defined HADOOP_CLASSPATH for the giraph-*-dependencies.jar. I can run the PageRankBenchmark example without a problem (directly from the giraph-*-dependencies.jar), and the shortes path example works as well (also directly from the giraph-*-dependencies.jar). Other hadoop jobs work without a problem (somewhere I have read to test if my "cluster" works correctly). Does anyone came across similar problem? Any help will be appreciated.
Solution (sorry to post it like this but I can't answer my own question for a couple of more hours)
To solve this issue I had to add my Job jar to the -libjars (no changes to HADOOP_CLASSPATH where made). The command to launch job now looks like this.
hadoop jar giraphJobs.jar org.test.giraph.Test -libjars /path/to/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar,/path/to/job.jar /path/to/input /path/to/output 0 3
List of jars has to be comma separated. Though this has solved my problem. I am still curious why I have to pass my job jar as a "classpath" parameter? Can someone explain me what is the rational behind this? As I found it strange (to say the least) to invoke my job jar and then pass it again as a "classpath" jar. I am really curious about the explanation.

I found an alternative programmatic solution to the problem.
We need to modify the run() method in the following way -
...
#Override
public int run(String[] argArray) throws Exception {
Preconditions.checkArgument(argArray.length == 4,
"run: Must have 4 arguments <input path> <output path> " +
"<source vertex id> <# of workers>");
GiraphJob job = new GiraphJob(getConf(), getClass().getName());
// This is the addition - it will make hadoop look for other classes in the same jar that contains this class
job.getInternalJob().setJarByClass(getClass());
job.setVertexClass(getClass());
...
}
setJarByClass() will make hadoop look for the missing classes in the same jar that contains the class returned by getClass(), and we will not need to add the job jar name separately to the -libjars option.

Related

Hadoop WordCount MapReduce: Getting invalid argument error for setInputFormatClass

I am trying to run a wordcount program but I am getting error for below code
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Error:- "The method setInputFormatClass(Class)
in the type Job is not applicable for the arguments
(Class)"
The likely problem (without seeing all of your code) is that you've mixed the two mapreduce APIs, the mapred and mapreduce.
Check the imports for the two classes. I'm guessing yours probably look like:
org.apache.hadoop.mapred.TextInputFormat
When it should be:
org.apache.hadoop.mapreduce.lib.input.TextInputFormat

Error when running Hadoop Map Reduce for map-only job

I want to run a map-only job in Hadoop MapReduce, here's my code:
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("import");
job.setMapperClass(Map.class);//Custom Mapper
job.setInputFormatClass(TextInputFormat.class);
job.setNumReduceTasks(0);
TextInputFormat.setInputPaths(job, new Path("/home/jonathan/input"));
But I get the error:
13/07/17 18:22:48 ERROR security.UserGroupInformation: PriviledgedActionException
as: jonathan cause:org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
Then I tried to use this:
job.setOutputFormatClass(org.apache.hadoop.mapred.lib.NullOutputFormat.class);
But it gives me a compilation error:
java: method setOutputFormatClass in class org.apache.hadoop.mapreduce.Job
cannot be applied to given types;
required: java.lang.Class<? extends org.apache.hadoop.mapreduce.OutputFormat>
found: java.lang.Class<org.apache.hadoop.mapred.lib.NullOutputFormat>
reason: actual argument java.lang.Class
<org.apache.hadoop.mapred.lib.NullOutputFormat> cannot be converted to
java.lang.Class<? extends org.apache.hadoop.mapreduce.OutputFormat>
by method invocation conversion
What am I doing wrong?
Map-only jobs still need an output location specified. As the error says, you're not specifying this.
I think you mean that your job produces no output at all. Hadoop still wants you to specify an output location, though nothing need be written.
You want org.apache.hadoop.mapreduce.lib.output.NullOutputFormat not org.apache.hadoop.mapred.lib.NullOutputFormat, which is what the second error indicates though it's subtle.

Pattern match input files for Amazon Elastic MapReduce

I am trying to run a MapReduce streaming job that takes input files from directories in an s3 bucket that match a given pattern. The pattern is something like bucket-name/[date]/product/logs/[hour]/[logfilename]. An example log would be in a while like bucket-name/2013-05-02/product/logs/05/log123456789.
I can get the job to work by passing only the hour portion of the file name as a wildcard. For example: bucket-name/2013-05-02/product/logs/*/. This successfully picks each log file from each hour, and passes them individually to mappers.
The problem comes with I try to also make the date a wildcard, for example: bucket-name/*/product/logs/*/. When I do this, the job gets created but no tasks are created and eventually it fails. This error is printed in the syslog.
2013-05-02 08:03:41,549 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful. Error: Job initialization failed:
java.lang.OutOfMemoryError: Java heap space
at java.util.regex.Matcher.<init>(Matcher.java:207)
at java.util.regex.Pattern.matcher(Pattern.java:888)
at org.apache.hadoop.conf.Configuration.substituteVars(Configuration.java:378)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:418)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:523)
at org.apache.hadoop.mapred.SkipBadRecords.getMapperMaxSkipRecords(SkipBadRecords.java:247)
at org.apache.hadoop.mapred.TaskInProgress.<init>(TaskInProgress.java:146)
at org.apache.hadoop.mapred.JobInProgress.initTasks(JobInProgress.java:722)
at org.apache.hadoop.mapred.JobTracker.initJob(JobTracker.java:4238)
at org.apache.hadoop.mapred.EagerTaskInitializationListener$InitJob.run(EagerTaskInitializationListener.java:79)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
2013-05-02 08:03:41,549 INFO org.apache.hadoop.streaming.StreamJob (main): killJob...
On further testing, it looks like the multiple wildcard syntax works as expected in the command line client. I had trouble getting it to work at first, before realizing that requiring Ruby 1.8.7 meant it requires exactly Ruby 1.8.7, and nothing later.

MapReduce: Passing external jar files using libjars option does not work

My map reduce program needs external jar files. I am using the
"-libjars" option to provide those external jar files -
I used Tool, Configured and ToolRunner Utilities provided by hadoop.
public static void main(String[] args)throws Exception {
int res = ToolRunner.run(newConfiguration(), new MapReduce(),args);
System.exit(res);
}
#Override
public int run(String[] args) throwsException {
// Configuration processed by ToolRunner
Configuration conf = getConf();
Job job = new Job (conf, "MapReduce");
....
}
When I tried to run the job -
$ Hadoop jar myjob.jar jobClassName -libjars external.jar
It threw the following exception.
12/11/21 16:26:02 INFO mapred.JobClient: Task Id :
attempt_201211211620_0001_m_000000_1, Status : FAILED Error:
java.lang.ClassNotFoundException:
org.joda.time.format.DateTimeFormatterBuilder
I have been trying to resolve it for a while. Nothing seems to work so far. I am using CDH 4.1.1.
It seems cannot find JodaTime. Open /etc/hbase/hbase-env.sh and add your extra jar to HADOOP_CLASSPATH.
export HADOOP_CLASSPATH="<extra_entries>:$HADOOP_CLASSPATH"
Another, less efficient and sometimes not possible, idea is to copy your requited jar to /usr/share/hadoop/lib.
Try invoking the command using the fully qualified absolute file name for the external.jar. Also confirm that the missing class and all of its prerequisite classes are in the external.jar.

my pig UDF runs in local mode but fails with "Deserialization error: could not instantiate" on my cluster

I have a pig UDF which runs perfectly in local mode, but fails with: could not instantiate 'com.bla.myFunc' with arguments 'null' when I try it on the cluster.
my mistake was not digging hard enough in the task logs.
when you dig there thru the jobTracker UI, you could see that the root cause was:
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Maps
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
so, besides the usual:
pigServer.registerFunction("myFunc", new FuncSpec("com.bla.myFunc"));
we should add:
registerJar(pigServer, Maps.class);
and so on for any jar used by the UDF.
Another option is to use build-jar-with-dependencies, but then you have to put the pig.jar before yours in the classpath, or else you'll tackle this one: embedded hadoop-pig: what's the correct way to use the automatic addContainingJar for UDFs?

Resources