MapReduce: Passing external jar files using libjars option does not work - hadoop

My map reduce program needs external jar files. I am using the
"-libjars" option to provide those external jar files -
I used Tool, Configured and ToolRunner Utilities provided by hadoop.
public static void main(String[] args)throws Exception {
int res = ToolRunner.run(newConfiguration(), new MapReduce(),args);
System.exit(res);
}
#Override
public int run(String[] args) throwsException {
// Configuration processed by ToolRunner
Configuration conf = getConf();
Job job = new Job (conf, "MapReduce");
....
}
When I tried to run the job -
$ Hadoop jar myjob.jar jobClassName -libjars external.jar
It threw the following exception.
12/11/21 16:26:02 INFO mapred.JobClient: Task Id :
attempt_201211211620_0001_m_000000_1, Status : FAILED Error:
java.lang.ClassNotFoundException:
org.joda.time.format.DateTimeFormatterBuilder
I have been trying to resolve it for a while. Nothing seems to work so far. I am using CDH 4.1.1.

It seems cannot find JodaTime. Open /etc/hbase/hbase-env.sh and add your extra jar to HADOOP_CLASSPATH.
export HADOOP_CLASSPATH="<extra_entries>:$HADOOP_CLASSPATH"
Another, less efficient and sometimes not possible, idea is to copy your requited jar to /usr/share/hadoop/lib.

Try invoking the command using the fully qualified absolute file name for the external.jar. Also confirm that the missing class and all of its prerequisite classes are in the external.jar.

Related

Error when running Hadoop Map Reduce for map-only job

I want to run a map-only job in Hadoop MapReduce, here's my code:
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJobName("import");
job.setMapperClass(Map.class);//Custom Mapper
job.setInputFormatClass(TextInputFormat.class);
job.setNumReduceTasks(0);
TextInputFormat.setInputPaths(job, new Path("/home/jonathan/input"));
But I get the error:
13/07/17 18:22:48 ERROR security.UserGroupInformation: PriviledgedActionException
as: jonathan cause:org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
Exception in thread "main" org.apache.hadoop.mapred.InvalidJobConfException:
Output directory not set.
Then I tried to use this:
job.setOutputFormatClass(org.apache.hadoop.mapred.lib.NullOutputFormat.class);
But it gives me a compilation error:
java: method setOutputFormatClass in class org.apache.hadoop.mapreduce.Job
cannot be applied to given types;
required: java.lang.Class<? extends org.apache.hadoop.mapreduce.OutputFormat>
found: java.lang.Class<org.apache.hadoop.mapred.lib.NullOutputFormat>
reason: actual argument java.lang.Class
<org.apache.hadoop.mapred.lib.NullOutputFormat> cannot be converted to
java.lang.Class<? extends org.apache.hadoop.mapreduce.OutputFormat>
by method invocation conversion
What am I doing wrong?
Map-only jobs still need an output location specified. As the error says, you're not specifying this.
I think you mean that your job produces no output at all. Hadoop still wants you to specify an output location, though nothing need be written.
You want org.apache.hadoop.mapreduce.lib.output.NullOutputFormat not org.apache.hadoop.mapred.lib.NullOutputFormat, which is what the second error indicates though it's subtle.

Load shared library from distributed cache

I have a shared library that I copied to hdfs at
/user/uokuyucu/lib/libxxx.so
and I have a WordCount.java with the identical code from tutorial plus my own FileInputFormat class called MyFileInputFormat that has nothing in it except the constructor modified as follows:
public MyInputFileFormat() {
System.loadLibrary("xxx");
}
and I'm also adding my shared library to distributed cache like this in job setup (main):
DistributedCache.addCacheFile(new URI("/user/uokuyucu/lib/libxxx.so"),
job.getConfiguration());
I run it as;
hadoop jar mywordcount.jar mywordcount.WordCount input output
and got java.lang.UnsatisfiedLinkError: no far_jni_interface in java.library.path exception.
How can I load a shared library in my hadoop job?

Nutch in Windows: Failed to set permissions of path

I'm trying to user Solr with Nutch on a Windows Machine and I'm getting the following error:
Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\mapred\staging\admin-1654213299\.staging to 0700
From a lot of threads I learned, that hadoop which seems to be used by nutch does some chmod magic that will work on Unix machines, but not on Windows.
This problem exists for more than a year now. I found one thread, where the code line is shown and a fix proposed. Am I really them only one who has this problem? Are all others creating a custom build in order to run nutch on windows? Or is there some option to disable the hadoop stuff or another solution? Maybe another crawler than nutch?
Here's the stack trace of what I'm doing:
admin#WIN-G1BPD00JH42 /cygdrive/c/solr/apache-nutch-1.6
$ bin/nutch crawl urls -dir crawl -depth 3 -topN 5 -solr http://localhost:8080/solr-4.1.0
cygpath: can't convert empty path
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=http://localhost:8080/solr-4.1.0
topN = 5
Injector: starting at 2013-03-03 17:43:15
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Exception in thread "main" java.io.IOException: Failed to set permissions of path: c:\temp\mapred\staging\admin-1654213299\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at org.apache.nutch.crawl.Injector.inject(Injector.java:281)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:127)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
It took me a while to get this working but here's the solution which works on nutch 1.7.
Download Hadoop Core 0.20.2 from the maven repository
Replace $NUTCH_HOME/lib/hadoop-core-1.2.0.jar with the downloaded file renaming it with the same name.
That should be it.
Explanation
This issue is caused by hadoop since it assumes you're running on unix and abides by the file permission rules. The issue was resolved in 2011 actually but nutch didn't update the hadoop version they use. The relevant fixes are here and here
We are using Nutch too, but it is not supported for running on Windows, on Cygwin our 1.4 version had similar problems as you had, something like mapreduce too.
We solved it by using a vm (Virtual box) with Ubuntu and a shared directory between Windows and Linux, so we can develop and built on Windows and run Nutch (crawling) on Linux.
I have Nutch running on windows, no custom build. It's a long time since I haven't used it though. But one thing that took me a while to catch, is that you need to run cygwin as a windows admin to get the necessary rights.
I suggest a different approach. Check this link out. It explains how to swallow the error on Windows, and does not require you to downgrade Hadoop or rebuild Nutch. I tested on Nutch 2.1, but it applies to other versions as well.
I also made a simple .bat for starting the crawler and indexer, but it is meant for Nutch 2.x, might not be applicable for Nutch 1.x.
For the sake of posterity, the approach entails:
Making a custom LocalFileSystem implementation:
public class WinLocalFileSystem extends LocalFileSystem {
public WinLocalFileSystem() {
super();
System.err.println("Patch for HADOOP-7682: "+
"Instantiating workaround file system");
}
/**
* Delegates to <code>super.mkdirs(Path)</code> and separately calls
* <code>this.setPermssion(Path,FsPermission)</code>
*/
#Override
public boolean mkdirs(Path path, FsPermission permission)
throws IOException {
boolean result=super.mkdirs(path);
this.setPermission(path,permission);
return result;
}
/**
* Ignores IOException when attempting to set the permission
*/
#Override
public void setPermission(Path path, FsPermission permission)
throws IOException {
try {
super.setPermission(path,permission);
}
catch (IOException e) {
System.err.println("Patch for HADOOP-7682: "+
"Ignoring IOException setting persmission for path \""+path+
"\": "+e.getMessage());
}
}
}
Compiling it and placing the JAR under ${HADOOP_HOME}/lib
And then registering it by modifying ${HADOOP_HOME}/conf/core-site.xml:
fs.file.impl
com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem
Enables patch for issue HADOOP-7682 on Windows
You have to change the project dependences hadoop-core and hadoop-tools. I'm using 0.20.2 version and works fine.

DistributedCache unable to access archives

I am able to access individual files using DistributedCache but unable to access archives.
In the main method I am adding the archive as
DistributedCache.addCacheArchive(new Path("/stocks.gz").toUri(), job.getConfiguration());
where /stocks.gz is in hdfs. In the mapper I use,
Path[] paths = DistributedCache.getLocalCacheArchives(context.getConfiguration());
File localFile = new File(paths[0].toString());
which throws the exception,
java.io.FileNotFoundException: /tmp/hadoop-user/mapred/local/taskTracker/distcache/-8696401910194823450_622739733_1347031628/localhost/stocks.gz (No such file or directory)
I am expecting the DistributedCache to unzip /stocks.gz and the mapper to use the underlying file, but it throws a FileNotFound exception.
DistributedCache.addCacheFile and DistributedCache.getLocalCacheFiles works correctly when passing a single file, however passing an archive does not work. What am I doing wrong here ?
Can you try giving the stocks.gz with the Absolute path.
DistributedCache.addCacheArchive(new Path("<Absolute Path To>/stocks.gz").toUri(), job.getConfiguration());

Giraph Shortest Paths Example ClassNotFoundException

I am trying to run the shortest paths example from the giraph incubator (https://cwiki.apache.org/confluence/display/GIRAPH/Shortest+Paths+Example). However instead of executing the example from the giraph-*-dependencies.jar, I have created my own job jar. When I created a single Job file as presented in the example, I was getting
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.Test$SimpleShortestPathsVertexInputFormat
Then I have moved the inner classes (SimpleShortestPathsVertexInputFormat and SimpleShortestPathsVertexOutputFormat) to separates files and renamed them just in case (SimpleShortestPathsVertexInputFormat_v2, SimpleShortestPathsVertexOutputFormat_v2); the classes are not static anymore. This have solved the issues of class not found for the SimpleShortestPathsVertexInputFormat_v2, however I am still getting the same error for the SimpleShortestPathsVertexOutputFormat_v2. Below is my stack trace.
INFO mapred.JobClient: Running job: job_201205221101_0003
INFO mapred.JobClient: map 0% reduce 0%
INFO mapred.JobClient: Task Id : attempt_201205221101_0003_m_000005_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.utils.SimpleShortestPathsVertexOutputFormat_v2
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:898)
at org.apache.giraph.graph.BspUtils.getVertexOutputFormatClass(BspUtils.java:134)
at org.apache.giraph.bsp.BspOutputFormat.getOutputCommitter(BspOutputFormat.java:56)
at org.apache.hadoop.mapred.Task.initialize(Task.java:490)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:352)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.utils.SimpleShortestPathsVertexOutputFormat_v2
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:866)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:890)
... 9 more
I have inspected my job jar and all classes are there. Furthermore I am using hadoop 0.20.203 in a pseudo distributed mode. The way I launch my job is presented below.
hadoop jar giraphJobs.jar org.test.giraph.Test -libjars /path/to/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar /path/to/input /path/to/output 0 3
Also I have defined HADOOP_CLASSPATH for the giraph-*-dependencies.jar. I can run the PageRankBenchmark example without a problem (directly from the giraph-*-dependencies.jar), and the shortes path example works as well (also directly from the giraph-*-dependencies.jar). Other hadoop jobs work without a problem (somewhere I have read to test if my "cluster" works correctly). Does anyone came across similar problem? Any help will be appreciated.
Solution (sorry to post it like this but I can't answer my own question for a couple of more hours)
To solve this issue I had to add my Job jar to the -libjars (no changes to HADOOP_CLASSPATH where made). The command to launch job now looks like this.
hadoop jar giraphJobs.jar org.test.giraph.Test -libjars /path/to/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar,/path/to/job.jar /path/to/input /path/to/output 0 3
List of jars has to be comma separated. Though this has solved my problem. I am still curious why I have to pass my job jar as a "classpath" parameter? Can someone explain me what is the rational behind this? As I found it strange (to say the least) to invoke my job jar and then pass it again as a "classpath" jar. I am really curious about the explanation.
I found an alternative programmatic solution to the problem.
We need to modify the run() method in the following way -
...
#Override
public int run(String[] argArray) throws Exception {
Preconditions.checkArgument(argArray.length == 4,
"run: Must have 4 arguments <input path> <output path> " +
"<source vertex id> <# of workers>");
GiraphJob job = new GiraphJob(getConf(), getClass().getName());
// This is the addition - it will make hadoop look for other classes in the same jar that contains this class
job.getInternalJob().setJarByClass(getClass());
job.setVertexClass(getClass());
...
}
setJarByClass() will make hadoop look for the missing classes in the same jar that contains the class returned by getClass(), and we will not need to add the job jar name separately to the -libjars option.

Resources