Load shared library from distributed cache - hadoop

I have a shared library that I copied to hdfs at
/user/uokuyucu/lib/libxxx.so
and I have a WordCount.java with the identical code from tutorial plus my own FileInputFormat class called MyFileInputFormat that has nothing in it except the constructor modified as follows:
public MyInputFileFormat() {
System.loadLibrary("xxx");
}
and I'm also adding my shared library to distributed cache like this in job setup (main):
DistributedCache.addCacheFile(new URI("/user/uokuyucu/lib/libxxx.so"),
job.getConfiguration());
I run it as;
hadoop jar mywordcount.jar mywordcount.WordCount input output
and got java.lang.UnsatisfiedLinkError: no far_jni_interface in java.library.path exception.
How can I load a shared library in my hadoop job?

Related

how to trigger spark job from java code with SparkSubmit.scala

I saw oozie is using
List<String> sparkArgs = new ArrayList<String>();
sparkArgs.add("--master");
sparkArgs.add("yarn-cluster");
sparkArgs.add("--class");
sparkArgs.add("com.sample.spark.HelloSpark");
...
SparkSubmit.main(sparkArgs.toArray(new String[sparkArgs.size()]));
But when I ran this on cluster, I always got
Error: Could not load YARN classes. This copy of Spark may not have been compiled with YARN support.
I think that is because my program can not find HADOOP_CONF_DIR. But how do I let SparkSubmit know that settings in Java code ?

DistributedCache unable to access archives

I am able to access individual files using DistributedCache but unable to access archives.
In the main method I am adding the archive as
DistributedCache.addCacheArchive(new Path("/stocks.gz").toUri(), job.getConfiguration());
where /stocks.gz is in hdfs. In the mapper I use,
Path[] paths = DistributedCache.getLocalCacheArchives(context.getConfiguration());
File localFile = new File(paths[0].toString());
which throws the exception,
java.io.FileNotFoundException: /tmp/hadoop-user/mapred/local/taskTracker/distcache/-8696401910194823450_622739733_1347031628/localhost/stocks.gz (No such file or directory)
I am expecting the DistributedCache to unzip /stocks.gz and the mapper to use the underlying file, but it throws a FileNotFound exception.
DistributedCache.addCacheFile and DistributedCache.getLocalCacheFiles works correctly when passing a single file, however passing an archive does not work. What am I doing wrong here ?
Can you try giving the stocks.gz with the Absolute path.
DistributedCache.addCacheArchive(new Path("<Absolute Path To>/stocks.gz").toUri(), job.getConfiguration());

MapReduce: Passing external jar files using libjars option does not work

My map reduce program needs external jar files. I am using the
"-libjars" option to provide those external jar files -
I used Tool, Configured and ToolRunner Utilities provided by hadoop.
public static void main(String[] args)throws Exception {
int res = ToolRunner.run(newConfiguration(), new MapReduce(),args);
System.exit(res);
}
#Override
public int run(String[] args) throwsException {
// Configuration processed by ToolRunner
Configuration conf = getConf();
Job job = new Job (conf, "MapReduce");
....
}
When I tried to run the job -
$ Hadoop jar myjob.jar jobClassName -libjars external.jar
It threw the following exception.
12/11/21 16:26:02 INFO mapred.JobClient: Task Id :
attempt_201211211620_0001_m_000000_1, Status : FAILED Error:
java.lang.ClassNotFoundException:
org.joda.time.format.DateTimeFormatterBuilder
I have been trying to resolve it for a while. Nothing seems to work so far. I am using CDH 4.1.1.
It seems cannot find JodaTime. Open /etc/hbase/hbase-env.sh and add your extra jar to HADOOP_CLASSPATH.
export HADOOP_CLASSPATH="<extra_entries>:$HADOOP_CLASSPATH"
Another, less efficient and sometimes not possible, idea is to copy your requited jar to /usr/share/hadoop/lib.
Try invoking the command using the fully qualified absolute file name for the external.jar. Also confirm that the missing class and all of its prerequisite classes are in the external.jar.

Giraph Shortest Paths Example ClassNotFoundException

I am trying to run the shortest paths example from the giraph incubator (https://cwiki.apache.org/confluence/display/GIRAPH/Shortest+Paths+Example). However instead of executing the example from the giraph-*-dependencies.jar, I have created my own job jar. When I created a single Job file as presented in the example, I was getting
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.Test$SimpleShortestPathsVertexInputFormat
Then I have moved the inner classes (SimpleShortestPathsVertexInputFormat and SimpleShortestPathsVertexOutputFormat) to separates files and renamed them just in case (SimpleShortestPathsVertexInputFormat_v2, SimpleShortestPathsVertexOutputFormat_v2); the classes are not static anymore. This have solved the issues of class not found for the SimpleShortestPathsVertexInputFormat_v2, however I am still getting the same error for the SimpleShortestPathsVertexOutputFormat_v2. Below is my stack trace.
INFO mapred.JobClient: Running job: job_201205221101_0003
INFO mapred.JobClient: map 0% reduce 0%
INFO mapred.JobClient: Task Id : attempt_201205221101_0003_m_000005_0, Status : FAILED
java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.utils.SimpleShortestPathsVertexOutputFormat_v2
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:898)
at org.apache.giraph.graph.BspUtils.getVertexOutputFormatClass(BspUtils.java:134)
at org.apache.giraph.bsp.BspOutputFormat.getOutputCommitter(BspOutputFormat.java:56)
at org.apache.hadoop.mapred.Task.initialize(Task.java:490)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:352)
at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:253)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.test.giraph.utils.SimpleShortestPathsVertexOutputFormat_v2
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:866)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:890)
... 9 more
I have inspected my job jar and all classes are there. Furthermore I am using hadoop 0.20.203 in a pseudo distributed mode. The way I launch my job is presented below.
hadoop jar giraphJobs.jar org.test.giraph.Test -libjars /path/to/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar /path/to/input /path/to/output 0 3
Also I have defined HADOOP_CLASSPATH for the giraph-*-dependencies.jar. I can run the PageRankBenchmark example without a problem (directly from the giraph-*-dependencies.jar), and the shortes path example works as well (also directly from the giraph-*-dependencies.jar). Other hadoop jobs work without a problem (somewhere I have read to test if my "cluster" works correctly). Does anyone came across similar problem? Any help will be appreciated.
Solution (sorry to post it like this but I can't answer my own question for a couple of more hours)
To solve this issue I had to add my Job jar to the -libjars (no changes to HADOOP_CLASSPATH where made). The command to launch job now looks like this.
hadoop jar giraphJobs.jar org.test.giraph.Test -libjars /path/to/giraph-0.2-SNAPSHOT-jar-with-dependencies.jar,/path/to/job.jar /path/to/input /path/to/output 0 3
List of jars has to be comma separated. Though this has solved my problem. I am still curious why I have to pass my job jar as a "classpath" parameter? Can someone explain me what is the rational behind this? As I found it strange (to say the least) to invoke my job jar and then pass it again as a "classpath" jar. I am really curious about the explanation.
I found an alternative programmatic solution to the problem.
We need to modify the run() method in the following way -
...
#Override
public int run(String[] argArray) throws Exception {
Preconditions.checkArgument(argArray.length == 4,
"run: Must have 4 arguments <input path> <output path> " +
"<source vertex id> <# of workers>");
GiraphJob job = new GiraphJob(getConf(), getClass().getName());
// This is the addition - it will make hadoop look for other classes in the same jar that contains this class
job.getInternalJob().setJarByClass(getClass());
job.setVertexClass(getClass());
...
}
setJarByClass() will make hadoop look for the missing classes in the same jar that contains the class returned by getClass(), and we will not need to add the job jar name separately to the -libjars option.

Generating job and topology traces from history folder of multinode cluster using Rumen

I have a single node cluster from which i got logs and gave input TraceBuilder and it works.
I have grouped 5 node cluster under default rack and got logs. Here job and topology traces are generated properly.
I have set up 5 node cluster with each of them mapped to different racks.
I have hadoop-0.20.2 set up on my Eclipse Helios. So, i ran Tracebuilder using
Main Class: org.apache.hadoop.tools.rumen.TraceBuilder
I ran some jobs on cluster and used copy of /usr/local/hadoop/logs/history folder of master node as input to TraceBuilder.
Arguments: /home/arun/job.json /home/arun/topology.json /home/ubuntu/Documents/testlog
But i get
11/12/16 12:02:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
11/12/16 12:02:38 WARN rumen.TraceBuilder: TraceBuilder got an error while processing the [possibly virtual] file master_1324011575958_job_201112161029_0001_hduser_word+count within Path file:/home/ubuntu/Documents/testlog/master_1324011575958_job_201112161029_0001_hduser_word+count
java.lang.NullPointerException
at org.apache.hadoop.tools.rumen.JobBuilder.processTaskAttemptFinishedEvent(JobBuilder.java:492)
at org.apache.hadoop.tools.rumen.JobBuilder.process(JobBuilder.java:149)
at org.apache.hadoop.tools.rumen.TraceBuilder.processJobHistory(TraceBuilder.java:310)
at org.apache.hadoop.tools.rumen.TraceBuilder.run(TraceBuilder.java:264)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:83)
at org.apache.hadoop.tools.rumen.TraceBuilder.main(TraceBuilder.java:142)
.....................
It generates job trace json file but the fields like hostname and location are "null" in it and the topology trace json file doesn't have 5 node's info and is like this :
{
"name" : "<root>",
"children" : [ ]
}
Can anyone help me out?
This error occurs because none expected input file was found on input directory.
The input directory must to contain job files, for example: job_201205192032_0006_conf.xml. These files are stored inside the logs/history folder, but under some directories generated in accord with the job execution and execution date

Resources