Running Cascading using Oozie - hadoop

I'm trying to run a Cascading job using Oozie.
I am getting a java.lang.ClassNotFoundException: cascading.tap.hadoop.MultiInputSplit
I am including the cascading jar in the workflow lib, but it is not being included when cascading launches the m/r job.
Is there anyone out there using Cascading along with Oozie?

You should combine cascading jar with your own jar and give it to workflow/lib.

Related

How HBase add its dependency jars and use HADOOP_CLASSPATH

48. HBase, MapReduce, and the CLASSPATH
By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either the HBase configuration under $HBASE_CONF_DIR or the HBase classes.
To give the MapReduce jobs the access they need, you could add hbase-site.xml_to _$HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib directory. You would then need to copy these changes across your cluster. Or you could edit $HADOOP_HOME/conf/hadoop-env.sh and add hbase dependencies to the HADOOP_CLASSPATH variable. Neither of these approaches is recommended because it will pollute your Hadoop install with HBase references. It also requires you restart the Hadoop cluster before Hadoop can use the HBase data.
The recommended approach is to let HBase add its dependency jars and use HADOOP_CLASSPATH or -libjars.
I'm learning how HBase interacts with MapReduce
I know what the above two ways mean, but I don't know how to configure the recommended way
Could anyone tell me how to configure it in the recommended way?
As the docs show, prior to running hadoop jar, you can export HADOOP_CLASSPATH=$(hbase classpath) and you can use hadoop jar ... -libjars [...]
The true recommended way would be to bundle your HBase dependencies as an Uber JAR in your mapreduce application
The only caveat is that you need to ensure that your project uses the same/compatible hbase-mapreduce client versions as the server.
That way, you don't need any extra configuration, except maybe specifying the hbase-site.xml

Correct way of submitting a job to YARN cluster in which job has dependencies on external jars?

I am trying to understand on what is the correct way of submitting a MR (for that matter a Spark based Java) job to YARN cluster.
Consider the situation below:
Using client machine develop code (MR or Spark) jobs, and say the codes uses 3rd party jar's. Now, when a developer has to submit the job to the YARN cluster, what is the correct way of submitting the job to cluster so that there is no run time exception of class not found. Since job is submitted as jar file, how can a developer "put" the 3rd party jars?
I am having difficulty in understanding this, can anyone help me understand this?
You have to simply build a "fat jar," with Gradle or Maven, that contains not only your compiled code but also all transitive dependencies.
You can use either the Maven Assembly Plugin or any of the Gradle plugins like the Shadow Plugin.
The output of these is what you should supply to spark-submit.

Add Spark to Oozie shared lib

By default, Oozie shared lib directory provides libraries for Hive, Pig, and Map-Reduce. If I want to run Spark job on Oozie, it might be better to add Spark lib jars to Oozie's shared lib instead of copy them to app's lib directory.
How can I add Spark lib jars (including spark-core and its dependencies) to Oozie's shared lib? Any comment / answer is appreciated.
Spark action is scheduled to be released with Oozie 4.2.0, even though the doc seems to be a bit behind. See related JIRA here :
Oozie JIRA - Add spark action executor
Cloudera's release CDH 5.4 has it already though, see official doc here:
CDH 5.4 oozie doc - Oozie Spark Action Extension
With the older version of Oozie, the jars could be shared with various approaches. The first approach may work the best. The complete listings anyway :
Below are the various ways to include a jar with your workflow:
Set oozie.libpath=/path/to/jars,another/path/to/jars in job.properties.
This is useful if you have many workflows that all need the same jar; you can put it in one place in HDFS and use it with many workflows. The jars will be available to all actions in that workflow.
There is no need to ever point this at the ShareLib location. (I see that in a lot of workflows.) Oozie knows where the ShareLib is and will include it automatically if you set oozie.use.system.libpath=true in job.properties.
Create a directory named “lib” next to your workflow.xml in HDFS and put jars in there.
This is useful if you have some jars that you only need for one workflow. Oozie will automatically make those jars available to all actions in that workflow.
Specify the tag in an action with the path to a single jar; you can have multiple tags.
This is useful if you want some jars only for a specific action and not all actions in a workflow.
The downside is that you have to specify them in your workflow.xml, so if you ever need to add/remove some jars, you have to change your workflow.xml.
Add jars to the ShareLib (e.g. /user/oozie/share/lib/lib_/pig)
While this will work, it’s not recommended for two reasons:
The additional jars will be included with every workflow using that ShareLib, which may be unexpected to those workflows and users.
When upgrading the ShareLib, you’ll have to recopy the additional jars to the new ShareLib.
quoted from Rober Kanter's blog here : How-to: Use the ShareLib in Apache Oozie (CDH 5)

How to use apache Oozie java lib

I'm new to Apache Oozie. As far as I understood, we can define a workflow of action either using a workflow file or a coordinator file, which are both in xml format, and submit it to the Oozie engine. However, there is a Oozie java library as well and I'd like to know how I can use this library? Is it for programmatically generating the xml files and submitting them to the engine? Can someone point me to an example please.

Adding dependent jars for UDF in the PIG

I've a UDF which I use to do custom processing on the records. In the eval function I am using a third party jar for processing. I saw the job jar file, but it does not include this dependency. Is there any way to include dependent jar in the job jar ?
(For testing I am running the cluster in the local mode).
Or can I use distributed cache to make the dependent jar available to the UDF ?
I've tried registering the dependent jars in the pig. For the first registered jar (all udfs are bundled in this jar) I do not face the issues. But for the second jar, I am facing issues when UDF tries to access the class from it.
REGISTER '/home/user/pig/udfrepository/projectUDF.jar'
REGISTER '/home/user/thridpartyjars/xyz.jar';
The logs I get on the console are like this :
2013-08-11 10:35:02,485 [Thread-14] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001
java.lang.NoSuchMethodError: org.xyz.abc.convertToOtherFormat(Lorg/DateTimeZone;)Lorg/DateTime;
at com.myproject.MyUDF.exec(MyUDF.java:70)
Any help on this is highly appreciated.
Thanks in advance.
The same issue I found resolved and documented here :
http://hadooptips.wordpress.com/2013/08/13/nosuchmethoderror-while-using-joda-time-2-2-jar-in-pig/

Resources