I want to create an Oozie workflow that would use the MapReduceIndexerTool to take my data and index it. I've managed to get it working using a Shell action, which calls my script to execute the following command:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
-D 'mapred.child.java.opts=-Xmx500m' \
--morphline-file morphline.conf \
--output-dir hdfs://cloudera1:8020/user/nicolas/outdir \
--verbose --go-live --zk-host cloudera2:2181/solr \
--collection Test_Collection hdfs:///user/nicolas/indir
It finds all the files and directories it needs, and the workflow will finish successfully. However, I would like to add my custom Morphlines command to modify some of the data. I have been following the kitesdk guide to do just that. I packaged my code into a jar and uploaded it to hdfs://cloudera1:8020/user/nicolas/custom-command.jar through the Hue File Browser. I've also updated my morphline.conf so that I import my package, and use my command. If I just include the file in my workflow, the following error occurs:
Error: org.kitesdk.morphline.api.MorphlineCompilationException: No command builder registered for name: tweakData ...
I'm assuming that the MapReduceIndexerTool is having trouble finding my jar. So, I decided to add the --libjars parameter to my script:
hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-*-job.jar \
org.apache.solr.hadoop.MapReduceIndexerTool \
--libjars hdfs://cloudera1:8020/user/nicolas/custom-command.jar ...
When I do that, a different error occurs:
WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException
as:yarn (auth:SIMPLE) cause:java.io.FileNotFoundException: File does not exist:
TD;DR How to I include the jar for my custom Morphlines command so it is found by Oozie / YARN?
You can add the jars to the lib sub-directory of the oozie wf HDFS directory or add the jar to the tag in oozie shell action.
--libjars requires a file on the local file system (rather than an HDFS file).
I am trying to a sqoop action in oozie, but mysql-connector-java.jar is not present in /user/oozie/share/lib/sqoop, because of no permission I am not able to add the jar as of now,
Is there any way or workaround to include mysql-connector-java.jar in workflow.xml
I have placed the jar in sqoop apps / lib directory, but its not working
In general, the Hadoop administrator should keep all the common library in Hadoop distribution to make the usage more efficient, if not, give a try to the following -jarfile option
sqoop import \
-libjars /file/location/path/mysql-connector-java.jar \
--connect jdbc:mysql://localhost:3306:3306/retail_db \
--username root \
--password xyzpwd \
--table order_items \
--target-dir /user/cloudera/landing_zone/sqoop_import/order_items
as per sqoop documentation:
-libjars specify a comma separated jar files to include in the classpath. The -files, -libjars, and -archives arguments are not typically used with Sqoop, but they are included as part of Hadoop’s internal argument-parsing system.
I am getting below error while executing sqoop export command(in shell script) with oozie.
"java.lang.RuntimeException: Could not load db driver class: oracle.jdbc.OracleDriver"
sqoop export from cli(edge node) works fine.
I have added the ojdbc6.jar to below locations.
(HDFS locations)
/user/oozie/share/lib/sqoop/ and
i have also set oozie.use.system.libpath=true in my oozie job.properties file
Please guide me if i am missing any setting.
log content
Make sure that you upload a file to a directory /user/oozie/share/lib/sqoop (it could looks like /user/oozie/share/lib/lib_${timestamp}/sqoop for Cloudera and HDP).
Check if ojdbc6.jar file is correct - check if it contains OracleDriver.class and make sure size of the file is ok. It could be error while downloading.
Check permissions to ojdbc6.jar file (eventually, you can try to give 755 permissions to this file). Check who is the owner of the file - it should be oozie by default.
Update Oozie sharelib by execute below command (run this command on the host where Oozie Server is located):
sudo -u oozie oozie admin -oozie http://<Oozie_Server_Host>:11000/oozie -sharelibupdate
Verify sharelib for sqoop:
sudo -u oozie oozie admin -oozie http://<Oozie_Server_Host>:11000/oozie -shareliblist sqoop*
You can always restart Oozie service. It should update sharelib.
Create a directory named lib next to your workflow.xml in HDFS and put jars in there. Oozie will automatically make those jars available to all actions in that workflow.
Cloudera users should check this article. Especially paragraph 'One Last Thing'.
The page here (http://spark.apache.org/docs/latest/programming-guide.html) indicates packages can be included when the shell is launched via:
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.4.0
What is the syntax for including local packages (that are downloaded manually say)? Something to do with Maven coords?
If the jars are present on the master/workers, you simply need to specify them on the classpath in spark-submit:
spark-shell \
spark.driver.extraClassPath="/path/to/jar/spark-csv_2.11.jar" \
If the jars are only present in the Master, and you want them to be sent to the worker (only works for client mode), you can add the --jars flag:
spark-shell \
spark.driver.extraClassPath="/path/to/jar/spark-csv_2.11.jar" \
spark.executor.extraClassPath="spark-csv_2.11.jar" \
--jars "/path/to/jar/jary.jar:/path/to/other/other.jar"
For a more elaborated answer see Add jars to a Spark Job - spark-submit
Please use:
./spark-shell --jars my_jars_to_be_included
I have a Spark (Spark 1.5.2) application that streams data from Kafka to HDFS. My application contains two Typesafe config files to configure certain things like Kafka topic etc.
Now I want to run my application with spark-submit (cluster mode) in a cluster.
The jar file with all dependencies of my project is stored on HDFS.
As long as my config files are included in the jar file everything works fine. But this is unpractical for testing purposes because I always have to rebuild the jar.
Therefore I excluded the config files of my project and I added them via "driver-class-path". This worked on client mode but if I move the config files now to HDFS and run my application in cluster mode it can't find the settings. Below you can find my spark-submit command:
/usr/local/spark/bin/spark-submit \
--total-executor-cores 10 \
--executor-memory 15g \
--verbose \
--deploy-mode cluster\
--class com.hdp.speedlayer.SpeedLayerApp \
--driver-class-path hdfs://iot-master:8020/user/spark/config \
--master spark://spark-master:6066 \
I already tried it with the --file parameter but that also didn't work. Does anybody know how I can fix this?
I did some further research and I figured out that it could be related to the HDFS path. I changed the HDFS path to "hdfs:///iot-master:8020//user//spark//config But unfortunately that also that didn't work. But maybe this could help you.
Below you can also see the error I get when I run the driver program in cluster mode:
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.ExceptionInInitializerError
at com.speedlayer.SpeedLayerApp.main(SpeedLayerApp.scala)
... 6 more
Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'application'
at com.typesafe.config.impl.SimpleConfig.findKey(SimpleConfig.java:124)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:145)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:159)
at com.typesafe.config.impl.SimpleConfig.find(SimpleConfig.java:164)
Trying to achieve the same result I found out the following:
--files: is associated only to local files on machine running the spark-submit command and converts to conf.addFile(). so hdfs files wont work unless you are able to run hdfs dfs -get <....> before to retrieve the file. in my case I want to run it from oozie so I dont know on which machine its going to run and I dont want to add a copy file action to my workflow.
The quote #Yuval_Itzchakov took refers to --jars which only handles jars since it converts to conf.addJar()
So as far as I know there is no strait way to load configuration file from hdfs.
My approach was to pass the path to my app and read the configuration file and merge it into reference file:
private val HDFS_IMPL_KEY = "fs.hdfs.impl"
def loadConf(pathToConf: String): Config = {
val path = new Path(pathToConf)
val confFile = File.createTempFile(path.getName, "tmp")
getFileSystemByUri(path.toUri).copyToLocalFile(path, new Path(confFile.getAbsolutePath))
def getFileSystemByUri(uri: URI) : FileSystem = {
val hdfsConf = new Configuration()
hdfsConf.set(HDFS_IMPL_KEY, classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
FileSystem.get(uri, hdfsConf)
P.S the error only means that the ConfigFactory didnt find any configuration file, so he couldn't find the property you are looking for.
One option is to use the --files flag and with the HDFS location and make sure you add it to your executor classpath using the spark.executor.extraClassPath flag with -Dconfig.file:
Spark uses the following URL scheme to allow different
strategies for disseminating jars:
file: - Absolute paths and file:/ URIs are served by the driver’s HTTP
file server, and every executor pulls the file from the driver HTTP
hdfs:, http:, https:, ftp: - these pull down files and JARs
from the URI as expected
local: - a URI starting with local:/ is
expected to exist as a local file on each worker node. This means that
no network IO will be incurred, and works well for large files/JARs
that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Also, you can see it when looking at the help documentation for spark-submit:
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
Running with spark-submit:
/usr/local/spark/bin/spark-submit \
--total-executor-cores 10 \
--executor-memory 15g \
--conf "spark.executor.extraClassPath=-Dconfig.file=application.conf"
--verbose \
--deploy-mode cluster\
--class com.hdp.speedlayer.SpeedLayerApp \
--driver-class-path hdfs://iot-master:8020/user/spark/config \
--files hdfs:/path/to/conf \
--master spark://spark-master:6066 \
i am getting the following error while running the following oozie command
hadoop#master1:~/work/oozie-4.1.0/bin$ oozie-setup.sh -hadoop 0.20.200 $HADOOP_HOME -extjs /home/hadoop/work/ext-2.2.zip
Usage : oozie-setup.sh
prepare-war [-d directory] [-secure] (-d identifies an alternative directory
for processing jars
-secure will configure the war file to use HTTPS (SSL))
sharelib create -fs FS_URI [-locallib SHARED_LIBRARY] (create sharelib for
FS_URI is the fs.default.name
for hdfs uri; SHARED_LIBRARY, path to the
Oozie sharelib to install, it can be a tarball
or an expanded version of it. If ommited,
the Oozie sharelib tarball from the Oozie
installation directory will be used)
(action failes if sharelib is already installed
in HDFS)
sharelib upgrade -fs FS_URI [-locallib SHARED_LIBRARY] (upgrade existing
sharelib, fails if there
is no existing sharelib installed in HDFS)
db create|upgrade|postupgrade -run [-sqlfile ] (create, upgrade or postupgrade oozie db with an
optional sql File)
(without options prints this usage information)
EXTJS can be downloaded from http://www.extjs.com/learn/Ext_Version_Archives
Any idea to solve this?
EXT.js iw for WEBUI right,
It shows the we need to Prepare war that means need to add the EXT.js by passing the command
oozie-setup.sh prepare-war
or by
addtowar.sh -extjs EXTJS_PATH
Make sure that the ext.js file was present in the LIBEXT in the oozie home while using preparewar.