Oozie + Sqoop: JDBC Driver Jar Location - hadoop

I have a 6 node cloudera based hadoop cluster and I'm trying to connect to an oracle database from a sqoop action in oozie.
I have copied my ojdbc6.jar into the sqoop lib location (which for me happens to be at: /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib/ ) on all the nodes and have verified that I can run a simple 'sqoop eval' from all the 6 nodes.
Now when I run the same command using Oozie's sqoop action, I get "Could not load db driver class: oracle.jdbc.OracleDriver"
I have read this article about using shared libs and it makes sense to me when we're talking about my task/action/workflow specific dependencies. But I see a JDBC driver installation as an extention to sqoop and so I think it belongs in the sqoop installation lib.
Now the question is, while sqoop sees this ojdbc6 jar I have put into it's lib folder, how come my Oozie workflow doesn't see it?
Is this something expected or am I missing something?
As an aside, what do you guy think about where is the appropriate location for a JDBC driver jar?
Thanks in advance!

The JDBC driver jar (and any jars it depends on) should go in your Oozie sharelib folder on HDFS. I'm running Hortonworks Data Platform 1.2 instead of Cloudera 4.2 so the details may vary, but my JDBC driver is located in /user/oozie/share/lib/sqoop. This should allow you to run Sqoop with the JDBC via Oozie.
It is not necessary to put to the JDBC driver jar in the sqoop lib on the data nodes. In my setupt I can't run a simple sqoop eval from the command line on my data nodes. I understand the logic for why you thought this would work. The reason the JDBC driver jar needs to be on HDFS is so that all the data nodes have access to it. Your solution should accomplish the same goal. I'm not familiar enough with the inner workings of Oozie to say why using the sharelib works but your solution does not.

In CDH5, you should put the jar to '/user/oozie/share/lib/lib_${timestamp}/sqoop', and after that, you must update the sharelib or restart oozie.
update sharelib:
oozie admin -oozie http://localhost:11000/oozie -sharelibupdate

If you are using CDH-5 the JDBC driver jar (and any jars it depends on) should go in '/user/oozie/share/lib/lib_timestamp/sqoop' folder on HDFS.

I was facing the same issue it was not able to find the mysql jar. I am using cloudera 4.4 in this even oozie admin -oozie http://localhost:11000/oozie -sharelibupdate command will not work
To resolve the issue I had followed the below steps:
create a user in Hue with hdfs and provide the admin privileges
using Hue UI upload the jar into /user/oozie/share/lib/sqoop hdfs path
or you can use below command:
hadoop put /var/lib/sqoop2/mysql-connector-java.jar /user/oozie/share/lib/sqoop
Once the jar is placed run the oozie command.

Related

How to add jar files for Hue in Cloudera?

I'm running an SQL query on a JSON serde table. It's working in the Hive CLI, but it's failing in Hue with the error:
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
I guess it's due to the missing jar file; any idea how to add the jar file hive-hcatalog-core-1.2.1.jar for Hue?
Place your jar in HDFS and add same path by using ADD JAR hdfs:///user/hive/lib/hive-hcatalog-core-1.2.1.jar ;
Run ADD JAR hive-hcatalog-core-1.2.1.jar in hue before your query this thing will be present till your current secession persists.
For the benefit of others, who might face same issue either for this particular jar "hive-hcatalog-core-1.2.1.jar" or any udf jar:
In the HUE - Query Editor, run the following command:
add jar hdfs:/hive-hcatalog-core-1.2.1.jar;
Please note single quotes is not required as is the case with Hive CLI
Exact command cloudera gave is ADD JAR {{lib_dir}}/hive/lib/hive-contrib.jar;
1)I am unable to find hive/lib directory on CDH 5
The {{lib_dir}} on CDH installed environments for Hive would either be /usr/lib/hive/ or /opt/cloudera/parcels/CDH/lib/hive/ (depending on packages or parcels being in use).
this is the way to add jar in cloudera
for this you have to change to supper user by use this command
SUDO SU
it will change to supper user

Running Spark Jobs via Oozie

Is it possible to run Spark Jobs e.g. Spark-sql jobs via Oozie?
In the past we have used Oozie with Hadoop. Since we are now using Spark-Sql on top of YARN, looking for a way to use Oozie to schedule jobs.
Thanks.
Yup its possible ... The procedure is also same, that you have to provide Oozia a directory structure having coordinator.xml, workflow.xml and a lib directory containing your Jar files.
But remember Oozie starts the job with java -cp command, not with spark-submit, so if you have to run it with Oozie, Here is a trick.
Run your jar with spark-submit in background.
Look for that process in process list. It will be running under java -cp command but with some additional Jars, that are added by spark-submit. Add those Jars in CLASS_PATH. and that's it. Now you can run your Spark applications through Oozie.
1. nohup spark-submit --class package.to.MainClass /path/to/App.jar &
2. ps aux | grep '/path/to/App.jar'
EDITED: You can also use latest Oozie, which has Spark Action also.
To run Spark SQL by Oozie you need to use Oozie Spark Action.
You can locate oozie.gz on your distribution. Usually in cloudera you can find this oozie examples directory at below path.
]$ locate oozie.gz
/usr/share/doc/oozie-4.1.0+cdh5.7.0+267/oozie-examples.tar.gz
Spark SQL need hive-site.xml file for execution which you need to provide in workflow.xml
< spark-opts>--file /hive-site.xml < /spark-opts>

Oozie cant able to find JDBC drivers in Sqoop

Refrence to the previously asked question Oozie + Sqoop: JDBC Driver Jar Location 1
but not able to find jar in HDFS /user/oozie/share/lib/sqoop location.
I have also tried to put driver jars at my workFlow app Lib. Still Drivers not found error occure.
You need to add all lib files like jdbc drivers, etc in the oozie share lib folder inside sqoop folder .
This should resolve your issue.
To check the library files invoked/used by the job , go to the job tracker for the corresponding job and in syslogs you will see which all jars has been used.
The exact problem was the single coats "'". Because of single coats oozie take it as a single string. But it was working fine when I was using it in the Sqoop command.
................. --driver com.microsoft.sqlserver.jdbc.SQLServer...................
instead of.
.................. --driver 'com.microsoft.sqlserver.jdbc.SQLServer'................

Does HCatalog require installation before being used?

Can anyone please tell me that, does HCatalog require installation before using? Or it can be used just as a jar file?
I have Cloudera running on a VM, and I can use HCatalog for my MR job, Pig, Hive with no problem. And I thought the same MR code would work with another hadoop installed platform, but obviously it's not the case, exception thrown on the HCatInputFormat.setInput(). When I use Pig -useHCatalog, I'ved been prompted that the usage was wrong, meaning that it didn't know what's -useHCatalog as a parameter.
Didn't thought about this before as have been using HCatalog on Cloudera...
Yes, you need to install and start HCatalog server. HCatalog should come with the latest Hive tar package.
Check here of Apache Hive documentation for details,
Basically you need to,
Setup MySQL database for HCatalog
Run server install script
share/hcatalog/scripts/hcat_server_install.sh -r root -d dbroot -h
hadoop_home -p portnum
Start the HCatalog server
export HIVE_HOME=hive_home
$HIVE_HOME/sbin/hcat_server.sh start
As pointed out, you do not need to install hcatalog separately if you are working with hive 0.12 or later versions.

Using different hadoop-mapreduce-client-core.jar to run hadoop cluster

I'm working on a hadoop cluster with CDH4.2.0 installed and ran into this error. It's been fixed in later versions of hadoop but I don't have access to update the cluster. Is there a way to tell hadoop to use this jar when running my job through the command line arguments like
hadoop jar MyJob.jar -D hadoop.mapreduce.client=hadoop-mapreduce-client-core-2.0.0-cdh4.2.0.jar
where the new mapreduce-client-core.jar file is the patched jar from the ticket. Or must hadoop be completely recompiled with this new jar? I'm new to hadoop so I don't know all the command line options that are possible.
I'm not sure how that would work as when you're executing the hadoop command you're actually executing code in the client jar.
Can you not use MR1? The issue says this issue only occurs when you're using MR2, so unless you really need Yarn you're probably better using the MR1 library to run your map/reduce.

Resources