How to access remote hive and hadoop file system using sqoop import command with kerberos authentication? - hadoop

Using sqoop version 1.4.5 and hadoop version 3.3.4. My requirement is to connect to remote hive and remote hadoop file system without changing the configuration files with kerberos.
Is it possible to do the following operation without amending the configuration files for hadoop, sqoop? If yes, then what all parameters needs to be changed in the configuration files?

Related

how hive is running without hive-site.xml file?

I am trying to set up hive on my local. I started all Hadoop processes and set up the {hive}/bin path. On command prompt I can run hive commands , create and read tables. My questions are -
1) is hive-site.xml is optional file ?
2) in absence of hive-site.xml file, how hive get information regrading metastore and other configuration?
If you're running Hive queries from your local machine which has Hadoop installed, hive-site.xml is not needed as you are talking directly to hive/bin in the Hive installation directory. You don't need to tell Hive where to find Hive.
If you wanted to run Hive commands from another machine, but interacting with Hive on your local machine, you'd need hive-site.xml.

Error when trying to execute kylin.sh start in HDP Sandbox 2.6

I installed Apache Kylin, following the official installation guide http://kylin.apache.org/docs/install/index.html, in HDP sandbox 2.6
When I run the script, $KYLIN_HOME/bin/kylin.sh start, I got the error below:
What can I do to fix this error?
Thanks in advance
Check if Hive service is up in your ambari, when Hive service is down Kylin cannot find it and gives the error. Check for .bash_profile as well. When those two issues are addressed kylin should be able to find location of hive dependency.
Kylin uses the find-hive-dependency.sh script to setup the CLASSPATH. This script uses a Hive CLI command (I test it with beeline) to query Hive env vars and extract the CLASSPATH from them.
beeline connect to Hive using the properties at kylin_hive_conf.xml but for some reason (probably due to the Hive version included in HDP 2.6) some of the loaded Hive properties cannot be set when the connection is stablished.
The Hive properties that causes the issue can be discarded for connecting to Hive to query the CLASSPATH, so, to fix this issue:
Edit $KYLIN_HOME/conf/kylin.properties and set kylin.source.hive.client=beeline
Open the find-hive-dependency.sh script, go to line 34 aprox and modify the line
hive_env=${beeline_shell} ${hive_conf_properties} ${beeline_params} --outputformat=dsv -e "set;" 2>&1 | grep 'env:CLASSPATH'
Just remove ${hive_conf_properties}
Check Hive depedencies have been configured by running the command find-hive-dependency.sh.
Now $KYLIN_HOME/bin/kylin.sh start should works.

Configure hadoop-client to connect to hadoop in other machine/server

On server A i have hadoop and python scripts for performing tasks on hadoop.
On server B i have hive/hadoop.
Is it possible to configure hadoop-client on server A to be connected to hadoop on server B?
It's not clear what Python library you are using, but assuming PySpark, you can copy or configure the HADOOP_CONF_DIR on your client machine, and it can communicate with any external Hadoop system.
At the very least, you'll need to configure a core-site.xml to communicate with HDFS and a hive-site.xml to communicate with Hive.
If you are using PyHive library, you just connect to user#hiveserver2:1000

Import data from inter cluster hadoop with different versions using command line

Can you tell me the exact command to import data from hdfs with two different haddop version one with hadoop 2.0.4 alpha and other 2.4.0 version? How can I use distcp command in this case?
When you have different versions use hftp instead of using the actual hdfs command. You can see examples on Cloudera website. Use the hftp on your source cluster and hdfs on your destination cluster address.

Oozie + Sqoop: JDBC Driver Jar Location

I have a 6 node cloudera based hadoop cluster and I'm trying to connect to an oracle database from a sqoop action in oozie.
I have copied my ojdbc6.jar into the sqoop lib location (which for me happens to be at: /opt/cloudera/parcels/CDH-4.2.0-1.cdh4.2.0.p0.10/lib/sqoop/lib/ ) on all the nodes and have verified that I can run a simple 'sqoop eval' from all the 6 nodes.
Now when I run the same command using Oozie's sqoop action, I get "Could not load db driver class: oracle.jdbc.OracleDriver"
I have read this article about using shared libs and it makes sense to me when we're talking about my task/action/workflow specific dependencies. But I see a JDBC driver installation as an extention to sqoop and so I think it belongs in the sqoop installation lib.
Now the question is, while sqoop sees this ojdbc6 jar I have put into it's lib folder, how come my Oozie workflow doesn't see it?
Is this something expected or am I missing something?
As an aside, what do you guy think about where is the appropriate location for a JDBC driver jar?
Thanks in advance!
The JDBC driver jar (and any jars it depends on) should go in your Oozie sharelib folder on HDFS. I'm running Hortonworks Data Platform 1.2 instead of Cloudera 4.2 so the details may vary, but my JDBC driver is located in /user/oozie/share/lib/sqoop. This should allow you to run Sqoop with the JDBC via Oozie.
It is not necessary to put to the JDBC driver jar in the sqoop lib on the data nodes. In my setupt I can't run a simple sqoop eval from the command line on my data nodes. I understand the logic for why you thought this would work. The reason the JDBC driver jar needs to be on HDFS is so that all the data nodes have access to it. Your solution should accomplish the same goal. I'm not familiar enough with the inner workings of Oozie to say why using the sharelib works but your solution does not.
In CDH5, you should put the jar to '/user/oozie/share/lib/lib_${timestamp}/sqoop', and after that, you must update the sharelib or restart oozie.
update sharelib:
oozie admin -oozie http://localhost:11000/oozie -sharelibupdate
If you are using CDH-5 the JDBC driver jar (and any jars it depends on) should go in '/user/oozie/share/lib/lib_timestamp/sqoop' folder on HDFS.
I was facing the same issue it was not able to find the mysql jar. I am using cloudera 4.4 in this even oozie admin -oozie http://localhost:11000/oozie -sharelibupdate command will not work
To resolve the issue I had followed the below steps:
create a user in Hue with hdfs and provide the admin privileges
using Hue UI upload the jar into /user/oozie/share/lib/sqoop hdfs path
or you can use below command:
hadoop put /var/lib/sqoop2/mysql-connector-java.jar /user/oozie/share/lib/sqoop
Once the jar is placed run the oozie command.

Resources