Not able to connect to hive on AWS EMR using java

Not able to connect to hive on AWS EMR using java - hadoop

I have setup AWS EMR cluster with hive. I want to connect to hive thrift server from my local machine using java. I tried following code-
Class.forName("com.amazon.hive.jdbc3.HS2Driver");
con = DriverManager.getConnection("jdbc:hive2://ec2XXXX.compute-1.amazonaws.com:10000/default","hadoop", "");
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/HiveJDBCDriver.html.As mentioned in the developer guide, added jars related with hive jdbc driver to class path.
But I am getting exception when trying to get connection.
I was able to connect to hive server on simple hadoop cluster using above code (with different jdbc driver).
Can someone please suggest if I am missing something?
Is it possible to connect to hive server on AWS EMR from local machine using hive jdbc?

(Merged Answer from the comments)
Hive is running on port 10000 but only locally, you have to create a ssh tunnel to the emr.
The following is from the documentation for hive 0.13.1
Create Tunnel
ssh -o ServerAliveInterval=10 -i path-to-key-file -N -L 10000:localhost:10000 hadoop#master-public-dns-name
Connect to JDBC
jdbc:hive2://localhost:10000/default

You can use the code using the library JSch
public static void portForwardForHive() {
try {
if(session != null && session.isConnected()) {
return;
}
JSch jsch = new JSch();
jsch.addIdentity(PATH_TO_SSH_KEY_PEM);
String host = REMOTE_HOST;
session = jsch.getSession(USER, host, 22);
// username and password will be given via UserInfo interface.
UserInfo ui = new MyUserInfo();
session.setUserInfo(ui);
session.connect();
int assingedPort = session.setPortForwardingL(LPORT, RHOST, RPORT);
System.out.println("Port forwarding done for the post : " + assingedPort);
} catch (Exception e) {
System.out.println(e);
}
}

Not sure if you've resolved this yet, but its a bug in EMR that's just bitten me.
For direct jdbc connectivity like you are doing, you must include the jdbc drivers in your shaded uber-jar. For jdbc access from within dataframes, you cannot access the jar in your uber-jar (another unrelated bug), but you must specify it on the command line (S3 is a convenient place to keep them):
--files s3://mybucketJAR/postgresql-9.4-1201.jdbc4.jar
However, even after this you will run into another problem if you are specifically trying to access hive. Amazon has built their own jdbc drivers with a different class hierarchy to the normal hive driver (com.amazon.hive.jdbc41.HS2Driver), however the EMR cluster includes the standard Hive jdbc driver in its standard path (org.apache.hive.jdbc.HiveDriver).
This is automatically registered as being capable of handling the jdbc:hive and jdbc:hive2 urls, so when you try to connect to a hive URL it finds this one first and uses it - even if you specifically register the amazon one. Unfortunately, this one is not compatible with amazon's EMR build of Hive.
There are two possible solutions:
1: Find the offending driver and unregister it:
Scala example:
val jdbcDrv = Collections.list(DriverManager.getDrivers)
for(i <- 0 until jdbcDrv.size) {
val drv = jdbcDrv.get(i)
val drvName = drv.getClass.getName
if(drvName == "org.apache.hive.jdbc.HiveDriver") {
log.info(s"Deregistering JDBC Driver: ${drvName}")
DriverManager.deregisterDriver(drv)
}
}
Or
2: As I found out later, you can specify the driver as part of the connect properties when you attempt to connect:
Scala example:
val hiveCredentials = new java.util.Properties
hiveCredentials.setProperty("user", hiveDBUser)
hiveCredentials.setProperty("password", hiveDBPassword)
hiveCredentials.setProperty("driver", "com.amazon.hive.jdbc41.HS2Driver")
val conn = DriverManager.getConnection(hiveDBURL, hiveCredentials)
This is a more "correct" version as it should override any preregistered handlers even if they have completely different class hierarchies.

Related

How to install JDBC driver on Databricks Cluster?

I'm trying to get the data from my Oracle Database to a Databricks Cluster. But I think I'm doing it wrong:
On the cluster library I just installed the ojdbc8.jar and then after that I opened a notebook and did this to connect:
CREATE TABLE oracle_table
USING org.apache.spark.sql.jdbc
OPTIONS (
dbtable 'table_name',
driver 'oracle.jdbc.driver.OracleDriver',
user 'username',
password 'pasword',
url 'jdbc:oracle:thin://#<hostname>:1521/<db>')
And it says:
java.sql.SQLException: Invalid Oracle URL specified
Can someone help? I've been reading documentations but there's no clear instruction on how I should actually install this jar step by step. I might be using the wrong jar? Thanks!

I have managed to set this up in Python/PySpark as follows:
jdbcUrl = "jdbc:oracle:thin:#//hostName:port/databaseName"
connectionProperties = {
"user" : username,
"password" : password,
"driver" : "oracle.jdbc.driver.OracleDriver"
}
query = "(select * from mySchema.myTable )"
df = spark.read.jdbc(url=jdbcUrl, table=query1, properties=connectionProperties)
I am using the Oracle JDBC Thin Driver instantclient-basic-linux.x64-21.5.0.0.0, as available on the Oracle webpages. The current version is 21.7 I think, but it should work the same way.
Check this link to understand the two different notations for jdbc URLs

How to use Hadoop Credential provider in Spark to connect to Oracle database?

I am trying to establish a secure connection between Spark and Oracle as well as Sqoop and Oracle. After my research I have found two different option for two different setup.
Connecting Spark to Oracle where passwords are encrypted using spark.jdbc.b64password and further it has been decrypted in spark code and used it in jdbc url.
Using Hadoop credential provider to create the password file and further it has been used in Sqoop to connect to Oracle.
Now keeping password in two different files doesn't seems like a good practice. My question is can we use Hadoop credential provider in spark to use the same credential profile created for Sqoop?
If you have any other option to make this better please help.

The recommended way is to use Kerberos authentication both in Spark and Hadoop and with Oracle. The Oracle JDBC thin driver supports Kerberos authentication. A single Kerberos principal is then used to authenticate the user all the way from Spark or Hadoop to the Oracle database.

You could use all languages supported by Spark to read the jecks password from inside your code:
Python:
spark1 = SparkSession.builder.appName("xyz").master("yarn").enableHiveSupport().config("hive.exec.dynamic.partition", "true").config("hive.exec.dynamic.partition.mode", "nonstrict").getOrCreate()
x = spark1.sparkContext._jsc.hadoopConfiguration()
x.set("hadoop.security.credential.provider.path", "jceks://file///localpathtopassword")
a = x.getPassword("<password alias>")
passw = ""
for i in range(a.__len__()):
passw = passw + str(a.__getitem__(i))
In the above code you shall get the password string in passw
Scala:
import org.apache.hadoop.security.alias.CredentialProvider
import org.apache.hadoop.security.alias.CredentialProvider.CredentialEntry
import org.apache.hadoop.security.alias.CredentialProviderFactory
import org.apache.hadoop.conf.Configuration
val conf_H: Configuration = new org.apache.hadoop.conf.Configuration()
val alias = password_alias
val jceksPath = security_credential_provider_path`enter code here`
conf_H.set(CredentialProviderFactory.CREDENTIAL_PROVIDER_PATH, jceksPath)
val pass = conf_H.getPassword(alias).mkString
if (pass != null && !pass.isEmpty() && !pass.equalsIgnoreCase("")) {
jdbcPassword = pass
}

you can also allow spark to set hadoop.security.credential.provider.path in hadoop configuration in such way:
"""
Create java key store with following command:
> keytool -genseckey -alias duke -keypass 123456 -storetype jceks -keystore keystore.jceks
> export HADOOP_CREDSTORE_PASSWORD=123456
"""
jceks = os.path.join(os.path.dirname(__file__), "keystore.jceks")
print(jceks)
assert os.path.isfile(jceks)
spark_session = lambda: (SparkSession
.builder
.enableHiveSupport()
.config('spark.ui.enabled', False)
.config("spark.hadoop.hadoop.security.credential.provider.path",
"jceks://file//" + jceks)
.getOrCreate())
with spark_session() as spark:
hc = spark.sparkContext._jsc.hadoopConfiguration()
jo = hc.getPassword("duke")
expected_password = ''.join(jo)
assert len(retrieved_password) > 0
spark.hadoop.hadoop.security.credential.provider.path is little weird but spark cuts off spark.hadoop. prefix when loads hadoop settings

Sybase JDBC connection login failed when run as WAR file; but login success when run within Eclipse

I am trying to connect to Sybase database from a Spring Boot application.
I am using JConnect jconn4 jdbc driver.
Login password is encrypted at Sybase server, so I am using bouncycastle cryptography API to encrypt the password.
Below is the block of code:
This code uses plain JDBC to test the connection, later I am planning to modify this to use Spring's JDBCTemplate.
Connection con = null;
try {
Class.forName("com.sybase.jdbc4.jdbc.SybDriver");
logger.info("Loaded Sybase driver successfully.....");
} catch (ClassNotFoundException cnfe) {
cnfe.printStackTrace();
}
and
String url = "jdbc:sybase:Tds:<url>:<port>/<databasename>";
Properties props = new Properties();
props.put("ENCRYPT_PASSWORD", "true");
props.put("JCE_PROVIDER_CLASS", "org.bouncycastle.jce.provider.BouncyCastleProvider");
props.put("user", "username");
props.put("password", "pwd");
and
con = DriverManager.getConnection(url, props);
When Login Success into Sybase Server:
Running the application within Eclipse IDE.
When Login failed into Sybase Server:
Running the generated WAR file within Windows command prompt.
Running the generated WAR file within Linux Shell (Dev/QA Server).
What I tried alreday:
1.Compared the version of jar file used by the application at runtime for the below jars.Because there are multiple versions of same jar file exists in classpath (from various dependencies).
jconn4-7.07.jar (sybase jdbc driver).
bcprov-jdk16-1.43.jar (bouncycastle crypto API).
I used below code block to find the jar used by the application at runtime.
Class clazz = null;
try {
clazz = Class.forName("org.bouncycastle.jce.provider.BouncyCastleProvider");
logger.info("BouncyCastleProvider is ... " + clazz.toString());
if (clazz != null && clazz.getProtectionDomain() != null
&& clazz.getProtectionDomain().getCodeSource() != null) {
URL codeLocation = clazz.getProtectionDomain().getCodeSource().getLocation();
logger.info("BouncyCastleProvider jar is ... " + codeLocation.toString());
}
} catch (ClassNotFoundException e) {
logger.info(e.getMessage());
}
2.Found 4 versions of bcprov-jdk1x-x.xx.jar file through pom.xml's Dependency Hierarchy window in eclipse and 'excluded' three of those files in pom. This is to avoid version conflict of jar files.
but it is not working :-(.
Why it is able to connect within Eclispe and not while running as WAR?
Any help/direction would be much useful.

After a lot of research and spent almost 10 hrs of effort, I am able to solve this by adding 'CHARSET' into connection properties.
props.put("CHARSET", "iso_1");
Moral of the story: Problems may be bigger but solutions aren't :-).

hbase.master vs zookeeper details in HBase connection using Java API

What is the benefits of using both
hbase.master
hbase.zookeeper.quorum & hbase.zookeeper.property.clientPort
in creating connection with HBase using Java API?
Sample code:
Configuration hBaseConfig = HBaseConfiguration.create();
hBaseConfig.set("hbase.master", hbaseHost +":"+ port);
hBaseConfig.set("hbase.zookeeper.quorum",zookeeperHost);
hBaseConfig.set("hbase.zookeeper.property.clientPort", "2181");
Which out of these details is sufficient or I need both?

I'll answer your query by splitting it up ..
Q1 . What is the benefits of using both
hbase.master
hbase.zookeeper.quorum & hbase.zookeeper.property.clientPort
in creating connection with HBase using Java API?
Solution : Regarding benefit you would be able to access the hbase using the java api,for which the hbase master and zookeeper service must be up in your server which is mandatory .
Q2 : Sample code:
Configuration hBaseConfig = HBaseConfiguration.create();
hBaseConfig.set("hbase.master", hbaseHost +":"+ port);
hBaseConfig.set("hbase.zookeeper.quorum",zookeeperHost);
hBaseConfig.set("hbase.zookeeper.property.clientPort", "2181");
Which out of these details is sufficient or I need both?
Solution : you would require both , however you could also add "hbase-site,xml" in your classpath , which would be available under hbase/conf directory in the machine where hbase is installed. Along with that you need to add "core-site.xml" under hadoop/conf directory .
For more information : " https://hbase.apache.org/book.html#_examples "
check out the link.

How to connect Hive in iReport?

I am using Hadoop-0.20.0 and Hive-0.8.0. Now i have data into Hive table and i want generate reports from that. For that I am using iReport-4.5.0. For that I also download HivePlugin-0.5.nbm in iReport.
Now I am going to connect Hive connection in iReport.
Create New Data source --> New --> Hive Connection
Jdbc Drive: org.apache.hadoop.hive.jdbc.HiveDriver
Jdbc URl: jdbc:hive//localhost:10000/default
Server Address: localhost
Database: default
user name: root
password: somepassword
Then click on Test connection button.
I am getting error like:
Exception
Message:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format)
Level:
SEVERE
Stack Trace:
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException:
java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format)
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:226)
org.apache.hadoop.hive.jdbc.HiveConnection.<init>(HiveConnection.java:72)
org.apache.hadoop.hive.jdbc.HiveDriver.connect(HiveDriver.java:110)
com.jaspersoft.ireport.designer.connection.JDBCConnection.getConnection(JDBCConnection.java:140)
com.jaspersoft.ireport.hadoop.hive.connection.HiveConnection.getConnection(HiveConnection.java:48)
com.jaspersoft.ireport.designer.connection.JDBCConnection.test(JDBCConnection.java:447)
com.jaspersoft.ireport.designer.connection.gui.ConnectionDialog.jButtonTestActionPerformed(ConnectionDialog.java:335)
com.jaspersoft.ireport.designer.connection.gui.ConnectionDialog.access$300(ConnectionDialog.java:43)
Can any one help me in this? Where i am wrong or missing something?

"I also download HivePlugin-0.5.nbm in iReport."
This isn't clear. iReport 4.5 has the Hadoop Hive connector pre-installed. Why did you download the connector separately? Did you install this plugin?
Create New Data source --> New --> Hive Connection
Jdbc Drive: org.apache.hadoop.hive.jdbc.HiveDriver
...
This isn't possible with the current Hadoop Hive connector. When you create a new "Hadoop Hive Connection" you are given only one parameter to fill out: the url.
I'm guessing that you created a JDBC connection when you meant to create a Hadoop Hive connection. This is a logical thing to do. Hive is accessed via JDBC. But the Hive JDBC driver is still pretty new. It has a number of shortcomings. That's why the Hive connector was added to iReport. It is based on the Hive JDBC driver, but it includes a wrapper around it to avoid some problems.
Or maybe you installed an old Hive connector over the top of the one that's already included with iReport 4.5. At some point in the past the Hive connector let you fill in extra information like the JDBC Driver.
Start with a fresh iReport installation, and make sure you use the Hadoop Hive Connection. That should clear it up.

The error "java.lang.RuntimeException: Illegal Hadoop Version: Unknown (expected A.B.* format)" happens because the VersionInfo class in hadoop-common.jar attempts to locate the version info using the current thread's class loader.
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/VersionInfo.java#L41-L58
The code in question looks like this...
package org.apache.hadoop.util;
...
public class VersionInfo {
...
protected VersionInfo(String component) {
info = new Properties();
String versionInfoFile = component + "-version-info.properties";
InputStream is = null;
try {
is = Thread.currentThread().getContextClassLoader()
.getResourceAsStream(versionInfoFile);
if (is == null) {
throw new IOException("Resource not found");
}
info.load(is);
} catch (IOException ex) {
LogFactory.getLog(getClass()).warn("Could not read '" +
versionInfoFile + "', " + ex.toString(), ex);
} finally {
IOUtils.closeStream(is);
}
}
If your tool attempts to connect to the datasource in a separate thread, it will generate this error.
The easiest way to work around the issue is to put the hadoop-common.jar library in $JAVA_HOME/lib/ext or use the command line setting -Djava.endorsed.dirs to point to the hadoop-common.jar library. Then the thread's class loader will always be able to find this information.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Not able to connect to hive on AWS EMR using java - hadoop

Related

How to install JDBC driver on Databricks Cluster?

How to use Hadoop Credential provider in Spark to connect to Oracle database?

Sybase JDBC connection login failed when run as WAR file; but login success when run within Eclipse

hbase.master vs zookeeper details in HBase connection using Java API

How to connect Hive in iReport?

Categories

Resources