I have hadoop 3.1.2 and hive 3.1.2 on a cluster and I want to connect to hive with presto-server-0.265.1.
I have just one catalog file in /opt/presto/etc/catalog as hive.properties here is:
connector.name=hive-hadoop2
hive.metastore.uri=thrift://192.168.49.13:9083
The presto service run but it can not connect to hive because I use hadoop3 and when I change hive.properties,presto service can not run.
how can I connect to hadoop3?
Update:
it wasn't about hadoop. hive metastore was not installed correctly so presto had problem to connect hive metastore
I am stuck at point as , how to use pyspark to fetch data from hive server using jdbc.
I am Trying to connect to HiveServer2 running on my local machine from pyspark using jdbc. All components HDFS,pyspark,HiveServer2 are on same machine.
Following is the code i am using to connect :
connProps={ "username" : 'hive',"password" : '',"driver" : "org.apache.hive.jdbc.HiveDriver"}
sqlContext.read.jdbc(url='jdbc:hive2://127.0.0.1:10000/default',table='pokes',properties=connProps)
dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:hive://localhost:10000/default").option("driver", "org.apache.hive.jdbc.HiveDriver").option("dbtable", "pokes").option("user", "hive").option("password", "").load()
both methods used above are giving me same error as below:
org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
javax.jdo.JDOFatalDataStoreException: Unable to open a test connection
to the given database. JDBC url =
jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
Terminating connection pool (set lazyInit to true if you expect to
start your database after your app).
ERROR XSDB6: Another instance of Derby may have already booted the database /home///jupyter-notebooks/metastore_db
metastore_db is located at same directory where my jupyter notebooks are created. but hive-site.xml is having different metastore location.
I have already checked reffering to other questions about same error saying other spark-shell or such process is running,but its not. Even if i try following command when HiveServer2 and HDFS are down i am getting same error
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
I am able to connect to hives using java program using jdbc. Am I missing something here? Please help.Thanks in advance.
Spark should not use JDBC to connect to Hive.
It reads from the metastore, and skips HiveServer2
However, Another instance of Derby may have already booted the database means that you're running Spark from another session, such as another Jupyter kernel that's still running. Try setting a different metastore location, or work on setting up a remote Hive metastore using a local Mysql or Postgres database and edit $SPARK_HOME/conf/hive-site.xml with that information.
From SparkSQL - Hive tables
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# spark is an existing SparkSession
spark.sql("CREATE TABLE...")
I'm trying to use sqoop in Hue but there's an error :
Sqoop error: Could not get connectors.
and no sqoop wizard in page.
But I could import data from Oracle using sqoop shell (not sqoop2).
My questions are :
Is there anything else to config beside putting oracle jdbc driver ? (in /opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/sqoop2/client-lib/)
What directories that needs to be permitted for user sqoop2 ? (except /var/lib/sqoop )
note : I still got no clue after reading this post https://community.cloudera.com/t5/Web-UI-Hue-Beeswax/Blank-Sqoop-page-in-Hue-2-5-0/td-p/2581
env : Hue 3.9 / Sqoop2 / CDH 5.5.0 / CM 5.5.0 / LDAP, Kerberos & Sentry installed
I am trying to experiment on Apache Drill 1.4 in Embedded mode and trying to connect to Hive running on EMR - Drill is running on server outside EMR.
I have some basic questions that I want to get clarified and some configuration issues to be fixed.
Here is what I have so far -
Running AWS EMR cluster.
Running Drill Embedded server.
According to the documentation on configuring storage plugin for Hive, https://drill.apache.org/docs/hive-storage-plugin/ , I am getting confused on whether or not to use Remote Metastore or Embedded Metastore. What is the difference?
Next, my EMR cluster is running and here is what hive-site.xml looks like -
<property>
<name>hive.metastore.uris</name>
<value>thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:3306/hive?createDatabaseIfNotExist=true</value>
<description>username to use against metastore database</description>
</property>
There are other properties defined like MySQL username and password etc. but I guess these are important here.
Which one should I use to connect to Hive? I have tried to put both these in the storage plugin but Drill doesnt take it.
Storage plugins I have tried look like this -
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083",
"fs.default.name": "hdfs://ec2-XX-XX-XX-XX.compute-1.amazonaws.com/",
"hive.metastore.sasl.enabled": "false"
}
}
and
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083",
"javax.jdo.option.ConnectionURL": "jdbc:derby:ec2-XX-XX-XX-XX.compute-1.amazonaws.com;databaseName=data;create=true",
"hive.metastore.warehouse.dir": "/user/hive/warehouse",
"fs.default.name": "file:///",
"hive.metastore.sasl.enabled": "false"
}
}
It would be of great help if you could guide me in setting this up.
Thanks!
Whether or not to use Remote Metastore or Embedded Metastore?
Embedded Mode: This is recommended for testing or experimental purposes only.In this mode, the metastore uses a Derby database, and both the database and the metastore service are embedded in the main HiveServer process. Both are started for you when you start the HiveServer process.
Remote Mode: The Hive metastore service runs in its own JVM process. HiveServer2, HCatalog and other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured via the javax.jdo.option.ConnectionURL property). This should be used for production.
You are using MySQL to store metadata for Hive. So, Drill needs javax.jdo.option.ConnectionUserName & javax.jdo.option.ConnectionPassword too to create connection.
Sample hive plugin (Remote Mode):
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris":<--->,
"javax.jdo.option.ConnectionURL":<--->,
"javax.jdo.option.ConnectionDriverName":<--->,
"javax.jdo.option.ConnectionUserName":<--->,
"javax.jdo.option.ConnectionPassword":<--->,
"hive.metastore.warehouse.dir":<--->,
"fs.default.name":<--->
}
}
<---> : can be taken from hive-site.xml.
I was facing several problems -
VPC issue - my EMR cluster and mysql host were in different VPCs.
Trivial.
Mysql connection was not happening from EMR cluster to
mysql host - binding was strict to localhost. Removed it.
Now when I restarted hive --service metastore, I saw the error that driver name is not correct and driver class com.mysql.jdbc.Driver not found - so I had to download MySQL Connector driver as instructed in Step 2 here.
After MySql could connect, metastore could connect to the database : error was mysql Database initialization failed; direct SQL is disabled, but
initial tables need to be present. So the table creation had to be
done with a command here - Getting MissingTableException: Required table missing VERSION when starting hive on mysql
Go to the
$HIVE_HOME and run the initschema option on the schematool:
bin/schematool -dbType mysql -initSchema
Make sure you have cleaned up the mysql database on which you are moving this metastore. No tables or schema or tables are present that Hive needs.
After these, metastore was able to connect to external database. Now Hive is up and running with remote metastore.
Now I hosted Drill (embedded) in new EC2 host to connect to this metastore and it worked like a charm!
curl -X POST -H "Content-Type: application/json" -d '{"name":"hive", "config": { "type": "hive", "enabled": true, "configProps": { "hive.metastore.uris":"thrift://ip-XX.XX.XX.XX.ec2.internal:9083", "javax.jdo.option.ConnectionURL":"jdbc:mysql://ip-XX.XX.XX.XX:3306/hive?createDatabaseIfNotExist=true", "javax.jdo.option.ConnectionDriverName":"com.mysql.jdbc.Driver", "javax.jdo.option.ConnectionUserName":"root", "javax.jdo.option.ConnectionPassword":"blah", "hive.metastore.warehouse.dir":"/user/hive/warehouse", "fs.default.name":"hdfs://ip-XX.XX.XX.XX.ec2.internal:8020" }}}' http://localhost:8047/storage/hive.json
I have installed Cloudera's Hadoop QuickStart VM and I am attempting to pass records from my local database to HDFS using a PowerCenter mapping.
I've set up the Hadoop_HDFS_Connection in PowerCenter Workflow Manager but when I run the workflow I get the following error: "Unable to establish a connection with the specified HDFS host". It gives a "java.net.ConnectionException" error when trying to connect to the host name and port.
I think the error may be in the hostname notation. On Cloudera Manager on the VM, the host name is listed as 'localhost.localdomain' but I don't know how to translate this in the PowerCenter connection settings.
Anybody got this connection to work?
Many thanks.
Brian