Connect to Hive on EMR using Apache Drill Embedded - hadoop

I am trying to experiment on Apache Drill 1.4 in Embedded mode and trying to connect to Hive running on EMR - Drill is running on server outside EMR.
I have some basic questions that I want to get clarified and some configuration issues to be fixed.
Here is what I have so far -
Running AWS EMR cluster.
Running Drill Embedded server.
According to the documentation on configuring storage plugin for Hive, https://drill.apache.org/docs/hive-storage-plugin/ , I am getting confused on whether or not to use Remote Metastore or Embedded Metastore. What is the difference?
Next, my EMR cluster is running and here is what hive-site.xml looks like -
<property>
<name>hive.metastore.uris</name>
<value>thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:3306/hive?createDatabaseIfNotExist=true</value>
<description>username to use against metastore database</description>
</property>
There are other properties defined like MySQL username and password etc. but I guess these are important here.
Which one should I use to connect to Hive? I have tried to put both these in the storage plugin but Drill doesnt take it.
Storage plugins I have tried look like this -
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083",
"fs.default.name": "hdfs://ec2-XX-XX-XX-XX.compute-1.amazonaws.com/",
"hive.metastore.sasl.enabled": "false"
}
}
and
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris": "thrift://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:9083",
"javax.jdo.option.ConnectionURL": "jdbc:derby:ec2-XX-XX-XX-XX.compute-1.amazonaws.com;databaseName=data;create=true",
"hive.metastore.warehouse.dir": "/user/hive/warehouse",
"fs.default.name": "file:///",
"hive.metastore.sasl.enabled": "false"
}
}
It would be of great help if you could guide me in setting this up.
Thanks!

Whether or not to use Remote Metastore or Embedded Metastore?
Embedded Mode: This is recommended for testing or experimental purposes only.In this mode, the metastore uses a Derby database, and both the database and the metastore service are embedded in the main HiveServer process. Both are started for you when you start the HiveServer process.
Remote Mode: The Hive metastore service runs in its own JVM process. HiveServer2, HCatalog and other processes communicate with it via the Thrift network API (configured via the hive.metastore.uris property). The metastore service communicates with the metastore database over JDBC (configured via the javax.jdo.option.ConnectionURL property). This should be used for production.
You are using MySQL to store metadata for Hive. So, Drill needs javax.jdo.option.ConnectionUserName & javax.jdo.option.ConnectionPassword too to create connection.
Sample hive plugin (Remote Mode):
{
"type": "hive",
"enabled": true,
"configProps": {
"hive.metastore.uris":<--->,
"javax.jdo.option.ConnectionURL":<--->,
"javax.jdo.option.ConnectionDriverName":<--->,
"javax.jdo.option.ConnectionUserName":<--->,
"javax.jdo.option.ConnectionPassword":<--->,
"hive.metastore.warehouse.dir":<--->,
"fs.default.name":<--->
}
}
<---> : can be taken from hive-site.xml.

I was facing several problems -
VPC issue - my EMR cluster and mysql host were in different VPCs.
Trivial.
Mysql connection was not happening from EMR cluster to
mysql host - binding was strict to localhost. Removed it.
Now when I restarted hive --service metastore, I saw the error that driver name is not correct and driver class com.mysql.jdbc.Driver not found - so I had to download MySQL Connector driver as instructed in Step 2 here.
After MySql could connect, metastore could connect to the database : error was mysql Database initialization failed; direct SQL is disabled, but
initial tables need to be present. So the table creation had to be
done with a command here - Getting MissingTableException: Required table missing VERSION when starting hive on mysql
Go to the
$HIVE_HOME and run the initschema option on the schematool:
bin/schematool -dbType mysql -initSchema
Make sure you have cleaned up the mysql database on which you are moving this metastore. No tables or schema or tables are present that Hive needs.
After these, metastore was able to connect to external database. Now Hive is up and running with remote metastore.
Now I hosted Drill (embedded) in new EC2 host to connect to this metastore and it worked like a charm!
curl -X POST -H "Content-Type: application/json" -d '{"name":"hive", "config": { "type": "hive", "enabled": true, "configProps": { "hive.metastore.uris":"thrift://ip-XX.XX.XX.XX.ec2.internal:9083", "javax.jdo.option.ConnectionURL":"jdbc:mysql://ip-XX.XX.XX.XX:3306/hive?createDatabaseIfNotExist=true", "javax.jdo.option.ConnectionDriverName":"com.mysql.jdbc.Driver", "javax.jdo.option.ConnectionUserName":"root", "javax.jdo.option.ConnectionPassword":"blah", "hive.metastore.warehouse.dir":"/user/hive/warehouse", "fs.default.name":"hdfs://ip-XX.XX.XX.XX.ec2.internal:8020" }}}' http://localhost:8047/storage/hive.json

Related

Talend 8.0 DBConnection to Hive (Cloudera)

I'm trying to create connection for accessing to hive from talend.
My Hive is running in cloudera-quickstart 5.12.0.0 (using virtualbox.)
and in virtualbox, my ifconfig is : 192.168.100.245
My parameter on Talend DBconnection metadata is
DBType : Hive
Distribution : Cloudera
Version : Cloudera CDH6.1.1
Hive Model : Standalone
Hive Server Version : Hive Server2 --jdbc:hive2://
string of connection : jdbc:hive2://192.168.100.245:10000/default
login : cloudera
password : cloudera
server : 192.168.100.245
Port:1000
Database: default
Additional JDBC Settings : Empty
right now the result is Connection failure, must change database setting.
Can anyone help me to fix this issue ?.
Or anyone know the resource or tutorial for connection talend to cloudera hadoop
I'm using Talend 8.0, Windows 10 and Virtualbox 6 + Cloudera 5.12.0
Thank you for advice.

Connecting HiveServer2 from pyspark

I am stuck at point as , how to use pyspark to fetch data from hive server using jdbc.
I am Trying to connect to HiveServer2 running on my local machine from pyspark using jdbc. All components HDFS,pyspark,HiveServer2 are on same machine.
Following is the code i am using to connect :
connProps={ "username" : 'hive',"password" : '',"driver" : "org.apache.hive.jdbc.HiveDriver"}
sqlContext.read.jdbc(url='jdbc:hive2://127.0.0.1:10000/default',table='pokes',properties=connProps)
dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:hive://localhost:10000/default").option("driver", "org.apache.hive.jdbc.HiveDriver").option("dbtable", "pokes").option("user", "hive").option("password", "").load()
both methods used above are giving me same error as below:
org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
javax.jdo.JDOFatalDataStoreException: Unable to open a test connection
to the given database. JDBC url =
jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
Terminating connection pool (set lazyInit to true if you expect to
start your database after your app).
ERROR XSDB6: Another instance of Derby may have already booted the database /home///jupyter-notebooks/metastore_db
metastore_db is located at same directory where my jupyter notebooks are created. but hive-site.xml is having different metastore location.
I have already checked reffering to other questions about same error saying other spark-shell or such process is running,but its not. Even if i try following command when HiveServer2 and HDFS are down i am getting same error
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
I am able to connect to hives using java program using jdbc. Am I missing something here? Please help.Thanks in advance.
Spark should not use JDBC to connect to Hive.
It reads from the metastore, and skips HiveServer2
However, Another instance of Derby may have already booted the database means that you're running Spark from another session, such as another Jupyter kernel that's still running. Try setting a different metastore location, or work on setting up a remote Hive metastore using a local Mysql or Postgres database and edit $SPARK_HOME/conf/hive-site.xml with that information.
From SparkSQL - Hive tables
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# spark is an existing SparkSession
spark.sql("CREATE TABLE...")

Pyspark: remote Hive warehouse location

I need to read / write tables stored in remote Hive Server from Pyspark. All I know about this remote Hive is that it runs under Docker. From Hadoop Hue I have found two urls for an iris table that I try to select some data from:
I have a table metastore url:
http://xxx.yyy.net:8888/metastore/table/mytest/iris
and table location url:
hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytest.db/iris
I have no idea why last url contains quickstart.cloudera:8020. Maybe this is because Hive runs under Docker?
Discussing access to Hive tables Pyspark tutorial writes:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
In my case hive-site.xml that I managed to get does not have neither hive.metastore.warehouse.dir nor spark.sql.warehouse.dir property.
Spark tutorial suggests to use the following code to access remote Hive tables:
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
And in my case, after running similar to the above code, but with correct value for warehouseLocation, I think I can then do:
spark.sql("use mytest")
spark.sql("SELECT * FROM iris").show()
So where can I find remote Hive warehouse location? How to make Pyspark to work with remote Hive tables?
Update
hive-site.xml has the following properties:
...
...
...
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
...
...
...
<property>
<name>hive.metastore.uris</name>
<value>thrift://127.0.0.1:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
So it looks like 127.0.0.1 is Docker localhost that runs Clouder docker app. Does not help to get to Hive warehouse at all.
How to access Hive warehouse when Cloudera Hive runs as a Docker app.?
Here https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html at "Remote Mode" you'll find that you the Hive metastore runs its own JVM process, other process such as HiveServer2, HCatalog, Cloudera Impala communicate with it through the Thrift API using property hive.metastore.uri in the hive-site.xml:
<property>
<name>hive.metastore.uris</name>
<value>thrift://xxx.yyy.net:8888</value>
</property>
(Not sure about the way you have to specify the address)
And maybe this property too:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://xxx.yyy.net/hive</value>
</property>

Hive : The application won't work without a running HiveServer2

I am new to this field. I was checking CDH 5.8 quick-start VM to try some basic hive/impala example.
But I hit an issue, while I am opening HUE it's giving below error. I searched solution for but didnt get anything which can resolve my issue.
Configuration files located in /etc/hue/conf.empty
Potential misconfiguration detected. Fix and restart Hue.
Hive The application won't work without a running HiveServer2.
I checked the and it's up & running. Tried restarting the service & CDH, didnt help.
Hive Server2 is running [ OK ]
When navigated to Hive tried some command it gave me below error.
Could not connect to quickstart.cloudera:10000 (code THRIFTTRANSPORT): TTransportException('Could not connect to quickstart.cloudera:10000',)
FOR Impala I am getting
AnalysisException: This Impala daemon is not ready to accept user requests. Status: Waiting for catalog update from the StateStore.
Tried starting hive --service metastore but got error
[cloudera#quickstart conf.empty]$ hive --service metastore
2017-03-03 05:37:14,502 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
Starting Hive Metastore Server
org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9083.
Not sure what is wrong or if I need to change some config. Can you anyone guide me towards the solution ?
You HiveServer2 requires Metastore up and running. Seems your Metastore Server cannot start because the port 9083 is already used by some service. Check it:
netstat -tulpn | grep 9083
If something is using this port you need to either change the port of you metastore in hive configuration or stop the application which already uses this port.

how to connect hive with multiple users

I am very new to Hadoop and some how we managed to install it with apache distribution and Derby database.
My requirement is having multiple users to access hive at a single time. But right now we are only able to allow a single user at a time.
I searched some of the blogs but haven't found the solution.
Could some one help me with solution?
Derby only allows single connection (process) to access the database at a give time, hence only one user can access the Hive.
Upgrade your hive metastore to either MySQL, PostgreSQL to support multiple concurrent connections to Hive.
For upgrading your metastore from Derby to MySQL/PostgreSQL there are lot resources online here's some of them:
From Cloudera
From Apache Hive Wiki
There are many different ways to access metastore by multiple users concurrently.
Embedded metastore.(default metastore:derby)
Local metastore.
Remote metastore.
Let's see the usage of above mentioned metastore.
Embedded metastore :
This metastore is only using for Unit test. And it's limitation that, it allows only a user to access Hive at same (Multiple sessions are not allowed and it throws error).
Local metastore:(By using MySql or Oracle DB)
To overcome the default metastore limitation the Local metastore is used, this can allow multiple user in same JVM (It allows multiple session on same machine). To setup this mode see below of this answer.
Remote Metastore(This metastore is using in production)
In a same project multiple hive users need to worked on it, and they can use hive concurrently on different machine but the metadata should be stored on centralized by using MySql or Oracle, ect,. Here, hive are running on each users JVM, If users are are processing, then they want to communicate with metastore which is centralized, for communicating we are going with Thrift Network APIs. To setup this mode see below of this answer.
METASTORE SETUP FOR MULTIPLE USER:
Step 1 : Download and install mysql server
sudo apt-get install mysql-server
Step 2 : Download and install JDBC driver.
sudo apt-get install libmysql-java
Step 3 : We need to copy the downloaded JDBC driver to hive/lib/ or link the JDBC location to hive/lib.
-Goto to the $HIVE_HOME/lib folder and create a link to the MySQL JDBC library.
ln -s /usr/share/java/mysql-connector-java.jar
Step 4 : Create users on metastore to access remotly and locally.
mysql -u root -p <Give password while installing DB>
mysql> CREATE USER 'user1'#'%' IDENTIFIED BY 'user1pass';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hduserdb'#'%' WITH GRANT OPTION;
mysql> flush privileges;
IF you want multiple user to access do repeat the step 4 by giving user name, password.
Step 5 :: Goto hive/conf/hive-site.xml (If it's not available create it.)
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true</value>
<description>replace -master- with your database hostname</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>user1</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>user1pass</value>
<description>password for connecting to mysql server</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://slave2:9083</value>
<description>Here use your metasore host name to access from different machine</description>
</property>
</configuration>
Do repeat only Step 5 on all users machine and change user name and password according.
Step 6 : From hive-2.. onwards we must give this comment.
slave#ubuntu~$: schematool -initSchema -dbType mysql
Step 7 : To start hive metastore server
~$: hive --service metastore &
Now, check hive with different user concurrently from different machine.

Resources