Hive metastore database details missing in hive-site.xml - hadoop

We are using CDH 5.4.6. I am able to find Hive Metastore details in Cloudera UI .
But I am trying to find the same details on configuartion file.
I can only find hive.metastore.uris parameter in /etc/hive/conf/hive-site.xml . conf file hive-site.xml supposed to have javax.jdo.option.ConnectionURL / ConnectionDriverName / ConnectionUserName / ConnectionPassword. Where can I find those details?
<!--Autogenerated by Cloudera Manager-->
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://xxxxx.com:9083</value>
</property>

JDO details are only applicable to Hive Metastore. So, for security reasons they are not included in client configuration version of hive-site.xml. The settings that you see in Cloudera Manager UI are stored in Cloudera Manager's database. CM retrieves and adds those values dynamically to a special server-side hive-site.xml which it generates before HMS process is started. That file can be seen in configuration directory /var/run/cloudera-scm-agent/process/nnn-hive-HIVEMETASTORE/ on the node running HMS role (with proper permissions; nnn here is an incremental process counter).
By the way, CDH 5.4.6 has been EOL'ed for ages. Why aren't you upgrading?

Related

Pyspark: remote Hive warehouse location

I need to read / write tables stored in remote Hive Server from Pyspark. All I know about this remote Hive is that it runs under Docker. From Hadoop Hue I have found two urls for an iris table that I try to select some data from:
I have a table metastore url:
http://xxx.yyy.net:8888/metastore/table/mytest/iris
and table location url:
hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytest.db/iris
I have no idea why last url contains quickstart.cloudera:8020. Maybe this is because Hive runs under Docker?
Discussing access to Hive tables Pyspark tutorial writes:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
In my case hive-site.xml that I managed to get does not have neither hive.metastore.warehouse.dir nor spark.sql.warehouse.dir property.
Spark tutorial suggests to use the following code to access remote Hive tables:
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
And in my case, after running similar to the above code, but with correct value for warehouseLocation, I think I can then do:
spark.sql("use mytest")
spark.sql("SELECT * FROM iris").show()
So where can I find remote Hive warehouse location? How to make Pyspark to work with remote Hive tables?
Update
hive-site.xml has the following properties:
...
...
...
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
...
...
...
<property>
<name>hive.metastore.uris</name>
<value>thrift://127.0.0.1:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
So it looks like 127.0.0.1 is Docker localhost that runs Clouder docker app. Does not help to get to Hive warehouse at all.
How to access Hive warehouse when Cloudera Hive runs as a Docker app.?
Here https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html at "Remote Mode" you'll find that you the Hive metastore runs its own JVM process, other process such as HiveServer2, HCatalog, Cloudera Impala communicate with it through the Thrift API using property hive.metastore.uri in the hive-site.xml:
<property>
<name>hive.metastore.uris</name>
<value>thrift://xxx.yyy.net:8888</value>
</property>
(Not sure about the way you have to specify the address)
And maybe this property too:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://xxx.yyy.net/hive</value>
</property>

Issue on configure hive on spark

I have downloaded spark-2.0.0-bin-hadoop2.7. Could any one advise how to configure hive on this and use in scala console? Now I am able to run RDD's on file using Scala (spark-shell console).
Follow the official Hive on Spark documentation:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
You can set on Hive the spark engine by using the following command:
set hive.execution.engine=spark;
or by adding it on hive-site.xml (refer to kanishka post)
Then prior to Hive 2.2.0, copy the spark-assembly jar to HIVE_HOME/lib.
Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn't have an assembly jar.
To run with YARN mode (either yarn-client or yarn-cluster), copy the following jars to HIVE_HOME/lib.
scala-library
spark-core
spark-network-common
Set the spark_home:
export $SPARK_HOME=/path-to-spark
Start Spark Master and Workers:
spark-class org.apache.spark.deploy.master.Master
spark-class org.apache.spark.deploy.worker.Worker spark://MASTER_IP:PORT
Configure Spark:
set spark.master=<Spark Master URL>;
set spark.executor.memory=512m;
set spark.yarn.executor.memoryOverhead=10~20% of spark.executor.memory(value);
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Put your hive-site.xml on spark conf directory
Hive can support multiple execution engine. Like TEZ, Spark.
You can set the property in hive-site.xml
</property>
<name>hive.execution.engine</name>
<value>spark</value>
<description>
I am choosing Spark as the execution engine
</description>
</property>
Copy jars spark-assembly jar to HIVE_HOME/lib
Set the spark_home
set the below properties
set spark.master=<Spark Master URL>
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Above steps would suffice i think

how to connect hive with multiple users

I am very new to Hadoop and some how we managed to install it with apache distribution and Derby database.
My requirement is having multiple users to access hive at a single time. But right now we are only able to allow a single user at a time.
I searched some of the blogs but haven't found the solution.
Could some one help me with solution?
Derby only allows single connection (process) to access the database at a give time, hence only one user can access the Hive.
Upgrade your hive metastore to either MySQL, PostgreSQL to support multiple concurrent connections to Hive.
For upgrading your metastore from Derby to MySQL/PostgreSQL there are lot resources online here's some of them:
From Cloudera
From Apache Hive Wiki
There are many different ways to access metastore by multiple users concurrently.
Embedded metastore.(default metastore:derby)
Local metastore.
Remote metastore.
Let's see the usage of above mentioned metastore.
Embedded metastore :
This metastore is only using for Unit test. And it's limitation that, it allows only a user to access Hive at same (Multiple sessions are not allowed and it throws error).
Local metastore:(By using MySql or Oracle DB)
To overcome the default metastore limitation the Local metastore is used, this can allow multiple user in same JVM (It allows multiple session on same machine). To setup this mode see below of this answer.
Remote Metastore(This metastore is using in production)
In a same project multiple hive users need to worked on it, and they can use hive concurrently on different machine but the metadata should be stored on centralized by using MySql or Oracle, ect,. Here, hive are running on each users JVM, If users are are processing, then they want to communicate with metastore which is centralized, for communicating we are going with Thrift Network APIs. To setup this mode see below of this answer.
METASTORE SETUP FOR MULTIPLE USER:
Step 1 : Download and install mysql server
sudo apt-get install mysql-server
Step 2 : Download and install JDBC driver.
sudo apt-get install libmysql-java
Step 3 : We need to copy the downloaded JDBC driver to hive/lib/ or link the JDBC location to hive/lib.
-Goto to the $HIVE_HOME/lib folder and create a link to the MySQL JDBC library.
ln -s /usr/share/java/mysql-connector-java.jar
Step 4 : Create users on metastore to access remotly and locally.
mysql -u root -p <Give password while installing DB>
mysql> CREATE USER 'user1'#'%' IDENTIFIED BY 'user1pass';
mysql> GRANT ALL PRIVILEGES ON *.* TO 'hduserdb'#'%' WITH GRANT OPTION;
mysql> flush privileges;
IF you want multiple user to access do repeat the step 4 by giving user name, password.
Step 5 :: Goto hive/conf/hive-site.xml (If it's not available create it.)
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/metastore?createDatabaseIfNotExist=true</value>
<description>replace -master- with your database hostname</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>MySQL JDBC driver class</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>user1</value>
<description>user name for connecting to mysql server</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>user1pass</value>
<description>password for connecting to mysql server</description>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://slave2:9083</value>
<description>Here use your metasore host name to access from different machine</description>
</property>
</configuration>
Do repeat only Step 5 on all users machine and change user name and password according.
Step 6 : From hive-2.. onwards we must give this comment.
slave#ubuntu~$: schematool -initSchema -dbType mysql
Step 7 : To start hive metastore server
~$: hive --service metastore &
Now, check hive with different user concurrently from different machine.

How to use Hive without hadoop

I am a new to NoSQL solutions and want to play with Hive. But installing HDFS/Hadoop takes a lot of resources and time (maybe without experience but I got no time to do this).
Are there ways to install and use Hive on a local machine without HDFS/Hadoop?
yes you can run hive without hadoop
1.create your warehouse on your local system
2. give default fs as file:///
than you can run hive in local mode with out hadoop installation
In Hive-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<!-- this should eventually be deprecated since the metastore should supply this -->
<name>hive.metastore.warehouse.dir</name>
<value>file:///tmp</value>
<description></description>
</property>
<property>
<name>fs.default.name</name>
<value>file:///tmp</value>
</property>
</configuration>
If you are just talking about experiencing Hive before making a decision you can just use a preconfigured VM as #Maltram suggested (Hortonworks, Cloudera, IBM and others all offer such VMs)
What you should keep in mind that you will not be able to use Hive in production without Hadoop and HDFS so if it is a problem for you, you should consider alternatives to Hive
You cant, just download Hive, and run:
./bin/hiveserver2
Cannot find hadoop installation: $HADOOP_HOME or $HADOOP_PREFIX must be set or hadoop must be in the path
Hadoop is like a core, and Hive need some library from it.
Update This answer is out-of-date : with Hive on Spark it is no longer necessary to have hdfs support.
Hive requires hdfs and map/reduce so you will need them. The other answer has some merit in the sense of recommending a simple / pre-configured means of getting all of the components there for you.
But the gist of it is: hive needs hadoop and m/r so in some degree you will need to deal with it.
Top answer works for me. But need few more setups. I spend a quite some time search around to fix multiple problems until I finally set it up. Here I summarize the steps from scratch:
Download hive, decompress it
Download hadoop, decompress it, put it in the same parent folder as hive
Setup hive-env.sh
$ cd hive/conf
$ cp hive-env.sh.template hive-env.sh
Add following environment in hive-env.sh (change path accordingly based
on actual java/hadoop version)
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_281.jdk/Contents/Home
export path=$JAVA_HOME/bin:$path
export HADOOP_HOME=${bin}/../../hadoop-3.3.1
setup hive-site.xml
$ cd hive/conf
$ cp hive-default.xml.template hive-site.xml
Replace all the variable ${system:***} with constant paths (Not sure why this is not recognized in my system).
Set database path to local with following attributes (copied from top answer)
<configuration>
<property>
<name>hive.metastore.schema.verification</name>
<value>false</value>
</property>
<property>
<!-- this should eventually be deprecated since the metastore should supply this -->
<name>hive.metastore.warehouse.dir</name>
<value>file:///tmp</value>
<description></description>
</property>
<property>
<name>fs.default.name</name>
<value>file:///tmp</value>
</property>
</configuration>
setup hive-log4j2.properties (optional, good for troubleshooting)
cp hive-log4j2.properties.template hive-log4j2.properties
Replace all the variable ${sys:***} to constant path
Setup metastore_db
If you directly run hive, when do any DDL, you will got error of:
FAILED: HiveException org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:Hive metastore database is not initialized. Please use schematool (e.g. ./schematool -initSchema -dbType ...) to create the schema. If needed, don't forget to include the option to auto-create the underlying database in your JDBC connection string (e.g. ? createDatabaseIfNotExist=true for mysql))
In that case we need to recreate metastore_db with following command
$ cd hive/bin
$ rm -rf metastore_db
$ ./schematool -initSchema -dbType derby
Start hive
$ cd hive/bin
$ ./hive
Now you should be able run hive on you local file system. One thing to note, the metastore_db will always be created on you current directory. If you start hive in a different directory, you need to recreate it again.
Although, there are some details that you have to keep in mind it's completely normal to use Hive without HDFS. There are a few details one should keep in mind.
As a few commenters mentioned above you'll still need some .jar files from hadoop common.
As of today(XII 2020) it's difficult to run Hive/hadoop3 pair. Use stable hadoop2 with Hive2.
Make sure POSIX permissions are set correctly, so your local hive can access warehouse and eventually derby database location.
Initialize your database by manual call to schematool
You can use site.xml file pointing to local POSIX filesystem, but you can also set those options in HIVE_OPTS environmen variable.
I covered that with examples of errors I've seen on my blog post

Best place for json Serde JAR in CDH Hadoop for use with Hive/Hue/MapReduce

I'm using Hive/Hue/MapReduce with a json Serde. To get this working I have copied the json_serde.jar to several lib directories on every cluster node:
/opt/cloudera/parcels/CDH/lib/hive/lib
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib
/opt/cloudera/parcels/CDH/lib/hadoop/lib
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib
...
On every CDH update of the cluster I have to do that again.
Is there a more elegant way where the distribution of the Serde in the cluster would be automatic and resistant to updates?
If using HiveServer2 (Default in Cloudera 5.0+) the following configuration will work across your entire cluster without having to copy the jar to each node.
In your hive-site.xml config file, or if you're using Cloudera Manager in the "HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml" config box
<property>
<name>hive.aux.jars.path</name>
<value>/user/hive/aux_jars/hive-serdes-1.0-snapshot.jar</value>
</property>
Then create the directory in your HDFS filesystem (/user/hive/aux_jars) and place the jar file in it. If you are running HUE you can do this part via the web UI, just click on File Browser at the top right.
It depends on the version of Hue and if using Beeswax or HiveServer2:
Beeswax: there is a workaround with the HIVE_AUX_JARS_PATH https://issues.cloudera.org/browse/HUE-1127
HiveServer2 supports a hive.aux.jars.path property in the hive-site.xml. HiveServer2 does not support a .hiverc and Hue is looking at providing an equivalent at some point: https://issues.cloudera.org/browse/HUE-1066

Resources