Hive not storing Warehouse in HDFS - hadoop

I have downloaded hive installation on my local system and copied hive-site.xml into Spark conf directory. I tried to create a managed table in Hive context using spark shell.
I have put following property in hive-site.xml (present in spark's conf directory):
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
Also I have put HADOOP_HOME in spark-env.sh:
export HADOOP_CONF_DIR=/opt/hadoop/conf
As per Hive documentation, the hive warehouse should get stored in HDFS, but the warehouse is getting stored in local drive (/user/hive/warehouse).
Please help me out in understanding why Hive is not storing warehouse directory in HDFS.

Please define your Spark dependencies using 2.0.2
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.2"
You can then use hive.metastore.warehouse.dir or spark.sql.warehouse.dir to set the Spark warehouse and point to HDFS where the other Hive tables live.

Related

Pyspark: remote Hive warehouse location

I need to read / write tables stored in remote Hive Server from Pyspark. All I know about this remote Hive is that it runs under Docker. From Hadoop Hue I have found two urls for an iris table that I try to select some data from:
I have a table metastore url:
http://xxx.yyy.net:8888/metastore/table/mytest/iris
and table location url:
hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytest.db/iris
I have no idea why last url contains quickstart.cloudera:8020. Maybe this is because Hive runs under Docker?
Discussing access to Hive tables Pyspark tutorial writes:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
When working with Hive, one must instantiate SparkSession with Hive support, including connectivity to a persistent Hive metastore, support for Hive serdes, and Hive user-defined functions. Users who do not have an existing Hive deployment can still enable Hive support. When not configured by the hive-site.xml, the context automatically creates metastore_db in the current directory and creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. Note that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0. Instead, use spark.sql.warehouse.dir to specify the default location of database in warehouse. You may need to grant write privilege to the user who starts the Spark application.
In my case hive-site.xml that I managed to get does not have neither hive.metastore.warehouse.dir nor spark.sql.warehouse.dir property.
Spark tutorial suggests to use the following code to access remote Hive tables:
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
// warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
And in my case, after running similar to the above code, but with correct value for warehouseLocation, I think I can then do:
spark.sql("use mytest")
spark.sql("SELECT * FROM iris").show()
So where can I find remote Hive warehouse location? How to make Pyspark to work with remote Hive tables?
Update
hive-site.xml has the following properties:
...
...
...
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
...
...
...
<property>
<name>hive.metastore.uris</name>
<value>thrift://127.0.0.1:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
So it looks like 127.0.0.1 is Docker localhost that runs Clouder docker app. Does not help to get to Hive warehouse at all.
How to access Hive warehouse when Cloudera Hive runs as a Docker app.?
Here https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html at "Remote Mode" you'll find that you the Hive metastore runs its own JVM process, other process such as HiveServer2, HCatalog, Cloudera Impala communicate with it through the Thrift API using property hive.metastore.uri in the hive-site.xml:
<property>
<name>hive.metastore.uris</name>
<value>thrift://xxx.yyy.net:8888</value>
</property>
(Not sure about the way you have to specify the address)
And maybe this property too:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://xxx.yyy.net/hive</value>
</property>

Do we need to run hiveserver2 on our client machine to access hive metastore?

I am using spark-java to access hive metastore. On my machine only spark is installed and nothing else. I don't have hadoop directory or Hive folder. I have created hive-site.xml, hdfs-site.xml,core-site.xml and yarn-site.xml inside spark/conf directory. My hive metastore is setup on another machine which is a part of hadoop cluster and is the namenode. I can access hive metastore from spark/bin/beeline and spark/bin/spark-shell on my desktop, but when I try to access hive-metastore from java-api, I get metastore_db folder and derby.log file created in my project, which means I can't access hive metastore.
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "hdfs://bigdata-namenode:9000/user/hive/warehouse")
.config("mapred.input.dir.recursive", true)
.config("hive.mapred.supports.subdirectories", true)
.config("spark.sql.hive.thriftServer.singleSession", true)
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.master("local")
.getOrCreate();
spark.sql("show databases").show();
when I start thrift server on my desktop (i.e client machine) I get this log thriftserver.log
which says spark.sql.warehouse.dir is set to my local file system path i.e not hdfs where is actual warehouse located.
/spark/conf/core-site.xml
/spark/conf/hive-site.xml

Issue on configure hive on spark

I have downloaded spark-2.0.0-bin-hadoop2.7. Could any one advise how to configure hive on this and use in scala console? Now I am able to run RDD's on file using Scala (spark-shell console).
Follow the official Hive on Spark documentation:
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started
You can set on Hive the spark engine by using the following command:
set hive.execution.engine=spark;
or by adding it on hive-site.xml (refer to kanishka post)
Then prior to Hive 2.2.0, copy the spark-assembly jar to HIVE_HOME/lib.
Since Hive 2.2.0, Hive on Spark runs with Spark 2.0.0 and above, which doesn't have an assembly jar.
To run with YARN mode (either yarn-client or yarn-cluster), copy the following jars to HIVE_HOME/lib.
scala-library
spark-core
spark-network-common
Set the spark_home:
export $SPARK_HOME=/path-to-spark
Start Spark Master and Workers:
spark-class org.apache.spark.deploy.master.Master
spark-class org.apache.spark.deploy.worker.Worker spark://MASTER_IP:PORT
Configure Spark:
set spark.master=<Spark Master URL>;
set spark.executor.memory=512m;
set spark.yarn.executor.memoryOverhead=10~20% of spark.executor.memory(value);
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Put your hive-site.xml on spark conf directory
Hive can support multiple execution engine. Like TEZ, Spark.
You can set the property in hive-site.xml
</property>
<name>hive.execution.engine</name>
<value>spark</value>
<description>
I am choosing Spark as the execution engine
</description>
</property>
Copy jars spark-assembly jar to HIVE_HOME/lib
Set the spark_home
set the below properties
set spark.master=<Spark Master URL>
set spark.eventLog.enabled=true;
set spark.eventLog.dir=<Spark event log folder (must exist)>
set spark.executor.memory=512m;
set spark.serializer=org.apache.spark.serializer.KryoSerializer;
Above steps would suffice i think

Error while adding UDF in hive

I have to add a UDF in hive.
The query I am trying is :
create function strip1 as 'com.hadoopbook.hive.Strip' using jar '/home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/Strip.jar'
But I am getting a exception as :
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask. Hive warehouse is non-local, but /home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/Strip.jar specifies file on local filesystem. Resources on non-local warehouse should specify a non-local scheme/path
Can anyone tell how to solve this ?
Three options:
copy the jar on hdfs and use that path.
OR
as error is telling you: In the $HIVE_HOME/conf directory there is the hive-default.xml and/or hive-site.xml which has the hive.metastore.warehouse.dir property. add hdfs:/ to this path, and restart/re-run the hive shell/script:
<property>
<name>hive.metastore.warehouse.dir</name>
<value>hdfs://usr/hive/warehouse </value>
<description>location of the warehouse directory</description>
</property>
OR
if you are running hive queries from hive shell then:
hive> set hive.metastore.warehouse.dir;
hive.metastore.warehouse.dir=/user/hive/warehouse
above command prints the path, just prefix the hdfs:/ to it as below and then re-run your hive command(s) :
hive> set hive.metastore.warehouse.dir="hdfs://user/hive/warehouse";
You could setting the configuration hive.aux.jars.path to /home/hduser/Hadoop-tutorial/hadoop-book-master/ch17-hive/src/main/java/com/hadoopbook/hive/
and create hive udf function via below command:
create function strip1 as 'com.hadoopbook.hive.Strip'
You can first try to add UDF jar to a hdfs location instead of the local directory:
$ add jar "hdfs://user/cloudera/hive/udf/Strip.jar"
and then create hive function as below:
$ create function test_function as "com.hadoopbook.hive.Strip"
Hope this helps :)

Best place for json Serde JAR in CDH Hadoop for use with Hive/Hue/MapReduce

I'm using Hive/Hue/MapReduce with a json Serde. To get this working I have copied the json_serde.jar to several lib directories on every cluster node:
/opt/cloudera/parcels/CDH/lib/hive/lib
/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/lib
/opt/cloudera/parcels/CDH/lib/hadoop/lib
/opt/cloudera/parcels/CDH/lib/hadoop-0.20-mapreduce/lib
...
On every CDH update of the cluster I have to do that again.
Is there a more elegant way where the distribution of the Serde in the cluster would be automatic and resistant to updates?
If using HiveServer2 (Default in Cloudera 5.0+) the following configuration will work across your entire cluster without having to copy the jar to each node.
In your hive-site.xml config file, or if you're using Cloudera Manager in the "HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml" config box
<property>
<name>hive.aux.jars.path</name>
<value>/user/hive/aux_jars/hive-serdes-1.0-snapshot.jar</value>
</property>
Then create the directory in your HDFS filesystem (/user/hive/aux_jars) and place the jar file in it. If you are running HUE you can do this part via the web UI, just click on File Browser at the top right.
It depends on the version of Hue and if using Beeswax or HiveServer2:
Beeswax: there is a workaround with the HIVE_AUX_JARS_PATH https://issues.cloudera.org/browse/HUE-1127
HiveServer2 supports a hive.aux.jars.path property in the hive-site.xml. HiveServer2 does not support a .hiverc and Hue is looking at providing an equivalent at some point: https://issues.cloudera.org/browse/HUE-1066

Resources