I am stuck at point as , how to use pyspark to fetch data from hive server using jdbc.
I am Trying to connect to HiveServer2 running on my local machine from pyspark using jdbc. All components HDFS,pyspark,HiveServer2 are on same machine.
Following is the code i am using to connect :
connProps={ "username" : 'hive',"password" : '',"driver" : "org.apache.hive.jdbc.HiveDriver"}
sqlContext.read.jdbc(url='jdbc:hive2://127.0.0.1:10000/default',table='pokes',properties=connProps)
dataframe_mysql = sqlContext.read.format("jdbc").option("url", "jdbc:hive://localhost:10000/default").option("driver", "org.apache.hive.jdbc.HiveDriver").option("dbtable", "pokes").option("user", "hive").option("password", "").load()
both methods used above are giving me same error as below:
org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;
javax.jdo.JDOFatalDataStoreException: Unable to open a test connection
to the given database. JDBC url =
jdbc:derby:;databaseName=metastore_db;create=true, username = APP.
Terminating connection pool (set lazyInit to true if you expect to
start your database after your app).
ERROR XSDB6: Another instance of Derby may have already booted the database /home///jupyter-notebooks/metastore_db
metastore_db is located at same directory where my jupyter notebooks are created. but hive-site.xml is having different metastore location.
I have already checked reffering to other questions about same error saying other spark-shell or such process is running,but its not. Even if i try following command when HiveServer2 and HDFS are down i am getting same error
spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING) USING hive")
I am able to connect to hives using java program using jdbc. Am I missing something here? Please help.Thanks in advance.
Spark should not use JDBC to connect to Hive.
It reads from the metastore, and skips HiveServer2
However, Another instance of Derby may have already booted the database means that you're running Spark from another session, such as another Jupyter kernel that's still running. Try setting a different metastore location, or work on setting up a remote Hive metastore using a local Mysql or Postgres database and edit $SPARK_HOME/conf/hive-site.xml with that information.
From SparkSQL - Hive tables
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
.config("spark.sql.warehouse.dir", warehouse_location) \
.enableHiveSupport() \
.getOrCreate()
# spark is an existing SparkSession
spark.sql("CREATE TABLE...")
Related
I have tried to run the Sqoop eval script through AWS EMR CLI for Teradata connection but found the error
Error loading ManagerFactory information from file /usr/lib/sqoop/conf/managers.d/td_connector.txt: java.io.IOException: Could not load jar $SQOOP_HOME/lib/teradata-connector-1.6.5.jar into JVM. (Could not find class org.apache.sqoop.teradata.TeradataConnManager.)
Steps I have followed:
login to EMR version emr-6.2.0 with the configuration of hadoop 3 and sqoop 1.4.7 through SSH
Downloaded the Teradata Hadoop connector 3.x from teradata downloads
moved the teradata hadoop connector to $SQOOP_HOME/lib and installed.
created the text file td_connect at /usr/lib/sqoop/conf/managers.d/ and included the text org.apache.sqoop.teradata.TeradataConnManager=$SQOOP_HOME/lib/teradata-connector-1.6.5.jar
ran the script
sqoop eval --connection-manager org.apache.sqoop.teradata.TeradataConnManager --connect jdbc:teradata://host/database= --username username --password password --query 'select top 5 * from table'
Could you please help to identify the issue
I tried restarting my system, checked whether there is enough space or not and also made sure my hive server2 is running. But I'm getting these errors when given '$hive' in Cloudera.
Logging initialized using configuration in
file:/etc/hive/conf.dist/hive-log4j.properties
WARN: The method class
org.apache.commons.logging.impl.SLF4JLogFactory#release() was invoked.
Exception in thread "main" java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException:
org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
The process of starting Hive2 is changed, as Hive got deprecated. Usage of Beeline is recommended.
Beeline was developed specifically to interact with the new server. Unlike Hive CLI, which is an Apache Thrift-based client, Beeline is a JDBC client based on the SQLLine CLI — although the JDBC driver used communicates with HiveServer2 using HiveServer2’s Thrift APIs.
As Hive development has shifted from the original Hive server (HiveServer1) to the new server (HiveServer2), users and developers accordingly need to switch to the new client tool. However, there’s more to this process than simply switching the executable name from “hive” to “beeline”.
More information provided over here
Use the below command to enter into interactive mode. Beeline supports same commands that Hive server does. You can execute same script in Beeline without any modifications.
beeline -u jdbc:hive2://
To start the Hive metastore,
sudo service hive-metastore start
I am using spark-java to access hive metastore. On my machine only spark is installed and nothing else. I don't have hadoop directory or Hive folder. I have created hive-site.xml, hdfs-site.xml,core-site.xml and yarn-site.xml inside spark/conf directory. My hive metastore is setup on another machine which is a part of hadoop cluster and is the namenode. I can access hive metastore from spark/bin/beeline and spark/bin/spark-shell on my desktop, but when I try to access hive-metastore from java-api, I get metastore_db folder and derby.log file created in my project, which means I can't access hive metastore.
SparkSession spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.enableHiveSupport()
.config("spark.sql.warehouse.dir", "hdfs://bigdata-namenode:9000/user/hive/warehouse")
.config("mapred.input.dir.recursive", true)
.config("hive.mapred.supports.subdirectories", true)
.config("spark.sql.hive.thriftServer.singleSession", true)
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.master("local")
.getOrCreate();
spark.sql("show databases").show();
when I start thrift server on my desktop (i.e client machine) I get this log thriftserver.log
which says spark.sql.warehouse.dir is set to my local file system path i.e not hdfs where is actual warehouse located.
/spark/conf/core-site.xml
/spark/conf/hive-site.xml
I am new to this field. I was checking CDH 5.8 quick-start VM to try some basic hive/impala example.
But I hit an issue, while I am opening HUE it's giving below error. I searched solution for but didnt get anything which can resolve my issue.
Configuration files located in /etc/hue/conf.empty
Potential misconfiguration detected. Fix and restart Hue.
Hive The application won't work without a running HiveServer2.
I checked the and it's up & running. Tried restarting the service & CDH, didnt help.
Hive Server2 is running [ OK ]
When navigated to Hive tried some command it gave me below error.
Could not connect to quickstart.cloudera:10000 (code THRIFTTRANSPORT): TTransportException('Could not connect to quickstart.cloudera:10000',)
FOR Impala I am getting
AnalysisException: This Impala daemon is not ready to accept user requests. Status: Waiting for catalog update from the StateStore.
Tried starting hive --service metastore but got error
[cloudera#quickstart conf.empty]$ hive --service metastore
2017-03-03 05:37:14,502 WARN [main] mapreduce.TableMapReduceUtil: The hbase-prefix-tree module jar containing PrefixTreeCodec is not present. Continuing without it.
Starting Hive Metastore Server
org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9083.
Not sure what is wrong or if I need to change some config. Can you anyone guide me towards the solution ?
You HiveServer2 requires Metastore up and running. Seems your Metastore Server cannot start because the port 9083 is already used by some service. Check it:
netstat -tulpn | grep 9083
If something is using this port you need to either change the port of you metastore in hive configuration or stop the application which already uses this port.
I am connecting to a hive installation using a JDBC client code. I have created a test table with two columns(column1, column2) both string type. When i try executing simple queries like "select* from test" i get result in java program but queries with where clauses and other complex queries throw the following exception.
"Query returned non-zero code: 1, cause: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MapRedTask"
I have tried changing permissions of hdfs directories where file is present, /tmp on local directory but this didn't work.
This is my connection code
Connection con = DriverManager.getConnection("jdbc:hive://"+host+":"+port+"/default", "", "");
Statement stmt = con.createStatement();
Error is thrown at executeQuery() method
Checking the logs on server gives the following exception:
java.io.IOException: Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.
at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:83)
at org.apache.hadoop.mapreduce.Cluster.(Cluster.java:76)
at org.apache.hadoop.mapred.JobClient.init(JobClient.java:478)
at org.apache.hadoop.mapred.JobClient.(JobClient.java:457)
at org.apache.hadoop.hive.ql.exec.ExecDriver.execute(ExecDriver.java:426)
at org.apache.hadoop.hive.ql.exec.MapRedTask.execute(MapRedTask.java:138)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:138)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:57)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1374)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1160)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:973)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:893)
at org.apache.hadoop.hive.service.HiveServer$HiveServerHandler.execute(HiveServer.java:198)
at org.apache.hadoop.hive.service.ThriftHive$Processor$execute.getResult(ThriftHive.java:644)
at org.apache.hadoop.hive.service.ThriftHive$Processor$execute.getResult(ThriftHive.java:628)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Job Submission failed with exception 'java.io.IOException(Cannot initialize Cluster. Please check your configuration for mapreduce.framework.name and the correspond server addresses.)'
The queries work when run on a command prompt but not in JDBC client.
I am stuck on this. Any suggestions would be helpful.
UPDATE
I am using cloudera CDH4 hadoop/hive distribution. The script that i ran is as follows
#!/bin/bash
HADOOP_HOME=/usr/lib/hadoop/client
HIVE_HOME=/usr/lib/hive
echo -e '1\x01foo' > /tmp/a.txt
echo -e '2\x01bar' >> /tmp/a.txt
HADOOP_CORE={{ls $HADOOP_HOME/hadoop*core*.jar}}
CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf
for i in ${HIVE_HOME}/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
for i in ${HADOOP_HOME}/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
java -cp $CLASSPATH com.hive.test.HiveConnect
I had change HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}} to HADOOP_CORE={{ls $HADOOP_HOME/hadoop*core*.jar}} as there was no jar file in my hadoop_home starting with hadoop- and ending with -core.jar. Is this correct? Also running the script gives the following error
/usr/lib/hadoop/client/hadoop*core*.jar}}: No such file or directory
Also i have modified the script to add hadoop client jars to classpath as the script threw the error that hadoop fileReader not found. So i added the following as well.
for i in ${HADOOP_HOME}/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
This executes the class file and runs the query "select * from test" but fails on "select column1 from test".
Still no success and the same error.
Since, it is running fine with the hive shell, can you check if the user with which you are running the hive shell and the java program (with JDBC) are the same?
Next, Starting the Thrift Server
cd to where hive is -
Issue this commands -
bin/hive --service hiveserver &
you should see -
Starting Hive Thrift Server
A quick way to ensure the HiveServer is running is to use the netstat command to determine if port 10,000 is open and listening for connections:
netstat -nl | grep 10000
tcp 0 0 :::10000 :::* LISTEN
Next, create a file called myhivetest.sh and put the follwing inside
and replace HADOOP_HOME, HIVE_HOME and package.youMainClass according to your requirements-
#!/bin/bash
HADOOP_HOME=/your/path/to/hadoop
HIVE_HOME=/your/path/to/hive
echo -e '1\x01foo' > /tmp/a.txt
echo -e '2\x01bar' >> /tmp/a.txt
HADOOP_CORE={{ls $HADOOP_HOME/hadoop-*-core.jar}}
CLASSPATH=.:$HADOOP_CORE:$HIVE_HOME/conf
for i in ${HIVE_HOME}/lib/*.jar ; do
CLASSPATH=$CLASSPATH:$i
done
java -cp $CLASSPATH package.youMainClass
Save the myhivetest.sh and do a chmod +x myhivetest.sh. You can run the bash script using ./myhivetest.sh, which will build your classpath before invoking your hive program.
Please follow the instruction here for details.
There are two ways embedded mode and standalone mode.
You should look for the standalone mode.
For your information:
Hive is not a extensive query engine akin to the DBMS like MySQL, Oracle and Teradata etc.
Hive has got limitations on the extent of complex queries you can make, like very complex joins etc.
Hive runs Hadoop MapReduce jobs when you do a query.
Check this tutorial for what type of queries are supported and which are not.
Hope this helps.
I had the same issue. I have managed to resolve the issue.
This error popped up when I was running the hive jdbc client on a hadoop cluster with /user accounts set up.
With such a environment set up, the ability to run map-reduce jobs were all based on permissions.
With the connection string being wrong, the map-reduce framework was not able to set up staging directories and trigger off the job.
Please look at your connection string [if this error is popping up in a hadoop-cluster setup].
If the connection string looks this way
Connection con = DriverManager
.getConnection(
"jdbc:hive2://cluster.xyz.com:10000/default",
"hive", "");
Change it to
Connection con = DriverManager
.getConnection(
"jdbc:hive2://cluster.xyz.com:10000/default",
"user1", "");
where user1 is a configured user on the cluster setup.
I was having similar issues. I am trying to query Hive using Oracle SQL Developer (http://www.oracle.com/technetwork/developer-tools/sql-developer/overview/index.html) combined with a third-party JDBC driver as described here: https://blogs.oracle.com/datawarehousing/entry/oracle_sql_developer_data_modeler. Yes, I know that I could use Hue to do this but I interact with many other databases, including Oracle, and it is nice to have a rich client that I can save SQL queries and simple reports directly on my machine.
I am running the latest version of Cloudera CDH (5.4) on a cluster on AWS.
I was able to issue simple queries such as "SELECT * FROM SAMPLE_07" and receive a result, but running "SELECT COUNT(*) FROM SAMPLE_07" would throw a JDBC error. I was able to solve this by creating a user in Hue, and entering this user information in the Oracle SQL Developer connection information dialog. After doing this, I was able to run both queries.
What was confusing about this is that I was able to run a simple SELECT statement and received no error -- what I am used to is either a) I can log into a system to run queries or b) I can't. Strange that it "sort of" works without the correct user ID but I guess one of those strange Hadoop things.