HDP: Make Spark RDD Tables accessible via JDBC using Hive Thrift - jdbc

I'm using Spark Streaming to analyze Tweets in a sliding window. As don't want to save all data but just the current data of the window, I want query the data directly from memory.
My problem is pretty much identical to this one:
How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?
This is the important part of my code:
sentimentedWords.foreachRDD { rdd =>
val hiveContext = new HiveContext(SparkContext.getOrCreate())
import hiveContext.implicits._
val dataFrame = rdd.toDF("sentiment", "tweet")
dataFrame.registerTempTable("tweets")
HiveThriftServer2.startWithContext(hiveContext)
}
As I found out the HiveThriftServer2.startWithContext(hiveContext)line starts up a new ThriftServer that should provide access to the tempTable via JDBC. However, I get the following Exception in my console:
org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:10000.
at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:93)
at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:79)
at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:236)
at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:69)
at java.lang.Thread.run(Thread.java:745)
As I'm using Hortonworks Data Platform (HDP) the port 10000 is already in use by the default Hive Thrift Server! I logged into Ambari and changed the ports as follows:
<property>
<name>hive.server2.thrift.http.port</name>
<value>12345</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>12345</value>
</property>
But this made it worse. Now Ambari shows that it can't start the service due to some ConnectionRefused error. Other ports like 10001 don't work either. And the port 10000 is still in use after restarting Hive.
I assume that if I can use the port 10000 for my Spark application/ThriftServer and move the default Hive ThriftServer to some other port then everything should be fine. Alternatively I could also tell my application to start the ThriftServer on a different port, but I don't know if that's possible.
Any ideas?
Additional comment:
Killing the service listening on port 10000 has no effect.

I finally fixed the problem as follows:
As I'm using Spark Streaming my job is running in an infinite loop. In the loop I had the line that starts the Thrift Server:
HiveThriftServer2.startWithContext(hiveContext)
This resulted in my console being spammed with the "Could not create ServerSocket" messages. I overlooked that my code is working fine and that I just accidentially tried to start multiple servers... awkward.
What's also important to mention:
If you are using Hortonworks HDP: Do not use the beeline command to start beeline. Start the "correct" beeline that can be found in your $SPARK_HOME/bin/beeline. This took me hours to find out! I don't know what's wrong with the regular beeline and at this point I don't care anymore to be honest...
Besides that: After a restart of my HDP Sandbox the ConnectionRefused issue with Ambari also was gone.

Related

Connecting Apache Superset with Hive

I've my Hadoop cluster running in AWS environment where the schema got mapped with Hive. And I could see the complete Data in Hive.
Now, Here is the Problem - I am trying to connect my hive to Superset where I couldn't able to connect with.
This is how I have provided my URI:
jdbc+hive://MYIP:PORT
Also tried:
hive://username:password#MYIP:PORT
Make sure hive server2 is up and running
Also you can try this one
hive://hostname:10000/default?auth=NOSASL

How to configure HUE to be connected to remote Hive server?

I'm trying to use HUE Beeswax to connect my company's Hive database. Firstly, is it possible to use HUE installed on my mac to be connected with remote Hive server? If it does, how am I supposed to find the address for the Hive server which is running on our private server? Only thing I can do is to type 'hive' and put some sql queries in hive shell. I already installed HUE but can't figure out how to connect it to the remote Hive server. Any tips would be much appreciated.
If all you want is a desktop connection to Hive, you only need a JDBC client, not a full web app like Hue.
In any case, Hive CLI is deprecated. Beeline is preferred. To use Beeline and Hue, you need a HiveServer2 running.
To find the address of the HiveServer2, if you have it, you need to find your hive-site.xml file on the Hadoop cluster, and export it. Other ways to get this information are available in Ambari or Cloudera Manager (but if you're using a Cloudera CDH cluster, you already have Hue). The Thrift interface is what you want. Default port is 10000
When you setup the Hue, you will need to find the hue.ini file, in which, edit the section that starts with [beeswax] and fill in the necessary values. Personally, I find that section fairly straightforward
You can read the Hue github to find the requirements for running it on a Mac

Query Hive remotely using shell

Let's imagine I have access to an Hive datawarehouse, I can query it using some webservice. The problem is that I cannot automate the query using this service, so I would like to be able to query Hive from an external script (that I would be able to automate).
For now, I've only seen people running Hive on their local machine and querying it, I was wondering if it was possible to do it remotely ? If yes, how ?
Thanks a lot !
As far as I understood, you are asking if there are ways to connect to hive from a remote machine?
You could install hive client (beeline) on any remote machine and connect to hive via jdbc.
Take a look here:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
An easy way to do this, is to deploy the client configuration of hadoop/yarn on the remote machine. If the remote cluster is secured with firewalls and kerberos, you will need access to those first. After that it's just a matter of starting up a hive shell or committing a job submit to Yarn.
When you use Cloudera, you might be able to add the host to the cluster and install a "gateway" role for yarn and hive on the target machine. This is very straight-forward and requires just a few minutes of work.
Alternatively using the JDBC connector should also work, as stated in Facha's answer.

How to start spark (with thrift server) in non-blocking mode that hive can update and reload data into spark (table-looking)

We do have problems with table lookings. We need simultanious access from hive and spark (with thrift server) to tables. However our problem is running spark with thrift server result in a table looking.
We're running on an Amazon AWS EMR Cluster with Hive, Spark and thrift Server 2.
We'd like to update with hive an s3 storage and load this aggregated data into spark in background periodically. Spark meanwhile is allways on with thrift server loaded and has the same data loaded from s3, to do realtime aggregations on this data. Spark does not need write access on this data.
The problem is running the periodicall data-loading tasks on hive result in freeze of the job.
We think the meta-store may be locked by spark / thrift server, blocking hive from updating and reloading data into spark. (But not sure about this)
Is it possible to start spark and thrift server in read only non-blocking mode?
What may cause the problem? Anyone experienced similar problems?
How is your metastore configured ? Does it use Derby for the metastore ?
With the default configuration it uses Derby, which does not support multiple concurrent users.
If so, you should change it to use something like MySQL, which does support multiple users.

Connect tableau with Elastic search via Spark SQL

I found a post that discusses about connecting Tableau to Elastic Search via Hive SQL. I was wondering if there is a way to connect to Elastic Search via Spark SQL as I am not much familiar with hive.
Thanks.
#busybug91,
The right driver is here please try with this one. Could be solve your issue.
#NicholasY It got it resolved after a couple of trials. Two steps that I took:-
I wasn't using the right driver for connection. I was using datastax enterprise driver. However, they have a driver for spark sql as well. I used windows 64bit version of driver. Using MapR Hadoop Hive and Hortonworks Hadoop Hive drivers didn't work as I've Apache hive.
When I used right driver (from DataStax) I realized that my hive metastore and spark-thrift-server running on same port. I changed spark-thrift-server's port to 10001 and a successful connection was established.
A new problem: I've created external table in hive. I am able to query the data as well. I start hive-metastore as a service. However, as mentioned on this link I am not able to see my tables in hive in Spark SQL. My connection of Tableau with Spark Sql is of no use unless I see tables from hive metastore!! When I do show tables; in spark sql (via spark-sql shell and hive metastore running as a service as same time), it runs a job which gives a completion time also but now table names. I monitored it via localhost:4040 I see that input and output size are 0.0. I believe I am not able to get tables from hive in spark sql that is why I don't see any table after connection is established from Tableau to spark sql.
EDIT
I changed metastore from derby to mysql for both hive and spark sql.
I'm trying to do that, so maybe i can help you to warn up something.
First, compile a Spark SQL version with Hive and thrift Server (ver 0.13):
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
You need to have a hive-site.xml properly configurered to work with Hive and to copy it to spark/conf folder.
Then, you have to set the $CLASSPATH with the elasticsearch-hadoop jar path.
Be careful ! Spark SQL 1.2.0 is not working with elasticsearch-hadoop-2.0.x. You have to use a elasticsearch-hadoop-2.1.0-Beta4 or BUILD-SNAPSHOT available here.
To finish you have to run thriftserver with something like that:
./start-thriftserver.sh --master spark://master:7077 --driver-class-path $CLASSPATH --jars /root/spark-sql/spark-1.2.0/lib/elasticsearch-hadoop-2.1.0.Beta4.jar --hiveconf hive.server2.thrift.bind.host 0.0.0.0 --hiveconf hive.server2.thrift.port 10000
It works for me but only on small docType ( 5000 rows ) , the data-colocation seems not working. I looking for a solution to move elasticsearch-hadoop.jar on each Spark workers as ryrobes did for Hadoop.
If you find a way to locate access to elasticsearch, let me know ;)
HTH,

Resources