Avoid starting HiveThriftServer2 with created context programmatically - hadoop

We are trying to use ThriftServer to query data from spark temp tables, in spark 2.0.0.
First, we have created sparkSession with enabled Hive Support.
Currently, we start ThriftServer with sqlContext like this:
HiveThriftServer2.startWithContext(spark.sqlContext());
We have spark stream with registered temp table "spark_temp_table":
StreamingQuery streamingQuery = streamedData.writeStream()
.format("memory")
.queryName("spark_temp_table")
.start();
With beeline we are able to see temp tables (running SHOW TABLES);
When we want to run second job (with second sparkSession) with this approach we have to start second ThriftServer with different port.
I have two questions here:
Is there any way to have one ThriftServer on one port with access to all temp tables in a different sparkSessions?
HiveThriftServer2.startWithContext(spark.sqlContext()); is annotated with #DeveloperApi. Is there any way to start thrift server with context not in the code programatically?
I saw there is configuration --conf spark.sql.hive.thriftServer.singleSession=true passed to ThriftServer on startup (sbin/start-thriftserver.sh) but I don't understand how to define this for a job. I tried to set this configuration property in sparkSession builder , but beeline didn't display temp tables.

Is there any way to have one ThriftServer on one port with access to all temp tables in a different sparkSessions?
No. ThriftServer uses specific session and temporary tables can be accessed only within this session. This is why:
beeline didn't display temp tables.
when you start independent server with sbin/start-thriftserver.sh.
spark.sql.hive.thriftServer.singleSession doesn't mean you get a single session for multiple servers. It uses the same session for all connections to a single Thrift server. Possible use case:
you start thrift server.
client1 connects to this server and creates temp table foo.
client2 connects to this server and reads foo

Related

Multi-Region Aurora with write forwarding from Spring Boot Application

I have created a multi-region(e.g. two region) Aurora cluster based on MySql engine. It has primary cluster with 1 writer and 1 reader instance, and secondary cluster with only Reader instances.
As per the Aurora documentation here, following command in secondary region on reader instance, can forward any write call to primary cluster writer instance.
SET aurora_replica_read_consistency = 'session';
This works fine, when I do the same via mysql client. And I can use secondary reader instance for write operations too.
Now, I have created an application having separate instance for these two regions. Primary application instance connected with primary Aurora cluster having writer and reader, hence I can do both read and write operation there.
For secondary application instance, which is connected to secondary Aurora cluster having only reader instance, only read operations are working.
As a solution I created writeForward.sql in spring boot application to execute and set aurora_replica_read_consistency during application initialisation on secondary cluster only. For this, I added following property to parameter store in secondary region only:
spring.datasource.data=classpath:writeForward.sql
But this is somehow not working and secondary application is still not able to do any write operation.
I am looking for some help on how to handle this.
After reading through the Aurora documentation again, I realise that write forwarding from secondary region only works when property aurora_replica_read_consistency is set for each session.
Always set the aurora_replica_read_consistency parameter for any session for which you want to forward writes. If you don't, Aurora doesn't enable write forwarding for that session.
To make this possible, each DB connection made by application need to execute this command:
SET aurora_replica_read_consistency = 'session';
For Spring Boot application using Hikari DB Connection pool, I used following property, which automatically executes above SQL command for each connection that is maintain with DB.
spring.datasource.hikari.connection-init-sql= SET aurora_replica_read_consistency = 'session'
Details about Hikari Connection Pool can be found here, which mentions about connectionInitSql property.

Can we install multiple hive servers in the same cluster

I would like to enable two application instances to share a single HDFS cluster, but each instance of the application requires its own Hive database.
Is there a way to configure multiple independent Hive Servers/Metastores within a cluster so that each application can use the data in the cluster?
each instance of the application requires its own Hive database
Then do CREATE DATABASE my_own_database; in Hive.
Before any queries in the other app, run USE my_own_database; or SELECT * FROM my_own_database.table
Otherwise, sure, you would have to install and configure a separate Hive metastore Java process pointing at a different database (or even separate server)
in hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:<protocol>://<host>:<port>/<databasename></value>
</property>
Then your applications would have to set hive.metastore.uris to point at that instance

cache spark table in thrift server

when using jupyter to cache some data into spark (using sqlcontext.cacheTable) i can see the table cached for the sparkcontext running within Jupyter. But now i want to access those cached tables from BI tools via odbc using the thrift server. when checking the thriftserver cache I dont see any table, the question is how do i get those tables cached to be consumed from BI tools?
do i have to send the same spark commands via jdbc? in that case, is the context related to the current session?
regards,
miguel
I found the solution. In order to have the tables cached to be used with jdbc/odbc clients via thriftserver i have to use CACHE TABLE from one of the clients, for exmaple from beeline. Once this is done the table is in-memory for all the different sessions.
It is also important to be sure you are using the right spark thriftserver. In order to know that just do a show table; in beeline, if you get just one column back you are not using the spark one and the CACHE TABLE wont work.

HDP: Make Spark RDD Tables accessible via JDBC using Hive Thrift

I'm using Spark Streaming to analyze Tweets in a sliding window. As don't want to save all data but just the current data of the window, I want query the data directly from memory.
My problem is pretty much identical to this one:
How to Access RDD Tables via Spark SQL as a JDBC Distributed Query Engine?
This is the important part of my code:
sentimentedWords.foreachRDD { rdd =>
val hiveContext = new HiveContext(SparkContext.getOrCreate())
import hiveContext.implicits._
val dataFrame = rdd.toDF("sentiment", "tweet")
dataFrame.registerTempTable("tweets")
HiveThriftServer2.startWithContext(hiveContext)
}
As I found out the HiveThriftServer2.startWithContext(hiveContext)line starts up a new ThriftServer that should provide access to the tempTable via JDBC. However, I get the following Exception in my console:
org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:10000.
at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:93)
at org.apache.thrift.transport.TServerSocket.<init>(TServerSocket.java:79)
at org.apache.hive.service.auth.HiveAuthFactory.getServerSocket(HiveAuthFactory.java:236)
at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:69)
at java.lang.Thread.run(Thread.java:745)
As I'm using Hortonworks Data Platform (HDP) the port 10000 is already in use by the default Hive Thrift Server! I logged into Ambari and changed the ports as follows:
<property>
<name>hive.server2.thrift.http.port</name>
<value>12345</value>
</property>
<property>
<name>hive.server2.thrift.port</name>
<value>12345</value>
</property>
But this made it worse. Now Ambari shows that it can't start the service due to some ConnectionRefused error. Other ports like 10001 don't work either. And the port 10000 is still in use after restarting Hive.
I assume that if I can use the port 10000 for my Spark application/ThriftServer and move the default Hive ThriftServer to some other port then everything should be fine. Alternatively I could also tell my application to start the ThriftServer on a different port, but I don't know if that's possible.
Any ideas?
Additional comment:
Killing the service listening on port 10000 has no effect.
I finally fixed the problem as follows:
As I'm using Spark Streaming my job is running in an infinite loop. In the loop I had the line that starts the Thrift Server:
HiveThriftServer2.startWithContext(hiveContext)
This resulted in my console being spammed with the "Could not create ServerSocket" messages. I overlooked that my code is working fine and that I just accidentially tried to start multiple servers... awkward.
What's also important to mention:
If you are using Hortonworks HDP: Do not use the beeline command to start beeline. Start the "correct" beeline that can be found in your $SPARK_HOME/bin/beeline. This took me hours to find out! I don't know what's wrong with the regular beeline and at this point I don't care anymore to be honest...
Besides that: After a restart of my HDP Sandbox the ConnectionRefused issue with Ambari also was gone.

hadoop hive question

I'm trying to create tables pragmatically using JDBC. However, I can't really see the table I created from the hive shell. What's worse, when i access hive shell from different directories, i see different result of the database.
Is any setting i need to configure?
Thanks in advance.
Make sure you run hive from the same directory every time because when you launch hive CLI for the first time, it creates a metastore derby db in the current directory. This derby DB contains metadata of hive tables. If you change directories, you will have unorganized metadata for hive tables. Also the Derby DB cannot handle multiple sessions. To allow for concurrent Hive access you would need to use a real database to manage the Metastore rather than the wimpy little derbyDB that comes with it. You can download mysql for this and change hive properties for jdbc connection to mysql type 4 pure java driver.
Try emailing the Hive userlist or the IRC channel.
You probably need to setup the central Hive metastore (by default, Derby, but it can be mySQL/Oracle/Postgres). The metastore is the "glue" between Hive and HDFS. It tells Hive where your data files live in HDFS, what type of data they contain, what tables they belong to, etc.
For more information, see http://wiki.apache.org/hadoop/HiveDerbyServerMode
Examine your hadoop logs. For me this happened when my hadoop system was not setup properly. The namenode was not able to contact the datanodes on other machines etc.
Yeah, it's due to the metastore not being set up properly. Metastore stores the metadata associated with your Hive table (e.g. the table name, table location, column names, column types, bucketing/sorting information, partitioning information, SerDe information, etc.).
The default metastore is an embedded Derby database which can only be used by one client at any given time. This is obviously not good enough for most practical purposes. You, like most users, should configure your Hive installation to use a different metastore. MySQL seems to be a popular choice. I have used this link from Cloudera's website to successfully configure my MySQL metastore.

Resources