I have an application that performs SQL query on Spark DataFrame like this:
DataFrame sqlDataFrame = sqlContext.createDataFrame(accessLogs, ApacheAccessLog.class);
sqlDataFrame.registerTempTable("logs");
sqlContext.cacheTable("logs");
Row contentSizeStats = sqlContext.sql( "SELECT SUM(contentSize), COUNT(*), MIN(contentSize), MAX(contentSize) FROM logs")
.javaRDD()
.collect()
.get(0);
I can submit this application to Spark by using spark-submit, it works totally fine.
But now what I want is to develop a web application (using Spring or other frameworks), users write SQL script in the front-end, click Query button, and then the web server send the SQL script to Apache Spark to perform the query action, just like what the spark-submit did above. After SQL execution I hope that Spark can send the result back to the web server.
In the official documentation it is mentioned that we can use Thrift JDBC/ODBC, but it only presents how to connect to Thrift server. There is no other information about how to perform query action. Did I miss anything? Is there any example that I can take a look?
Thanks in advance!
Yes Thrift JDBC/ODBC is better option. You can use HiveServer2 service
Here is the code
HiveContext hiveContext = SparkConnection.getHiveContext();
hiveContext.setConf("hive.server2.thrift.port","10002");
hiveContext.setConf("hive.server2.thrift.bind.host","192.168.1.25");
HiveThriftServer2.startWithContext(hiveContext);
It will open a JDBC port. And via hive jdbc driver you can connect it.
Related
I'm trying to get LibreOffice's Base v5.1.4.2, running on Ubuntu v16.04 to connect to a Hive v1.2.1 database via JDBC. I added the following jars, downloaded from Maven Central, to LibreOffice's classpath ('Tools -> LibreOffice -> Advanced -> Class Path'):
hive-common-1.2.1.jar
hive-jdbc-1.2.1.jar
hive-metastore-1.2.1.jar
hive-service-1.2.1.jar
hadoop-common-2.6.2.jar
httpclient-4.4.jar
httpcore-4.4.jar
libthrift-0.9.2.jar
commons-logging-1.1.3.jar
slf4j-api-1.7.5.jar
I then restarted LibreOffice, opened Base, selected 'Connect to an existing database' -> 'JDBC' and set the following properties:
I entered the credentials and clicked the 'Test Connection' button, which returned a "the connection was established successfully" message. Great!
In the LibreOffice Base UI, the options under the 'Tables' panel were grayed out. The options in the queries tab were not, so I tried to connect to Hive.
The 'Use Wizard to Create Query' option prompts for a password and then returns "The field names from 'airline.on_time_performance' could not be retrieved."
The JDBC connection is able to connect to Hive and list the tables, though it seems to have problems retrieving the columns. When I try to execute a simple select statement, the 'Create Query in SQL View' option returns a somewhat cryptic "Method not supported" message:
The error message is a bit vague. I suspect that I may be missing a dependency since I am able to connect to Hive from Java using JDBC.
I'm curious to know if anyone in the community has LibreOffice Base working with Hive. If so, what am I missing?
The Apache JDBC driver reports "Method not supported" for most features, just because the Apache committers did not bother to handle the list of simple yes/no API calls. Duh.
If you want to see by yourself, just download DBVisualizer Free, configure the Apache Hive driver, open a connection, and check the Database Info tab.
Now, DBVis is quite permissive with lame drivers, but it seems that LibreOffice is not.
You can try the Cloudera Hive JDBC driver as an alternative. You just have to "register" -- i.e. leave your e-mail address -- to access the download URL; it's simpler to deploy than the Apache thing (based on the Simba SDK, all Hive-specific JARs are bundled) and it works with about any BI tool. So hopefully it works with LibreThing too.
Disclaimer: I wish the Apache distro had a proper JDBC driver, and anyone could use it instead of relying of "free" commercial software. But for now it's just a wish.
I'm trying to load data from Oracle table to Cassandra table by using Pentaho Data Integration 5.1(Community Edition). But I'm not getting whether connection has been established between oracle and cassandra. I'm using Cassandra 2.2.3 and Oracle 11gR2.
I've added following jars in lib folder of data-integration
--cassandra-thrift-1.0.0
--apache-cassandra-cql-1.0.0
--libthrift-0.6.jar
--guava-r08.jar
--cassandra_driver.jar
Please anyone can help me to figure out how to check whether connection has been established in Pentaho.
There are some ways to debug if a connection is established to a database, I don't know if all of them are valid for cassandra, but I'll add a especial one for that.
1) The test button
By simply clicking the test button on the connection edit screen.
2) Logs with high details may help
Another way to test is running you transformation with a high detail log:
sh pan.sh -file=my_cassandra_transformation.ktr -level=Rowlevel
3) The input preview
For cassandra, in especific, I would try just to create a simple read operation using Cassandra Input step and clicking in the 'preview' button.
4) The controlled output test
Or maybe you can try with a simplier transformation first, to make sure it's running fine. Eg.
Here is how i am running queries through hive jdbc
Class.forName(DRIVER);
Connection = DriverManager.getConnection(CONNECTION_URL, USERNAME, PASSWORD);
Response = Connection.createStatement();
ResultSet = Response.executeQuery(query);
I can see the application details in yarn ui. But now i want to get the application id for this job through java code, Is it possible to do so? If yes, then how?
AFAIK the short answer is: not in older versions of Hive; possibly with recent versions, which let you retrieve some logs, which may contain the YARN ID.
Starting with Hive 0.14 you can set up HiveServer2 to publish the execution logs for the current Statement; and in your client code you can use a Hive-specific API to fetch these logs (asynchronously just like Beeline client does, or just once when execution is over).
Quoting Hive documentation
Starting with Hive 0.14.0, HiveServer2 operation logs are available
for Beeline clients. These parameters configure logging:
hive.server2.logging.operation.enabled
hive.server2.logging.operation.log.location
hive.server2.logging.operation.verbose (Hive 0.14 to 1.1)
hive.server2.logging.operation.level (Hive 1.2 onward)
Hive 2.0 adds the support of logging queryId and sessionId to HiveServer2 log file (...)
The source code for HiveStatement shows several non-JDBC methods such as getQueryLog and hasMoreLogs -- also getYarnATSGuid for Hive 2+ and other stuff for Hive 3+.
Here is the link to the "master" branch on GitHub, switch to whichever version you are using (possibly an old 1.2 for compatibility with Spark).
For a dummy demo about how to tap the "Logs" methods, have a look at that SO post with a snippet.
I've installed ODBC Driver from http://hortonworks.com/hdp/addons/
and configured to use my Hive Server 2 on HDP installation.
I'm using Microstrategy Microstrategy Analytics Desktop to run queries. It works fine, until I'm trying to use server side property.
I've configured my ODBC Data Source Administrator/System DSN/Hortoworks Hive ODBC Driver Setup/Advanced Options/Setver Side Properties as follows:
SSP_mapred.job.queue.name = pr
SSP_tez.queue.name = pr
But in Applications on HDP I can see that MSTR is using 'default' queue instead of pr.
What am I doing wrong? In Installation guide for the Hortonworks driver (as well as for Simba Driver) property is calld: SSP_mapred.queue.names=myQueue but it doesnt work as well..
Is there any place I can see the log of this connection and check if the properties are sent to the asever at all?
Regrds
Pawel
I have an amazon redshift db that supports connecting a postgresql client with jdbc
google apps scripts support connecting to a db with jdbc, but only with the mysql, ms sql, and oracle protocol, but not postgresql. If I try, not surprisingly I get error:
'Connection URL uses an unsupported JDBC protocol.'
Looking at some google forums, this has been an issue for several years with no response from google.
Is there any workaround?
thanks
I use Kloudio - a google sheets extension to get this done. I can run and schedule my redshift queries in Kloudio
If you are using amazon redshift then you can connect it through amazon redshift client
Here are the steps:
Write SQL query in the redshift client and save it as a report
use its api key generated and report no. in the embedded link.
use Importdata function to google spreadsheet to import the data automatically. it will refresh by default every one hour.
Thanks