How to get Yarn Application Id for hive jdbc connection?

How to get Yarn Application Id for hive jdbc connection? - jdbc

Here is how i am running queries through hive jdbc
Class.forName(DRIVER);
Connection = DriverManager.getConnection(CONNECTION_URL, USERNAME, PASSWORD);
Response = Connection.createStatement();
ResultSet = Response.executeQuery(query);
I can see the application details in yarn ui. But now i want to get the application id for this job through java code, Is it possible to do so? If yes, then how?

AFAIK the short answer is: not in older versions of Hive; possibly with recent versions, which let you retrieve some logs, which may contain the YARN ID.
Starting with Hive 0.14 you can set up HiveServer2 to publish the execution logs for the current Statement; and in your client code you can use a Hive-specific API to fetch these logs (asynchronously just like Beeline client does, or just once when execution is over).
Quoting Hive documentation
Starting with Hive 0.14.0, HiveServer2 operation logs are available
for Beeline clients. These parameters configure logging:
hive.server2.logging.operation.enabled
hive.server2.logging.operation.log.location
hive.server2.logging.operation.verbose (Hive 0.14 to 1.1)
hive.server2.logging.operation.level (Hive 1.2 onward)
Hive 2.0 adds the support of logging queryId and sessionId to HiveServer2 log file (...)
The source code for HiveStatement shows several non-JDBC methods such as getQueryLog and hasMoreLogs -- also getYarnATSGuid for Hive 2+ and other stuff for Hive 3+.
Here is the link to the "master" branch on GitHub, switch to whichever version you are using (possibly an old 1.2 for compatibility with Spark).
For a dummy demo about how to tap the "Logs" methods, have a look at that SO post with a snippet.

Related

New Databricks JDBC driver version doesn't recognize JDBC URL

I had been using the Databricks JDBC driver version 2.6.22 and tried to upgrade to 2.6.27. However, after upgrading I get messages saying my JDBC URLs are invalid when trying to connect. These JDBC URLs work fine with the old version of the driver and I pull them directly from the Databricks SQL endpoint info, so I expect something else is going on.
Example JDBC URL:
jdbc:spark://[workspace domain]:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/endpoints/[identifier]
I noticed between versions the name went from SimbaSparkJDBC42-2.6.22.1040 to DatabricksJDBC42-2.6.27.1048 and the JAR class name went from com.simba.spark.jdbc.Driver to com.databricks.client.jdbc.Driver. Does dropping Simba mean there was a more major change? Do I need to correct my JDBC URLs somehow?
I'm downloading my driver from here
I'm using DBeaver as my SQL client if that makes a difference.

JDBC URLs for the new databricks driver start with jdbc:databricks: instead of jdbc:spark:. As of now, JDBC URL details in the UI still use the old format, just replace spark with databricks and they should work. Mentioned here

Databricks has a different URL format, check the documentation here
Basically in the url replace spark to databricks and add PWD parameter.
jdbc:databricks://[workspace domain]:443/default;transportMode=http;ssl=1;AuthMech=3;httpPath=/sql/1.0/endpoints/[identifier];PWD=[your_access_token]
PWD is the personal access token. Instructions to get access token.

Hive Transactions + Remote Metastore Error

I'm running Hive 2.1.1 on EMR 5.5.0 with a remote mysql metastore DB. I need to enable transactions on hive, but when I follow the configuration here and run any query, I get the following error
FAILED: Error in acquiring locks: Error communicating with the metastore
Settings on the metastore:
hive.compactor.worker.threads = 0
hive.compactor.initiator.on = true
Settings in the hive client:
SET hive.support.concurrency=true;
SET hive.enforce.bucketing=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
This only happens when I set hive.txn.manager, so my hive metastore is definitely online.
I've tried some of the old suggestions of turning hive test features on which didn't work, but I don't think this is a test feature anymore. I can't turn off concurrency as a similar post in SO suggests because I need concurrency. It seems like the problem is that either DbTxnManager isn't getting the remote metastore connection info properly from hive config or the mysqldb is missing some tables required by DbTxnManager. I have datanucleus.autoCreateTables=true.

It looks like hive wasn't properly creating the tables needed for the transaction manager. I'm not sure where it was getting its schema, but it was definitely wrong.
So we just ran the hive-txn-schema query to setup the schema manually. We'll do this at the start of any of our clusters from now on.
https://github.com/apache/hive/blob/master/metastore/scripts/upgrade/mysql/hive-txn-schema-2.1.0.mysql.sql

The error from
FAILED: Error in acquiring locks: Error communicating with the metastore
sometimes because of it without any data, you need to initialization some data in your tables. for example below.
create table t1(id int, name string)
clustered by (id) into 8 buckets
stored as orc TBLPROPERTIES ('transactional'='true');

Configure Sentry to show/hide different databases for different users

I have a cluster running with cdh-5.7.0 and configured the following setup
hadoop with kerberos
hive with LDAP authentication
hive with sentry authorization (rules stored in JDBC derby)
My goal is to restrict users to see which databases exist in my system.
E.g.:
User-A should only see database DB-A when execute show databases
User-B should only see database DB-B when execute show databases
I followed the article https://blog.cloudera.com/blog/2013/12/how-to-get-started-with-sentry-in-hive/ to make that happen. But without success.
What I achieved was that
User-A can only select tables from DB-A and not from DB-B.
User-B can only select tables from DB-B and not from DB-A.
But both can still see DB-A and DB-B when executing show databases. But i want to avoid this.
Any hints from you how the rules or the setup could looks like to get that running?
Thanks
Marko

According your description and from what I've learned from existing setups, in case of Sentry v1.6+ you need to add the following property to your hive-site.xml:
<property>
<name>hive.metastore.filter.hook</name>
<value>org.apache.sentry.binding.metastore.SentryMetaStoreFilterHook</value>
</property>
Even if you are on CDH 5.7, the MapR 5 documentation is providing some context. As well Sentry Service Interactions.
After re-starting the Hive service you should be able to see the result which you are expecting.

LibreOffice Base JDBC connection to Hive returns “Method not supported” when executing valid select statement

I'm trying to get LibreOffice's Base v5.1.4.2, running on Ubuntu v16.04 to connect to a Hive v1.2.1 database via JDBC. I added the following jars, downloaded from Maven Central, to LibreOffice's classpath ('Tools -> LibreOffice -> Advanced -> Class Path'):
hive-common-1.2.1.jar
hive-jdbc-1.2.1.jar
hive-metastore-1.2.1.jar
hive-service-1.2.1.jar
hadoop-common-2.6.2.jar
httpclient-4.4.jar
httpcore-4.4.jar
libthrift-0.9.2.jar
commons-logging-1.1.3.jar
slf4j-api-1.7.5.jar
I then restarted LibreOffice, opened Base, selected 'Connect to an existing database' -> 'JDBC' and set the following properties:
I entered the credentials and clicked the 'Test Connection' button, which returned a "the connection was established successfully" message. Great!
In the LibreOffice Base UI, the options under the 'Tables' panel were grayed out. The options in the queries tab were not, so I tried to connect to Hive.
The 'Use Wizard to Create Query' option prompts for a password and then returns "The field names from 'airline.on_time_performance' could not be retrieved."
The JDBC connection is able to connect to Hive and list the tables, though it seems to have problems retrieving the columns. When I try to execute a simple select statement, the 'Create Query in SQL View' option returns a somewhat cryptic "Method not supported" message:
The error message is a bit vague. I suspect that I may be missing a dependency since I am able to connect to Hive from Java using JDBC.
I'm curious to know if anyone in the community has LibreOffice Base working with Hive. If so, what am I missing?

The Apache JDBC driver reports "Method not supported" for most features, just because the Apache committers did not bother to handle the list of simple yes/no API calls. Duh.
If you want to see by yourself, just download DBVisualizer Free, configure the Apache Hive driver, open a connection, and check the Database Info tab.
Now, DBVis is quite permissive with lame drivers, but it seems that LibreOffice is not.
You can try the Cloudera Hive JDBC driver as an alternative. You just have to "register" -- i.e. leave your e-mail address -- to access the download URL; it's simpler to deploy than the Apache thing (based on the Simba SDK, all Hive-specific JARs are bundled) and it works with about any BI tool. So hopefully it works with LibreThing too.
Disclaimer: I wish the Apache distro had a proper JDBC driver, and anyone could use it instead of relying of "free" commercial software. But for now it's just a wish.

How to send SQL query to SparkSQL from a web application

I have an application that performs SQL query on Spark DataFrame like this:
DataFrame sqlDataFrame = sqlContext.createDataFrame(accessLogs, ApacheAccessLog.class);
sqlDataFrame.registerTempTable("logs");
sqlContext.cacheTable("logs");
Row contentSizeStats = sqlContext.sql( "SELECT SUM(contentSize), COUNT(*), MIN(contentSize), MAX(contentSize) FROM logs")
.javaRDD()
.collect()
.get(0);
I can submit this application to Spark by using spark-submit, it works totally fine.
But now what I want is to develop a web application (using Spring or other frameworks), users write SQL script in the front-end, click Query button, and then the web server send the SQL script to Apache Spark to perform the query action, just like what the spark-submit did above. After SQL execution I hope that Spark can send the result back to the web server.
In the official documentation it is mentioned that we can use Thrift JDBC/ODBC, but it only presents how to connect to Thrift server. There is no other information about how to perform query action. Did I miss anything? Is there any example that I can take a look?
Thanks in advance!

Yes Thrift JDBC/ODBC is better option. You can use HiveServer2 service
Here is the code
HiveContext hiveContext = SparkConnection.getHiveContext();
hiveContext.setConf("hive.server2.thrift.port","10002");
hiveContext.setConf("hive.server2.thrift.bind.host","192.168.1.25");
HiveThriftServer2.startWithContext(hiveContext);
It will open a JDBC port. And via hive jdbc driver you can connect it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio