Hive JDBC Vs CLI client - jdbc

I need to access data using Hive programatically (data in the order of GBs per query). I was evaluating CLI driver Vs Hive JDBC driver.
When we use JDBC, there is an extra overhead of thrift server & I am trying to understand how heavy is that. Also can it be a single point bottleneck if multiple clients connect to single thrift server? Or is it a common practice that people configure multiple thrift servers on Hadoop and do some load balancing stuff?
I am looking for the better performance rather than faster prototyping.
Thanks in advance.

Shengjie's link doesn't work- This might properly automagically linkify:
http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/

From performance point of view, yes, thrift server can potentially be the bottleneck and the SPF. I've seen people set up multiple thrift servers talking to mysql metastore. Take a look at this http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/.Hope it helps.

You can try using connection pooling. I had a similar issue while submitting hive query through JDBC was taking more time than hive cli.
Also in your connection string mention few parameters as below:
jdbc:hive2://servername:portno/;hive.execution.engine=tez;tez.queue.name=alt;hive.exec.parallel=true;hive.vectorized.execution.enabled=true;hive.vectorized.execution.reduce.enabled=true;

Related

Impala streaming over JDBC is really slow

I have run several large queries using impala-shell and found the performance to be satisfactory. These queries typically write 100k-1m rows to disk. However, when I run the very same queries programmatically using JDBC, the results take much, much longer to write to disk. For example, a query which takes five minutes from impala-shell takes up to thirty minutes over JDBC.
I have tried both the Hive and Cloudera JDBC drivers but get similarly bad performance. I have tried various fetch sizes but it has not made any difference. Is Impala streaming over JDBC fundamentally slow or could I do something else to speed up the streaming?
This is on CDH 5.9.1.
This turned out to be a client-side issue. I was using curl to test a web application which was making the Impala queries. Switching from curl to a client written in Scala code removed the latency.

Difference Between HWI and "HiveServer" in Hive

I am going through Apache Hive these days and the following thing is confusing me quite a bit -
There is a Hive Web Interface (hive --service hwi), that listens on a port (default 9999) and allow the client to Submit a query and come back later facility, Authorization equipped etc.
There is also a HiveServer (hive --service HiveServer), that runs a server and allows remote clients to connect and submit Hive queries and is also Authorization protected etc.
How are they different ? (OR are they not) ? If they are different, but offers the same kind of features, what is different ?
There is also a HiveServer2 and a Thrift server, which not sure but I think an improvement over HiveServer ?
Can someone talk about them and clarify, whats the uniqueness in them and bigger problem they solve ?
Regards,
(*Vipul)() ;
HWI
Hive's HWI (HiveWebInterface) is an alternative to using Hive command line interface. It provides the features such as:
Schema browsing
Detached query execution
Manage sessions
No local installation
HiveServer
HiveServer on the other hand allows remote clients to submit requests to Hive using Thrift's various programming language bindings. As HiveServer uses Thrift, it is sometimes called as ThriftServer.
HiveServer v1 cannot handle concurrent requests from more than one client, this limitation is addressed in HiveServer v2, which allows multiple concurrent connections to clients. HiveServer2 also provides:
authentication using Kerberos & LDAP
SSL encryption
PAM
HiveServer2 provides various client interfaces like:
Beeline command line shell
JDBC
Python & Ruby clients
HiveServer2 JDBC driver can be used to connect to BI tools like Tableau, Talend, etc. to perform ETL.

Using Hive metastore for client application performance

I am new to hadoop. Please help me in below concept.
It is always good practice to use hive metastore(into other db like mysql etc) for production purpose.
What is the exact role and need of storing meatadata on RDBMS ?
If we create a client application to get hive data on UI, will this metadata store help to improve the performance to get data?
If yes What will be the architecture of this kind of client application? Will it hit first RDBMS metastore ? How it will be different form querying hive directly in some other way like using thrift?
Hadoop experts ,please help
Thanks
You can use prestodb that allows you to run/translate SQL queries against HIVE. It has a mysql connector that you can use to exploit your stored hive schema.
Thus from your client application, you just need a JDBC driver as any RDBMS out there.

Connect to Spark SQL via ODBC

According to this page: https://spark.apache.org/sql/ you can connect existing BI tools to Spark SQL via ODBC or JDBC:
I don't mean Shark as this is basically EOL:
It is for this reason that we are ending development in Shark as a separate project and moving all our development resources to Spark SQL, a new component in Spark.
How would a BI tool (like Tableau) connect to shark sql via ODBC?
With the release of Spark SQL 1.1 you also have thrift JDBC driver see https://spark.apache.org/docs/latest/sql-programming-guide.html#distributed-sql-engine
Simba provides the ODBC driver that Databricks uses, however that is only for the Databricks distribution. We are launching the public version for use with Apache tomorrow (Wed, Dec 3rd) at www.simba.com. You'll be able to download and trial the driver for use with Tableau then.
As Carlos said, Stratio Meta is a module that acts as a parser, validator, planner and coordinator layer over the different persistence layers (currently, only Cassandra and Mongo, but also HDFS in the short term). This modules offers a Shell with a SQL-like language, a Java/Scala API, a REST API and ODBC (JDBC shortly). It also uses another Stratio module, Stratio Deep, which allows us to use Apache Spark in order to execute query in an efficent and fast way.
Disclaimer: I am currently employed by Stratio Big Data
Please take a look at: http://www.openstratio.org/blog/connecting-to-the-stratio-big-data-platform-using-odbc-2/
Stratio is a platform that includes a certified Spark distribution that allows you to connect Spark to any type of data repository (like Cassandra, MongoDB,...). It has an ODBC Driver so you can write SQL queries that will be translated to Spark jobs, or even faster, direct queries to Cassandra -or whichever database you want to connect to it - if possible. This way, it is pretty simple to connect Tableau into Spark and your data repository. If you need any help, we will be more than glad to assist you.
Disclaimer: I'm one of Stratio's ODBC developers
Simba will offer one: http://databricks.com/blog/2014/04/30/Databricks-selects-Simba-ODBC-driver-for-shark.html. No known official release date.
[update]
Use HIVE's ODBC driver to connect to Spark SQL as described here and here.
For Spark on Azure HDInsight, you can connect Tableau (or PowerBI) as described here https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-use-bi-tools/. The ODBC driver is here: http://www.microsoft.com/en-us/download/details.aspx?id=47713

Spark in Business Intelligence

Currently I am doing a project in Business Intelligence and Big Data area, 2 areas in which in all honesty I am new and very green.
I was planning to build a Hive Datawarehouse using MongoDB and connect it with a Business Intelligence platform like Pentaho. While researching I came across Spark and got interested in it's Shark module due to it's in-memory functionality and increase in performance while doing queries.
I know that I can connect Hive to Pentaho but the thing I was wondering is if I could use Shark queries between them for performance? If not is does anyone know of any other BI platform that would allow that?
As I said I am pretty new in this areas so feel free to correct me since there is a good chance of me having some concepts mixed up and having said something idiotic.
I think that you should build Hive Datawarehouse using Hive or MongoDB Datawarehouse using MongoDB. I didn't understand how you are going to mix them, but I will try to answer the question anyway.
Usually, you configure for a BI tool a JDBC driver for DB of your choice (e.g. Hive) and the BI tool fetches the data using that JDBC driver. How the driver fetches the data from DB is completely transparent for the BI tool.
Thus, you can use Hive, Shark or any other DB which comes with a JDBC driver.
I can summarize your options this way:
Hive: the most complete feature set, and is the most compatible tool. Can be used over plain data or, you can ETL the data into its ORC format boosting performance.
Impala: claims to be faster than Hive but has less complete feature set. Can be used over plain data or, you can ETL the data into its Parquet format boosting performance.
Shark: cutting edge, not mainstream yet. Performance depends on which percent of your data can fit into RAM over your cluster.
First of all Shark is being absorbed by Spark SQL.
SparkSQL provides a JDBC/ ODBC connector. That should allow you to integrate it with most of your existing platforms.

Resources