Difference Between HWI and "HiveServer" in Hive - hadoop

I am going through Apache Hive these days and the following thing is confusing me quite a bit -
There is a Hive Web Interface (hive --service hwi), that listens on a port (default 9999) and allow the client to Submit a query and come back later facility, Authorization equipped etc.
There is also a HiveServer (hive --service HiveServer), that runs a server and allows remote clients to connect and submit Hive queries and is also Authorization protected etc.
How are they different ? (OR are they not) ? If they are different, but offers the same kind of features, what is different ?
There is also a HiveServer2 and a Thrift server, which not sure but I think an improvement over HiveServer ?
Can someone talk about them and clarify, whats the uniqueness in them and bigger problem they solve ?
Regards,
(*Vipul)() ;

HWI
Hive's HWI (HiveWebInterface) is an alternative to using Hive command line interface. It provides the features such as:
Schema browsing
Detached query execution
Manage sessions
No local installation
HiveServer
HiveServer on the other hand allows remote clients to submit requests to Hive using Thrift's various programming language bindings. As HiveServer uses Thrift, it is sometimes called as ThriftServer.
HiveServer v1 cannot handle concurrent requests from more than one client, this limitation is addressed in HiveServer v2, which allows multiple concurrent connections to clients. HiveServer2 also provides:
authentication using Kerberos & LDAP
SSL encryption
PAM
HiveServer2 provides various client interfaces like:
Beeline command line shell
JDBC
Python & Ruby clients
HiveServer2 JDBC driver can be used to connect to BI tools like Tableau, Talend, etc. to perform ETL.

Related

Remote Kerberos Authentication for Hive via Intellij

Suppose I had a configuration that included 1) a local Windows client, 2) a remote unix server without windowing capability, and 3) a separate remote hadoop cluster that houses data to be queried with (among other things) Hive. I am seeking a way to establish the Hive Metastore as a data source in a Jetbrains IDE installed on the Windows client (specifically Intellij).
The wrinkle in this configuration is Kerberos, which is installed on the remote unix server, but not on the local Windows machine. Typically, the Hive Metastore is accessed from the unix server. It should be assumed that installing Kerberos on the Windows client is not a feasible scenario, and it isn't clear to me how Intellij could feasibly be used on a windowless unix environment in any scenario. However, I really want the features it provides to be available.
Is it actually possible to get Intellij to somehow leverage the ability to initialize a Kerberos ticket on the unix server to connect to Hive?
Is it possible to get Intellij to reactively query my Kerberos credentials upon initialization of a connection with the Hive Metastore?
This seems less than likely, but any ideas would be greatly appreciated.

Hive Server2,Beeline not able to understand

Q1 : What is Server2 in Hive?
Q2 : What is the use of jdbc or odbc in Server2? For What purpose server2 is used with jdbc or odbc?
Q3 : If i want to connect with Hive server2 to jdbc or odbc, how I can connect? Can I connect in my cloudera which is single node? Guide me how to connect with it?
Q4 : How to connect with Beeline in Cloudera. The commands of Beeline are same or there is any difference. How to connect Beeline with jdbc and odbc?
Please help me regarding these questions. I searched on internet but unable to understand it.. Thanks in advance
Please find answers below:
A1. HiveServer2 is simply the version 2 of the Hive Server. The enhanced Hive server is designed for multi-client concurrency and improved authentication that encourages clients to connect through JDBC and ODBC rather than thrift protocol directly
A2. JDBC/ODBC is the standard recommended way to interact with SQL engines through programming languages. Apart from interacting with Hive using command line i.e. beeline, clients can interact programmatically or external applications like Tableau / Qlik etc which needs the corresponding JDBC/ODBC drivers. The process should be the same whether its a single node or distributed cluster.
A3. Please refer Cloudera documentation on how to setup and execute Hive commands using JDBC/ODBC. Check the below links
http://www.cloudera.com/documentation/other/connectors/hive-jdbc/latest/Cloudera-JDBC-Driver-for-Apache-Hive-Install-Guide.pdf
A4. Check the link for complete example - http://hadooptutorial.info/hiveserver2-beeline-introduction/
Hope that helps!!

Query Hive remotely using shell

Let's imagine I have access to an Hive datawarehouse, I can query it using some webservice. The problem is that I cannot automate the query using this service, so I would like to be able to query Hive from an external script (that I would be able to automate).
For now, I've only seen people running Hive on their local machine and querying it, I was wondering if it was possible to do it remotely ? If yes, how ?
Thanks a lot !
As far as I understood, you are asking if there are ways to connect to hive from a remote machine?
You could install hive client (beeline) on any remote machine and connect to hive via jdbc.
Take a look here:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
An easy way to do this, is to deploy the client configuration of hadoop/yarn on the remote machine. If the remote cluster is secured with firewalls and kerberos, you will need access to those first. After that it's just a matter of starting up a hive shell or committing a job submit to Yarn.
When you use Cloudera, you might be able to add the host to the cluster and install a "gateway" role for yarn and hive on the target machine. This is very straight-forward and requires just a few minutes of work.
Alternatively using the JDBC connector should also work, as stated in Facha's answer.

Cloudera beeswax server and hive server

I have a fundamental question regarding the two servers mentioned in the context of cloudera cdh4 distribution
Are those two interchangeable/replaceable as in could you run beeswax in place of hive server?
I'm trying to use a thrift client to connect and in my set up only the beeswax is running and not the hive server. In such a case can I connect to the beeswax server?
Hive Server is the default process and Beeswax is a newer process designed to better support concurrency and provide authentication using Kerberos. You should run one or the other.
And yes, you should definitely be able to connect to beeswax using Thrift. You can find clients for Beeswax and Hive server here.
what is the difference between hive-server2 and beeswax? They are both designed to better support concurrency and security.

Hive JDBC Vs CLI client

I need to access data using Hive programatically (data in the order of GBs per query). I was evaluating CLI driver Vs Hive JDBC driver.
When we use JDBC, there is an extra overhead of thrift server & I am trying to understand how heavy is that. Also can it be a single point bottleneck if multiple clients connect to single thrift server? Or is it a common practice that people configure multiple thrift servers on Hadoop and do some load balancing stuff?
I am looking for the better performance rather than faster prototyping.
Thanks in advance.
Shengjie's link doesn't work- This might properly automagically linkify:
http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/
From performance point of view, yes, thrift server can potentially be the bottleneck and the SPF. I've seen people set up multiple thrift servers talking to mysql metastore. Take a look at this http://blog.milford.io/2011/07/productionizing-the-hive-thrift-server/.Hope it helps.
You can try using connection pooling. I had a similar issue while submitting hive query through JDBC was taking more time than hive cli.
Also in your connection string mention few parameters as below:
jdbc:hive2://servername:portno/;hive.execution.engine=tez;tez.queue.name=alt;hive.exec.parallel=true;hive.vectorized.execution.enabled=true;hive.vectorized.execution.reduce.enabled=true;

Resources