Let's imagine I have access to an Hive datawarehouse, I can query it using some webservice. The problem is that I cannot automate the query using this service, so I would like to be able to query Hive from an external script (that I would be able to automate).
For now, I've only seen people running Hive on their local machine and querying it, I was wondering if it was possible to do it remotely ? If yes, how ?
Thanks a lot !
As far as I understood, you are asking if there are ways to connect to hive from a remote machine?
You could install hive client (beeline) on any remote machine and connect to hive via jdbc.
Take a look here:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients
An easy way to do this, is to deploy the client configuration of hadoop/yarn on the remote machine. If the remote cluster is secured with firewalls and kerberos, you will need access to those first. After that it's just a matter of starting up a hive shell or committing a job submit to Yarn.
When you use Cloudera, you might be able to add the host to the cluster and install a "gateway" role for yarn and hive on the target machine. This is very straight-forward and requires just a few minutes of work.
Alternatively using the JDBC connector should also work, as stated in Facha's answer.
Related
Suppose I had a configuration that included 1) a local Windows client, 2) a remote unix server without windowing capability, and 3) a separate remote hadoop cluster that houses data to be queried with (among other things) Hive. I am seeking a way to establish the Hive Metastore as a data source in a Jetbrains IDE installed on the Windows client (specifically Intellij).
The wrinkle in this configuration is Kerberos, which is installed on the remote unix server, but not on the local Windows machine. Typically, the Hive Metastore is accessed from the unix server. It should be assumed that installing Kerberos on the Windows client is not a feasible scenario, and it isn't clear to me how Intellij could feasibly be used on a windowless unix environment in any scenario. However, I really want the features it provides to be available.
Is it actually possible to get Intellij to somehow leverage the ability to initialize a Kerberos ticket on the unix server to connect to Hive?
Is it possible to get Intellij to reactively query my Kerberos credentials upon initialization of a connection with the Hive Metastore?
This seems less than likely, but any ideas would be greatly appreciated.
I'm trying to use HUE Beeswax to connect my company's Hive database. Firstly, is it possible to use HUE installed on my mac to be connected with remote Hive server? If it does, how am I supposed to find the address for the Hive server which is running on our private server? Only thing I can do is to type 'hive' and put some sql queries in hive shell. I already installed HUE but can't figure out how to connect it to the remote Hive server. Any tips would be much appreciated.
If all you want is a desktop connection to Hive, you only need a JDBC client, not a full web app like Hue.
In any case, Hive CLI is deprecated. Beeline is preferred. To use Beeline and Hue, you need a HiveServer2 running.
To find the address of the HiveServer2, if you have it, you need to find your hive-site.xml file on the Hadoop cluster, and export it. Other ways to get this information are available in Ambari or Cloudera Manager (but if you're using a Cloudera CDH cluster, you already have Hue). The Thrift interface is what you want. Default port is 10000
When you setup the Hue, you will need to find the hue.ini file, in which, edit the section that starts with [beeswax] and fill in the necessary values. Personally, I find that section fairly straightforward
You can read the Hue github to find the requirements for running it on a Mac
My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS
I'm new to the Hortonworks Sandbox and I was wondering if it allows for an external connection to be made to it using curl from a remote machine to retrieve finished jobs or it initiate a new job. If it can, an example of using it would be greatly appreciated.
I have tried for some time and I keep on getting a login in page as the response from the curl request (even when I use the user|password command from curl using the appropriate credentials).
I am using a Hortonworks Sandbox v1.3 virtual box image that Hortonworks provides as a free download to run the environment.
Pig is not service, so you can't connect to it.
Instead of this, you can consider connecting to WebHCat (formerly Templeton) which has REST-API to hive, pig and several other components. Documentation is here: http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.4/bk_dataintegration/content/ch_using_hcatalog_1.html
If you're using VBox version, use 127.0.0.1:9090 to connect to webhcat, if other version, so use :9090
Especially look at templeton/v1/queue/:jobid to retrieve job status,
templeton/v1/pig to initiate pig job
I have a fundamental question regarding the two servers mentioned in the context of cloudera cdh4 distribution
Are those two interchangeable/replaceable as in could you run beeswax in place of hive server?
I'm trying to use a thrift client to connect and in my set up only the beeswax is running and not the hive server. In such a case can I connect to the beeswax server?
Hive Server is the default process and Beeswax is a newer process designed to better support concurrency and provide authentication using Kerberos. You should run one or the other.
And yes, you should definitely be able to connect to beeswax using Thrift. You can find clients for Beeswax and Hive server here.
what is the difference between hive-server2 and beeswax? They are both designed to better support concurrency and security.