Sqoop vs Sqoop2 commands - sqoop2

One of the advantages of migrating to Sqoop2 is that we are not sharing database credentials with clients.
Now when we execute Sqoop commands, they look like below:
sqoop import --connect ... --username ... --table ...
When we upgrade to Sqoop2 then also we are executing same command except that connection string will point to Sqoop2 server rather than actual RDBMS (involved in transfer) and credentials will be of Sqoop2 server.
Here also we are sharing credentials of Sqoop2 server with all the clients. Is it not violating basic principle for which we created Sqoop2?

In Sqoop, who have access to the hadoop cluster will know the database credentials as it has to be hard coded
In Sqoop2, database credentials will be known to only the admins who manage the cluster. Developers need not know the password.
In Sqoop client can submit jobs directly on the cluster, there is no server concept. It means that you need to have JDBC jar files on the Sqoop client. Once you have database credentials and the jar files with in the same firewall, security can be easily breached outside Sqoop.
In Sqoop2 client will not submit jobs directly, it will point to the server and server will submit the jobs. So Sqoop server, database and hadoop cluster can be behind the firewall and only Sqoop server ports shall be opened to only Sqoop2 client. Hence users cannot breach security by logging into database outside the Sqoop (even if they know database credentials and have jdbc jars).
On top of additional security, it also have this major difference:
Sqoop cannot be integrated with web interfaces such as hue as it follows client only architecture
Sqoop2 runs on client server architecture. Server runs as web applications and hence tools like Hue can actually used to develop sqoop based scripts

Related

Remote Kerberos Authentication for Hive via Intellij

Suppose I had a configuration that included 1) a local Windows client, 2) a remote unix server without windowing capability, and 3) a separate remote hadoop cluster that houses data to be queried with (among other things) Hive. I am seeking a way to establish the Hive Metastore as a data source in a Jetbrains IDE installed on the Windows client (specifically Intellij).
The wrinkle in this configuration is Kerberos, which is installed on the remote unix server, but not on the local Windows machine. Typically, the Hive Metastore is accessed from the unix server. It should be assumed that installing Kerberos on the Windows client is not a feasible scenario, and it isn't clear to me how Intellij could feasibly be used on a windowless unix environment in any scenario. However, I really want the features it provides to be available.
Is it actually possible to get Intellij to somehow leverage the ability to initialize a Kerberos ticket on the unix server to connect to Hive?
Is it possible to get Intellij to reactively query my Kerberos credentials upon initialization of a connection with the Hive Metastore?
This seems less than likely, but any ideas would be greatly appreciated.

How to configure HUE to be connected to remote Hive server?

I'm trying to use HUE Beeswax to connect my company's Hive database. Firstly, is it possible to use HUE installed on my mac to be connected with remote Hive server? If it does, how am I supposed to find the address for the Hive server which is running on our private server? Only thing I can do is to type 'hive' and put some sql queries in hive shell. I already installed HUE but can't figure out how to connect it to the remote Hive server. Any tips would be much appreciated.
If all you want is a desktop connection to Hive, you only need a JDBC client, not a full web app like Hue.
In any case, Hive CLI is deprecated. Beeline is preferred. To use Beeline and Hue, you need a HiveServer2 running.
To find the address of the HiveServer2, if you have it, you need to find your hive-site.xml file on the Hadoop cluster, and export it. Other ways to get this information are available in Ambari or Cloudera Manager (but if you're using a Cloudera CDH cluster, you already have Hue). The Thrift interface is what you want. Default port is 10000
When you setup the Hue, you will need to find the hue.ini file, in which, edit the section that starts with [beeswax] and fill in the necessary values. Personally, I find that section fairly straightforward
You can read the Hue github to find the requirements for running it on a Mac

How to push data from SQL to HDFS

I have the following use case:
We have several SQL databases in different locations and we need to load some data them to HDFS.
The problem is that we do not have access to the servers from our Hadoop cluster(due to security concerns), but we can push data to our cluster.
Is there ant tool like Apache Sqoop to do such bulk loading.
Dump data as files from your SQL databases in some delimited format for instance csv and then do a simple hadoop put command and put all the files to hdfs.
Thats it.
Let us assume I am working in a small company on 30 node cluster daily 100GB data processing. This data will comes from the different sources like RDBS such as Oracle, MySQL, IBMs Netteza, DB2 and etc. We need not to install SQOOP on all 30 nodes. The minimum number of nodes should be isntalled by SQOOP is=1. After installing on one machine now we will access those machines. Using SQOOP we will import that data.
As per the security is considered no import will be done untill and unless the administartor has to put the following two commands.
MYSQL>grant all privileges on mydb.table to ''#'IP Address of Sqoop Machine'
MYSQL>grant all privileges on mydb.table to '%'#'IP Address of Sqoop Machine'
these two commands should be fire by admin.
Then we can use our sqoop import commands and etc.

Difference Between HWI and "HiveServer" in Hive

I am going through Apache Hive these days and the following thing is confusing me quite a bit -
There is a Hive Web Interface (hive --service hwi), that listens on a port (default 9999) and allow the client to Submit a query and come back later facility, Authorization equipped etc.
There is also a HiveServer (hive --service HiveServer), that runs a server and allows remote clients to connect and submit Hive queries and is also Authorization protected etc.
How are they different ? (OR are they not) ? If they are different, but offers the same kind of features, what is different ?
There is also a HiveServer2 and a Thrift server, which not sure but I think an improvement over HiveServer ?
Can someone talk about them and clarify, whats the uniqueness in them and bigger problem they solve ?
Regards,
(*Vipul)() ;
HWI
Hive's HWI (HiveWebInterface) is an alternative to using Hive command line interface. It provides the features such as:
Schema browsing
Detached query execution
Manage sessions
No local installation
HiveServer
HiveServer on the other hand allows remote clients to submit requests to Hive using Thrift's various programming language bindings. As HiveServer uses Thrift, it is sometimes called as ThriftServer.
HiveServer v1 cannot handle concurrent requests from more than one client, this limitation is addressed in HiveServer v2, which allows multiple concurrent connections to clients. HiveServer2 also provides:
authentication using Kerberos & LDAP
SSL encryption
PAM
HiveServer2 provides various client interfaces like:
Beeline command line shell
JDBC
Python & Ruby clients
HiveServer2 JDBC driver can be used to connect to BI tools like Tableau, Talend, etc. to perform ETL.

Cloudera beeswax server and hive server

I have a fundamental question regarding the two servers mentioned in the context of cloudera cdh4 distribution
Are those two interchangeable/replaceable as in could you run beeswax in place of hive server?
I'm trying to use a thrift client to connect and in my set up only the beeswax is running and not the hive server. In such a case can I connect to the beeswax server?
Hive Server is the default process and Beeswax is a newer process designed to better support concurrency and provide authentication using Kerberos. You should run one or the other.
And yes, you should definitely be able to connect to beeswax using Thrift. You can find clients for Beeswax and Hive server here.
what is the difference between hive-server2 and beeswax? They are both designed to better support concurrency and security.

Resources