Best practices to work Sqoop, HDFS and Hive - hadoop

I have to use sqoop to import all tables from a mysql database to hdfs and to external tables in hive (no filters, with the same structure)
In import I want to bring:
New data for existing tables
Updated data for existing tables (using only the id column)
New tables created in mysql (y to create external table in hive)
Then create a sqoop job to do it all automatically.
(I have a mysql database with approximately 60 tables, and with each new client going into production, a new table is created. So I need sqoop to work as automatically as possible)
The first command executed to import all the tables was:
sqoop import-all-tables
--connect jdbc:mysql://IP/db_name
--username user
--password pass
--warehouse-dir /user/hdfs/db_name
-m 1
Here Scoop and support for external Hive tables says that support was added for the creation of external tables in hive, but I did not find documentation or examples on the mentioned commands
What are the best practices to work with in sqoop where it looks at
all the updates from a mysql database and passes to hdfs and hive?
Any ideas would be good.
Thanks in advance.
Edit: Scoop and support for external Hive tables (SQOOP-816) is still unresolved

Related

Is it possible to use/query data using Pig/Tableau or some-other tool from HDFS which was inserted/loaded using a HIVE Managed table?

Is it possible to use or query data using Pig or Drill or Tableau or some other tool from HDFS which was inserted/loaded using a HIVE Managed table; or is it only applicable with data in HDFS which was inserted/loaded using a HIVE External table?
Edit 1: Is the data associated with Managed Hive Tables locked to Hive?
Managed and external tables only refer to the file system, not visibility to clients. Both can be accessed with Hive clients
Hive has HiveServer2 (use Thrift) which is a service that lets clients execute queries against Hive. It provides JDBC/ODBC.
So you have query data in hive however it is managed table by hive or external tables.
DBeaver/Tableau can query hive once connected to HiveServer2.
For Pig you can use HCatalog
pig -useHCatalog

Importing data to hbase using sqoop

When I want to import the data to hive using sqoop I can specify --hive-home <dir> and sqoop will call that specified copy of hive installed on the machine where the script is being executed. But what about hbase? How does sqoop know which hbase instance/database I want the data to be imported on?
Maybe the documentation helps?
By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS
Every example I see just shows that option along with column families, and whatnot, so I assume it depends on whatever variables that might be part of the sqoop-env.sh, like what the Hortonworks docs say
When you give the hive home directory, that's not telling it any database or table information either, but rather where Hive configuration files exist on the machine you're running Sqoop on. By default, that's set to be the environment variable $HIVE_HOME

How to push data from SQL to HDFS

I have the following use case:
We have several SQL databases in different locations and we need to load some data them to HDFS.
The problem is that we do not have access to the servers from our Hadoop cluster(due to security concerns), but we can push data to our cluster.
Is there ant tool like Apache Sqoop to do such bulk loading.
Dump data as files from your SQL databases in some delimited format for instance csv and then do a simple hadoop put command and put all the files to hdfs.
Thats it.
Let us assume I am working in a small company on 30 node cluster daily 100GB data processing. This data will comes from the different sources like RDBS such as Oracle, MySQL, IBMs Netteza, DB2 and etc. We need not to install SQOOP on all 30 nodes. The minimum number of nodes should be isntalled by SQOOP is=1. After installing on one machine now we will access those machines. Using SQOOP we will import that data.
As per the security is considered no import will be done untill and unless the administartor has to put the following two commands.
MYSQL>grant all privileges on mydb.table to ''#'IP Address of Sqoop Machine'
MYSQL>grant all privileges on mydb.table to '%'#'IP Address of Sqoop Machine'
these two commands should be fire by admin.
Then we can use our sqoop import commands and etc.

sqoop imports to a different directory than the Hive warehouse directory

When I use sqoop to import tables into Hive the tables go to a different directory than /user/hive/warehouse in HDFS Hadoop. I'm using a default Derby database for the Hive metastore. How can I change this directory to the Hive warehouse directory by default?
Try using --hive-home /user/hive/warehouse. Generally when you are importing data from relational database hive-home has to be selected by default. As you're mentioning that it is not using the warehouse path try setting the parameter using --hive-home.

Cant Query Hive Tables after Sqoop import

I imported several tables of a oracle db into hive via sqoop. The command looked something like this:
./sqoop import --connect jdbc:oracle:thin:#//185.2.252.52:1521/orcl --username USER_NAME --password test --table TABLENAME--hive-import
Im using a embedded Metastore (At least i think so. i have not changed the default conf in that regard). When i do SHOW TABLES in HIVE, the imported Tables do not show up, but some Tables i've created for testing via the command line do. The tables are all in the same warehouse directory on the hdfs. It seems like the sqoop import is not using the same metastore.
But where ist it? And how can I switch to it when using the command line for querying?
thanks
I think that the entire problem is in embedded metastore as HIVE will create it in case that it don't exists in user current working directory by default. And thus Sqoop will end up using different metastore than hive. I would recommend configuring MySQL as backend for metastore.

Resources