Sqoop to dynamically create hive partitioned table from oracle and import data - hadoop

I have a table in oracle (table name is TRCUS) with customer's details, partitioned based on year & month.
Partitions name in Oracle:
PERIOD_JAN_13,
PERIOD_FEB_13,
PERIOD_JAN_14,
PERIOD_FEB_14 etc
Now I want to import this table's data into HIVE using SQOOP directly.
Sqoop job should create a hive table, dynamically create partitions based on the oracle table partition and then import data into hive; into the respective partitions.
How can this be achievable using SQOOP ?

Unfortunately, it cannot be achieved using Sqoop. However, there is one method which I guess you might not know.
Create a table in Hive without any partitions.
Set dynamic partition modes
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Import data into Hive table that is not partitioned using Sqoop
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/database1" --username root --password cloudera --query 'SELECT DISTINCT id, count from test WHERE $CONDITIONS' --target-dir /user/hive/warehouse/ --hive-table pd_withoutpartition --hive-database database1 --hive-import --hive-overwrite -m 1 --direct
Create another table with partitions
Overwrite into partition table from previous table
INSERT OVERWRITE TABLE pd_partition partition(name) SELECT id, count, name from pd_withoutpartition;
Note: Make sure that column with which you want to partition is mentioned last during overwrite in select statement.
Hive Version : Hive 1.1.0-cdh5.13.1

Related

How to specify multiple columns for incremental data in Sqoop?

I am using following query to fetch incremental data in sqoop-
bin/sqoop job --create JOB_NAME -- import --connect jdbc:oracle:thin:/system#HOST:PORT:ORACLE_SERVICE --username USERNAME --password-file /PASSWORD_FILE.txt --fields-terminated-by ',' --enclosed-by '"' --table SCHEMA.TABLE_NAME --target-dir /TARGET_DIR -m 2 --incremental append --check-column NVL(UPDATE_DATE,INSERT_DATE) --last-value '2019-01-01 00:00:00.000' --split-by PRIMARY_KEY --direct
It throwing error for Multiple columns in --check-columns parameters.
Is there any approcach to specify multi columns in --check-column parameter?
I want to fetch data , if UPDATE_DATE field contains null value then it should fetch the data on the basis of INSERT_DATE column.
I want to extract transaction records from a table which is being updated daily , and if the records is inserted first time then there is no value in UPDATED_DATE column. That's why I need to compare both columns while fetching data from table.
Any help regarding this would be highly appreciated.
As per my understanding it doesn't look like it's possible to have 2 check columns when doing incremental imports, so the only way we can manage to get this done is with 2 separate imports:
Incremental import with the Insert date as check column for first time
records
Incremental import with the updated column as check column
for UPDATED records

incremental sqoop to HIVE table

It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?
1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.
2) How effective --query is when sqooping data to HIVE?
Thank you
You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.
Store the incremental table as an external table in Hive.
It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:
sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
second part of your question
Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.

Sqoop Direct Import Netezza Table Permissions

We are using netezza direct to import data from Netezza to Hadoop as part of POC.
Have couple of questions on Netezza specific and Netezza Sqoop Integration.
Q1. Does Sqoop direct mode always require CREATE EXTERNAL TABLE and DROP privilege to perform direct transfer?
Q2. Does external table get created in Netezza ? If yes, which database ? I see Sqoop using below query :
CREATE EXTERNAL TABLE '/yarn/local/usercache/someuser/appcache/application_1483624176418_42787/work/task_1483624176418_42787_m_000000/nzexttable-0.txt'
USING (REMOTESOURCE 'JDBC'
BOOLSTYLE 'T_F'
CRINSTRING FALSE DELIMITER 44 ENCODING
'internal' FORMAT 'Text' INCLUDEZEROSECONDS TRUE
NULLVALUE 'null' MAXERRORS 1)
AS SELECT * FROM SOME_TBL WHERE (DATASLICEID % 3)
Does it create in Database selected in db URL ? jdbc:netezza://somehostname:5480/SOME_DB_1
Q3. If Netezza needs to create External tables, can it create the external table in different database than the one which the actual table with data that needs to be pulled into Hadoop. What is the config change that needs to be done ?
Q4. Does Sqoop run DROP table on external table which was created by individual mappers ?
Sqoop command Used :
export HADOOP_CLASSPATH=/opt/nz/lib/nzjdbc3.jar
sqoop import -D mapreduce.job.queuename=some_queue
-D yarn.nodemanager.local-dirs=/tmp -D mapreduce.map.log.level=DEBUG
--direct --connect jdbc:netezza://somehost:5480/SOME_DB --table SOME_TBL_1
--username SOMEUSER --password xxxxxxx --target-dir /tmp/netezza/some_tbl_file
--num-mappers 2 --verbose
This is what I got as reply in Sqoop User community (Thanks Szabolcs Vasas).
In case of Netezza direct imports Sqoop executes a CREATE EXTERNAL TABLE command (so you will need CREATE EXTERNAL TABLE privilege) to create a backup of the content of the table to a temporary file and it copies the content of this file to the final output on HDFS.
The SQL command you pasted in your email is indeed the one which is executed by Sqoop but as far as I understand from the Netezza documentation (http://www.ibm.com/support/knowledgecenter/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_expls.html, 6th example) this does not really create a new external table in any schema it just backs up the content of the table and because of that no DROP TABLE statement is executed.
Q1. Yes, Sqoop need CREATE EXTERNAL TABLE but not DROP privilege.
Q2. Sqoop does not really create a new external table in any schema it just backs up the content of the table (http://www.ibm.com/support/knowledgecenter/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_expls.html, 6th example).
Q3. Not possible to create an EXTERNAL table in a specific schema.
Q4. No, Sqoop does not run DROP command.
Moreover, the table created by sqoop direct process is Netezza TET - Transient external tables. Thus, the external remotesource JDBC table is dropped once the mapper receives the data as NamedFifo. Thus tables are not stored in Netezza after the transfer.

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

How to create external table in Hive using sqoop. Need suggestions

Using sqoop I can create managed table but not the externale table.
Please let me know what are the best practices to unload data from data warehouse and load them in Hive external table.
1.The tables in warehouse are partitioned. Some are date wise partitioned some are state wise partitioned.
Please put your thoughts or practices used in production environment.
Sqoop does not support creating Hive external tables. Instead you might:
Use the Sqoop codegen command to generate the SQL for creating the Hive internal table that matches your remote RDBMS table (see http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_codegen_literal)
Modify the generated SQL to create a Hive external table
Execute the modified SQL in Hive
Run your Sqoop import command, loading into the pre-created Hive external table
Step 1: import data from mysql to hive table.
sqoop import
--connect jdbc:mysql://localhost/
--username training --password training
--table --hive-import --hive-table -m 1
--fields-terminated-by ','
Step 2: In hive change the table type from Managed to External.
Alter table <Table-name> SET TBLPROPERTIES('EXTERNAL'='TRUE')
Note:you can import directly into hive table or else to back end of hive.
My best suggestion is to SQOOP your data to HDFS and create EXTERNAL for Raw operations and Transformations.
Finally mashed up data to the internal table. I believe this is one of the best practices to get things done in a proper way.
Hope this helps!!!
Refer to these links:
https://mapr.com/blog/what-kind-hive-table-best-your-data/
In the above if you want to skip directly to the point -->2.2.1 External or Internal
https://hadoopsters.net/2016/07/15/hive-tables-internal-and-external-explained/
After referring to the 1st link then second will clarify most of your questions.
Cheers!!

Resources