How to specify multiple columns for incremental data in Sqoop? - hadoop

I am using following query to fetch incremental data in sqoop-
bin/sqoop job --create JOB_NAME -- import --connect jdbc:oracle:thin:/system#HOST:PORT:ORACLE_SERVICE --username USERNAME --password-file /PASSWORD_FILE.txt --fields-terminated-by ',' --enclosed-by '"' --table SCHEMA.TABLE_NAME --target-dir /TARGET_DIR -m 2 --incremental append --check-column NVL(UPDATE_DATE,INSERT_DATE) --last-value '2019-01-01 00:00:00.000' --split-by PRIMARY_KEY --direct
It throwing error for Multiple columns in --check-columns parameters.
Is there any approcach to specify multi columns in --check-column parameter?
I want to fetch data , if UPDATE_DATE field contains null value then it should fetch the data on the basis of INSERT_DATE column.
I want to extract transaction records from a table which is being updated daily , and if the records is inserted first time then there is no value in UPDATED_DATE column. That's why I need to compare both columns while fetching data from table.
Any help regarding this would be highly appreciated.

As per my understanding it doesn't look like it's possible to have 2 check columns when doing incremental imports, so the only way we can manage to get this done is with 2 separate imports:
Incremental import with the Insert date as check column for first time
records
Incremental import with the updated column as check column
for UPDATED records

Related

incremental sqoop to HIVE table

It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?
1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.
2) How effective --query is when sqooping data to HIVE?
Thank you
You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.
Store the incremental table as an external table in Hive.
It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:
sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
second part of your question
Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.

Sqoop Incremental Import for New Records

I have a table consisting emp_id from 1 to 10 in which emp_Id = 6 is not present and I have done a incremental Sqoop on append mode by creating Sqoop job.After this,Next I wanted to add two new emp_id came i.e emp_id=6 and emp_id = 12.Now my question is will incremental sqoop import the emp_id =6 or will it import only emp_id = 12?
if you have defined --check-column emp_id with --incremental append (and you have wrote emp_id=6 and other values null) your job will be write only emp_id=12 record. But if the --check-column parameter does not find a value 6 in existing records, the job will write also the emp_id=6 record.
If you want append new records and/or modify existing ones consider the --last-value parameter. See this.
Be careful that with job created, the last value is already saved in /tmp sqoop directory. In this case I suggest you to use sqoop import simply, I verified that it works better.

Sqoop to dynamically create hive partitioned table from oracle and import data

I have a table in oracle (table name is TRCUS) with customer's details, partitioned based on year & month.
Partitions name in Oracle:
PERIOD_JAN_13,
PERIOD_FEB_13,
PERIOD_JAN_14,
PERIOD_FEB_14 etc
Now I want to import this table's data into HIVE using SQOOP directly.
Sqoop job should create a hive table, dynamically create partitions based on the oracle table partition and then import data into hive; into the respective partitions.
How can this be achievable using SQOOP ?
Unfortunately, it cannot be achieved using Sqoop. However, there is one method which I guess you might not know.
Create a table in Hive without any partitions.
Set dynamic partition modes
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Import data into Hive table that is not partitioned using Sqoop
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/database1" --username root --password cloudera --query 'SELECT DISTINCT id, count from test WHERE $CONDITIONS' --target-dir /user/hive/warehouse/ --hive-table pd_withoutpartition --hive-database database1 --hive-import --hive-overwrite -m 1 --direct
Create another table with partitions
Overwrite into partition table from previous table
INSERT OVERWRITE TABLE pd_partition partition(name) SELECT id, count, name from pd_withoutpartition;
Note: Make sure that column with which you want to partition is mentioned last during overwrite in select statement.
Hive Version : Hive 1.1.0-cdh5.13.1

Oracle ROWID for Sqoop Split-By Column

I have a huge oracle table (Transaction), the data in my oracle table has skew data on the column "Customer id" due to which the few mappers take time in hours to finish the job while other mappers finish the job in minutes. I couldn't see any other option to avoid the skewing data as this is the only column can be splited by. We can combine other columns like Customer ID, Batch ID, SEQ NUM to come with multi column split but I understood that sqoop doesn't support multi column in split by.
My objective is to pull the transaction data for a specific period (i.e. batch date unique for a month of data).
I tried the below options in sqoop with 10 mappers.
--split-by "my column name" //for example customer id
--where "my query condition" //for example batch date
Now I am thinking of using the ROWID which might split the rows evenly between the mappers. I thought of using the boundary query to get the MIN & MAX ROW ID. Below is Sqoop command I want to use.
sqoop import \
--table Transaction \
--split-by ROWID \
--where "BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY')" \
--boundary-query "SELECT MIN(ROWID) AS MIN, MAX(ROWID) AS MAXL FROM Transaction WHERE BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY') GROUP BY CUSTOMERID, BATCHNO,BATCHSEQNO " \
--num-mappers 10 \
--target-dir /user/trans
Need advise if this would be right option or is there any other way.
Also I would like to know if we can use multi split-by column name by any chance.
Providing --boundary-query will only save your time in evaluating minimun and maximun value. All mappers will have the same range query.
In your case, sqoop will generate boundary query like -
SELECT MIN(ROWID), MAX(ROWID) FROM (Select * From Transaction WHERE BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY') ) t1
You can try this query and your custom boundary query on your JDBC client to check which one is faster and use that one.
Now coming to uneven mappers load.
Yes, you are right. Currently, sqoop doesn't support multi-column in split by. you have to choose one column. If ROWID is evenly distributed (I am assuming yes), you should use it.
So, you query looks good. Just check compare--boundary-query.
Edit
There is no proper java type issue with ROWID type of Oracle.
Add --map-column-java ROWID=String in your import command to map this to Java's String.
Do you have index on SEQ NUM, if so then you can use SEQ-NUM in --split-by (i am assuming that SEQ-NUM no generating randomly it is populating in incremental fashion for each transaction ). so your sqoop command may look like this
sqoop import \
--table Transaction \
--split-by SEQ-NUM \
--where "BATCH_DT=TO_DATE('03/31/2016','MM/DD/YYYY')" \
--num-mappers 10 \
--target-dir /user/trans

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

Resources