Sqoop Incremental Import for New Records - hadoop

I have a table consisting emp_id from 1 to 10 in which emp_Id = 6 is not present and I have done a incremental Sqoop on append mode by creating Sqoop job.After this,Next I wanted to add two new emp_id came i.e emp_id=6 and emp_id = 12.Now my question is will incremental sqoop import the emp_id =6 or will it import only emp_id = 12?

if you have defined --check-column emp_id with --incremental append (and you have wrote emp_id=6 and other values null) your job will be write only emp_id=12 record. But if the --check-column parameter does not find a value 6 in existing records, the job will write also the emp_id=6 record.
If you want append new records and/or modify existing ones consider the --last-value parameter. See this.
Be careful that with job created, the last value is already saved in /tmp sqoop directory. In this case I suggest you to use sqoop import simply, I verified that it works better.

Related

How to specify multiple columns for incremental data in Sqoop?

I am using following query to fetch incremental data in sqoop-
bin/sqoop job --create JOB_NAME -- import --connect jdbc:oracle:thin:/system#HOST:PORT:ORACLE_SERVICE --username USERNAME --password-file /PASSWORD_FILE.txt --fields-terminated-by ',' --enclosed-by '"' --table SCHEMA.TABLE_NAME --target-dir /TARGET_DIR -m 2 --incremental append --check-column NVL(UPDATE_DATE,INSERT_DATE) --last-value '2019-01-01 00:00:00.000' --split-by PRIMARY_KEY --direct
It throwing error for Multiple columns in --check-columns parameters.
Is there any approcach to specify multi columns in --check-column parameter?
I want to fetch data , if UPDATE_DATE field contains null value then it should fetch the data on the basis of INSERT_DATE column.
I want to extract transaction records from a table which is being updated daily , and if the records is inserted first time then there is no value in UPDATED_DATE column. That's why I need to compare both columns while fetching data from table.
Any help regarding this would be highly appreciated.
As per my understanding it doesn't look like it's possible to have 2 check columns when doing incremental imports, so the only way we can manage to get this done is with 2 separate imports:
Incremental import with the Insert date as check column for first time
records
Incremental import with the updated column as check column
for UPDATED records

Sqoop import to route records with null values in a particular columns into another table

I am trying to move records with null values in a particular column to a particular table and non-null records to another while SQOOP import. tried to explore on goolge but there is not much beyond --null-string and --null-non-String params but that will just replace with the defined characters ...
I can think of following ways to handle it
once importing into hive, run a dedup to filter out the records but this is something to be tried in worst case.
handling at sqoop level itself(no clue on this)
could any expert here can help me with the above ask.
ENV details : its a plain Apache hadoop cluster. sqoop version 1.4.6
We can try making use of --query option along with the sqoop-import command
--query select * from table where column is null and $CONDITIONS
And in a similar way for not null condition also.
There will be 2 sqoop import jobs here.

Oracle table incremental import to HDFS

I have Oracle table of 520 GB and on this table insert, Update and delete operations are performed frequently.This table is partitioned on ID column however there is no primary key defined and also there is no timestamp column available.
Can you please let me know what is best way I can perform incremental import to HDFS on this table.
This totally depends on what is your "id" column. If it is generated by ordered sequence, that's easy, just load the table with --incremental append --check-column ID.
If ID column is generated with noorder sequence, allow for some overlap and filter it on hadoop side.
If ID is not unique, your only choice is a CDC tool. Oracle GG, Informatica PWX and so on. There are no opensource/free solitions that I'm aware of.
Also don't need any index to perform incremental load with sqoop but an index will definitely help as its absence will lead to fullscan(s) to the source (and possibly very big) table.
your problem is not that hard to solve, just look for some key things in you db.
1. is you column id run by conditions "not NULL and 1=1 ", if so then use sqoop for you task
using following sqoop tools
--incremental append/lastmodified -check-column [column id]
--split-by [column id] // this is useful if there is not primary key also allows you to run multiple mappers in case of no primary key, you have to specify -m 1 for one mapper only.
prefered way is to do this task using sqoop job using --create tool.
for more information check https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_6
Hope this Helps !

Sqoop to dynamically create hive partitioned table from oracle and import data

I have a table in oracle (table name is TRCUS) with customer's details, partitioned based on year & month.
Partitions name in Oracle:
PERIOD_JAN_13,
PERIOD_FEB_13,
PERIOD_JAN_14,
PERIOD_FEB_14 etc
Now I want to import this table's data into HIVE using SQOOP directly.
Sqoop job should create a hive table, dynamically create partitions based on the oracle table partition and then import data into hive; into the respective partitions.
How can this be achievable using SQOOP ?
Unfortunately, it cannot be achieved using Sqoop. However, there is one method which I guess you might not know.
Create a table in Hive without any partitions.
Set dynamic partition modes
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
Import data into Hive table that is not partitioned using Sqoop
sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/database1" --username root --password cloudera --query 'SELECT DISTINCT id, count from test WHERE $CONDITIONS' --target-dir /user/hive/warehouse/ --hive-table pd_withoutpartition --hive-database database1 --hive-import --hive-overwrite -m 1 --direct
Create another table with partitions
Overwrite into partition table from previous table
INSERT OVERWRITE TABLE pd_partition partition(name) SELECT id, count, name from pd_withoutpartition;
Note: Make sure that column with which you want to partition is mentioned last during overwrite in select statement.
Hive Version : Hive 1.1.0-cdh5.13.1

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

Resources