Oracle table incremental import to HDFS - hadoop

I have Oracle table of 520 GB and on this table insert, Update and delete operations are performed frequently.This table is partitioned on ID column however there is no primary key defined and also there is no timestamp column available.
Can you please let me know what is best way I can perform incremental import to HDFS on this table.

This totally depends on what is your "id" column. If it is generated by ordered sequence, that's easy, just load the table with --incremental append --check-column ID.
If ID column is generated with noorder sequence, allow for some overlap and filter it on hadoop side.
If ID is not unique, your only choice is a CDC tool. Oracle GG, Informatica PWX and so on. There are no opensource/free solitions that I'm aware of.
Also don't need any index to perform incremental load with sqoop but an index will definitely help as its absence will lead to fullscan(s) to the source (and possibly very big) table.

your problem is not that hard to solve, just look for some key things in you db.
1. is you column id run by conditions "not NULL and 1=1 ", if so then use sqoop for you task
using following sqoop tools
--incremental append/lastmodified -check-column [column id]
--split-by [column id] // this is useful if there is not primary key also allows you to run multiple mappers in case of no primary key, you have to specify -m 1 for one mapper only.
prefered way is to do this task using sqoop job using --create tool.
for more information check https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_6
Hope this Helps !

Related

How to sqoop big table from oracle db to hdfs?

One of my Oracle table contains 265 million records, I need to push that table from Oracle database to hdfs but this table doesnt have any primary key/Unique column. Hence, I cant use multiple mappers. If I use multiple mappers, I have to specify Split by column.
Whats the best way to sqoop the table.
Any leads are appreciated.
In order to use multiple mappers, you will need a --split-by parameter. The best column to choose is one that is not null in all 265m rows and evenly distributed. Primary key meets that criteria because it is sequential and in all rows.
Any column that is evenly distributed across the data set could be a good choice for a --split-by choice. The link #yammanuruarun posted includes the --boundary-query argument to help limit the work the RDBMS has to do to return those rows. I suggest using a Fibbonacci sequence for your -m 1,2,3,5,8.
Also, check out:
How to find optimal number of mappers when running Sqoop import and export?

Incremental import of Oracle tables which does not have primary key to HDFS

I have Oracle database with almost 300 tables out of that 200 tables doesn't have any primary key and few tables have composite primary key. My requirement is to import all tables data in incremental manner to HDFS. Can you please let me know how this can be achieved using Sqoop. It would be great help if any other option is suggested.
Unfortunately, being unable to recognize updated rows (you indicate that you do not track update timestamps), makes it practically impossible to use incremental loads to capture the changes.
Some possibilities:
Add timestamps
Do a full load
Use the rownumber to identify new records, and don't process updated records

Perform Incremental Sqoop on table that contains joins?

I have some very large tables that I am trying to sqoop from a Source System Data Warehouse into HDFS, but limited bandwidth to do so. I would like to only pull the columns I need, and minimize the run-time for getting the tables stood up.
The sqoop currently pulls something like this:
SELECT
ColumnA,
ColumnB,
....
ColumnN
FROM
TABLE_A
LEFT JOIN
TABLE_B
ON
...
LEFT JOIN
TABLE_N
....
Is It possible to perform an incremental sqoop, given that the data is stored in a star-schema format, and the dimensions could update independently of the facts?
Or, is the only solution to sqoop the entire table, for the columns that I need, incrementally, and perform the joins on the HDFS side?
For incremental imports you need to use --incremental flag. Please refer to below link for more info :-
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
you need to specify —incremental to tell sqoop that you want an incremental load —check-column to specify which column is used for incremental sqooping and —last-value to say from which value you want to start sqooping the next load.
This is just half the picture. There are more ways to do this.for eg. you can use —query option and your query would be like Select * from table where column > 123. This is basically the same thing. You would need to record the last/max value for the selected column and use it for next import.

Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables

Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables.
Or to deal with redundancy option if the values are already available in Hive Tables?
If your data has a unique identifier and you are running incremental imports you can specify it on the -mergeKey value of the import. This will merge the values that where already on the table with the newest one. The newer will override the oldest.
If you are not running incremental imports you can use sqoop merge to unify data.
From sqoop docs :
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.
The important is that you do have a single unique primary key for each record. Otherwise you might generate one when importing the data. To do so you could generate the import with the --query and generate the new column with the unique key on the select of the data concatenating existing columns until you get a unique combination.
--query "SELECT CONVERT(VARCHAR(128), [colum1]) + '_' + CONVERT(VARCHAR(128), [column2]) AS CompoundKey ,* FROM [dbo].[tableName] WHERE \$CONDITIONS" \
There is no direct option from sqoop that will provide the solution that you are looking for. You will have to set up EDW kind of process to achieve your goal:
import data in staging table(hive - create staging database for this purpose) - this should be copy of target table, but data type may vary as per your transformations requirements.
load data from staging database table(hive) to target database table(hive) by doing transformations. in your case:
Insert into table trgt.table
select * from stg.table stg_tbl
where stg_tbl.col1 not in (select col1 from trgt.table);
here trgt is target database, stg is staging database - both are in hive.

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

Resources