Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables - sqoop

Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables.
Or to deal with redundancy option if the values are already available in Hive Tables?

If your data has a unique identifier and you are running incremental imports you can specify it on the -mergeKey value of the import. This will merge the values that where already on the table with the newest one. The newer will override the oldest.
If you are not running incremental imports you can use sqoop merge to unify data.
From sqoop docs :
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.
The important is that you do have a single unique primary key for each record. Otherwise you might generate one when importing the data. To do so you could generate the import with the --query and generate the new column with the unique key on the select of the data concatenating existing columns until you get a unique combination.
--query "SELECT CONVERT(VARCHAR(128), [colum1]) + '_' + CONVERT(VARCHAR(128), [column2]) AS CompoundKey ,* FROM [dbo].[tableName] WHERE \$CONDITIONS" \

There is no direct option from sqoop that will provide the solution that you are looking for. You will have to set up EDW kind of process to achieve your goal:
import data in staging table(hive - create staging database for this purpose) - this should be copy of target table, but data type may vary as per your transformations requirements.
load data from staging database table(hive) to target database table(hive) by doing transformations. in your case:
Insert into table trgt.table
select * from stg.table stg_tbl
where stg_tbl.col1 not in (select col1 from trgt.table);
here trgt is target database, stg is staging database - both are in hive.

Related

Drop Columns while importing Data in sqoop

I am importing data from oracle to hive . My table doesn't have any integer columns which can be used in my primary keys .So I am not able to use it in my split-by column.
Alternatively I created a row_num column for all rows present in the table . Then this row_num column will be used in split-by column. Finally I want to drop this column from my hive table.
Column list is huge ,I dont want to select all columns using --columns neither I want to create any temporary table for this purpose.
Please let me know whether we can handle this in sqoop arguments.
Can Any little tweek on the --query parameter help you?
Something below.
sqoop import --query 'query string'

Oracle table incremental import to HDFS

I have Oracle table of 520 GB and on this table insert, Update and delete operations are performed frequently.This table is partitioned on ID column however there is no primary key defined and also there is no timestamp column available.
Can you please let me know what is best way I can perform incremental import to HDFS on this table.
This totally depends on what is your "id" column. If it is generated by ordered sequence, that's easy, just load the table with --incremental append --check-column ID.
If ID column is generated with noorder sequence, allow for some overlap and filter it on hadoop side.
If ID is not unique, your only choice is a CDC tool. Oracle GG, Informatica PWX and so on. There are no opensource/free solitions that I'm aware of.
Also don't need any index to perform incremental load with sqoop but an index will definitely help as its absence will lead to fullscan(s) to the source (and possibly very big) table.
your problem is not that hard to solve, just look for some key things in you db.
1. is you column id run by conditions "not NULL and 1=1 ", if so then use sqoop for you task
using following sqoop tools
--incremental append/lastmodified -check-column [column id]
--split-by [column id] // this is useful if there is not primary key also allows you to run multiple mappers in case of no primary key, you have to specify -m 1 for one mapper only.
prefered way is to do this task using sqoop job using --create tool.
for more information check https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_6
Hope this Helps !

Perform Incremental Sqoop on table that contains joins?

I have some very large tables that I am trying to sqoop from a Source System Data Warehouse into HDFS, but limited bandwidth to do so. I would like to only pull the columns I need, and minimize the run-time for getting the tables stood up.
The sqoop currently pulls something like this:
SELECT
ColumnA,
ColumnB,
....
ColumnN
FROM
TABLE_A
LEFT JOIN
TABLE_B
ON
...
LEFT JOIN
TABLE_N
....
Is It possible to perform an incremental sqoop, given that the data is stored in a star-schema format, and the dimensions could update independently of the facts?
Or, is the only solution to sqoop the entire table, for the columns that I need, incrementally, and perform the joins on the HDFS side?
For incremental imports you need to use --incremental flag. Please refer to below link for more info :-
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_incremental_imports
you need to specify —incremental to tell sqoop that you want an incremental load —check-column to specify which column is used for incremental sqooping and —last-value to say from which value you want to start sqooping the next load.
This is just half the picture. There are more ways to do this.for eg. you can use —query option and your query would be like Select * from table where column > 123. This is basically the same thing. You would need to record the last/max value for the selected column and use it for next import.

Does sqoop preserves order of imported rows as in Database

I am sqooping a table from oracle database to AWS S3 & then creating a hive table over it.
After importing the data, is the order of records present in database preserved in hive table?
I want to fetch few hundred rows from database as well as hive using java JDBC then compare each row present in ResultSet. Assuming I don't have a primary key, can I compare the rows from both ResultSets as they appear(sequentially, using resultSet.next()) or does the order gets changed due to parallel import?
If order isn't preserved whether ORDER BY is a good option?
Order is not preserved during import, also order is not determined when selecting without ORDER BY or DISTRIBUTE+SORT due to parallel select processing.
You need to specify order by when selecting data, does not matter how it was inserted.
ORDER BY orders all data, will work on single reducer, DISTRIBUTE BY + SORT orders per reducer and works in distributed mode.
Also see this answer https://stackoverflow.com/a/40264715/2700344

how to with deal primarykey while exporting data from Hive to rdbms using sqoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance
If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.
Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

Resources