how to with deal primarykey while exporting data from Hive to rdbms using sqoop - hadoop

Here is a my scenario i have a data in hive warehouse and i want to export this data into a table named "sample" of "test" database in mysql. What happens if one column is primary key in sample.test and and the data in hive(which we are exporting) is having duplicate values under that key ,then obviously the job will fail , so how could i handle this kind of scenario ?
Thanks in Advance

If you want your mysql table to contain only the last row among the duplicates, you can use the following:
sqoop export --connect jdbc:mysql://<*ip*>/test -table sample --username root -P --export-dir /user/hive/warehouse/sample --update-key <*primary key column*> --update-mode allowinsert
While exporting, Sqoop converts each row into an insert statement by default. By specifying --update-key, each row can be converted into an update statement. However, if a particular row is not present for update, the row is skipped by default. This can be overridden by using --update-mode allowinsert, which allows such rows to be converted into insert statements.

Beforing performing export operation ,massage your data by removing duplicates from primary key. Take distinct on that primary column and then export to mysql.

Related

Drop Columns while importing Data in sqoop

I am importing data from oracle to hive . My table doesn't have any integer columns which can be used in my primary keys .So I am not able to use it in my split-by column.
Alternatively I created a row_num column for all rows present in the table . Then this row_num column will be used in split-by column. Finally I want to drop this column from my hive table.
Column list is huge ,I dont want to select all columns using --columns neither I want to create any temporary table for this purpose.
Please let me know whether we can handle this in sqoop arguments.
Can Any little tweek on the --query parameter help you?
Something below.
sqoop import --query 'query string'

Oracle table incremental import to HDFS

I have Oracle table of 520 GB and on this table insert, Update and delete operations are performed frequently.This table is partitioned on ID column however there is no primary key defined and also there is no timestamp column available.
Can you please let me know what is best way I can perform incremental import to HDFS on this table.
This totally depends on what is your "id" column. If it is generated by ordered sequence, that's easy, just load the table with --incremental append --check-column ID.
If ID column is generated with noorder sequence, allow for some overlap and filter it on hadoop side.
If ID is not unique, your only choice is a CDC tool. Oracle GG, Informatica PWX and so on. There are no opensource/free solitions that I'm aware of.
Also don't need any index to perform incremental load with sqoop but an index will definitely help as its absence will lead to fullscan(s) to the source (and possibly very big) table.
your problem is not that hard to solve, just look for some key things in you db.
1. is you column id run by conditions "not NULL and 1=1 ", if so then use sqoop for you task
using following sqoop tools
--incremental append/lastmodified -check-column [column id]
--split-by [column id] // this is useful if there is not primary key also allows you to run multiple mappers in case of no primary key, you have to specify -m 1 for one mapper only.
prefered way is to do this task using sqoop job using --create tool.
for more information check https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_purpose_6
Hope this Helps !

incremental sqoop to HIVE table

It is known that --incremental sqoop import switch doesn't work for HIVE import through SQOOP. But what is the workaround for that?
1)One thing I could make up is that we can create a HIVE table, and bring incremental data to HDFS through SQOOP, and then manually load them. but if we are doing it , each time do that load, the data would be overwritten. Please correct me if I am wrong.
2) How effective --query is when sqooping data to HIVE?
Thank you
You can do the sqoop incremental append to the hive table, but there is no straight option, below is one of the way you can achieve it.
Store the incremental table as an external table in Hive.
It is more common to be importing incremental changes since the last time data was updated and then merging it.In the following example, --check-column is used to fetch records newer than last_import_date, which is the date of the last incremental data update:
sqoop import --connect jdbc:teradata://{host name}/Database=retail —connection manager org.apache.sqoop.teradata.TeradataConnManager --username dbc -password dbc --table SOURCE_TBL --target-dir /user/hive/incremental_table -m 1 --check-column modified_date --incremental lastmodified --last-value {last_import_date}
second part of your question
Query is also a very useful argument you can leverage in swoop import, that will give you the flexibility of basic joins on the rdbms table and flexibility to play with the date and time formats. If I were in your shoes I would do this, using the query I will import the data in the way I need and than I will append it to my original table and while loading from temporary to main table I can play more with the data. I would suggest using query if the updates are not too frequent.

Sqoop import all the table primary and non primary key tables at a time

Can we import tables using sqoop having primary and non primary at a time.Eg i am hvaing 200 primary key tables and 200 non priamry key tables in database .How can we import 400 tables at a time?
In addition to Jamie's answer:
You can add --autoreset-to-one-mapper tag in your sqoop import-all-tables... command.
Say you are using 8 mappers (-m 8) in your command. Then using above tag tables with primary keys will split as per the number of mappers and tables without primary keys will be loaded using 1 mappers.
so, overall your efficiency will improve.
Check 1st point of sqoop documentation for details.
Yes, you can add the flag --m 1 to the import command of all your tables (both the 200 with primary key and the 200 without it).
By adding this option, Sqoop will only use one mapper to retrieve all the data from the tables, so your command would look like something like this:
sqoop import-all-tables --connect your-database --username user --password pwd --m 1

Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables

Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables.
Or to deal with redundancy option if the values are already available in Hive Tables?
If your data has a unique identifier and you are running incremental imports you can specify it on the -mergeKey value of the import. This will merge the values that where already on the table with the newest one. The newer will override the oldest.
If you are not running incremental imports you can use sqoop merge to unify data.
From sqoop docs :
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.
The important is that you do have a single unique primary key for each record. Otherwise you might generate one when importing the data. To do so you could generate the import with the --query and generate the new column with the unique key on the select of the data concatenating existing columns until you get a unique combination.
--query "SELECT CONVERT(VARCHAR(128), [colum1]) + '_' + CONVERT(VARCHAR(128), [column2]) AS CompoundKey ,* FROM [dbo].[tableName] WHERE \$CONDITIONS" \
There is no direct option from sqoop that will provide the solution that you are looking for. You will have to set up EDW kind of process to achieve your goal:
import data in staging table(hive - create staging database for this purpose) - this should be copy of target table, but data type may vary as per your transformations requirements.
load data from staging database table(hive) to target database table(hive) by doing transformations. in your case:
Insert into table trgt.table
select * from stg.table stg_tbl
where stg_tbl.col1 not in (select col1 from trgt.table);
here trgt is target database, stg is staging database - both are in hive.

Resources