Sqoop export from hive to oracle with different col names, number of columns and order of columns - sqoop

The scenario is like, I have a hive table with 10 columns . I want to export the data from my hive table to an oracle table using Sqoop.
But the target oracle table has 30 columns having different names than hive table columns. Also, the column positions in oracle table are not same as in hive table.
Can anyone please suggest how can I write the Sqoop export command for this case?

Try below , it is assumed that your hive table is created as a external table and your data is located at /myhivetable/data/ , fields are terminated by | and lines are terminated by '\n'.
In you RDBMS table , the 20 columns which are not going to be populated from hive HDFS should have default values or allow null values.
Let us suppose your database columns are DC1,DC2,D4,DC5 ....D20 and hive columns are c1,c2,c3,c3,......c10 and your mapping is as below.
DC1 -- c8
DC2 -- c1
DC3 -- c2
DC4 -- c4
DC5 -- c3
DC6 -- c7
DC7 -- c10
DC8 -- c9
DC9 -- c5
DC10 -- c6
sqoop export \
--connect jdbc:postgresql://10.10.11.11:1234/db \
--table table1 \
--username user \
--password pwd \
--export-dir /myhivetable/data/ \
--columns "DC2,DC3,DC5,DC4,DC9,DC10,DC6,DC1,DC8,DC7" \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--input-null-string "\\\\N" \
--input-null-non-string "\\\\N"

First of all, you can't export data directly from hive to oracle.
You need to EXPORT hive table to HDFS
sample command:
export table mytable to 'some_hdfs_location'
Or use HDFS data location of your hive table.
command to check the location
show create table mytable
So now you have location of data for your Hive table.
You can use --columns tag in Sqoop Export command to choose column order and number.
There is no problem with different column name.
I am taking simple example
Now you have hive table with columns - c1, c2, c3
and Oracle table - col1, col2, col3, col4, col5
I want to map c1 with col2, c2 with col5, c3 with col1.
I will use --columns "col2,col5,col1" in my sqoop command.
As per Sqoop docs,
By default, all columns within a table are selected for export. You can select a subset of columns and control their ordering by using the --columns argument. This should include a comma-delimited list of columns to export. For example: --columns "col1,col2,col3". Note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail.

There are 2 options:
As of now, sqoop export is very limited (thinking because this is not much intended functionality but other way around), it gives only option to specify the --export-dir which is the table's warehouse directory. And it loads all columns. So you may need to load into a staging table & load it into the original base table with relevant column mapping.
You can export the data from Hive using:
INSERT OVERWRITE DIRECTORY '/user/hive_exp/orders' select column1, column2 from hivetable;
Then use the Oracle's native import tool. This gives more flexibility.
Please update if you have a better solution.

Related

Can Sqoop update record on Oracle RDBMS table that have different column structure with Hive table

I'm a Hadoop newcomer trying to export data from Hive to Oracle. Can Sqoop update data to Oracle table let say,
Oracle Table have column A,B,C,D,E
I stored data on Hive table as B,C,E
Can Sqoop export update(just update, not upsert) with B,C as update keys and update just the E column from Hive?
Pls mention --update-key Prim_key_col_in_table. Pls note --update-mode default is updateonly so you dont have to mention anything.
You can also add input-fields-terminated-by command if you want to.
here is a sample command -
sqoop export --connect jdbc:mysql://xxxxxx/mytable --username xxxxx --password xxxxx --table export_sqoop_mytable --update-key Prim_key_col_in_table --export-dir /user/ingenieroandresangel/datasets/mytable.txt -m 1

sqoop export commands for the data which has spaces before in hdfs

i have data which has stored in hdfs, the data has space before and after of the value, when i try to export to mysql, it gives numberformat exception but when i create data without space, it has inserted into mysql successfully.
my question is can't we export the data which has space from hdfs to mysql usong sqoop export command?
The data which i used
1201, adi, sen manager, 30000, it
1201, pavan, jun manager, 5000, cs
1203, santhosh, junior, 60000, mech
i created table like
create table emp(id BIGINT,name varchar(20),desg varchar(20),salary BIGINT,dept varchar(20));
sqoop command -- sqoop export \
--connect jdbc:mysql://127.0.0.1/mydb \
--username root \
--table emp \
--m 1 \
--export-dir /mydir \
--input-fields-terminated-by ',' \
--input-lines-terminated-by '\n'
result: numberformatexception input string:'1201'
can't parse the data
i discussed in forum, they said trim the space but i wants to know that automatically trim the spaces while perform sqoop export.
can somebody give suggestions on this?
You can do 1 simple thing:
Create temporary table in MySQL with all VARCHAR
create table emp-temp(id BIGINT,name varchar(20),desg varchar(20),salary BIGINT,dept varchar(20));
Now create another with numeric fields after TRIM() and CAST()
create table emp as select CAST(TRIM(id) AS UNSIGNED), name, desg, CAST(TRIM(salary) AS UNSIGNED), dept FROM emp_temp;
Sqoop internally runs MapReduce job.
Simple solution is to run a Mapper and trim the spaces in your data and get the output in different file and run sqoop export on new file.

Sqoop incremental export using hcatalog?

Is there a way to use sqoop to do incremental exports ? I am using Hcatalog integration for sqoop.I tried using the --last-value, --check-column options which are used for incremental import, but sqoop gave me error that the options were invalid.
I have not seen incremental sqoop export arguments. The other way you could try is to create a contol_table in hive where you keep log of the table name & timestamp when it was last exported every time.
create table if not exists control_table (
table_name string,
export_date timestamp
);
insert into control_table 'export_table1' as table_name, from_unixtime(unix_timestamp()) as export_date from control_table;
If export_table1 is the table you want to export incrementally and assuming if have already executed above two statements.
--execute below at once
--get the timestamp when the table was last executed
create temporary table control_table_now as select table_name, max(export_date) as last_export_date from control_table group by table_name;
--get incremental rows
create table new_export_table1 as select field1, field2, field3, .... timestamp1 from export_table1 e, control_table_now c where c.table_name = 'export_table1' and e.timestamp1 >= c.last_export_date;
--append the control_table for next process
insert into control_table 'export_table1' as table_name, from_unixtime(unix_timestamp()) as export_date from control_table;
Now, export the new_export_table1 table which is incrementally created using sqoop export command.
By default sqoop does not support incremental update with hcatalog integration, when we try it gives following error
Append mode for imports is not compatible with HCatalog. Please remove the parameter--append-mode
at org.apache.sqoop.tool.BaseSqoopTool.validateHCatalogOptions(BaseSqoopTool.java:1561)
you can use query option to make it work. as described in this hortonworks document

Sqoop Export specific columns from hdfs to mysql is not working properly

My HDFS file contains 5 columns.
emp_no,birth_date,first_name,last_name,hire_date
I want to export it with only 3 columns:
emp_no,first_name,last_name
I am doing it with
sqoop export
--connect jdbc:mysql://mysql.example.com/sqoop
--username sqoop
--password sqoop
--table employees
--columns "emp_no,first_name,last_name"
--export-dir /user/dataset/employees
But I am getting emp_no,birth_date and first_name in MySQL table.
I am getting 3 columns in my table but one column I want to skip is not happening with --columns in sqoop export
I solved my problem. Actually I misunderstood option --columns for export.
With --columns option for export, we can select subset of columns or control ordering of the table columns(or destination e.g mysql columns) not the HDFS columns.
This option decides binding of HDFS source columns with columns mentioned in --columns option of the destination table.
e.g. if I mention --columns "col2,col3,col1" in sqoop command
where col1,col2,col3 are mysql table's columns
Then it will bind col2 with first column of the HDFS source and col3 with second column of the HDFS source and so on..

Importing tables from multiple databases into Hadoop and Union

I have this specific scenario:
There are year-wise databases in SQL Server with naming like "FOOXXYY" where XXYY denotes fiscal year. Now I want to take a specific table "bar" from all these Databases, Union it into a single table in hive and store it into the HDFS.
What will be the best and fastest approach to go about it?
You need to create database, create partitioned table, add partitions, run 4 different sqoop commands to connect to each of the database and load data into partitions. Here are the sample code snippets.
Create database and then partition table like this;
CREATE TABLE `order_items`(
`order_item_id` int,
`order_item_order_id` int,
`order_item_order_date` string,
`order_item_product_id` int,
`order_item_quantity` smallint,
`order_item_subtotal` float,
`order_item_product_price` float)
PARTITIONED BY (
`order_month` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
You can then add partitions using these commands:
alter table order_items add partition (order_month=201301);
alter table order_items add partition (order_month=201302);
Once the table is created, you can run describe formatted order_items. It will give the path of the table and you can validate using dfs -ls command in hive.
From describe formatted
Location: hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/retail_ods.db/order_items
dfs -ls /apps/hive/warehouse/retail_ods.db/order_items
You will get 2 directories, capture the path.
By now you have table and partitions for each of the year (for your case). Now you can use sqoop import command for each database to query from the table and copy into the respective partition.
You can find sample sqoop commands here. You can even pass the query as part of sqoop import commands (Google sqoop user guide).
sqoop import \
--connect "jdbc:mysql://sandbox.hortonworks.com:3306/retail_db" \
--username=retail_dba \
--password=hadoop \
--table order_items \
--target-dir /apps/hive/warehouse/retail_ods.db/order_items/order_month=201301 \
--append \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--outdir java_files

Resources