Importing tables from multiple databases into Hadoop and Union - hadoop

I have this specific scenario:
There are year-wise databases in SQL Server with naming like "FOOXXYY" where XXYY denotes fiscal year. Now I want to take a specific table "bar" from all these Databases, Union it into a single table in hive and store it into the HDFS.
What will be the best and fastest approach to go about it?

You need to create database, create partitioned table, add partitions, run 4 different sqoop commands to connect to each of the database and load data into partitions. Here are the sample code snippets.
Create database and then partition table like this;
CREATE TABLE `order_items`(
`order_item_id` int,
`order_item_order_id` int,
`order_item_order_date` string,
`order_item_product_id` int,
`order_item_quantity` smallint,
`order_item_subtotal` float,
`order_item_product_price` float)
PARTITIONED BY (
`order_month` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|';
You can then add partitions using these commands:
alter table order_items add partition (order_month=201301);
alter table order_items add partition (order_month=201302);
Once the table is created, you can run describe formatted order_items. It will give the path of the table and you can validate using dfs -ls command in hive.
From describe formatted
Location: hdfs://sandbox.hortonworks.com:8020/apps/hive/warehouse/retail_ods.db/order_items
dfs -ls /apps/hive/warehouse/retail_ods.db/order_items
You will get 2 directories, capture the path.
By now you have table and partitions for each of the year (for your case). Now you can use sqoop import command for each database to query from the table and copy into the respective partition.
You can find sample sqoop commands here. You can even pass the query as part of sqoop import commands (Google sqoop user guide).
sqoop import \
--connect "jdbc:mysql://sandbox.hortonworks.com:3306/retail_db" \
--username=retail_dba \
--password=hadoop \
--table order_items \
--target-dir /apps/hive/warehouse/retail_ods.db/order_items/order_month=201301 \
--append \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--outdir java_files

Related

Does Sqoop support extracting data from partitioned oracle table

I have a very large oracle table which is a partitioned table, I would ask whether or how Sqoop supports to do split based on oracle partitions, eg, one mapper to do import from one oracle partition.
Sqoop supports import from oracle partitioned table. Here is the documentation.
Syntax is somthing like this
sqoop import \
-Doraoop.disabled=false \
-Doraoop.import.partitions='"PARTITION-NAME","PARTITION-NAME1","PARTITION-NAME2",' \
--connect jdbc:oracle:thin:#XXX.XXX.XXX.XXX:15XX:SCHEMA_NAME \
--username user \
--password password \
--table SCHEMA.TABLE_NAME \
--target-dir /HDFS/PATH/ \
-m 1
Single mapper will be assigned to each partition that will write data to HDFS simultaneously.
Make sure you have dynamic partitions property enabled and number of partitions property value is also higher than the partitions existing in oracle when you create Hive table.

Copying data from HDFS to hive using SQOOP

I want to copy data from HDFS to hive table. I tried below code but it doesn't throw any error and data is also not copied in mentioned hive table. Below is my code:
sqoop import --connect jdbc:mysql://localhost/sampleOne \
--username root \
--password root \
--external-table-dir "/WithFields" \
--hive-import \
--hive-table "sampleone.customers"
where sampleone is database in hive and customers is newly created table in hive and --external-table-dir is the HDFS path from where I want to load data in hive table. What else I am missing in this above code ??
If data is in HDFS, you do not need Sqoop to populate a Hive table. Steps to do this are below:
This is the data in HDFS
# hadoop fs -ls /example_hive/country
/example_hive/country/country1.csv
# hadoop fs -cat /example_hive/country/*
1,USA
2,Canada
3,USA
4,Brazil
5,Brazil
6,USA
7,Canada
This is the Hive table creation DDL
CREATE TABLE sampleone.customers
(
id int,
country string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
Verify Hive table is empty
hive (sampleone)> select * from sampleone.customers;
<no rows>
Load Hive table
hive (sampleone)> LOAD DATA INPATH '/example_hive/country' INTO TABLE sampleone.customers;
Verify Hive table has data
hive (sampleone)> select * from sampleone.customers;
1 USA
2 Canada
3 USA
4 Brazil
5 Brazil
6 USA
7 Canada
Note: This approach will move data from /example_hive/country location on HDFS to Hive warehouse directory (which will again be on HDFS) backing the table.

sqoop export commands for the data which has spaces before in hdfs

i have data which has stored in hdfs, the data has space before and after of the value, when i try to export to mysql, it gives numberformat exception but when i create data without space, it has inserted into mysql successfully.
my question is can't we export the data which has space from hdfs to mysql usong sqoop export command?
The data which i used
1201, adi, sen manager, 30000, it
1201, pavan, jun manager, 5000, cs
1203, santhosh, junior, 60000, mech
i created table like
create table emp(id BIGINT,name varchar(20),desg varchar(20),salary BIGINT,dept varchar(20));
sqoop command -- sqoop export \
--connect jdbc:mysql://127.0.0.1/mydb \
--username root \
--table emp \
--m 1 \
--export-dir /mydir \
--input-fields-terminated-by ',' \
--input-lines-terminated-by '\n'
result: numberformatexception input string:'1201'
can't parse the data
i discussed in forum, they said trim the space but i wants to know that automatically trim the spaces while perform sqoop export.
can somebody give suggestions on this?
You can do 1 simple thing:
Create temporary table in MySQL with all VARCHAR
create table emp-temp(id BIGINT,name varchar(20),desg varchar(20),salary BIGINT,dept varchar(20));
Now create another with numeric fields after TRIM() and CAST()
create table emp as select CAST(TRIM(id) AS UNSIGNED), name, desg, CAST(TRIM(salary) AS UNSIGNED), dept FROM emp_temp;
Sqoop internally runs MapReduce job.
Simple solution is to run a Mapper and trim the spaces in your data and get the output in different file and run sqoop export on new file.

Sqoop export from hive to oracle with different col names, number of columns and order of columns

The scenario is like, I have a hive table with 10 columns . I want to export the data from my hive table to an oracle table using Sqoop.
But the target oracle table has 30 columns having different names than hive table columns. Also, the column positions in oracle table are not same as in hive table.
Can anyone please suggest how can I write the Sqoop export command for this case?
Try below , it is assumed that your hive table is created as a external table and your data is located at /myhivetable/data/ , fields are terminated by | and lines are terminated by '\n'.
In you RDBMS table , the 20 columns which are not going to be populated from hive HDFS should have default values or allow null values.
Let us suppose your database columns are DC1,DC2,D4,DC5 ....D20 and hive columns are c1,c2,c3,c3,......c10 and your mapping is as below.
DC1 -- c8
DC2 -- c1
DC3 -- c2
DC4 -- c4
DC5 -- c3
DC6 -- c7
DC7 -- c10
DC8 -- c9
DC9 -- c5
DC10 -- c6
sqoop export \
--connect jdbc:postgresql://10.10.11.11:1234/db \
--table table1 \
--username user \
--password pwd \
--export-dir /myhivetable/data/ \
--columns "DC2,DC3,DC5,DC4,DC9,DC10,DC6,DC1,DC8,DC7" \
--fields-terminated-by '|' \
--lines-terminated-by '\n' \
--input-null-string "\\\\N" \
--input-null-non-string "\\\\N"
First of all, you can't export data directly from hive to oracle.
You need to EXPORT hive table to HDFS
sample command:
export table mytable to 'some_hdfs_location'
Or use HDFS data location of your hive table.
command to check the location
show create table mytable
So now you have location of data for your Hive table.
You can use --columns tag in Sqoop Export command to choose column order and number.
There is no problem with different column name.
I am taking simple example
Now you have hive table with columns - c1, c2, c3
and Oracle table - col1, col2, col3, col4, col5
I want to map c1 with col2, c2 with col5, c3 with col1.
I will use --columns "col2,col5,col1" in my sqoop command.
As per Sqoop docs,
By default, all columns within a table are selected for export. You can select a subset of columns and control their ordering by using the --columns argument. This should include a comma-delimited list of columns to export. For example: --columns "col1,col2,col3". Note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail.
There are 2 options:
As of now, sqoop export is very limited (thinking because this is not much intended functionality but other way around), it gives only option to specify the --export-dir which is the table's warehouse directory. And it loads all columns. So you may need to load into a staging table & load it into the original base table with relevant column mapping.
You can export the data from Hive using:
INSERT OVERWRITE DIRECTORY '/user/hive_exp/orders' select column1, column2 from hivetable;
Then use the Oracle's native import tool. This gives more flexibility.
Please update if you have a better solution.

how to define hive table structure using sqoop import-mainframe --create-hive-table command

we are trying to import a flat mainframe file to load into hive table. I was able to import and load it to hive table using sqoop import-mainframe but my entire file is placed in one column and that too the column does not have a name in it.
Is there a possibility to define the table structure in sqoop import command itself?
we are using the below command to import from mainframe and load it to Hive table
sqoop import-mainframe --connect mainframe.com --dataset mainframedataset --username xxxxx -P --hive-import --create-hive-table --hive-table table1 --warehouse-dir /warehouse/
Sample mainframe data:
ASWIN|1234|1000.00 XXXX|1235|200.00 YYYY|1236|150.00
Hive table create script generated by sqoop:
CREATE TABLE Employee ( DEFAULT_COLUMN STRING) COMMENT 'Imported by sqoop on 2016/08/26 02:12:04' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\012' STORED AS TEXTFILE
As per Sqoop docs,
By default, each record in a dataset is stored as a text record with a newline at the end. Each record is assumed to contain a single text field with the name DEFAULT_COLUMN. When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates.
Your psv file will be loaded to HDFS.
Now create table1 (hive table) yourself using -
CREATE TABLE table1 (Name string, Empid int,Amount float) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\012' STORED AS TEXTFILE
Now run your sqoop import command without --create-hive-table tag. It should work.

Resources