Wrong data types in hive with sqoop import from Oracle - oracle

I am trying to import Oracle tables into hive directly with sqoop.
Oracle tables use data types NUMBER, VARCHAR2, RAW
When I tried:
sqoop import ... --hive-import --hive-overwrite --hive-database default --fields-terminated-by '|' --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N' --warehouse-dir "/test"
All data types in hive tables are either double or string but I want int, date, etc for NUMBER(1), Date types.
I have tried like adding few tags like
--map-column-hive O_abc=INT,O_def=DATE,pqr=INT,O_uvw=INT,O_xyz=INT.
Is there any way I can automatic because I need to import 150 to 200 tables. It's tedious to mention all map-columns for every table.
Environment:
Hadoop-2.6.0
Sqoop-1.4.6
Hive-2.3.0
Java-1.8
two node cluster
Thanks in advance!

You could import all tables from Oracle to HDFS (sqoop import-all-tables {generic-args} {import-args}) and create an external and internal table based on your requirement.

Related

Is it possible to import a table with sqoop and add an extra timestamp column?

is it possible to use the sqoop command "import table" to import a table from an oracle database to an Hadoop cluster and add an extra column with the current timestamp (for troubleshouting purposes)? so far, I have the following command:
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect jdbc:oracle:thin:#//MY_ORACLE_SERVER --username USERNAME --password PASSWORD --target-dir /MyDIR --fields-terminated-by '\b' --table SOURCE_TABLE --hive-table DESTINATION_TABLE --hive-import --hive-overwrite --hive-delims-replacement '<newline>'
I would like to add a timestamp column to the table so that I know when that data was loaded. Is it possible?
Thanks in advance
you can use the free-form query import instead of table import, and call the timestamp function :
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect jdbc:oracle:thin:#//MY_ORACLE_SERVER --username USERNAME --password PASSWORD --target-dir /MyDIR --fields-terminated-by '\b' ----query 'SELECT a.*,systimestamp FROM SOURCE_TABLE a' --hive-table DESTINATION_TABLE --hive-import --hive-overwrite --hive-delims-replacement '<newline>'
Maybe you could use sysdate instead systimestamp (smaller datatype but less precision)
You can create a temp hive table by using sqoop ,after that create a new hive table by using old one with extra required columns.

How can i import a column type SDO_GEOMETRY from Oracle to HDFS with Sqoop?

ISSUE
I'm using Sqoop to fetch data from Oracle and put it to HDFS. Unlike other basic datatypes i understand SDO_GEOMETRY is meant for spatial data.
My Sqoop job fails while fetching datatype SDO_GEOMETRY.
Need help to import the column Shape with SDO_GEOMETRY datatype from Oracle to Hdfs.
I have more than 1000 tables which has the SDO_GEOMETRY datatype , how can i handle the datatype in general while sqoop imports happen ?
I have tried the --map-column-java and --map-column-hive , but i still get the error.
error :
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive does not support the SQL type for column
SHAPE
SQOOP COMMAND
Below is the sqoop command that i have :
sqoop import --connect 'jdbc:oracle:thin:XXXXX/xxxxx#(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(Host=xxxxxxx)(Port=1521))(CONNECT_DATA=(SID=xxxxx)))' -m 1 --create-hive-table --hive-import --fields-terminated-by '^' --null-string '\\\\N' --null-non-string '\\\\N' --hive-overwrite --hive-table PROD.PLAN1 --target-dir test/PLAN1 --table PROD.PLAN --map-column-hive SE_XAO_CAD_DATA=BINARY --map-column-java SHAPE=String --map-column-hive SHAPE=STRING --delete-target-dir
The default type mapping that Sqoop provides between relational databases and Hadoop is not working in your case that is why Sqoop Job Fails. You need to override the mapping as geometry datatypes not supported by sqoop.
Use the below parameter in your sqoop job
syntax:--map-column-java col1=javaDatatype,col2=javaDatatype.....
sqoop import
.......
........
--map-column-java columnNameforSDO_GEOMETRY=String
As your column name is Shape
--map-column-java Shape=String
Sqoop import to HDFS
Sqoop does not support all of the RDBMS datatypes.
If a particular datatype is not supported, you will get error like:
No Java type for SQL type .....
Solution
Add --map-column-java in your sqoop command.
Syntax: --map-column-java col-name=java-type,...
For example, --map-column-java col1=String,col2=String
Sqoop import to HIVE
You need same --map-column-java mentioned above.
By default, sqoop supports these JDBC types and convert them in the corresponding hive types:
INTEGER
SMALLINT
VARCHAR
CHAR
LONGVARCHAR
NVARCHAR
NCHAR
LONGNVARCHAR
DATE
TIME
TIMESTAMP
CLOB
NUMERIC
DECIMAL
FLOAT
DOUBLE
REAL
BIT
BOOLEAN
TINYINT
BIGINT
If your datatype is not in this list, you get error like:
Hive does not support the SQL type for .....
Solution
You need to add --map-column-hive in your sqoop import command.
Syntax: --map-column-hive col-name=hive-type,...
For example, --map-column-hive col1=string,col2='varchar(100)'
Add --map-column-java SE_XAO_CAD_DATA=String,SHAPE=String --map-column-hive SE_XAO_CAD_DATA=BINARY,SHAPE=STRING in your command.
Don't use multiple --map-column-java and --map-column-hive.
For importing SDO GEOMETRY from Oracle to HIVE through SQOOP,
use the SQOOP free form query option along with Oracle's SDO_UTIL.TO_GEOJSON, SDO_UTIL.TO_WKTGEOMETRY functions.
The SQOOP --query option allows us to add a SELECT SQL QUERY so that we can get the required data only from the table. And, in the SQL query we can include SDO_UTIL package functions like TO_GEOJSON and TO_WKTGEOMETRY. It looks something like,
sqoop import \
...
--query 'SELECT ID, NAME, \
SDO_UTIL.TO_GEOJSON(MYSHAPECOLUMN), \
SDO_UTIL.TO_WKTGEOMETRY(MYSHAPECOLUMN) \
FROM MYTABLE WHERE $CONDITIONS' \
...
This returns the SDO GEOMETRY as Geojson and WKT formats as per the definitions of functions and can directly be inserted into respective HIVE STRING-type columns without any other type mapping in the SQOOP command.
Choose the Geojson and WKT as per requirement and this approach also can be extended to other spatial functions available.

sqoop import as parquet file to target dir, but can't find the file

I have been using sqoop to import data from mysql to hive, the command I used are below:
sqoop import --connect jdbc:mysql://localhost:3306/datasync \
--username root --password 654321 \
--query 'SELECT id,name FROM test WHERE $CONDITIONS' --split-by id \
--hive-import --hive-database default --hive-table a \
--target-dir /tmp/yfr --as-parquetfile
The Hive table is created and the data is inserted, however I can not find the parquet file.
Does anyone know?
Best regards,
Feiran
Sqoop import to hive works in 2 steps:
Fetching data from RDBMS to HDFS
Create hive table if not exists and Load data into hive table
In your case,
firstly, data is stored at --target-dir i.e. /tmp/yfr
Then, it is loaded into Hive table a using
LOAD DATA INPTH ... INTO TABLE..
command.
As mentioned in the comments, data is moved to hive warehouse directory that's why there is no data in --target-dir.

Preserving datatype while table import using scoop

Is there any way to preserve data type while import tables from Netezza to Hive using scoop? For example Column1 with Int datatype in Netezza can be imported as Int not String. I am using
sqoop import --connect jdbc:netezza//server/db --username --password --table table_name --hive-table table_nm
Using "--map-column-hive" I can manually map a column but I want to import more than 50 tables from Netezza to Hive.I want to automate the process.

Configuring Sqoop with Mysql?

I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do

Resources