How can i import a column type SDO_GEOMETRY from Oracle to HDFS with Sqoop? - hadoop

ISSUE
I'm using Sqoop to fetch data from Oracle and put it to HDFS. Unlike other basic datatypes i understand SDO_GEOMETRY is meant for spatial data.
My Sqoop job fails while fetching datatype SDO_GEOMETRY.
Need help to import the column Shape with SDO_GEOMETRY datatype from Oracle to Hdfs.
I have more than 1000 tables which has the SDO_GEOMETRY datatype , how can i handle the datatype in general while sqoop imports happen ?
I have tried the --map-column-java and --map-column-hive , but i still get the error.
error :
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive does not support the SQL type for column
SHAPE
SQOOP COMMAND
Below is the sqoop command that i have :
sqoop import --connect 'jdbc:oracle:thin:XXXXX/xxxxx#(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(Host=xxxxxxx)(Port=1521))(CONNECT_DATA=(SID=xxxxx)))' -m 1 --create-hive-table --hive-import --fields-terminated-by '^' --null-string '\\\\N' --null-non-string '\\\\N' --hive-overwrite --hive-table PROD.PLAN1 --target-dir test/PLAN1 --table PROD.PLAN --map-column-hive SE_XAO_CAD_DATA=BINARY --map-column-java SHAPE=String --map-column-hive SHAPE=STRING --delete-target-dir

The default type mapping that Sqoop provides between relational databases and Hadoop is not working in your case that is why Sqoop Job Fails. You need to override the mapping as geometry datatypes not supported by sqoop.
Use the below parameter in your sqoop job
syntax:--map-column-java col1=javaDatatype,col2=javaDatatype.....
sqoop import
.......
........
--map-column-java columnNameforSDO_GEOMETRY=String
As your column name is Shape
--map-column-java Shape=String

Sqoop import to HDFS
Sqoop does not support all of the RDBMS datatypes.
If a particular datatype is not supported, you will get error like:
No Java type for SQL type .....
Solution
Add --map-column-java in your sqoop command.
Syntax: --map-column-java col-name=java-type,...
For example, --map-column-java col1=String,col2=String
Sqoop import to HIVE
You need same --map-column-java mentioned above.
By default, sqoop supports these JDBC types and convert them in the corresponding hive types:
INTEGER
SMALLINT
VARCHAR
CHAR
LONGVARCHAR
NVARCHAR
NCHAR
LONGNVARCHAR
DATE
TIME
TIMESTAMP
CLOB
NUMERIC
DECIMAL
FLOAT
DOUBLE
REAL
BIT
BOOLEAN
TINYINT
BIGINT
If your datatype is not in this list, you get error like:
Hive does not support the SQL type for .....
Solution
You need to add --map-column-hive in your sqoop import command.
Syntax: --map-column-hive col-name=hive-type,...
For example, --map-column-hive col1=string,col2='varchar(100)'
Add --map-column-java SE_XAO_CAD_DATA=String,SHAPE=String --map-column-hive SE_XAO_CAD_DATA=BINARY,SHAPE=STRING in your command.
Don't use multiple --map-column-java and --map-column-hive.

For importing SDO GEOMETRY from Oracle to HIVE through SQOOP,
use the SQOOP free form query option along with Oracle's SDO_UTIL.TO_GEOJSON, SDO_UTIL.TO_WKTGEOMETRY functions.
The SQOOP --query option allows us to add a SELECT SQL QUERY so that we can get the required data only from the table. And, in the SQL query we can include SDO_UTIL package functions like TO_GEOJSON and TO_WKTGEOMETRY. It looks something like,
sqoop import \
...
--query 'SELECT ID, NAME, \
SDO_UTIL.TO_GEOJSON(MYSHAPECOLUMN), \
SDO_UTIL.TO_WKTGEOMETRY(MYSHAPECOLUMN) \
FROM MYTABLE WHERE $CONDITIONS' \
...
This returns the SDO GEOMETRY as Geojson and WKT formats as per the definitions of functions and can directly be inserted into respective HIVE STRING-type columns without any other type mapping in the SQOOP command.
Choose the Geojson and WKT as per requirement and this approach also can be extended to other spatial functions available.

Related

Sqoop Import of Data having new line character in avro format and then query using hive

My requirement is to load the data from RDBMS into HDFS (backed by CDH 5.9.X) via sqoop (1.4.6) in avro format and then use an external hive(1.1) table to query the data.
Unfortunately, the data in RDBMS has some new line characters.
We all know that hive can't parse new line character in the data and the data mapping fails when selected the whole data via hive. However, hive's select count(*) works fine.
I used below options during sqoop import and checked, but didn't work:
--hive-drop-import-delims
--hive-delims-replacement
The above options work for text format. But storing data in text format is not a viable option for me.
The above options are converted properly in the Sqoop generated (codegen) POJO class's toString method (obviously as text format is working as expected), so I feel this method is not at all used during avro import. Probably because avro has not problem dealing with new line character, where as hive has.
I am surprised, don't anyone face such a common scenario, a table which has remark, comment field is prone to this problem.
Can anyone suggest me a solution please?
My command:
sqoop import \
-Dmapred.job.queue.name=XXXX \
--connect jdbc:oracle:thin:#Masked:61901/AgainMasked \
--table masked.masked \
--username masked \
--P \
--target-dir /user/masked/ \
--as-avrodatafile \
--map-column-java CREATED=String,LAST_UPD=String,END_DT=String,INFO_RECORD_DT=String,START_DT=String,DB_LAST_UPD=String,ADDR_LINE_3=String\
--hive-delims-replacement ' '
--null-string '\\N'
--null-non-string '\\N'
--fields-terminated-by '\001'
-m 1
This looks like an issue with avro serde. It is an open bug.
https://issues.apache.org/jira/browse/HIVE-14044.
Can you try the same in hive 2.0?
As mentioned by VJ, there is an open issue for new line character in avro.
What an alternate approach that you can try is
Sqoop the data into a hive staging table as a textfileformat.
Create an avro table.
Insert data from staging table to main avro table in hive.
As newline character is very well handled in textfileformat

Wrong data types in hive with sqoop import from Oracle

I am trying to import Oracle tables into hive directly with sqoop.
Oracle tables use data types NUMBER, VARCHAR2, RAW
When I tried:
sqoop import ... --hive-import --hive-overwrite --hive-database default --fields-terminated-by '|' --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N' --warehouse-dir "/test"
All data types in hive tables are either double or string but I want int, date, etc for NUMBER(1), Date types.
I have tried like adding few tags like
--map-column-hive O_abc=INT,O_def=DATE,pqr=INT,O_uvw=INT,O_xyz=INT.
Is there any way I can automatic because I need to import 150 to 200 tables. It's tedious to mention all map-columns for every table.
Environment:
Hadoop-2.6.0
Sqoop-1.4.6
Hive-2.3.0
Java-1.8
two node cluster
Thanks in advance!
You could import all tables from Oracle to HDFS (sqoop import-all-tables {generic-args} {import-args}) and create an external and internal table based on your requirement.

how do i access data from a dataframe using the column name

I have an oracle table with xml data stored in it (xmlType). I'm trying to sqoop it to hdfs with the below command. the xml field is getting displayed as null in the hdfs file.
sqoop import --connect jdbc:oracle:thin:#DBconnString
--username uname --password pwd
--delete-target-dir
--table sample
--map-column-java column1=String
Can anyone suggest what am I doing wrong?
It is a sqoop limitation, the xmlType is not supported.
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_supported_data_types
There is a workaround here https://issues.apache.org/jira/browse/SQOOP-2749 which is essentially convert your xmlType to clob and then map it to string using the following option
--map-column-java "XMLRECORD=String"

incremental "lastmodified" not working in sqoop

I'm trying sqoop to perform incremental import from Teradata DB to Hive. Below is the query:
sqoop import --connect jdbc:teradata://xxx.xxx.x.xx/DATABASE=DBN --driver com.teradata.jdbc.TeraDriver --username userN --password pass --query "SELECT alias.colA, alias.call_date, alias.colB, alias.colC FROM tableName alias where \$CONDITIONS" --target-dir /apps/hive/warehouse/staging.db/tableName -m 26 --check-column call_date --incremental append --split-by alias.colA --last-value '2016-02-01'
The column call_date is of DATE type, values in the format 'YYYY-MM-DD'.
When I use 'append' for --incremental, everything works fine. But when I put 'lastmodified', the following error is thrown:
ERROR util.SqlTypeMap: It seems like you are looking up a column that does not
ERROR util.SqlTypeMap: exist in the table. Please ensure that you've specified
ERROR util.SqlTypeMap: correct column names in Sqoop options.
ERROR tool.ImportTool: Imported Failed: column not found: call_date
I'm using sqoop 1.4.4.2.1 on HDP 2.1
While Teradata DB is 14.10
Any pointers will be helpful.
I think, in case of query you can perform the last value check in the query itself some think like this
"SELECT alias.colA, alias.call_date, alias.colB, alias.colC FROM tableName alias where call_date >'2016-02-01' and \$CONDITIONS" .
Reference (refer section Incrementally Updating Data in Hive > 1.Ingest the data.)
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.0/bk_dataintegration/content/incrementally-updating-hive-table-with-sqoop-and-ext-table.html

Sqoop Hive table import, Table dataType doesn't match with database

Using Sqoop to import data from oracle to hive, its working fine but it create table in hive with only 2 dataTypes String and Double. I want to use timeStamp as datatype for some columns.
How can I do it.
bin/sqoop import --table TEST_TABLE --connect jdbc:oracle:thin:#HOST:PORT:orcl --username USER1 -password password -hive-import --hive-home /user/lib/Hive/
In addition to above answers we may also have to observe when the error is coming, e.g.
In my case I had two types of data columns that caused error: json and binary
for json column the error came while a Java Class was executing, at the very beginning of the import process :
/04/19 09:37:58 ERROR orm.ClassWriter: Cannot resolve SQL type
for binary column, error was thrown while importing into the hive tables (after data is imported and put into HDFS files)
16/04/19 09:51:22 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive does not support the SQL type for column featured_binary
To get rid of these two errors, I had to provide the following options
--map-column-java column1_json=String,column2_json=String,featured_binary=String --map-column-hive column1_json=STRING,column2_json=STRING,featured_binary=STRING
In summary, we may have to provide the
--map-column-java
or
--map-column-hive
depending upon the failure.
You can use the parameter --map-column-hive to override default mapping. This parameter expects a comma-separated list of key-value pairs separated by = to specify which column should be matched to which type in Hive.
sqoop import \
...
--hive-import \
--map-column-hive id=STRING,price=DECIMAL
A new feature was added with sqoop-2103/sqoop 1.4.5 that lets you call out the decimal precision with the map-column-hive parameter. Example:
--map-column-hive 'TESTDOLLAR_AMT=DECIMAL(20%2C2)'
This syntax would define the field as a DECIMAL(20,2). The %2C is used as a comma and the parameter needs to be in single quotes if submitting from the bash shell.
I tried using Decimal with no modification and I got a Decimal(10,0) as a default.

Resources