Sqoop Import of Data having new line character in avro format and then query using hive - hadoop

My requirement is to load the data from RDBMS into HDFS (backed by CDH 5.9.X) via sqoop (1.4.6) in avro format and then use an external hive(1.1) table to query the data.
Unfortunately, the data in RDBMS has some new line characters.
We all know that hive can't parse new line character in the data and the data mapping fails when selected the whole data via hive. However, hive's select count(*) works fine.
I used below options during sqoop import and checked, but didn't work:
--hive-drop-import-delims
--hive-delims-replacement
The above options work for text format. But storing data in text format is not a viable option for me.
The above options are converted properly in the Sqoop generated (codegen) POJO class's toString method (obviously as text format is working as expected), so I feel this method is not at all used during avro import. Probably because avro has not problem dealing with new line character, where as hive has.
I am surprised, don't anyone face such a common scenario, a table which has remark, comment field is prone to this problem.
Can anyone suggest me a solution please?
My command:
sqoop import \
-Dmapred.job.queue.name=XXXX \
--connect jdbc:oracle:thin:#Masked:61901/AgainMasked \
--table masked.masked \
--username masked \
--P \
--target-dir /user/masked/ \
--as-avrodatafile \
--map-column-java CREATED=String,LAST_UPD=String,END_DT=String,INFO_RECORD_DT=String,START_DT=String,DB_LAST_UPD=String,ADDR_LINE_3=String\
--hive-delims-replacement ' '
--null-string '\\N'
--null-non-string '\\N'
--fields-terminated-by '\001'
-m 1

This looks like an issue with avro serde. It is an open bug.
https://issues.apache.org/jira/browse/HIVE-14044.
Can you try the same in hive 2.0?

As mentioned by VJ, there is an open issue for new line character in avro.
What an alternate approach that you can try is
Sqoop the data into a hive staging table as a textfileformat.
Create an avro table.
Insert data from staging table to main avro table in hive.
As newline character is very well handled in textfileformat

Related

Special characters are not proper after sqooping data into Hive from teradata

I'm trying to sqoop the teradata table into Hive using below "sqoop-import" command.
sqoop tdimport
-Dtdch.output.hdfs.avro.schema.file=/tmp/data/country.avsc --connect jdbc:teradata://tdserver/database=SALES --username tduser
--password tdpw --as-avrodatafile --target-dir /tmp/data/country_avro --table COUNTRY
--split-by SALESCOUNTRYCODE --num-mappers 1
The teradata table contains special characters in some columns.After sqooping into Hive, the special characters are not coming proper.
Is there any way to enable the special characters while firing the sqoop import command?
Do we need to use UTF-8, to resolve this issue ?
Can anyone please suggest me regarding this issue ...

How can i import a column type SDO_GEOMETRY from Oracle to HDFS with Sqoop?

ISSUE
I'm using Sqoop to fetch data from Oracle and put it to HDFS. Unlike other basic datatypes i understand SDO_GEOMETRY is meant for spatial data.
My Sqoop job fails while fetching datatype SDO_GEOMETRY.
Need help to import the column Shape with SDO_GEOMETRY datatype from Oracle to Hdfs.
I have more than 1000 tables which has the SDO_GEOMETRY datatype , how can i handle the datatype in general while sqoop imports happen ?
I have tried the --map-column-java and --map-column-hive , but i still get the error.
error :
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive does not support the SQL type for column
SHAPE
SQOOP COMMAND
Below is the sqoop command that i have :
sqoop import --connect 'jdbc:oracle:thin:XXXXX/xxxxx#(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(Host=xxxxxxx)(Port=1521))(CONNECT_DATA=(SID=xxxxx)))' -m 1 --create-hive-table --hive-import --fields-terminated-by '^' --null-string '\\\\N' --null-non-string '\\\\N' --hive-overwrite --hive-table PROD.PLAN1 --target-dir test/PLAN1 --table PROD.PLAN --map-column-hive SE_XAO_CAD_DATA=BINARY --map-column-java SHAPE=String --map-column-hive SHAPE=STRING --delete-target-dir
The default type mapping that Sqoop provides between relational databases and Hadoop is not working in your case that is why Sqoop Job Fails. You need to override the mapping as geometry datatypes not supported by sqoop.
Use the below parameter in your sqoop job
syntax:--map-column-java col1=javaDatatype,col2=javaDatatype.....
sqoop import
.......
........
--map-column-java columnNameforSDO_GEOMETRY=String
As your column name is Shape
--map-column-java Shape=String
Sqoop import to HDFS
Sqoop does not support all of the RDBMS datatypes.
If a particular datatype is not supported, you will get error like:
No Java type for SQL type .....
Solution
Add --map-column-java in your sqoop command.
Syntax: --map-column-java col-name=java-type,...
For example, --map-column-java col1=String,col2=String
Sqoop import to HIVE
You need same --map-column-java mentioned above.
By default, sqoop supports these JDBC types and convert them in the corresponding hive types:
INTEGER
SMALLINT
VARCHAR
CHAR
LONGVARCHAR
NVARCHAR
NCHAR
LONGNVARCHAR
DATE
TIME
TIMESTAMP
CLOB
NUMERIC
DECIMAL
FLOAT
DOUBLE
REAL
BIT
BOOLEAN
TINYINT
BIGINT
If your datatype is not in this list, you get error like:
Hive does not support the SQL type for .....
Solution
You need to add --map-column-hive in your sqoop import command.
Syntax: --map-column-hive col-name=hive-type,...
For example, --map-column-hive col1=string,col2='varchar(100)'
Add --map-column-java SE_XAO_CAD_DATA=String,SHAPE=String --map-column-hive SE_XAO_CAD_DATA=BINARY,SHAPE=STRING in your command.
Don't use multiple --map-column-java and --map-column-hive.
For importing SDO GEOMETRY from Oracle to HIVE through SQOOP,
use the SQOOP free form query option along with Oracle's SDO_UTIL.TO_GEOJSON, SDO_UTIL.TO_WKTGEOMETRY functions.
The SQOOP --query option allows us to add a SELECT SQL QUERY so that we can get the required data only from the table. And, in the SQL query we can include SDO_UTIL package functions like TO_GEOJSON and TO_WKTGEOMETRY. It looks something like,
sqoop import \
...
--query 'SELECT ID, NAME, \
SDO_UTIL.TO_GEOJSON(MYSHAPECOLUMN), \
SDO_UTIL.TO_WKTGEOMETRY(MYSHAPECOLUMN) \
FROM MYTABLE WHERE $CONDITIONS' \
...
This returns the SDO GEOMETRY as Geojson and WKT formats as per the definitions of functions and can directly be inserted into respective HIVE STRING-type columns without any other type mapping in the SQOOP command.
Choose the Geojson and WKT as per requirement and this approach also can be extended to other spatial functions available.

how do i access data from a dataframe using the column name

I have an oracle table with xml data stored in it (xmlType). I'm trying to sqoop it to hdfs with the below command. the xml field is getting displayed as null in the hdfs file.
sqoop import --connect jdbc:oracle:thin:#DBconnString
--username uname --password pwd
--delete-target-dir
--table sample
--map-column-java column1=String
Can anyone suggest what am I doing wrong?
It is a sqoop limitation, the xmlType is not supported.
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_supported_data_types
There is a workaround here https://issues.apache.org/jira/browse/SQOOP-2749 which is essentially convert your xmlType to clob and then map it to string using the following option
--map-column-java "XMLRECORD=String"

Sqoop function '--map-column-hive' being ignored

I am trying to import a file into hive as parquet and the --map-column-hive column_name=timestamp is being ignored. The column 'column_name' is originally of type datetime in sql and it converts it into bigint in parquet. I want to convert it to timestamp format through sqoop but it is not working.
sqoop import \
--table table_name \
--driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
--connect jdbc:sqlserver://servername \
--username user --password pw \
--map-column-hive column_name=timestamp\
--as-parquetfile \
--hive-import \
--hive-table table_name -m 1
When I view the table in hive, it still shows the column with its original datatype.
I tried column_name=string and that did not work either.
I think this may be an issue with converting files to parquet but I am not sure. Does anyone have a solution to fix this?
I get no errors when running the command, it just completes the import as if the command was did not exist.
Before hive 1.2 version Timestmap support in ParquetSerde is not avabile. Only binary data type support is available in 1.1.0.
Please check the link
Please upgrade your version to 1.2 and after ,it should work.
Please check the issue log and release notes below.
https://issues.apache.org/jira/browse/HIVE-6384
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12329345&styleName=Text&projectId=12310843

Encoding columns in Hive

I'm importing a table from mysql to hive using Sqoop. Some columns are latin1 encoded. Is there any way to do either:
Set the encoding for those columns as latin1 in Hive. OR
Convert the columns to utf-8 while importing with sqoop?
In Hive --default-character-set is used to set the character set for whole database not specific to few columns. I was not able to find Sqoop parameter which will convert tables columns to utf-8 in fly rather the columns are expected to set type fixed.
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
I believe you would need to convert Latin1 columns to utf-8 first in your MySql and then you can import from Sqoop. You can use the following script to convert the all the columns into utf-8, which I found here.
mysql --database=dbname -B -N -e "SHOW TABLES" | \
awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE \
utf8_general_ci;"}' | mysql --database=dbname &
Turned out the problem was unrelated. The column works fine regardless of encoding...but the table's schema had changed in mysql. I assumed that since I'm passing in the overwrite flag, sqoop would remake the table every time in Hive. Not so! The schema changes in mysql didn't get transferred to Hive, so the data in the md5 column was actually data from a different column.
The "fix" we settled on was, before every sqoop import check for schema changes, and if there was a change, drop the table and re-import. This forces a schema update in Hive.
Edit: my original sqoop command was something like:
sqoop import --connect jdbc:mysql://HOST:PORT/DB --username USERNAME --password PASSWORD --table uploads --hive-table uploads --hive-import --hive-overwrite --split-by id --num-mappers 8 --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N'
But now I manually issue a drop table uploads to hive first if the schema changes.

Resources