Special characters are not proper after sqooping data into Hive from teradata - hadoop

I'm trying to sqoop the teradata table into Hive using below "sqoop-import" command.
sqoop tdimport
-Dtdch.output.hdfs.avro.schema.file=/tmp/data/country.avsc --connect jdbc:teradata://tdserver/database=SALES --username tduser
--password tdpw --as-avrodatafile --target-dir /tmp/data/country_avro --table COUNTRY
--split-by SALESCOUNTRYCODE --num-mappers 1
The teradata table contains special characters in some columns.After sqooping into Hive, the special characters are not coming proper.
Is there any way to enable the special characters while firing the sqoop import command?
Do we need to use UTF-8, to resolve this issue ?
Can anyone please suggest me regarding this issue ...

Related

Sqoop Import of Data having new line character in avro format and then query using hive

My requirement is to load the data from RDBMS into HDFS (backed by CDH 5.9.X) via sqoop (1.4.6) in avro format and then use an external hive(1.1) table to query the data.
Unfortunately, the data in RDBMS has some new line characters.
We all know that hive can't parse new line character in the data and the data mapping fails when selected the whole data via hive. However, hive's select count(*) works fine.
I used below options during sqoop import and checked, but didn't work:
--hive-drop-import-delims
--hive-delims-replacement
The above options work for text format. But storing data in text format is not a viable option for me.
The above options are converted properly in the Sqoop generated (codegen) POJO class's toString method (obviously as text format is working as expected), so I feel this method is not at all used during avro import. Probably because avro has not problem dealing with new line character, where as hive has.
I am surprised, don't anyone face such a common scenario, a table which has remark, comment field is prone to this problem.
Can anyone suggest me a solution please?
My command:
sqoop import \
-Dmapred.job.queue.name=XXXX \
--connect jdbc:oracle:thin:#Masked:61901/AgainMasked \
--table masked.masked \
--username masked \
--P \
--target-dir /user/masked/ \
--as-avrodatafile \
--map-column-java CREATED=String,LAST_UPD=String,END_DT=String,INFO_RECORD_DT=String,START_DT=String,DB_LAST_UPD=String,ADDR_LINE_3=String\
--hive-delims-replacement ' '
--null-string '\\N'
--null-non-string '\\N'
--fields-terminated-by '\001'
-m 1
This looks like an issue with avro serde. It is an open bug.
https://issues.apache.org/jira/browse/HIVE-14044.
Can you try the same in hive 2.0?
As mentioned by VJ, there is an open issue for new line character in avro.
What an alternate approach that you can try is
Sqoop the data into a hive staging table as a textfileformat.
Create an avro table.
Insert data from staging table to main avro table in hive.
As newline character is very well handled in textfileformat

How can i import a column type SDO_GEOMETRY from Oracle to HDFS with Sqoop?

ISSUE
I'm using Sqoop to fetch data from Oracle and put it to HDFS. Unlike other basic datatypes i understand SDO_GEOMETRY is meant for spatial data.
My Sqoop job fails while fetching datatype SDO_GEOMETRY.
Need help to import the column Shape with SDO_GEOMETRY datatype from Oracle to Hdfs.
I have more than 1000 tables which has the SDO_GEOMETRY datatype , how can i handle the datatype in general while sqoop imports happen ?
I have tried the --map-column-java and --map-column-hive , but i still get the error.
error :
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive does not support the SQL type for column
SHAPE
SQOOP COMMAND
Below is the sqoop command that i have :
sqoop import --connect 'jdbc:oracle:thin:XXXXX/xxxxx#(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(Host=xxxxxxx)(Port=1521))(CONNECT_DATA=(SID=xxxxx)))' -m 1 --create-hive-table --hive-import --fields-terminated-by '^' --null-string '\\\\N' --null-non-string '\\\\N' --hive-overwrite --hive-table PROD.PLAN1 --target-dir test/PLAN1 --table PROD.PLAN --map-column-hive SE_XAO_CAD_DATA=BINARY --map-column-java SHAPE=String --map-column-hive SHAPE=STRING --delete-target-dir
The default type mapping that Sqoop provides between relational databases and Hadoop is not working in your case that is why Sqoop Job Fails. You need to override the mapping as geometry datatypes not supported by sqoop.
Use the below parameter in your sqoop job
syntax:--map-column-java col1=javaDatatype,col2=javaDatatype.....
sqoop import
.......
........
--map-column-java columnNameforSDO_GEOMETRY=String
As your column name is Shape
--map-column-java Shape=String
Sqoop import to HDFS
Sqoop does not support all of the RDBMS datatypes.
If a particular datatype is not supported, you will get error like:
No Java type for SQL type .....
Solution
Add --map-column-java in your sqoop command.
Syntax: --map-column-java col-name=java-type,...
For example, --map-column-java col1=String,col2=String
Sqoop import to HIVE
You need same --map-column-java mentioned above.
By default, sqoop supports these JDBC types and convert them in the corresponding hive types:
INTEGER
SMALLINT
VARCHAR
CHAR
LONGVARCHAR
NVARCHAR
NCHAR
LONGNVARCHAR
DATE
TIME
TIMESTAMP
CLOB
NUMERIC
DECIMAL
FLOAT
DOUBLE
REAL
BIT
BOOLEAN
TINYINT
BIGINT
If your datatype is not in this list, you get error like:
Hive does not support the SQL type for .....
Solution
You need to add --map-column-hive in your sqoop import command.
Syntax: --map-column-hive col-name=hive-type,...
For example, --map-column-hive col1=string,col2='varchar(100)'
Add --map-column-java SE_XAO_CAD_DATA=String,SHAPE=String --map-column-hive SE_XAO_CAD_DATA=BINARY,SHAPE=STRING in your command.
Don't use multiple --map-column-java and --map-column-hive.
For importing SDO GEOMETRY from Oracle to HIVE through SQOOP,
use the SQOOP free form query option along with Oracle's SDO_UTIL.TO_GEOJSON, SDO_UTIL.TO_WKTGEOMETRY functions.
The SQOOP --query option allows us to add a SELECT SQL QUERY so that we can get the required data only from the table. And, in the SQL query we can include SDO_UTIL package functions like TO_GEOJSON and TO_WKTGEOMETRY. It looks something like,
sqoop import \
...
--query 'SELECT ID, NAME, \
SDO_UTIL.TO_GEOJSON(MYSHAPECOLUMN), \
SDO_UTIL.TO_WKTGEOMETRY(MYSHAPECOLUMN) \
FROM MYTABLE WHERE $CONDITIONS' \
...
This returns the SDO GEOMETRY as Geojson and WKT formats as per the definitions of functions and can directly be inserted into respective HIVE STRING-type columns without any other type mapping in the SQOOP command.
Choose the Geojson and WKT as per requirement and this approach also can be extended to other spatial functions available.

how do i access data from a dataframe using the column name

I have an oracle table with xml data stored in it (xmlType). I'm trying to sqoop it to hdfs with the below command. the xml field is getting displayed as null in the hdfs file.
sqoop import --connect jdbc:oracle:thin:#DBconnString
--username uname --password pwd
--delete-target-dir
--table sample
--map-column-java column1=String
Can anyone suggest what am I doing wrong?
It is a sqoop limitation, the xmlType is not supported.
https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_supported_data_types
There is a workaround here https://issues.apache.org/jira/browse/SQOOP-2749 which is essentially convert your xmlType to clob and then map it to string using the following option
--map-column-java "XMLRECORD=String"

Configuring Sqoop with Mysql?

I have successfully installed SQOOP now the problem is that how to implement it with RDBMS and how to load data from RDBMS to HDFS using SQOOP.
By Using Sqoop You can Load Data directly to Hive Tables or Store the data in Some target Directory in HDFS
If you Need to copy data from RDBMS into Some directory
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--col column_name(s) {In case you need to call specific columns}
--target-dir '/tmp/myfolder'
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--fields-terminated-by ',' {how do you want your data to look in target file}
Boundary Query : This is something you can specify. If you do not specify this , then by default this is run in as an inner query which adds up to a complex query.
If you specify this explicitly then this runs as a normal query and hence the performance is increased.
Also you may want to restrict the number of observation ,say based on column ID, and suppose you need data from ID 1 to 1000. Then using Boundary condition and split-by you will be able to restrict your import data.
--boundary-query "select 0,1000 from employee'
--split-by ID
Split-By : You use Split by on a Sqoop import to specify the column on basis of which split is required. By default,if you do not specify this, sqoop pics up table's primary key as the Split_by column.
Split By picks up data from tables and stores them in different folders based on number of mappers. By Default Number of Mappers are 4.
This may seem unwanted but in case you have a composite primary key or no primary key at all, then sqoop fails to pick up data and may error out.
Note: You may not face any issue if you set the number of mappers to 1. In this case, no split by condition is used since there is only one mapper. So query runs fine. This can be done using
--m 1
If you Need to copy data from RDBMS into Hive Table
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password {In case no password Do not Specify it}
--table tableName
--boundary-query 'Select min,max from table name'
--m 5 {set number of mappers to 5}
--hive-import
--hive-table serviceorderdb.productinfo
--m 1
Running a query instead of calling entire table itself
sqoop import
--connect ConnectionString
--username username
--password Your_Database_Password
--query 'select name from employees where name like '%s' and $CONDITIONS'
--m 5 {set number of mappers to 5}
--target-dir '/tmp/myfolder'
--fields-terminated-by ',' {how do you want your data to look in target file}
You may see $conditions as extra parameter $CONDITIONS. This is because this time you specified no table and specified a query explicity. When Sqoop runs, it searches for a boundary conditions, which it does not find. Then It Searches for a table and a primary key for applying boundary query which again it will not find. Hence we use $CONDITIONS to explicitly specify that we are not using a query and use default boundry condition from query result.
Checking if your connection is set up properly : For this you can just call list databases and if the you see your data populated then your connection is fine.
$ sqoop list-databases
--connect jdbc:mysql://localhost/
--username root
--password pwd
Connection String for Different Databases :
MYSQL: jdbc:mysql://<hostname>:<port>/<dbname>
jdbc:mysql://127.0.0.1:3306/test_database
Oracle :#//host_name:port_number/service_name
jdbc:oracle:thin:scott/tiger#//myhost:1521/myservicename
You may learn more about sqoop imports from : https://sqoop.apache.org/docs/1.4.1-incubating/SqoopUserGuide.html
By using sqoop import command you can import data from RDBMS to HDFS, Hive and HBase
sqoop import --connect jdbc:mysql://localhost:portnumber/DBName --username root --table emp --password root -m1
By using this command data will be stored in HDFS.
Sample commands to run sqoop import (load data from RDBMS to HDFS):
Postgres
sqoop import --connect jdbc:postgresql://postgresHost/databaseName
--username username --password 123 --table tableName
MySQL
sqoop import --connect jdbc:mysql://mysqlHost/databaseName --username username --password 123 --table tableName
Oracle*
sqoop import --connect jdbc:oracle:thin:#oracleHost:1521/databaseName --username USERNAME --password 123 --table TABLENAME
SQL Server
sqoop import --connect 'jdbc:sqlserver://sqlserverhost:1433;database=dbname;username=<username>;password=<password>' --table tableName
*Sqoop won't find any columns from a table if you don't specify both the username and the table in correct case. Usually, specifying both in uppercase will resolve the issue.
Read the Sqoop User's Guide: https://sqoop.apache.org/docs/1.4.5/SqoopUserGuide.html
I also recommend the Apache Sqoop Cookbook. You will learn how to use import and export tools, do incremental import jobs, save jobs, solve problems with jdbc drivers and much more. http://shop.oreilly.com/product/0636920029519.do

Encoding columns in Hive

I'm importing a table from mysql to hive using Sqoop. Some columns are latin1 encoded. Is there any way to do either:
Set the encoding for those columns as latin1 in Hive. OR
Convert the columns to utf-8 while importing with sqoop?
In Hive --default-character-set is used to set the character set for whole database not specific to few columns. I was not able to find Sqoop parameter which will convert tables columns to utf-8 in fly rather the columns are expected to set type fixed.
$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \
--direct -- --default-character-set=latin1
I believe you would need to convert Latin1 columns to utf-8 first in your MySql and then you can import from Sqoop. You can use the following script to convert the all the columns into utf-8, which I found here.
mysql --database=dbname -B -N -e "SHOW TABLES" | \
awk '{print "ALTER TABLE", $1, "CONVERT TO CHARACTER SET utf8 COLLATE \
utf8_general_ci;"}' | mysql --database=dbname &
Turned out the problem was unrelated. The column works fine regardless of encoding...but the table's schema had changed in mysql. I assumed that since I'm passing in the overwrite flag, sqoop would remake the table every time in Hive. Not so! The schema changes in mysql didn't get transferred to Hive, so the data in the md5 column was actually data from a different column.
The "fix" we settled on was, before every sqoop import check for schema changes, and if there was a change, drop the table and re-import. This forces a schema update in Hive.
Edit: my original sqoop command was something like:
sqoop import --connect jdbc:mysql://HOST:PORT/DB --username USERNAME --password PASSWORD --table uploads --hive-table uploads --hive-import --hive-overwrite --split-by id --num-mappers 8 --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N'
But now I manually issue a drop table uploads to hive first if the schema changes.

Resources