sqoop inserting data into wrong hive column from rdbms table - hadoop

I have a table called 'employee' in SQL Server :
ID NAME ADDRESS DESIGNATION
1 Jack XXX Clerk
2 John YYY Engineer
I have created an external table (emp) in hive and through sqoop import I imported data from employee to hive table using --query argument of sqoop. If I mention --query as 'select * from employee' then data gets inserted to hive table correctly.But if I mention --query as 'select ID,NAME,DESIGNATION' from employee' then data in DESIGNATION column of 'employee' table(rdbms) is getting inserted to address column of 'emp' table instead of getting inserted to designation column.When I run the below hive query:
select designation from emp;
I get values as :
NULL
NULL
instead of : Clerk
Engineer
But if I run the hive query as :
select address from emp;
I get values as :
Clerk
Engineer
instead of :NULL
NULL
Any ideas of fixing this incorrect data would be of great help.I am currently using 0.11 version of hive so I can't use hive insert queries which are available from 0.14 hive version.

ok,I show you a sample.
sqoop import --connect jdbc:mysql://host:port/db'?useUnicode=true&characterEncoding=utf-8' \
--username 'xxxx' \
--password 'xxxx' \
--table employee \
--columns 'ID,NAME,DESIGNATION' \
--where 'aaa=bbb' \
-m 1 \
--target-dir hdfs://nameservice1/dir \
--fields-terminated-by '\t' \
--hive-import \
--hive-overwrite \
--hive-drop-import-delims \
--null-non-string '\\N' \
--null-string '\\N' \
--hive-table 'hive_db.hive_tb' \
--hive-partition-key 'pt' \
--hive-partition-value '2016-01-20'
and some param is optional.
sqoop syntax detail:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_literal

The Sqoop statement will import the data to hdfs directory as(assuming field separator as ,)
1,Jack,Clerk
2,John,Engineer
So Address column will have DESIGNATION data and DESIGNATION column will be null
You can try --query "select ID,NAME,'',DESIGNATION from employee", this should work

Related

Date to Timestamp issue while scooping data from oracle

I'm doing sqoop import from oracle to hdfs and create a hive table in parquet format.
I have a Date field in Oracle table( mm/dd/yyyy format) which i need to bring in a Timestamp format( yyyy-mm-dd hh24:mi:ss ) in hive.
i used cast(xyz_date as Timestamp) in sqoop select query but it is save as Long type in Parquet file.
I checked hive table and NULLs are stored in hive table in xyz_date field.
I do not want to store it as a string. Please help.
system sqoop import -D mapDateToTimestamp=true \ --connect \ --username abc \ --password-file file:location \ --query "select X,Y,TO_DATE(to_char(XYZ,'MM/DD/YYYY'),'MM/DD/YYYY') from TABLE1 where $CONDITIONS" \ --split-by Y \ --target-dir /location \ --delete-target-dir \ -as-parquetfile \ --compress \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --map-column-java Y=Long
Here is my hive table format :
CREATE external TABLE IF NOT EXISTS abc ( X STRING, Y BIGINT, XYZ TIMESTAMP ) STORED AS PARQUET LOCATION '/location' TBLPROPERTIES ("parquet.compression"="SNAPPY");

White spaces instead of NULL in Hive table after Sqoop import

I created sqoop process which imports data from MS SQL to Hive, but I have a problem with 'char' type fields. Sqoop import code:
sqoop import \
--create-hcatalog-table \
--connect "connection_parameters" \
--username USER \
--driver net.sourceforge.jtds.jdbc.Driver \
--null-string '' \
--null-non-string '' \
--class-name TABLE_X \
--hcatalog-table TABLE_X_TEST \
--hcatalog-database default \
--hcatalog-storage-stanza "stored as orc tblproperties ('orc.compress'='SNAPPY')" \
--map-column-hive "column_1=char(10),column_2=char(35)" \
--num-mappers 1 \
--query "select top 10 "column_1", "column_2" from TABLE_X where \$CONDITIONS" \
--outdir "/tmp"
column_1 which is type char(10) should be NULL if there is no data. But Hive fills the field with 10 spaces.
column_2 which is type char(35) should be NULL too, but there are 35 spaces.
It is huge problem because I cannot run query like this:
select count(*) from TABLE_X_TEST where column_1 is NULL and column_2 is NULL;
But I have to use this one:
select count(*) from TABLE_X_TEST where column_1 = ' ' and column_2 = ' ';
I tried change query parameter and use trim function:
--query "select top 10 rtrim(ltrim("column_1")), rtrim(ltrim("column_2")) from TABLE_X where \$CONDITIONS"
but it does not work, so I suppose it is not a problem with source, but with Hive.
How I can prevent Hive from inserting spaces in empty fields?
You need to change these parameters:
--null-string '\\N' \
--null-non-string '\\N' \
Hive, by default, expects that the NULL value will be encoded using the string constant \N. Sqoop, by default, encodes it using the string constant null. To rectify the mismatch, you’ll need to override Sqoop’s default behavior with Hive’s using parameters --null-string and --null-non-string (this is what you do but with incorrect values). For details, see docs.
I tried without giving the options of null-string and null-non-string for creating orc tables using Sqoop hcatalog, all the nulls in source are reflecting as NULL and I am able to query using is null function.
Let me know if you found any other solution to handle null's.

Sqoop import split by column and different database to split that data

I have a requirement.
I need to get data from Teradata database a into Hadoop.
I have access to only view in that database and need to pull data from that view. As I don't have read/write access to that database. I cannot use --split-by column option in my sqoop import query.
So is there a option where I can say sqoop to use database b for storing the split data and then move the data into Hadoop
Query:
sqoop import \
--connect "jdbc:sqlserver://xx.aa.dd.aa;databaseName=a" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username XXXX \
--password XXXX \
--num-mappers 20 \
--query "select * from (select ID,name,x,y,z from TABLE1 where DT between '2018/01/01' and '2018/01/31') as temp_table where updt_date <'2018/01/31' AND \$CONDITIONS" \
--split-by id \
--target-dir /user/XXXX/sqoop_import/XYZ/2018/TABLE1

Sqoop - FileAlreadyExists exception

I need some help with sqoop.
First of all, I'm sorry, my english isn't very good.
Using the folowing command:
sqoop import -D mapreduce.output.fileoutputformat.compress=false --num-mappers 1 --connection-manager "com.quest.oraoop.OraOopConnManager" --connect "jdbc:Oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=myserver)(PORT=1534)))(CONNECT_DATA=(SERVICE_NAME=myservice)))" --username "rodrigo" --password pwd \
--query "SELECT column1, column2 from myTable where \$CONDITIONS" \
--null-string '' --null-non-string '' --fields-terminated-by '|' \
--lines-terminated-by '\n' --as-textfile --target-dir /data/rodrigo/myTable \
--hive-import --hive-partition-key yearmonthday --hive-partition-value '20180101' --hive-overwrite --verbose -P --m 1 --hive-table myTable
My table is already created, because I must create a solicitation for create a table in my hive database, so I can't create dinamically inside sqoop command.
I have permission to create the directory in hdfs.
When I remove the directory, sqoop logs an error saying that I have no create table permissions, and when I already create the diretory, it returns a FileAlreadyExistsException.
What can I do to solve that?
Thanks from Brazil.

impala incremental last modified

I have a Sqoop import to bring in data from Oracle with a join on two tables. I need to do a --incremental last-modified based on a column which is common on both the tables:
--query "SELECT customer_info.customer_id, customer_info.customer_name,
customer.date_created, sales_info.last_update_date as sales_last_update_date
from customer_info
inner join
sales_info ON customer_info.customer_id = sales_info.customer_id
AND \$CONDITIONS" \
--split-by "customer_id" \
--fields-terminated-by '\t' \
--target-dir (name_of_dir) \
--incremental lastmodified \
--check-column sales_last_update_date \
The last_update_date is common on both the columns.
But I get the error:
ORA-00904: "sales_last_update_date": invalid identifier

Resources