White spaces instead of NULL in Hive table after Sqoop import - hadoop

I created sqoop process which imports data from MS SQL to Hive, but I have a problem with 'char' type fields. Sqoop import code:
sqoop import \
--create-hcatalog-table \
--connect "connection_parameters" \
--username USER \
--driver net.sourceforge.jtds.jdbc.Driver \
--null-string '' \
--null-non-string '' \
--class-name TABLE_X \
--hcatalog-table TABLE_X_TEST \
--hcatalog-database default \
--hcatalog-storage-stanza "stored as orc tblproperties ('orc.compress'='SNAPPY')" \
--map-column-hive "column_1=char(10),column_2=char(35)" \
--num-mappers 1 \
--query "select top 10 "column_1", "column_2" from TABLE_X where \$CONDITIONS" \
--outdir "/tmp"
column_1 which is type char(10) should be NULL if there is no data. But Hive fills the field with 10 spaces.
column_2 which is type char(35) should be NULL too, but there are 35 spaces.
It is huge problem because I cannot run query like this:
select count(*) from TABLE_X_TEST where column_1 is NULL and column_2 is NULL;
But I have to use this one:
select count(*) from TABLE_X_TEST where column_1 = ' ' and column_2 = ' ';
I tried change query parameter and use trim function:
--query "select top 10 rtrim(ltrim("column_1")), rtrim(ltrim("column_2")) from TABLE_X where \$CONDITIONS"
but it does not work, so I suppose it is not a problem with source, but with Hive.
How I can prevent Hive from inserting spaces in empty fields?

You need to change these parameters:
--null-string '\\N' \
--null-non-string '\\N' \
Hive, by default, expects that the NULL value will be encoded using the string constant \N. Sqoop, by default, encodes it using the string constant null. To rectify the mismatch, you’ll need to override Sqoop’s default behavior with Hive’s using parameters --null-string and --null-non-string (this is what you do but with incorrect values). For details, see docs.

I tried without giving the options of null-string and null-non-string for creating orc tables using Sqoop hcatalog, all the nulls in source are reflecting as NULL and I am able to query using is null function.
Let me know if you found any other solution to handle null's.

Related

Date to Timestamp issue while scooping data from oracle

I'm doing sqoop import from oracle to hdfs and create a hive table in parquet format.
I have a Date field in Oracle table( mm/dd/yyyy format) which i need to bring in a Timestamp format( yyyy-mm-dd hh24:mi:ss ) in hive.
i used cast(xyz_date as Timestamp) in sqoop select query but it is save as Long type in Parquet file.
I checked hive table and NULLs are stored in hive table in xyz_date field.
I do not want to store it as a string. Please help.
system sqoop import -D mapDateToTimestamp=true \ --connect \ --username abc \ --password-file file:location \ --query "select X,Y,TO_DATE(to_char(XYZ,'MM/DD/YYYY'),'MM/DD/YYYY') from TABLE1 where $CONDITIONS" \ --split-by Y \ --target-dir /location \ --delete-target-dir \ -as-parquetfile \ --compress \ --compression-codec org.apache.hadoop.io.compress.SnappyCodec \ --map-column-java Y=Long
Here is my hive table format :
CREATE external TABLE IF NOT EXISTS abc ( X STRING, Y BIGINT, XYZ TIMESTAMP ) STORED AS PARQUET LOCATION '/location' TBLPROPERTIES ("parquet.compression"="SNAPPY");

sqoop issue while importing data from SAP HANA

We are currently we are moving data from SAP HANA to Hadoop using sqoop.
SAP HANA tables uses '' character in table name and column names. our reqular sqoop command is working, but it is failing when I use "Split by". Can any one pls help.
code:
/usr/hdp/sqoop/bin/sqoop import \
--connect "jdbc:sap://***-***.**.*****.com:30015" \
--username DFIT_SUPP_USR --password **** \
--driver com.sap.db.jdbc.Driver \
--query "select '\"/BA1/C55LGENT/\"' FROM \"_SYS_BIC\".\"sap.fs.frdp.300.RDL/BV_RDL_ZAFI______Z_SLPD\" where \$CONDITIONS and (\"/BA1/C55LGENT\") IN ('0000000671','0000000615') and (\"/BA1/C55LGENT\" != '0000000022') AND (\"/BIC/ZCINTEIND\" ='01') AND (\"/BA1/IGL_ACCOUNT\") IN ( '0000401077', '0000401035') AND (\"/BA1/C55POSTD\">= '20170101' AND \"/BA1/C55POSTD\" <='20170101')" \
--target-dir /user/arekapalli/pfit_export_test12 \
--delete-target-dir \
--split-by //BA1//C55LGENT// \
-m 10
Below is the error we got..
Caused by: com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [257] (at 12): sql syntax error: incorrect syntax near "/": line 1 col 12 (at pos 12)
your problem is probably here
--query "select '\"/BA1/C55LGENT/\"' FROM \"_SYS_BIC\".\"sap.fs.frdp.300.RDL/BV_RDL_ZAFI______Z_SLPD\" where \$CONDITIONS and (\"/BA1/C55LGENT\") IN ('0000000671','0000000615') and (\"/BA1/C55LGENT\" != '0000000022') AND (\"/BIC/ZCINTEIND\" ='01') AND (\"/BA1/IGL_ACCOUNT\") IN ( '0000401077', '0000401035') AND (\"/BA1/C55POSTD\">= '20170101' AND \"/BA1/C55POSTD\" <='20170101')" \
you are assuming that the "\" is a escape character used from the terminal, that is probabliy wrong. try the following
--query 'select "/BA1/C55LGENT/" FROM "_SYS_BIC"."sap.fs.frdp.300.RDL/BV_RDL_ZAFI______Z_SLPD" where \$CONDITIONS and ("/BA1/C55LGENT") IN ("0000000671","0000000615") and ("/BA1/C55LGENT" != "0000000022") AND ("/BIC/ZCINTEIND" ="01") AND ("/BA1/IGL_ACCOUNT") IN ( "0000401077", "0000401035") AND ("/BA1/C55POSTD">= "20170101" AND "/BA1/C55POSTD" <="20170101")' \
I am not a sap user, so maybe could be something wrong with the query, anyway you can see that I removed all your ' from the query and I used the as delimiter of the query

Sqoop --split-by error while importing despite of having primary key in table

MySQL table with dept_id as primary key
|dept_id | dept_name |
| 2 | Fitness
| 3 | Footwear
| 4 | Apparel
| 5 | Golf
| 6 | Outdoors
| 7 | Fan Shop
Sqoop Query
sqoop import \
-m 2 \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username retail_dba \
-P \
--query 'select * from departments where dept_id < 6 AND $CONDITIONS' \
--target-dir /user/cloudera/sqoop_import/departments;
Results with an error on console:
When importing query results in parallel, you must specify --split-by
---Question point!---
Even though the table has primary key & the splits can be equally distributed between 2 mappers then what is the need of --spit-by or -m 1 ??
Guide me for the same.
Thanks.
The reason why Sqoop import needs --split-by when you use --query is because when you specify the source location of data in "query", it is not possible to know/guess the primary key for Sqoop. Because, in query, you can join two or more tables which will have multiple keys and fields. So, Sqoop can't know/guess on which of those keys can it split by.
It is not primary key with --split-by usage. you are seeing error because of use of --query option. This option MUST be used with --split-by, --target-dir and $CONDITIONS in query.
free_form_query_imports documentations
When importing a free-form query, you must specify a destination
directory with --target-dir.
If you want to import the results of a query in parallel, then each
map task will need to execute a copy of the query, with results
partitioned by bounding conditions inferred by Sqoop. Your query must
include the token $CONDITIONS which each Sqoop process will replace
with a unique condition expression. You must also select a splitting
column with --split-by.
You can use --where option if you don't want to use --split-by and --query:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--where "department_id < 6"
if you use --boundary-query option then you don't need --split-by, --query option:
sqoop import \
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
-P \
--table departments \
--target-dir /user/cloudera/departments \
-m 2 \
--boundary-query "select 2, 6 from departments limit 1"
selecting_the_data_to_import
By default sqoop will use query select min(<split-by>),
max(<split-by>) from <table name> to find out boundaries for creating
splits. In some cases this query is not the most optimal so you can
specify any arbitrary query returning two numeric columns using
--boundary-query argument.
As per sqoop docs,
If you want to import the results of a query in parallel, then each map task will need to execute a copy of the query, with results partitioned by bounding conditions inferred by Sqoop. Your query must include the token $CONDITIONS which each Sqoop process will replace with a unique condition expression. You must also select a splitting column with --split-by.
So you have to specify your primary key in --split-by tag.
If you choose 1 mapper, Sqoop will not split task in parallel and perform complete import in 1 mapper.
Check my another answer (if you want) to understand need of $CONDITIONS and number of mappers.

How to specify multiple conditions in sqoop?

Sqoop version: 1.4.6.2.3.4.0-3485
I have been trying to import data using sqoop using the following command:
sqoop import -libjars /usr/local/bfm/lib/java/jConnect-6/6.0.0/jconn3-6.0.0.jar --connect jdbc:sybase:db --username user --password 'pwd' --driver com.sybase.jdbc3.jdbc.SybDriver --query 'SELECT a.* from table1 a,table2 b where b.run_group=a.run_group and a.date<"7/22/2016" AND $CONDITIONS' --target-dir /user/user/a/ --verbose --hive-import --hive-table default.temp_a --split-by id
I get the following error:
Invalid column name '7/22/2016'
I have tried enclosing the query in double quotes, but then it says:
CONDITIONS: Undefined variable.
Tried several combinations of single/double quotes and escaping $CONDITIONS and using a --where switch as well.
PS: The conditions are non numeric. (It works for cases like where x<10 or so, but not in case where it's a string or date)
In your command --split-by=id should be --split-by=a.id, I would use join instead of adding extra where condition, also I would convert date to (specified string value) VARCHR (using sybase specific function)
sqoop import -libjars /usr/local/bfm/lib/java/jConnect-6/6.0.0/jconn3-6.0.0.jar \
--connect jdbc:sybase:db \
--username user \
--password 'pwd' \
--driver com.sybase.jdbc3.jdbc.SybDriver \
--query "SELECT a.* from table1 a join table2 b on a.id=b.id where a.run_group=b.run_group and convert(varchar, a.date, 101) < '7/22/2016' AND \$CONDITIONS" \
--target-dir /user/user/a/ \
--verbose \
--hive-import \
--hive-table default.temp_a \
--split-by a.id
A workaround that can be used: -options-file
Copy the query in your options file and use the switch.
The options file might be as:
--query
select * \
from table t1 \
where t1.field="text" \
and t1.value="value" \
and $CONDITIONS
Note: Not sure if it was a particular version issue or not but --query directly in the command just refused to work with $CONDITIONS. (Yes, I tried escaping it with \ and several other combinations of quotations)

sqoop inserting data into wrong hive column from rdbms table

I have a table called 'employee' in SQL Server :
ID NAME ADDRESS DESIGNATION
1 Jack XXX Clerk
2 John YYY Engineer
I have created an external table (emp) in hive and through sqoop import I imported data from employee to hive table using --query argument of sqoop. If I mention --query as 'select * from employee' then data gets inserted to hive table correctly.But if I mention --query as 'select ID,NAME,DESIGNATION' from employee' then data in DESIGNATION column of 'employee' table(rdbms) is getting inserted to address column of 'emp' table instead of getting inserted to designation column.When I run the below hive query:
select designation from emp;
I get values as :
NULL
NULL
instead of : Clerk
Engineer
But if I run the hive query as :
select address from emp;
I get values as :
Clerk
Engineer
instead of :NULL
NULL
Any ideas of fixing this incorrect data would be of great help.I am currently using 0.11 version of hive so I can't use hive insert queries which are available from 0.14 hive version.
ok,I show you a sample.
sqoop import --connect jdbc:mysql://host:port/db'?useUnicode=true&characterEncoding=utf-8' \
--username 'xxxx' \
--password 'xxxx' \
--table employee \
--columns 'ID,NAME,DESIGNATION' \
--where 'aaa=bbb' \
-m 1 \
--target-dir hdfs://nameservice1/dir \
--fields-terminated-by '\t' \
--hive-import \
--hive-overwrite \
--hive-drop-import-delims \
--null-non-string '\\N' \
--null-string '\\N' \
--hive-table 'hive_db.hive_tb' \
--hive-partition-key 'pt' \
--hive-partition-value '2016-01-20'
and some param is optional.
sqoop syntax detail:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_literal
The Sqoop statement will import the data to hdfs directory as(assuming field separator as ,)
1,Jack,Clerk
2,John,Engineer
So Address column will have DESIGNATION data and DESIGNATION column will be null
You can try --query "select ID,NAME,'',DESIGNATION from employee", this should work

Resources