How to specify multiple conditions in sqoop? - hadoop

Sqoop version: 1.4.6.2.3.4.0-3485
I have been trying to import data using sqoop using the following command:
sqoop import -libjars /usr/local/bfm/lib/java/jConnect-6/6.0.0/jconn3-6.0.0.jar --connect jdbc:sybase:db --username user --password 'pwd' --driver com.sybase.jdbc3.jdbc.SybDriver --query 'SELECT a.* from table1 a,table2 b where b.run_group=a.run_group and a.date<"7/22/2016" AND $CONDITIONS' --target-dir /user/user/a/ --verbose --hive-import --hive-table default.temp_a --split-by id
I get the following error:
Invalid column name '7/22/2016'
I have tried enclosing the query in double quotes, but then it says:
CONDITIONS: Undefined variable.
Tried several combinations of single/double quotes and escaping $CONDITIONS and using a --where switch as well.
PS: The conditions are non numeric. (It works for cases like where x<10 or so, but not in case where it's a string or date)

In your command --split-by=id should be --split-by=a.id, I would use join instead of adding extra where condition, also I would convert date to (specified string value) VARCHR (using sybase specific function)
sqoop import -libjars /usr/local/bfm/lib/java/jConnect-6/6.0.0/jconn3-6.0.0.jar \
--connect jdbc:sybase:db \
--username user \
--password 'pwd' \
--driver com.sybase.jdbc3.jdbc.SybDriver \
--query "SELECT a.* from table1 a join table2 b on a.id=b.id where a.run_group=b.run_group and convert(varchar, a.date, 101) < '7/22/2016' AND \$CONDITIONS" \
--target-dir /user/user/a/ \
--verbose \
--hive-import \
--hive-table default.temp_a \
--split-by a.id

A workaround that can be used: -options-file
Copy the query in your options file and use the switch.
The options file might be as:
--query
select * \
from table t1 \
where t1.field="text" \
and t1.value="value" \
and $CONDITIONS
Note: Not sure if it was a particular version issue or not but --query directly in the command just refused to work with $CONDITIONS. (Yes, I tried escaping it with \ and several other combinations of quotations)

Related

White spaces instead of NULL in Hive table after Sqoop import

I created sqoop process which imports data from MS SQL to Hive, but I have a problem with 'char' type fields. Sqoop import code:
sqoop import \
--create-hcatalog-table \
--connect "connection_parameters" \
--username USER \
--driver net.sourceforge.jtds.jdbc.Driver \
--null-string '' \
--null-non-string '' \
--class-name TABLE_X \
--hcatalog-table TABLE_X_TEST \
--hcatalog-database default \
--hcatalog-storage-stanza "stored as orc tblproperties ('orc.compress'='SNAPPY')" \
--map-column-hive "column_1=char(10),column_2=char(35)" \
--num-mappers 1 \
--query "select top 10 "column_1", "column_2" from TABLE_X where \$CONDITIONS" \
--outdir "/tmp"
column_1 which is type char(10) should be NULL if there is no data. But Hive fills the field with 10 spaces.
column_2 which is type char(35) should be NULL too, but there are 35 spaces.
It is huge problem because I cannot run query like this:
select count(*) from TABLE_X_TEST where column_1 is NULL and column_2 is NULL;
But I have to use this one:
select count(*) from TABLE_X_TEST where column_1 = ' ' and column_2 = ' ';
I tried change query parameter and use trim function:
--query "select top 10 rtrim(ltrim("column_1")), rtrim(ltrim("column_2")) from TABLE_X where \$CONDITIONS"
but it does not work, so I suppose it is not a problem with source, but with Hive.
How I can prevent Hive from inserting spaces in empty fields?
You need to change these parameters:
--null-string '\\N' \
--null-non-string '\\N' \
Hive, by default, expects that the NULL value will be encoded using the string constant \N. Sqoop, by default, encodes it using the string constant null. To rectify the mismatch, you’ll need to override Sqoop’s default behavior with Hive’s using parameters --null-string and --null-non-string (this is what you do but with incorrect values). For details, see docs.
I tried without giving the options of null-string and null-non-string for creating orc tables using Sqoop hcatalog, all the nulls in source are reflecting as NULL and I am able to query using is null function.
Let me know if you found any other solution to handle null's.

Sqoop import split by column and different database to split that data

I have a requirement.
I need to get data from Teradata database a into Hadoop.
I have access to only view in that database and need to pull data from that view. As I don't have read/write access to that database. I cannot use --split-by column option in my sqoop import query.
So is there a option where I can say sqoop to use database b for storing the split data and then move the data into Hadoop
Query:
sqoop import \
--connect "jdbc:sqlserver://xx.aa.dd.aa;databaseName=a" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username XXXX \
--password XXXX \
--num-mappers 20 \
--query "select * from (select ID,name,x,y,z from TABLE1 where DT between '2018/01/01' and '2018/01/31') as temp_table where updt_date <'2018/01/31' AND \$CONDITIONS" \
--split-by id \
--target-dir /user/XXXX/sqoop_import/XYZ/2018/TABLE1

Sqoop - FileAlreadyExists exception

I need some help with sqoop.
First of all, I'm sorry, my english isn't very good.
Using the folowing command:
sqoop import -D mapreduce.output.fileoutputformat.compress=false --num-mappers 1 --connection-manager "com.quest.oraoop.OraOopConnManager" --connect "jdbc:Oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=myserver)(PORT=1534)))(CONNECT_DATA=(SERVICE_NAME=myservice)))" --username "rodrigo" --password pwd \
--query "SELECT column1, column2 from myTable where \$CONDITIONS" \
--null-string '' --null-non-string '' --fields-terminated-by '|' \
--lines-terminated-by '\n' --as-textfile --target-dir /data/rodrigo/myTable \
--hive-import --hive-partition-key yearmonthday --hive-partition-value '20180101' --hive-overwrite --verbose -P --m 1 --hive-table myTable
My table is already created, because I must create a solicitation for create a table in my hive database, so I can't create dinamically inside sqoop command.
I have permission to create the directory in hdfs.
When I remove the directory, sqoop logs an error saying that I have no create table permissions, and when I already create the diretory, it returns a FileAlreadyExistsException.
What can I do to solve that?
Thanks from Brazil.

sqoop issue while importing data from SAP HANA

We are currently we are moving data from SAP HANA to Hadoop using sqoop.
SAP HANA tables uses '' character in table name and column names. our reqular sqoop command is working, but it is failing when I use "Split by". Can any one pls help.
code:
/usr/hdp/sqoop/bin/sqoop import \
--connect "jdbc:sap://***-***.**.*****.com:30015" \
--username DFIT_SUPP_USR --password **** \
--driver com.sap.db.jdbc.Driver \
--query "select '\"/BA1/C55LGENT/\"' FROM \"_SYS_BIC\".\"sap.fs.frdp.300.RDL/BV_RDL_ZAFI______Z_SLPD\" where \$CONDITIONS and (\"/BA1/C55LGENT\") IN ('0000000671','0000000615') and (\"/BA1/C55LGENT\" != '0000000022') AND (\"/BIC/ZCINTEIND\" ='01') AND (\"/BA1/IGL_ACCOUNT\") IN ( '0000401077', '0000401035') AND (\"/BA1/C55POSTD\">= '20170101' AND \"/BA1/C55POSTD\" <='20170101')" \
--target-dir /user/arekapalli/pfit_export_test12 \
--delete-target-dir \
--split-by //BA1//C55LGENT// \
-m 10
Below is the error we got..
Caused by: com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [257] (at 12): sql syntax error: incorrect syntax near "/": line 1 col 12 (at pos 12)
your problem is probably here
--query "select '\"/BA1/C55LGENT/\"' FROM \"_SYS_BIC\".\"sap.fs.frdp.300.RDL/BV_RDL_ZAFI______Z_SLPD\" where \$CONDITIONS and (\"/BA1/C55LGENT\") IN ('0000000671','0000000615') and (\"/BA1/C55LGENT\" != '0000000022') AND (\"/BIC/ZCINTEIND\" ='01') AND (\"/BA1/IGL_ACCOUNT\") IN ( '0000401077', '0000401035') AND (\"/BA1/C55POSTD\">= '20170101' AND \"/BA1/C55POSTD\" <='20170101')" \
you are assuming that the "\" is a escape character used from the terminal, that is probabliy wrong. try the following
--query 'select "/BA1/C55LGENT/" FROM "_SYS_BIC"."sap.fs.frdp.300.RDL/BV_RDL_ZAFI______Z_SLPD" where \$CONDITIONS and ("/BA1/C55LGENT") IN ("0000000671","0000000615") and ("/BA1/C55LGENT" != "0000000022") AND ("/BIC/ZCINTEIND" ="01") AND ("/BA1/IGL_ACCOUNT") IN ( "0000401077", "0000401035") AND ("/BA1/C55POSTD">= "20170101" AND "/BA1/C55POSTD" <="20170101")' \
I am not a sap user, so maybe could be something wrong with the query, anyway you can see that I removed all your ' from the query and I used the as delimiter of the query

Sqoop with duplicate column name

I wrote a sqoop with duplicate column name (have alias) but it threw me an error message "Duplicate Column identifier specified: 'id'". I modified sqoop to have concat function and now it gives me an error "Hive does not support the SQL type for column a"
sqoop import \
--connect jdbc:mysql://foo.test.net/mfg \
--username pingp \
--password 987yjd \
--hive-import \
--hive-table third_map \
--query "select concat(r.id,'') a, concat(p.id,'') b from tblDimMfg r join tblDimMfg p on r.id = p.id where r.Name = 'bbp' and p.Name = 'bbt' and \$CONDITIONS" \
--target-dir /user/test/hivehome/mysql/third_map \
--fields-terminated-by '\t' \
--hive-drop-import-delims \
-m 1
Any suggestion?
Thank you,
Rio
The resolution is the create a sub-select where the duplicate column names are then it works.

Resources