impala incremental last modified - sqoop

I have a Sqoop import to bring in data from Oracle with a join on two tables. I need to do a --incremental last-modified based on a column which is common on both the tables:
--query "SELECT customer_info.customer_id, customer_info.customer_name,
customer.date_created, sales_info.last_update_date as sales_last_update_date
from customer_info
inner join
sales_info ON customer_info.customer_id = sales_info.customer_id
AND \$CONDITIONS" \
--split-by "customer_id" \
--fields-terminated-by '\t' \
--target-dir (name_of_dir) \
--incremental lastmodified \
--check-column sales_last_update_date \
The last_update_date is common on both the columns.
But I get the error:
ORA-00904: "sales_last_update_date": invalid identifier

Related

Sqoop import split by column and different database to split that data

I have a requirement.
I need to get data from Teradata database a into Hadoop.
I have access to only view in that database and need to pull data from that view. As I don't have read/write access to that database. I cannot use --split-by column option in my sqoop import query.
So is there a option where I can say sqoop to use database b for storing the split data and then move the data into Hadoop
Query:
sqoop import \
--connect "jdbc:sqlserver://xx.aa.dd.aa;databaseName=a" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username XXXX \
--password XXXX \
--num-mappers 20 \
--query "select * from (select ID,name,x,y,z from TABLE1 where DT between '2018/01/01' and '2018/01/31') as temp_table where updt_date <'2018/01/31' AND \$CONDITIONS" \
--split-by id \
--target-dir /user/XXXX/sqoop_import/XYZ/2018/TABLE1

Sqoop - FileAlreadyExists exception

I need some help with sqoop.
First of all, I'm sorry, my english isn't very good.
Using the folowing command:
sqoop import -D mapreduce.output.fileoutputformat.compress=false --num-mappers 1 --connection-manager "com.quest.oraoop.OraOopConnManager" --connect "jdbc:Oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=myserver)(PORT=1534)))(CONNECT_DATA=(SERVICE_NAME=myservice)))" --username "rodrigo" --password pwd \
--query "SELECT column1, column2 from myTable where \$CONDITIONS" \
--null-string '' --null-non-string '' --fields-terminated-by '|' \
--lines-terminated-by '\n' --as-textfile --target-dir /data/rodrigo/myTable \
--hive-import --hive-partition-key yearmonthday --hive-partition-value '20180101' --hive-overwrite --verbose -P --m 1 --hive-table myTable
My table is already created, because I must create a solicitation for create a table in my hive database, so I can't create dinamically inside sqoop command.
I have permission to create the directory in hdfs.
When I remove the directory, sqoop logs an error saying that I have no create table permissions, and when I already create the diretory, it returns a FileAlreadyExistsException.
What can I do to solve that?
Thanks from Brazil.

Incremental import- To avoid duplication of rows

Consider a table departments with below data-
ID -1,2,3,8000
Name- A,B,C,D
I imported data into HDFS using sqoop
Added 2 new rows with ID 4 and 5 into MySQL
Performed incremental import with last value as 3 and mode=append
Data imported has two rows for 8000 ID as the condition used is department_id>3
How can I tweak the below command to make sure duplicate rows are created.
sqoop import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db"
--username=retail_dba
--password=cloudera
--table departments
--target-dir/user/cloudera/dep1
--append
--check-column "department_id"
--incremental append
--last-value 3
You can not tweak this command.
--incremental append is for appending new data with --check-column > -last-value.
For your usecase you should use --incremental lastmodified.
--check-column should be of date, time, datetime and timestamp data types.
If you added new records after --last-value, it will fetch all the records (new or updated)
Sample command:
sqoop import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
--password=cloudera \
--table departments \
--target-dir/user/cloudera/dep1 \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2015-10-20 06:00:01"
All the records added after "2015-10-20 06:00:01" will be imported.
Check sqoop documentation for more details.

sqoop inserting data into wrong hive column from rdbms table

I have a table called 'employee' in SQL Server :
ID NAME ADDRESS DESIGNATION
1 Jack XXX Clerk
2 John YYY Engineer
I have created an external table (emp) in hive and through sqoop import I imported data from employee to hive table using --query argument of sqoop. If I mention --query as 'select * from employee' then data gets inserted to hive table correctly.But if I mention --query as 'select ID,NAME,DESIGNATION' from employee' then data in DESIGNATION column of 'employee' table(rdbms) is getting inserted to address column of 'emp' table instead of getting inserted to designation column.When I run the below hive query:
select designation from emp;
I get values as :
NULL
NULL
instead of : Clerk
Engineer
But if I run the hive query as :
select address from emp;
I get values as :
Clerk
Engineer
instead of :NULL
NULL
Any ideas of fixing this incorrect data would be of great help.I am currently using 0.11 version of hive so I can't use hive insert queries which are available from 0.14 hive version.
ok,I show you a sample.
sqoop import --connect jdbc:mysql://host:port/db'?useUnicode=true&characterEncoding=utf-8' \
--username 'xxxx' \
--password 'xxxx' \
--table employee \
--columns 'ID,NAME,DESIGNATION' \
--where 'aaa=bbb' \
-m 1 \
--target-dir hdfs://nameservice1/dir \
--fields-terminated-by '\t' \
--hive-import \
--hive-overwrite \
--hive-drop-import-delims \
--null-non-string '\\N' \
--null-string '\\N' \
--hive-table 'hive_db.hive_tb' \
--hive-partition-key 'pt' \
--hive-partition-value '2016-01-20'
and some param is optional.
sqoop syntax detail:
http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_literal
The Sqoop statement will import the data to hdfs directory as(assuming field separator as ,)
1,Jack,Clerk
2,John,Engineer
So Address column will have DESIGNATION data and DESIGNATION column will be null
You can try --query "select ID,NAME,'',DESIGNATION from employee", this should work

Sqoop with duplicate column name

I wrote a sqoop with duplicate column name (have alias) but it threw me an error message "Duplicate Column identifier specified: 'id'". I modified sqoop to have concat function and now it gives me an error "Hive does not support the SQL type for column a"
sqoop import \
--connect jdbc:mysql://foo.test.net/mfg \
--username pingp \
--password 987yjd \
--hive-import \
--hive-table third_map \
--query "select concat(r.id,'') a, concat(p.id,'') b from tblDimMfg r join tblDimMfg p on r.id = p.id where r.Name = 'bbp' and p.Name = 'bbt' and \$CONDITIONS" \
--target-dir /user/test/hivehome/mysql/third_map \
--fields-terminated-by '\t' \
--hive-drop-import-delims \
-m 1
Any suggestion?
Thank you,
Rio
The resolution is the create a sub-select where the duplicate column names are then it works.

Resources