Incremental sqoop from oracle to hdfs with condition - hadoop

I am doing a incremental sqooping from to hdfs oracle giving where condition like
(LST_UPD_TMST >TO_TIMESTAMP('2016-05-31T18:55Z', 'YYYY-MM-DD"T"HH24:MI"Z"')
AND LST_UPD_TMST <= TO_TIMESTAMP('2016-09-13T08:51Z', 'YYYY-MM-DD"T"HH24:MI"Z"'))
But it is not using the index. How can I force an Index so that sqoop can be faster by considering only filtered records.
What is the best option to do incremental sqoop. Table size in oracle is in TBs.
Table has billions rows and after where condition it is in some million

You can use --where or --query with where condition in select to filter import results
I was not sure about your sqoop full command, just give a try in this way
sqoop import
--connect jdbc:oracle:thin:#//db.example.com/dbname \
--username dbusername \
--password dbpassword \
--table tablename \
--columns "column,names,to,select,in,comma,separeted" \
--where "(LST_UPD_TMST >TO_TIMESTAMP('2016-05-31T18:55Z', 'YYYY-MM-DD\"T\"HH24:MI\"Z\"') AND LST_UPD_TMST <= TO_TIMESTAMP('2016-09-13T08:51Z', 'YYYY-MM-DD\"T\"HH24:MI\"Z\"'))" \
--target-dir {hdfs/location/to/save/data/from/oracle} \
--incremental lastmodified \
--check-column LST_UPD_TMST \
--last-value {from Date/Timestamp to Sqoop in incremental}
Check more details about sqoop incremental load
Update
For incremental imports Sqoop saved job is recommended to maintain --last-value automatically.
sqoop job --create {incremental job name} \
-- import
--connect jdbc:oracle:thin:#//db.example.com/dbname \
--username dbusername \
--password dbpassword \
--table tablename \
--columns "column,names,to,select,in,comma,separeted" \
--incremental lastmodified \
--check-column LST_UPD_TMST \
--last-value 0
Here --last-value 0 to import from start for first time then latest
value will be passed automatically in next invocation by sqoop job

Related

Sqoop import split by column and different database to split that data

I have a requirement.
I need to get data from Teradata database a into Hadoop.
I have access to only view in that database and need to pull data from that view. As I don't have read/write access to that database. I cannot use --split-by column option in my sqoop import query.
So is there a option where I can say sqoop to use database b for storing the split data and then move the data into Hadoop
Query:
sqoop import \
--connect "jdbc:sqlserver://xx.aa.dd.aa;databaseName=a" \
--connection-manager org.apache.sqoop.manager.SQLServerManager \
--username XXXX \
--password XXXX \
--num-mappers 20 \
--query "select * from (select ID,name,x,y,z from TABLE1 where DT between '2018/01/01' and '2018/01/31') as temp_table where updt_date <'2018/01/31' AND \$CONDITIONS" \
--split-by id \
--target-dir /user/XXXX/sqoop_import/XYZ/2018/TABLE1

Sqoop - FileAlreadyExists exception

I need some help with sqoop.
First of all, I'm sorry, my english isn't very good.
Using the folowing command:
sqoop import -D mapreduce.output.fileoutputformat.compress=false --num-mappers 1 --connection-manager "com.quest.oraoop.OraOopConnManager" --connect "jdbc:Oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=myserver)(PORT=1534)))(CONNECT_DATA=(SERVICE_NAME=myservice)))" --username "rodrigo" --password pwd \
--query "SELECT column1, column2 from myTable where \$CONDITIONS" \
--null-string '' --null-non-string '' --fields-terminated-by '|' \
--lines-terminated-by '\n' --as-textfile --target-dir /data/rodrigo/myTable \
--hive-import --hive-partition-key yearmonthday --hive-partition-value '20180101' --hive-overwrite --verbose -P --m 1 --hive-table myTable
My table is already created, because I must create a solicitation for create a table in my hive database, so I can't create dinamically inside sqoop command.
I have permission to create the directory in hdfs.
When I remove the directory, sqoop logs an error saying that I have no create table permissions, and when I already create the diretory, it returns a FileAlreadyExistsException.
What can I do to solve that?
Thanks from Brazil.

Sqoop job incremental lastmodified wrong timestamp value

I am trying to create a Sqoop Job using incremental lastmodified
sqoop job --create job_import_test8_by_query_update -- import \
--bindir ./ --connect 'jdbc:mysql://localhost/db?serverTimezone=UTC&useSSL=false' \
--username user \
--password pass \
--table test8 -m 2 \
--incremental lastmodified \
--check-column "timestamp_field" \
--last-value 0 \
--split-by "id" \
--merge-key "id" \
--verbose \
--target-dir /usr/local/sqlImport/1
in this example I am having problem with last-value.
Running the first time when last-value is "0" works fine. Then the last value is automatically set to current_local_time + 4 hours, so I am losing some records.
It seems that the last-value takes the server timezone value instead of the last record value from the database.
Thanks for any help!
Try adding the useTimezone option to your connection string
--connect 'jdbc:mysql://localhost/db?useTimezone=true&serverTimezone=UTC'

Incremental import- To avoid duplication of rows

Consider a table departments with below data-
ID -1,2,3,8000
Name- A,B,C,D
I imported data into HDFS using sqoop
Added 2 new rows with ID 4 and 5 into MySQL
Performed incremental import with last value as 3 and mode=append
Data imported has two rows for 8000 ID as the condition used is department_id>3
How can I tweak the below command to make sure duplicate rows are created.
sqoop import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db"
--username=retail_dba
--password=cloudera
--table departments
--target-dir/user/cloudera/dep1
--append
--check-column "department_id"
--incremental append
--last-value 3
You can not tweak this command.
--incremental append is for appending new data with --check-column > -last-value.
For your usecase you should use --incremental lastmodified.
--check-column should be of date, time, datetime and timestamp data types.
If you added new records after --last-value, it will fetch all the records (new or updated)
Sample command:
sqoop import
--connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
--username=retail_dba \
--password=cloudera \
--table departments \
--target-dir/user/cloudera/dep1 \
--incremental lastmodified \
--check-column last_update_date \
--last-value "2015-10-20 06:00:01"
All the records added after "2015-10-20 06:00:01" will be imported.
Check sqoop documentation for more details.

Run a sqoop job on a specific queue

I'm trying to create a Sqoop job run in a specific queue but it doesn't work.
I've tried two things :
1st : Declare the queue in the job creation
sqoop job \
--create myjob \
-- import \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
-Dmapred.job.queue.name=shortduration \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR
But it throws a parsing argument error due to -Dmapred.job.queue.name=shortduration
2nd : remove the -Dmapred.job.queue.name=shortduration of the job creation. job creation works well. But unable to specify which queue should be used
I'm loosing hope to run my job in this queue
Thanks for any help provided !
EDIT : get an import working with sqoop import -Dmapred.job.queue.name=shortduration but sqoop job not working
I think you have an error in your command
-Dmapreduce.job.queuename=NameOfTheQueue
note queuename one word and the order, based on the documentation, vm args need to go directly after the import.
https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_using_generic_and_specific_arguments
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf specify an application configuration file
-D use value for given property
sqoop job -Dmapred.job.queuename=shortduration \
--create myjob \
-- import \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR
you might just want to try it with the import tool to see if it is working correctly then do the job command ie
sqoop import -Dmapred.job.queuename=shortduration \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR

Resources