Sqoop export having duplicate entries in to table without primary key - hadoop

I have a table department_id,department_name,LastModifieddate;
when i run the command like below
sqoop export \
--connect "jdbc:mysql://ip-172-31-13-154:3306/sqoopex" \
--username sqoopuser \
--password NHkkP876rp \
--table dep_prasad \
--input-fields-terminated-by '|' \
--input-lines-terminated-by '\n' \
--export-dir /user/venkateswarlujvs2821/dep_prasad/ \
--num-mappers 2 \
--outdir /user/venkateswarlujvs2821/dep_prasad
It works fine and is inserting records
when i modify the file which is there in HDFS and adding some more records
and when i try to export it .It is inserting duplicate entries in to my sql
i am using the following sqoop command for the second time.
sqoop export \
--connect "jdbc:mysql://ip-172-31-13-154:3306/sqoopex" \
--username sqoopuser \
--password NHkkP876rp \
--table dep_prasad \
--input-fields-terminated-by '|' \
--input-lines-terminated-by '\n' \
--update-key department_id \
--update-mode allowinsert \
--export-dir /user/venkateswarlujvs2821/dep_prasad/ \
--num-mappers 2 \
--outdir /user/venkateswarlujvs2821/dep_prasad
Note:My table DO NOT HAVE PRIMARY KEY
I want to update only the new records how can i do that?

Related

Sqoop - FileAlreadyExists exception

I need some help with sqoop.
First of all, I'm sorry, my english isn't very good.
Using the folowing command:
sqoop import -D mapreduce.output.fileoutputformat.compress=false --num-mappers 1 --connection-manager "com.quest.oraoop.OraOopConnManager" --connect "jdbc:Oracle:thin:#(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=myserver)(PORT=1534)))(CONNECT_DATA=(SERVICE_NAME=myservice)))" --username "rodrigo" --password pwd \
--query "SELECT column1, column2 from myTable where \$CONDITIONS" \
--null-string '' --null-non-string '' --fields-terminated-by '|' \
--lines-terminated-by '\n' --as-textfile --target-dir /data/rodrigo/myTable \
--hive-import --hive-partition-key yearmonthday --hive-partition-value '20180101' --hive-overwrite --verbose -P --m 1 --hive-table myTable
My table is already created, because I must create a solicitation for create a table in my hive database, so I can't create dinamically inside sqoop command.
I have permission to create the directory in hdfs.
When I remove the directory, sqoop logs an error saying that I have no create table permissions, and when I already create the diretory, it returns a FileAlreadyExistsException.
What can I do to solve that?
Thanks from Brazil.

Sqoop job incremental lastmodified wrong timestamp value

I am trying to create a Sqoop Job using incremental lastmodified
sqoop job --create job_import_test8_by_query_update -- import \
--bindir ./ --connect 'jdbc:mysql://localhost/db?serverTimezone=UTC&useSSL=false' \
--username user \
--password pass \
--table test8 -m 2 \
--incremental lastmodified \
--check-column "timestamp_field" \
--last-value 0 \
--split-by "id" \
--merge-key "id" \
--verbose \
--target-dir /usr/local/sqlImport/1
in this example I am having problem with last-value.
Running the first time when last-value is "0" works fine. Then the last value is automatically set to current_local_time + 4 hours, so I am losing some records.
It seems that the last-value takes the server timezone value instead of the last record value from the database.
Thanks for any help!
Try adding the useTimezone option to your connection string
--connect 'jdbc:mysql://localhost/db?useTimezone=true&serverTimezone=UTC'

sqoop hive import with partitions

I have some sqoop jobs importing into hive that I want to partition, but I can't get it to function. The import will actually work: the table is sqooped, it's visible in hive, there's data but the partition parameters I'm expecting to see don't appear when I describe the table. I HAVE sqooped this table as a csv, created an external parquet table, and inserted the data into that (which works), but I want to be able to avoid the extra steps if possible. here's my current code. Am I missing something or am I trying to do the impossible? thanks!
sqoop import -Doraoop.import.hint=" " \
--options-file /home/[user]/pass.txt \
--verbose \
--connect jdbc:oracle:thin:#ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/SQSOP051 \
--username [user]\
--num-mappers 10 \
--hive-import \
--query "select DISC_PROF_SK_ID, CLM_RT_DISC_IND, EASY_PAY_PLN_DISC_IND, TO_CHAR(L40_ATOMIC_TS,'YYYY') as YEAR, TO_CHAR(L40_ATOMIC_TS,'MM') as MONTH from ${DataSource[index]}.$TableName where \$CONDITIONS" \
--hive-database [dru_user] \
--hcatalog-partition-keys YEAR \
--hcatalog-partition-values '2015' \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_user]/Claims_Data/$TableName \
--hive-table $TableName'testing' \
--split-by ${SplitBy[index]} \
--delete-target-dir \
--direct \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile \
You can replace the options-file by --password-file. However that will not solve the partition problem. For the partition problem you can try creating the partition-ed table $TableName partitioned first before the import.
sqoop import -Doraoop.import.hint=" " \
--password-file /home/[user]/pass.txt \
--verbose \
--connect jdbc:oracle:thin:#ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/SQSOP051 \
--username [user] \
--num-mappers 10 \
--hive-import \
--query "SELECT disc_prof_sk_id,
clm_rt_disc_ind,
easy_pay_pln_disc_ind,
To_char(l40_atomic_ts,'YYYY') AS year,
To_char(l40_atomic_ts,'MM') AS month
FROM ${DataSource[index]}.$tablename
WHERE \$conditions" \
--hcatalog-database [dru_user] \
--hcatalog-partition-key YEAR \
--hcatalog-partition-values '2015' \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_user]/Claims_Data/$TableName \
--hcatalog-table $TableName \
--split-by ${SplitBy[index]} \
--delete-target-dir \
--direct \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile

Incremental sqoop from oracle to hdfs with condition

I am doing a incremental sqooping from to hdfs oracle giving where condition like
(LST_UPD_TMST >TO_TIMESTAMP('2016-05-31T18:55Z', 'YYYY-MM-DD"T"HH24:MI"Z"')
AND LST_UPD_TMST <= TO_TIMESTAMP('2016-09-13T08:51Z', 'YYYY-MM-DD"T"HH24:MI"Z"'))
But it is not using the index. How can I force an Index so that sqoop can be faster by considering only filtered records.
What is the best option to do incremental sqoop. Table size in oracle is in TBs.
Table has billions rows and after where condition it is in some million
You can use --where or --query with where condition in select to filter import results
I was not sure about your sqoop full command, just give a try in this way
sqoop import
--connect jdbc:oracle:thin:#//db.example.com/dbname \
--username dbusername \
--password dbpassword \
--table tablename \
--columns "column,names,to,select,in,comma,separeted" \
--where "(LST_UPD_TMST >TO_TIMESTAMP('2016-05-31T18:55Z', 'YYYY-MM-DD\"T\"HH24:MI\"Z\"') AND LST_UPD_TMST <= TO_TIMESTAMP('2016-09-13T08:51Z', 'YYYY-MM-DD\"T\"HH24:MI\"Z\"'))" \
--target-dir {hdfs/location/to/save/data/from/oracle} \
--incremental lastmodified \
--check-column LST_UPD_TMST \
--last-value {from Date/Timestamp to Sqoop in incremental}
Check more details about sqoop incremental load
Update
For incremental imports Sqoop saved job is recommended to maintain --last-value automatically.
sqoop job --create {incremental job name} \
-- import
--connect jdbc:oracle:thin:#//db.example.com/dbname \
--username dbusername \
--password dbpassword \
--table tablename \
--columns "column,names,to,select,in,comma,separeted" \
--incremental lastmodified \
--check-column LST_UPD_TMST \
--last-value 0
Here --last-value 0 to import from start for first time then latest
value will be passed automatically in next invocation by sqoop job

Run a sqoop job on a specific queue

I'm trying to create a Sqoop job run in a specific queue but it doesn't work.
I've tried two things :
1st : Declare the queue in the job creation
sqoop job \
--create myjob \
-- import \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
-Dmapred.job.queue.name=shortduration \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR
But it throws a parsing argument error due to -Dmapred.job.queue.name=shortduration
2nd : remove the -Dmapred.job.queue.name=shortduration of the job creation. job creation works well. But unable to specify which queue should be used
I'm loosing hope to run my job in this queue
Thanks for any help provided !
EDIT : get an import working with sqoop import -Dmapred.job.queue.name=shortduration but sqoop job not working
I think you have an error in your command
-Dmapreduce.job.queuename=NameOfTheQueue
note queuename one word and the order, based on the documentation, vm args need to go directly after the import.
https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_using_generic_and_specific_arguments
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf specify an application configuration file
-D use value for given property
sqoop job -Dmapred.job.queuename=shortduration \
--create myjob \
-- import \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR
you might just want to try it with the import tool to see if it is working correctly then do the job command ie
sqoop import -Dmapred.job.queuename=shortduration \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR

Resources