Exporting data from Teradata to HDFS using TDCH - hadoop

I'm trying to export a table from Teradata into a file in my hdfs using TDCH.
I'm using the below parameters :
hadoop jar $TDCH_JAR com.teradata.connector.common.tool.ConnectorImportTool \
-libjars $LIB_JARS \
-Dmapred.job.queue.name=default \
-Dtez.queue.name=default \
-Dmapred.job.name=TDCH \
-classname com.teradata.jdbc.TeraDriver \
-url jdbc:teradata://$ipServer/logmech=ldap,database=$database,charset=UTF16 \
-jobtype hdfs \
-fileformat textfile \
-separator ',' \
-enclosedby '"' \
-targettable ${targetTable} \
-username ${userName} \
-password ${password} \
-sourcequery "select * from ${database}.${targetTable}" \
-nummappers 1 \
-sourcefieldnames "" \
-targetpaths ${targetPaths}
It's working, but I need the headers in the file, and when I add the parameter:
-targetfieldnames "ID","JOB","DESC","DT","REG" \
It doesnt work, I don't even generate the file anymore.
Can anyonne help me?

The -targetfieldnames option is only valid for -jobtype hive.
It does not put headers in the HDFS file, it specifies Hive column names.
(There is no option to prefix CSV with a header record.)
Also the value supplied for -targetfieldnames should be a single string like "ID,JOB,DESC,DT,REG" rather than a list of strings.

Related

Sqoop export having duplicate entries in to table without primary key

I have a table department_id,department_name,LastModifieddate;
when i run the command like below
sqoop export \
--connect "jdbc:mysql://ip-172-31-13-154:3306/sqoopex" \
--username sqoopuser \
--password NHkkP876rp \
--table dep_prasad \
--input-fields-terminated-by '|' \
--input-lines-terminated-by '\n' \
--export-dir /user/venkateswarlujvs2821/dep_prasad/ \
--num-mappers 2 \
--outdir /user/venkateswarlujvs2821/dep_prasad
It works fine and is inserting records
when i modify the file which is there in HDFS and adding some more records
and when i try to export it .It is inserting duplicate entries in to my sql
i am using the following sqoop command for the second time.
sqoop export \
--connect "jdbc:mysql://ip-172-31-13-154:3306/sqoopex" \
--username sqoopuser \
--password NHkkP876rp \
--table dep_prasad \
--input-fields-terminated-by '|' \
--input-lines-terminated-by '\n' \
--update-key department_id \
--update-mode allowinsert \
--export-dir /user/venkateswarlujvs2821/dep_prasad/ \
--num-mappers 2 \
--outdir /user/venkateswarlujvs2821/dep_prasad
Note:My table DO NOT HAVE PRIMARY KEY
I want to update only the new records how can i do that?

sqoop hive import with partitions

I have some sqoop jobs importing into hive that I want to partition, but I can't get it to function. The import will actually work: the table is sqooped, it's visible in hive, there's data but the partition parameters I'm expecting to see don't appear when I describe the table. I HAVE sqooped this table as a csv, created an external parquet table, and inserted the data into that (which works), but I want to be able to avoid the extra steps if possible. here's my current code. Am I missing something or am I trying to do the impossible? thanks!
sqoop import -Doraoop.import.hint=" " \
--options-file /home/[user]/pass.txt \
--verbose \
--connect jdbc:oracle:thin:#ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/SQSOP051 \
--username [user]\
--num-mappers 10 \
--hive-import \
--query "select DISC_PROF_SK_ID, CLM_RT_DISC_IND, EASY_PAY_PLN_DISC_IND, TO_CHAR(L40_ATOMIC_TS,'YYYY') as YEAR, TO_CHAR(L40_ATOMIC_TS,'MM') as MONTH from ${DataSource[index]}.$TableName where \$CONDITIONS" \
--hive-database [dru_user] \
--hcatalog-partition-keys YEAR \
--hcatalog-partition-values '2015' \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_user]/Claims_Data/$TableName \
--hive-table $TableName'testing' \
--split-by ${SplitBy[index]} \
--delete-target-dir \
--direct \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile \
You can replace the options-file by --password-file. However that will not solve the partition problem. For the partition problem you can try creating the partition-ed table $TableName partitioned first before the import.
sqoop import -Doraoop.import.hint=" " \
--password-file /home/[user]/pass.txt \
--verbose \
--connect jdbc:oracle:thin:#ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/SQSOP051 \
--username [user] \
--num-mappers 10 \
--hive-import \
--query "SELECT disc_prof_sk_id,
clm_rt_disc_ind,
easy_pay_pln_disc_ind,
To_char(l40_atomic_ts,'YYYY') AS year,
To_char(l40_atomic_ts,'MM') AS month
FROM ${DataSource[index]}.$tablename
WHERE \$conditions" \
--hcatalog-database [dru_user] \
--hcatalog-partition-key YEAR \
--hcatalog-partition-values '2015' \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_user]/Claims_Data/$TableName \
--hcatalog-table $TableName \
--split-by ${SplitBy[index]} \
--delete-target-dir \
--direct \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile

Run a sqoop job on a specific queue

I'm trying to create a Sqoop job run in a specific queue but it doesn't work.
I've tried two things :
1st : Declare the queue in the job creation
sqoop job \
--create myjob \
-- import \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
-Dmapred.job.queue.name=shortduration \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR
But it throws a parsing argument error due to -Dmapred.job.queue.name=shortduration
2nd : remove the -Dmapred.job.queue.name=shortduration of the job creation. job creation works well. But unable to specify which queue should be used
I'm loosing hope to run my job in this queue
Thanks for any help provided !
EDIT : get an import working with sqoop import -Dmapred.job.queue.name=shortduration but sqoop job not working
I think you have an error in your command
-Dmapreduce.job.queuename=NameOfTheQueue
note queuename one word and the order, based on the documentation, vm args need to go directly after the import.
https://sqoop.apache.org/docs/1.4.3/SqoopUserGuide.html#_using_generic_and_specific_arguments
Generic Hadoop command-line arguments:
(must preceed any tool-specific arguments)
Generic options supported are
-conf specify an application configuration file
-D use value for given property
sqoop job -Dmapred.job.queuename=shortduration \
--create myjob \
-- import \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR
you might just want to try it with the import tool to see if it is working correctly then do the job command ie
sqoop import -Dmapred.job.queuename=shortduration \
--connect jdbc:teradata://RCT/DATABASE=MYDB \
--driver com.teradata.jdbc.TeraDriver \
--username DBUSER -P \
--query "$query" \
--target-dir /data/source/dest/$i \
--check-column DAT_CRN_AGG \
--incremental append \
--last-value 2001-01-01 \
--split-by NUM_CTR

query with special charaters in sqoop

Can we use query in sqoop options file like i used query with special characters when i ran the query its giving an error incorrect syntax near "\" . Should I use any escape character in the properties file ?
In option file I have mentioned query and using in sqoop import command.
Propertiesfile:
--query "select top 10 source_system,company_code,gl_document,***************negative_posting_flag, to_number(to_varchar(to_date(create_tmstmp),'yyyymm')) as part_date from c_fin_a.gl_transaction_data where to_number(to_varchar(to_date(create_tmstmp),'yyyymm'))=201602 and \$CONDITIONS"
sqoop import command
sudo sqoop import \
--options-file /home/emaarae/sqoop_shell/sqoop_hdfs.properties \
--append \
--null-string '' \
--null-non-string '' \
--fields-terminated-by '\001' \
--lines-terminated-by '\n' \
--m 15

Terdata export fails when using TDCH

when exporting 2billion+ records into teradata from hadoop using TDCH (Teradata Connector for Hadoop) using the below command with "batch.insert",
hadoop jar teradata-connector-1.3.2-hadoop210.jar com.teradata.connector.common.tool.ConnectorExportTool \
-D mapreduce.job.queuename=<queuename> \
-libjars <LIB_JARS_PATH> \
-classname com.teradata.jdbc.TeraDriver \
-url <jdbc_connection_string> \
-username <user_id> \
-password "********" \
-jobtype hive \
-sourcedatabase <hive_src_dbase> \
-sourcetable <hive_src_table> \
-fileformat orcfile \
-stagedatabase <stg_db_in_tdata> \
-stagetablename <stg_tbl_in_tdata> \
-targettable <target_tbl_in_tdata> \
-nummappers 25 \
-batchsize 13000 \
-method batch.insert \
-usexviews false \
-keepstagetable true \
-queryband '<queryband>'
Data is loading successfully into stage table but, then the export job fails before inserting the records in stage table into target table saying, "Connection Reset"
Can someone please help me identify the reason for this, and how to fix this.

Resources