Do the attributes of sqoop command follow some syntactical order? - hadoop

For example
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--m 1 \ (or --num-mappers 10)
--where “city =’abcd’” \
--target-dir /whereque
is same as?
$ sqoop import \
--connect jdbc:mysql://localhost/userdb \
--username root \
--table emp_add \
--where “city =’abcd’” \
--target-dir /whereque
--m 1 \ (or --num-mappers 10)
I tried above two options and it worked. But my question is can we jumble up the attributes for all the cases?

Sqoop command usually follows this syntax:
sqoop command [GENERIC-ARGS] [TOOL-ARGS]
You cannot change the order of the usage. However you can change order of the tool-args.
For more have a look at documentation.

Actually, you don't have any generic argument in your code. Generic arguments are related to "configuration" settings. There are listed below:
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

Sqoop command is shown below:
sqoop import [GENERIC-ARGS] [TOOL-ARGS]
Please see some points below on the order in which Command should be executed.
1.Generic arguments must always be placed after tool name
2.All generic arguments must always be placed before tool arguments
3.Generic arguments are always preceded by single dash(-) character.
4.Tool arguments are always preceded with 2 dashes(--), exception to which is single character arguments

Related

sqoop script error "... is not a valid DFS filename"

*running this in a Linux environment via putty from windows.
I have a sqoop script, trying to copy a table from oracle to hive. I get an error regarding my destination path.../hdfs://myserver/apps/hive/warehouse/new_schema/new_table is not a valid DFS filename
Can anyone please tell me if my destination path looks correct? I am not trying to setup a file, I just want to copy a table from oracle to hive and put it in a scheme that already exists in hive. Below is my script.
#!/bin/bash
sqoop import \
-Dmapred.map.child.java.opts='-Doracle.net.tns_admin=. - Doracle.net.wallet_location=.' \
-files $WALLET_LOCATION/cwallet.sso,$WALLET_LOCATION/ewallet.p12,$TNS_ADMIN/sqlnet.ora,$TNS_ADMIN/tnsnames.ora \
--connect jdbc:oracle:thin:/#MY_ORACLE_DATABASE \
--table orignal_schema.orignal_table \
--hive-drop-import-delims \
--hive-import \
--hive-table new_schema.new_table \
--num-mappers 1 \
--hive-overwrite \
--mapreduce-job-name my_sqoop_job \
--delete-target-dir \
--target-dir /hdfs://myserver/apps/hive/warehouse/new_schema.db \
--create-hive-table
I think what's causing your that error is the "/" before your HDFS path. The correct path should be:
--target-dir hdfs://myserver/apps/hive/warehouse/new_schema.db
Also, always make sure the hostname is correct, to avoid further errors.

Can we use Sqoop2 import to create only a file and not into a HIVE table etc

I have tried running below commands in Sqoop2:
This one works wherein TAB-Separated part files (part-m-00000, part-m-00001 etc) were created:
sqoop import --connect jdbc:oracle:thin:#999.999.999.999:1521/SIDNAME --username god --table TABLENAME --fields-terminated-by '\t' --lines-terminated-by '\n' -P
This one fails:
sqoop import -Dmapreduce.job.user.classpath.first=true \
-Dmapreduce.output.basename=`date +%Y-%m-%d` \
--connect jdbc:oracle:thin:#999.999.999.999:1521/SIDNAME \
--username nbkeplo \
--P \
--table TABLENAME \
--columns "COL1, COL2, COL3" \
--target-dir /usr/data/sqoop \
-–as-parquetfile \
-m 10
Error:
20/01/08 09:21:23 ERROR tool.BaseSqoopTool: Error parsing arguments for import:
20/01/08 09:21:23 ERROR tool.BaseSqoopTool: Unrecognized argument: -–as-parquetfile
20/01/08 09:21:23 ERROR tool.BaseSqoopTool: Unrecognized argument: -m
20/01/08 09:21:23 ERROR tool.BaseSqoopTool: Unrecognized argument: 10
Try --help for usage instructions.
I want the output to be a <.parquet> file and not a HIVE table (want to use with Apache Spark directly without using HIVE). Is this <.parquet> file creation possible with Sqoop import ?
Importing directly to HDFS (as AVRO, SequenceFile, or ) is possible with Sqoop. When you output to Hive, it's still written to HDFS, just inside the Hive warehouse for managed tables. Also, Spark is able to read from any HDFS location it has permission to.
Your code snippets are not the same, and you didn't mention troubleshooting steps you have tried.
I would add the --split-by, --fields-terminated-by, and --lines-terminated-by arguments to your command.
The below works:
sqoop import \
--connect jdbc:oracle:thin:#999.999.999.999:1521/SIDNAME \
--username user \
--target-dir /xxx/yyy/zzz \
--as-parquetfile \
--table TABLE1 \
-P

--mapreduce-name not working with sqoop

When I tried to sqoop the data and in the query when I use
--mapreduce-name both in free form query as well as in normal import, sqoop is giving the generic name for the jar that is QueryResult.jar for free form query for Sqoop import it is giving the tablename as jar which is default.
Why --mapreduce-name is not reflecting. Could anyone help me out with this.
Use -D mapred.job.name=customJobName to set the name of the MR job Sqoop launches.
if not specified, the name defaults to the jar name for the job -
which is derived from the used table name.
Sqoop command syntax:
sqoop import [GENERIC-ARGS] [TOOL-ARGS]
AS per sqoop documentation,
Use -D mapred.job.name=<job_name> to set the name of the MR job that Sqoop launches.
Sample command:
sqoop import \
-D mapred.job.name=my_custom_name \
--connect 'jdbc:db2://192.xxx.xxx.xx:50000/BENCHDS' \
--driver com.ibm.db2.jcc.DB2Driver \
--username db2inst1 \
--password db2inst1 \
--table db2inst1.table1 \
--hive-import \
--hive-table hive_table1 \
--null-string '\\N' \
--null-non-string '\\N' \
--verbose

sqoop create impala parquet table

I'm relatively new the process of sqooping so pardon any ignorance. I have been trying to sqoop a table from a data source as a parquet file and create an impala table (also as parquet) into which I will insert the sqooped data. The code runs without an issue, but when I try to select a couple rows for testing I get the error:
.../EWT_CALL_PROF_DIM_SQOOP/ec2fe2b0-c9fa-4ef9-91f8-46cf0e12e272.parquet' has an incompatible Parquet schema for column 'dru_id.test_ewt_call_prof_dim_parquet.call_prof_sk_id'. Column type: INT, Parquet schema: optional byte_array CALL_PROF_SK_ID [i:0 d:1 r:0]
I was mirroring the process I found on a cloudera guide here:https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html. Mainly the "Internal and External Tables" section. I've been trying to avoid having to infer the schema with a particular parquet file, since this whole thing will be kicked off every month with a bash script (and I also can't think of a way to point it to just one file if I use more than one mapper).
Here's the code I used. I feel like I'm either missing something small and stupid, or I've screwed up everything major without realizing it. Any and all help appreciated. thanks!
sqoop import -Doraoop.import.hint=" " \
--options-file /home/kemri/pass.txt \
--verbose \
--connect jdbc:oracle:thin:#ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/EWSOP000 \
--username [userid] \
--num-mappers 1 \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP \
--delete-target-dir \
--table DMPROD.EWT_CALL_PROF_DIM \
--direct \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile
impala-shell -k -i hrtimpslb.[employer].com
create external table test_EWT_CALL_PROF_DIM_parquet(
CALL_PROF_SK_ID INT,
SRC_SKL_CD_ID STRING,
SPLIT_NM STRING,
SPLIT_DESC STRING,
CLM_SYS_CD STRING,
CLM_SYS_NM STRING,
LOB_CD STRING,
LOB_NM STRING,
CAT_IND STRING,
CALL_TY_CD STRING,
CALL_TY_NM STRING,
CALL_DIR_CD STRING,
CALL_DIR_NM STRING,
LANG_CD STRING,
LANG_NM STRING,
K71_ATOMIC_TS TIMESTAMP)
stored as parquet location '/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP';
As per request in the comments I provide an example of how you could achieve the same using one sqoop import with --hive-import. For obvious reasons I haven't tested it for your specific requirements, so it could need some more tuning which is often the case with these sqoop commands.
In my experience importing as parquet forces you to use the --query option since it doesn't allow you to use schema.table as table.
sqoop import -Doraoop.import.hint=" "\
--verbose \
--connect jdbc:oracle:thin:#ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/EWSOP000 \
--username [userid] \
-m 1 \
--password [ifNecessary] \
--hive-import \
--query 'SELECT * FROM DMPROD.EWT_CALL_PROF_DIM WHERE $CONDITIONS' \
--hive-database [database you want to use] \
--hive-table test_EWT_CALL_PROF_DIM_parquet \
--target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP \
--null-string '\\N' \
--null-non-string '\\N' \
--as-parquetfile
Basically what you need for --hive-import is --hive-database, --hive-table and --query.
If you don't want all your columns to appear in Hive as strings you must also include:
--map-hive-columns [column_name1=Timestamp,column_name2=Int,...]
You might need a similar --map-java-columns as well, but I'm never sure when this is required.
You will need a --split-by if you want multiple mappers
As discussed in the comments you will need to use invalidate metadata db.table to make sure Impala sees these changes. You could issue both commands from CL or a single bash-script where you can issue the impala command using impala-shell -q [query].

How to manage property file in Sqoop scripts

I need to provide certain user defined placeholders/parameters to a SQOOP script which at the execution time should be picked up.
For example my current sqoop script looks like:
sqoop job --create dummy_sqoop_job_name -- \
--options-file $USER_HOME/option_files/import.txt \
--password-file $USER_HOME/sqoop.password \
--table SCHEMA_NAME.TABLE_NAME \
--split-by SPLIT_KEY_COL \
--incremental append \
--check-column $CHECK_COLUMN \
--last-value $LAST_VALUE \
--target-dir $USER_HOME/staged/input/
I need to replace all placeholder starting with $ with the value from a property file.
Property file will be part of build packet.
Place holder can be anything, here I have taken $ to identify a placeholder.
Please let me know if I could achieve this in SQOOP or not

Resources