What is the intention of Sqoop options? -single --double munus difference? - hadoop

In the given example: username followed by one - where as --connect and --table other commands followed by double -- what is the intention of such Sqoop options? Where should I use single and where double?
sqoop-import --connect jdbc:mysql://localhost:3306/db1 -username
root -password password --table tableName --hive-table tableName
--create-hive-table --hive-import --hive-home path/to/hive_home
List item

Generic Hadoop arguments are preceded by a single dash character (-), whereas sqoop arguments start with two dashes (--), unless they are single character arguments such as -P.
Generic hadoop command-line arguments supported are:
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
FYI: You must supply the generic hadoop arguments -conf, -D, and so on after the tool name but before any tool-specific arguments (such as --connect).

Related

How to make hadoop snappy output file the same format as those generated by Spark

we are using Spark and up until now the output are PSV files. Now in order to save space, we'd like to compress the output. To do so, we will change to save JavaRDD using the SnappyCodec, like this:
objectRDD.saveAsTextFile(rddOutputFolder, org.apache.hadoop.io.compress.SnappyCodec.class);
We will then use Sqoop to import the output into a database. The whole process works fine.
For previously generated PSV files in HDFS, we'd like to compress them in Snappy format as well. This is the command we tried:
hadoop jar /usr/hdp/2.6.5.106-2/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.5.106-2.jar \
-Dmapred.output.compress=true -Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
-Dmapred.reduce.tasks=0 \
-input input-path \
-output output-path
The command works fine. But the issue is, sqoop can't parse the snappy output files.
When we use a command like "hdfs dfs -text hdfs-file-name" to view the generated files, the output looks like below, with a "index" like field added into each line:
0 2019-05-02|AMRS||5072||||3540||MMPT|0|
41 2019-05-02|AMRS||5538|HK|51218||1000||Dummy|45276|
118 2019-05-02|AMRS||5448|US|51218|TRADING|2282|HFT|NCR|45119|
I.e., an extra value like "0 ", "41 ", "118 " are added into the beginning of each line. Note that the .snappy files generated by Spark doesn't has this "extra-field".
Any idea how to prevent this extra field being inserted?
Thanks a lot!
These are not indexes but rather keys generated by TextInputFormat, as explained here.
The class you supply for the input format should return key/value
pairs of Text class. If you do not specify an input format class, the
TextInputFormat is used as the default. Since the TextInputFormat
returns keys of LongWritable class, which are actually not part of the
input data, the keys will be discarded; only the values will be piped
to the streaming mapper.
And since you do not have any mapper defined in your job, those key/value pairs are written straight out to the file system. So as the above excerpt hints, you need some sort of a mapper that would discard the keys. A quick-and-dirty is to use something already available to serve as a pass-through, like a shell cat command:
hadoop jar /usr/hdp/2.6.5.106-2/hadoop-mapreduce/hadoop-streaming-2.7.3.2.6.5.106-2.jar \
-Dmapred.output.compress=true -Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec \
-mapper /bin/cat \
-Dmapred.reduce.tasks=0 \
-input input-path \
-output output-path

how does --options-file differ from --connection-param-file

Sqoop documentation shows example for --options-file as:
#
# Options file for Sqoop import
#
# Specifies the tool being invoked
import
# Connect parameter and value
--connect
jdbc:mysql://localhost/db
# Username parameter and value
--username
foo
#
# Remaining options should be specified in the command line.
#
As per above if it is only the connection information and as per the comment all remaining options should be specified in the command line, why is it in --options-file and not --connection-param-file ?
The comment Remaining options should be specified in the command line is misleading. It is there only to show it is possible to have comments in the options file. However, it does not mean you cannot specify more options.
I am using options file for Sqoop and they contain connection details as well as --num-mappers or --fields-terminated-by.

common.AbstractJob: Unexpected -libjars while processing Job-Specific Options

all!
When I use RecommenderJob in my project, I met an unexpected error. The arguments passed to the job is a String array which has values as follows:
[-libjars, /path/to/xxx.jar,/path/to/yyy.jar,
--input, hdfs://localhost:9000/tmp/x,
--output, hdfs://localhost:9000/tmp/y,
--similarityClassname,
org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.TanimotoCoefficientSimilarity,
--numRecommendations, 6,
--tempDir, hdfs://localhost:9000/tmp/z]
After I run the job via following code:
job.run(args);
It print an ERROR as follows:
ERROR common.AbstractJob: Unexpected -libjars while processing Job-Specific Options:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
-archives <paths> comma separated archives to be unarchived
on the compute machines.
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-files <paths> comma separated files to be copied to the
map reduce cluster
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-libjars <paths> comma separated jar files to include in the
classpath.
Unexpected -libjars while processing Job-Specific Options:
Usage:
...
Does anybody know how to solve it. Thanks in advance!
Finally, I have found the solution by myself. We should not use
job.run(args);
to run the job, which only deals with the Job-Specific Options. It is correct to use ToolRunner to run the job which processes the Generic Options followed by Job-Specific Options, and hence solving the problem.
ToolRunner.run(conf, job, args);

Sqoop Hive table import, Table dataType doesn't match with database

Using Sqoop to import data from oracle to hive, its working fine but it create table in hive with only 2 dataTypes String and Double. I want to use timeStamp as datatype for some columns.
How can I do it.
bin/sqoop import --table TEST_TABLE --connect jdbc:oracle:thin:#HOST:PORT:orcl --username USER1 -password password -hive-import --hive-home /user/lib/Hive/
In addition to above answers we may also have to observe when the error is coming, e.g.
In my case I had two types of data columns that caused error: json and binary
for json column the error came while a Java Class was executing, at the very beginning of the import process :
/04/19 09:37:58 ERROR orm.ClassWriter: Cannot resolve SQL type
for binary column, error was thrown while importing into the hive tables (after data is imported and put into HDFS files)
16/04/19 09:51:22 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive does not support the SQL type for column featured_binary
To get rid of these two errors, I had to provide the following options
--map-column-java column1_json=String,column2_json=String,featured_binary=String --map-column-hive column1_json=STRING,column2_json=STRING,featured_binary=STRING
In summary, we may have to provide the
--map-column-java
or
--map-column-hive
depending upon the failure.
You can use the parameter --map-column-hive to override default mapping. This parameter expects a comma-separated list of key-value pairs separated by = to specify which column should be matched to which type in Hive.
sqoop import \
...
--hive-import \
--map-column-hive id=STRING,price=DECIMAL
A new feature was added with sqoop-2103/sqoop 1.4.5 that lets you call out the decimal precision with the map-column-hive parameter. Example:
--map-column-hive 'TESTDOLLAR_AMT=DECIMAL(20%2C2)'
This syntax would define the field as a DECIMAL(20,2). The %2C is used as a comma and the parameter needs to be in single quotes if submitting from the bash shell.
I tried using Decimal with no modification and I got a Decimal(10,0) as a default.

How to use sqoop to export the default hive delimited output?

I have a hive query:
insert override directory /x
select ...
Then I'm try to export the data with sqoop
sqoop export --connect jdbc:mysql://mysqlm/site --username site --password site --table x_data --export-dir /x --input-fields-terminated-by 0x01 --lines-terminated-by '\n'
But this seems to fail to parse the fields according to delimiter
What am I missing?
I think the --input-fields-terminated-by 0x01 part doesn't work as expected?
I do not want to create additional tables in hive that contains the query results.
stack trace:
2013-09-24 05:39:21,705 ERROR org.apache.sqoop.mapreduce.TextExportMapper: Exception:
java.lang.NumberFormatException: For input string: "9-2"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:458)
...
The vi view of output
16-09-2013 23^A1182^A-1^APub_X^A21782^AIT^A1^A0^A0^A0^A0^A0.0^A0.0^A0.0
16-09-2013 23^A1182^A6975^ASoMo Audience Corp^A2336143^AUS^A1^A1^A0^A0^A0^A0.2^A0.0^A0.0
16-09-2013 23^A1183^A-1^APub_UK, Inc.^A1564001^AGB^A1^A0^A0^A0^A0^A0.0^A0.0^A0.0
17-09-2013 00^A1120^A-1^APub_US^A911^A--^A181^A0^A0^A0^A0^A0.0^A0.0^A0.0
I've found the correct solution for that special character in bash
#!/bin/bash
# ... your script
hive_char=$( printf "\x01" )
sqoop export --connect jdbc:mysql://mysqlm/site --username site --password site --table x_data --export-dir /x --input-fields-terminated-by ${hive_char} --lines-terminated-by '\n'
The problem was in correct separator recognition (nothing to do with types and schema) and that was achieved by hive_char.
Another possibility to encode this special character in linux to command-line is to type Cntr+V+A
Using
--input-fields-terminated-by '\001' --lines-terminated-by '\n'
as flags in the sqoop export command seems to do the trick for me.
So, in your example, the full command would be:
sqoop export --connect jdbc:mysql://mysqlm/site --username site --password site --table x_data --export-dir /x --input-fields-terminated-by '\001' --lines-terminated-by '\n'
I think its the DataType mismatch with your RDBMS schema.
Try to find the column name of "9-2" value and check the datatype in RDBMS schema.
If its int or numeric then Sqoop will parse the value and insert. And as it seems "9-2" is not numeric value.
Let me know if this doesn't work.
It seems like sqoop is taking '0' as a delimiter .
You are getting an error because:-
First column in your mysql table could be varchar and second column is a number.
As per below string:-
16- 0 9-2 0 13 23^A1182^A-1^APub_X^A21782^AIT^A1^A0^A0^A0^A0^A0.0^A0.0^A0.0
Your first column parsed by sqoop is :-16-
and second column is:-9-2
So its better to specify a delimiter in quotes('0x01')
or
(Its always easy and has better control)use hive create table command as:-
create table tablename row format delimited fields terminated by '\t' as select ... and specify '\t' as delimiter in your sqoop command.

Resources