Sqoop job through oozie - hadoop

I have created a sqoop job called TeamMemsImportJob which basically pulls data from sql server into hive.
I can execute the sqoop job through the unix command line by running the following command:
sqoop job –exec TeamMemsImportJob
If I create an oozie job with the actual scoop import command in it, it runs through fine.
However if I create the oozie job and run the sqoop job through it, I get the following error:
oozie job -config TeamMemsImportJob.properties -run
>>> Invoking Sqoop command line now >>>
4273 [main] WARN org.apache.sqoop.tool.SqoopTool – $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
4329 [main] INFO org.apache.sqoop.Sqoop – Running Sqoop version: 1.4.4.2.1.1.0-385
5172 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage – Cannot restore job: TeamMemsImportJob
5172 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage – (No such job)
5172 [main] ERROR org.apache.sqoop.tool.JobTool – I/O error performing job operation: java.io.IOException: Cannot restore missing job TeamMemsImportJob
at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.read(HsqldbJobStorage.java:256)
at org.apache.sqoop.tool.JobTool.execJob(JobTool.java:198)
it looks as if it cannot find the job. However I can see the job as below
[root#sandbox ~]# sqoop job –list
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
14/06/25 08:12:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4.2.1.1.0-385
Available jobs:
TeamMemsImportJob
How do I resolve this?

You have to use the --meta-connect flag while creating a job to create a custom Sqoop metastore database so that Oozie can have access.
sqoop \
job \
--meta-connect \
"jdbc:hsqldb:file:/on/server/not/hdfs/sqoop-metastore/sqoop-meta.db;shutdown=true" \
--create \
jobName \
-- \
import \
--connect jdbc:oracle:thin:#server:port:sid \
--username username \
--password-file /path/on/hdfs/server.password \
--table TABLE \
--incremental append \
--check-column ID \
--last-value "0" \
--target-dir /path/on/hdfs/TABLE
When you need to execute jobs, you can do it from Oozie the regular way, but make sure to include --meta-connect to indicate where the job is stored.

If we see the log we can see that it cannot find the stored job.
Since you are using the native hsql db.
To make Sqoop jobs available across other systems you should configure other database for example mysql which can be accessed by all systems.
From documentation
Running sqoop-metastore launches a shared HSQLDB database instance on
the current machine. Clients can connect to this metastore and create
jobs which can be shared between users for execution
The location of the metastore’s files on disk is controlled by the
sqoop.metastore.server.location property in conf/sqoop-site.xml. This
should point to a directory on the local filesystem.
The metastore is available over TCP/IP. The port is controlled by the
sqoop.metastore.server.port configuration parameter, and defaults to
16000.
Clients should connect to the metastore by specifying
sqoop.metastore.client.autoconnect.url or --meta-connect with the
value jdbc:hsqldb:hsql://:/sqoop. For example,
jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop.
This metastore may be hosted on a machine within the Hadoop cluster,
or elsewhere on the network.
Can you check if that db is accessible from other systems.

Related

Sqoop eval throwing error when I tried to check the connection due to java.io.IOException: Could not load jar into JVM

I have tried to run the Sqoop eval script through AWS EMR CLI for Teradata connection but found the error
Error loading ManagerFactory information from file /usr/lib/sqoop/conf/managers.d/td_connector.txt: java.io.IOException: Could not load jar $SQOOP_HOME/lib/teradata-connector-1.6.5.jar into JVM. (Could not find class org.apache.sqoop.teradata.TeradataConnManager.)
Steps I have followed:
login to EMR version emr-6.2.0 with the configuration of hadoop 3 and sqoop 1.4.7 through SSH
Downloaded the Teradata Hadoop connector 3.x from teradata downloads
moved the teradata hadoop connector to $SQOOP_HOME/lib and installed.
created the text file td_connect at /usr/lib/sqoop/conf/managers.d/ and included the text org.apache.sqoop.teradata.TeradataConnManager=$SQOOP_HOME/lib/teradata-connector-1.6.5.jar
ran the script
sqoop eval --connection-manager org.apache.sqoop.teradata.TeradataConnManager --connect jdbc:teradata://host/database= --username username --password password --query 'select top 5 * from table'
Could you please help to identify the issue

Sqoop create hive table ERROR - Encountered IOException running create table job

I am running sqoop on a Centos7 Machine that has hadoop/map reduce and hive already installed. I read from a tutorial that when importing data from a RDBMS (SQL Server in my case) to HDFS I need to run the next commands :
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect 'jdbc:sqlserver://hostname;database=databasename' --username admin --password admin123 --table tableA
Everything works perfectly with this step. The next step is creating a hive table that has the same structure as the RDBMS (SQL Server in my case) and using a sqoop command :
sqoop create-hive-table --connect 'jdbc:sqlserver://hostname;database=databasename' --username admin --password admin123 --table tableA --hivetable hivetablename --fields-terminated-by ','
However, whenever I run the above command I get the next error :
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
com.fasterxml.jackson.databind.ObjectMapper.readerFor(Ljava/lang
/Class;)Lcom/fasterxml/jackson/databind/ObjectReader;
18/04/01 19:37:52 ERROR ql.Driver: FAILED: Execution Error, return code 1
from org.apache.hadoop.hive.ql.exec.DDLTask.
com.fasterxml.jackson.databind.ObjectMapper.readerFor(Ljava/lang
/Class;)Lcom/fasterxml/jackson/databind/ObjectReader;
18/04/01 19:37:52 INFO ql.Driver: Completed executing
command(queryId=hadoop_20180401193745_1f3cf07d-ca16-40dd-
8f8d-1e426ecd5860); Time taken: 0.212 seconds
18/04/01 19:37:52 INFO conf.HiveConf: Using the default value passed in
for log id: 0813b5c9-f374-4920-b8c6-b8541449a6eb
18/04/01 19:37:52 INFO session.SessionState: Resetting thread name to
main
18/04/01 19:37:52 INFO conf.HiveConf: Using the default value passed in
for log id: 0813b5c9-f374-4920-b8c6-b8541449a6eb
18/04/01 19:37:52 INFO session.SessionState: Deleted directory: /tmp/hive
/hadoop/0813b5c9-f374-4920-b8c6-b8541449a6eb on fs with scheme hdfs
18/04/01 19:37:52 INFO session.SessionState: Deleted directory: /tmp/hive
/java/hadoop/0813b5c9-f374-4920-b8c6-b8541449a6eb on fs with scheme file
18/04/01 19:37:52 ERROR tool.CreateHiveTableTool: Encountered IOException
running create table job: java.io.IOException: Hive CliDriver exited with
status=1
I am not a java expert but I would like to know if you have any idea of this result?
I've faced the same issue. It seems that there are some compatibility issues between my versions of sqoop (1.4.7) and hive (2.3.4).
The problem raises from the version of the jackson-* jar files within $SQOOP_HOME/lib: some of them are too old for hive because we need versions older than 2.6.
The solution that I found was to replace the following files in $SQOOP_HOME/lib by their counterpart in $HIVE_HOME/lib:
jackson-core-*.jar
jackson-databind-*.jar
jackson-annotations-*.jar
They are all from versions 2.6+ and this seems to work. Not sure it's good practice though.
I was facing the same issue and I have downgraded my hive to 1.2.2 and it works. That will solve the issue.
But not really sure if you want to use Sqoop with only hive2.
Instead of writing two different statements, you can put the whole thing in one statement, which will fetch the data from sql server and then create a HIVE table too.
sqoop import -Dorg.apache.sqoop.splitter.allow_text_splitter=true --connect 'jdbc:sqlserver://hostname;database=databasename' --username admin --password admin123 --table tableA --hive-import --hive-overwrite --hive-table hivetablename --fields-terminated-by ',' --hive-drop-import-delims --null-string '\\N' --null-non-string '\\N'
For this please check the jackson-core, jackson-databind and jackson-annotation jar. The jar should be of the latest version. Usually it comes due to the older version. Place these jar inside the hive lib and sqoop lib. Along with please check the libthrift jar, both in hive and hbase it should be same and copy the same in sqoop lib

passing mysql properties via sqoop eval

sqoop eval command :
sqoop eval --connect 'jdbc:mysql://<connection url>' --driver com.mysql.jdbc.Driver --query "select max(rdate) from test.sqoop_test"
gives me output:
Warning: /usr/hdp/2.3.2.0-2950/accumulo does not exist! Accumulo
imports will fail. Please set $ACCUMULO_HOME to the root of your
Accumulo installation. Warning: /usr/hdp/2.3.2.0-2950/zookeeper does
not exist! Accumulo imports will fail. Please set $ZOOKEEPER_HOME to
the root of your Zookeeper installation. 16/10/05 18:38:17 INFO
sqoop.Sqoop: Running Sqoop version: 1.4.6.2.3.2.0-2950 16/10/05
18:38:17 WARN tool.BaseSqoopTool: Setting your password on the
command-line is insecure. Consider using -P instead. 16/10/05 18:38:17
WARN sqoop.ConnFactory: Parameter --driver is set to an explicit
driver however appropriate connection manager is not being set (via
--connection-manager). Sqoop is going to fall back to org.apache.sqoop.manager.GenericJdbcManager. Please specify explicitly
which connection manager should be used next time. 16/10/05 18:38:17
INFO manager.SqlManager: Using default fetchSize of 1000
-------------- | max(rdate) |
-------------- | 2014-01-25 |
but i want output without warning and table boundries like:
max(rdate) 2014-01-25
i basically want to store this output to a file.
thanks in advance
You can perform Sqoop Import operation to save output in HDFS.
Warnings are straight forward.
You can set $ACCUMULO_HOME, $ZOOKEEPER_HOME if available.
You can set --connection-manager corresponding to Mysql
For the sake of security,
It's recommended to use -P for password rather than writing in command.
These are not errors, you can live with these warnings.
You can create a .sh file , write your sqoop commands into it, then run it as
shell_file_name.sh > your_output_file.txt
We have two ways to get the query results:
The other way is you can write to HDFS by importing query results(--target-dir /path) and read from there.
You can change the file system option in sqoop command to store the results from import query, So the idea behind is you importing data to local file system rather HDFS.
eg: sqoop import -fs local -jt local --connect "connection string" --username root --password root query "Select * from table" --target-dir /home/output
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1762587

Sqoop job fails with KiteSDK validation error for Oracle import

I am attempting to run a Sqoop job to load from an Oracle db and into Parquet format to a Hadoop cluster. The job is incremental.
Sqoop version is 1.4.6. Oracle version is 12c. Hadoop version is 2.6.0 (distro is Cloudera 5.5.1).
The Sqoop command is (this creates the job, and executes it):
$ sqoop job -fs hdfs://<HADOOPNAMENODE>:8020 \
--create myJob \
-- import \
--connect jdbc:oracle:thin:#<DBHOST>:<DBPORT>/<DBNAME> \
--username <USERNAME> \
-P \
--as-parquetfile \
--table <USERNAME>.<TABLENAME> \
--target-dir <HDFSPATH> \
--incremental append \
--check-column <TABLEPRIMARYKEY>
$ sqoop job --exec myJob
Error on execute:
16/02/05 11:25:30 ERROR sqoop.Sqoop: Got exception running Sqoop:
org.kitesdk.data.ValidationException: Dataset name
05112528000000918_2088_<USERNAME>.<TABLENAME>
is not alphanumeric (plus '_')
at org.kitesdk.data.ValidationException.check(ValidationException.java:55)
at org.kitesdk.data.spi.Compatibility.checkDatasetName(Compatibility.java:103)
at org.kitesdk.data.spi.Compatibility.check(Compatibility.java:66)
at org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.create(FileSystemMetadataProvider.java:209)
at org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:137)
at org.kitesdk.data.Datasets.create(Datasets.java:239)
at org.kitesdk.data.Datasets.create(Datasets.java:307)
at org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:107)
at org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:80)
at org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:106)
at org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:668)
at org.apache.sqoop.manager.OracleManager.importTable(OracleManager.java:444)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.tool.JobTool.execJob(JobTool.java:228)
at org.apache.sqoop.tool.JobTool.run(JobTool.java:283)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
Troubleshooting Steps:
0) HDFS is stable, other Sqoop jobs are functional, Oracle source DB is up and the connection has been tested.
1) I tried creating a synonym in Oracle, that way I could simply have the --table option as:
--table TABLENAME (without the username)
This gave me an error that the table name was not correct. It needs the full USERNAME.TABLENAME for the --table option.
Error:
16/02/05 12:04:46 ERROR tool.ImportTool: Imported Failed: There is no column found in the target table <TABLENAME>. Please ensure that your table name is correct.
2) I made sure that this is a Parquet issue. I removed the --as-parquetfile option and the job was successful.
3) I wondered if this is somehow caused by the incremental options. I removed the --incremental append & --check-column options and the job was successful. This confuses me.
4) I tried the job with MySQL and it was successful.
Has anyone run into something similar? Is there a way (or is it even advisable) to disable the Kite validation? It seems that the dataset is being created with dots ("."), which then Kite SDK complains about - but this is an assumption on my part as I am not too familiar with Kite SDK.
Thanks in advance,
Jose
Resolved. There seems to be a known issue with the JDBC connectivity to Oracle 12c. Using a specific OJDBC6 (instead of 7) did the trick. FYI - the OJDBC is installed in /usr/share/java/ and a symbolic link is created in /installpath.../lib/sqoop/lib/
As reported by user #Remya Senan,
breaking the parameter
--hive-table my_hive_db_name.my_hive_table_name
into separate params
--hive-database my_hive_db_name
--hive-table my_hive_table_name
did the trick for me
My environment was
Sqoop v1.4.7
Hive 2.3.3
Tip: I was on emr-5.19.0
I also got this error when I was sqoop importing all tables as parquet file on CHD5.8. By looking at error message I felt this implementation does not support directories with "-" in their name. Based on this understanding I removed "-" from directory name and re-ran the sqoop import command and all worked fine. Hope this helps!

Passing parameter to sqoop job

I'm crceating a sqoop job which will be scheduled in Oozie to load daily data into Hive.
I want to do incremental load into hive based on Date as a parameter, which will be passed to sqoop job
After researching lot I'm unable to find a way to pass a parameter to Sqoop job
You do this by passing the date down through two stages:
Coordinator to workflow
In your coordinator you can pass the date to the workflow that it executes as a <property>, like this:
<coordinator-app name="schedule" frequency="${coord:days(1)}"
start="2015-01-01T00:00Z" end="2025-01-01T00:00Z"
timezone="Etc/UTC" xmlns="uri:oozie:coordinator:0.2">
...
<action>
<workflow>
<app-path>${nameNode}/your/workflow.xml</app-path>
<configuration>
<property>
<name>workflow_date</name>
<value>${coord:formatTime(coord:nominalTime(), 'yyyyMMdd')}</value>
</property>
</configuration>
</workflow>
</action>
...
</coordinator-app>
Workflow to Sqoop
In your workflow you can reference that property in your Sqoop call using the ${workflow_date} variable, like this:
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
...
<command>import --connect jdbc:connect:string:here --table tablename --target-dir /your/import/dir/${workflow_date}/ -m 1</command>
...
</sqoop>
Below solution is from Apache Sqoop Cookbook.
Preserving the Last Imported Value
Problem
Incremental import is a great feature that you're using a lot. Shouldering the responsibility for remembering the last imported value is getting to be a hassle.
Solution
You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. You can create a simple incremental import job with the following command:
sqoop job \
--create visits 3.3. Preserving the Last Imported Value | 27
-- import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table visits \
--incremental append \
--check-column id \
--last-value 0
And start it with the --exec parameter:
sqoop job --exec visits
Discussion
The Sqoop metastore is a powerful part of Sqoop that allows you to retain your job definitions and to easily run them anytime. Each saved job has a logical name that is used for referencing. You can list all retained jobs using the --list parameter:
sqoop job --list
You can remove the old job definitions that are no longer needed with the --delete parameter, for example:
sqoop job --delete visits
And finally, you can also view content of the saved job definitions using the --show parameter, for example:
sqoop job --show visits
Output of the --show command will be in the form of properties. Unfortunately, Sqoop currently can't rebuild the command line that you used to create the saved job.
The most important benefit of the built-in Sqoop metastore is in conjunction with incremental import. Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job. This way, users do not need to remember the last imported value after each execution; everything is handled automatically.

Resources