I'm crceating a sqoop job which will be scheduled in Oozie to load daily data into Hive.
I want to do incremental load into hive based on Date as a parameter, which will be passed to sqoop job
After researching lot I'm unable to find a way to pass a parameter to Sqoop job
You do this by passing the date down through two stages:
Coordinator to workflow
In your coordinator you can pass the date to the workflow that it executes as a <property>, like this:
<coordinator-app name="schedule" frequency="${coord:days(1)}"
start="2015-01-01T00:00Z" end="2025-01-01T00:00Z"
timezone="Etc/UTC" xmlns="uri:oozie:coordinator:0.2">
...
<action>
<workflow>
<app-path>${nameNode}/your/workflow.xml</app-path>
<configuration>
<property>
<name>workflow_date</name>
<value>${coord:formatTime(coord:nominalTime(), 'yyyyMMdd')}</value>
</property>
</configuration>
</workflow>
</action>
...
</coordinator-app>
Workflow to Sqoop
In your workflow you can reference that property in your Sqoop call using the ${workflow_date} variable, like this:
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
...
<command>import --connect jdbc:connect:string:here --table tablename --target-dir /your/import/dir/${workflow_date}/ -m 1</command>
...
</sqoop>
Below solution is from Apache Sqoop Cookbook.
Preserving the Last Imported Value
Problem
Incremental import is a great feature that you're using a lot. Shouldering the responsibility for remembering the last imported value is getting to be a hassle.
Solution
You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. You can create a simple incremental import job with the following command:
sqoop job \
--create visits 3.3. Preserving the Last Imported Value | 27
-- import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table visits \
--incremental append \
--check-column id \
--last-value 0
And start it with the --exec parameter:
sqoop job --exec visits
Discussion
The Sqoop metastore is a powerful part of Sqoop that allows you to retain your job definitions and to easily run them anytime. Each saved job has a logical name that is used for referencing. You can list all retained jobs using the --list parameter:
sqoop job --list
You can remove the old job definitions that are no longer needed with the --delete parameter, for example:
sqoop job --delete visits
And finally, you can also view content of the saved job definitions using the --show parameter, for example:
sqoop job --show visits
Output of the --show command will be in the form of properties. Unfortunately, Sqoop currently can't rebuild the command line that you used to create the saved job.
The most important benefit of the built-in Sqoop metastore is in conjunction with incremental import. Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job. This way, users do not need to remember the last imported value after each execution; everything is handled automatically.
Related
Does sqoop import/export create java classes? If it does so, where can I see these generated classes. What is the location of these class files?
Does sqoop import/export create java classes?
Yes
If it does so, where can I see these generated classes. What is the location of these class files?
It automatically generates a java file of same table name in the
current path of local system.
You can use --outdir to provide your own path.
Updated as per comment
You can use codegen command for this:
sqoop codegen \
--connect jdbc:mysql://localhost/databasename\
--username username\
--password password\
--table tablename
After the command is executed successfully there will be a path at the end where you can see the java files.
This is the complete flow of sqoop commands
User---> SQOOP CLI cmd ----> Sqoop Code GEN -----> Sqoop JAR Writer
----> JAR submission ---> ResourceManager ----> MR operation (5phases) ----> HDFS ----> Ack to Sqoop by MR program
**
Sqoop internally uses MapReducev1 or v2 for its execution(Getting data from DB and Storing the same in HDFS in comma delimited values). And it first creates a .java source file for the map-reduce prg and pakages in jar and then submits.
The .java is created in the current local directory with name of table.
sqoop import --connect jdbc:mysql://localhost/hadoop --table employee -m 1
In this case a "employee.java" is created .
When using sqoop import is possible to pass java properties.
In my case, I need to pass
-Doraoop.oracle.rac.service.name=myservice
together with a --direct to use sqoop direct connection to an oracle RAC.
Now I need to create a sqoop job with the same parameter but when I try issuing
sqoop job --create myjob -- import -Doraoop.oracle.rac.service.name=myservice --direct --connect...
It complains saying
ERROR tool.BaseSqoopTool: Error parsing arguments for import:
ERROR tool.BaseSqoopTool: Unrecognized argument: -Doraoop.oracle.rac.service.name=myservice
....
Wherever I put the -D it doesn't work while with the straight sqoop import it works.
It works only in the following way
sqoop job -Doraoop.oracle.rac.service.name=myservice --create myjob -- import ...
but in this way, the property is passed to the current execution and not to the subsequent job execution.
Is there a way to pass java properties through -D to a sqoop job --create myjob -- import command?
Trying with sqoop 1.4.6 on cdh 5.5
As per Sqoop docs,
Syntax for sqoop job command:
sqoop job (generic-args) (job-args) [-- [subtool-name] (subtool-args)]
Now -Dis generic-arg and --create myjob is job-arg
So you have to use command like :
sqoop job -Doraoop.oracle.rac.service.name=myservice --create myjob ....
For current and all subsequent job execution, it should behave in the same manner.
Inspect the configuration of the job using:
sqoop job --show myjob
Check if you find any difference in first execution and subsequent execution.
I have a lot of sqoop jobs running in AWS EMR, but sometimes i need to turn off this instance.
There's a way to save the last id from incremental import, maybe localy and upload it to s3 via cronjob.
My first idea is, when i create the job i just send a request to Redshift, where my data is stored and get the last id or last_modified, via bash script.
Another idea is to get the output of sqoop job --show $jobid, filter the parameter of last_id and using it to create the job again.
But i don't know if sqoop offer a way to do this more easily.
As per the Sqoop docs,
If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job --exec someIncrementalJob will continue to import only newer rows than those previously imported.
So, you need to store nothing. Sqoop's metastore will take care of saving last value and avail for next incremental import job.
Example,
sqoop job \
--create new_job \
-- \
import \
--connect jdbc:mysql://localhost/testdb \
--username xxxx \
--password xxxx \
--table employee \
--incremental append \
--check-column id \
--last-value 0
And start this job with the --exec parameter:
sqoop job --exec new_job
Solution
I change the file sqoop-site.xml and add the endpoint to my MySQL.
Steps
Create the MySQL instance and run this queries:
CREATE TABLE SQOOP_ROOT (version INT, propname VARCHAR(128) NOT NULL, propval VARCHAR(256), CONSTRAINT SQOOP_ROOT_unq UNIQUE (version, propname)); and INSERT INTO SQOOP_ROOT VALUES(NULL, 'sqoop.hsqldb.job.storage.version', '0');
Change the original sqoop-site.xml adding your MySQL endpoint, user and password.
<property>
<name>sqoop.metastore.client.enable.autoconnect</name>
<value>true</value>
<description>If true, Sqoop will connect to a local metastore
for job management when no other metastore arguments are
provided.
</description>
</property>
<!--
The auto-connect metastore is stored in ~/.sqoop/. Uncomment
these next arguments to control the auto-connect process with
greater precision.
-->
<property>
<name>sqoop.metastore.client.autoconnect.url</name>
<value>jdbc:mysql://your-mysql-instance-endpoint:3306/database</value>
<description>The connect string to use when connecting to a
job-management metastore. If unspecified, uses ~/.sqoop/.
You can specify a different path here.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.username</name>
<value>${sqoop-user}</value>
<description>The username to bind to the metastore.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.password</name>
<value>${sqoop-pass}</value>
<description>The password to bind to the metastore.
</description>
</property>
When you execute the command sqoop job --list in first time it will return zero values. But after creating the jobs, if you shutdown the EMR, you don't lose the sqoop metadata from executing jobs.
In EMR, we can use the Bootstrap Action to automate this operation in cluster creation.
I'm using Oozie Sqoop Action to import data in the Datalake.
I need a HDFS folder for each table of the database source. I have more than 300 tables.
I could have all the 300 Sqoop Actions hardcoded in a Workflow but then the Workflow would be too big for the Oozie configuration.
Error submitting job /user/me/workflow.xml
E0736: Workflow definition length [107,123] exceeded maximum allowed length [100,000]
Having big file like that isn't a good idea because it slows the system (it is saved in the database) and it's hard to maintain.
Question is, how do I call a sub-workflow for each table name ?
Equivalent shell script would be something like:
while read TABLE; do
sqoop import --connect ${CONNECT} --username ${USERNAME} --password ${PASSWORD} --table ${TABLE} --target-dir ${HDFS_LOCATION}/${TABLE} --num-mappers ${NUM-MAPPERS}
done < tables.data
Where tables.data contains a table names list which is a subset of the database source tables names. For example :
TABLE_ONE
TABLE_TWO
TABLE_SIX
TABLE_TEN
And here the sub-workflow I want to call for each table:
<workflow-app name="sub-workflow-import-table" xmlns="uri:oozie:workflow:0.5">
<start to="sqoop-import"/>
<action name="sqoop-import">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>sqoop import --connect ${CONNECT} --username ${USERNAME} --password ${PASSWORD} --table ${TABLE} --target-dir ${HDFS_LOCATION}/${TABLE} --num-mappers ${NUM-MAPPERS}</command>
</sqoop>
<ok to="end"/>
<error to="log-and-kill"/>
</action>
<end name="end"/>
<kill name="log-and-kill">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
</workflow-app>
Let me know if you need more precision.
Thanks!
David
There's sadly no way to do this nicely in Oozie - you'd need to hardcode all 300 Sqoop actions into an Oozie XML. This is because Oozie deals with directed acyclic graphs, which means loops (like your shell script) don't have an Oozie equivalent.
However I don't think Oozie is the right tool here. Oozie requires one container per action to use as a launcher, which means your cluster will need to allocate 300 additional containers over the space of a single run. This can effectively deadlock a cluster as you end up in situations where launchers prevent the actual jobs running! I've worked on a large cluster with > 1000 tables and we used Bash there to avoid this issue.
If you do want to go ahead with this in Oozie, you can't avoid generating a workflow with 300 actions. I would do it as 300 actions rather than 300 calls to sub-workflows which each call one action, else you're going to generate even more overhead. You can either create this file manually, or preferably write some code to generate the Oozie workflow XML file given a list of tables. The latter is more flexible as it allows tables to be included or excluded on a per-run basis.
But as I initially said, I'd stick to Bash for this one unless you have a very very good reason.
My suggestion would be to create workflows each for 50 table import. So you have 6 of them like that. Call all the 6 workflows as sub workflows from a master or parent workflow. By this way we can have the control at one point and it will be easy to schedule a single workflow.
I have created a sqoop job called TeamMemsImportJob which basically pulls data from sql server into hive.
I can execute the sqoop job through the unix command line by running the following command:
sqoop job –exec TeamMemsImportJob
If I create an oozie job with the actual scoop import command in it, it runs through fine.
However if I create the oozie job and run the sqoop job through it, I get the following error:
oozie job -config TeamMemsImportJob.properties -run
>>> Invoking Sqoop command line now >>>
4273 [main] WARN org.apache.sqoop.tool.SqoopTool – $SQOOP_CONF_DIR has not been set in the environment. Cannot check for additional configuration.
4329 [main] INFO org.apache.sqoop.Sqoop – Running Sqoop version: 1.4.4.2.1.1.0-385
5172 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage – Cannot restore job: TeamMemsImportJob
5172 [main] ERROR org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage – (No such job)
5172 [main] ERROR org.apache.sqoop.tool.JobTool – I/O error performing job operation: java.io.IOException: Cannot restore missing job TeamMemsImportJob
at org.apache.sqoop.metastore.hsqldb.HsqldbJobStorage.read(HsqldbJobStorage.java:256)
at org.apache.sqoop.tool.JobTool.execJob(JobTool.java:198)
it looks as if it cannot find the job. However I can see the job as below
[root#sandbox ~]# sqoop job –list
Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
14/06/25 08:12:08 INFO sqoop.Sqoop: Running Sqoop version: 1.4.4.2.1.1.0-385
Available jobs:
TeamMemsImportJob
How do I resolve this?
You have to use the --meta-connect flag while creating a job to create a custom Sqoop metastore database so that Oozie can have access.
sqoop \
job \
--meta-connect \
"jdbc:hsqldb:file:/on/server/not/hdfs/sqoop-metastore/sqoop-meta.db;shutdown=true" \
--create \
jobName \
-- \
import \
--connect jdbc:oracle:thin:#server:port:sid \
--username username \
--password-file /path/on/hdfs/server.password \
--table TABLE \
--incremental append \
--check-column ID \
--last-value "0" \
--target-dir /path/on/hdfs/TABLE
When you need to execute jobs, you can do it from Oozie the regular way, but make sure to include --meta-connect to indicate where the job is stored.
If we see the log we can see that it cannot find the stored job.
Since you are using the native hsql db.
To make Sqoop jobs available across other systems you should configure other database for example mysql which can be accessed by all systems.
From documentation
Running sqoop-metastore launches a shared HSQLDB database instance on
the current machine. Clients can connect to this metastore and create
jobs which can be shared between users for execution
The location of the metastore’s files on disk is controlled by the
sqoop.metastore.server.location property in conf/sqoop-site.xml. This
should point to a directory on the local filesystem.
The metastore is available over TCP/IP. The port is controlled by the
sqoop.metastore.server.port configuration parameter, and defaults to
16000.
Clients should connect to the metastore by specifying
sqoop.metastore.client.autoconnect.url or --meta-connect with the
value jdbc:hsqldb:hsql://:/sqoop. For example,
jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop.
This metastore may be hosted on a machine within the Hadoop cluster,
or elsewhere on the network.
Can you check if that db is accessible from other systems.