Execute a sub-workflow for each line of a file - hadoop

I'm using Oozie Sqoop Action to import data in the Datalake.
I need a HDFS folder for each table of the database source. I have more than 300 tables.
I could have all the 300 Sqoop Actions hardcoded in a Workflow but then the Workflow would be too big for the Oozie configuration.
Error submitting job /user/me/workflow.xml
E0736: Workflow definition length [107,123] exceeded maximum allowed length [100,000]
Having big file like that isn't a good idea because it slows the system (it is saved in the database) and it's hard to maintain.
Question is, how do I call a sub-workflow for each table name ?
Equivalent shell script would be something like:
while read TABLE; do
sqoop import --connect ${CONNECT} --username ${USERNAME} --password ${PASSWORD} --table ${TABLE} --target-dir ${HDFS_LOCATION}/${TABLE} --num-mappers ${NUM-MAPPERS}
done < tables.data
Where tables.data contains a table names list which is a subset of the database source tables names. For example :
TABLE_ONE
TABLE_TWO
TABLE_SIX
TABLE_TEN
And here the sub-workflow I want to call for each table:
<workflow-app name="sub-workflow-import-table" xmlns="uri:oozie:workflow:0.5">
<start to="sqoop-import"/>
<action name="sqoop-import">
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<command>sqoop import --connect ${CONNECT} --username ${USERNAME} --password ${PASSWORD} --table ${TABLE} --target-dir ${HDFS_LOCATION}/${TABLE} --num-mappers ${NUM-MAPPERS}</command>
</sqoop>
<ok to="end"/>
<error to="log-and-kill"/>
</action>
<end name="end"/>
<kill name="log-and-kill">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
</workflow-app>
Let me know if you need more precision.
Thanks!
David

There's sadly no way to do this nicely in Oozie - you'd need to hardcode all 300 Sqoop actions into an Oozie XML. This is because Oozie deals with directed acyclic graphs, which means loops (like your shell script) don't have an Oozie equivalent.
However I don't think Oozie is the right tool here. Oozie requires one container per action to use as a launcher, which means your cluster will need to allocate 300 additional containers over the space of a single run. This can effectively deadlock a cluster as you end up in situations where launchers prevent the actual jobs running! I've worked on a large cluster with > 1000 tables and we used Bash there to avoid this issue.
If you do want to go ahead with this in Oozie, you can't avoid generating a workflow with 300 actions. I would do it as 300 actions rather than 300 calls to sub-workflows which each call one action, else you're going to generate even more overhead. You can either create this file manually, or preferably write some code to generate the Oozie workflow XML file given a list of tables. The latter is more flexible as it allows tables to be included or excluded on a per-run basis.
But as I initially said, I'd stick to Bash for this one unless you have a very very good reason.

My suggestion would be to create workflows each for 50 table import. So you have 6 of them like that. Call all the 6 workflows as sub workflows from a master or parent workflow. By this way we can have the control at one point and it will be easy to schedule a single workflow.

Related

Save sqoop incremental import id

I have a lot of sqoop jobs running in AWS EMR, but sometimes i need to turn off this instance.
There's a way to save the last id from incremental import, maybe localy and upload it to s3 via cronjob.
My first idea is, when i create the job i just send a request to Redshift, where my data is stored and get the last id or last_modified, via bash script.
Another idea is to get the output of sqoop job --show $jobid, filter the parameter of last_id and using it to create the job again.
But i don't know if sqoop offer a way to do this more easily.
As per the Sqoop docs,
If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job --exec someIncrementalJob will continue to import only newer rows than those previously imported.
So, you need to store nothing. Sqoop's metastore will take care of saving last value and avail for next incremental import job.
Example,
sqoop job \
--create new_job \
-- \
import \
--connect jdbc:mysql://localhost/testdb \
--username xxxx \
--password xxxx \
--table employee \
--incremental append \
--check-column id \
--last-value 0
And start this job with the --exec parameter:
sqoop job --exec new_job
Solution
I change the file sqoop-site.xml and add the endpoint to my MySQL.
Steps
Create the MySQL instance and run this queries:
CREATE TABLE SQOOP_ROOT (version INT, propname VARCHAR(128) NOT NULL, propval VARCHAR(256), CONSTRAINT SQOOP_ROOT_unq UNIQUE (version, propname)); and INSERT INTO SQOOP_ROOT VALUES(NULL, 'sqoop.hsqldb.job.storage.version', '0');
Change the original sqoop-site.xml adding your MySQL endpoint, user and password.
<property>
<name>sqoop.metastore.client.enable.autoconnect</name>
<value>true</value>
<description>If true, Sqoop will connect to a local metastore
for job management when no other metastore arguments are
provided.
</description>
</property>
<!--
The auto-connect metastore is stored in ~/.sqoop/. Uncomment
these next arguments to control the auto-connect process with
greater precision.
-->
<property>
<name>sqoop.metastore.client.autoconnect.url</name>
<value>jdbc:mysql://your-mysql-instance-endpoint:3306/database</value>
<description>The connect string to use when connecting to a
job-management metastore. If unspecified, uses ~/.sqoop/.
You can specify a different path here.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.username</name>
<value>${sqoop-user}</value>
<description>The username to bind to the metastore.
</description>
</property>
<property>
<name>sqoop.metastore.client.autoconnect.password</name>
<value>${sqoop-pass}</value>
<description>The password to bind to the metastore.
</description>
</property>
When you execute the command sqoop job --list in first time it will return zero values. But after creating the jobs, if you shutdown the EMR, you don't lose the sqoop metadata from executing jobs.
In EMR, we can use the Bootstrap Action to automate this operation in cluster creation.

Passing parameter to sqoop job

I'm crceating a sqoop job which will be scheduled in Oozie to load daily data into Hive.
I want to do incremental load into hive based on Date as a parameter, which will be passed to sqoop job
After researching lot I'm unable to find a way to pass a parameter to Sqoop job
You do this by passing the date down through two stages:
Coordinator to workflow
In your coordinator you can pass the date to the workflow that it executes as a <property>, like this:
<coordinator-app name="schedule" frequency="${coord:days(1)}"
start="2015-01-01T00:00Z" end="2025-01-01T00:00Z"
timezone="Etc/UTC" xmlns="uri:oozie:coordinator:0.2">
...
<action>
<workflow>
<app-path>${nameNode}/your/workflow.xml</app-path>
<configuration>
<property>
<name>workflow_date</name>
<value>${coord:formatTime(coord:nominalTime(), 'yyyyMMdd')}</value>
</property>
</configuration>
</workflow>
</action>
...
</coordinator-app>
Workflow to Sqoop
In your workflow you can reference that property in your Sqoop call using the ${workflow_date} variable, like this:
<sqoop xmlns="uri:oozie:sqoop-action:0.2">
...
<command>import --connect jdbc:connect:string:here --table tablename --target-dir /your/import/dir/${workflow_date}/ -m 1</command>
...
</sqoop>
Below solution is from Apache Sqoop Cookbook.
Preserving the Last Imported Value
Problem
Incremental import is a great feature that you're using a lot. Shouldering the responsibility for remembering the last imported value is getting to be a hassle.
Solution
You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. You can create a simple incremental import job with the following command:
sqoop job \
--create visits 3.3. Preserving the Last Imported Value | 27
-- import \
--connect jdbc:mysql://mysql.example.com/sqoop \
--username sqoop \
--password sqoop \
--table visits \
--incremental append \
--check-column id \
--last-value 0
And start it with the --exec parameter:
sqoop job --exec visits
Discussion
The Sqoop metastore is a powerful part of Sqoop that allows you to retain your job definitions and to easily run them anytime. Each saved job has a logical name that is used for referencing. You can list all retained jobs using the --list parameter:
sqoop job --list
You can remove the old job definitions that are no longer needed with the --delete parameter, for example:
sqoop job --delete visits
And finally, you can also view content of the saved job definitions using the --show parameter, for example:
sqoop job --show visits
Output of the --show command will be in the form of properties. Unfortunately, Sqoop currently can't rebuild the command line that you used to create the saved job.
The most important benefit of the built-in Sqoop metastore is in conjunction with incremental import. Sqoop will automatically serialize the last imported value back into the metastore after each successful incremental job. This way, users do not need to remember the last imported value after each execution; everything is handled automatically.

Oozie shell script action

I am exploring the capabilities of Oozie for managing Hadoop workflows. I am trying to set up a shell action which invokes some hive commands. My shell script hive.sh looks like:
#!/bin/bash
hive -f hivescript
Where the hive script (which has been tested independently) creates some tables and so on. My question is where to keep the hivescript and then how to reference it from the shell script.
I've tried two ways, first using a local path, like hive -f /local/path/to/file, and using a relative path like above, hive -f hivescript, in which case I keep my hivescript in the oozie app path directory (same as hive.sh and workflow.xml) and set it to go to the distributed cache via the workflow.xml.
With both methods I get the error message:
"Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]" on the oozie web console. Additionally I've tried using hdfs paths in shell scripts and this does not work as far as I know.
My job.properties file:
nameNode=hdfs://sandbox:8020
jobTracker=hdfs://sandbox:50300
queueName=default
oozie.libpath=${nameNode}/user/oozie/share/lib
oozie.use.system.libpath=true
oozieProjectRoot=${nameNode}/user/sandbox/poc1
appPath=${oozieProjectRoot}/testwf
oozie.wf.application.path=${appPath}
And workflow.xml:
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>${appPath}/hive.sh</exec>
<file>${appPath}/hive.sh</file>
<file>${appPath}/hive_pill</file>
</shell>
<ok to="end"/>
<error to="end"/>
</action>
<end name="end"/>
My objective is to use oozie to call a hive script through a shell script, please give your suggestions.
One thing that has always been tricky about Oozie workflows is the execution of bash scripts.
Hadoop is created to be massively parallel so the architecture acts very different than you would think.
When an oozie workflow executes a shell action, it will receive resources from your job tracker or YARN on any of the nodes in your cluster. This means that using a local location for your file will not work, since the local storage is exclusively on your edge node. If the job happened to spawn on your edge node then it would work, but any other time it would fail, and this distribution is random.
To get around this, I found it best to have the files I needed (including the sh scripts) in hdfs in either a lib space or the same location as my workflow.
Here is a good way to approach what you are trying to achieve.
<shell xmlns="uri:oozie:shell-action:0.1">
<exec>hive.sh</exec>
<file>/user/lib/hive.sh#hive.sh</file>
<file>ETL_file1.hql#hivescript</file>
</shell>
One thing you will notice is that the exec is just hive.sh since we are assuming that the file will be moved to the base directory where the shell action is completed
To make sure that last note is true, you must include the file's hdfs path, this will force oozie to distribute that file with the action. In your case, the hive script launcher should only be coded once, and simply fed different files. Since we have a one to many relationship, the hive.sh should be kept in a lib and not distributed with every workflow.
Lastly you see the line:
<file>ETL_file1.hql#hivescript</file>
This line does two things. Before the # we have the location of the file. It is just the file name since we should distribute our distinct hive files with our workflows
user/directory/workflow.xml
user/directory/ETL_file1.hql
and the node running the sh will have this distributed to it automagically. Lastly, the part after the # is the variable name we assign it two inside of the sh script. This gives you the ability to reuse the same script over and over and simply feed it different files.
HDFS directory notes,
if the file is nested inside the same directory as the workflow, then you only need to specify child paths:
user/directory/workflow.xml
user/directory/hive/ETL_file1.hql
Would yield:
<file>hive/ETL_file1.hql#hivescript</file>
But if the path is outside of the workflow directory you will need the full path:
user/directory/workflow.xml
user/lib/hive.sh
would yield:
<file>/user/lib/hive.sh#hive.sh</file>
I hope this helps everyone.
From
http://oozie.apache.org/docs/3.3.0/DG_ShellActionExtension.html#Shell_Action_Schema_Version_0.2
If you keep your shell script and hive script both in some folder in workflow then you can execute it.
See the command in sample
<exec>${EXEC}</exec>
<argument>A</argument>
<argument>B</argument>
<file>${EXEC}#${EXEC}</file> <!--Copy the executable to compute node's current working directory -->
you can write whatever commands you want in file
You can also use use hive action directly
http://oozie.apache.org/docs/3.3.0/DG_HiveActionExtension.html

Can I rename the oozie job name dynamically

We have a Hadoop service in which we have multiple applications. We need to process the data for each of the applications by reexecuting the same workflow. These are scheduled to execute at the same time of the day. The issue is that when these jobs are running its hard to know for which application the job is running/failed/succeeded. Ofcourse, I can open the job coonfiguration and know it but that does take time since there are 10s of applications running under that service.
Is there any option in oozie to dynamically pass the name of the workflow (or part of it) when executing the job such as
oozie job -run -config <filename> -name "<NameIWishToGive>"
OR
oozie job -run -config <filename> -nameSuffix "<MyApplicationNameUnderTheService>"
Also, we dont wish to create multiple job folders to execute separately as that would be too much of copy paste.
Please suggest.
It looks to me like you should be able to just use properties set in the job config.
I was able to get a dynamic name by doing the following.
Here's an example of my workflow.xml:
<workflow-app xmlns="uri:oozie:workflow:0.2" name="map-reduce-wf-${environment}">
...
</workflow-app>
And in my job.properties I had:
...
environment=test
...
The name ended up being: "map-reduce-wf-test"
you will find a whole bunch of oozie command lines here in the apache docs. i'm not sure which one exactly you are looking for so i thought i'd just paste the link. hope this helps!
I couldn't find anything in oozie to do that. Here is the script that does find/replace of #{appName} and #{frequency} in *.xml files + uploads all files to hdfs. Values are taken from the properties file passed to the script as the 3rd argument.
Gist - https://gist.github.com/epishkin/5952522
Example:
./upload.sh simple_reports namenode01 simple_reports/coordinator_script-1.properties
where 'simple_reports' is a folder with workflow.xml and coordinator.xml files.
workflow.xml:
<workflow-app name="#{appName}" xmlns="uri:oozie:workflow:0.3">
...
</workflow-app>
coordinator.xml:
<coordinator-app name="#{appName}-coord" xmlns="uri:oozie:coordinator:0.2"
frequency="#{frequency}"
start="${start}"
end= "${end}"
timezone="America/New_York">
...
</coordinator-app>
coordinator_script-1.properties:
appName=multi_network
frequency=${coord:days(7)}
...
Hope this helps.
I had recently faced this issue and this, All the tables uses the same workflow but name of the oozie application should reflect the name of the table it is processing.
Then pass the same parameter from job.properties then the name of the ozzie application will be acoording to dataload_tablename.

I couldn't import the tables from my sql server to hive through sqoop

When I pass the command:
$sqoop create-hive-table --connect 'jdbc:sqlserver://10.100.0.18:1433;username=cloud;password=cloud123;database=hadoop' --table cluster
Some errors and warnings appear and at the end it says,
Failed to start database '/var/lib/hive/metastore/metastore_db', see the next exception for details [again a list of import errors displayed]
Finally it says hive exited with satus 9
What is the problem here? I am new to sqoop and hive. Please anyone help me.
The correct syntax would be
sqoop import --connect 'jdbc:sqlserver://10.100.0.18:1433/hadoop' --username cloud --password cloud123 --table cluster --hive-import
I think you might want to check if you have write permissions to the specified directory and if a directory named metastore_db is being created
This message is usually shown when you're running Sqoop with default Hive configuration. Hive will by default use derby datastore which is usable only in very basic test use cases. I would recommend to reconfigure your hive instance to use some other relation database as a datastore back end (MySQL, PostgreSQL, Oracle).
Your syntax is all wrong. Syntax is $sqoop tool-name [tool-arguments]
$sqoop import --create-hive-table --connect 'jdbc:sqlserver://10.100.0.18:1433/hadoop' --username cloud --password cloud123 --table cluster
Pasting a sample call of hive import using sqoop. This might help you to correct your syntax further. Remember that essentially you need to give minimum the below command to make it work.
sqoop import --connect jdbc:mysql://localhost/RAWDATA --table geolocation --username root --password hadoop --hive-import --create-hive-table --driver com.mysql.jdbc.Driver --m 1 --delete-target-dir
--connect, in this the part which reads /RAWDATA is the database name from your mysql instance which contains the geolocation table. You can execute 'show databases' and 'show tables' command in mysql to check for your databases and tables.
--delete-target-dir option is used for safety. It will ensure sqoop delete the tmp dir it creates to write the file before moving it into hive. This will avoid unnecessary errors of directory already exists, in case you retry the command.
--create-hive-table is required only if you did not create the target table in hive already. If your previous runs of sqoop command created the table already, then you can ignore this option completely. Check your hive database for existence of target hive table.
--driver is a mandatory part of the command to perform any database connection.Make sure you either find the right path to the driver library or try googling for options. You can try first the one pasted above to see if it does the trick. You can revert to this forum for help.
remember we did not mention which database in hive the table will be created therefore it will be in default database of hive. I am not giving that option since you are just about starting in sqoop.

Resources