I have oozie installation as part of the cloudera installation.
I'm trying to execute the coordinator workflow fro the example with the following configuration in the coordinator.xml.
<coordinator-app name="cron-coord" frequency="${coord:minutes(60)}" start="${start}" end="${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.2">
With this configuration i expected the workflow to be executed every 1 hour , but it seems that the workflow has been executed every 5 minutes , does anyone have answer for this issue?
Are you setting the start time prior to the current time? If so, Oozie will work in the catch up mode until all delayed actions have been scheduled. The "frequency" setting does not apply to the catch-up mode.
You may give time coords in hours instead of minutes as :
coordinator-app name="cron-coord" frequency="${coord:hours(1)}" start="${start}" end="${end}" timezone="UTC" xmlns="uri:oozie:coordinator:0.2"
Related
I'm using SpringBoot 2.4.x app with SpringBatch 4.3.x. I've created a simple job.
Where I've FlatFileItemReader which reads from CSV file. I've ImportKafkaItemWriter which writes to Kafka topic. One step where I combines these. I'm using SimpleJobLauncher and I've set ThreadPoolTaskExecutor as TasKExecutor of the JobLauncher. It is working fine as I've expected. But one resilience use case I've which is if I kill the app and then restart the app and trigger the job then it would carry on and finish the remaining job. Unfortunately it is not happening. I did further investigate and found that when I forcibly close the app SpringBatch job repository key tables look like this:
job_execution_id
version
job_instance_id
create_time
start_time
end_time
status
exit_code
exit_message
last_updated
job_configuration_location
1
1
1
2021-06-16 09:32:43
2021-06-16 09:32:43
STARTED
UNKNOWN
2021-06-16 09:32:43
and
step_execution_id
version
step_name
job_execution_id
start_time
end_time
status
commit_count
read_count
filter_count
write_count
read_skip_count
write_skip_count
process_skip_count
rollback_count
exit_code
exit_message
last_updated
1
4
productImportStep
1
2021-06-16 09:32:43
STARTED
3
6
0
6
0
0
0
0
EXECUTING
2021-06-16 09:32:50
If I manually update these tables where I set a valid end_time and status to FAILED then I can restart the job and works absolutely fine. May I know what I need to do so that Spring Batch can update those relevant repositories appropriately and I can avoid this manual steps. I can provide more information about code if needed.
If I manually update these tables where I set a valid end_time and status to FAILED then I can restart the job and works absolutely fine. May I know what I need to do so that Spring Batch can update those relevant repositories appropriately and I can avoid this manual steps
When a job is killed abruptly, Spring Batch won't have a chance to update its status in the Job repository, so the status is stuck at STARTED. Now when the job is restarted, the only information that Spring Batch has is the status in the job repository. By just looking at the status in the database, Spring Batch cannot distinguish between a job that is effectively running and a job that has been killed abruptly (in both cases, the status is STARTED).
The way to go in indeed manually updating the tables to either mark the status as FAILED to be able to restart the job or ABANDONED to abandon it. This is a business decision that you have to make and there is no way to automate it on the framework side. For more details, please refer to the reference documentation here: Aborting a Job.
You can add a faked parameter example Version a counter to increment for every new job execution so you don't have to check for the table database job.
What I mean mvn clean package
Then you try to launch the program like this :
java my-jarfile.jar dest=/tmp/foo Version="0"
java my-jarfile.jar dest=/tmp/foo Version="1"
java my-jarfile.jar dest=/tmp/foo Version="2"
etc ... Or
You Can use jobParameters to launch thé job programatically via jobLauncher and use date paramèter date = new Date().toString() which gives date with New stamp on every New job execution
You can use "JVM Shutdown Hook":
Something like this:
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
if (jobExecution.isRunning()) {
jobExecution.setEndTime(new Date());
jobExecution.setStatus(BatchStatus.FAILED);
jobExecution.setExitStatus(ExitStatus.FAILED);
jobRepository.update(jobExecution);
}
}));
I have scheduled a coordinator using cron expression
frequency = "20 3 * * 2-4" but it gives error.
The oozie coordinator logs say "java.lang.IllegalArgumentException" : paramter [frequency]=[20 3 * * 2-4] must be an integer . Parsing error for input String : "20 3 * * 2-4"
HDP version : 2.5.3
Oozie Client build version : 4.2.0.2.5.3.0-37
..
..
You are requesting Oozie to apply XML schema for Coordinator... in version 0.2 of that schema.
The documentation hints that CRON syntax worked with schema 0.2 but I'm pretty sure that CRON scheduling was introduced in Oozie V4.0 (and documented in V4.1) -- and since Oozie V4.0 introduced schema 0.4 I believe that the documentation is wrong.
Bottom line: requesting xmlns="uri:oozie:coordinator:0.4" should allow Oozie to parse your CRON schedule correctly.
We are planing to create Oozie job which run Sqoop command to import data from SQL server to HDFS on hourly basis. But we are facing challenge, how to get alert if that job fails in between and how sqoop will check which data imported successfully and which is still pending. Is there any process to maintain transactions and retry mechanism during sqoop import. And also we get alert on their failure.
You can configure the workflow of Oozie to send an email on fail.
You could achieve this by redirecting the error tag from any action to a send email action.
An example for the email configuration might be the following.
<action name="send-email">
<email xmlns="uri:oozie:email-action:0.1">
<to>${emailToAddress}</to>
<subject>Failed to import table.</subject>
<body>The following import has failed.
failed the workflow that was trying to perform job --exec import-${tableName}-${environment}-${format}-${db} --verbose
ID= ${wf:id()}
NAME= ${wf:name()}
APP PATH= ${wf:appPath()}
USER= ${wf:user()}
GROUP= ${wf:group()}
NAMENODE= ${nameNode}
JOBTRACKER = ${jobTracker}
QUEUE = ${queueName}
START DATE = ${start}
error message[${wf:errorMessage(wf:lastErrorNode())}]</body>
</email>
<ok to="fail-job"/>
<error to="fail-email"/>
</action>
Notice that email adressess can be multiple comma separated.
For the email to be sent properly you also need to configure the oozie email client properly at the oozie custom site. The parameters that you might need to configure are the following:
Custom oozie-site
oozie.email.smtp.password
oozie.email.from.address
oozie.email.smtp.auth
oozie.email.smtp.host
oozie.email.smtp.port
oozie.email.smtp.username
oozie.service.ProxyUserService.proxyuser.falcon.groups
oozie.service.ProxyUserService.proxyuser.falcon.hosts
About retry up from Oozie 3.1 you can configure parameter retry and retry interval in every action. To achieve this you can set the following parameters inside the action tag
<action name="a" retry-max="2" retry-interval="1">
....
</action>
More information at Oozie's documentation
You can find out or modify retry and retry interval defaults on oozie-default.xml. Generic defaults are specified here
I'm trying to run a workflow using a coordinator, but when i try to set the workflow and coordinator XML file paths together, i get an error.
This is how my jobs.properties file looks like:
nameNode=hdfs://10.74.6.155:9000
jobTracker=10.74.6.155:9010
queueName=default
oozie.use.system.libpath=true
oozie.wf.application.path=${nameNode}/user/${user.name}/examples/apps/test/
oozie.coord.application.path=${nameNode}/user/${user.name}/examples/apps/test/
when i run my workflow with the command line:
bin\oozie job -oozie http://localhost:11000/oozie -config examples\apps\test\job.properties -run
i get the following error:
Error: E0302 : E0302: Invalid parameter [{0}]
what am i doing wrong?
Thanks!
Both workflow and coordination paths cannot exist in job.properties at the same time. You can either run a job as a workflow or as a coordination.
Use only your Coordinator path in your properties file and use your workflow path in the Coordinator.xml file.
**oozie.use.system.libpath=true
workflowpath=${nameNode}/user/${user.name}/examples/apps/test/
oozie.coord.application.path=${nameNode}/user/${user.name}/examples/apps/test/**
In your coordinator.xml file add this line
'<app-path>${workflowpath}</app-path>'
I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is <user, <movie, rating>*> and <movie,<user,rating>*>. In the reducer, I will process these pairs.
But when I run the job, the mapper completes as expected, but reducer always complain that
Task attempt_* failed to report status for 600 seconds.
I know this is due to failed to update status, so I added a call to context.progress() in my code like this:
int count = 0;
while (values.hasNext()) {
if (count++ % 100 == 0) {
context.progress();
}
/*other code here*/
}
Unfortunately, this does not help. Still many reduce tasks failed.
Here is the log:
Task attempt_201104251139_0295_r_000014_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000012_1, Status : FAILED
Task attempt_201104251139_0295_r_000012_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000006_1, Status : FAILED
Task attempt_201104251139_0295_r_000006_1 failed to report status for 600 seconds. Killing!
BTW, the error happened in reduce to copy phase, the log says:
reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385
Thanks for the help.
The easiest way will be to set this configuration parameter:
<property>
<name>mapred.task.timeout</name>
<value>1800000</value> <!-- 30 minutes -->
</property>
in mapred-site.xml
The easiest another way is to set in your Job Configuration inside the program
Configuration conf=new Configuration();
long milliSeconds = 1000*60*60; <default is 600000, likewise can give any value)
conf.setLong("mapred.task.timeout", milliSeconds);
**before setting it please check inside the Job file(job.xml) file in jobtracker GUI about the correct property name whether its mapred.task.timeout or mapreduce.task.timeout
.
.
.
while running the job check in the Job file again whether that property is changed according to the setted value.
In newer versions, the name of the parameter has been changed to mapreduce.task.timeout as described in this link (search for task.timeout). In addition, you can also disable this timeout as described in the above link:
The number of milliseconds before a task will be terminated if it
neither reads an input, writes an output, nor updates its status
string. A value of 0 disables the timeout.
Below is an example setting in the mapred-site.xml:
<property>
<name>mapreduce.task.timeout</name>
<value>0</value> <!-- A value of 0 disables the timeout -->
</property>
If you have hive query and its timing out , you can set above configurations in following way:
set mapred.tasktracker.expiry.interval=1800000;
set mapred.task.timeout= 1800000;
From https://issues.apache.org/jira/browse/HADOOP-1763
causes might be :
1. Tasktrackers run the maps successfully
2. Map outputs are served by jetty servers on the TTs.
3. All the reduce tasks connects to all the TT where maps are run.
4. since there are lots of reduces wanting to connect the map output server, the jetty servers run out of threads (default 40)
5. tasktrackers continue to make periodic heartbeats to JT, so that they are not dead, but their jetty servers are (temporarily) down.