Spring Batch doesn't update Job repository after force termination

Spring Batch doesn't update Job repository after force termination - spring-boot

I'm using SpringBoot 2.4.x app with SpringBatch 4.3.x. I've created a simple job.
Where I've FlatFileItemReader which reads from CSV file. I've ImportKafkaItemWriter which writes to Kafka topic. One step where I combines these. I'm using SimpleJobLauncher and I've set ThreadPoolTaskExecutor as TasKExecutor of the JobLauncher. It is working fine as I've expected. But one resilience use case I've which is if I kill the app and then restart the app and trigger the job then it would carry on and finish the remaining job. Unfortunately it is not happening. I did further investigate and found that when I forcibly close the app SpringBatch job repository key tables look like this:
job_execution_id
version
job_instance_id
create_time
start_time
end_time
status
exit_code
exit_message
last_updated
job_configuration_location
1
1
1
2021-06-16 09:32:43
2021-06-16 09:32:43
STARTED
UNKNOWN
2021-06-16 09:32:43
and
step_execution_id
version
step_name
job_execution_id
start_time
end_time
status
commit_count
read_count
filter_count
write_count
read_skip_count
write_skip_count
process_skip_count
rollback_count
exit_code
exit_message
last_updated
1
4
productImportStep
1
2021-06-16 09:32:43
STARTED
3
6
0
6
0
0
0
0
EXECUTING
2021-06-16 09:32:50
If I manually update these tables where I set a valid end_time and status to FAILED then I can restart the job and works absolutely fine. May I know what I need to do so that Spring Batch can update those relevant repositories appropriately and I can avoid this manual steps. I can provide more information about code if needed.

If I manually update these tables where I set a valid end_time and status to FAILED then I can restart the job and works absolutely fine. May I know what I need to do so that Spring Batch can update those relevant repositories appropriately and I can avoid this manual steps
When a job is killed abruptly, Spring Batch won't have a chance to update its status in the Job repository, so the status is stuck at STARTED. Now when the job is restarted, the only information that Spring Batch has is the status in the job repository. By just looking at the status in the database, Spring Batch cannot distinguish between a job that is effectively running and a job that has been killed abruptly (in both cases, the status is STARTED).
The way to go in indeed manually updating the tables to either mark the status as FAILED to be able to restart the job or ABANDONED to abandon it. This is a business decision that you have to make and there is no way to automate it on the framework side. For more details, please refer to the reference documentation here: Aborting a Job.

You can add a faked parameter example Version a counter to increment for every new job execution so you don't have to check for the table database job.
What I mean mvn clean package
Then you try to launch the program like this :
java my-jarfile.jar dest=/tmp/foo Version="0"
java my-jarfile.jar dest=/tmp/foo Version="1"
java my-jarfile.jar dest=/tmp/foo Version="2"
etc ... Or
You Can use jobParameters to launch thé job programatically via jobLauncher and use date paramèter date = new Date().toString() which gives date with New stamp on every New job execution

You can use "JVM Shutdown Hook":
Something like this:
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
if (jobExecution.isRunning()) {
jobExecution.setEndTime(new Date());
jobExecution.setStatus(BatchStatus.FAILED);
jobExecution.setExitStatus(ExitStatus.FAILED);
jobRepository.update(jobExecution);
}
}));

Related

Flyway not running migration scripts

I am running gradle flywayMigrate and getting this output that doesn't show any errors although it is not running my migration scripts:
Database: jdbc:mysql://localhost:3306 (MySQL 8.0)
Successfully validated 1 migration (execution time 00:00.006s)
Current version of schema `userdb`: null
Schema `userdb` is up to date. No migration necessary.
:flywayMigrate (Thread[Daemon worker Thread 3,5,main]) completed. Took 1.025 secs.
my configuration in gradle is as follows:
flyway{
url = 'jdbc:mysql://localhost:3306?&serverTimezone=UTC'
user = 'root'
password = 'password'
schemas = ['userdb']
locations = ['filesystem:src/main/resources/db/migration/']
}
and my scripts are in: F:......\src\main\resources\db\migration\v1__Create_user_table.sql
create table USERS (
ID int not null,
NAME varchar(100) not null
);
can't figure out why it is not carrying out the migration. It did however create the flyway history table.

I realized what the problem was. I had to capitalize the 'v' in my scriptname "v1__Create_user_table.sql". Amazing waste of time spent debugging.

SCDF: Restart and resume a composed task

SCDF Composed Task Runner gives us the option to turn on the --increment-instance-enabled. This option creates an artificial run.id parameter, which increments for every run. Therefore the task is unique for Spring Batch and will restart.
The problem with the IdIncrementer is when I mix it with execution without the IdIncrementer. In the event when a task does not finish, I want to resume the Task. The problem I encountered was when the task finishes without the IdIncrementer, I could not start the task again with the IdIncrementer.
I was wondering what would be the best way to restart with the option to resume?
My idea would be to create a new IdResumer, which uses the same run.id as the last execution.
We are run SCDF 2.2.1 on Openshift v3.11.98 and we use CTR 2.1.1.
The steps to reproduce this:
Create a new SCDF Task Definition with the following definition: dummy1:dummy && dummy2: dummy && dummy3: dummy. The dummy app is a docker container, that fails randomly with 50% chance.
Execute the SCDF Task with the --increment-instance-enabled=true and wait for one of the dummy task to fail (restart if needed).
To resume the same failed execution, execute the SCDF Task now --increment-instance-enabled=false. And let it finish successfully (Redo if needed).
Start the SCDF Task again with --increment-instance-enabled=true.
At step 4 the composed task throws the JobInstanceAlreadyCompleteException, even though the --increment-instance-enabled is enabled again.
Caused by:
org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException:
A job instance already exists and is complete for
parameters={-spring.cloud.data.flow.taskappname=composed-task-runner,
-spring.cloud.task.executionid=3190, -spring.datasource.username=testuser, -graph=aaa-stackoverflow-dummy2 && aaa-stackoverflow-dummy3, -spring.cloud.data.flow.platformname=default, -spring.datasource.url=jdbc:postgresql://10.10.10.10:5432/tms_efa?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory&currentSchema=dev,
-spring.datasource.driverClassName=org.postgresql.Driver, -spring.datasource.password=pass1234, -spring.cloud.task.name=aaa-stackoverflow, -dataflowServerUri=https://scdf-dev.company.com:443/ , -increment-instance-enabled=true}. If you want to run this job again, change the parameters.
Is there a better way to resume and restart the task?

Spring Batch Integration job instance already exists on start up

I am using spring batch integration to poll for a file and process it and was looking for some guidance on the job parameters aspect of it. I am using the following to create a job launch request and turn a file into the request
#Transformer
public JobLaunchRequest toRequest(Message<File> message) {
JobParametersBuilder jobParametersBuilder =
new JobParametersBuilder();
jobParametersBuilder.addString(fileParameterName,
message.getPayload().getAbsolutePath());
jobParametersBuilder.addLong("time", new Date().getTime());
return new JobLaunchRequest(job, jobParametersBuilder.toJobParameters());
}
On starting up the application for the first time there is only one parameter run.id. If i add a file to repository that the file poller is looking in it creates 2 parameters in the db: fileParameterName and time. If I start the application again it will use the previous values for parameters fileParameterName and time and add a new run.id. The message on the initial start up is :
Job: ... launched with the following parameters: [{run.id=1}]
If I add a file my application handles the file correctly:
Job: ... launched with the following parameters:[{input.file.name=C:\Temp\test.csv, time=1472051531556}]
but if I stop and start the application again I get the following message:
Job: ... launched with the following parameters: [{time=1472051531556, run.id=1, input.file.name=C:\Temp\test.csv}]
My question is why on this start up it is looking at the previous parameters? Is there a way to add the current time as a parameter on start up instead of the previous time so I dont get "A job instance already exists and is complete for parameters={}"? Or to stop the jobs running on start up?
Also if the application is running and I add a file it will enter the toRequest method but it does not on start up.
Any help would be great.
Thanks

We should have a parameter as 'run.id' with 'current timestamp' to where we kick off Spring Batch Job. This is how we kick off a Spring Batch job from shell script.
RUN_ID=$(date +"%Y-%m-%d %H:%M:%S") JOB_PARAMS="filename=XXX"
$JAVA_HOME
org.springframework.batch.core.launch.support.CommandLineJobRunner
springbatch_XXX.xml SpringBatchJob run.id="$RUN_ID" ${JOB_PARAMS}

Informatica error 1417 :: Task not yet registered with this service process

I am getting following error while running a workflow in informatica.
Session task instance [worklet.session] : [TM_6775 The master DTM process was unable to connect to the master service process to update the session status with the following message: error message [ERROR: The session run for [Session task instance [worklet.session]] and [ folder id = 206, workflow id = 16042, workflow run id = 65095209, worklet run id = 65095337, task instance id = 13272 ] is not yet registered with this service process.] and error code [1417].]
This error comes randomly for many other sessions, when they are ran through workflow as a whole. However if I "start task" that failed task next time, it runs successfully.
Any help is much appreciated.

Just an idea to try if you use versioning. Check that everthing is checked in correctly. If the mapping, worflow or worklet is checked out then you and informatica will run different versions wich may cause the behaivour to differ when you start it manually.
Infromatica will allways use the checked in version and you will allways use the checked out version.

How to fix "Task attempt_201104251139_0295_r_000006_0 failed to report status for 600 seconds."

I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is <user, <movie, rating>*> and <movie,<user,rating>*>. In the reducer, I will process these pairs.
But when I run the job, the mapper completes as expected, but reducer always complain that
Task attempt_* failed to report status for 600 seconds.
I know this is due to failed to update status, so I added a call to context.progress() in my code like this:
int count = 0;
while (values.hasNext()) {
if (count++ % 100 == 0) {
context.progress();
}
/*other code here*/
}
Unfortunately, this does not help. Still many reduce tasks failed.
Here is the log:
Task attempt_201104251139_0295_r_000014_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000012_1, Status : FAILED
Task attempt_201104251139_0295_r_000012_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000006_1, Status : FAILED
Task attempt_201104251139_0295_r_000006_1 failed to report status for 600 seconds. Killing!
BTW, the error happened in reduce to copy phase, the log says:
reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385
Thanks for the help.

The easiest way will be to set this configuration parameter:
<property>
<name>mapred.task.timeout</name>
<value>1800000</value> <!-- 30 minutes -->
</property>
in mapred-site.xml

The easiest another way is to set in your Job Configuration inside the program
Configuration conf=new Configuration();
long milliSeconds = 1000*60*60; <default is 600000, likewise can give any value)
conf.setLong("mapred.task.timeout", milliSeconds);
**before setting it please check inside the Job file(job.xml) file in jobtracker GUI about the correct property name whether its mapred.task.timeout or mapreduce.task.timeout
.
.
.
while running the job check in the Job file again whether that property is changed according to the setted value.

In newer versions, the name of the parameter has been changed to mapreduce.task.timeout as described in this link (search for task.timeout). In addition, you can also disable this timeout as described in the above link:
The number of milliseconds before a task will be terminated if it
neither reads an input, writes an output, nor updates its status
string. A value of 0 disables the timeout.
Below is an example setting in the mapred-site.xml:
<property>
<name>mapreduce.task.timeout</name>
<value>0</value> <!-- A value of 0 disables the timeout -->
</property>

If you have hive query and its timing out , you can set above configurations in following way:
set mapred.tasktracker.expiry.interval=1800000;
set mapred.task.timeout= 1800000;

From https://issues.apache.org/jira/browse/HADOOP-1763
causes might be :
1. Tasktrackers run the maps successfully
2. Map outputs are served by jetty servers on the TTs.
3. All the reduce tasks connects to all the TT where maps are run.
4. since there are lots of reduces wanting to connect the map output server, the jetty servers run out of threads (default 40)
5. tasktrackers continue to make periodic heartbeats to JT, so that they are not dead, but their jetty servers are (temporarily) down.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio