Spring Batch: Terminating the current running job - spring

I am having an issue in terminating the current running spring batch. I wrote
Set<Long> executions = jobOperator.getRunningExecutions("Job-Builder");
jobOperator.stop(longExecutions.iterator().next());`
in my code after going through the spring documentation.
The problem I am facing is at times the termination of the job is happening as expected and the other times the termination of job is not happening. In fact every time I call stop on joboperator it is updating the BATCH_JOB_EXECUTION table. When the termination happens successfully the status of the job is updating to STOPPED by killing the jobExecution in my batch process. The other times when it fails it is completing the rest of the different flows of the batch and updating the status to FAILED on BATCH_JOB_EXECUTION table.
But every time I call stop in the job operator I see a message in my console
2020-09-30 18:14:29.780 [http-nio-8081-exec-5] INFO o.s.b.c.l.s.SimpleJobOperator:428 - Aborting job execution: JobExecution: id=33058, version=2, startTime=2020-09-30 18:14:25.79, endTime=null, lastUpdated=2020-09-30 18:14:28.9, status=STOPPING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=32922, version=0, Job=[Job-Builder]], jobParameters=[{date=1601504064263, time=1601504064262, extractType=false, JobId=1601504064262}]
My project has a series of flows and steps with in it.
Over all my batch process looks like this:
JobBuilderFactory has 3 flows
Each flow has a stepbuilder and two tasklets.
each stepbuilder has a partitioner and a chunk(size is 100) based itemReader, itemProcessor and itemWriter.
I am calling the stop method when I am executing the very first flow in my jobBuilderFactory. The over all process to complete takes about 30 mins. So, it has close to around 20-25 mins from the time I call the stop method and the chunk size is 100 with in each and every flow and I am dealing with more than 500k records.
So, my question is why is jobExecution stopping at times when called stop methos(which is what I wanted) and why it isn't able to stop the jobExecution the remaining times.
Thanks in advance

So, my question is why is jobExecution stopping at times when called stop methos(which is what I wanted) and why it isn't able to stop the jobExecution the remaining times.
It's not easy to figure out the reason for that from what you shared, but I can give you a couple of notes about stopping jobs:
jobOperator.stop does not guarantee that the job stops, it only sends a stop signal to the job execution. From what you shared, you are not checking the returned boolean that indicates if the signal has been correctly sent or not, so you should be doing that first.
You did not share your code, but you need to use StoppableTasklet instead of Tasklet to make sure the stop signal is correctly sent to your steps.

Related

Spring integration inboundChannelAdapter stops polling unexpectedly

In our project we need to retrieve prices from a remote ftp server. During the office hours this works fine, prices are retrieved and successfully processed. After office hours there are no new prices published on the ftp server, so as expected we don't find anything new.
Our problem is that after a few hours of not finding new prices, the poller just stops polling. No error in the logfiles (even when running on org.springframework.integration on debug level) and no exceptions. We are now using a separate TaskExecutor to isolate the issue, but still the poller just stops. In the mean time we adjusted the cron expression to match these hours, to limited the resource use, but still the poller just stops when it is supposed to run.
Any help to troubleshoot this issue is very much appreciated!
We use an #InboudChannelAdapter on a FtpStreamingMessageSource which is configured like this:
#Bean
#InboundChannelAdapter(
value = FTP_PRICES_INBOUND,
poller = [Poller(
maxMessagesPerPoll = "\${ftp.fetch.size}",
cron = "\${ftp.poll.cron}",
taskExecutor = "ftpTaskExecutor"
)],
autoStartup = "\${ftp.fetch.enabled:false}"
)
fun ftpInboundFlow(
#Value("\${ftp.remote.prices.dir}") pricesDir: String,
#Value("\${ftp.remote.prices.file.pattern}") remoteFilePattern: String,
#Value("\${ftp.fetch.size}") fetchSize: Int,
#Value("\${ftp.fetch.enabled:false}") fetchEnabled: Boolean,
clock: Clock,
remoteFileTemplate: RemoteFileTemplate<FTPFile>,
priceParseService: PriceParseService,
ftpFilterOnlyFilesFromMaxDurationAgo: FtpFilterOnlyFilesFromMaxDurationAgo
): FtpStreamingMessageSource {
val messageSource = FtpStreamingMessageSource(remoteFileTemplate, null)
messageSource.setRemoteDirectory(pricesDir)
messageSource.maxFetchSize = fetchSize
messageSource.setFilter(
inboundFilters(
remoteFilePattern,
ftpFilterOnlyFilesFromMaxDurationAgo
)
)
return messageSource;
}
The property values are:
poll.cron: "*/30 * 4-20 * * MON-FRI"
fetch.size: 10
fetch.enabled: true
We limit the poll.cron we used the retrieve every minute.
In the related DefaultFtpSessionFactory, the timeouts are set to 60 seconds to override the default value of -1 (which means no timeout at all):
sessionFactory.setDataTimeout(timeOut)
sessionFactory.setConnectTimeout(timeOut)
sessionFactory.setDefaultTimeout(timeOut)
Maybe my answer seems a bit too easy, bit is it because your cron expression states that it should schedule the job between 4 and 20 hour. After 8:00 PM it will not schedule the job anymore and it will start polling again at 4:00 AM.
It turned out that the processing took longer than the scheduled interval, so during processing a new task was already executed. So eventually multiple task were trying to accomplish the same thing.
We solved this by using a fixedDelay on the poller instead of a fixedRate.
The difference is that a fixedRate schedules on a regular interval independent if the task was finished and the fixedDelay schedules a delay after the task is finished.

KafkaConsumer poll() behavior understanding

Trying to understand (new to kafka)how the poll event loop in kafka works.
Use Case : 25 records on the topic, max poll size is set to 5.
max.poll.interval.ms = 5000 //5 seconds by default max.poll.records = 5
Sequence of tasks
Poll the records from the topic.
Process the records in a for loop.
Some processing login where the logic would either pass or fail.
If logic passes (with offset) will be added to a map.
Then it will be committed using commitSync call.
If fails then the loop will break and whatever was success before this would be committed.The problem starts after this.
The next poll would just keep moving in batches of 5 even after error, is it expected?
What we basically expect is that the loop breaks and the offsets till success process message logic should get committed, then the next poll should continue from the failed message.
Example, 1st batch of poll 5 messages polled and 1,2 offsets successful and committed then 3rd failed.So the poll call keep moving to next batch like 5-10,10-15 if there are any errors in between we expect it to stop at that point and poll should start from 3 in first case or if it fails in 2nd batch at 8 then the next poll should start from 8th offset not from next max poll batch settings which would be like 5 in this case.IF IT MATTERS USING SPRING BOOT PROJECT and enable autocommit is false.
I have tried finding this in documentation but no help.
tried tweaking this but no help max.poll.interval.ms
EDIT: Not accepted answer because there is no direct solution for a customer consumer.Keeping this for informational purpose
max.poll.interval.ms is milliseconds, not seconds so it should be 5000.
Once the records have been returned by the poll (and offsets not committed), they won't be returned again unless you restart the consumer or perform seek() operations on the consumer to reset the offset to the unprocessed ones.
The Spring for Apache Kafka project provides a SeekToCurrentErrorHandler to perform this task for you.
If you are using the consumer yourself (which it sounds like), you must do the seeks.
You can manually seek to the beginning offset of the poll for all the assigned partitions on failure. I am not sure using spring consumer.
Sample code for seeking offset to beginning for normal consumer.
In the code below I am getting the records list per partition and then getting the offset of the first record to seek to.
def seekBack(records: ConsumerRecords[String, String]) = {
records.partitions().map(partition => {
val partitionedRecords = records.records(partition)
val offset = partitionedRecords.get(0).offset()
consumer.seek(partition, offset)
})
}
One problem doing this in production is bad since you don't want seekback all the time only in cases where you have a transient error otherwise you will end up retrying infinitely.

Spring #Scheduled, how to run once at time (not concurrently)

I have an annotated method width #Scheduled with an cron of */15 * * * * ? (run each 15 seconds).
Sometimes this process take more than 15 seconds to run.
Is there any way to avoid the call of the #Scheduled if it's already running?
My workaround currently is a flag field in the class to signal if the process is running, and if it is marked the code exits before execute the main code.
I think it's already the case, if the first job has'nt finished, the second will not start.
See :
How to prevent overlapping schedules in Spring?
If it isn't working, you can also use an AtomicBoolean to check if you must start the process or not.

Hive job gets killed and query execute() remains hanging

I am using hive-jdbc-0.7.1-cdh3u5.jar. I have some memory-intensive queries running on EMR which occasionally fail. When I look at the job tracker I see that the query has been killed and I see the following error:
java.io.IOException: Task process exit with nonzero status of 137
However, the Hive JDBC driver execute() call does not detect this, but instead is left hanging. No exception is caught. Any ideas? Thanks:
ST stQuery = MY_QUERY;
try {
Statement stmt = conn.createStatement();
stmt.execute(stQuery.render()); // Hangs here without knowing that the job has been killed. Exception does not get raised.
}
catch(SQLException sqle){
sqle.printStackTrace();
log.error("Failed to run query");
return;
}
It is perhaps due to the fact that the hadoop will kill
the task after 10 minutes (600 sec) if it doesn't get the response and
by setting the parameter mapred.task.timeout=0 we can avoid killing
the tasks which are running for more than 10 min.
Also in theses cases one can write mapper/reducer in such a way as to report progress on a regular basis (more frequently than every 10 minutes). This may be achieved in a number of ways:
Call setStatus() on Reporter to set a human-readable description of
the task’s progress
Call incrCounter() on Reporter to increment a user counter
Call progress() on Reporter to tell Hadoop that your task is still there (and making progress)

Quartz.NET: Need CronTrigger on an iStatefulJob instance to *delay instead of skip* if running job while schedule matures

Greetings, your friendly neighborhood Quartz.NET n00b is back!
I have a Windows Service running iStatefulJob instances on a Quartz.NET CronTrigger based schedule scheme... The CRON String used to schedule the job: "0 0/1 * * * ? *"
Everything works great. However, if I have a job that is set to run, say, at the X:00 mark of every minute, and that job happens to run for MORE than a minute, I notice that the subsequent job runs IMMEDIATELY after the job is finished executing, rather than waiting until its next scheduled run, effectively "queuing" up instead of merely skipping the job till it's next scheduled run.
I put in the trigger a CronTrigger MisfireInstruction of DONOTHING, but the exact same thing happens when a job overruns its next scheduled execution schedule.
How do I get an iStatefulJob instance to merely SKIP a scheduled execution trigger if it is currently running, rather than have it delay it until the first execution completes?
I explicitly set the trigger.MisfireInstruction = MisfireInstruction.CronTrigger.DoNothing;
...But instead of "doing nothing", for a job scheduled to run every minute that takes 90 seconds to complete, I experience the following execution log:
Job runs at 9:00:00am, finishes at 9:01:30am <- job runs for 1:30
Job runs at 9:01:30am, finishes at 9:03:00am <- subsequent job that should have run at 9:01:00
Job runs at 9:04:00am, finishes at 9:05:30am <- shouldn't this one have run at 9:03:00?
Job runs at 9:05:30am, finishes at 9:07:00am <- subsequent job that should have run at 9:05:00
Job runs at 9:08:00am, finishes at 9:09:30am <- shouldn't this have run at 9:07:00?
... it seems like it runs correctly the first time, on the minute... delays for 30 seconds as the 90 second job execution time expires, and then, instead of waiting till the NEXT full minute, EXECUTES IMMEDIATELY at the 30 second mark... Doubly odd, is that it then finishes the SECOND job on the minute mark, but waits till the NEXT minute mark to execute instead of running it back-2-back...
Pretty much seems like it works correctly EVERY OTHER RUN, when it is not running on the :30 marks...
What's the best way to get a job not to delay/queue, but to just SKIP until it is idle and the next schedule matures?
EDIT: I tried going back to iJobs instead of iStatefulJobs using the same DONOTHING trigger misfire instruction, but the job executes EVERY MINUTE despite the prior execution being still active. I can't seem to get it to skip a scheduled run if it is currently running with either iJob or iStatefulJob...
EDIT#2: I think that my triggers are NEVER misfiring, which is why DoNothing as a misfire instruction is useless... Given that's the case, I guess I need another mechanism to detect if a job instance of a schedule is running to ensure the job SKIPS its next execution until its following scheduled time rather than delaying it until first instance completion...
EDIT3: I tried adding an element to the iStatefulJob jobdatamap called "IsRunning"... I set it to TRUE when the execute sequence starts, and then return it to false after job completion. Before executing, it checks the element, which is apparently persisted between jobs, and prematurely quits the execution (logging "JOB SKIPPED!") if it detects it to be true... This unfortunately doesn't work, for probably obvious reasons: If the jobs are running following the bulleted schedule above, then the job is never SIMULTANEOUSLY running along with itself, as it is delaying the run till the job ends, so this check is useless. According to documentation, returning to iJob from iStatefulJob would not help here as the jobdatamap is only persisted between jobs in the Stateful job type...
I still haven't solved how to SKIP a scheduled job instead of delaying it till it's current iteration completes... If anyone has ideas, you're a lifesaver! :)
It should be caused by misfireThreshold of RAMJobStore (http://quartznet.sourceforge.net/apidoc/topic2722.html).
The time span by which a trigger must
have missed its next-fire-time, in
order for it to be considered
"misfired" and thus have its misfire
instruction applied.
It is 60 seconds by default. So job isn't considered as "misfired" until it is late for more than misfiredThreshold value.
To resolve the problem just decrease this threshold (below code sets to 1 ms):
...
properties["quartz.jobStore.misfireThreshold"] = "1";
...
schedulerFactory = new StdSchedulerFactory(properties);
It should resolve the issue.

Resources