Spring batch - Job remains blocked on the status STARDED - spring

I have a job with a single chunk step executed in parallel (8 partitions) :
Reader : JdbcCursorItemReader
Processor : use jdbcTemplate to call database (1 thread for each partition)
Writer : write in file
I use a JdbcCursorItemReader to read millions of data from shared Postgres Database (v9.2).(other users use database at the same time)
Spring batch version : 3.0.6
The problem is the job and steps are blocked on status STARTED after some hour of execution with any errors in log
Before blocked :
After blocked
the table pg_stat_activity is empty (i think processor are killed) and status job = STARTED
Anyone have any idea why the job and parallel steps are blocked in STARTED status ?
Thank you for your help

Job configuration
My problem is why the job is blocked on status STARTED because i have a some problem when i start a new execution

Related

Spring Batch: Terminating the current running job

I am having an issue in terminating the current running spring batch. I wrote
Set<Long> executions = jobOperator.getRunningExecutions("Job-Builder");
jobOperator.stop(longExecutions.iterator().next());`
in my code after going through the spring documentation.
The problem I am facing is at times the termination of the job is happening as expected and the other times the termination of job is not happening. In fact every time I call stop on joboperator it is updating the BATCH_JOB_EXECUTION table. When the termination happens successfully the status of the job is updating to STOPPED by killing the jobExecution in my batch process. The other times when it fails it is completing the rest of the different flows of the batch and updating the status to FAILED on BATCH_JOB_EXECUTION table.
But every time I call stop in the job operator I see a message in my console
2020-09-30 18:14:29.780 [http-nio-8081-exec-5] INFO o.s.b.c.l.s.SimpleJobOperator:428 - Aborting job execution: JobExecution: id=33058, version=2, startTime=2020-09-30 18:14:25.79, endTime=null, lastUpdated=2020-09-30 18:14:28.9, status=STOPPING, exitStatus=exitCode=UNKNOWN;exitDescription=, job=[JobInstance: id=32922, version=0, Job=[Job-Builder]], jobParameters=[{date=1601504064263, time=1601504064262, extractType=false, JobId=1601504064262}]
My project has a series of flows and steps with in it.
Over all my batch process looks like this:
JobBuilderFactory has 3 flows
Each flow has a stepbuilder and two tasklets.
each stepbuilder has a partitioner and a chunk(size is 100) based itemReader, itemProcessor and itemWriter.
I am calling the stop method when I am executing the very first flow in my jobBuilderFactory. The over all process to complete takes about 30 mins. So, it has close to around 20-25 mins from the time I call the stop method and the chunk size is 100 with in each and every flow and I am dealing with more than 500k records.
So, my question is why is jobExecution stopping at times when called stop methos(which is what I wanted) and why it isn't able to stop the jobExecution the remaining times.
Thanks in advance
So, my question is why is jobExecution stopping at times when called stop methos(which is what I wanted) and why it isn't able to stop the jobExecution the remaining times.
It's not easy to figure out the reason for that from what you shared, but I can give you a couple of notes about stopping jobs:
jobOperator.stop does not guarantee that the job stops, it only sends a stop signal to the job execution. From what you shared, you are not checking the returned boolean that indicates if the signal has been correctly sent or not, so you should be doing that first.
You did not share your code, but you need to use StoppableTasklet instead of Tasklet to make sure the stop signal is correctly sent to your steps.

spring batch: alert with grafana & prometheus if a job failed in the last xx minutes

I am using spring batch (4.2.2.RELEASE) together with the spring actuator (2.2.6 RELEASE). Since version 4.2, spring batch provides support for batch monitoring and metrics based on micrometer (https://docs.spring.io/spring-batch/docs/4.2.x/reference/html/monitoring-and-metrics.html).
For example i am able to see with the metric name spring_batch_job how often a job was executed, its status and duration.
I want to monitor this metric with grafana & prometheus and alert if a job failed in the last xx minutes.
If the spring batch application runs as a service it seems that it sums up all the metrics until the service is stopped. For example if a job was started 12 times in the last hour the metrics output could be the following:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 10.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 354.354538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
So two instances of the mainJob failed. Assumed in the next hour all 12 jobs will be successful, the metrics output would be:
spring_batch_job_seconds_count{name="mainJob",status="COMPLETED",} 22.0
spring_batch_job_seconds_sum{name="mainJob",status="COMPLETED",} 708.704538083
spring_batch_job_seconds_count{name="mainJob",status="FAILED",} 2.0
spring_batch_job_seconds_sum{name="mainJob",status="FAILED",} 0.880157862
How am i able to check if a job failed in the last xx minutes? Because the following expression would still return the two failed job instances: spring_batch_job_seconds_count{status="FAILED"}[15m]
I'm not familiar with Prometheus QL but I will try to help.
What you can do is to calculate the difference of this counter between the last hour and the hour before. If you see an increase in the number of failed instances, then at least one instance has failed and you can raise an alert. Otherwise, no job has failed in the previous hour.
Prometheus provides the increase function that is designed specifically for that. So you should be able to answer your question and raise an alert when:
increase(spring_batch_job_seconds_count{name="mainJob",status="FAILED"}[15m]) > 0
As I said, I'm not expert at Prometheus, so I will let you check the syntax. But that's the idea.

kafka spark streaming job with many active jobs

I meet with a “many Active jobs” issue when using direct kafka streaming on YARN. (spark 1.5, hadoop 2.6, CDH5.5.1)
The problem happens when kafka has almost NO traffic.
From application UI, I see many ‘active’ jobs are keep running for hours. And finally the driver “Requesting 4 new executors because tasks are backlogged”.
But, when looking at the driver log of a ‘activity’ job, the log says the job is finished. So, why the application UI shows this job is activity like forever?
Thanks!
Here are related log info about one of the ‘activity’ jobs.
There are two stages: a reduceByKey follows a flatmap. The log says both stages are finished in ~20ms and the job also finishes in 64 ms.
Got job 6567
Final stage: ResultStage 9851(foreachRDD at
Parents of final stage: List(ShuffleMapStage 9850)
Missing parents: List(ShuffleMapStage 9850)
…
Finished task 0.0 in stage 9850.0 (TID 29551) in 20 ms
Removed TaskSet 9850.0, whose tasks have all completed, from pool
ShuffleMapStage 9850 (flatMap at OpaTransLogAnalyzeWithShuffle.scala:83) finished in 0.022 s
…
Submitting ResultStage 9851 (ShuffledRDD[16419] at reduceByKey at OpaTransLogAnalyzeWithShuffle.scala:83), which is now runnable
…
ResultStage 9851 (foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84) finished in 0.023 s
Job 6567 finished: foreachRDD at OpaTransLogAnalyzeWithShuffle.scala:84, took 0.064372 s
Finished job streaming job 1468592373000 ms.1 from job set of time 1468592373000 ms
I am facing similar issue. Myn is spark streaming applicaiton where in my only action is to write to cassandra table. And, this write fails due to certain ssl authenticaion. Ideally it should show such batches as failed in Streaming, but it remains in active state forever; inside the batch the jobs are completed successfully, ideally it should have been marked failed.

Run #Scheduled task only on one WebLogic cluster node?

We are running a Spring 3.0.x web application (.war) with a nightly #Scheduled job in a clustered WebLogic 10.3.4 environment. However, as the application is deployed to each node (using the deployment wizard in the AdminServer's web console), the job is started on each node every night thus running multiple times concurrently.
How can we prevent this from happening?
I know that libraries like Quartz allow coordinating jobs inside clustered environment by means of a database lock table or I could even implement something like this myself. But since this seems to be a fairly common scenario I wonder if Spring does not already come with an option how to easily circumvent this problem without having to add new libraries to my project or putting in manual workarounds.
We are not able to upgrade to Spring 3.1 with configuration profiles, as mentioned here
Please let me know if there are any open questions. I also asked this question on the Spring Community forums. Thanks a lot for your help.
We only have one task that send a daily summary email. To avoid extra dependencies, we simply check whether the hostname of each node corresponds with a configured system property.
private boolean isTriggerNode() {
String triggerHostmame = System.getProperty("trigger.hostname");;
String hostName = InetAddress.getLocalHost().getHostName();
return hostName.equals(triggerHostmame);
}
public void execute() {
if (isTriggerNode()) {
//send email
}
}
We are implementing our own synchronization logic using a shared lock table inside the application database. This allows all cluster nodes to check if a job is already running before actually starting it itself.
Be careful, since in the solution of implementing your own synchronization logic using a shared lock table, you always have the concurrency issue where the two cluster nodes are reading/writing from the table at the same time.
Best is to perform the following steps in one db transaction:
- read the value in the shared lock table
- if no other node is having the lock, take the lock
- update the table indicating you take the lock
I solved this problem by making one of the box as master.
basically set an environment variable on one of the box like master=true.
and read it in your java code through system.getenv("master").
if its present and its true then run your code.
basic snippet
#schedule()
void process(){
boolean master=Boolean.parseBoolean(system.getenv("master"));
if(master)
{
//your logic
}
}
you can try using TimerManager (Job Scheduler in a clustered environment) from WebLogic as TaskScheduler implementation (TimerManagerTaskScheduler). It should work in a clustered environment.
Andrea
I've recently implemented a simple annotation library, dlock, to execute a scheduled task only once over multiple nodes. You can simply do something like below.
#Scheduled(cron = "59 59 8 * * *" /* Every day at 8:59:59am */)
#TryLock(name = "emailLock", owner = NODE_NAME, lockFor = TEN_MINUTE)
public void sendEmails() {
List<Email> emails = emailDAO.getEmails();
emails.forEach(email -> sendEmail(email));
}
See my blog post about using it.
You don't neeed to synchronize your job start using a DB.
On a weblogic application you can get the instanze name where the application is running:
String serverName = System.getProperty("weblogic.Name");
Simply put a condition two execute the job:
if (serverName.equals(".....")) {
execute my job;
}
If you want to bounce your job from one machine to the other, you can get the current day in the year, and if it is odd you execute on a machine, if it is even you execute the job on the other one.
This way you load a different machine every day.
We can make other machines on cluster not run the batch job by using the following cron string. It will not run till 2099.
0 0 0 1 1 ? 2099

hadoop streaming jobs fails to report?

All jobs were running successfully using hadoop-streaming, but all of a sudden I started to see errors due to one of worker machines
Hadoop job_201110302152_0002 failures on master
Attempt Task Machine State Error Logs
attempt_201110302152_0002_m_000037_0 task_201110302152_0002_m_000037 worker2 FAILED
Task attempt_201110302152_0002_m_000037_0 failed to report status for 622 seconds. Killing!
-------
Task attempt_201110302152_0002_m_000037_0 failed to report status for 601 seconds. Killing!
Last 4KB
Last 8KB
All
Questions :
- Why does this happening ?
- How can I handle such issues?
Thank you
The description for mapred.task.timeout which defaults to 600s says "The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. "
Increasing the value of mapred.task.timeout might solve the problem, but you need to figure out if more than 600s is actually required for the map task to complete processing the input data or if there is a bug in the code which needs to be debugged.
According to the Hadoop best practices, on average a map task should take a minute or so to process an InputSplit.

Resources