Spring batch master is waiting but worker startup failed in remote partitioning - spring-boot

I am stuck in a scenario in spring batch job remote partitioning where master started successfully but worker failed to start.The job is deployed on Aws batch , so master is waiting indefinitely for workers to finish since worker cannot comeup.
Can anyone suggest me the way to handle such scenario. I dont want my master node to wait till timeout has occured.

The manager is configurable with a timeout to fail if workers do not reply in time. So it won't wait indefinitely.
And if that happens, the job instance will fail and you can either:
restart it (only failed partitions will be restarted)
or abandon it and start a new instance.

Related

Multiple instances of a partitioned spring batch job

I have a Spring batch partitioned job. The job is always started with a unique set of parameters so always a new job.
My remoting fabric is JMS with request/response queues configured for communication between the masters and slaves.
One instance of this partitioned job processes files in a given folder. Master step gets the file names from the folder and submits the file names to the slaves; each slave instance processes one of the files.
Job works fine.
Recently, I started to execute multiple instances (completely separate JVMs) of this job to process files from multiple folders. So I essentially have multiple master steps running but the same set of slaves.
Randomly; I notice the following behavior sometimes - the slaves will finish their work but the master keeps spinning thinking the slaves are still doing something. The step status will show successful in the job repo but at the job level the status is STARTING with an exit code of UNKNOWN.
All masters share the set of request/response queues; one queue for requests and one for responses.
Is this a supported configuration? Can you have multiple master steps sharing the same set of queues running concurrently? Because of the behavior above I'm thinking the responses back from the workers are going to the incorrect master.

Spring Cloud Task - Remote Partitioning Concerns

We have Spring Cloud Data Flow local setup and the task is running the Spring Batch Job which reads from a Database and writes to AWS S3, all of this works fine.
When it comes to stopping the JOB, the task stops but resuming the job is not possible since the status is in "STARTED", this I think we can handle in code, by setting the batch status to 'STOPPED' when the stop is triggered, correct me if this can't be handled?
Also when trying to stop an individual slave task, there's an error:
2020-03-27 10:48:48.140 INFO 11258 --- [nio-9393-exec-7]
.s.c.d.s.s.i.DefaultTaskExecutionService : Task execution stop request
for id 192 for platform default has been submitted 2020-03-27
10:48:48.144 ERROR 11258 --- [nio-9393-exec-7]
o.s.c.d.s.c.RestControllerAdvice : Caught exception while
handling a request
java.lang.NullPointerException: null at
org.springframework.cloud.dataflow.server.service.impl.DefaultTaskExecutionService.cancelTaskExecution(DefaultTaskExecutionService.java:669)
~[spring-cloud-dataflow-server-core-2.3.0.RELEASE.jar!/:2.3.0.RELEASE]
at
org.springframework.cloud.dataflow.server.service.impl.DefaultTaskExecutionService.lambda$stopTaskExecution$0(DefaultTaskExecutionService.java:583)
~[spring-cloud-dataflow-server-core-2.3.0.RELEASE.jar!/:2.3.0.RELEASE]
How do we implement this is in distributed environment where we have a master server which can start the master on the master server and start the workers on respective slave servers?
1) You are correct you will need to change your status from STARTED to FAILED.
2) Since remote partitioning uses Spring Cloud Deployer (not Spring Cloud Data Flow) to launch the worker tasks, SCDF does not have a way to determine platform information to properly stop the the worker task. I've added GH Issue spring-cloud/spring-cloud-dataflow#3857 to resolve this problem.
3) The current implementation prevents a user from launching on multiple servers, rather lets the platform (Kubernetes, Cloud Foundry) distribute the worker tasks. You can implement your own deployer to add this feature.

Stopping ec2 to scale-down before it completes the process running on it

We have an application which runs on ec2 instances that we use as docker host in ecs cluster. There are multiple tasks running on each ec2. Each task picks up one message from SQS and process some event(which convert data from one format to other and upload it to a file system), which may take from few seconds to 12-15hours depending on the size of the data it contains. Once an event processing is completed task is stopped and for new message(event) new task is created. Whenever there are huge number of messages in SQS we are scaling-up the instances to process the messages (to avoid wait time). When (number of messages) < (number of running tasks) for certain duration then we need to scale-down i.e. we need to terminate ec2 instances.
For ec2 scale-down we need to make sure there is no task running i.e. container is not processing any event on it. There is no way to found out which ec2 are free(not processing any event) so we marked container to DRAINING state and then TERMINATING ec2. But while we mark any container to draining state, task running on it, are stopped(hence event processing is killed in between and data is lost). Is there any why we can complete the process before tasks are stopped or anyone can suggest better approach.

What is the job status, when Name Node fails in YARN?

When a job is running in the cluster, if suddenly the NameNode fails, then what will be the status of the job (failed or killed)?
If failed means, who is updating the job status?
How does this work internally?
Standby Namenode will become active Namenode with fail over process. Have a look at How does Hadoop Namenode failover process works?
YARN architecture revolves around Resource Manager, Node Manager and Applications Master. Jobs will continue without any of impact with namenode failure. If any of above three processes fails, job recovery will be done depending on respective process recovery.
Resource Manager recovery:
With the ResourceManger Restart enabled, the RM being promoted (current standby) to an active state loads the RM internal state and continues to operate from where the previous active left off as much as possible depending on the RM restart feature. A new attempt is spawned for each managed application previously submitted to the RM.
Application Master recovery:
For MapReduce running on YARN (aka MR2), the MR ApplicationMaster plays the role of a per-job jobtracker. MRAM failure recovery is controlled by the property, mapreduce.am.max-attempts. This property may be set per job. If its value is greater than 1, then when the ApplicationMaster dies, a new one is spun up for a new application attempt, up to the max-attempts. When a new application attempt is started, in-flight tasks are aborted and rerun but completed tasks are not rerun.
Node Manager Recovery:
During the recovery, the NM loads the applications’ state from the state store. The state for each application indicates whether the application has finished or not. Note that for a finished application no more containers will be launched but it may still be undergoing log- aggregation. As each application is recovered, a new Application object is created and initialization events are triggered to reinitialize the bookkeeping for the application within the NM.
During all these phases, Job History plays a critical role. Successfully completed Map & Reduce tasks status will be restored from Job History Server. This status is helpful to stop re-launch of successfully completed Map/Reduce tasks.
Have a look at Resource Manager HA article , Node Manager restart article and YARN HA article
I'm not completely sure of the following since I haven't tested it out. But it can't hurt to fire up a VM and test it out for yourself.
The namenode does not handle the status of jobs, that's what Yarn is doing.
If the namenode is not HA and it dies, you will lose your connection to HDFS (and maybe even have data loss). yarn will try to re-contact hdfs for a few tries by default and eventually time out and fail the job.

How Container failure is handled for a YARN MapReduce job?

How are software/hardware failures handled in YARN? Specifically, what happens in case of container(s) failure/crash?
Container and task failures are handled by node-manager. When a container fails or dies, node-manager detects the failure event and launches a new container to replace the failing container and restart the task execution in the new container.
In the event of application-master failure, the resource-manager detects the failure and start a new instance of the application-master with a new container.
Find the details here
App master will re-attempt task that complete with exception or stop responding ( 4 time by default )
_ Job with two many failed task are considered as failed job.

Resources