Spring Cloud Task - Remote Partitioning Concerns - spring

We have Spring Cloud Data Flow local setup and the task is running the Spring Batch Job which reads from a Database and writes to AWS S3, all of this works fine.
When it comes to stopping the JOB, the task stops but resuming the job is not possible since the status is in "STARTED", this I think we can handle in code, by setting the batch status to 'STOPPED' when the stop is triggered, correct me if this can't be handled?
Also when trying to stop an individual slave task, there's an error:
2020-03-27 10:48:48.140 INFO 11258 --- [nio-9393-exec-7]
.s.c.d.s.s.i.DefaultTaskExecutionService : Task execution stop request
for id 192 for platform default has been submitted 2020-03-27
10:48:48.144 ERROR 11258 --- [nio-9393-exec-7]
o.s.c.d.s.c.RestControllerAdvice : Caught exception while
handling a request
java.lang.NullPointerException: null at
org.springframework.cloud.dataflow.server.service.impl.DefaultTaskExecutionService.cancelTaskExecution(DefaultTaskExecutionService.java:669)
~[spring-cloud-dataflow-server-core-2.3.0.RELEASE.jar!/:2.3.0.RELEASE]
at
org.springframework.cloud.dataflow.server.service.impl.DefaultTaskExecutionService.lambda$stopTaskExecution$0(DefaultTaskExecutionService.java:583)
~[spring-cloud-dataflow-server-core-2.3.0.RELEASE.jar!/:2.3.0.RELEASE]
How do we implement this is in distributed environment where we have a master server which can start the master on the master server and start the workers on respective slave servers?

1) You are correct you will need to change your status from STARTED to FAILED.
2) Since remote partitioning uses Spring Cloud Deployer (not Spring Cloud Data Flow) to launch the worker tasks, SCDF does not have a way to determine platform information to properly stop the the worker task. I've added GH Issue spring-cloud/spring-cloud-dataflow#3857 to resolve this problem.
3) The current implementation prevents a user from launching on multiple servers, rather lets the platform (Kubernetes, Cloud Foundry) distribute the worker tasks. You can implement your own deployer to add this feature.

Related

How to update Spring Batch status on unexpected shutdown

I'm implementing a service that would reject job requests from being processed if an existing job is running. Unfortunately, I'm not sure if there is a way to tell the difference between a job that is actively running and a job that ended due to an unexpected shutdown like turning Tomcat off. The statuses in the tables are the same with status = STARTED and exit_code = UNKNOWN.
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions("MY_JOB");
Is there a way to tell the two apart or implementation that would change active job statuses to maybe ABANDONED?
There is indeed no way, by just looking at the database, to distinguish between a job that is effectively running and a job that has been abruptly killed (in both cases, the status is STARTED).
What you need to do is, in addition to checking status in the database, find a way to see if a job is effectively running. This really depends on how you run your jobs. For example, if you run your jobs in separate JVMs, you can write some code that checks if there is a JVM currently running your job. If you deploy your jobs to Kubernetes, you could ask Kubernetes if there is a pod currently running your job, etc.
However, if you can identify the execution that has been abruptly stopped for which the status has been stuck at STARTED (because Spring Batch did not have a chance to update its status to FAILED with a graceful shutdown), then you can update the status manually to ABONDONED and set its END_TIME to a non null value. This way, JobExplorer#findRunningExecutions will not return it anymore as a running execution.

Does ECS update-service command marks the container instance state to draining when use with the --force-new-deployment option?

The command:
aws ecs update-service --service my-http-service --task-definition amazon-ecs-sample --force-new-deployment
As per AWS docs: You can use this option (--force-new-deployment) to trigger a new deployment with no service definition changes. For example, you can update a service's tasks to use a newer Docker image with the same image/tag combination (my_image:latest ) or to roll Fargate tasks onto a newer platform version.
My question, if I use '--force-new-deployment' (as I will use the exisiting tag or definition), will the underline 'ECS Instance' automatically set to DRAINING state, so that any new task (if any) will not start in the EXISTING ecs-instance that is suppose to go away during rolling-update deployment strategy (or deployment controller) ?
In other words, will there be any chance:
For a new task to be created on the existing/old container instance, that is suppose to go away during rolling update.
Also, what would happen with the ongoing task that is running on this existing/old container instance, that is suppose to go away during rolling update.
Ref: https://docs.aws.amazon.com/cli/latest/reference/ecs/update-service.html
Please note that no Container instance is going anywhere with this 'update-service' command. This command will only create a new Deployment under the ECS service and when the new tasks become healthy, remove the old task(s).
Edit 1:
How about the request that were served by old task?
I am assuming the tasks are behind an Application Load Balancer. In this case, old tasks will be deregistered from the ALB.
Note: In the following discussion, target is the ECS Task.
To give you a brief description on how the Deregistration Delay works with ECS, the following is the sequential order when a task connected to an ALB is stopped. It can be due to a scale in event, deployment of a new task definition, decrease of the number of tasks, force deployment, etc.
ECS sends DeregisterTargets call and the targets change the status to "draining". New connections will not be served to these targets.
If the deregistration delay time elapsed and there are still in-flight requests, the ALB will terminate them and clients will receive 5XX responses originated from the ALB.
The targets are deregistered from the target group.
ECS will send the stop call to the tasks and the ECS-agent will gracefully stop the containers (SIGTERM).
If the containers are not stopped within the stop timeout period (ECS_CONTAINER_STOP_TIMEOUT by default 30s) they will force stopped (SIGKILL).
As per the ELB documentation [1] if a deregistering target has no in-flight requests and no active connections, Elastic Load Balancing immediately completes the deregistration process, without waiting for the deregistration delay to elapse. However, even though target deregistration is complete, the status of the target will be displayed as draining and you can see the registered Target of the old task is still present in the TargetGroup console with status as draining until the deregistration delay elapses.
The design of ECS is to stop the Task after the completion of Draining process as mentioned in the ECS document [2] and hence the ECS Service waits for the TargetGroup to complete the Draining process before issuing the stop call.
Ref:
[1] Target Groups for Your Application Load Balancers - Deregistration Delay - https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#deregistration-delay
[2] Updating a Service - https://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html

Apache Flink on Windows

First, I am a complete newbie with Flink. I have installed Apache Flink on Windows.
I start Flink with start-cluster.bat. It prints out
Starting a local cluster with one JobManager process and one
TaskManager process. You can terminate the processes via CTRL-C in the
spawned shell windows. Web interface by default on
http://localhost:8081/.
Anyway, when I submit the job, I have a bunch of messages:
DEBUG org.apache.flink.runtime.rest.RestClient - Received response
{"status":{"id":"IN_PROGRESS"}}.
In the log in the web UI at http://localhost:8081/, I see:
2019-02-15 16:04:23.571 [flink-akka.actor.default-dispatcher-4] WARN
akka.remote.ReliableDeliverySupervisor
flink-akka.remote.default-remote-dispatcher-6 - Association with
remote system [akka.tcp://flink#127.0.0.1:56404] has failed, address
is now gated for [50] ms. Reason: [Disassociated]
If I go to the Task Manager tab, it is empty.
I tried to find if any port needed by flink was in use but it does not seem to be the case.
Any idea to solve this?
So I was running Flink locally using Intelij
Using ArchType that gives you ready to go examples
https://ci.apache.org/projects/flink/flink-docs-stable/dev/projectsetup/java_api_quickstart.html
You not necessary have to install it unless you are using Flink as a service on cluster.
Code editor will compile it just fine for spot instance of Flink for one code run.

Easier way to find pods and look at logs for a job in spring cloud dataflow kubernetes

Currently, when I launch a task in spring cloud dataflow it starts a pod inside which the task and the inherent jobs run. That pod has a naming convention of task name followed by a random ID. I wanted to know if I can maybe map it to the task execution ID or Job Execution ID so that its easier for me to locate a pod in case a job fails and look at the logs.
In Spring Cloud Data Flow, this is the expected behavior.
The task execution ID and job execution ID are generated only after the task launch request is sent to the corresponding target deployment environment (local, CF, k8s).
Feel free to create a feature request (would be great if you have any other ideas) in SCDF Github and we can track it from there.

I am not sure whether the application is running on just the master or the whole cluster for Spark on EC2

I am using Spark 1.1.1 . I followed the instructions given on https://spark.apache.org/docs/1.1.1/ec2-scripts.html and have a cluster of 1 master node and 1 worker on EC2 running.
I have made a jar of the application and rsynced it to the slaves. When I run the application using spark-submit with the deploy-mode of client, the application works. However, when I do so using deploy-mode cluster it gives me an error saying it cannot find the jar on the worker. The permission of the jar is 755 on both the master and worker.
I am not sure whether when I run the application using deploy-mode=client whether the application is using the workers. I don't think it is since the worker url does not show any completed jobs. But it does show failed jobs during deploy-mode=cluster.
Am I doing something wrong? Thank you for your help.
You can check if executors are assigned to the application on the /executors page on port 4040 (e.g. http://localhost:4040/executors/). If you only see <driver> then you are not using the worker. If you see one line for <driver> and one other line (with ID 0, unless it has restarted), then the worker is also providing an executor to your application. Here you can also see how many tasks it has completed for your application, and other stats.

Resources