Sprint Cloud Dataflow with Kubernetes: BackoffLimit - spring

Kubernetes Pod backoff failure policy
From the k8s documentation:
There are situations where you want to fail a Job after some amount of retries due to a logical error in configuration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as failed. The back-off limit is set by default to 6.
Spring cloud dataflow:
When a job has failed, we actually don't want a retry. In other words, we want to set the backoffLimit: 1 in our Sprint Cloud Dataflow config file.
We have tried to set it like the following:
deployer.kubernetes.spec.backoffLimit: 1
or even
deployer.kubernetes.backoffLimit: 1
But both is not transmitted to our Kubernetes Cluster.
After 6 tries, we see the following message:
status: conditions:
- lastProbeTime: '2019-10-22T17:45:46Z'
lastTransitionTime: '2019-10-22T17:45:46Z'
message: Job has reached the specified backoff limit
reason: BackoffLimitExceeded
status: 'True'
type: Failed failed: 6 startTime: '2019-10-22T17:33:01Z'
Actually we want to fail fast (1 or 2 tries maximum)
Question: How can we properly set this property, so that all task triggered by SCDF will fail maximum once on Kubernetes?
Update (23.10.2019)
We have also tried the property:
deployer:
kubernetes:
maxCrashLoopBackOffRestarts: Never # No retry for failed tasks
But the jobs are still failing 6 times instead of 1.
Update (26.10.2019)
For completeness sake:
I am scheduling a task in SCDF
The task is triggered on Kubernetes (more specifically Openshift)
When I check the configuration on the K8s-platform, I see that it still has a backoffLimit of 6, instead of 1:
Yalm config snippet taken from the running pod:
spec:
backoffLimit: 6
completions: 1
parallelism: 1
In the official documentation, it says:
`maxCrashLoopBackOffRestarts` - Maximum allowed restarts for app that is in a CrashLoopBackOff. Values are `Always`, `IfNotPresent`, `Never`
But maxCrashLoopBackOffRestarts takes an integer. So I guess the documentation is not accurate.
The pod is then restarted 6 times.
I have tried to set those properties unsuccessfully:
spring.cloud.dataflow.task.platform.kubernetes.accounts.defaults.maxCrashLoopBackOffRestarts: 0
spring.cloud.deployer.kubernetes.maxCrashLoopBackOffRestarts: 0
spring.cloud.scheduler.kubernetes.maxCrashLoopBackOffRestarts: 0
None of those has worked.
Any idea?

To override the default restart limit, you'd have to use SCDF's maxCrashLoopBackOffRestarts deployer property. All of the supported properties are documented in the ref. guide.
You can configure to override this property "globally" in SCDF or individually override it at each stream/task deployment level, as well. More info here.

Thanks to ilayaperumalg it's much clearer why it's not working:
It looks like the property maxCrashLoopBackOffRestarts is applicable
for determining the status of the runtime application instance while
the property you refer to as backoffLimit is applicable to the JobSpec
which is currently not being supported. We can add this as a feature
to support your case.
Github Link

Related

Datadog skip ingestion of Spring actuator health endpoint

I was trying to configure my application to not report my health endpoint in datadog APM./ I checked the documentation here: https://docs.datadoghq.com/tracing/guide/ignoring_apm_resources/?tab=kuberneteshelm&code-lang=java
And tried adding the config in my helm deployment.yaml file:
- name: DD_APM_IGNORE_RESOURCES
value: GET /actuator/health
This had no effect. Traces were still showing up in datadog. The method and path are correct. I changed the value a few times with different combinations (tried a few regex options). No go.
The I tried the DD_APM_FILTER_TAGS_REJECT environment variable, trying to ignore http.route:/actuator/health. Also without success.
I even ran the agent and application locally to see if there was anything to do with the environment, but the configs were not applied.
What are more options to try in this scenario?
This is the span detail:

Updating service [default] (this may take several minutes)...failed

This used to be working perfectly until a couple of days back exactly 4 days back. When i run gcloud app deploy now it complete the build and then straight after completing the build it hangs on Updating Service
Here is the output:
Updating service [default] (this may take several minutes)...failed.
ERROR: (gcloud.app.deploy) Error Response: [13] Flex operation projects/just-sleek/regions/us-central1/operations/8260bef8-b882-4313-bf97-efff8d603c5f error [INTERNAL]: An internal error occurred while processing task /appengine-flex-v1/insert_flex_deployment/flex_create_resources>2020-05-26T05:20:44.032Z4316.jc.11: Deployment Manager operation just-sleek/operation-1590470444486-5a68641de8da1-5dfcfe5c-b041c398 errors: [
code: "RESOURCE_ERROR"
location: "/deployments/aef-default-20200526t070946/resources/aef-default-20200526t070946"
message: {
\"ResourceType\":\"compute.beta.regionAutoscaler\",
\"ResourceErrorCode\":\"403\",
\"ResourceErrorMessage\":{
\"code\":403,
\"errors\":[{
\"domain\":\"usageLimits\",
\"message\":\"Exceeded limit \'QUOTA_FOR_INSTANCES\' on resource \'aef-default-20200526t070946\'. Limit: 8.0\",
\"reason\":\"limitExceeded\"
}],
\"message\":\"Exceeded limit \'QUOTA_FOR_INSTANCES\' on resource \'aef-default-20200526t070946\'. Limit: 8.0\",
\"statusMessage\":\"Forbidden\",
\"requestPath\":\"https://compute.googleapis.com/compute/beta/projects/just-sleek/regions/us-central1/autoscalers\",
\"httpMethod\":\"POST\"
}
}"]
I tried the following the ways to resolve the error:
I deleted all my previous version and left the running version
I ran gcloud components update still fails.
I create a new project, changed the region from [REGION1] to [REGION2] and deployed and m still getting the same error.
Also ran gcloud app deploy --verbosity=debug, does not give me any different result
I have no clue what is causing this issue and how to solve it please assist.
Google is already aware of this issue and it is currently being investigated.
There is a Public Issue Tracker, you may 'star' and follow so that you can receive any further updates on this. In addition, you may see some workarounds posted that could be performed temporarily if agreed with your preferences.
Currently, there is no ETA yet for the resolution but there will be an update provided as soon as the team progresses on the issue.
I resolved this by adding this into my app.yaml
automatic_scaling:
min_num_instances: 1
max_num_instances: 7
I found the solution here:
https://issuetracker.google.com/issues/157449521
And I was also redirected to:
gcloud app deploy - updating service default fails with code 13 Quota for instances limit exceeded, and 401 unathorizeed

GKE Ingress Timeout Values

I'd like to use websockets in my web application. Right now my websocket disconnects and reconnects every 30 seconds, which is the default timeout in GKE Ingress. I tried the following to change timeout values:
metadata:
name: my-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-connect-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.org/proxy-connect-timeout: "300"
nginx.org/proxy-read-timeout: "3600"
nginx.org/proxy-send-timeout: "3600"
After recreating the ingress through kubectl the timeout value remains 30 seconds:
I also tried to create a backend configuration as described here: https://cloud.google.com/kubernetes-engine/docs/how-to/configure-backend-service
The timeout value still remained unchanged at 30 seconds.
Is there a way to increase timeout value through annotations in .yml file? I could edit the timeout value through the web interface but I'd rather use .yml files.
Fixed. I upgraded my master and its nodes to version 1.14 and then the backend config approach worked.
This doesn't seems like an issue with the version.
As long as the GKE version is 1.11.3-gke.18 and above as mentioned here, you should be able to update the timeoutSec value by configuring the 'BackendConfig' as explained in the help center article.
I changed the timeoutSec value by editing the example manifest and then updating the BackendConfig (in my GKE 1.13.11-gke.14 cluster) using "kubectl apply -f my-bsc-backendconfig.yaml
" command.

Run Flink with parallelism more than 1

may be i'm just missing smth, but i just have no more ideas where to look.
i read messages from 2 sources, make a join based on a common key and sink
it all to kafka.
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(3)
...
source1
.keyBy(_.searchId)
.connect(source2.keyBy(_.searchId))
.process(new SearchResultsJoinFunction)
.addSink(KafkaSink.sink)
so it perfectly works when i launch it locally and it also works on cluster with Parallelism set to 1, but with 3 not any more.
When i deploy it to 1 job manager and 3 taskmanagers and get every Task in "RUNNING" state, after 2
minutes (when nothing is comming to sink) one of the taskmanagers gets the following log:
https://gist.github.com/zavalit/1b1bf6621bed2a3848a05c1ef84c689c#file-gistfile1-txt-L108
and the whole thing just shuts down.
i'll appreciate any hint.
tnx, in an advance.
The problem appears to be that this task manager -- flink-taskmanager-12-2qvcd (10.81.53.209) -- is unable to talk to at least one of the other task managers, namely flink-taskmanager-12-57jzd (10.81.40.124:46240). This is why the job never really starts to run.
I would check in the logs for this other task manager to see what it says, and I would also review your network configuration. Perhaps a firewall is getting in the way?

Keeping a Ruby Service running on Elastic Beanstalk

I have been looking for a while now on setting up worker nodes in a cloud native application. I plan to have an autoscaling group of worker nodes pulling jobs from a queue, nothing special there.
I am just wondering, is there any best practice way to ensure that a (eg. ruby) script is running at all times? My current assumption is that you have a script running that polls the queue for jobs and sleeps for a few seconds or so if a job query returns no new job.
What really caught my attention was the Services key in the Linux Custom Config section of AWS Elastic Beanstalk Documentation.
00_start_service.config
services:
sysvinit:
<name of service>:
enabled: true
ensureRunning: true
files: "<file name>"
sources: "<directory>"
packages:
<name of package manager>:
<package name>: <version>
commands:
<name of command>:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-ec2.html
The example they give is this..
services:
sysvinit:
myservice:
enabled: true
ensureRunning: true
I find the example and documentation extremely vague and I have no idea how to get my own service up and running using this config key, which means I do not even know if this is what I want or need to use. I have tried creating a ruby executable file and putting the name in the field, but no luck.
I asked the AWS forums for more clarification and have received no response.
If anyone has any insight or direction on how this can be achieved, I would greatly appreciate it. Thank you!
I decided not to use the "services" section of the EB config files, instead just using the "commands" ..
I build a service monitor in ruby that monitors a given system process (in this case my service).
The service itself is a script looping infinitely with delays based on long polling times to the queue service.
A cron job runs the monitor every minute and if the service is down it is restarted.
The syntax for files in the documentation seems to be wrong. The following works for me (note square brackets instead of quotation marks):
services:
sysvinit:
my_service:
enabled: true
ensureRunning: true
files : [/etc/init.d/my_service]

Resources