HA - Pacemaker - Is there a way to clean automatically failed actions after X sec/min/hour? - high-availability

I'm using Pacemaker + Corosync in Centos7
When one of my resource failed/stopped I/m getting a failed action message:
Master/Slave Set: myoptClone01 [myopt_data01]
Masters: [ pcmk01-cr ]
Slaves: [ pcmk02-cr ]
myopt_fs01 (ocf::heartbeat:Filesystem): Started pcmk01-cr
myopt_VIP01 (ocf::heartbeat:IPaddr2): Started pcmk01-cr
ServicesResource (ocf::heartbeat:RADviewServices): Started pcmk01-cr
Failed Actions:
* ServicesResource_monitor_120000 on pcmk02-cr 'unknown error' (1): call=141, status=complete, exitreason='none',
last-rc-change='Mon Jan 30 10:19:36 2017', queued=0ms, exec=142ms
Is there a way to clean automatically the failed actions after X sec/min/hour?

Look into the 'failure-timeout' resource option. This will automatically cleanup the failed action if no further failures for the particular resource has occurred within the value of failure-timeout.
I believe the failure-timeout is calculated during the cluster-recheck-interval. Which means that even if you have the failure-timeout configured to 1 minute it may still take up to 15 minutes and 59 seconds to clear the failed action with Pacemaker's default 15 minute cluster-recheck-interval.
More information:
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-failure-migration.html
http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-resource-options.html

Related

Kubernetes Pod terminates with Exit Code 143

I am using a containerized Spring boot application in Kubernetes. But the application automatically exits and restarts with exit code 143 and error message "Error".
I am not sure how to identify the reason for this error.
My first idea was that Kubernetes stopped the container due to too high resource usage, as described here, but I can't see the corresponding kubelet logs.
Is there any way to identify the cause/origin of the SIGTERM? Maybe from spring-boot itself, or from the JVM?
Exit Code 143
It denotes that the process was terminated by an external signal.
The number 143 is a sum of two numbers: 128+x, # where x is the signal number sent to the process that caused it to terminate.
In the example, x equals 15, which is the number of the SIGTERM signal, meaning the process was killed forcibly.
Hope this helps better.
I've just run into this exact same problem. I was able to track down the origin of the Exit Code 143 by looking at the logs on the Kubernetes nodes (note, the logs on the node not the pod). (I use Lens as an easy way to get a node shell but there are other ways)
Then if you look in /var/log/messages for terminated you'll see something like this:
Feb 2 11:52:27 np-26992252-3 kubelet[23125]: I0202 11:52:27.541751 23125 kubelet.go:2214] "SyncLoop (probe)" probe="liveness" status="unhealthy" pod="default/app-compute-deployment-56ccffd87f-8s78v"
Feb 2 11:52:27 np-26992252-3 kubelet[23125]: I0202 11:52:27.541920 23125 kubelet.go:2214] "SyncLoop (probe)" probe="readiness" status="" pod="default/app-compute-deployment-56ccffd87f-8s78v"
Feb 2 11:52:27 np-26992252-3 kubelet[23125]: I0202 11:52:27.543274 23125 kuberuntime_manager.go:707] "Message for Container of pod" containerName="app" containerStatusID={Type:containerd ID:c3426d6b07fe3bd60bcbe675bab73b6b4b3619ef4639e1c23bca82692633765e} pod="default/app-comp
ute-deployment-56ccffd87f-8s78v" containerMessage="Container app failed liveness probe, will be restarted"
Feb 2 11:52:27 np-26992252-3 kubelet[23125]: I0202 11:52:27.543374 23125 kuberuntime_container.go:723] "Killing container with a grace period" pod="default/app-compute-deployment-56ccffd87f-8s78v" podUID=89fdc1a2-3a3b-4d57-8a4d-ab115e52dc85 containerName="app" containerID="con
tainerd://c3426d6b07fe3bd60bcbe675bab73b6b4b3619ef4639e1c23bca82692633765e" gracePeriod=30
Feb 2 11:52:27 np-26992252-3 containerd[22741]: time="2023-02-02T11:52:27.543834687Z" level=info msg="StopContainer for \"c3426d6b07fe3bd60bcbe675bab73b6b4b3619ef4639e1c23bca82692633765e\" with timeout 30 (s)"
Feb 2 11:52:27 np-26992252-3 containerd[22741]: time="2023-02-02T11:52:27.544593294Z" level=info msg="Stop container \"c3426d6b07fe3bd60bcbe675bab73b6b4b3619ef4639e1c23bca82692633765e\" with signal terminated"
The bit to look out for is containerMessage="Container app failed liveness probe, will be restarted"

PIPE Connection to jenkins timing out

Any help sooner would be greatly appreciated
I am using PIPE to connect to Jenkins pipeline from BB and using the below code in my BB.yml
- step: &functionalTest
name: functional test
image: python:3.9
script:
- pipe: atlassian/jenkins-job-trigger:0.1.1
variables:
JENKINS_URL: '<<myJenkinsURL>>'
JENKINS_USER: '<<myJenkinsUser>>'
JENKINS_TOKEN: $JENKINS_USER_TOKEN
JOB_NAME: '<<myJenkinsJob>>'
WAIT: 'true'
WAIT_MAX_TIMEOUT: 500
It was working fine until last week. However, since Friday I can see number failures in BB pipeline with the timeout. Though the Jenkins job is successful and took only 4 mins and 56 secs to execute all test cases. Also I have WAIT_MAX_TIMEOUT: 500 (almost 8 mins max timeout)
Exception:
✖ Timeout while waiting for jenkins job with build number 254 to be completed
PS: Jenkins job for this build ID is successful in 5 mins (including the Sonar report generation)

GCP / exporting disk image to Storage bucket fails

I'm trying to export a disk image I've build in GCP and export it as a vmdk to a storage bucket.
The export through an error message complaining about service account not found. I can't remember having deleted such a user account. For me it should exist since the creation of the project.
How can I re-create a default service account without taking the risk to loose all my compute engine resources? Which roles should I give to this service account?
[image-export-ext.export-disk.setup-disks]: 2021-10-06T18:52:00Z CreateDisks: Creating disk "disk-export-disk-os-image-export-ext-export-disk-j8vpl".
[image-export-ext.export-disk.setup-disks]: 2021-10-06T18:52:00Z CreateDisks: Creating disk "disk-export-disk-buffer-j8vpl".
[image-export-ext.export-disk]: 2021-10-06T18:52:01Z Step "setup-disks" (CreateDisks) successfully finished.
[image-export-ext.export-disk]: 2021-10-06T18:52:01Z Running step "run-export-disk" (CreateInstances)
[image-export-ext.export-disk.run-export-disk]: 2021-10-06T18:52:01Z CreateInstances: Creating instance "inst-export-disk-image-export-ext-export-disk-j8vpl".
[image-export-ext]: 2021-10-06T18:52:07Z Error running workflow: step "export-disk" run error: step "run-export-disk" run error: operation failed &{ClientOperationId: CreationTimestamp: Description: EndTime:2021-10-06T11:52:07.153-07:00 Error:0xc000712230 HttpErrorMessage:BAD REQUEST HttpErrorStatusCode:400 Id:5314937137696624317 InsertTime:2021-10-06T11:52:02.707-07:00 Kind:compute#operation Name:operation-1633546321707-5cdb3a43ac385-839c7747-2ca655ee OperationGroupId: OperationType:insert Progress:100 Region: SelfLink:https://www.googleapis.com/compute/v1/projects/savvy-bonito-207708/zones/us-east1-b/operations/operation-1633546321707-5cdb3a43ac385-839c7747-2ca655ee StartTime:2021-10-06T11:52:02.708-07:00 Status:DONE StatusMessage: TargetId:840687976797195965 TargetLink:https://www.googleapis.com/compute/v1/projects/savvy-bonito-207708/zones/us-east1-b/instances/inst-export-disk-image-export-ext-export-disk-j8vpl User:494995903825#cloudbuild.gserviceaccount.com Warnings:[] Zone:https://www.googleapis.com/compute/v1/projects/savvy-bonito-207708/zones/us-east1-b ServerResponse:{HTTPStatusCode:200 Header:map[Cache-Control:[private] Content-Type:[application/json; charset=UTF-8] Date:[Wed, 06 Oct 2021 18:52:07 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} ForceSendFields:[] NullFields:[]}:
Code: EXTERNAL_RESOURCE_NOT_FOUND
Message: The resource '494995903825-compute#developer.gserviceaccount.com' of type 'serviceAccount' was not found.
[image-export-ext]: 2021-10-06T18:52:07Z Workflow "image-export-ext" cleaning up (this may take up to 2 minutes).
[image-export-ext]: 2021-10-06T18:52:08Z Workflow "image-export-ext" finished cleanup.
[image-export] 2021/10/06 18:52:08 step "export-disk" run error: step "run-export-disk" run error: operation failed &{ClientOperationId: CreationTimestamp: Description: EndTime:2021-10-06T11:52:07.153-07:00 Error:0xc000712230 HttpErrorMessage:BAD REQUEST HttpErrorStatusCode:400 Id:5314937137696624317 InsertTime:2021-10-06T11:52:02.707-07:00 Kind:compute#operation Name:operation-1633546321707-5cdb3a43ac385-839c7747-2ca655ee OperationGroupId: OperationType:insert Progress:100 Region: SelfLink:https://www.googleapis.com/compute/v1/projects/savvy-bonito-207708/zones/us-east1-b/operations/operation-1633546321707-5cdb3a43ac385-839c7747-2ca655ee StartTime:2021-10-06T11:52:02.708-07:00 Status:DONE StatusMessage: TargetId:840687976797195965 TargetLink:https://www.googleapis.com/compute/v1/projects/savvy-bonito-207708/zones/us-east1-b/instances/inst-export-disk-image-export-ext-export-disk-j8vpl **User:494995903825#cloudbuild.gserviceaccount.com** Warnings:[] Zone:https://www.googleapis.com/compute/v1/projects/savvy-bonito-207708/zones/us-east1-b ServerResponse:{HTTPStatusCode:200 Header:map[Cache-Control:[private] Content-Type:[application/json; charset=UTF-8] Date:[Wed, 06 Oct 2021 18:52:07 GMT] Server:[ESF] Vary:[Origin X-Origin Referer] X-Content-Type-Options:[nosniff] X-Frame-Options:[SAMEORIGIN] X-Xss-Protection:[0]]} ForceSendFields:[] NullFields:[]}: Code: EXTERNAL_RESOURCE_NOT_FOUND; Message: The resource **'494995903825-compute#developer.gserviceaccount.com' of type 'serviceAccount' was not found.**
ERROR
ERROR: build step 0 "gcr.io/compute-image-tools/gce_vm_image_export:release" failed: step exited with non-zero status: 1
Go to IAM & Admin > IAM and check whether your default SA is there.
If deleted you can recover within 30 days.
How to check if it is deleted?
To recover. One cannot recover a default compute service account after 30 days.
If all the above fails, then you might need to go the custom SA route, or share an image with a project that has a default service account.

NiFi FetchFTP delete process getting failed for some of the files

NiFi version 1.5.
i use FetchFTP configured as shown below:
Hostname: x.x.x.x
port: 21
username: yyy
password: zzz
Remote File: ${path}/${file_name}
Completion Strategy: Delete File
Run Schedule: 5 sec
50 files are processed by FetchFTP and out of which only 46 is successfully deleted the file from the FTP server.
Instantly, the processor showed the below error message and also in the log file:
2019-11-20 23:33:25,542 WARN [Timer-Driven Process Thread-6] o.a.nifi.processors.standard.FetchFTP FetchFTP[id=53ee29a5] Successfully fetched the content for StandardFlowFileRecord[uuid=8ec8219e,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=15742604, container=default, section=175], offset=49, length=598026],offset=0,name=20191118145221190.pdf,size=598026] from x.x.x.x:21/<folder>/20191118145221190.pdf but failed to remove the remote file due to java.io.IOException: Failed to remove file /<folder>/20191118145221190.pdf due to 550 The process cannot access the file because it is being used by another process.
Appreciate for any help regarding this.

Issue installing openwhisk with incubator-openwhisk-devtools

I have a blocking issue installing openwhisk with docker
I typed make quick-start right after a git pull of the project incubator-openwhisk-devtools. My OS is Fedora 29, docker version is 18.09.0, docker-compose version is 1.22.0. JDk 8 Oracle.
I get the following error:
[...]
adding the function to whisk ...
ok: created action hello
invoking the function ...
error: Unable to invoke action 'hello': The server is currently unavailable (because it is overloaded or down for maintenance). (code ciOZDS8VySDyVuETF14n8QqB9wifUboT)
[...]
[ERROR] [#tid_sid_unknown] [Invoker] failed to ping the controller: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for health-0: 30069 ms has passed since batch creation plus linger time
[ERROR] [#tid_sid_unknown] [KafkaProducerConnector] sending message on topic 'health' failed: Expiring 1 record(s) for health-0: 30009 ms has passed since batch creation plus linger time
Please note that controller-local-logs.log is never created.
If I issue a touch controller-local-logs.log in the right directory the log file is always empty after I try to issue make quick-start again.
http://localhost:8888/ping gives me the right answer: pong.
http://localhost:9222 is not reacheable.
Where am I wrong?
Thank you in advance

Resources