Openshift/Kubernetes liveliness probe Spring actuator - spring

I have implemented spring actuator health endpoints and added this in the Livelines probe -
http-get http://:8080/actuator/health
when i describe the pod I don't see the #success counter is increasing
http-get http://:8080/actuator/health delay=60s timeout=20s period=10s #success=1 #failure=3
How to know if the liveliness probe is actually running with the default actuator's health endpoint

The value of success here is successThreshold. It is not a counter field
kubectl explain pod.spec.containers.livenessProbe.successThreshold
DESCRIPTION:
Minimum consecutive successes for the probe to be considered successful
after having failed. Defaults to 1. Must be 1 for liveness. Minimum value
is 1.
similarly there is a failure threshold
kubectl explain pod.spec.containers.livenessProbe.failureThreshold
When a Pod starts and the probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness probe means restarting the container. In case of readiness probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.configure-probes
How to know if the liveliness probe is actually running with the default actuator's health endpoint
check the logs of your pod
kubectl logs -f $pod
or check the log of kubelet, which is probing the pod
journalctl -f -u kubelet

Related

OpenFaaS : Receiving Timeout errors during health check of function Pod

Issue:
We have added health check configuration to our function. However pod becomes Unhealthy due to timeout error in liveness and readiness checks and consequently getting restarted.
However if I hit same health check url using CURL or browser it returns correct response.
Health check configuration reference.
We are using Kubernetes HPAv2 for auto-scaling (Reference).
test-function.yml
test-function:
lang: quarkus-java-with-fonts
handler: ./test-function
image: repo.azurecr.io/test-function:0.1
labels:
agentpool: openfaas
com.openfaas.scale.min: "2"
com.openfaas.scale.max: "10"
com.openfaas.scale.factor: 0
annotations:
com.openfaas.health.http.path: "/health"
com.openfaas.health.http.initialDelay: "30s"
environment:
secret_name: environment-variables
secrets:
- environment-variables
constraints:
- agentpool=openfaas
limits:
cpu: 1500m
memory: 1Gi
requests:
cpu: 500m
memory: 500Mi
Error Trace :
Liveness probe failed: Get "http://XX.XXX.XX.XX:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get "http://XX.XXX.XX.XX:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Any idea what could be wrong.
These errors:
Liveness probe failed: Get "http://XX.XXX.XX.XX:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get "http://XX.XXX.XX.XX:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
mean that the HTTP request has failed. For the readiness and liveness probe to work properly, this type of request must be successful.
To find out where the problem is, you need to get the pod IP address. Run:
kubectl get pods -o wide
You should see an output similar to this:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
<my-pod-name> 1/1 Running 0 25d 10.92.3.4 <my-node-name> <none> 1/1
Take your IP and run:
kubectl exec -t <another_pod> -- curl -I <pod's cluster IP>
If you get a 200 response code, it means the endpoint is properly created and configured. Any other answer suggests there is a problem with your image.
See also:
this similar question (with solution) on github
very similar question on Stack Overflow
guide how to set up liveness and readiness probes
this nice article

Openshift container health probe connection refused

Hi openshift community,
I am currently migrating an app to Openshift and has encountered failed health probes due to connection refused. What I find strange is that if I ssh into the pod and use
curl localhost:10080/xxx-service/info
It returns HTTP 200 but if I use the IP address then it fails with
This is the details:
POD status
Logs in Openshift saying Spring boot started successfully
Openshift events saying probes failed due to connection refused
Tried SSH to pod to check using localhost which works
Not sure why the IP address is not resolving at the POD level.... Does anyone know the answer or have encountered it?
It is hard to say in your case what exactly the issue is, as it is environment specific.
In general, you should avoid using IP addresses when working with Kubernetes, as these will change whenever a Pod is restarted (which may be the root cause for the issue you are seeing).
When defining readiness and liveness probes for your container, I would recommend that you always use the following syntax to define your checks (note that it does not specify the host):
...
readinessProbe:
httpGet:
path: /xxx-service/info
port: 10080
initialDelaySeconds: 15
timeoutSeconds: 1
...
See also the Kubernetes or OpenShift documentation for more information:
https://docs.openshift.com/container-platform/3.11/dev_guide/application_health.html
https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-a-liveness-http-request
I found the root cause, it turns out it was Spring related.
It was a Spring Boot app that was being packaged as WAR file and deployed to Tomcat server. In the application.properties, it had this field:
server.address=127.0.0.1
Removing it has fixed this issue.
https://docs.spring.io/spring-boot/docs/current/reference/html/appendix-application-properties.html#common-application-properties-server

What does the Age such as "10m (x64 over 24h)" mean in k8s events?

What does the Age such as "10m (x64 over 24h)" mean in k8s events?
e.g.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 10m (x64 over 24h) kubelet, worker-pool1-9a9436te-ccdamf Readiness probe failed: Get http://192.168.177.153:8088/readiness: dial tcp 192.168.177.153:8088: connect: connection refused
Normal Pulled 10m (x9 over 27h) kubelet, worker-pool1-9a9436te-ccdamf Container image "k8s-registry.local/image/image1:1.100.0-51" already present on machine
Normal Created 10m (x9 over 27h) kubelet, worker-pool1-9a9436te-ccdamf Created container mm-controller
Does it healthy now or not?
from description:
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
It seems to be Ready, but since there is no time stamps, I am confused.
10m (x64 over 24h) means the last time this event happened 10 minutes ago and this event has occurred 64 times over a period of last 24h.
Pod is ready to serve traffic as the Ready is true. Conditions just shows the latest status.
Readiness probe will succeed once the app is running and kubernetes is able to validate it by hitting the endpoint defined in readiness probe. Now the app may take sometime to actually run and readiness probe will fail till that time.
You can provide a initialDelaySeconds to avoid that.
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5

How can I troubleshoot/fix an issue interacting with a running Kubernetes pod (timeout error)?

I have two EC2 instances, one running a Kubernetes Master node and the other running the Worker node. I can successfully create a pod from a deployment file that pulls a docker image and it starts with a status of "Running". However when I try to interact with it I get a timeout error.
Ex: kubectl logs <pod-name> -v6
Output:
Config loaded from file /home/ec2-user/.kube/config
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name> 200 OK in 11 milliseconds
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name>/log 500 Internal Server Error in 30002 milliseconds
Server response object: [{"status": "Failure", "message": "Get https://<worker-node-ip>:10250/containerLogs/default/<pod-name>/<container-name>: dial tcp <worker-node-ip>:10250: i/o timeout", "code": 500 }]
I can get information about the pod by running kubectl describe pod <pod-name> and confirm the status as Running. Any ideas on how to identify exactly what is causing this error and/or how to fix it?
Probably, you didn't install any network add-on to your Kubernetes cluster. It's not included in kubeadm installation, but it's required to communicate between pods scheduled on different nodes. The most popular are Calico and Flannel. As you already have a cluster, you may want to chose the network add-on that uses the same subnet as you stated with kubeadm init --pod-network-cidr=xx.xx.xx.xx/xx during cluster initialization.
192.168.0.0/16 is default for Calico network addon
10.244.0.0/16 is default for Flannel network addon
You can change it by downloading corresponded YAML file and by replacing the default subnet with the subnet you want. Then just apply it with kubectl apply -f filename.yaml

IBM Cloud Private monitoring gets 502 bad gateway

The following containers are not starting after installing IBM Cloud Private. I had previously installed ICP without a Management node and was doing a new install after having done and 'uninstall' and did restart the Docker service on all nodes.
Installed a second time with a Management node defined, Master/Proxy on a single node, and two Worker nodes.
Selecting menu option Platform / Monitoring gets 502 Bad Gateway
Event messages from deployed containers
Deployment - monitoring-prometheus
TYPE SOURCE COUNT REASON MESSAGE
Warning default-scheduler 2113 FailedScheduling
No nodes are available that match all of the following predicates:: MatchNodeSelector (3), NoVolumeNodeConflict (4).
Deployment - monitoring-grafana
TYPE SOURCE COUNT REASON MESSAGE
Warning default-scheduler 2097 FailedScheduling
No nodes are available that match all of the following predicates:: MatchNodeSelector (3), NoVolumeNodeConflict (4).
Deployment - rootkit-annotator
TYPE SOURCE COUNT REASON MESSAGE
Normal kubelet 169.53.226.142 125 Pulled
Container image "ibmcom/rootkit-annotator:20171011" already present on machine
Normal kubelet 169.53.226.142 125 Created
Created container
Normal kubelet 169.53.226.142 125 Started
Started container
Warning kubelet 169.53.226.142 2770 BackOff
Back-off restarting failed container
Warning kubelet 169.53.226.142 2770 FailedSync
Error syncing pod
The management console sometimes displays a 502 Bad Gateway Error after installation or rebooting the master node. If you recently installed IBM Cloud Private, wait a few minutes and reload the page.
If you rebooted the master node, take the following steps:
Configure the kubectl command line interface. See Accessing your IBM Cloud Private cluster by using the kubectl CLI.
Obtain the IP addresses of the icp-ds pods. Run the following command:
kubectl get pods -o wide -n kube-system | grep "icp-ds"
The output resembles the following text:
icp-ds-0 1/1 Running 0 1d 10.1.231.171 10.10.25.134
In this example, 10.1.231.171 is the IP address of the pod.
In high availability (HA) environments, an icp-ds pod exists for each master node.
From the master node, ping the icp-ds pods. Check the IP address for each icp-ds pod by running the following command for each IP address:
ping 10.1.231.171
If the output resembles the following text, you must delete the pod:
connect: Invalid argument
Delete each pod that you cannot reach:
kubectl delete pods icp-ds-0 -n kube-system
In this example, icp-ds-0 is the name of the unresponsive pod.
In HA installations, you might have to delete the pod for each master node.
Obtain the IP address of the replacement pod or pods. Run the following command:
kubectl get pods -o wide -n kube-system | grep "icp-ds"
The output resembles the following text:
icp-ds-0 1/1 Running 0 1d 10.1.231.172 10.10.2
Ping the pods again. Check the IP address for each icp-ds pod by running the following command for each IP address:
ping 10.1.231.172
If you can reach all icp-ds pods, you can access the IBM Cloud Private management console when that pod enters the available state.

Resources