Issue:
We have added health check configuration to our function. However pod becomes Unhealthy due to timeout error in liveness and readiness checks and consequently getting restarted.
However if I hit same health check url using CURL or browser it returns correct response.
Health check configuration reference.
We are using Kubernetes HPAv2 for auto-scaling (Reference).
test-function.yml
test-function:
lang: quarkus-java-with-fonts
handler: ./test-function
image: repo.azurecr.io/test-function:0.1
labels:
agentpool: openfaas
com.openfaas.scale.min: "2"
com.openfaas.scale.max: "10"
com.openfaas.scale.factor: 0
annotations:
com.openfaas.health.http.path: "/health"
com.openfaas.health.http.initialDelay: "30s"
environment:
secret_name: environment-variables
secrets:
- environment-variables
constraints:
- agentpool=openfaas
limits:
cpu: 1500m
memory: 1Gi
requests:
cpu: 500m
memory: 500Mi
Error Trace :
Liveness probe failed: Get "http://XX.XXX.XX.XX:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get "http://XX.XXX.XX.XX:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Any idea what could be wrong.
These errors:
Liveness probe failed: Get "http://XX.XXX.XX.XX:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Readiness probe failed: Get "http://XX.XXX.XX.XX:8080/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
mean that the HTTP request has failed. For the readiness and liveness probe to work properly, this type of request must be successful.
To find out where the problem is, you need to get the pod IP address. Run:
kubectl get pods -o wide
You should see an output similar to this:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
<my-pod-name> 1/1 Running 0 25d 10.92.3.4 <my-node-name> <none> 1/1
Take your IP and run:
kubectl exec -t <another_pod> -- curl -I <pod's cluster IP>
If you get a 200 response code, it means the endpoint is properly created and configured. Any other answer suggests there is a problem with your image.
See also:
this similar question (with solution) on github
very similar question on Stack Overflow
guide how to set up liveness and readiness probes
this nice article
Related
I'm trying to use GKE private cluster with standard config, with the Anthos service mesh managed profile. However, when I try to deploy "Iris" model for the test, the deployment stuck in calling "storage.googleapis.com":
$ kubectl get all -n test
NAME READY STATUS RESTARTS AGE
pod/iris-model-default-0-classifier-dfb586df4-ltt29 0/3 Init:1/2 0 30s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/iris-model-default ClusterIP xxx.xxx.65.194 <none> 8000/TCP,5001/TCP 30s
service/iris-model-default-classifier ClusterIP xxx.xxx.79.206 <none> 9000/TCP,9500/TCP 30s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/iris-model-default-0-classifier 0/1 1 0 31s
NAME DESIRED CURRENT READY AGE
replicaset.apps/iris-model-default-0-classifier-dfb586df4 1 1 0 31s
$ kubectl logs -f -n test pod/iris-model-default-0-classifier-dfb586df4-ltt29 -c classifier-model-initializer
2022/11/19 20:59:34 NOTICE: Config file "/.rclone.conf" not found - using defaults
2022/11/19 20:59:57 ERROR : GCS bucket seldon-models path v1.15.0-dev/sklearn/iris: error reading source root directory: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 20:59:57 ERROR : Attempt 1/3 failed with 1 errors and: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 21:00:17 ERROR : GCS bucket seldon-models path v1.15.0-dev/sklearn/iris: error reading source root directory: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
2022/11/19 21:00:17 ERROR : Attempt 2/3 failed with 1 errors and: Get "https://storage.googleapis.com/storage/v1/b/seldon-models/o?alt=json&delimiter=%2F&maxResults=1000&prefix=v1.15.0-dev%2Fsklearn%2Firis%2F&prettyPrint=false": dial tcp 199.36.153.8:443: connect: connection refused
I used "sidecar injection" with the namespace labeling:
kubectl create namespace test
kubectl label namespace test istio-injection- istio.io/rev=asm-managed --overwrite
kubectl annotate --overwrite namespace test mesh.cloud.google.com/proxy='{"managed":"true"}'
When I don't use "sidecar injection", the deployment was quite successful. But in this case I need to inject the proxy manually to get the accesss to the model API. I wonder if this is the intended operation or not.
Istio sidecars will block connectivity on other init containers. This is a known issue with Istio sidecars unfortunately. A potential workaround is to ask Istio to don't "filter" traffic going to storage.googleapis.com (i.e. don't route that traffic through Istio's egress), which can be done through Istio's excludeIPRanges flag.
In the longer term, due to these shortcomings, Istio seems to be moving away from sidecars into their new "Ambient mesh".
What does the Age such as "10m (x64 over 24h)" mean in k8s events?
e.g.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 10m (x64 over 24h) kubelet, worker-pool1-9a9436te-ccdamf Readiness probe failed: Get http://192.168.177.153:8088/readiness: dial tcp 192.168.177.153:8088: connect: connection refused
Normal Pulled 10m (x9 over 27h) kubelet, worker-pool1-9a9436te-ccdamf Container image "k8s-registry.local/image/image1:1.100.0-51" already present on machine
Normal Created 10m (x9 over 27h) kubelet, worker-pool1-9a9436te-ccdamf Created container mm-controller
Does it healthy now or not?
from description:
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
It seems to be Ready, but since there is no time stamps, I am confused.
10m (x64 over 24h) means the last time this event happened 10 minutes ago and this event has occurred 64 times over a period of last 24h.
Pod is ready to serve traffic as the Ready is true. Conditions just shows the latest status.
Readiness probe will succeed once the app is running and kubernetes is able to validate it by hitting the endpoint defined in readiness probe. Now the app may take sometime to actually run and readiness probe will fail till that time.
You can provide a initialDelaySeconds to avoid that.
readinessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
I have implemented spring actuator health endpoints and added this in the Livelines probe -
http-get http://:8080/actuator/health
when i describe the pod I don't see the #success counter is increasing
http-get http://:8080/actuator/health delay=60s timeout=20s period=10s #success=1 #failure=3
How to know if the liveliness probe is actually running with the default actuator's health endpoint
The value of success here is successThreshold. It is not a counter field
kubectl explain pod.spec.containers.livenessProbe.successThreshold
DESCRIPTION:
Minimum consecutive successes for the probe to be considered successful
after having failed. Defaults to 1. Must be 1 for liveness. Minimum value
is 1.
similarly there is a failure threshold
kubectl explain pod.spec.containers.livenessProbe.failureThreshold
When a Pod starts and the probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness probe means restarting the container. In case of readiness probe the Pod will be marked Unready. Defaults to 3. Minimum value is 1.configure-probes
How to know if the liveliness probe is actually running with the default actuator's health endpoint
check the logs of your pod
kubectl logs -f $pod
or check the log of kubelet, which is probing the pod
journalctl -f -u kubelet
I have two EC2 instances, one running a Kubernetes Master node and the other running the Worker node. I can successfully create a pod from a deployment file that pulls a docker image and it starts with a status of "Running". However when I try to interact with it I get a timeout error.
Ex: kubectl logs <pod-name> -v6
Output:
Config loaded from file /home/ec2-user/.kube/config
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name> 200 OK in 11 milliseconds
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name>/log 500 Internal Server Error in 30002 milliseconds
Server response object: [{"status": "Failure", "message": "Get https://<worker-node-ip>:10250/containerLogs/default/<pod-name>/<container-name>: dial tcp <worker-node-ip>:10250: i/o timeout", "code": 500 }]
I can get information about the pod by running kubectl describe pod <pod-name> and confirm the status as Running. Any ideas on how to identify exactly what is causing this error and/or how to fix it?
Probably, you didn't install any network add-on to your Kubernetes cluster. It's not included in kubeadm installation, but it's required to communicate between pods scheduled on different nodes. The most popular are Calico and Flannel. As you already have a cluster, you may want to chose the network add-on that uses the same subnet as you stated with kubeadm init --pod-network-cidr=xx.xx.xx.xx/xx during cluster initialization.
192.168.0.0/16 is default for Calico network addon
10.244.0.0/16 is default for Flannel network addon
You can change it by downloading corresponded YAML file and by replacing the default subnet with the subnet you want. Then just apply it with kubectl apply -f filename.yaml
I have two servers one for production and one for test, I've been trying to install metricbeat on both servers.
I did on my test server and set it to send logs to my amazon elasticsearch service and now I can see on kibana all data from that server but I did the same process to my production server and when I use command sudo metricbeat -e it starts but I get
ERROR pipeline/output.go:74 Failed to connect: Get https://amazon-endpoint request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) but the config for both is the same