ICP fails to start after machine reboot - ibm-cloud-private

I have ICP V2.1 installed into a RHEL VMWare image. After rebooting the image, ICP fails to start in what appears to be the first known issue in the documentation (Kubernetes controller manager fails to start after a master or cluster restart). However, the prescribed resolution does not get my system going.
Here is the running pod list:
NAME READY STATUS RESTARTS AGE
calico-node-amd64-dtl47 2/2 Running 14 20h
filebeat-ds-amd64-mvcsj 1/1 Running 8 20h
k8s-etcd-192.168.232.131 1/1 Running 7 20h
k8s-mariadb-192.168.232.131 1/1 Running 7 20h
k8s-master-192.168.232.131 2/3 CrashLoopBackOff 15 17m
k8s-proxy-192.168.232.131 1/1 Running 7 20h
metering-reader-amd64-gkwt4 1/1 Running 7 20h
monitoring-prometheus-nodeexporter-amd64-sghrv 1/1 Running 7 20h
Removing the k8s-master-192.168.232.131 pod and allowing it to restart only puts it back into the CrashLoopBackOff state. Here is how the last line in controller manager log looks:
F1029 23:55:07.345341 1 controllermanager.go:176] error building controller context: failed to get supported resources from server: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1alpha1: an error on the server ("Error: 'dial tcp 10.0.0.145:443: getsockopt: connection refused'\nTrying to reach: 'https://10.0.0.145:443/apis/servicecatalog.k8s.io/v1alpha1'") has prevented the request from succeeding
Removing the pod or removing the failed controller master docker container directly has no effect. It seems like another service hasn't started yet, or failed to start. I've waited several hours to see if the issue resolves itself, but to no avail.
Thanks...

Before the fix of https://github.com/kubernetes/kubernetes/pull/49495, kuberentes controller manager failed to start if an registered extension-apiserver is not ready. In ICP, service catalog is implemented as extension-apiserver.
Usually after ICP master is restarted, kubelet will start the k8s management service first as static pod. After that, it will get pods/nodes/service information from kubernetes api server, and then start all the pods including catalog api service. For that case, the whole cluster is recovered.
However for your case, there is a race condition that when kubelet get pods information from kuberentes api server and start all the pods, it has not get the nodes information from kubernetes api server yet. As a result, kubelet failed to start catalog api service due to nodeSelector is not met. The whole cluster failed to be recovered.
In next release of ICP 2.1.0.1, kuberentes will be upgraded into 1.8.2 with the fix of https://github.com/kubernetes/kubernetes/pull/49495. The issue will be resolved completely.
Before that you could try the following workaround method.
Use the -s flag form of the kubectl command if your token has expired after restart and you no longer have access to the GUI to re-establish it.
Delete apiservices of v1alpha1.servicecatalog.k8s.io
kubectl delete apiservices v1alpha1.servicecatalog.k8s.io
kubectl -s 127.0.0.1:8888 delete apiservices v1alpha1.servicecatalog.k8s.io
Delete the dead controller manager
docker rm <k8s controller manager>
Wait until service catalog started
Recover the service catalog apiservices by re-register the apiservice of v1alpha1.servicecatalog.k8s.io
kubectl apply -f cluster/cfc-components/service-catalog/apiregistration.yaml
kubectl -s 127.0.0.1:8888 apply -f cluster/cfc-components/service-catalog/apiregistration.yaml

Related

Error syncing pod on starting Beam - Dataflow pipeline from docker

We are constantly getting an error while starting our Beam Golang SDK pipeline (driver program) from a docker image which works when started from local / VM instance. We are using Dataflow runner for our pipeline and Kubernetes to deploy.
LOCAL SETUP:
We have GOOGLE_APPLICATION_CREDENTIALS variable set with service account for our GCP cluster. When running the job from local, job gets submitted to dataflow and completes successfully.
DOCKER SETUP:
Build image used is FROM golang:1.14-alpine. When we pack the same program with Dockerfile and try to run, it fails with error
User program exited: fork/exec /bin/worker: no such file or directory
On checking Stackdriver logs for more details, we see this:
Error syncing pod 00014c7112b5049966a4242e323b7850 ("dataflow-go-job-1-1611314272307727-
01220317-27at-harness-jv3l_default(00014c7112b5049966a4242e323b7850)"),
skipping: failed to "StartContainer" for "sdk" with CrashLoopBackOff:
"back-off 2m40s restarting failed container=sdk pod=dataflow-go-job-1-
1611314272307727-01220317-27at-harness-jv3l_default(00014c7112b5049966a4242e323b7850)"
Found reference to this error in Dataflow common errors doc, but it is too generic to figure out whats failing. After multiple retries, we were able to eliminate any permission / access related issues from pods. Not sure what else could be the problem here.
After multiple attempts, we decided to start the job manually from a new Debian 10 based VM instance and it worked. This brought to our notice that we are using alpine based golang image in Docker which may not have all the required dependencies installed to start the job.
On golang docker hub, we found a golang:1.14-buster where buster is codename for Debian 10. Using that for docker build helped us solve the issue. Self answering here to help anyone else facing the same issues.

Failed to create kubernetes pod on a windows server 2019 node

I set up a master node on a Ubuntu 18.04 machine and followed the guide: https://kubernetes.io/ja/docs/setup/production-environment/windows/user-guide-windows-nodes/ to register a windows server 2019 node to the cluster successfully.
Now the kubelet has been started on powershell and the two nodes are ready.
On the windows machine, I run the command line: "kubectl create deployment --image=XXX(A windows server 2019 image) webadmin-app" to create a deployment on the windows node.
When creating the pod, kubelet reports the following log messages:
W0821 17:37:03.003768 99524 pod_container_deletor.go:77] Container "dee4daa76a9e60e0e68af75597092aa5cff517c7021a6ef7579f77f662f2a163" not found in pod's containers
W0821 17:37:03.071774 99524 helpers.go:289] Unable to create pod sandbox due to conflict. Attempting to remove sandbox "dee4daa76a9e60e0e68af75597092aa5cff517c7021a6ef7579f77f662f2a163"
E0821 17:37:03.108764 99524 remote_runtime.go:200] CreateContainer in sandbox "62ff282461eba2fae24a66b7d38ccca43b224c74320dbb5a0a4659b4c4446eb7" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: Conflict. The container name "/k8s_webadmin-site_webadmin-app-757c7455cf-nms75_default_7ac60567-f9e2-4c04-aead-c6957200c961_0" is already in use by container "dee4daa76a9e60e0e68af75597092aa5cff517c7021a6ef7579f77f662f2a163". You have to remove (or rename) that container to be able to reuse that name.
E0821 17:37:03.109762 99524 kuberuntime_manager.go:801] container start failed: CreateContainerError: Error response from daemon: Conflict. The container name "/k8s_webadmin-site_webadmin-app-757c7455cf-nms75_default_7ac60567-f9e2-4c04-aead-c6957200c961_0" is already in use by container "dee4daa76a9e60e0e68af75597092aa5cff517c7021a6ef7579f77f662f2a163". You have to remove (or rename) that container to be able to reuse that name.
E0821 17:37:03.113766 99524 pod_workers.go:191] Error syncing pod 7ac60567-f9e2-4c04-aead-c6957200c961 ("webadmin-app-757c7455cf-nms75_default(7ac60567-f9e2-4c04-aead-c6957200c961)"), skipping: failed to "StartContainer" for "webadmin-site" with CreateContainerError: "Error response from daemon: Conflict. The container name "/k8s_webadmin-site_webadmin-app-757c7455cf-nms75_default_7ac60567-f9e2-4c04-aead-c6957200c961_0" is already in use by container "dee4daa76a9e60e0e68af75597092aa5cff517c7021a6ef7579f77f662f2a163". You have to remove (or rename) that container to be able to reuse that name."
Such failed messages keeps generating when creating the pod.
So I listed the docker containers using "docker ps" during the pod creation period. It seems that the kubelet keeps creating and removing containers(which are based on the specified image XXX).
How can I resolve such failure to create a deployment on the windows node?
Update:
I can create a deployment if --image is set to mcr.microsoft.com/k8s/core/pause:1.2.0 .
However if I use another image which is based on windows nano server. I got the deployment failed.
I caught up some error logs from kubelet output:
W0826 17:58:14.903662 127340 cri_stats_provider_windows.go:89] Failed to get HNS endpoint "" with error 'json: cannot unmarshal array into Go value of type hns.HNSEndpoint', continue to get stats for other endpoints
E0826 17:58:14.940665 127340 cni.go:364] Error adding default_webadmin-app-757c7455cf-5c5nf/9a37dab965d1b006c4c2ec70619b3a4bb42dd39dad8c0ce1baf39eab634da3d5 to network flannel/vxlan0: error while ProvisionEndpoint(9a37dab965d1b006c4c2ec70619b3a4bb42dd39dad8c0ce1baf39eab634da3d5_vxlan0,39EA65C2-C0F9-4870-8B7A-E2A1DBF5CD9D,9a37dab965d1b006c4c2ec70619b3a4bb42dd39dad8c0ce1baf39eab634da3d5): The virtual machine or container was forcefully exited.
E0826 17:58:14.942663 127340 cni_windows.go:59] error while adding to cni network: error while ProvisionEndpoint(9a37dab965d1b006c4c2ec70619b3a4bb42dd39dad8c0ce1baf39eab634da3d5_vxlan0,39EA65C2-C0F9-4870-8B7A-E2A1DBF5CD9D,9a37dab965d1b006c4c2ec70619b3a4bb42dd39dad8c0ce1baf39eab634da3d5): The virtual machine or container was forcefully exited.
W0826 17:58:14.943663 127340 docker_sandbox.go:400] failed to read pod IP from plugin/docker: networkPlugin cni failed on the status hook for pod "webadmin-app-757c7455cf-5c5nf_default": error while ProvisionEndpoint(9a37dab965d1b006c4c2ec70619b3a4bb42dd39dad8c0ce1baf39eab634da3d5_vxlan0,39EA65C2-C0F9-4870-8B7A-E2A1DBF5CD9D,9a37dab965d1b006c4c2ec70619b3a4bb42dd39dad8c0ce1baf39eab634da3d5): The virtual machine or container was forcefully exited.
From the error messages, it seems that it is the container's image that caused the pod's failure. Did I miss anything when I was setting up my own image? Why the network plugin failed to work for my created image but works for the image mcr.microsoft.com/k8s/core/pause:1.2.0 ?

helm chart installation of cert-manager timed out for AKS

I'm trying to create two kubernetes (AKS) clusters for test environment using Azure DevOps. These clusters using letsencrypt certificates for their endpoints. I'm therefore automating the creation of these certificates using helm charts.
For some reason, the cert-manager installation helm task times out if I create two clusters around the same time.
I have tested the same release process with a single cluster and there isn't a problem when i run my deployment.
The helm cert-manager installation command that runs is:
c:\agent\_work\_tool\helm\2.11.0\x64\windows-amd64\helm.exe install --set ingressShim.defaultIssuerName=letsencrypt-prod,ingressShim.defaultIssuerKind=ClusterIssuer,rbac.create=false,serviceAccount.create=false --name appl-cert-manager --wait stable/cert-manager
As i said, this command for the 1st cluster succeeds. I receive a message:
16:20:26.4583241Z cert-manager has been deployed successfully!
However, the second command takes about 5 minutes. Then I receive this message:
2018-11-08T16:28:14.4988796Z ##[error]Error: release appl-cert-manager failed: timed out waiting for the condition
Is this happening because the name has to be globally unique?
thanks
In case someone has the same problem, it's got a simple solution that works consistently for me.
Add a timeout argument to helm:
--timeout 600
for example for what I assume is a 10 minute timeout setting.

Could not delete DC/OS service that was failed to deploy

I deployed a service in DC/OS (the service is cassandra). The deployment failed and it kept retrying. Under DC/OS > Services > Tasks I could see a new task was created every a few minutes, but they all had the status of "Failed". Under the Debug tab I could see the TASK_FAILED state with a error message about how I misconfigured the service (I picked a user that does not exist).
So I wanted to destroy the service and start over again.
Under Services, I clicked on the menu on the service and selected "Delete". The command was taken, and the Status changed to "Deleting" But then it stayed there forever.
If I checked the Tasks tab, I could see that DC/OS was still attempting to start the server every a few minutes.
Now how do I delete the service? Thanks!
As per latest DCOS cassandra servicce docs, you should uninstall it using dcos cli :
dcos package uninstall --app-id=<service-name> cassandra
If you are using DCOS 1.9 or older version, then follow below steps to uninstall service :
$ MY_SERVICE_NAME=<service-name>
$ dcos package uninstall --app-id=$MY_SERVICE_NAME cassandra`.
$ dcos node ssh --master-proxy --leader "docker run mesosphere/janitor /janitor.py \
-r $MY_SERVICE_NAME-role \
-p $MY_SERVICE_NAME-principal \
-z dcos-service-$MY_SERVICE_NAME"

osx openshift MountVolume.SetUp failed for volume with: exit status 1

I use the OpenShift client on macOS to get an OpenShift cluster up and running. I can login on both commandline and web console. But when I try to install an application, e.g. my own simple spring-boot-application or the openshift/jenkins template, the deployment process stuck and I get a couple of errors:
Failed Sync: Error syncing pod
Failed mount: MountVolume.SetUp failed for volume "kubernetes.io/secret/972d63f8-bc0e-11e7-b3e2-025000000001-deployer-token-n8bdv" (spec.Name: "deployer-token-n8bdv") pod "972d63f8-bc0e-11e7-b3e2-025000000001" (UID: "972d63f8-bc0e-11e7-b3e2-025000000001") with: exit status 1
Failed mount: Unable to mount volumes for pod "jenkins-1-deploy_myproject(972d63f8-bc0e-11e7-b3e2-025000000001)": timeout expired waiting for volumes to attach/mount for pod "myproject"/"jenkins-1-deploy". list of unattached/unmounted volumes=[deployer-token-n8bdv]
Any hints? What does the exit status 1 mean? Or can someone point me to the source code where I can lookup MountVolume.SetUp method?
$ oc version
oc v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7
features: Basic-Auth
Server https://127.0.0.1:8443
openshift v3.6.0+c4dd4cf
kubernetes v1.6.1+5115d708d7

Resources