helm chart installation of cert-manager timed out for AKS - lets-encrypt

I'm trying to create two kubernetes (AKS) clusters for test environment using Azure DevOps. These clusters using letsencrypt certificates for their endpoints. I'm therefore automating the creation of these certificates using helm charts.
For some reason, the cert-manager installation helm task times out if I create two clusters around the same time.
I have tested the same release process with a single cluster and there isn't a problem when i run my deployment.
The helm cert-manager installation command that runs is:
c:\agent\_work\_tool\helm\2.11.0\x64\windows-amd64\helm.exe install --set ingressShim.defaultIssuerName=letsencrypt-prod,ingressShim.defaultIssuerKind=ClusterIssuer,rbac.create=false,serviceAccount.create=false --name appl-cert-manager --wait stable/cert-manager
As i said, this command for the 1st cluster succeeds. I receive a message:
16:20:26.4583241Z cert-manager has been deployed successfully!
However, the second command takes about 5 minutes. Then I receive this message:
2018-11-08T16:28:14.4988796Z ##[error]Error: release appl-cert-manager failed: timed out waiting for the condition
Is this happening because the name has to be globally unique?
thanks

In case someone has the same problem, it's got a simple solution that works consistently for me.
Add a timeout argument to helm:
--timeout 600
for example for what I assume is a 10 minute timeout setting.

Related

DataHub installation on Minikube failing: "no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"" on elasticsearch setup

Im following the deployement guide of DataHub with Kubernetes present on the documentation: https://datahubproject.io/docs/deploy/kubernetes
Settin up the local clusten with Minikube I've started following the prerequisites session of the guide.
At first I tried to change some of the default values to try it locally (I've already installed it sucessfully on Google Kubernetes Engine, so I was trying different set ups)
But on the first step of the installation I've received the error:
Error: INSTALLATION FAILED: unable to build kubernetes objects from release manifest: resource mapping not found for name: "elasticsearch-master-pdb" namespace: "" from "": no matches for kind "PodDisruptionBudget" in version "policy/v1beta1"
ensure CRDs are installed first
The steps I've followed after installing Minikube where the exact steps presented on the page:
helm repo add datahub https://helm.datahubproject.io/
helm install prerequisites datahub/datahub-prerequisites
With the error happening on step 2
At first I've changed to the default configuration to see if it wasnt a mistake on the new values, but the error remained.
Ive expected that after followint the exact default steps the installation would be successfull locally, just like it was on the GKE
I got help browsing the DataHub slack community and figured out a way to fix this error.
It was simply a matter of a version error with Kubernetes, I was able to fix it by forcing minikube to start with the 1.19.0 version of Kubernetes:
minikube start --kubernetes-version=v1.19.0

How can I make sure that Cloud Run waits for my Spring Boot application to start before denying the health check?

I am deploying my Spring Boot application as a compiled jar file running in a docker container deployed to gcp, and deploys it through gcloud cli in my pipeline:
gcloud beta run deploy $SERVICE_NAME --image $IMAGE_NAME --region europe-north1 --project
Which will work and give me the correct response when the application succeeds to start. However, when there's an error and the application fails to start:
Cloud Run error: The user-provided container failed to start and listen on the port defined provided by the PORT=8080 environment variable.
The next time the pipeline runs (with the errors fixed), the gcloud beta run deploy command fails and gives the same error as seen above. While the actual application runs without issues in Cloud Run. How can I solve this?
Currently I have to check Cloud Run manually as I cannot trust my pipeline, and I have to run it twice to make it succeed. Any help will be appreciated. Let me know if you want any extra information.

Error syncing pod on starting Beam - Dataflow pipeline from docker

We are constantly getting an error while starting our Beam Golang SDK pipeline (driver program) from a docker image which works when started from local / VM instance. We are using Dataflow runner for our pipeline and Kubernetes to deploy.
LOCAL SETUP:
We have GOOGLE_APPLICATION_CREDENTIALS variable set with service account for our GCP cluster. When running the job from local, job gets submitted to dataflow and completes successfully.
DOCKER SETUP:
Build image used is FROM golang:1.14-alpine. When we pack the same program with Dockerfile and try to run, it fails with error
User program exited: fork/exec /bin/worker: no such file or directory
On checking Stackdriver logs for more details, we see this:
Error syncing pod 00014c7112b5049966a4242e323b7850 ("dataflow-go-job-1-1611314272307727-
01220317-27at-harness-jv3l_default(00014c7112b5049966a4242e323b7850)"),
skipping: failed to "StartContainer" for "sdk" with CrashLoopBackOff:
"back-off 2m40s restarting failed container=sdk pod=dataflow-go-job-1-
1611314272307727-01220317-27at-harness-jv3l_default(00014c7112b5049966a4242e323b7850)"
Found reference to this error in Dataflow common errors doc, but it is too generic to figure out whats failing. After multiple retries, we were able to eliminate any permission / access related issues from pods. Not sure what else could be the problem here.
After multiple attempts, we decided to start the job manually from a new Debian 10 based VM instance and it worked. This brought to our notice that we are using alpine based golang image in Docker which may not have all the required dependencies installed to start the job.
On golang docker hub, we found a golang:1.14-buster where buster is codename for Debian 10. Using that for docker build helped us solve the issue. Self answering here to help anyone else facing the same issues.

Could not delete DC/OS service that was failed to deploy

I deployed a service in DC/OS (the service is cassandra). The deployment failed and it kept retrying. Under DC/OS > Services > Tasks I could see a new task was created every a few minutes, but they all had the status of "Failed". Under the Debug tab I could see the TASK_FAILED state with a error message about how I misconfigured the service (I picked a user that does not exist).
So I wanted to destroy the service and start over again.
Under Services, I clicked on the menu on the service and selected "Delete". The command was taken, and the Status changed to "Deleting" But then it stayed there forever.
If I checked the Tasks tab, I could see that DC/OS was still attempting to start the server every a few minutes.
Now how do I delete the service? Thanks!
As per latest DCOS cassandra servicce docs, you should uninstall it using dcos cli :
dcos package uninstall --app-id=<service-name> cassandra
If you are using DCOS 1.9 or older version, then follow below steps to uninstall service :
$ MY_SERVICE_NAME=<service-name>
$ dcos package uninstall --app-id=$MY_SERVICE_NAME cassandra`.
$ dcos node ssh --master-proxy --leader "docker run mesosphere/janitor /janitor.py \
-r $MY_SERVICE_NAME-role \
-p $MY_SERVICE_NAME-principal \
-z dcos-service-$MY_SERVICE_NAME"

ICP fails to start after machine reboot

I have ICP V2.1 installed into a RHEL VMWare image. After rebooting the image, ICP fails to start in what appears to be the first known issue in the documentation (Kubernetes controller manager fails to start after a master or cluster restart). However, the prescribed resolution does not get my system going.
Here is the running pod list:
NAME READY STATUS RESTARTS AGE
calico-node-amd64-dtl47 2/2 Running 14 20h
filebeat-ds-amd64-mvcsj 1/1 Running 8 20h
k8s-etcd-192.168.232.131 1/1 Running 7 20h
k8s-mariadb-192.168.232.131 1/1 Running 7 20h
k8s-master-192.168.232.131 2/3 CrashLoopBackOff 15 17m
k8s-proxy-192.168.232.131 1/1 Running 7 20h
metering-reader-amd64-gkwt4 1/1 Running 7 20h
monitoring-prometheus-nodeexporter-amd64-sghrv 1/1 Running 7 20h
Removing the k8s-master-192.168.232.131 pod and allowing it to restart only puts it back into the CrashLoopBackOff state. Here is how the last line in controller manager log looks:
F1029 23:55:07.345341 1 controllermanager.go:176] error building controller context: failed to get supported resources from server: unable to retrieve the complete list of server APIs: servicecatalog.k8s.io/v1alpha1: an error on the server ("Error: 'dial tcp 10.0.0.145:443: getsockopt: connection refused'\nTrying to reach: 'https://10.0.0.145:443/apis/servicecatalog.k8s.io/v1alpha1'") has prevented the request from succeeding
Removing the pod or removing the failed controller master docker container directly has no effect. It seems like another service hasn't started yet, or failed to start. I've waited several hours to see if the issue resolves itself, but to no avail.
Thanks...
Before the fix of https://github.com/kubernetes/kubernetes/pull/49495, kuberentes controller manager failed to start if an registered extension-apiserver is not ready. In ICP, service catalog is implemented as extension-apiserver.
Usually after ICP master is restarted, kubelet will start the k8s management service first as static pod. After that, it will get pods/nodes/service information from kubernetes api server, and then start all the pods including catalog api service. For that case, the whole cluster is recovered.
However for your case, there is a race condition that when kubelet get pods information from kuberentes api server and start all the pods, it has not get the nodes information from kubernetes api server yet. As a result, kubelet failed to start catalog api service due to nodeSelector is not met. The whole cluster failed to be recovered.
In next release of ICP 2.1.0.1, kuberentes will be upgraded into 1.8.2 with the fix of https://github.com/kubernetes/kubernetes/pull/49495. The issue will be resolved completely.
Before that you could try the following workaround method.
Use the -s flag form of the kubectl command if your token has expired after restart and you no longer have access to the GUI to re-establish it.
Delete apiservices of v1alpha1.servicecatalog.k8s.io
kubectl delete apiservices v1alpha1.servicecatalog.k8s.io
kubectl -s 127.0.0.1:8888 delete apiservices v1alpha1.servicecatalog.k8s.io
Delete the dead controller manager
docker rm <k8s controller manager>
Wait until service catalog started
Recover the service catalog apiservices by re-register the apiservice of v1alpha1.servicecatalog.k8s.io
kubectl apply -f cluster/cfc-components/service-catalog/apiregistration.yaml
kubectl -s 127.0.0.1:8888 apply -f cluster/cfc-components/service-catalog/apiregistration.yaml

Resources