IBM Cloud Private monitoring gets 502 bad gateway - ibm-cloud-private

The following containers are not starting after installing IBM Cloud Private. I had previously installed ICP without a Management node and was doing a new install after having done and 'uninstall' and did restart the Docker service on all nodes.
Installed a second time with a Management node defined, Master/Proxy on a single node, and two Worker nodes.
Selecting menu option Platform / Monitoring gets 502 Bad Gateway
Event messages from deployed containers
Deployment - monitoring-prometheus
TYPE SOURCE COUNT REASON MESSAGE
Warning default-scheduler 2113 FailedScheduling
No nodes are available that match all of the following predicates:: MatchNodeSelector (3), NoVolumeNodeConflict (4).
Deployment - monitoring-grafana
TYPE SOURCE COUNT REASON MESSAGE
Warning default-scheduler 2097 FailedScheduling
No nodes are available that match all of the following predicates:: MatchNodeSelector (3), NoVolumeNodeConflict (4).
Deployment - rootkit-annotator
TYPE SOURCE COUNT REASON MESSAGE
Normal kubelet 169.53.226.142 125 Pulled
Container image "ibmcom/rootkit-annotator:20171011" already present on machine
Normal kubelet 169.53.226.142 125 Created
Created container
Normal kubelet 169.53.226.142 125 Started
Started container
Warning kubelet 169.53.226.142 2770 BackOff
Back-off restarting failed container
Warning kubelet 169.53.226.142 2770 FailedSync
Error syncing pod

The management console sometimes displays a 502 Bad Gateway Error after installation or rebooting the master node. If you recently installed IBM Cloud Private, wait a few minutes and reload the page.
If you rebooted the master node, take the following steps:
Configure the kubectl command line interface. See Accessing your IBM Cloud Private cluster by using the kubectl CLI.
Obtain the IP addresses of the icp-ds pods. Run the following command:
kubectl get pods -o wide -n kube-system | grep "icp-ds"
The output resembles the following text:
icp-ds-0 1/1 Running 0 1d 10.1.231.171 10.10.25.134
In this example, 10.1.231.171 is the IP address of the pod.
In high availability (HA) environments, an icp-ds pod exists for each master node.
From the master node, ping the icp-ds pods. Check the IP address for each icp-ds pod by running the following command for each IP address:
ping 10.1.231.171
If the output resembles the following text, you must delete the pod:
connect: Invalid argument
Delete each pod that you cannot reach:
kubectl delete pods icp-ds-0 -n kube-system
In this example, icp-ds-0 is the name of the unresponsive pod.
In HA installations, you might have to delete the pod for each master node.
Obtain the IP address of the replacement pod or pods. Run the following command:
kubectl get pods -o wide -n kube-system | grep "icp-ds"
The output resembles the following text:
icp-ds-0 1/1 Running 0 1d 10.1.231.172 10.10.2
Ping the pods again. Check the IP address for each icp-ds pod by running the following command for each IP address:
ping 10.1.231.172
If you can reach all icp-ds pods, you can access the IBM Cloud Private management console when that pod enters the available state.

Related

kubernetes nodeport external ip not accessible

I have been trying to deploy the Spring Boot application on kubernetes cluseter. But somehow I can not access the rest end point from outside the cluster.
Here are the steps which i performed
Setup the kubernetes cluster using kubespray following the guide - Kubernetes Cluster setup using Kubespray
Pushed the spring boot docker image to docker hub
Created kubernetes deployment
vagrant#node1:~/spring-boot$ kubectl create deployment demo --image=rahulwagh17/kubernetes:jhooq-k8s-springboot
deployment.apps/demo created
Exposed the deployment with external IP = 1.1.1.1
kubectl expose deployment demo --type=LoadBalancer --name=demo-service --external-ip=1.1.1.1 --port=8080
service/demo-service exposed
This is how my deployment is looking
vagrant#node1:~/spring-boot$ kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
demo 1/1 1 1 24s
This is how my services are looking
vagrant#node1:~/spring-boot$ kubectl get service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
demo-service LoadBalancer 10.233.31.159 1.1.1.1 8080:30099/TCP 13s
kubernetes ClusterIP 10.233.0.1 <none> 443/TCP 23h
I can curl the rest end point within the cluster without a problem
vagrant#node1:~/spring-boot$ curl 10.233.31.159:8080/hello
Hello - Jhooq-k8s
Problem I am facing - When i am trying to curl the rest point from outside the cluster, i can not do
$ curl http://1.1.1.1:30099/hello
curl: (7) Failed to connect to 1.1.1.1 port 30099: Operation timed out
I am little new to kubernetes, so any leads or suggestions are highly appreciated
Please try via below approach:
Via Node Port:- Which means NodeIP:NodePort and in this case, please get any node-ip and then run a command
curl http://$NODE_IP:30099/hello
and you should be able to access your service.

Why does search name resolution fail for elasticsearch-master-headless on Kubernetes 1.16?

I'm trying to get elasticsearch running on Kubernetes 1.16 with Helm 3 on GKE. I'm aware that both 1.16 and 3 aren't supported yet. I want to prepare a PR to make it compatible. I'm using the helm charts from https://github.com/elastic/helm-charts.
If I use the original chart 7.6.1 the pod creation fails due to create Pod elasticsearch-master-0 in StatefulSet elasticsearch-master failed error: pods "elasticsearch-master-0" is forbidden: unable to validate against any pod security policy: [spec.volumes[1]: Invalid value: "projected": projected volumes are not allowed to be used]. Therefore I created the following patch:
diff --git a/elasticsearch/values.yaml b/elasticsearch/values.yaml
index 053c020..fd9c37b 100755
--- a/elasticsearch/values.yaml
+++ b/elasticsearch/values.yaml
## -107,6 +107,7 ## podSecurityPolicy:
- secret
- configMap
- persistentVolumeClaim
+ - projected
persistence:
enabled: true
With this patch on master/d9ccb5a and tag 7.6.1 (tried both) the pods quickly get into unhealthy state due to failed to resolve host [elasticsearch-master-headless] caused by a java.net.UnknownHostException: elasticsearch-master-headless.
I don't understand why the name resolution doesn't work as there's no change introduced in 1.16 which changes name resolution with Kubernetes names afaik. If I try to ping elasticsearch-master-headless from a shell in the pod started with kubectl exec, I can't reach it neither.
I tried to contact the nameserver in /etc/resolv.conf with telnet because it allows specifying a specific port:
[elasticsearch#elasticsearch-master-1 ~]$ cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local us-central1-a.c.myproject.internal c.myproject.internal google.internal
nameserver 10.23.240.10
options ndots:5
[elasticsearch#elasticsearch-master-1 ~]$ telnet 10.23.240.10
Trying 10.23.240.10...
^C
[elasticsearch#elasticsearch-master-1 ~]$ telnet 10.23.240.10 53
Trying 10.23.240.10...
telnet: connect to address 10.23.240.10: Connection refused
I obfuscated the project ID with myproject.
The patch is already proposed to be merged upstream together with other changes at https://github.com/elastic/helm-charts/pull/496.
This is caused by the pod kube-dns crashing due to
F0315 20:01:02.464839 1 server.go:61] Failed to create a kubernetes client: open /var/run/secrets/kubernetes.io/serviceaccount/token: permission denied
Since Kubernetes 1.16 is only available in the rapid channel of GKE and it's a system pod, I consider this a bug.
I'll update this answer if I find the energy to file a bug.
Chances are that there is a firewall(firewalld) blocking 53/udp,tcp or an issue with CoreDNS pod in the cluster where you are performing the test.

How can I troubleshoot/fix an issue interacting with a running Kubernetes pod (timeout error)?

I have two EC2 instances, one running a Kubernetes Master node and the other running the Worker node. I can successfully create a pod from a deployment file that pulls a docker image and it starts with a status of "Running". However when I try to interact with it I get a timeout error.
Ex: kubectl logs <pod-name> -v6
Output:
Config loaded from file /home/ec2-user/.kube/config
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name> 200 OK in 11 milliseconds
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name>/log 500 Internal Server Error in 30002 milliseconds
Server response object: [{"status": "Failure", "message": "Get https://<worker-node-ip>:10250/containerLogs/default/<pod-name>/<container-name>: dial tcp <worker-node-ip>:10250: i/o timeout", "code": 500 }]
I can get information about the pod by running kubectl describe pod <pod-name> and confirm the status as Running. Any ideas on how to identify exactly what is causing this error and/or how to fix it?
Probably, you didn't install any network add-on to your Kubernetes cluster. It's not included in kubeadm installation, but it's required to communicate between pods scheduled on different nodes. The most popular are Calico and Flannel. As you already have a cluster, you may want to chose the network add-on that uses the same subnet as you stated with kubeadm init --pod-network-cidr=xx.xx.xx.xx/xx during cluster initialization.
192.168.0.0/16 is default for Calico network addon
10.244.0.0/16 is default for Flannel network addon
You can change it by downloading corresponded YAML file and by replacing the default subnet with the subnet you want. Then just apply it with kubectl apply -f filename.yaml

ibm-cloud-private DNS or Internet issues from inside the pods

I've been experimenting with an ICP instance (ICP 2.1.0.2): 1 master node and 2 worker nodes.
I noticed that the pods in my ICP Kubernetes cluster don't have outbound Internet connectivity (or are having DNS lookup issues)
For example, If I start up a busybox pod in my cluster, and try to do "nslookup github.com" or "ping google.com" .. it fails..
kubectl run curl --image=radial/busyboxplus:curl -i --tty
root#curl-545bbf5f9c-gssbg:/ ]$ nslookup github.com
Server: 10.0.0.10
Address 1: 10.0.0.10
nslookup: can't resolve 'github.com'
I checked and saw that "kube-dns" (service, pod, daemonset.extensions, daemonset.apps) does appear to be running.
When I'm logged into (eg. SSH) to the ICP master and the worker nodes machines, I am able to ping these external sites successfully.
Any suggestions for how to troubleshoot this problem? Thanks!
We had kind of the reverse problem - where we could look up anything on internet or other domains, but not the domain in which the cluster was deployed.
That turned out to be the vague documentation around what cluster_domain and cluster_CA_domain mean in the config.yaml. But as a plus we got to learn a bit more about those and about configuring kube-dns.
Basically cluster_domain should be a private virtual domain to the cluster for which kube-dns will be authoritative. Anything else it should use the host's resolve.conf nameservers as upstream servers. If you suspect that your DNS servers are not being utilised for public DNS then you can update the kube-dns configMap to specify the upstream servers that it should use.
https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/
This is assuming you have configure cluster_domain, cluster_CA_domain correctly of course.
They should look something like
cluster_domain = mycluster.icp <----- could be "Mickey-mouse" for all it matters
cluster_CA_domain = icp.mycompany.com <----- the endpoint that portal/registry/api etc are accessible to users on

Worker nodes not available

I have setup and installed IBM Cloud private CE with two ubuntu images in Virtual Box. I can ssh into both images and from there ssh into the others. The ICp dashboard shows only one active node I was expecting two.
I explicitly ran the command (from a root user on master node):
docker run -e LICENSE=accept --net=host \
-v "$(pwd)":/installer/cluster \
ibmcom/cfc-installer install -l \
192.168.27.101
The result of this command seemed to be a successful addition of the worker node:
PLAY RECAP *********************************************************************
192.168.27.101 : ok=45 changed=11 unreachable=0 failed=0
But still the worker node isn't showing in the dashboard.
What should I be checking to ensure the worker node will work for the master node?
If you're using Vagrant to configure IBM Cloud Private, I'd highly recommend trying https://github.com/IBM/deploy-ibm-cloud-private
The project will use a Vagrantfile to configure a master/proxy and then provision 2 workers within the image using LXD. You'll get better density and performance on your laptop with this configuration over running two full Virtual Box images (1 for master/proxy, 1 for the worker).
You can check on your worker node with following steps:
check cluster nodes status
kubectl get nodes to check status of the newly added worker node
if it's NotReady, check kubelet log if there is error message about why kubelet is not running properly:
ICp 2.1
systemctl status kubelet
ICp 1.2
docker ps -a|grep kubelet to get kubelet_containerid,
docker logs kubelet_containerid
Run this to get the kubectl working
ln -sf /opt/kubernetes/hyperkube /usr/local/bin/kubectl
run the below command to identified failed pods if any in the setup on the master node.
Run this to get the pods details running in the environment
kubectl -n kube-system get pods -o wide
for restarting any failed pods of icp
txt="0/";ns="kube-system";type="pods"; kubectl -n $ns get $type | grep "$txt" | awk '{ print $1 }' | xargs kubectl -n $ns delete $type
now run the kubectl cluster-info
kubectl get nodes
Then ckeck the cluster info command of kubectl
Check kubectl version is giving you https://localhost:8080 or https://masternodeip:8001
kubectl cluster-info
Do you get the output
if no..
then
login to https://masternodeip:8443 using admin login
and then copy the configure clientcli settings by clicking on admin on the panel
paste it in ur master node.
and run the
kubectl cluster-info

Resources