How to diagnose 'stuck at provisioning' problems in Rancher 2.x? - provisioning

I'm attempting to set up a Rancher / Kubernetes dev lab on a set of four local virtual machines, however when attempting to add nodes to the cluster Rancher seems to be permanently stuck at 'Waiting to register with Kubernetes'.
From extensive Googling I suspect there is some kind of a communications problem between the Rancher node and the other three, however I can't find how to attempt to diagnose it, the instructions for finding logs in Rancher 1.x don't apply for 2.x and all the information I've so far found for 2.x appears to be on how to configure logging for a working cluster, as opposed to where to find Rancher's own logs of it's attempts to set up clusters.
So effectively two questions:
What is the best way to go about diagnosing this problem?
Where can I find Rancher's logs of it's cluster-building activities?
Details of my setup:
Four identical VMs, all with Ubuntu 20.04 and Docker 20.10.5, all running under Proxmox on the same host and all can ping and ssh to each other. All have full Internet access.
Rancher 2.5.7 is installed on 192.168.0.180 with the other three nodes being 181-183.
Using "Global > Cluster > Add Cluster" I created a new cluster, using the default settings.
Rancher gives me the following code to execute on the nodes, this has been done, with no errors reported:
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.5.7 --server https://192.168.0.180 --token (token) --ca-checksum (checksum) --etcd --controlplane --worker
According to the Rancher setup instructions Rancher should now configure and take control of the nodes, however nothing happens and the nodes continue to show "Waiting to register with Kubernetes".
I've execed into the Rancher container on .180 "docker exec -it (container-id) bash" and searched for the logs, however the /var/lib/cattle directory where in older versions the debug logs were found, is empty.
Update 2021-06-23
Having got nowhere with this I deleted the existing cluster attempt in Rancher, stopped all existing Docker processes on the nodes, and tried to create a new cluster, this time using one node each for etcd, controlplane, and worker, instead of all three doing all three tasks.
Exactly the same thing happens, Rancher just forever says "Waiting to register with Kubernetes." Looking at the logs on node-1 (181) using docker ps to find the id and then docker logs to view them, I get this:
root#knode-1:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3ca92e0ea581 rancher/rancher-agent:v2.5.7 "run.sh --server htt…" About a minute ago Up About a minute epic_goldberg
root#knode-1:~# docker logs 3ca92e0ea581
INFO: Arguments: --server https://192.168.0.180 --token REDACTED --ca-checksum 151f030e78c10cf8e2dad63679f6d07c166d2da25b979407a606dc195d08855e --etcd
INFO: Environment: CATTLE_ADDRESS=192.168.0.181 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=knode-1 CATTLE_ROLE=,etcd CATTLE_SERVER=https://192.168.0.180 CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 127.0.0.53 options edns0
WARN: Loopback address found in /etc/resolv.conf, please refer to the documentation how to configure your cluster to resolve DNS properly
INFO: https://192.168.0.180/ping is accessible
INFO: Value from https://192.168.0.180/v3/settings/cacerts is an x509 certificate
time="2021-06-23T09:46:36Z" level=info msg="Listening on /tmp/log.sock"
time="2021-06-23T09:46:36Z" level=info msg="Rancher agent version v2.5.7 is starting"
time="2021-06-23T09:46:36Z" level=info msg="Option customConfig=map[address:192.168.0.181 internalAddress: label:map[] roles:[etcd] taints:[]]"
time="2021-06-23T09:46:36Z" level=info msg="Option etcd=true"
time="2021-06-23T09:46:36Z" level=info msg="Option controlPlane=false"
time="2021-06-23T09:46:36Z" level=info msg="Option worker=false"
time="2021-06-23T09:46:36Z" level=info msg="Option requestedHostname=knode-1"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to wss://192.168.0.180/v3/connect/register with token rbdbrk8r7ncbvb9ktw9w669tj7q9xppb9scwxp9wj8zj25nhfq24s9"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to proxy" url="wss://192.168.0.180/v3/connect/register"
time="2021-06-23T09:46:36Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2021-06-23T09:46:38Z" level=info msg="Starting plan monitor, checking every 15 seconds"
The only error showing appears to be the DNS one - I originally set the node's resolv.conf to use 1.1.1.1 and 8.8.4.4, so presumably the Docker install changed it, however testing 127.0.0.53 on a range of domains and records it resolves DNS correctly so I don't think that's the problem.
Help?

Related

Windows RKE2 nodes networking isn't working

In AWS, I have created a RKE2 cluster using the Rancher 2.6.2 UI.
There are two Ubuntu 20.04 control plane nodes, and pods on these hosts can reach other pods/ the internet.
My Windows node (Server 2019, 1809 Datacenter) joins the cluster without any issues, however, the Windows containers cannot seem to reach any other network with curl or ping.
Checking the process that RKE2 uses:
Get-Process -Name rke2,containerd,calico-node,kube-proxy,kubelet
Calico-Node is not running. I checked event viewer for the RKE2 application logs and found:
Felix exited: exit status 129
Is there somewhere to check for more detailed logs? RKE2 seems to have things moved around, and it doesn't line up with the Calico docs.
I have disabled source/dest checks on the network interfaces of all of the Kubernetes nodes in AWS.
I also created a Windows firewall rule to allow Kubelet on 10250.

Unable to setup external etcd cluster in Kubernetes v1.15 using kubeadm

I'm trying to setup Kubernetes cluster with multi master and external etcd cluster. Followed these steps as described in kubernetes.io. I was able to create static manifest pod files in all the 3 hosts at /etc/kubernetes/manifests folder after executing Step 7.
After that when I executed command 'sudo kubeadmin init', the initialization got failed because of kubelet errors. Also verified journalctl logs, the error says misconfiguration of cgroup driver which is similar to this SO link.
I tried as said in the above SO link but not able to resolve.
Please help me in resolving this issue.
For installation of docker, kubeadm, kubectl and kubelet, I followed kubernetes.io site only.
Environment:
Cloud: AWS
EC2 instance OS: Ubuntu 18.04
Docker version: 18.09.7
Thanks
After searching few links and doing few trails, I am able to resolve this issue.
As given in the Container runtime setup, the Docker cgroup driver is systemd. But default cgroup driver of Kubelet is cgroupfs. So as Kubelet alone cannot identify cgroup driver automatically (as given in kubernetes.io docs), we have to provide cgroup-driver externally while running Kubelet like below:
cat << EOF > /etc/systemd/system/kubelet.service.d/20-etcd-service-manager.conf
[Service]
ExecStart=
ExecStart=/usr/bin/kubelet --cgroup-driver=systemd --address=127.0.0.1 --pod->manifest-path=/etc/kubernetes/manifests
Restart=always
EOF
systemctl daemon-reload
systemctl restart kubelet
Moreover, no need to run sudo kubeadm init, as we are providing --pod-manifest-path to Kubelet, it runs etcd as Static POD.
For debugging, logs of Kubelet can be checked using below command
journalctl -u kubelet -r
Hope it helps. Thanks.

How can I troubleshoot/fix an issue interacting with a running Kubernetes pod (timeout error)?

I have two EC2 instances, one running a Kubernetes Master node and the other running the Worker node. I can successfully create a pod from a deployment file that pulls a docker image and it starts with a status of "Running". However when I try to interact with it I get a timeout error.
Ex: kubectl logs <pod-name> -v6
Output:
Config loaded from file /home/ec2-user/.kube/config
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name> 200 OK in 11 milliseconds
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name>/log 500 Internal Server Error in 30002 milliseconds
Server response object: [{"status": "Failure", "message": "Get https://<worker-node-ip>:10250/containerLogs/default/<pod-name>/<container-name>: dial tcp <worker-node-ip>:10250: i/o timeout", "code": 500 }]
I can get information about the pod by running kubectl describe pod <pod-name> and confirm the status as Running. Any ideas on how to identify exactly what is causing this error and/or how to fix it?
Probably, you didn't install any network add-on to your Kubernetes cluster. It's not included in kubeadm installation, but it's required to communicate between pods scheduled on different nodes. The most popular are Calico and Flannel. As you already have a cluster, you may want to chose the network add-on that uses the same subnet as you stated with kubeadm init --pod-network-cidr=xx.xx.xx.xx/xx during cluster initialization.
192.168.0.0/16 is default for Calico network addon
10.244.0.0/16 is default for Flannel network addon
You can change it by downloading corresponded YAML file and by replacing the default subnet with the subnet you want. Then just apply it with kubectl apply -f filename.yaml

ibm-cloud-private DNS or Internet issues from inside the pods

I've been experimenting with an ICP instance (ICP 2.1.0.2): 1 master node and 2 worker nodes.
I noticed that the pods in my ICP Kubernetes cluster don't have outbound Internet connectivity (or are having DNS lookup issues)
For example, If I start up a busybox pod in my cluster, and try to do "nslookup github.com" or "ping google.com" .. it fails..
kubectl run curl --image=radial/busyboxplus:curl -i --tty
root#curl-545bbf5f9c-gssbg:/ ]$ nslookup github.com
Server: 10.0.0.10
Address 1: 10.0.0.10
nslookup: can't resolve 'github.com'
I checked and saw that "kube-dns" (service, pod, daemonset.extensions, daemonset.apps) does appear to be running.
When I'm logged into (eg. SSH) to the ICP master and the worker nodes machines, I am able to ping these external sites successfully.
Any suggestions for how to troubleshoot this problem? Thanks!
We had kind of the reverse problem - where we could look up anything on internet or other domains, but not the domain in which the cluster was deployed.
That turned out to be the vague documentation around what cluster_domain and cluster_CA_domain mean in the config.yaml. But as a plus we got to learn a bit more about those and about configuring kube-dns.
Basically cluster_domain should be a private virtual domain to the cluster for which kube-dns will be authoritative. Anything else it should use the host's resolve.conf nameservers as upstream servers. If you suspect that your DNS servers are not being utilised for public DNS then you can update the kube-dns configMap to specify the upstream servers that it should use.
https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/
This is assuming you have configure cluster_domain, cluster_CA_domain correctly of course.
They should look something like
cluster_domain = mycluster.icp <----- could be "Mickey-mouse" for all it matters
cluster_CA_domain = icp.mycompany.com <----- the endpoint that portal/registry/api etc are accessible to users on

Why does clock offset error in the host keeps occurring again and again : cloudera

I have stopped the ntpd and restarted it again. Have done a ntpdate pool.ntp.org. the error went once and the hosts were healthy but after sometime again got a clock offset error.
Also I observed that after doing a ntpdate the web interface of cloudera stopped working. It says potential mismatch configuration fix and restart hue.
I have the cloudera quick start vm with centos setup on VMware.
Check if /etc/ntp.conf file is the same across all nodes/masters
restart ntp
add deamon with chkconfig and set it to on
You can fix it by restarting the NTP service which syncronizes the time with a central source.
You can do this by logging in as root from the commandline and running service ntpd restart.
After about a minute the error in CM shoud go away.
Host Terminal
sudo su
service ntpd restart
Clock offset Error occur on Cloudera Manager if host\node's NTP service could not located or did not respond to a request for the clock offset.
Solution:
1)Identify NTP Server IP or Get details of NTP Server IP for your hadoop Cluster
2)On your Hadoop Cluster Nodes Edit-> /etc/ntp.conf
3)Add entries in ntp.conf
server [NTP Server IP]
server xxx.xx.xx.x
4)Restart Services.Execute
Service ntpd restart
5) Restart Cluster From Cloudera Manager
Note: If Problem Still Persist .Reboot you Hadoop Nodes & Check Process.
Check $ cat /etc/ntp.conf make sure configuration file is same as others (nodes)
$ systemctl restart ntpd
$ ntpdc -np
$ ntpdate -u 0.centos.pool.ntp.org
$ hwclock --systohc
$ systemctl restart cloudera-scm-agent
After that wait a few seconds to let it auto configure.

Resources