Windows RKE2 nodes networking isn't working - windows

In AWS, I have created a RKE2 cluster using the Rancher 2.6.2 UI.
There are two Ubuntu 20.04 control plane nodes, and pods on these hosts can reach other pods/ the internet.
My Windows node (Server 2019, 1809 Datacenter) joins the cluster without any issues, however, the Windows containers cannot seem to reach any other network with curl or ping.
Checking the process that RKE2 uses:
Get-Process -Name rke2,containerd,calico-node,kube-proxy,kubelet
Calico-Node is not running. I checked event viewer for the RKE2 application logs and found:
Felix exited: exit status 129
Is there somewhere to check for more detailed logs? RKE2 seems to have things moved around, and it doesn't line up with the Calico docs.
I have disabled source/dest checks on the network interfaces of all of the Kubernetes nodes in AWS.
I also created a Windows firewall rule to allow Kubelet on 10250.

Related

Kubernetes - Add a Windows node to a Windows-based control plane

I have installed Docker Desktop and Kubernetes on a Windows machine.
When i run the kubectl get nodes command, I get the following output:
NAME STATUS ROLES AGE VERSION
docker-desktop Ready control-plane 2d1h v1.24.0
So my cluster/control-plane is running properly.
I have a second Windows machine on the same network (in fact its a VM) and I'm trying to add this second machine to the existing cluster.
From what I've seen the control-plane node has to have kubeadm installed but it seems it's only available for Linux.
Is there another tool for Windows-based clusters or is it not possible to do this?
Below are details of docker desktop from docker documentation.
Docker Desktop includes a standalone Kubernetes server and client, as well as Docker CLI integration that runs on your machine. The Kubernetes server runs locally within your Docker instance, is not configurable, and is a single-node cluster..
You can refer kubernetes documentation and create kubernetes cluster with all your windows machines.
The other windows machine can be joined into cluster. Please refer Kubernetes documentation for windows and install kubeadm and run kubeadm join ,which will bootstrap and join the node into kubernetes cluster.
It turns out that the control-plane can only run on a Linux node.
I suspect that the output from the kubectl get nodes command was from a control-plane running on the WSL that Docker-Desktop uses.
So the only option for running a master node on Windows, is to run in a Linux VM.

How to diagnose 'stuck at provisioning' problems in Rancher 2.x?

I'm attempting to set up a Rancher / Kubernetes dev lab on a set of four local virtual machines, however when attempting to add nodes to the cluster Rancher seems to be permanently stuck at 'Waiting to register with Kubernetes'.
From extensive Googling I suspect there is some kind of a communications problem between the Rancher node and the other three, however I can't find how to attempt to diagnose it, the instructions for finding logs in Rancher 1.x don't apply for 2.x and all the information I've so far found for 2.x appears to be on how to configure logging for a working cluster, as opposed to where to find Rancher's own logs of it's attempts to set up clusters.
So effectively two questions:
What is the best way to go about diagnosing this problem?
Where can I find Rancher's logs of it's cluster-building activities?
Details of my setup:
Four identical VMs, all with Ubuntu 20.04 and Docker 20.10.5, all running under Proxmox on the same host and all can ping and ssh to each other. All have full Internet access.
Rancher 2.5.7 is installed on 192.168.0.180 with the other three nodes being 181-183.
Using "Global > Cluster > Add Cluster" I created a new cluster, using the default settings.
Rancher gives me the following code to execute on the nodes, this has been done, with no errors reported:
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.5.7 --server https://192.168.0.180 --token (token) --ca-checksum (checksum) --etcd --controlplane --worker
According to the Rancher setup instructions Rancher should now configure and take control of the nodes, however nothing happens and the nodes continue to show "Waiting to register with Kubernetes".
I've execed into the Rancher container on .180 "docker exec -it (container-id) bash" and searched for the logs, however the /var/lib/cattle directory where in older versions the debug logs were found, is empty.
Update 2021-06-23
Having got nowhere with this I deleted the existing cluster attempt in Rancher, stopped all existing Docker processes on the nodes, and tried to create a new cluster, this time using one node each for etcd, controlplane, and worker, instead of all three doing all three tasks.
Exactly the same thing happens, Rancher just forever says "Waiting to register with Kubernetes." Looking at the logs on node-1 (181) using docker ps to find the id and then docker logs to view them, I get this:
root#knode-1:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3ca92e0ea581 rancher/rancher-agent:v2.5.7 "run.sh --server htt…" About a minute ago Up About a minute epic_goldberg
root#knode-1:~# docker logs 3ca92e0ea581
INFO: Arguments: --server https://192.168.0.180 --token REDACTED --ca-checksum 151f030e78c10cf8e2dad63679f6d07c166d2da25b979407a606dc195d08855e --etcd
INFO: Environment: CATTLE_ADDRESS=192.168.0.181 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=knode-1 CATTLE_ROLE=,etcd CATTLE_SERVER=https://192.168.0.180 CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 127.0.0.53 options edns0
WARN: Loopback address found in /etc/resolv.conf, please refer to the documentation how to configure your cluster to resolve DNS properly
INFO: https://192.168.0.180/ping is accessible
INFO: Value from https://192.168.0.180/v3/settings/cacerts is an x509 certificate
time="2021-06-23T09:46:36Z" level=info msg="Listening on /tmp/log.sock"
time="2021-06-23T09:46:36Z" level=info msg="Rancher agent version v2.5.7 is starting"
time="2021-06-23T09:46:36Z" level=info msg="Option customConfig=map[address:192.168.0.181 internalAddress: label:map[] roles:[etcd] taints:[]]"
time="2021-06-23T09:46:36Z" level=info msg="Option etcd=true"
time="2021-06-23T09:46:36Z" level=info msg="Option controlPlane=false"
time="2021-06-23T09:46:36Z" level=info msg="Option worker=false"
time="2021-06-23T09:46:36Z" level=info msg="Option requestedHostname=knode-1"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to wss://192.168.0.180/v3/connect/register with token rbdbrk8r7ncbvb9ktw9w669tj7q9xppb9scwxp9wj8zj25nhfq24s9"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to proxy" url="wss://192.168.0.180/v3/connect/register"
time="2021-06-23T09:46:36Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2021-06-23T09:46:38Z" level=info msg="Starting plan monitor, checking every 15 seconds"
The only error showing appears to be the DNS one - I originally set the node's resolv.conf to use 1.1.1.1 and 8.8.4.4, so presumably the Docker install changed it, however testing 127.0.0.53 on a range of domains and records it resolves DNS correctly so I don't think that's the problem.
Help?

How can I troubleshoot/fix an issue interacting with a running Kubernetes pod (timeout error)?

I have two EC2 instances, one running a Kubernetes Master node and the other running the Worker node. I can successfully create a pod from a deployment file that pulls a docker image and it starts with a status of "Running". However when I try to interact with it I get a timeout error.
Ex: kubectl logs <pod-name> -v6
Output:
Config loaded from file /home/ec2-user/.kube/config
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name> 200 OK in 11 milliseconds
GET https://<master-node-ip>:6443/api/v1/namespaces/default/pods/<pod-name>/log 500 Internal Server Error in 30002 milliseconds
Server response object: [{"status": "Failure", "message": "Get https://<worker-node-ip>:10250/containerLogs/default/<pod-name>/<container-name>: dial tcp <worker-node-ip>:10250: i/o timeout", "code": 500 }]
I can get information about the pod by running kubectl describe pod <pod-name> and confirm the status as Running. Any ideas on how to identify exactly what is causing this error and/or how to fix it?
Probably, you didn't install any network add-on to your Kubernetes cluster. It's not included in kubeadm installation, but it's required to communicate between pods scheduled on different nodes. The most popular are Calico and Flannel. As you already have a cluster, you may want to chose the network add-on that uses the same subnet as you stated with kubeadm init --pod-network-cidr=xx.xx.xx.xx/xx during cluster initialization.
192.168.0.0/16 is default for Calico network addon
10.244.0.0/16 is default for Flannel network addon
You can change it by downloading corresponded YAML file and by replacing the default subnet with the subnet you want. Then just apply it with kubectl apply -f filename.yaml

ibm-cloud-private DNS or Internet issues from inside the pods

I've been experimenting with an ICP instance (ICP 2.1.0.2): 1 master node and 2 worker nodes.
I noticed that the pods in my ICP Kubernetes cluster don't have outbound Internet connectivity (or are having DNS lookup issues)
For example, If I start up a busybox pod in my cluster, and try to do "nslookup github.com" or "ping google.com" .. it fails..
kubectl run curl --image=radial/busyboxplus:curl -i --tty
root#curl-545bbf5f9c-gssbg:/ ]$ nslookup github.com
Server: 10.0.0.10
Address 1: 10.0.0.10
nslookup: can't resolve 'github.com'
I checked and saw that "kube-dns" (service, pod, daemonset.extensions, daemonset.apps) does appear to be running.
When I'm logged into (eg. SSH) to the ICP master and the worker nodes machines, I am able to ping these external sites successfully.
Any suggestions for how to troubleshoot this problem? Thanks!
We had kind of the reverse problem - where we could look up anything on internet or other domains, but not the domain in which the cluster was deployed.
That turned out to be the vague documentation around what cluster_domain and cluster_CA_domain mean in the config.yaml. But as a plus we got to learn a bit more about those and about configuring kube-dns.
Basically cluster_domain should be a private virtual domain to the cluster for which kube-dns will be authoritative. Anything else it should use the host's resolve.conf nameservers as upstream servers. If you suspect that your DNS servers are not being utilised for public DNS then you can update the kube-dns configMap to specify the upstream servers that it should use.
https://kubernetes.io/docs/tasks/administer-cluster/dns-custom-nameservers/
This is assuming you have configure cluster_domain, cluster_CA_domain correctly of course.
They should look something like
cluster_domain = mycluster.icp <----- could be "Mickey-mouse" for all it matters
cluster_CA_domain = icp.mycompany.com <----- the endpoint that portal/registry/api etc are accessible to users on

connecting to a mesos development cluster on ubuntu 14 machine via opevpn

can someone who has successfully settup a mesos development cluster on google cluster help me out.. i have the clutser running however I am having a hard time creating a vpn to conect to the clutser even though I have openvpn installed on my machine and I have downloded the openvpn file provided by mesos. I am using ubuntu 14. basically i have followed instruction to create the cluster but in order to access mesos, marathon I need to configure a vpn connection by using the openvpn file provided by mesosphere but I do not how to do it on ubuntu 14..
Have you tried openvpn --config <file.ovpn>?

Resources