dcos mesos dns is not resolving masters and leaders - mesos

Could you please help me to resolve this issue.I am struggling on this more than three days and I could not fix this.
I am configuring a DCOS installation with the guidance of https://dcos.io/docs/1.7/administration/installing/custom/advanced/
But Unfortunately, my dcos's dns server is not working properly.
1)Below is the output of the nslookup command:
# nslookup leader.mesos
;; Warning: query response not set
;; Got SERVFAIL reply from 198.51.100.1, trying next server
;; Warning: query response not set
;; Got SERVFAIL reply from 198.51.100.2, trying next server
;; Warning: query response not set
Server: 198.51.100.3
Address: 198.51.100.3#53
** server can't find leader.mesos: SERVFAIL
2) Below is the output of /opt/mesosphere/etc/mesos-dns.json
{
"zk": "zk://127.0.0.1:2181/mesos",
"refreshSeconds": 30,
"ttl": 60,
"domain": "mesos",
"port": 61053,
"resolvers": ["172.31.0.2"],
"timeout": 5,
"listener": "0.0.0.0",
"email": "root.mesos-dns.mesos",
"IPSources": ["host", "netinfo"]
}
3) Below is the output of the journalctl -u dcos-mesos-dns -b
19:29:50 Authentication failed: EOF
***19:29:51 Failed to connect to 127.0.0.1:2181: dial tcp 127.0.0.1:2181: getsockopt: connection refused***
19:29:52 Connected to 127.0.0.1:2181
19:29:52 Authenticated: id=98693002200481794, timeout=40000
service: Main process exited, code=exited, status=1/FAILURE
service: Unit entered failed state.
service: Failed with result 'exit-code'.
7/09/20 19:30:14 generator.go:124: no master
7/09/20 19:30:14 resolver.go:156: Warning: Error generating records: no master; keeping old DNS state
7/09/20 19:30:14 main.go:80: master detection timed out after 30s
service: Service hold-off time over, scheduling restart.
NS: DNS based Service Discovery.
DNS: DNS based Service Discovery...
19:30:19 Connected to 127.0.0.1:2181
19:30:19 Authenticated: id=98693008095641600, timeout=40000
DNS: DNS based Service Discovery...
service: Main process exited, ***code=exited, status=2/INVALIDARGUMENT***
service: Unit entered failed state.
service: Failed with result 'exit-code'.
NS: DNS based Service Discovery.
DNS: DNS based Service Discovery...
19:39:57 Connected to 127.0.0.1:2181
19:39:57 Authenticated: id=98693008095641610, timeout=40000
lines 170-207/207 (END)
Please ask if you need more logs...
Thank you very much!

I could able to fix this issue by following the latest version of the DCOS installation guidance: https://dcos.io/docs/1.10/installing/custom/advanced/
my genconf/config.yaml
---
bootstrap_url: http://xx.xx.xx.xx:9090
cluster_name: 'ProjectName'
exhibitor_storage_backend: aws_s3
aws_access_key_id: xxxxxxxxxxxxxxxxxx
aws_secret_access_key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
s3_bucket: cp-v902
aws_region: us-east-1
exhibitor_explicit_keys: 'false'
s3_prefix: dcos
ip_detect_filename: genconf/ip-detect
dns_search: ec2.internal
master_discovery: master_http_loadbalancer
exhibitor_address: internal-osdc-123456789.us-east-1.elb.amazonaws.com
num_masters: 1
resolvers:
- XXXX.XXX.XX.XX

Related

Consul Servers with TLS enabled but not required, how to check if clients are using TLS?

I'm enabling TLS connection for a Consul Datacenter with 3 servers and more than 100 node clients.
So, I config servers auto_encrypt with allow_tls as true:
"ca_file": "/etc/consul/consul-agent-ca.pem",
"cert_file": "/etc/consul/dc1-server-consul-0.pem",
"key_file": "/etc/consul/dc1-server-consul-0-key.pem",
"auto_encrypt": {
"allow_tls": true
}
But I did not require TLS verification, because it's a dynamic environment, I add and remove clients in daily basis. I cannot break info about node status.
"verify_outgoing": false,
"verify_server_hostname": false,
I need to identify which node clients are connection to the servers using TLS, and which don't.
So I can config the missing clent nodes.
I have tried a few consul commands, without success.
consul members
consul info
consul operator raft list-peers
Any tips?
There is a workaround, the service log shows the Non-TLS connection attempted
systemctl -l status consul
Sep 29 20:04:38 dev consul[25527]: 2021-09-29T20:04:38.280+0300 [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set: conn=from=57.243.24.234:57367
Sep 29 20:05:35 dev consul[25527]: 2021-09-29T20:05:35.845+0300 [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set: conn=from=127.45.234.453:60991
Sep 29 20:06:33 dev consul[25527]: 2021-09-29T20:06:33.193+0300 [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set: conn=from=45.34.30.345:62555
Sep 29 20:07:12 dev consul[25527]: 2021-09-29T20:07:12.051+0300 [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set: conn=from=234.45.24.345:57433
This is not a perfect solution, because I cannot query for TLS or Non-TLS, but it helps to check if there are clients without TLS enabled.

How to diagnose 'stuck at provisioning' problems in Rancher 2.x?

I'm attempting to set up a Rancher / Kubernetes dev lab on a set of four local virtual machines, however when attempting to add nodes to the cluster Rancher seems to be permanently stuck at 'Waiting to register with Kubernetes'.
From extensive Googling I suspect there is some kind of a communications problem between the Rancher node and the other three, however I can't find how to attempt to diagnose it, the instructions for finding logs in Rancher 1.x don't apply for 2.x and all the information I've so far found for 2.x appears to be on how to configure logging for a working cluster, as opposed to where to find Rancher's own logs of it's attempts to set up clusters.
So effectively two questions:
What is the best way to go about diagnosing this problem?
Where can I find Rancher's logs of it's cluster-building activities?
Details of my setup:
Four identical VMs, all with Ubuntu 20.04 and Docker 20.10.5, all running under Proxmox on the same host and all can ping and ssh to each other. All have full Internet access.
Rancher 2.5.7 is installed on 192.168.0.180 with the other three nodes being 181-183.
Using "Global > Cluster > Add Cluster" I created a new cluster, using the default settings.
Rancher gives me the following code to execute on the nodes, this has been done, with no errors reported:
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.5.7 --server https://192.168.0.180 --token (token) --ca-checksum (checksum) --etcd --controlplane --worker
According to the Rancher setup instructions Rancher should now configure and take control of the nodes, however nothing happens and the nodes continue to show "Waiting to register with Kubernetes".
I've execed into the Rancher container on .180 "docker exec -it (container-id) bash" and searched for the logs, however the /var/lib/cattle directory where in older versions the debug logs were found, is empty.
Update 2021-06-23
Having got nowhere with this I deleted the existing cluster attempt in Rancher, stopped all existing Docker processes on the nodes, and tried to create a new cluster, this time using one node each for etcd, controlplane, and worker, instead of all three doing all three tasks.
Exactly the same thing happens, Rancher just forever says "Waiting to register with Kubernetes." Looking at the logs on node-1 (181) using docker ps to find the id and then docker logs to view them, I get this:
root#knode-1:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3ca92e0ea581 rancher/rancher-agent:v2.5.7 "run.sh --server htt…" About a minute ago Up About a minute epic_goldberg
root#knode-1:~# docker logs 3ca92e0ea581
INFO: Arguments: --server https://192.168.0.180 --token REDACTED --ca-checksum 151f030e78c10cf8e2dad63679f6d07c166d2da25b979407a606dc195d08855e --etcd
INFO: Environment: CATTLE_ADDRESS=192.168.0.181 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=knode-1 CATTLE_ROLE=,etcd CATTLE_SERVER=https://192.168.0.180 CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 127.0.0.53 options edns0
WARN: Loopback address found in /etc/resolv.conf, please refer to the documentation how to configure your cluster to resolve DNS properly
INFO: https://192.168.0.180/ping is accessible
INFO: Value from https://192.168.0.180/v3/settings/cacerts is an x509 certificate
time="2021-06-23T09:46:36Z" level=info msg="Listening on /tmp/log.sock"
time="2021-06-23T09:46:36Z" level=info msg="Rancher agent version v2.5.7 is starting"
time="2021-06-23T09:46:36Z" level=info msg="Option customConfig=map[address:192.168.0.181 internalAddress: label:map[] roles:[etcd] taints:[]]"
time="2021-06-23T09:46:36Z" level=info msg="Option etcd=true"
time="2021-06-23T09:46:36Z" level=info msg="Option controlPlane=false"
time="2021-06-23T09:46:36Z" level=info msg="Option worker=false"
time="2021-06-23T09:46:36Z" level=info msg="Option requestedHostname=knode-1"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to wss://192.168.0.180/v3/connect/register with token rbdbrk8r7ncbvb9ktw9w669tj7q9xppb9scwxp9wj8zj25nhfq24s9"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to proxy" url="wss://192.168.0.180/v3/connect/register"
time="2021-06-23T09:46:36Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2021-06-23T09:46:38Z" level=info msg="Starting plan monitor, checking every 15 seconds"
The only error showing appears to be the DNS one - I originally set the node's resolv.conf to use 1.1.1.1 and 8.8.4.4, so presumably the Docker install changed it, however testing 127.0.0.53 on a range of domains and records it resolves DNS correctly so I don't think that's the problem.
Help?

how to resolve micok8s port forwarding error on vagrant VMs?

have a 2 node microk8s cluster running on 2 Vagrant VMs (Ubuntu 20.04). trying to forward port forward 443 from host so I can connect to dashboard from the host PC over the private VM network.
sudo microk8s kubectl port-forward -n kube-system service/kubernetes-dashboard 10443:443
receive the following error:
error: error upgrading connection: error dialing backend: dial tcp: lookup node-1: Temporary failure in name resolution
also noticed that the internal IPs for the nodes are not correct:
the master node is provisioned with an IP of 10.0.1.5 and the worker node 10.0.1.10. in the listing from kubectl both nodes have the same IP of 10.0.2.15.
not sure how to resolve this issue.
note I am able to access the dashboard login screen from http and port 8001 connecting to 10.0.1.5. but submitting the token does not do anything as per the K8s security design:
Logging in is only available when accessing Dashboard over HTTPS or when domain is either localhost
or 127.0.0.1. It's done this way for security reasons.
was able to get passed this issue by adding the nodes to the /etc/hosts file on each node:
10.1.0.10 node-1
10.1.0.5 k8s-master
then was able to restart and issue the port forward command:
sudo microk8s kubectl port-forward -n kube-system service/kubernetes-dashboard 10443:443 --address 0.0.0.0
Forwarding from 0.0.0.0:10443 -> 8443
then was able to access the K8s dashboard via the token auth method

Docker service tasks stuck in preparing state after reboot on windows

Restarting a windows server that is a swarm worker, causes windows containers to get stuck in a "Preparing" state indefinitely once the server and docker daemon are back online.
Image of tasks/containers stuck in preparing state:
https://user-images.githubusercontent.com/4528753/65180353-4e5d6e80-da22-11e9-8060-451150865177.png
Steps to reproduce the issue:
Create a swarm (in my case I have CentOS7 managers, and a few windows server 1903 workers)
Create a "global" docker service that only runs on the windows machines. They should start up fine
initially and work just fine.
Drain one or more of the windows nodes that is running the windows container(s) from step 2 (docker node update --availability=drain nodename)
Restart one or more of the nodes that were drained in step 3, wait for them to come back up
Set the windows node(s) back to active (docker node update --availability=active nodename)
At this point, just observe that the docker service created in step 2 will be "Preparing" the containers to start up on these nodes, and there it will stay (docker service ps servicename --no-trunc) -- you can observe this and run these commands from any master node
memberlist: Refuting a suspect message (from: c9347e85405d)
memberlist: Failed to send ping: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
grpc: addrConn.createTransport failed to connect to {10.60.3.110:2377 0 <nil>}. Err :connection error: desc = "transport: Error while
dialing dial tcp 10.60.3.110:2377: connectex: A socket operation was attempted to an unreachable host.". Reconnecting... [module=grpc]
memberlist: Failed to send ping: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
grpc: addrConn.createTransport failed to connect to {10.60.3.110:2377 0 <nil>}. Err :connection error: desc = "transport: Error while
dialing dial tcp 10.60.3.110:2377: connectex: A socket operation was attempted to an unreachable host.". Reconnecting... [module=grpc]
agent: session failed [node.id=wuhifvg9li3v5zuq2xu7c6hxa module=node/agent error=rpc error: code = Unavailable desc = all SubConns are
in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.60.3.69:2377:
connectex: A socket operation was attempted to an unreachable host." backoff=6.3s]
Failed to send gossip to 10.60.3.110: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.186: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.186: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.109: write udp 10.60.3.40:7946->10.60.3.109:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.110: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
memberlist: Failed to send gossip to 10.60.3.105:7946: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is
not valid in its context.
memberlist: Failed to send gossip to 10.60.3.186:7946: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is
not valid in its context.
Many of these errors are odd, for example... 7946 is totally open between the cluster nodes, telnets confirm this.
I expect to see the docker service containers start promptly, and not stuck in a Preparing state. The docker image is already pulled, it should be fast.
docker version output
Client: Docker Engine - Enterprise
Version: 19.03.2
API version: 1.40
Go version: go1.12.8
Git commit: c92ab06ed9
Built: 09/03/2019 16:38:11
OS/Arch: windows/amd64
Experimental: false
Server: Docker Engine - Enterprise
Engine:
Version: 19.03.2
API version: 1.40 (minimum version 1.24)
Go version: go1.12.8
Git commit: c92ab06ed9
Built: 09/03/2019 16:35:47
OS/Arch: windows/amd64
Experimental: false
docker info output
Client:
Debug Mode: false
Plugins:
cluster: Manage Docker clusters (Docker Inc., v1.1.0-8c33de7)
Server:
Containers: 4
Running: 0
Paused: 0
Stopped: 4
Images: 4
Server Version: 19.03.2
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: ics l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
Swarm: active
NodeID: wuhifvg9li3v5zuq2xu7c6hxa
Is Manager: false
Node Address: 10.60.3.40
Manager Addresses:
10.60.3.110:2377
10.60.3.186:2377
10.60.3.69:2377
Default Isolation: process
Kernel Version: 10.0 18362 (18362.1.amd64fre.19h1_release.190318-1202)
Operating System: Windows Server Datacenter Version 1903 (OS Build 18362.356)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 8GiB
Name: SWARMWORKER1
ID: V2WJ:OEUM:7TUQ:WPIO:UOK4:IAHA:KWMN:RQFF:CAUO:LUB6:DJIJ:OVBX
Docker Root Dir: E:\docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: this node is not a swarm manager - check license status on a manager node
Additional Details
These nodes are not using Docker Desktop for windows. I am provisioning docker on the box primarily based on the powershell instructions here: https://docs.docker.com/install/windows/docker-ee/
Windows firewall is disabled
iptables/firewalld is disabled
Communication is completely open between the cluster nodes
Totally up-to-date on cumulative updates
I posted on the moby repo issues but never heard a peep:
https://github.com/moby/moby/issues/39955
The ONLY way I've found to temporarily fix the issue is to drain the node from the swarm, delete docker files, reinstall windows "Containers" feature, then rejoin to the swarm. But, it happens again on reboot.
What's interesting is that when I see a swarm task in a "Preparing" state on the windows worker, the server doesn't seem to be doing anything at all, it's like the manager thinks the worker is preparing the container, but it isn't...
Anyone have any suggestions??

Kibana stopped working and now server not getting ready although kibana.service starts up nicely

Without any major system update of my Ubuntu (4.4.0-142-generic #168-Ubuntu SMP), Kibana 7.2.0 stopped working. I am still able to start the service with sudo systemctl start kibana.service and the corresponding status looks fine. There is only a warning and no error, this does not seem to be the issue:
# sudo systemctl status kibana.service
● kibana.service - Kibana
Loaded: loaded (/etc/systemd/system/kibana.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-07-10 09:43:49 CEST; 22min ago
Main PID: 14856 (node)
Tasks: 21
Memory: 583.2M
CPU: 1min 30.067s
CGroup: /system.slice/kibana.service
└─14856 /usr/share/kibana/bin/../node/bin/node --no-warnings --max-http-header-size=65536 /usr/share/kibana/bin/../src/cli -c /etc/kibana/kibana.yml
Jul 10 09:56:36 srv003 kibana[14856]: {"type":"log","#timestamp":"2019-07-10T07:56:36Z","tags":["warning","task_manager"],"pid":14856,"message":"The task maps_telemetry \"Maps-maps_telemetry\" is not cancellable."}
Nevertheless, when I visit http://srv003:5601/ on my client machine, I keep seeing only (even after waiting 20 minutes):
Kibana server is not ready yet
On the server srv003 itself, I see
me#srv003:# curl -XGET http://localhost:5601/status -I
curl: (7) Failed to connect to localhost port 5601: Connection refused
This is a strange since Kibana seems to be really listening at that port and the firewall is disabled for testing purposes:
root#srv003# sudo lsof -nP -i | grep 5601
node 14856 kibana 18u IPv4 115911041 0t0 TCP 10.0.0.72:5601 (LISTEN)
root#srv003# sudo ufw status verbose
Status: inactive
There is nothing suspicious in the log of kibana.service either:
root#srv003:/var/log# journalctl -u kibana.service | grep -A 99 "Jul 10 10:09:14"
Jul 10 10:09:14 srv003 systemd[1]: Started Kibana.
Jul 10 10:09:38 srv003 kibana[14856]: {"type":"log","#timestamp":"2019-07-10T08:09:38Z","tags":["warning","task_manager"],"pid":14856,"message":"The task maps_telemetry \"Maps-maps_telemetry\" is not cancellable."}
My Elasticsearch is still up and running. There is nothing interesting in the corresponding log files about Kibana:
root#srv003:/var/log# cat elasticsearch/elasticsearch.log |grep kibana
[2019-07-10T09:46:25,158][INFO ][o.e.c.m.MetaDataIndexTemplateService] [srv003] adding template [.kibana_task_manager] for index patterns [.kibana_task_manager]
[2019-07-10T09:47:32,955][INFO ][o.e.c.m.MetaDataCreateIndexService] [srv003] [.monitoring-kibana-7-2019.07.10] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[0], mappings [_doc]
Now I am running a bit out of options, and I hope somebody can give me another hint.
Edit: I do not have any Kibana plugins installed.
Consulted sources:
How to fix "Kibana server is not ready yet" error when using AKS
Kibana service is running but can not access via browser to console
Why won't Kibana Node server start up?
https://discuss.elastic.co/t/failed-to-start-kibana-7-0-1/180259/3 - most promising thread, but nobody ever answered
https://discuss.elastic.co/t/kibana-server-is-not-ready-yet-issue-after-upgrade-to-6-5-0/157021
https://discuss.elastic.co/t/kibana-server-not-ready/162075
It looks like if Kibana enters the described undefined state, a simple reboot of the computer is necessary. This is of course not acceptable for a (virtual or physical) machine where other services are running.

Resources