Coreos fleet not working after auto-scaling

Coreos fleet not working after auto-scaling - amazon-ec2

I have CoreOS cluster with 3 AWS ec2 instances. The cluster was setup using the CoreOS stack cloudformation. After the cluster is up and running, I need to update the autoscaling policy to pick up ec2 instance profile. I copied the existing auto-scaling configuration file and updated the IAM role for ec2s. Then I terminated EC2s in the fleet, letting the auto-scaling to fire up new instances. The new instances indeed assumed their new roles, however, the cluster seems lost cluster machine information:
ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
Drop-In: /run/systemd/system/etcd.service.d
└─10-oem.conf, 20-cloudinit.conf
Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
Main PID: 14124 (code=exited, status=1/FAILURE)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL | fail joining the cluster via given peers after 3 retries
The same token was used from cloud-init. https://discovery.etcd.io/<cluster token> shows 6 machines, with 3 dead ones, 3 new ones. So it looks like 3 new instances joined the cluster alright. The journal -u etcd.service logs shows the etcd timed out on dead instances, and got connection refused for the new ones.
journal -u etcd.service shows:
...
Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)
etcdctl --debug ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/? consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Maybe this is not right process to update a cluster's configuration, but IF the cluster does need auto-scaling for whatever reasons (load triggered for example), will the fleet still be able to function with dead instances and new instances mixed in the pool?
How to recover from this situations without tear down and rebuild?
Xueshan

In this scheme etcd will not remain with a quorum of machines and can't operate successfully. The best scheme for doing autoscaling would be to set up two groups of machines:
A fixed number (1-9) of etcd machines that will always be up. These are set up with a discovery token or static networking like normal.
Your autoscaling group, which doesn't start etcd, but instead configures fleet (and any other tool) to use the fixed etcd cluster. You can do this in cloud-config. Here's an example that also sets some fleet metadata so you can schedule jobs specifically to the autoscaled machines if desired:
#cloud-config
coreos:
fleet:
metadata: "role=autoscale"
etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
units:
- name: fleet.service
command: start
The validator wouldn't let me put in any 10.x IP addresses in my answer (wtf!?) so be sure to replace those.

You must have at least one machine always running with the discovery token, as soon as all of them go down, heartbeat will fail and no one new will be able to join, you will need a new token for the cluster to join.

Related

How to diagnose 'stuck at provisioning' problems in Rancher 2.x?

I'm attempting to set up a Rancher / Kubernetes dev lab on a set of four local virtual machines, however when attempting to add nodes to the cluster Rancher seems to be permanently stuck at 'Waiting to register with Kubernetes'.
From extensive Googling I suspect there is some kind of a communications problem between the Rancher node and the other three, however I can't find how to attempt to diagnose it, the instructions for finding logs in Rancher 1.x don't apply for 2.x and all the information I've so far found for 2.x appears to be on how to configure logging for a working cluster, as opposed to where to find Rancher's own logs of it's attempts to set up clusters.
So effectively two questions:
What is the best way to go about diagnosing this problem?
Where can I find Rancher's logs of it's cluster-building activities?
Details of my setup:
Four identical VMs, all with Ubuntu 20.04 and Docker 20.10.5, all running under Proxmox on the same host and all can ping and ssh to each other. All have full Internet access.
Rancher 2.5.7 is installed on 192.168.0.180 with the other three nodes being 181-183.
Using "Global > Cluster > Add Cluster" I created a new cluster, using the default settings.
Rancher gives me the following code to execute on the nodes, this has been done, with no errors reported:
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.5.7 --server https://192.168.0.180 --token (token) --ca-checksum (checksum) --etcd --controlplane --worker
According to the Rancher setup instructions Rancher should now configure and take control of the nodes, however nothing happens and the nodes continue to show "Waiting to register with Kubernetes".
I've execed into the Rancher container on .180 "docker exec -it (container-id) bash" and searched for the logs, however the /var/lib/cattle directory where in older versions the debug logs were found, is empty.
Update 2021-06-23
Having got nowhere with this I deleted the existing cluster attempt in Rancher, stopped all existing Docker processes on the nodes, and tried to create a new cluster, this time using one node each for etcd, controlplane, and worker, instead of all three doing all three tasks.
Exactly the same thing happens, Rancher just forever says "Waiting to register with Kubernetes." Looking at the logs on node-1 (181) using docker ps to find the id and then docker logs to view them, I get this:
root#knode-1:~# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
3ca92e0ea581 rancher/rancher-agent:v2.5.7 "run.sh --server htt…" About a minute ago Up About a minute epic_goldberg
root#knode-1:~# docker logs 3ca92e0ea581
INFO: Arguments: --server https://192.168.0.180 --token REDACTED --ca-checksum 151f030e78c10cf8e2dad63679f6d07c166d2da25b979407a606dc195d08855e --etcd
INFO: Environment: CATTLE_ADDRESS=192.168.0.181 CATTLE_INTERNAL_ADDRESS= CATTLE_NODE_NAME=knode-1 CATTLE_ROLE=,etcd CATTLE_SERVER=https://192.168.0.180 CATTLE_TOKEN=REDACTED
INFO: Using resolv.conf: nameserver 127.0.0.53 options edns0
WARN: Loopback address found in /etc/resolv.conf, please refer to the documentation how to configure your cluster to resolve DNS properly
INFO: https://192.168.0.180/ping is accessible
INFO: Value from https://192.168.0.180/v3/settings/cacerts is an x509 certificate
time="2021-06-23T09:46:36Z" level=info msg="Listening on /tmp/log.sock"
time="2021-06-23T09:46:36Z" level=info msg="Rancher agent version v2.5.7 is starting"
time="2021-06-23T09:46:36Z" level=info msg="Option customConfig=map[address:192.168.0.181 internalAddress: label:map[] roles:[etcd] taints:[]]"
time="2021-06-23T09:46:36Z" level=info msg="Option etcd=true"
time="2021-06-23T09:46:36Z" level=info msg="Option controlPlane=false"
time="2021-06-23T09:46:36Z" level=info msg="Option worker=false"
time="2021-06-23T09:46:36Z" level=info msg="Option requestedHostname=knode-1"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to wss://192.168.0.180/v3/connect/register with token rbdbrk8r7ncbvb9ktw9w669tj7q9xppb9scwxp9wj8zj25nhfq24s9"
time="2021-06-23T09:46:36Z" level=info msg="Connecting to proxy" url="wss://192.168.0.180/v3/connect/register"
time="2021-06-23T09:46:36Z" level=info msg="Waiting for node to register. Either cluster is not ready for registering or etcd and controlplane node have to be registered first"
time="2021-06-23T09:46:38Z" level=info msg="Starting plan monitor, checking every 15 seconds"
The only error showing appears to be the DNS one - I originally set the node's resolv.conf to use 1.1.1.1 and 8.8.4.4, so presumably the Docker install changed it, however testing 127.0.0.53 on a range of domains and records it resolves DNS correctly so I don't think that's the problem.
Help?

Hashicorp - vault service fails to start

I have setup Hashicorp - vault (Vault v1.5.4) on Ubuntu 18.04. My backend is Consul (single node running on same server as vault) - consul service is up.
My vault service fails to start
systemctl list-units --type=service | grep "vault"
vault.service loaded failed failed vault service
journalctl -xe -u vault
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Scheduled restart job, restart counter is at 5.
-- Subject: Automatic restarting of a unit has been scheduled
- Unit vault.service has finished shutting down.
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Start request repeated too quickly.
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Failed with result 'exit-code'.
Oct 03 00:21:33 ubuntu2 systemd[1]: Failed to start vault service.
-- Subject: Unit vault.service has failed
vault config.json
"api_addr": "http://<my-ip>:8200",
storage "consul" {
address = "127.0.0.1:8500"
path = "vault"
},
Service config
StandardOutput=/opt/vault/logs/output.log
StandardError=/opt/vault/logs/error.log
cat /opt/vault/logs/error.log
cat: /opt/vault/logs/error.log: No such file or directory
cat /opt/vault/logs/output.log
cat: /opt/vault/logs/output.log: No such file or directory
sudo tail -f /opt/vault/logs/error.log
tail: cannot open '/opt/vault/logs/error.log' for reading: No such file or
directory
:/opt/vault/logs$ ls -al
total 8
drwxrwxr-x 2 vault vault 4096 Oct 2 13:38 .
drwxrwxr-x 5 vault vault 4096 Oct 2 13:38 ..

After much debugging, the issue was silly goofup mixing .hcl and .json (they are so similar - but different) - cut-n-paste between stuff the storage (as posted) needs to be in json format. The problem is of course compounded when the error message saying nothing and there is nothing in the logs.
"storage": {
"consul": {
"address": "127.0.0.1:8500",
"path" : "vault"
}
},
There were a couple of other additional issues to sort out to get it going- disable_mlock : true, opening the firewall for 8200: sudo ufw allow 8200/tcp.
Finally got done (rather started).

Kibana stopped working and now server not getting ready although kibana.service starts up nicely

Without any major system update of my Ubuntu (4.4.0-142-generic #168-Ubuntu SMP), Kibana 7.2.0 stopped working. I am still able to start the service with sudo systemctl start kibana.service and the corresponding status looks fine. There is only a warning and no error, this does not seem to be the issue:
# sudo systemctl status kibana.service
● kibana.service - Kibana
Loaded: loaded (/etc/systemd/system/kibana.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-07-10 09:43:49 CEST; 22min ago
Main PID: 14856 (node)
Tasks: 21
Memory: 583.2M
CPU: 1min 30.067s
CGroup: /system.slice/kibana.service
└─14856 /usr/share/kibana/bin/../node/bin/node --no-warnings --max-http-header-size=65536 /usr/share/kibana/bin/../src/cli -c /etc/kibana/kibana.yml
Jul 10 09:56:36 srv003 kibana[14856]: {"type":"log","#timestamp":"2019-07-10T07:56:36Z","tags":["warning","task_manager"],"pid":14856,"message":"The task maps_telemetry \"Maps-maps_telemetry\" is not cancellable."}
Nevertheless, when I visit http://srv003:5601/ on my client machine, I keep seeing only (even after waiting 20 minutes):
Kibana server is not ready yet
On the server srv003 itself, I see
me#srv003:# curl -XGET http://localhost:5601/status -I
curl: (7) Failed to connect to localhost port 5601: Connection refused
This is a strange since Kibana seems to be really listening at that port and the firewall is disabled for testing purposes:
root#srv003# sudo lsof -nP -i | grep 5601
node 14856 kibana 18u IPv4 115911041 0t0 TCP 10.0.0.72:5601 (LISTEN)
root#srv003# sudo ufw status verbose
Status: inactive
There is nothing suspicious in the log of kibana.service either:
root#srv003:/var/log# journalctl -u kibana.service | grep -A 99 "Jul 10 10:09:14"
Jul 10 10:09:14 srv003 systemd[1]: Started Kibana.
Jul 10 10:09:38 srv003 kibana[14856]: {"type":"log","#timestamp":"2019-07-10T08:09:38Z","tags":["warning","task_manager"],"pid":14856,"message":"The task maps_telemetry \"Maps-maps_telemetry\" is not cancellable."}
My Elasticsearch is still up and running. There is nothing interesting in the corresponding log files about Kibana:
root#srv003:/var/log# cat elasticsearch/elasticsearch.log |grep kibana
[2019-07-10T09:46:25,158][INFO ][o.e.c.m.MetaDataIndexTemplateService] [srv003] adding template [.kibana_task_manager] for index patterns [.kibana_task_manager]
[2019-07-10T09:47:32,955][INFO ][o.e.c.m.MetaDataCreateIndexService] [srv003] [.monitoring-kibana-7-2019.07.10] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[0], mappings [_doc]
Now I am running a bit out of options, and I hope somebody can give me another hint.
Edit: I do not have any Kibana plugins installed.
Consulted sources:
How to fix "Kibana server is not ready yet" error when using AKS
Kibana service is running but can not access via browser to console
Why won't Kibana Node server start up?
https://discuss.elastic.co/t/failed-to-start-kibana-7-0-1/180259/3 - most promising thread, but nobody ever answered
https://discuss.elastic.co/t/kibana-server-is-not-ready-yet-issue-after-upgrade-to-6-5-0/157021
https://discuss.elastic.co/t/kibana-server-not-ready/162075

It looks like if Kibana enters the described undefined state, a simple reboot of the computer is necessary. This is of course not acceptable for a (virtual or physical) machine where other services are running.

NFS Vagrant on Fedora 22

I'm trying to run Vagrant using libvirt as my provider. Using rsync is unbearable since I'm working with a huge shared directory, but vagrant does succeed when the nfs setting is commented out and the standard rsync config is set.
config.vm.synced_folder ".", "/vagrant", mount_options: ['dmode=777','fmode=777']
Vagrant hangs forever on this step here after running vagrant up
==> default: Mounting NFS shared folders...
In my Vagrantfile I have this uncommented and the rsync config commented out, which turns NFS on.
config.vm.synced_folder ".", "/vagrant", type: "nfs"
When Vagrant is running it echos this out to the terminal.
Redirecting to /bin/systemctl status nfs-server.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Redirecting to /bin/systemctl start nfs-server.service
Job for nfs-server.service failed. See "systemctl status nfs-server.service" and "journalctl -xe" for details.
Results of systemctl status nfs-server.service
dillon#localhost ~ $ systemctl status nfs-server.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2015-05-29 22:24:47 PDT; 22s ago
Process: 3044 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited, status=1/FAILURE)
Process: 3040 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
Main PID: 3044 (code=exited, status=1/FAILURE)
May 29 22:24:47 localhost.sulfur systemd[1]: Starting NFS server and services...
May 29 22:24:47 localhost.sulfur rpc.nfsd[3044]: rpc.nfsd: writing fd to kernel failed: errno 111 (Connection refused)
May 29 22:24:47 localhost.sulfur rpc.nfsd[3044]: rpc.nfsd: unable to set any sockets for nfsd
May 29 22:24:47 localhost.sulfur systemd[1]: nfs-server.service: main process exited, code=exited, status=1/FAILURE
May 29 22:24:47 localhost.sulfur systemd[1]: Failed to start NFS server and services.
May 29 22:24:47 localhost.sulfur systemd[1]: Unit nfs-server.service entered failed state.
May 29 22:24:47 localhost.sulfur systemd[1]: nfs-server.service failed.
The journelctl -xe log has a ton of stuff in it so I won't post all of it here, but there are some things in the bold red.
May 29 22:24:47 localhost.sulfur rpc.mountd[3024]: Could not bind socket: (98) Address already in use
May 29 22:24:47 localhost.sulfur rpc.mountd[3024]: Could not bind socket: (98) Address already in use
May 29 22:24:47 localhost.sulfur rpc.statd[3028]: failed to create RPC listeners, exiting
May 29 22:24:47 localhost.sulfur systemd[1]: Failed to start NFS status monitor for NFSv2/3 locking..
Before I ran vagrant up I looked to see if there were any process binding to port 98 with netstat -tulpn and did not see anything and in fact while vagrant is hanging I ran netstat -tulpn again to see what was binding to port 98 and didn't see anything. (checked for both current user and root)
UPDATE: Haven't gotten any responses.
I wasn't able to figure out the current issue I'm having. I tried using lxc instead, but gets stuck on booting. I'd also prefer not to use VirtualBox, but the issue seems to lie within nfs not the hypervisor. Going to try using the rsync-auto feature Vagrant provides, but I'd prefer to get nfs working.

Looks like when using libvirt the user is given control over nfs and rpcbind, and Vagrant doesn't even try to touch those things like I had assumed it did. Running these solved my issue:
service rpcbind start
service nfs stop
service nfs start

The systemd unit dependencies of nfs-server.service contain rpcbind.target but not rpcbind.service.
One simple solution is to create a file /etc/systemd/system/nfs-server.service containing:
.include /usr/lib/systemd/system/nfs-server.service
[Unit]
Requires=rpcbind.service
After=rpcbind.service

On CentOS 7, all I needed to do
was install the missing rpcbind, like this:
yum -y install rpcbind
systemctl enable rpcbind
systemctl start rpcbind
systemctl restart nfs-server
Took me over an hour to find out and try this though :)
Michel

I've had issues with NFS mounts using both the libvirt and the VirtualBox provider on Fedora 22. After a lot of gnashing of teeth, I managed to figure out that it was a firewall issue. Fedora seems to ship with a firewalld service by default. Stopping that service - sudo systemctl stop firewalld - did the trick for me.
Of course, ideally you would configure this firewall rather than disable it entirely, but I don't know how to do that.

PCS Cluster with Shared IP on different interface

I am creating a Fedora PCS Cluster for HAProxy. I have it running on VMWare, and am following this guide, and get to this step of adding a IPAddr2 resource: http://clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/_adding_a_resource.html
The only difference is that I need my cluster heartbeat/comms on one NIC/subnet, and my shared resource IP on a different NIC/subnet.
My internal comms is Node1=192.160.0.1 and Node2=192.168.0.2, and my resource ip is 10.0.0.1
How do I use this command in this situation:
pcs resource create ClusterIP ocf:heartbeat:IPaddr2 \
ip=192.168.0.120 cidr_netmask=32 op monitor interval=30s
If I add it as above, I get this:
[root#node-01 .pcs]# pcs status
Cluster name: mycluster
Last updated: Tue Oct 28 09:10:13 2014
Last change: Tue Oct 28 09:00:13 2014 via cibadmin on node-02
Stack: corosync
Current DC: node-02 (2) - partition with quorum
Version: 1.1.11-1.fc20-9d39a6b
2 Nodes configured
1 Resources configured
Online: [ node-01 node-02 ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Failed actions:
ClusterIP_start_0 on node-01 'unknown error' (1): call=7, status=complete, last-rc-change='Tue Oct 28 09:00:13 2014', queued=0ms, exec=27ms
ClusterIP_start_0 on node-02 'unknown error' (1): call=6, status=complete, last-rc-change='Tue Oct 28 09:00:13 2014', queued=0ms, exec=27ms

First, you need to specify the network device as mentioned by Daniel, e.g.
pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.0.0.1 cidr_netmask=32 nic=eth0 op monitor interval=30s
Since you are running two node cluster, you don't have a fencing device. So you have to disable the STONITH setting but it is not recommended for production environment.
pcs property set stonith-enabled=false
The virtual ip address should be activated automatically.
#pcs status resources
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started:node-01

You need to specify the NIC. If your first NIC is eth0 and second is eth1. you can create the resource with this :
pcs resource create ClusterIP ocf:heartbeat:IPaddr2 ip=10.0.0.1 cidr_netmask=32 nic=eth1:0 op monitor interval=30s
You can also use only eth1, but i prefer to use sub interface for my floating IP address. You can create more than one floating IP addresses on one NIC, but you need to configure each on unique sub interface.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Coreos fleet not working after auto-scaling - amazon-ec2

You must have at least one machine always running with the discovery token, as soon as all of them go down, heartbeat will fail and no one new will be able to join, you will need a new token for the cluster to join.

Related

How to diagnose 'stuck at provisioning' problems in Rancher 2.x?

Hashicorp - vault service fails to start

Kibana stopped working and now server not getting ready although kibana.service starts up nicely

NFS Vagrant on Fedora 22

PCS Cluster with Shared IP on different interface

Categories

Resources