NFS Vagrant on Fedora 22 - vagrant

I'm trying to run Vagrant using libvirt as my provider. Using rsync is unbearable since I'm working with a huge shared directory, but vagrant does succeed when the nfs setting is commented out and the standard rsync config is set.
config.vm.synced_folder ".", "/vagrant", mount_options: ['dmode=777','fmode=777']
Vagrant hangs forever on this step here after running vagrant up
==> default: Mounting NFS shared folders...
In my Vagrantfile I have this uncommented and the rsync config commented out, which turns NFS on.
config.vm.synced_folder ".", "/vagrant", type: "nfs"
When Vagrant is running it echos this out to the terminal.
Redirecting to /bin/systemctl status nfs-server.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Redirecting to /bin/systemctl start nfs-server.service
Job for nfs-server.service failed. See "systemctl status nfs-server.service" and "journalctl -xe" for details.
Results of systemctl status nfs-server.service
dillon#localhost ~ $ systemctl status nfs-server.service
● nfs-server.service - NFS server and services
Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Fri 2015-05-29 22:24:47 PDT; 22s ago
Process: 3044 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited, status=1/FAILURE)
Process: 3040 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
Main PID: 3044 (code=exited, status=1/FAILURE)
May 29 22:24:47 localhost.sulfur systemd[1]: Starting NFS server and services...
May 29 22:24:47 localhost.sulfur rpc.nfsd[3044]: rpc.nfsd: writing fd to kernel failed: errno 111 (Connection refused)
May 29 22:24:47 localhost.sulfur rpc.nfsd[3044]: rpc.nfsd: unable to set any sockets for nfsd
May 29 22:24:47 localhost.sulfur systemd[1]: nfs-server.service: main process exited, code=exited, status=1/FAILURE
May 29 22:24:47 localhost.sulfur systemd[1]: Failed to start NFS server and services.
May 29 22:24:47 localhost.sulfur systemd[1]: Unit nfs-server.service entered failed state.
May 29 22:24:47 localhost.sulfur systemd[1]: nfs-server.service failed.
The journelctl -xe log has a ton of stuff in it so I won't post all of it here, but there are some things in the bold red.
May 29 22:24:47 localhost.sulfur rpc.mountd[3024]: Could not bind socket: (98) Address already in use
May 29 22:24:47 localhost.sulfur rpc.mountd[3024]: Could not bind socket: (98) Address already in use
May 29 22:24:47 localhost.sulfur rpc.statd[3028]: failed to create RPC listeners, exiting
May 29 22:24:47 localhost.sulfur systemd[1]: Failed to start NFS status monitor for NFSv2/3 locking..
Before I ran vagrant up I looked to see if there were any process binding to port 98 with netstat -tulpn and did not see anything and in fact while vagrant is hanging I ran netstat -tulpn again to see what was binding to port 98 and didn't see anything. (checked for both current user and root)
UPDATE: Haven't gotten any responses.
I wasn't able to figure out the current issue I'm having. I tried using lxc instead, but gets stuck on booting. I'd also prefer not to use VirtualBox, but the issue seems to lie within nfs not the hypervisor. Going to try using the rsync-auto feature Vagrant provides, but I'd prefer to get nfs working.

Looks like when using libvirt the user is given control over nfs and rpcbind, and Vagrant doesn't even try to touch those things like I had assumed it did. Running these solved my issue:
service rpcbind start
service nfs stop
service nfs start

The systemd unit dependencies of nfs-server.service contain rpcbind.target but not rpcbind.service.
One simple solution is to create a file /etc/systemd/system/nfs-server.service containing:
.include /usr/lib/systemd/system/nfs-server.service
[Unit]
Requires=rpcbind.service
After=rpcbind.service

On CentOS 7, all I needed to do
was install the missing rpcbind, like this:
yum -y install rpcbind
systemctl enable rpcbind
systemctl start rpcbind
systemctl restart nfs-server
Took me over an hour to find out and try this though :)
Michel

I've had issues with NFS mounts using both the libvirt and the VirtualBox provider on Fedora 22. After a lot of gnashing of teeth, I managed to figure out that it was a firewall issue. Fedora seems to ship with a firewalld service by default. Stopping that service - sudo systemctl stop firewalld - did the trick for me.
Of course, ideally you would configure this firewall rather than disable it entirely, but I don't know how to do that.

Related

dnsmasq can't bind listen-address

I tried to setup hostapd and dnsmasq to braodcast a WiFi from a Raspberry Pi 3. I only want to broadcast the WiFi so devices can connect to a http-server running on the Raspberry, no ethernet bridge is required.
I installed hostapd and dnsmasq and configured them as followed:
dhcpcd.conf
# A sample configuration for dhcpcd.
# See dhcpcd.conf(5) for details.
# Allow users of this group to interact with dhcpcd via the control socket.
#controlgroup wheel
# Inform the DHCP server of our hostname for DDNS.
hostname
# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
# Some non-RFC compliant DHCP servers do not reply with this set.
# In this case, comment out duid and enable clientid above.
#duid
# Persist interface configuration when dhcpcd exits.
persistent
# Rapid commit support.
# Safe to enable by default because it requires the equivalent option set
# on the server to actually work.
option rapid_commit
# A list of options to request from the DHCP server.
option domain_name_servers, domain_name, domain_search, host_name
option classless_static_routes
# Respect the network MTU. This is applied to DHCP routes.
option interface_mtu
# Most distributions have NTP support.
#option ntp_servers
# A ServerID is required by RFC2131.
require dhcp_server_identifier
# Generate SLAAC address using the Hardware Address of the interface
#slaac hwaddr
# OR generate Stable Private IPv6 Addresses based from the DUID
slaac private
#denyinterfaces wlan0
# Example static IP configuration:
#interface eth0
#static ip_address=192.168.0.5/24
#static ip6_address=fd51:42f8:caae:d92e::ff/64
#static routers=192.168.0.5
#static domain_name_servers=192.168.0.5
interface wlan0
allow-hotplug wlan0
#iface wlan0 inet static
static ip_address=192.168.0.5/24
nohook wpa_supplicant
#netmask 255.255.255.0
#network 192.168.0.0
#broadcast 192.168.0.255
# It is possible to fall back to a static IP if DHCP fails:
# define static profile
#profile static_eth0
#static ip_address=192.168.1.23/24
#static routers=192.168.1.1
#static domain_name_servers=192.168.1.1
# fallback to static profile on eth0
#interface eth0
#fallback static_eth0
As you can see I already tried different options using denyinterfaces and other things I found in different tutorials, but none did work.
hostapd.conf
interface=wlan0
driver=nl80211
ssid=****
hw_mode=g
channel=6
ieee80211n=1
wmm_enabled=1
ht_capab=[HT40][SHORT-GI-20][DSSS_CCK-40]
macaddr_acl=0
auth_algs=1
ignore_broadcast_ssid=0
wpa=2
wpa_key_mgmt=WPA-PSK
wpa_passphrase=****
rsn_pairwise=CCMP
dnsmasq.conf
interface=wlan0
listen-address=192.168.0.5
bind-interfaces
server=8.8.8.8
domain-needed
bogus-priv
dhcp-range=192.168.0.25,192.168.0.150,255.255.255.0,240h
Now I got two problems:
hostapd does not run on startup although I did define the daemon_conf and it is working when I run hostapd /path/to/config
My main problem, dnsmasq is not running. When I try to start the service, it crashes with an error cannot bind listen-address.
service dnsmasq status
● dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
Loaded: loaded (/lib/systemd/system/dnsmasq.service; enabled; vendor preset:
Active: failed (Result: exit-code) since Tue 2022-03-29 12:58:34 CEST; 17min
Process: 483 ExecStartPre=/usr/sbin/dnsmasq --test (code=exited, status=0/SUCC
Process: 491 ExecStart=/etc/init.d/dnsmasq systemd-exec (code=exited, status=2
Mär 29 12:58:33 raspberrypitop systemd[1]: Starting dnsmasq - A lightweight DHCP
Mär 29 12:58:33 raspberrypitop dnsmasq[483]: dnsmasq: Syntaxprüfung OK.
Mär 29 12:58:34 raspberrypitop dnsmasq[491]: dnsmasq: Konnte Empfangs-Socket für
Mär 29 12:58:34 raspberrypitop dnsmasq[491]: Konnte Empfangs-Socket für 192.168.
Mär 29 12:58:34 raspberrypitop dnsmasq[491]: Start fehlgeschlagen
Mär 29 12:58:34 raspberrypitop systemd[1]: dnsmasq.service: Control process exit
Mär 29 12:58:34 raspberrypitop systemd[1]: dnsmasq.service: Failed with result '
Mär 29 12:58:34 raspberrypitop systemd[1]: Failed to start dnsmasq - A lightweig
lines 1-14/14 (END)
I guess I messed up the configuration somehow, but since this is something new for me and there are many different tutorials for several different kinds of OSs and OS versions, it's very hard to understand, what is going wrong.
Ok, figured it out myself.
In my case hostapd not starting automatically was actually causing the second issue, since it prevented the wlan0 interface from coming up.
I had to sudo systemctl unmask hostapd and reboot. dnsmasq would still not start since it tried to start before hostapd was finished setting everything up, even if told to wait for hostapd.service. So i edited the systemd/dnsmasq.service config and added
[service]
restart=always
retry=2
So it tries to restart every 2sec until hostapd has done it's job and so it all is working.

Error starting Laravel Homestead after updating to 2.29

I have recently updated my vagrant version to 2.2.9. When running the command, vagrant up I am now getting this error:
homestead: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
homestead: Job for mariadb.service failed because the control process exited with error code.
homestead: See "systemctl status mariadb.service" and "journalctl -xe" for details.
I'm not sure what is causing this issue, I've updated the virtualbox, vagrant and the homestead package many times in the past without issue.
My machine is OS Catalina 10.15.5
I have tried uninstalling & re-installing, I've also tried installing an older version of vagrant. Everything results in the same error above. I'm not sure what to do next - any suggestions are greatly appreciated!
EDIT
Thank you, #Aminul!
Here is the output I get:
Status: "MariaDB server is down"
Jun 20 19:17:53 homestead mysqld[42962]: 2020-06-20 19:17:53 0 [Note] InnoDB: Starting shutdown...
Jun 20 19:17:54 homestead mysqld[42962]: 2020-06-20 19:17:54 0 [ERROR] Plugin 'InnoDB' init function returned error.
Jun 20 19:17:54 homestead mysqld[42962]: 2020-06-20 19:17:54 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
Jun 20 19:17:54 homestead mysqld[42962]: 2020-06-20 19:17:54 0 [Note] Plugin 'FEEDBACK' is disabled.
Jun 20 19:17:54 homestead mysqld[42962]: 2020-06-20 19:17:54 0 [ERROR] Could not open mysql.plugin table. Some plugins may be not loaded
Jun 20 19:17:54 homestead mysqld[42962]: 2020-06-20 19:17:54 0 [ERROR] Unknown/unsupported storage engine: InnoDB
Jun 20 19:17:54 homestead mysqld[42962]: 2020-06-20 19:17:54 0 [ERROR] Aborting
Jun 20 19:17:54 homestead systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
Jun 20 19:17:54 homestead systemd[1]: mariadb.service: Failed with result 'exit-code'.
Jun 20 19:17:54 homestead systemd[1]: Failed to start MariaDB 10.4.13 database server.
Running: mysql --version returns:
mysql Ver 15.1 Distrib 10.4.13-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2
So clearly, it's saying that MariaDB is not started. I can research how to start that. I'm more curious though -- is this something that's happened to homestead? Or is this a result of something else? Normally, I can just vagrant up and everything is good to go. I worry that if I mess with things I'm setting myself up for failure down the road.
EDIT 2
When running this:
vagrant#homestead:~$ systemctl start mysqld.service
This is what I am prompted with:
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to start 'mariadb.service'.
Authenticating as: vagrant,,, (vagrant)
Password:
I'm not sure what the credentials are to keep testing.
ADDITIONAL SOLUTION
Thank you,Raphy963!
I didn't want to answer my own question, and I was able to find another work-around that hopefully will help someone else.
The application I am working on is not yet in production, so I was able to change my database from MySQL to PostgreSQL.
I removed/uninstalled all instances of virtualbox, vagrant & homestead. I also removed the "VirtualBox VMs" directory.
I re-installed everything, starting with VirtualBox, Vagrant & then laravel/homestead. I am now running the latest versions of everything; using the Laravel documentation for instructions.
After everything was installed, running vagrant up did not create errors, however I was still not able to connect to MySQL.
I updated my Homestead.yaml file to the following:
---
ip: "10.10.10.10"
memory: 2048
cpus: 2
provider: virtualbox
authorize: ~/.ssh/id_rsa.pub
keys:
- ~/.ssh/id_rsa
folders:
- map: /Users/<username>/Sites
to: /home/vagrant/sites
sites:
- map: blog.test
to: /home/vagrant/sites/blog/public
databases:
- blog
- homestead
features:
- mariadb: false
- ohmyzsh: false
- webdriver: false
I updated my hosts file to this:
10.10.10.10 blog.test
Finally, using TablePlus I was able to connect with the following:
My .env file in my Laravel application looks like this:
DB_CONNECTION=pgsql
DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=blog
DB_USERNAME=homestead
DB_PASSWORD=secret
I am now able to connect using TablePlus and from my application.
Hope this helps someone!!
I was having the same issue and spent way too much time trying to fix it. I tried using the new release of Homestead from their GitHub repo (https://github.com/laravel/homestead) which claims to fix this exact issue but it didn't work.
After investigating on my own, I realized the scripts used in Vagrant for homestead to work (This repo over here https://github.com/laravel/settler) has been updated to "10.0.0-beta". I did the following to put it back to "9.5.1".
vagrant box remove laravel/homestead
vagrant box add laravel/homestead --box-version 9.5.1
Afterwards, I remade my instance by using vagrant destroy and vagrant up and MariaDB was up and running once more.
While this might not be the best solution, at least I got it to work which is good enough for me.
Hope it helped!
You will need to investigate what is the cause.
Login to your instance by runing vagrant ssh and run systemctl status mariadb.service for checking the error log.
Check what is is the error and reply here if you didn't understand.

Kibana stopped working and now server not getting ready although kibana.service starts up nicely

Without any major system update of my Ubuntu (4.4.0-142-generic #168-Ubuntu SMP), Kibana 7.2.0 stopped working. I am still able to start the service with sudo systemctl start kibana.service and the corresponding status looks fine. There is only a warning and no error, this does not seem to be the issue:
# sudo systemctl status kibana.service
● kibana.service - Kibana
Loaded: loaded (/etc/systemd/system/kibana.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2019-07-10 09:43:49 CEST; 22min ago
Main PID: 14856 (node)
Tasks: 21
Memory: 583.2M
CPU: 1min 30.067s
CGroup: /system.slice/kibana.service
└─14856 /usr/share/kibana/bin/../node/bin/node --no-warnings --max-http-header-size=65536 /usr/share/kibana/bin/../src/cli -c /etc/kibana/kibana.yml
Jul 10 09:56:36 srv003 kibana[14856]: {"type":"log","#timestamp":"2019-07-10T07:56:36Z","tags":["warning","task_manager"],"pid":14856,"message":"The task maps_telemetry \"Maps-maps_telemetry\" is not cancellable."}
Nevertheless, when I visit http://srv003:5601/ on my client machine, I keep seeing only (even after waiting 20 minutes):
Kibana server is not ready yet
On the server srv003 itself, I see
me#srv003:# curl -XGET http://localhost:5601/status -I
curl: (7) Failed to connect to localhost port 5601: Connection refused
This is a strange since Kibana seems to be really listening at that port and the firewall is disabled for testing purposes:
root#srv003# sudo lsof -nP -i | grep 5601
node 14856 kibana 18u IPv4 115911041 0t0 TCP 10.0.0.72:5601 (LISTEN)
root#srv003# sudo ufw status verbose
Status: inactive
There is nothing suspicious in the log of kibana.service either:
root#srv003:/var/log# journalctl -u kibana.service | grep -A 99 "Jul 10 10:09:14"
Jul 10 10:09:14 srv003 systemd[1]: Started Kibana.
Jul 10 10:09:38 srv003 kibana[14856]: {"type":"log","#timestamp":"2019-07-10T08:09:38Z","tags":["warning","task_manager"],"pid":14856,"message":"The task maps_telemetry \"Maps-maps_telemetry\" is not cancellable."}
My Elasticsearch is still up and running. There is nothing interesting in the corresponding log files about Kibana:
root#srv003:/var/log# cat elasticsearch/elasticsearch.log |grep kibana
[2019-07-10T09:46:25,158][INFO ][o.e.c.m.MetaDataIndexTemplateService] [srv003] adding template [.kibana_task_manager] for index patterns [.kibana_task_manager]
[2019-07-10T09:47:32,955][INFO ][o.e.c.m.MetaDataCreateIndexService] [srv003] [.monitoring-kibana-7-2019.07.10] creating index, cause [auto(bulk api)], templates [.monitoring-kibana], shards [1]/[0], mappings [_doc]
Now I am running a bit out of options, and I hope somebody can give me another hint.
Edit: I do not have any Kibana plugins installed.
Consulted sources:
How to fix "Kibana server is not ready yet" error when using AKS
Kibana service is running but can not access via browser to console
Why won't Kibana Node server start up?
https://discuss.elastic.co/t/failed-to-start-kibana-7-0-1/180259/3 - most promising thread, but nobody ever answered
https://discuss.elastic.co/t/kibana-server-is-not-ready-yet-issue-after-upgrade-to-6-5-0/157021
https://discuss.elastic.co/t/kibana-server-not-ready/162075
It looks like if Kibana enters the described undefined state, a simple reboot of the computer is necessary. This is of course not acceptable for a (virtual or physical) machine where other services are running.

Fedora 24 Vagrant issue. mount.nfs access denied by server

I started using fedora 24 last year for my study/work computer. First time I run into an issue I cannot figure out within a reasonable amount of time.
We need to use Vagrant for a project, and I'm trying to get it running on my computer. The command vagrant up fails at the mounting nfs. Here's the output after the command:
Bringing machine 'default' up with 'libvirt' provider...
==> default: Starting domain.
==> default: Waiting for domain to get an IP address...
==> default: Waiting for SSH to become available...
==> default: Creating shared folders metadata...
==> default: Exporting NFS shared folders...
==> default: Preparing to edit /etc/exports. Administrator privileges will be required...
[sudo] password for feilz:
Redirecting to /bin/systemctl status nfs-server.service
● nfs-server.service - NFS server and services
Loaded: loaded (/etc/systemd/system/nfs-server.service; enabled; vendor preset: disabled)
Drop-In: /run/systemd/generator/nfs-server.service.d
└─order-with-mounts.conf
Active: active (exited) since Wed 2017-02-15 15:17:58 EET; 19h ago
Main PID: 16889 (code=exited, status=0/SUCCESS)
Tasks: 0 (limit: 512)
CGroup: /system.slice/nfs-server.service
Feb 15 15:17:58 feilz systemd[1]: Starting NFS server and services...
Feb 15 15:17:58 feilz systemd[1]: Started NFS server and services.
==> default: Mounting NFS shared folders...
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!
mount -o 'vers=4' 192.168.121.1:'/home/feilz/env/debian64' /vagrant
Stdout from the command:
Stderr from the command:
stdin: is not a tty
mount.nfs: access denied by server while mounting 192.168.121.1:/home/feilz/env/debian64
My Vagrantfile looks like: (I skipped the commented out lines)
Vagrant.configure(2) do |config|
config.vm.box = "debian/jessie64"
config.vm.provider :libvirt do |libvirt|
libvirt.driver = "qemu"
end
end
I can run the vagrant ssh command to log in, and write the command
sudo mount -o 'vers=4' 192.168.121.1:'/home/feilz/env/debian64' /vagrant
inside vagrant to try again.Then the output becomes
mount.nfs: access denied by server while mounting 192.168.121.1:/home/feilz/env/debian64
I've gone through loads of webpages. I fixed missing ruby gems (nokogiri and libffi). I tried modifying the /etc/exports file, it doesn't work, and it gets reset after I run vagrant halt / up.
I have installed the vagrant plugin vagrant-libvirt
What haven't I tried yet, that would allow me to use the NFS file sharing for Vagrant?

Coreos fleet not working after auto-scaling

I have CoreOS cluster with 3 AWS ec2 instances. The cluster was setup using the CoreOS stack cloudformation. After the cluster is up and running, I need to update the autoscaling policy to pick up ec2 instance profile. I copied the existing auto-scaling configuration file and updated the IAM role for ec2s. Then I terminated EC2s in the fleet, letting the auto-scaling to fire up new instances. The new instances indeed assumed their new roles, however, the cluster seems lost cluster machine information:
ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
Drop-In: /run/systemd/system/etcd.service.d
└─10-oem.conf, 20-cloudinit.conf
Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
Main PID: 14124 (code=exited, status=1/FAILURE)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL | fail joining the cluster via given peers after 3 retries
The same token was used from cloud-init. https://discovery.etcd.io/<cluster token> shows 6 machines, with 3 dead ones, 3 new ones. So it looks like 3 new instances joined the cluster alright. The journal -u etcd.service logs shows the etcd timed out on dead instances, and got connection refused for the new ones.
journal -u etcd.service shows:
...
Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)
etcdctl --debug ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/? consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Maybe this is not right process to update a cluster's configuration, but IF the cluster does need auto-scaling for whatever reasons (load triggered for example), will the fleet still be able to function with dead instances and new instances mixed in the pool?
How to recover from this situations without tear down and rebuild?
Xueshan
In this scheme etcd will not remain with a quorum of machines and can't operate successfully. The best scheme for doing autoscaling would be to set up two groups of machines:
A fixed number (1-9) of etcd machines that will always be up. These are set up with a discovery token or static networking like normal.
Your autoscaling group, which doesn't start etcd, but instead configures fleet (and any other tool) to use the fixed etcd cluster. You can do this in cloud-config. Here's an example that also sets some fleet metadata so you can schedule jobs specifically to the autoscaled machines if desired:
#cloud-config
coreos:
fleet:
metadata: "role=autoscale"
etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
units:
- name: fleet.service
command: start
The validator wouldn't let me put in any 10.x IP addresses in my answer (wtf!?) so be sure to replace those.
You must have at least one machine always running with the discovery token, as soon as all of them go down, heartbeat will fail and no one new will be able to join, you will need a new token for the cluster to join.

Resources