Error Keepalived - FAULT state - shell

I'm trying to use keepalived to monitor some service.
See my configuration file "keepalived.conf":
! Configuration File for keepalived
vrrp_script check_haproxy {
script "/usr/local/bin/check_haproxy.sh"
interval 2
fail 2
rise 2
}
vrrp_instance Vtest_2 {
interface eno16777736
track_interface {
eno16777736
}
state BACKUP
virtual_router_id 178
priority 100
nopreempt
unicast_src_ip 172.28.7.132
unicast_peer {
172.28.7.133
}
virtual_ipaddress {
195.221.2.14/32 dev eno16777736
}
track_script {
check_haproxy
}
notify /usr/local/bin/keepalived.state.sh
}
In my script i to compare the number of the pid file and the process of the service :
#!/bin/bash
# check if haproxy is running, return 1 if not.
# Used by keepalived to initiate a failover in case haproxy is down
process=""
HAPROXY_STATUS=""
if [ -f "/var/run/haproxy.pid" ]
then
process=$(cat /var/run/haproxy.pid)
HAPROXY_STATUS=$(ps -aux | grep -w "$process.*[h]aproxy")
fi
if [ "$HAPROXY_STATUS" ]
then
exit 0
else
exit 1
fi
My proble is that when i start keepalived when the command :
keepalived -f /etc/keepalived/keepalived.conf --dont-fork --log-console --log-detail
Everythings works goods keepalived failover works good. But when i start the service with systemd:
systemctl start keepalived
I have this lines in my "/var/log/messages" file:
Mar 15 17:05:51 localhost systemd: Starting Session 10 of user root.
Mar 15 17:06:23 localhost systemd: Starting LVS and VRRP High Availability Monitor...
Mar 15 17:06:23 localhost Keepalived[34182]: Starting Keepalived v1.2.13 (11/20,2015)
Mar 15 17:06:23 localhost systemd: Started LVS and VRRP High Availability Monitor.
Mar 15 17:06:23 localhost Keepalived[34183]: Starting Healthcheck child process, pid=34184
Mar 15 17:06:23 localhost Keepalived[34183]: Starting VRRP child process, pid=34185
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Netlink reflector reports IP 172.28.7.132 added
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Netlink reflector reports IP fe80::20c:29ff:fe22:a18e added
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Registering Kernel netlink reflector
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Registering Kernel netlink command channel
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Registering gratuitous ARP shared channel
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Opening file '/etc/keepalived/keepalived.conf'.
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Configuration is using : 64365 Bytes
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Using LinkWatch kernel netlink reflector...
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: VRRP_Instance(Vtest_2) Entering BACKUP STATE
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: VRRP sockpool: [ifindex(2), proto(112), unicast(1), fd(10,11)]
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Netlink reflector reports IP 172.28.7.132 added
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Netlink reflector reports IP fe80::20c:29ff:fe22:a18e added
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Registering Kernel netlink reflector
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Registering Kernel netlink command channel
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Opening file '/etc/keepalived/keepalived.conf'.
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Configuration is using : 5344 Bytes
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Using LinkWatch kernel netlink reflector...
Mar 15 17:06:26 localhost Keepalived_vrrp[34185]: VRRP_Instance(Vtest_2) Now in FAULT state
Anyone have an idea ?

I'm a using much simple configuration for the same purpose. Utility pidof must be on most OSs Linux by default. Try this for example:
vrrp_script check_haproxy
{
script "pidof haproxy"
interval 2
weight 2
}
It's a working solution buddy.

I had a similar issue with my track_script not running when starting keepalived as a systemctl service.
I found out that putting my track_script in /usr/libexec/keepalived solved the issue. It is probably related to this issue.
! Configuration File for keepalived
global_defs {
}
vrrp_script odr_check {
script "/usr/libexec/keepalived/health_check.sh"
interval 2
weight 50
fall 1
rise 1
}
vrrp_instance VI_1 {
state BACKUP
interface eno16777984
virtual_router_id 42
priority 90
track_script {
odr_check
}
virtual_ipaddress {
192.168.110.84
}
}

Related

Stopped Laravel queue worker

Using AWS elasticbeanstalk.
Setting my deploy_config file like below:
08_queue_service_restart:
command: "systemctl restart laravel_worker"
files:
/opt/elasticbeanstalk/tasks/taillogs.d/laravel-logs.conf:
content: /var/app/current/storage/logs/laravel.log
group: root
mode: "000755"
owner: root
/etc/systemd/system/laravel_worker.service:
mode: "000755"
owner: root
group: root
content: |
# Laravel queue worker using systemd
# ----------------------------------
#
# /lib/systemd/system/queue.service
#
# run this command to enable service:
# systemctl enable queue.service
[Unit]
Description=Laravel queue worker
[Service]
User=nginx
Group=nginx
Restart=always
ExecStart=/usr/bin/nohup /usr/bin/php /var/app/current/artisan queue:work --tries=3
[Install]
WantedBy=multi-user.target
And it return error like below:
Aug 17 04:21:34 ip-blabla systemd: laravel_worker.service: main process exited, code=exited, status=1/FAILURE
Aug 17 04:21:34 ip-blabla systemd: Unit laravel_worker.service entered failed state.
Aug 17 04:21:34 ip-blabla systemd: laravel_worker.service failed.
Aug 17 04:21:34 ip-blabla systemd: laravel_worker.service holdoff time over, scheduling restart.
Aug 17 04:21:34 ip-blabla systemd: Stopped Laravel queue worker.
Aug 17 04:21:34 ip-blabla systemd: Started Laravel queue worker.
It was working with no error for months. But this morning, started to return error like that. And I tried making rebuild, but nothing changed.
It might stopped working when server restart. You have to manually run command in project directory via terminal to restart laravel queue worker.
for me this command worked
nohup php artisan queue:work --daemon &
Run this command into laravel-project-root-directory

Why can't write certificate.crt with acme?

root#vultr:~# systemctl status nginx
● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2021-07-28 02:16:44 UTC; 23min ago
Docs: man:nginx(8)
Main PID: 12999 (nginx)
Tasks: 2 (limit: 1148)
Memory: 8.2M
CGroup: /system.slice/nginx.service
├─12999 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
└─13000 nginx: worker process
Jul 28 02:16:44 vultr.guest systemd[1]: Starting A high performance web server and a reverse proxy server...
Jul 28 02:16:44 vultr.guest systemd[1]: nginx.service: Failed to parse PID from file /run/nginx.pid: Invalid argument
Jul 28 02:16:44 vultr.guest systemd[1]: Started A high performance web server and a reverse proxy server.
The nginx is in good status.
I want to create and write certificate.crt with acme:
sudo su -l -s /bin/bash acme
curl https://get.acme.sh | sh
export CF_Key="xxxx"
export CF_Email="yyyy#yahoo.com"
CF_Key is my global api key in cloudflare,CF_Email is the register email to login cloudflare.
acme#vultr:~$ acme.sh --issue --dns dns_cf -d domain.com --debug 2
The output content is so long that i can't post here,so i upload into the termbin.com ,we share the link below:
https://termbin.com/taxl
Please open the webpage,you can get the whole output info,and check which result in error,there are two main issues:
1.My nginx server is in good status,acme.sh can't detect it.
2.How can set the config file?
[Wed Jul 28 03:04:38 UTC 2021] config file is empty, can not read CA_EAB_KEY_ID
[Wed Jul 28 03:04:38 UTC 2021] config file is empty, can not read CA_EAB_HMAC_KEY
[Wed Jul 28 03:04:38 UTC 2021] config file is empty, can not read CA_EMAIL
To write key into specified directory:
acme.sh --install-cert -d domain.com
--key-file /usr/local/etc/certfiles/private.key
--fullchain-file /usr/local/etc/certfiles/certificate.crt
It encounter problem:
[Tue Jul 27 01:12:15 UTC 2021] Installing key to:/usr/local/etc/certfiles/private.key
cat: /home/acme/.acme.sh/domain.com/domain.com.key: No such file or directory
To check files in /usr/local/etc/certfiles/
ls /usr/local/etc/certfiles/
private.key
No certificate.crt in /usr/local/etc/certfiles/.
How to fix then?
From acme.sh v3.0.0, acme.sh is using Zerossl as default ca, you must
register the account first(one-time) before you can issue new certs.
Here is how ZeroSSL compares with LetsEncrypt.
With ZeroSSL as CA
You must register at ZeroSSL before issuing a certificate. To register run the below command (assuming yyyy#yahoo.com is email with which you want to register)
acme.sh --register-account -m yyyy#yahoo.com
Now you can issue a new certificate (assuming you have set CF_Key & CF_Email or CF_Token & CF_Account_ID)
acme.sh --issue --dns dns_cf -d domain.com
Without ZeroSSL as CA
If you don't want to use ZeroSSL and say want to use LetsEncrypt instead, then you can provide the server option to issue a certificate
acme.sh --issue --dns dns_cf -d domain.com --server letsencrypt
Here are more options for the CA server.

Hashicorp - vault service fails to start

I have setup Hashicorp - vault (Vault v1.5.4) on Ubuntu 18.04. My backend is Consul (single node running on same server as vault) - consul service is up.
My vault service fails to start
systemctl list-units --type=service | grep "vault"
vault.service loaded failed failed vault service
journalctl -xe -u vault
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Scheduled restart job, restart counter is at 5.
-- Subject: Automatic restarting of a unit has been scheduled
- Unit vault.service has finished shutting down.
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Start request repeated too quickly.
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Failed with result 'exit-code'.
Oct 03 00:21:33 ubuntu2 systemd[1]: Failed to start vault service.
-- Subject: Unit vault.service has failed
vault config.json
"api_addr": "http://<my-ip>:8200",
storage "consul" {
address = "127.0.0.1:8500"
path = "vault"
},
Service config
StandardOutput=/opt/vault/logs/output.log
StandardError=/opt/vault/logs/error.log
cat /opt/vault/logs/error.log
cat: /opt/vault/logs/error.log: No such file or directory
cat /opt/vault/logs/output.log
cat: /opt/vault/logs/output.log: No such file or directory
sudo tail -f /opt/vault/logs/error.log
tail: cannot open '/opt/vault/logs/error.log' for reading: No such file or
directory
:/opt/vault/logs$ ls -al
total 8
drwxrwxr-x 2 vault vault 4096 Oct 2 13:38 .
drwxrwxr-x 5 vault vault 4096 Oct 2 13:38 ..
After much debugging, the issue was silly goofup mixing .hcl and .json (they are so similar - but different) - cut-n-paste between stuff the storage (as posted) needs to be in json format. The problem is of course compounded when the error message saying nothing and there is nothing in the logs.
"storage": {
"consul": {
"address": "127.0.0.1:8500",
"path" : "vault"
}
},
There were a couple of other additional issues to sort out to get it going- disable_mlock : true, opening the firewall for 8200: sudo ufw allow 8200/tcp.
Finally got done (rather started).

Is docker checkpoint supported for windows containers, or does it rely on CRIU?

The goal is to make container create or start faster from a "clean" state for an image whose build contains a long-running application installation. I thought two options would be to docker export immediately following the container creation (not supported on windows) or docker checkpoint immediately following the container creation.
Trying to checkpoint a windows container,
docker checkpoint create my_container my_container_checkpoint
the command seems to hang and event log shows:
Index Time EntryType Source InstanceID Message
----- ---- --------- ------ ---------- -------
3595658 Feb 05 11:31 Information docker 2 debug: Calling GET /_ping
3595659 Feb 05 11:31 Information docker 2 debug: Calling GET /v1.35/containers/my_container/export
3595666 Feb 05 11:33 Information docker 2 debug: Calling POST /v1.35/containers/my_container/checkpoints
3595667 Feb 05 11:33 Information docker 2 debug: form data: {"CheckpointDir":"","CheckpointID":"my_container_checkpoint","Exit":true}
3595665 Feb 05 11:33 Information docker 2 debug: Calling GET /_ping
Version info:
Version 17.12.0-ce-win47 (15139)
Channel: stable
9c692cd

Can't kill processes (originating in a docker container)

I run a docker cluster with a few thousand containers and a few times per day randomly I have a process that gets "stuck" blocking a container from stopping. Below is an example container with its corresponding process and all things I have tried to kill the container / process.
The container:
# docker ps | grep 950677e2317f
950677e2317f 7e553d1d9f6f "/bin/sh -c /minecraf" 2 days ago Up 2 days 0.0.0.0:22661->22661/tcp, 0.0.0.0:22661->22661/udp, 0.0.0.0:37681->37681/tcp, 0.0.0.0:37681->37681/udp gloomy_jennings
Try to stop container using docker daemon (it tries forever without result):
# time docker stop --time=1 950677e2317f
^C
real 0m13.508s
user 0m0.036s
sys 0m0.008s
Daemon log while trying to stop:
# journalctl -fu docker.service
-- Logs begin at Fri 2015-12-11 15:40:55 CET. --
Dec 31 23:30:33 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:33.164731953+01:00" level=info msg="POST /v1.21/containers/950677e2317f/stop?t=1"
Dec 31 23:30:34 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:34.165531990+01:00" level=info msg="Container 950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942 failed to exit within 1 seconds of SIGTERM - using the force"
Dec 31 23:30:44 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:44.165954266+01:00" level=info msg="Container 950677e2317f failed to exit within 10 seconds of kill - trying direct SIGKILL"
Looking into the processes running on the machine reveals the zombie process (pid 11991 on host machine):
# ps aux | grep [1]1991
root 11991 84.3 0.0 5836 132 ? R Dec30 1300:19 bash -c (echo stop > /tmp/minecraft &)
# top -b | grep [1]1991
11991 root 20 0 5836 132 20 R 89.5 0.0 1300:29 bash
And it is indeed a process running within our container (check container id):
# cat /proc/11991/mountinfo
...
/var/lib/docker/containers/950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda2 rw,errors=remount-ro,data=ordered
Trying to kill the process yields nothing:
# kill -9 11991
# ps aux | grep [1]1991
root 11991 84.3 0.0 5836 132 ? R Dec30 1303:58 bash -c (echo stop > /tmp/minecraft &)
Some overview data:
# docker version
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:20:08 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:20:08 UTC 2015
OS/Arch: linux/amd64
# docker info
Containers: 189
Images: 322
Server Version: 1.9.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 700
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.0-19-generic
Operating System: Ubuntu 15.10
CPUs: 24
Total Memory: 125.8 GiB
Name: m3561.contabo.host
ID: ZM2Q:RA6Q:E4NM:5Q2Q:R7E4:BFPQ:EEVK:7MEO:YRH6:SVS6:RIHA:3I2K
# uname -a
Linux m3561.contabo.host 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
If stopping the docker daemon the process still lives. The only way to get rid of the process is to restart the host machine. As this happens fairly frequently (requires every node to restart every 3-7 days) it has a serious impact on the uptime of the overall cluster.
Any ideas on what to do here?
Okay, I think I found the root cause of this. The folks over at Docker helped me out, check out this thread on GitHub.
It turns out this most likely is a bug in the Linux Kernel 4.19+. I'll be rolling back to an older version until it is fixed.
UPDATE: I've been running 3.* only in my cluster for several days now without any issues. This was most certainly a kernel bug.
I had a similar problem and switching to use overlay2 storage driver made the problem go away. Changing the storage driver will loose all docker state (images & containers). It seems that the aufs storage driver has some problems that might be a source of lock ups.

Resources