Hashicorp - vault service fails to start

Hashicorp - vault service fails to start - consul

I have setup Hashicorp - vault (Vault v1.5.4) on Ubuntu 18.04. My backend is Consul (single node running on same server as vault) - consul service is up.
My vault service fails to start
systemctl list-units --type=service | grep "vault"
vault.service loaded failed failed vault service
journalctl -xe -u vault
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Scheduled restart job, restart counter is at 5.
-- Subject: Automatic restarting of a unit has been scheduled
- Unit vault.service has finished shutting down.
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Start request repeated too quickly.
Oct 03 00:21:33 ubuntu2 systemd[1]: vault.service: Failed with result 'exit-code'.
Oct 03 00:21:33 ubuntu2 systemd[1]: Failed to start vault service.
-- Subject: Unit vault.service has failed
vault config.json
"api_addr": "http://<my-ip>:8200",
storage "consul" {
address = "127.0.0.1:8500"
path = "vault"
},
Service config
StandardOutput=/opt/vault/logs/output.log
StandardError=/opt/vault/logs/error.log
cat /opt/vault/logs/error.log
cat: /opt/vault/logs/error.log: No such file or directory
cat /opt/vault/logs/output.log
cat: /opt/vault/logs/output.log: No such file or directory
sudo tail -f /opt/vault/logs/error.log
tail: cannot open '/opt/vault/logs/error.log' for reading: No such file or
directory
:/opt/vault/logs$ ls -al
total 8
drwxrwxr-x 2 vault vault 4096 Oct 2 13:38 .
drwxrwxr-x 5 vault vault 4096 Oct 2 13:38 ..

After much debugging, the issue was silly goofup mixing .hcl and .json (they are so similar - but different) - cut-n-paste between stuff the storage (as posted) needs to be in json format. The problem is of course compounded when the error message saying nothing and there is nothing in the logs.
"storage": {
"consul": {
"address": "127.0.0.1:8500",
"path" : "vault"
}
},
There were a couple of other additional issues to sort out to get it going- disable_mlock : true, opening the firewall for 8200: sudo ufw allow 8200/tcp.
Finally got done (rather started).

Related

Stopped Laravel queue worker

Using AWS elasticbeanstalk.
Setting my deploy_config file like below:
08_queue_service_restart:
command: "systemctl restart laravel_worker"
files:
/opt/elasticbeanstalk/tasks/taillogs.d/laravel-logs.conf:
content: /var/app/current/storage/logs/laravel.log
group: root
mode: "000755"
owner: root
/etc/systemd/system/laravel_worker.service:
mode: "000755"
owner: root
group: root
content: |
# Laravel queue worker using systemd
# ----------------------------------
#
# /lib/systemd/system/queue.service
#
# run this command to enable service:
# systemctl enable queue.service
[Unit]
Description=Laravel queue worker
[Service]
User=nginx
Group=nginx
Restart=always
ExecStart=/usr/bin/nohup /usr/bin/php /var/app/current/artisan queue:work --tries=3
[Install]
WantedBy=multi-user.target
And it return error like below:
Aug 17 04:21:34 ip-blabla systemd: laravel_worker.service: main process exited, code=exited, status=1/FAILURE
Aug 17 04:21:34 ip-blabla systemd: Unit laravel_worker.service entered failed state.
Aug 17 04:21:34 ip-blabla systemd: laravel_worker.service failed.
Aug 17 04:21:34 ip-blabla systemd: laravel_worker.service holdoff time over, scheduling restart.
Aug 17 04:21:34 ip-blabla systemd: Stopped Laravel queue worker.
Aug 17 04:21:34 ip-blabla systemd: Started Laravel queue worker.
It was working with no error for months. But this morning, started to return error like that. And I tried making rebuild, but nothing changed.

It might stopped working when server restart. You have to manually run command in project directory via terminal to restart laravel queue worker.
for me this command worked
nohup php artisan queue:work --daemon &
Run this command into laravel-project-root-directory

Why can't write certificate.crt with acme?

root#vultr:~# systemctl status nginx
● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2021-07-28 02:16:44 UTC; 23min ago
Docs: man:nginx(8)
Main PID: 12999 (nginx)
Tasks: 2 (limit: 1148)
Memory: 8.2M
CGroup: /system.slice/nginx.service
├─12999 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
└─13000 nginx: worker process
Jul 28 02:16:44 vultr.guest systemd[1]: Starting A high performance web server and a reverse proxy server...
Jul 28 02:16:44 vultr.guest systemd[1]: nginx.service: Failed to parse PID from file /run/nginx.pid: Invalid argument
Jul 28 02:16:44 vultr.guest systemd[1]: Started A high performance web server and a reverse proxy server.
The nginx is in good status.
I want to create and write certificate.crt with acme:
sudo su -l -s /bin/bash acme
curl https://get.acme.sh | sh
export CF_Key="xxxx"
export CF_Email="yyyy#yahoo.com"
CF_Key is my global api key in cloudflare,CF_Email is the register email to login cloudflare.
acme#vultr:~$ acme.sh --issue --dns dns_cf -d domain.com --debug 2
The output content is so long that i can't post here,so i upload into the termbin.com ,we share the link below:
https://termbin.com/taxl
Please open the webpage,you can get the whole output info,and check which result in error,there are two main issues:
1.My nginx server is in good status,acme.sh can't detect it.
2.How can set the config file?
[Wed Jul 28 03:04:38 UTC 2021] config file is empty, can not read CA_EAB_KEY_ID
[Wed Jul 28 03:04:38 UTC 2021] config file is empty, can not read CA_EAB_HMAC_KEY
[Wed Jul 28 03:04:38 UTC 2021] config file is empty, can not read CA_EMAIL
To write key into specified directory:
acme.sh --install-cert -d domain.com
--key-file /usr/local/etc/certfiles/private.key
--fullchain-file /usr/local/etc/certfiles/certificate.crt
It encounter problem:
[Tue Jul 27 01:12:15 UTC 2021] Installing key to:/usr/local/etc/certfiles/private.key
cat: /home/acme/.acme.sh/domain.com/domain.com.key: No such file or directory
To check files in /usr/local/etc/certfiles/
ls /usr/local/etc/certfiles/
private.key
No certificate.crt in /usr/local/etc/certfiles/.
How to fix then?

From acme.sh v3.0.0, acme.sh is using Zerossl as default ca, you must
register the account first(one-time) before you can issue new certs.
Here is how ZeroSSL compares with LetsEncrypt.
With ZeroSSL as CA
You must register at ZeroSSL before issuing a certificate. To register run the below command (assuming yyyy#yahoo.com is email with which you want to register)
acme.sh --register-account -m yyyy#yahoo.com
Now you can issue a new certificate (assuming you have set CF_Key & CF_Email or CF_Token & CF_Account_ID)
acme.sh --issue --dns dns_cf -d domain.com
Without ZeroSSL as CA
If you don't want to use ZeroSSL and say want to use LetsEncrypt instead, then you can provide the server option to issue a certificate
acme.sh --issue --dns dns_cf -d domain.com --server letsencrypt
Here are more options for the CA server.

Is docker checkpoint supported for windows containers, or does it rely on CRIU?

The goal is to make container create or start faster from a "clean" state for an image whose build contains a long-running application installation. I thought two options would be to docker export immediately following the container creation (not supported on windows) or docker checkpoint immediately following the container creation.
Trying to checkpoint a windows container,
docker checkpoint create my_container my_container_checkpoint
the command seems to hang and event log shows:
Index Time EntryType Source InstanceID Message
----- ---- --------- ------ ---------- -------
3595658 Feb 05 11:31 Information docker 2 debug: Calling GET /_ping
3595659 Feb 05 11:31 Information docker 2 debug: Calling GET /v1.35/containers/my_container/export
3595666 Feb 05 11:33 Information docker 2 debug: Calling POST /v1.35/containers/my_container/checkpoints
3595667 Feb 05 11:33 Information docker 2 debug: form data: {"CheckpointDir":"","CheckpointID":"my_container_checkpoint","Exit":true}
3595665 Feb 05 11:33 Information docker 2 debug: Calling GET /_ping
Version info:
Version 17.12.0-ce-win47 (15139)
Channel: stable
9c692cd

Error Keepalived - FAULT state

I'm trying to use keepalived to monitor some service.
See my configuration file "keepalived.conf":
! Configuration File for keepalived
vrrp_script check_haproxy {
script "/usr/local/bin/check_haproxy.sh"
interval 2
fail 2
rise 2
}
vrrp_instance Vtest_2 {
interface eno16777736
track_interface {
eno16777736
}
state BACKUP
virtual_router_id 178
priority 100
nopreempt
unicast_src_ip 172.28.7.132
unicast_peer {
172.28.7.133
}
virtual_ipaddress {
195.221.2.14/32 dev eno16777736
}
track_script {
check_haproxy
}
notify /usr/local/bin/keepalived.state.sh
}
In my script i to compare the number of the pid file and the process of the service :
#!/bin/bash
# check if haproxy is running, return 1 if not.
# Used by keepalived to initiate a failover in case haproxy is down
process=""
HAPROXY_STATUS=""
if [ -f "/var/run/haproxy.pid" ]
then
process=$(cat /var/run/haproxy.pid)
HAPROXY_STATUS=$(ps -aux | grep -w "$process.*[h]aproxy")
fi
if [ "$HAPROXY_STATUS" ]
then
exit 0
else
exit 1
fi
My proble is that when i start keepalived when the command :
keepalived -f /etc/keepalived/keepalived.conf --dont-fork --log-console --log-detail
Everythings works goods keepalived failover works good. But when i start the service with systemd:
systemctl start keepalived
I have this lines in my "/var/log/messages" file:
Mar 15 17:05:51 localhost systemd: Starting Session 10 of user root.
Mar 15 17:06:23 localhost systemd: Starting LVS and VRRP High Availability Monitor...
Mar 15 17:06:23 localhost Keepalived[34182]: Starting Keepalived v1.2.13 (11/20,2015)
Mar 15 17:06:23 localhost systemd: Started LVS and VRRP High Availability Monitor.
Mar 15 17:06:23 localhost Keepalived[34183]: Starting Healthcheck child process, pid=34184
Mar 15 17:06:23 localhost Keepalived[34183]: Starting VRRP child process, pid=34185
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Netlink reflector reports IP 172.28.7.132 added
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Netlink reflector reports IP fe80::20c:29ff:fe22:a18e added
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Registering Kernel netlink reflector
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Registering Kernel netlink command channel
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Registering gratuitous ARP shared channel
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Opening file '/etc/keepalived/keepalived.conf'.
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Configuration is using : 64365 Bytes
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: Using LinkWatch kernel netlink reflector...
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: VRRP_Instance(Vtest_2) Entering BACKUP STATE
Mar 15 17:06:23 localhost Keepalived_vrrp[34185]: VRRP sockpool: [ifindex(2), proto(112), unicast(1), fd(10,11)]
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Netlink reflector reports IP 172.28.7.132 added
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Netlink reflector reports IP fe80::20c:29ff:fe22:a18e added
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Registering Kernel netlink reflector
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Registering Kernel netlink command channel
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Opening file '/etc/keepalived/keepalived.conf'.
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Configuration is using : 5344 Bytes
Mar 15 17:06:23 localhost Keepalived_healthcheckers[34184]: Using LinkWatch kernel netlink reflector...
Mar 15 17:06:26 localhost Keepalived_vrrp[34185]: VRRP_Instance(Vtest_2) Now in FAULT state
Anyone have an idea ?

I'm a using much simple configuration for the same purpose. Utility pidof must be on most OSs Linux by default. Try this for example:
vrrp_script check_haproxy
{
script "pidof haproxy"
interval 2
weight 2
}
It's a working solution buddy.

I had a similar issue with my track_script not running when starting keepalived as a systemctl service.
I found out that putting my track_script in /usr/libexec/keepalived solved the issue. It is probably related to this issue.
! Configuration File for keepalived
global_defs {
}
vrrp_script odr_check {
script "/usr/libexec/keepalived/health_check.sh"
interval 2
weight 50
fall 1
rise 1
}
vrrp_instance VI_1 {
state BACKUP
interface eno16777984
virtual_router_id 42
priority 90
track_script {
odr_check
}
virtual_ipaddress {
192.168.110.84
}
}

Coreos fleet not working after auto-scaling

I have CoreOS cluster with 3 AWS ec2 instances. The cluster was setup using the CoreOS stack cloudformation. After the cluster is up and running, I need to update the autoscaling policy to pick up ec2 instance profile. I copied the existing auto-scaling configuration file and updated the IAM role for ec2s. Then I terminated EC2s in the fleet, letting the auto-scaling to fire up new instances. The new instances indeed assumed their new roles, however, the cluster seems lost cluster machine information:
ip-10-214-156-29 ~ # systemctl -l status etcd.service
● etcd.service - etcd
Loaded: loaded (/usr/lib64/systemd/system/etcd.service; disabled)
Drop-In: /run/systemd/system/etcd.service.d
└─10-oem.conf, 20-cloudinit.conf
Active: activating (auto-restart) (Result: exit-code) since Wed 2014-09-24 18:28:58 UTC; 9s ago
Process: 14124 ExecStart=/usr/bin/etcd (code=exited, status=1/FAILURE)
Main PID: 14124 (code=exited, status=1/FAILURE)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILURE
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal systemd[1]: Unit etcd.service entered failed state.
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 INFO | d9a7cb8df4a049689de452b6858399e9 attempted to join via 10.252.78.43:7001 failed: fail checking join version: Client Internal Error (Get http://10.252.78.43:7001/version: dial tcp 10.252.78.43:7001: connection refused)
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 WARNING | d9a7cb8df4a049689de452b6858399e9 cannot connect to existing peers [10.214.135.35:7001 10.16.142.108:7001 10.248.7.66:7001 10.35.142.159:7001 10.252.78.43:7001]: fail joining the cluster via given peers after 3 retries
Sep 24 18:28:58 ip-10-214-156-29.us-west-2.compute.internal etcd[14124]: [etcd] Sep 24 18:28:58.206 CRITICAL | fail joining the cluster via given peers after 3 retries
The same token was used from cloud-init. https://discovery.etcd.io/<cluster token> shows 6 machines, with 3 dead ones, 3 new ones. So it looks like 3 new instances joined the cluster alright. The journal -u etcd.service logs shows the etcd timed out on dead instances, and got connection refused for the new ones.
journal -u etcd.service shows:
...
Sep 24 06:01:11 ip-10-35-142-159.us-west-2.compute.internal etcd[574]: [etcd] Sep 24 06:01:11.198 INFO | 5c4531d885df4d06ae2d369c94f4de11 attempted to join via 10.214.156.29:7001 failed: fail checking join version: Client Internal Error (Get http://10.214.156.29:7001/version: dial tcp 10.214.156.29:7001: connection refused)
etcdctl --debug ls
Cluster-Peers: http://127.0.0.1:4001 http://10.35.142.159:4001
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/? consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://127.0.0.1:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Curl-Example: curl -X GET http://10.35.142.159:4001/v2/keys/?consistent=true&recursive=false&sorted=false
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Maybe this is not right process to update a cluster's configuration, but IF the cluster does need auto-scaling for whatever reasons (load triggered for example), will the fleet still be able to function with dead instances and new instances mixed in the pool?
How to recover from this situations without tear down and rebuild?
Xueshan

In this scheme etcd will not remain with a quorum of machines and can't operate successfully. The best scheme for doing autoscaling would be to set up two groups of machines:
A fixed number (1-9) of etcd machines that will always be up. These are set up with a discovery token or static networking like normal.
Your autoscaling group, which doesn't start etcd, but instead configures fleet (and any other tool) to use the fixed etcd cluster. You can do this in cloud-config. Here's an example that also sets some fleet metadata so you can schedule jobs specifically to the autoscaled machines if desired:
#cloud-config
coreos:
fleet:
metadata: "role=autoscale"
etcd_servers: "http://:4001,http://:4001,http://:4001,http://:4001,http://:4001,http://:4001"
units:
- name: fleet.service
command: start
The validator wouldn't let me put in any 10.x IP addresses in my answer (wtf!?) so be sure to replace those.

You must have at least one machine always running with the discovery token, as soon as all of them go down, heartbeat will fail and no one new will be able to join, you will need a new token for the cluster to join.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hashicorp - vault service fails to start - consul

Related

Stopped Laravel queue worker

Why can't write certificate.crt with acme?

Is docker checkpoint supported for windows containers, or does it rely on CRIU?

Error Keepalived - FAULT state

Coreos fleet not working after auto-scaling

Categories

Resources