Ping constraint not moving DRBD primary - pacemaker

I have a two node cluster (CentOS7-based), intended to be active/passive with DRBD resources and app resources dependant on them and a cluster ip dependant on the apps through ordering constraints. I have no colocation constraints. Instead all my resources are in the same group so they migrate all together.
There are 2 network interfaces on each node: the one is a LAN and the other a private point-to-point connection. DRBD is configured to use the point-to-point. Both networks are configured into RRP with the LAN the primary Pacemaker/Corosync connection and the point-to-point serving as backup by setting the RRP mode to passive.
Failover by rebooting or powering down the active node works fine and all resources successfully migrate to the survivor. This is where the good news stops though.
I have a ping resource pinging a host reachable on the LAN interface with a location constraint based on the ping to move the resource group to the passive node should the live node loose connectivity to the ping host. This part however does not work correctly.
When I pull the LAN network cable on the active node, the active node cannot ping the ping host anymore and the resources gets stopped on the current active node - as expected. Bear in mind that Corosync can still communicate among one another as the fall back onto the private network due the RRP. The resources however can't be started on the previously passive node (the one that can still connect to the gateway and that should becoming active now) because the DRBD resources remains primary on the node which had its cable pulled so the file systems can't be mounted on the one that should take over. Keep in mind DRBD keeps on being connected on the private network during this time as its plug was not pulled.
I can't figure out why the ping-based location constraint is not migrating the resource group correctly down to the DRBD primary/secondary setting. I was hoping someone here can assist.
Following is the state after I pulled the cable and the cluster went as far as it could to migrate before getting stuck.
[root#za-mycluster1 ~]# pcs status
Cluster name: MY_HA
Stack: corosync
Current DC: za-mycluster1.sMY.co.za (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Fri Apr 24 19:12:57 2020
Last change: Fri Apr 24 16:39:45 2020 by hacluster via crmd on za-mycluster1.sMY.co.za
2 nodes configured
14 resources configured
Online: [ za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za ]
Full list of resources:
Master/Slave Set: LV_DATAClone [LV_DATA]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Resource Group: mygroup
LV_DATAFS (ocf::heartbeat:Filesystem): Stopped
LV_POSTGRESFS (ocf::heartbeat:Filesystem): Stopped
postgresql_9.6 (systemd:postgresql-9.6): Stopped
LV_HOMEFS (ocf::heartbeat:Filesystem): Stopped
myapp (lsb:myapp): Stopped
ClusterIP (ocf::heartbeat:IPaddr2): Stopped
Master/Slave Set: LV_POSTGRESClone [LV_POSTGRES]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Master/Slave Set: LV_HOMEClone [LV_HOME]
Masters: [ za-mycluster1.sMY.co.za ]
Slaves: [ za-mycluster2.sMY.co.za ]
Clone Set: pingd-clone [pingd]
Started: [ za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za ]
Failed Resource Actions:
* LV_DATAFS_start_0 on za-mycluster2.sMY.co.za 'unknown error' (1): call=57, status=complete, exitreason='Couldn't mount device [/dev/drbd0] as /data',
last-rc-change='Fri Apr 24 16:59:10 2020', queued=0ms, exec=75ms
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
Note the error mounting the DRBD filesystem on the migration target. Looking at the DRBD status at this point shows node 1 is still primary so the DRBD resource never got set to secondary when the other resources got stopped.
[root#za-mycluster1 ~]# cat /proc/drbd
version: 8.4.11-1 (api:1/proto:86-101)
GIT-hash: 66145a308421e9c124ec391a7848ac20203bb03c build by mockbuild#, 2018-11-03 01:26:55
0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:169816 nr:0 dw:169944 dr:257781 al:8 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:6108 nr:0 dw:10324 dr:17553 al:14 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
2: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
ns:3368 nr:0 dw:4380 dr:72609 al:6 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
This is what the configuration looks like
[root#za-mycluster1 ~]# pcs config
Cluster Name: MY_HA
Corosync Nodes:
za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za
Pacemaker Nodes:
za-mycluster1.sMY.co.za za-mycluster2.sMY.co.za
Resources:
Master: LV_DATAClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_DATA (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_DATA
Operations: demote interval=0s timeout=90 (LV_DATA-demote-interval-0s)
monitor interval=60s (LV_DATA-monitor-interval-60s)
notify interval=0s timeout=90 (LV_DATA-notify-interval-0s)
promote interval=0s timeout=90 (LV_DATA-promote-interval-0s)
reload interval=0s timeout=30 (LV_DATA-reload-interval-0s)
start interval=0s timeout=240 (LV_DATA-start-interval-0s)
stop interval=0s timeout=100 (LV_DATA-stop-interval-0s)
Group: mygroup
Resource: LV_DATAFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd0 directory=/data fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_DATAFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_DATAFS-notify-interval-0s)
start interval=0s timeout=60s (LV_DATAFS-start-interval-0s)
stop interval=0s timeout=60s (LV_DATAFS-stop-interval-0s)
Resource: LV_POSTGRESFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd1 directory=/var/lib/pgsql fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_POSTGRESFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_POSTGRESFS-notify-interval-0s)
start interval=0s timeout=60s (LV_POSTGRESFS-start-interval-0s)
stop interval=0s timeout=60s (LV_POSTGRESFS-stop-interval-0s)
Resource: postgresql_9.6 (class=systemd type=postgresql-9.6)
Operations: monitor interval=60s (postgresql_9.6-monitor-interval-60s)
start interval=0s timeout=100 (postgresql_9.6-start-interval-0s)
stop interval=0s timeout=100 (postgresql_9.6-stop-interval-0s)
Resource: LV_HOMEFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd2 directory=/home fstype=ext4
Operations: monitor interval=20s timeout=40s (LV_HOMEFS-monitor-interval-20s)
notify interval=0s timeout=60s (LV_HOMEFS-notify-interval-0s)
start interval=0s timeout=60s (LV_HOMEFS-start-interval-0s)
stop interval=0s timeout=60s (LV_HOMEFS-stop-interval-0s)
Resource: myapp (class=lsb type=myapp)
Operations: force-reload interval=0s timeout=15 (myapp-force-reload-interval-0s)
monitor interval=60s on-fail=standby timeout=10s (myapp-monitor-interval-60s)
restart interval=0s timeout=120s (myapp-restart-interval-0s)
start interval=0s timeout=60s (myapp-start-interval-0s)
stop interval=0s timeout=60s (myapp-stop-interval-0s)
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: cidr_netmask=32 ip=192.168.51.185
Operations: monitor interval=30s (ClusterIP-monitor-interval-30s)
start interval=0s timeout=20s (ClusterIP-start-interval-0s)
stop interval=0s timeout=20s (ClusterIP-stop-interval-0s)
Master: LV_POSTGRESClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_POSTGRES (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_postgres
Operations: demote interval=0s timeout=90 (LV_POSTGRES-demote-interval-0s)
monitor interval=60s (LV_POSTGRES-monitor-interval-60s)
notify interval=0s timeout=90 (LV_POSTGRES-notify-interval-0s)
promote interval=0s timeout=90 (LV_POSTGRES-promote-interval-0s)
reload interval=0s timeout=30 (LV_POSTGRES-reload-interval-0s)
start interval=0s timeout=240 (LV_POSTGRES-start-interval-0s)
stop interval=0s timeout=100 (LV_POSTGRES-stop-interval-0s)
Master: LV_HOMEClone
Meta Attrs: clone-max=2 clone-node-max=1 master-max=1 master-node-max=1 notify=true
Resource: LV_HOME (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=lv_home
Operations: demote interval=0s timeout=90 (LV_HOME-demote-interval-0s)
monitor interval=60s (LV_HOME-monitor-interval-60s)
notify interval=0s timeout=90 (LV_HOME-notify-interval-0s)
promote interval=0s timeout=90 (LV_HOME-promote-interval-0s)
reload interval=0s timeout=30 (LV_HOME-reload-interval-0s)
start interval=0s timeout=240 (LV_HOME-start-interval-0s)
stop interval=0s timeout=100 (LV_HOME-stop-interval-0s)
Clone: pingd-clone
Resource: pingd (class=ocf provider=pacemaker type=ping)
Attributes: dampen=5s host_list=192.168.51.1 multiplier=1000
Operations: monitor interval=10 timeout=60 (pingd-monitor-interval-10)
start interval=0s timeout=60 (pingd-start-interval-0s)
stop interval=0s timeout=20 (pingd-stop-interval-0s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Resource: mygroup
Constraint: location-mygroup
Rule: boolean-op=or score=-INFINITY (id:location-mygroup-rule)
Expression: pingd lt 1 (id:location-mygroup-rule-expr)
Expression: not_defined pingd (id:location-mygroup-rule-expr-1)
Ordering Constraints:
promote LV_DATAClone then start LV_DATAFS (kind:Mandatory) (id:order-LV_DATAClone-LV_DATAFS-mandatory)
promote LV_POSTGRESClone then start LV_POSTGRESFS (kind:Mandatory) (id:order-LV_POSTGRESClone-LV_POSTGRESFS-mandatory)
start LV_POSTGRESFS then start postgresql_9.6 (kind:Mandatory) (id:order-LV_POSTGRESFS-postgresql_9.6-mandatory)
promote LV_HOMEClone then start LV_HOMEFS (kind:Mandatory) (id:order-LV_HOMEClone-LV_HOMEFS-mandatory)
start LV_HOMEFS then start myapp (kind:Mandatory) (id:order-LV_HOMEFS-myapp-mandatory)
start myapp then start ClusterIP (kind:Mandatory) (id:order-myapp-ClusterIP-mandatory)
Colocation Constraints:
Ticket Constraints:
Alerts:
No alerts defined
Resources Defaults:
resource-stickiness=INFINITY
Operations Defaults:
timeout=240s
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: MY_HA
dc-version: 1.1.20-5.el7-3c4c782f70
have-watchdog: false
no-quorum-policy: ignore
stonith-enabled: false
Quorum:
Options:
Any insight will be welcomed.

Assuming the Filesystem resources in your group exist on the DRBD devices outside of the group, you will need at least one order and one colocation constraint per DRBD device telling the cluster that it can only start mygroup after the DRBD devices are promoted to primary and on the node where they are primary. Your ping resource is working, as you're seeing mygroup stop and attempt to start on the peer, but it's failing to start because nothing is telling the DRBD Primary to move with the group, and that's where the Filesystems live.
Try adding the following constraints to the cluster:
# pcs cluster cib drbd_constraints
# pcs -f drbd_constraints constraint colocation add mygroup LV_DATAClone INFINITY with-rsc-role=Master
# pcs -f drbd_constraints constraint order promote LV_DATAClone then start mygroup
# pcs -f drbd_constraints constraint colocation add mygroup LV_POSTGRESClone INFINITY with-rsc-role=Master
# pcs -f drbd_constraints constraint order promote LV_POSTGRESClone then start mygroup
# pcs -f drbd_constraints constraint colocation add mygroup LV_HOMEFS INFINITY with-rsc-role=Master
# pcs -f drbd_constraints constraint order promote LV_HOMEFS then start mygroup
# pcs cluster push cib drbd_constraints

Related

High CPU usage on idle AMQ Artemis cluster, related to locks with shared-store HA

I have AMQ Artemis cluster, shared-store HA (master-slave), 2.17.0.
I noticed that all my clusters (active servers only) that are idle (no one is using them) using from 10% to 20% of CPU, except one, which is using around 1% (totally normal). I started investigating...
Long story short - only one cluster has a completely normal CPU usage. The only difference I've managed to find that if I connect to that normal cluster's master node and attempt telnet slave 61616 - it will show as connected. If I do the same in any other cluster (that has high CPU usage) - it will show as rejected.
In order to better understand what is happening, I enabled DEBUG logs in instance/etc/logging.properties. Here is what master node is spamming:
2021-05-07 13:54:31,857 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Backup is not active, trying original connection configuration now.
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Trying reconnection attempt 0/1
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Trying to connect with connectorFactory = org.apache.activemq.artemis.core.remoting.impl.netty$NettyConnectorFactory#6cf71172, connectorConfig=TransportConfiguration(name=slave-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?trustStorePassword=****&port=61616&keyStorePassword=****&sslEnabled=true&host=slave-com&trustStorePath=/path/to/ssl/truststore-jks&keyStorePath=/path/to/ssl/keystore-jks
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Connector NettyConnector [host=slave.com, port=61616, httpEnabled=false$ httpUpgradeEnabled=false, useServlet=false, servletPath=/messaging/ActiveMQServlet, sslEnabled=true, useNio=true] using native epoll
2021-05-07 13:54:32,357 DEBUG [org.apache.activemq.artemis.core.client] AMQ211002: Started EPOLL Netty Connector version 4.1.51.Final to slave.com:61616
2021-05-07 13:54:32,358 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Remote destination: slave.com/123.123.123.123:61616
2021-05-07 13:54:32,358 DEBUG [org.apache.activemq.artemis.spi.core.remoting.ssl.SSLContextFactory] Creating SSL context with configuration
trustStorePassword=****
port=61616
keyStorePassword=****
sslEnabled=true
host=slave.com
trustStorePath=/path/to/ssl/truststore.jks
keyStorePath=/path/to/ssl/keystore.jks
2021-05-07 13:54:32,448 DEBUG [org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnector] Added ActiveMQClientChannelHandler to Channel with id = 77c078c2
2021-05-07 13:54:32,448 DEBUG [org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl] Connector towards NettyConnector [host=slave.com, port=61616, httpEnabled=false, httpUpgradeEnabled=false, useServlet=false, servletPath=/messaging/ActiveMQServlet, sslEnabled=true, useNio=true] failed
This is what slave is spamming:
2021-05-07 14:06:53,177 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] trying to lock position: 1
2021-05-07 14:06:53,178 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] failed to lock position: 1
If I attempt to telnet from master node to slave node (same if I do it from slave to slave):
[root#master]# telnet slave.com 61616
Trying 123.123.123.123...
telnet: connect to address 123.123.123.123: Connection refused
However if I attempt the same telnet in that the only working cluster, I can successfully "connect" from master to slave...
Here is what I suspect:
Master acquires lock in instance/data/journal/server.lock
Master keeps trying to connect to slave server
Slave unable to start, because it cannot acquire the same server.lock on shared storage.
Master uses high CPU because of such hard-trying to connect to slave, which is not running.
What am I doing wrong?
EDIT: This is how my NFS mounts look like (taken from mount command):
some_server:/some_dir on /path/to/artemis/instance/data type nfs4 (rw,relatime,sync,vers=4.1,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,soft,noac,proto=tcp,timeo=50,retrans=1,sec=sys,clientaddr=123.123.123.123,local_lock=none,addr=123.123.123.123)
Turns out issue was in broker.xml configuration. In static-connectors I somehow decided to list only a "non-current server" (e.g. I have srv0 and srv1 - in srv0 I only added connector of srv1 and vice versa).
What it used to be (on 1st master node):
<cluster-connections>
<cluster-connection name="abc">
<connector-ref>srv0-connector</connector-ref>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>srv1-connector</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
How it is now (on 1st master node):
<cluster-connections>
<cluster-connection name="abc">
<connector-ref>srv0-connector</connector-ref>
<message-load-balancing>ON_DEMAND</message-load-balancing>
<max-hops>1</max-hops>
<static-connectors>
<connector-ref>srv0-connector</connector-ref>
<connector-ref>srv1-connector</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
After listing all cluster's nodes, the CPU normalized and it's not only ~1% on active node. The issue is totally not related AMQ Artemis connections spamming or file locks.

Consul UI does not show

Running single node Consul (v1.8.4) on Ubuntu 18.04. consul service is up, I had set the ui to be true (default).
But when I try access http://192.168.37.128:8500/ui
This site can’t be reached 192.168.37.128 took too long to respond.
ui.json
{
"addresses": {
"http": "0.0.0.0"
}
}
consul.service file:
[Unit]
Description=Consul
Documentation=https://www.consul.io/
[Service]
ExecStart=/usr/bin/consul agent –server –ui –data-dir=/temp/consul –bootstrap-expect=1 –node=vault –bind=–config-dir=/etc/consul.d/
ExecReload=/bin/kill –HUP $MAINPID
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
systemctl status consul
● consul.service - Consul
Loaded: loaded (/etc/systemd/system/consul.service; disabled; vendor preset: enabled)
Active: active (running) since Sun 2020-10-04 19:19:08 CDT; 50min ago
Docs: https://www.consul.io/
Main PID: 9477 (consul)
Tasks: 9 (limit: 4980)
CGroup: /system.slice/consul.service
└─9477 /opt/consul/bin/consul agent -server -ui -data-dir=/temp/consul -bootstrap-expect=1 -node=vault -bind=1
agent.server.raft: heartbeat timeout reached, starting election: last-leader=
agent.server.raft: entering candidate state: node="Node at 192.168.37.128:8300 [Candid
agent.server.raft: election won: tally=1
agent.server.raft: entering leader state: leader="Node at 192.168.37.128:8300 [Leader]
agent.server: cluster leadership acquired
agent.server: New leader elected: payload=vault
agent.leader: started routine: routine="federation state anti-entropy"
agent.leader: started routine: routine="federation state pruning"
agent.leader: started routine: routine="CA root pruning"
agent: Synced node info
Shows bind at 192.168.37.128:8300
This issue was firewall, had to open firewall on 8500
sudo ufw allow 8500/tcp

microk8s.enable dns gets stuck in ContainerCreating

I have installed microk8s snap on Ubuntu 19 in a VBox. When I run microk8s.enable dns, the pod for the deployment does not get past ContainerCreating state.
I used to work in before. I have also re-installed microk8s, this helped in the passed, but not anymore.
n.a.
Output from microk8s.kubectl get all --all-namespaces shows that something is wrong with the volume for the secrets. I don't know how I can investigate further, so any help is appreciated.
Cheers
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system pod/coredns-9b8997588-z88lz 0/1 ContainerCreating 0 16m
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
default service/kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 20m
kube-system service/kube-dns ClusterIP 10.152.183.10 <none> 53/UDP,53/TCP,9153/TCP 16m
NAMESPACE NAME READY UP-TO-DATE AVAILABLE AGE
kube-system deployment.apps/coredns 0/1 1 0 16m
NAMESPACE NAME DESIRED CURRENT READY AGE
kube-system replicaset.apps/coredns-9b8997588 1 1 0 16m
Output from microk8s.kubectl describe pod/coredns-9b8997588-z88lz -n kube-system
Name: coredns-9b8997588-z88lz
Namespace: kube-system
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: peza-ubuntu-19/10.0.2.15
Start Time: Sun, 29 Sep 2019 15:49:27 +0200
Labels: k8s-app=kube-dns
pod-template-hash=9b8997588
Annotations: scheduler.alpha.kubernetes.io/critical-pod:
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/coredns-9b8997588
Containers:
coredns:
Container ID:
Image: coredns/coredns:1.5.0
Image ID:
Ports: 53/UDP, 53/TCP, 9153/TCP
Host Ports: 0/UDP, 0/TCP, 0/TCP
Args:
-conf
/etc/coredns/Corefile
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
memory: 170Mi
Requests:
cpu: 100m
memory: 70Mi
Liveness: http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5
Readiness: http-get http://:8181/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/coredns from config-volume (ro)
/var/run/secrets/kubernetes.io/serviceaccount from coredns-token-h6qlm (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
config-volume:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: coredns
Optional: false
coredns-token-h6qlm:
Type: Secret (a volume populated by a Secret)
SecretName: coredns-token-h6qlm
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: CriticalAddonsOnly
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned kube-system/coredns-9b8997588-z88lz to peza-ubuntu-19
Warning FailedMount 5m59s kubelet, peza-ubuntu-19 Unable to attach or mount volumes: unmounted volumes=[coredns-token-h6qlm config-volume], unattached volumes=[coredns-token-h6qlm config-volume]: timed out waiting for the condition
Warning FailedMount 3m56s (x11 over 10m) kubelet, peza-ubuntu-19 MountVolume.SetUp failed for volume "coredns-token-h6qlm" : failed to sync secret cache: timed out waiting for the condition
Warning FailedMount 3m44s (x2 over 8m16s) kubelet, peza-ubuntu-19 Unable to attach or mount volumes: unmounted volumes=[config-volume coredns-token-h6qlm], unattached volumes=[config-volume coredns-token-h6qlm]: timed out waiting for the condition
Warning FailedMount 113s (x12 over 10m) kubelet, peza-ubuntu-19 MountVolume.SetUp failed for volume "config-volume" : failed to sync configmap cache: timed out waiting for the condition
I spent my morning fighting with this on ubuntu 19.04. None of the microk8s add-ons worked. Their containers got stuck in "ContainerCreating" status having something like "MountVolume.SetUp failed for volume "kubernetes-dashboard-token-764ml" : failed to sync secret cache: timed out waiting for the condition" in their descriptions.
I tried to start/stop/reset/reinstall microk8s a few times. Nothing worked. Once I downgraded it to the prev version the problem went away.
sudo snap install microk8s --classic --channel=1.15/stable

Elasticsearch: 'timed out waiting for all nodes to process published state' and cluster unavailability

I am setting up Elasticsearch 3-nodes cluster with docker. This is my docker compose file:
version: '2.0'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.3.0
environment:
- cluster.name=test-cluster
- node.name=elastic_1
- ES_JAVA_OPTS=-Xms512m -Xmx512m
- bootstrap.memory_lock=true
- discovery.zen.minimum_master_nodes=2
- discovery.zen.ping.unicast.hosts=elasticsearch,elasticsearch2,elasticsearch3
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- test_es_cluster_data:/usr/share/elasticsearch/data
networks:
- esnet
elasticsearch2:
extends:
file: ./docker-compose.yml
service: elasticsearch
environment:
- node.name=elastic_2
volumes:
- test_es_cluster2_data:/usr/share/elasticsearch/data
elasticsearch3:
extends:
file: ./docker-compose.yml
service: elasticsearch
environment:
- node.name=elastic_3
volumes:
- test_es_cluster3_data:/usr/share/elasticsearch/data
volumes:
test_es_cluster_data:
test_es_cluster2_data:
test_es_cluster3_data:
networks:
esnet:
Once the cluster is up, I kill master (elastic_1) to test failover. I expect new master will be elected, while the cluster should respond to read requests all the time.
Well, master is elected, but the cluster is not responding for pretty long time (~45s).
Please find logs from elastic_2 and elastic_3 after master is stopped (docker stop escluster_elasticsearch_1):
elastic_2:
...
[2018-07-04T14:47:04,495][INFO ][o.e.d.z.ZenDiscovery ] [elastic_2] master_left [{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}], reason [shut_down]
...
[2018-07-04T14:47:04,509][WARN ][o.e.c.NodeConnectionsService] [elastic_2] failed to connect to node {elastic_1}{...}{172.24.0.3}{172.24.0.3:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elastic_1][172.24.0.3:9300] connect_exception
...
[2018-07-04T14:47:07,565][INFO ][o.e.c.s.ClusterApplierService] [elastic_2] detected_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4]])
[2018-07-04T14:47:35,301][WARN ][r.suppressed ] path: /_cat/health, params: {}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2018-07-04T14:47:53,933][WARN ][o.e.c.s.ClusterApplierService] [elastic_2] cluster state applier task [apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4]])] took [46.3s] above the warn threshold of 30s
[2018-07-04T14:47:53,934][INFO ][o.e.c.s.ClusterApplierService] [elastic_2] removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [5]])
[2018-07-04T14:47:56,931][WARN ][o.e.t.TransportService ] [elastic_2] Received response for a request that has timed out, sent [48367ms] ago, timed out [18366ms] ago, action [internal:discovery/zen/fd/master_ping], node [{elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}], id [1035]
elastic_3:
[2018-07-04T14:47:04,494][INFO ][o.e.d.z.ZenDiscovery ] [elastic_3] master_left [{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}], reason [shut_down]
...
[2018-07-04T14:47:04,519][WARN ][o.e.c.NodeConnectionsService] [elastic_3] failed to connect to node {elastic_1}{...}{172.24.0.3}{172.24.0.3:9300} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [elastic_1][172.24.0.3:9300] connect_exception
...
[2018-07-04T14:47:07,550][INFO ][o.e.c.s.MasterService ] [elastic_3] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}
[2018-07-04T14:47:35,026][WARN ][r.suppressed ] path: /_cat/nodes, params: {v=}
org.elasticsearch.discovery.MasterNotDiscoveredException: null
...
[2018-07-04T14:47:37,560][WARN ][o.e.d.z.PublishClusterStateAction] [elastic_3] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{elastic_2}{...}{172.24.0.2}{172.24.0.2:9300}])
[2018-07-04T14:47:37,561][INFO ][o.e.c.s.ClusterApplierService] [elastic_3] new_master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [4] source [zen-disco-elected-as-master ([1] nodes joined)[, ]]])
[2018-07-04T14:47:41,021][WARN ][o.e.c.s.MasterService ] [elastic_3] cluster state update task [zen-disco-elected-as-master ([1] nodes joined)[, ]] took [33.4s] above the warn threshold of 30s
[2018-07-04T14:47:41,022][INFO ][o.e.c.s.MasterService ] [elastic_3] zen-disco-node-failed({elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}), reason(transport disconnected), reason: removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}
[2018-07-04T14:47:56,929][INFO ][o.e.c.s.ClusterApplierService] [elastic_3] removed {{elastic_1}{...}{172.24.0.3}{172.24.0.3:9300},}, reason: apply cluster state (from master [master {elastic_3}{...}{172.24.0.4}{172.24.0.4:9300} committed version [5] source [zen-disco-node-failed({elastic_1}{...}{172.24.0.3}{172.24.0.3:9300}), reason(transport disconnected)]])
Why does it take so long time for the cluster to stabilize and respond to requests?
It is puzzling, that:
a) new master is elected (elastic_3):
[2018-07-04T14:47:07,550][INFO ] ... [elastic_3] zen-disco-elected-as-master ([1] nodes joined)[, ], reason: new_master {elastic_3}...
b) then, it is detected by elastic_2:
[2018-07-04T14:47:07,565][INFO ] ... [elastic_2] detected_master {elastic_3}...
c) then, master times out on waiting to process published state:
[2018-07-04T14:47:37,560][WARN ] ... [elastic_3] timed out waiting for all nodes to process published state [4] (timeout [30s], pending nodes: [{elastic_2}...])
d) elastic_2 applies cluster state with warn:
[2018-07-04T14:47:53,933][WARN ] ... [elastic_2] cluster state applier task [apply cluster state (from master [master {elastic_3}...])] took [46.3s] above the warn threshold of 30s
What can cause the timeout (c)? All this is run on local machine (no network issues). Am I missing any configuration?
Meanwhile, requesting both elastic_2 and elastic_3 one ends up with MasterNotDiscoveredException. According to the documentation, the cluster is expected to respond (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/modules-discovery-zen.html#no-master-block).
Did anyone experienced this? I would appreciate any advice on this issue.
Using docker restart instead of docker stop solves the problem. See: https://discuss.elastic.co/t/timed-out-waiting-for-all-nodes-to-process-published-state-and-cluster-unavailability/138590

How to register Apache Storm supervisors to Apache Zookeeper

I'm following http://jayatiatblogs.blogspot.com/2011/11/storm-installation.html & http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html#sc_zkMulitServerSetup to set up Apache Storm cluster in Ubuntu 14.04 LTS at AWS EC2.
My master node is 10.0.0.185. My slave nodes are 10.0.0.83, 10.0.0.124 & 10.0.0.84 with myid of 1, 2 and 3 in their zookeeper-data respectively. I set up an ensemble of Apache Zookeeper consists of all the 3 slave nodes.
Below are my zoo.cfg for my master node & slave nodes:
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/home/ubuntu/zookeeper-data
clientPort=2181
server.1=10.0.0.83:2888:3888
server.2=10.0.0.124:2888:3888
server.3=10.0.0.84:2888:3888
autopurge.snapRetainCount=3
autopurge.purgeInterval=1
Below are my storm.yaml for slave nodes:
########### These MUST be filled in for a storm configuration
storm.zookeeper.server:
- "10.0.0.83"
- "10.0.0.124"
- "10.0.0.84"
# - "localhost"
storm.zookeeper.port: 2181
# nimbus.host: "localhost"
nimbus.host: "10.0.0.185"
storm.local.dir: "/home/ubuntu/storm/data"
java.library.path: "/usr/lib/jvm/java-7-oracle"
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
# worker.childopts: "-Xmx768m"
# nimbus.childopts: "-Xmx512m"
# supervisor.childopts: "-Xmx256m"
The storm.yaml file in master node is similar to slave nodes, except the supervisor.slots.ports is commented with #.
Below are the /etc/hosts file in both the master node & slave nodes:
127.0.0.1 localhost
10.0.0.185 ip-10-0-0-185.ap-southeast-1.compute.internal stormNimbus
10.0.0.124 ip-10-0-0-124.ap-southeast-1.compute.internal slaveRain
10.0.0.83 ip-10-0-0-83.ap-southeast-1.compute.internal slaveCloud
10.0.0.84 ip-10-0-0-84.ap-southeast-1.compute.internal slaveLightning
when I issue command echo status | nc 10.0.0.124 2181 on slave node 10.0.0.124:
Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
Clients:
/10.0.0.124:53790[0](queued=0,recved=1,sent=0)
I start up zookeeper & supervisor in all the 3 slave nodes. The another 2 slave nodes didn't register to the zookeeper, although I set up the zookeeper in a cluster mode.
When I issue storm rebalance in the Topology Summary of Storm UI, instead of allowing new slave node to be added and run concurrently with old slave node, the new slave node replace old slave node to run the topology.
Each time only a single slave node is running the topology.
What is the possible issues & how should I solve it?

Resources