corosync/pacemaker treating OCF_RUNNING_MASTER as error - pacemaker

I created an ocf resource agent and I want to run it as a Master/Slave set. At first my monitor function returned OCF_SUCCESS on a running node (regardless of whether it was a master or a slave) which did actually work, but pacemaker did not know which one was the current master (both instances reported as slaves).
That's why I changed the monitor function to return OCF_RUNNING_MASTER on the master and OCF_SUCCESS on the slave (because I saw it in the code of drdb). Unfortunately pacemaker seems to interpret this as an error, kills the master, pormotes the second node to master, and so on.
Does anyone know how I can make pacemaker interpret OCF_RUNNING_MASTER as success?
crm config:
node 3232286770: VStorage1 \
attributes standby=off
node 3232286771: VStorage2
primitive virtual_ip IPaddr2 \
params ip= cidr_netmask=32 nic=ens256 \
op monitor interval=10s \
meta migration-threshold=10
primitive filecluster ocf:msn:cluster \
op start timeout=120 interval=0 \
op stop timeout=120 interval=0 \
op promote timeout=120 interval=0 \
op demote timeout=120 interval=0 \
op monitor interval=20s role=Slave \
op monitor interval=10s role=Master \
meta migration-threshold=10
ms ms filecluster
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=debian \
stonith-enabled=false \
crm status output:
root#VStorage1:/usr/lib/ocf/resource.d# crm status
Last updated: Mon Nov 5 11:21:34 2018 Last change: Fri Nov 2 20:22:53 2018 by root via cibadmin on VStorage1
Stack: corosync
Current DC: VStorage1 (version 1.1.14-70404b0) - partition with quorum
2 nodes and 3 resources configured
Online: [ VStorage1 VStorage2 ]
Full list of resources:
virtual_ip (ocf::heartbeat:IPaddr2): Started VStorage1
Master/Slave Set: ms [filecluster]
Slaves: [ VStorage1 ]
Stopped: [ VStorage2 ]
Failed Actions:
* filecluster_monitor_20000 on VStorage1 'master' (8): call=153, status=complete, exitreason='none',
last-rc-change='Fri Nov 2 20:27:28 2018', queued=0ms, exec=0ms
* filecluster_monitor_20000 on VStorage2 'master' (8): call=135, status=complete, exitreason='none',
last-rc-change='Fri Nov 2 20:27:11 2018', queued=0ms, exec=0ms

a master-slave resource agent will report both slave only if the promote to master fails.
What is the condition in your ocf_agent for promoting to master.
See drbd agent for condition when the resource is promoted to master.


etcd v2: etcd-server is healthy but etcd-events is not joining ("cluster ID mismatch" and "unmatched member while checking PeerURLs" errors)

I have a legacy Kubernetes cluster running etcd v2 with 3 masters (etcd-a, etcd-b, etcd-c). We attempted an upgrade to etcd v3 but this broken the first master (etcd-a) and it was no longer able to join the cluster. After some time I was able to restore it:
removed etcd-a from etcd cluster with etcdctl member rm
added a new etcd-a1 with a clean state and added to the cluster etcdctl member add
started kubelet with ETCD_INITIAL_CLUSTER_STATE set to existing, then started protokube. At this point the master is able to join the cluster.
At the beginning I thought the cluster was healthy:
/ # etcdctl member list
a4***b2: name=etcd-c peerURLs= clientURLs=
cf***97: name=etcd-a1 peerURLs= clientURLs=
d3***59: name=etcd-b peerURLs= clientURLs=
/ # etcdctl cluster-health
member a4***b2 is healthy: got healthy result from
member cf***97 is healthy: got healthy result from
member d3***59 is healthy: got healthy result from
cluster is healthy
Yet the status of etcd-events is not great. etcd-events for a1 is not running
etcd-server-events-ip-a1 0/1 CrashLoopBackOff 430
etcd-server-events-ip-b 1/1 Running 3
etcd-server-events-ip-c 1/1 Running 0
Logs from etcd-events-a1:
flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=
flags: recognized and used environment variable ETCD_DATA_DIR=/var/etcd/data-events
flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER=etcd-events-a1=,etcd-events-b=,etcd-events-c=
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-token-etcd-events
flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=
flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=
flags: recognized and used environment variable ETCD_NAME=etcd-events-a1
etcdmain: etcd Version: 2.2.1
etcdmain: Git SHA: 75f8282
etcdmain: Go Version: go1.5.1
etcdmain: Go OS/Arch: linux/amd64
etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
etcdmain: the server is already initialized as member before, starting as etcd member...
etcdmain: listening for peers on
etcdmain: listening for client requests on
netutil: resolving to 10.15.***:2381
netutil: resolving to 10.15.***:2381
etcdmain: stopping listening for client requests on
etcdmain: stopping listening for peers on
etcdmain: error validating peerURLs {ClusterID:5a***b3 Members:[&{ID:a7***32 RaftAttributes:{PeerURLs:[]} Attributes:{Name:etcd-events-b ClientURLs:[]}} &{ID:cc***b3 RaftAttributes:{PeerURLs:[]} Attributes:{Name:etcd-events-a ClientURLs:[]}} &{ID:7f***2ca RaftAttributes:{PeerURLs:[]} Attributes:{Name:etcd-events-c ClientURLs:[]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs
# restart
etcdserver: restarting member eb***3a in cluster 96***07 at commit index 3
raft: eb***a3a became follower at term 12407
raft: newRaft eb***3a [peers: [], term: 12407, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
etcdserver: starting server... [version: 2.2.1, cluster version: to_be_decided]
etcdserver: added local member eb***3a [] to cluster 96***07
etcdserver: added member 7f***ca [] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***b3, local=96***07)
rafthttp: request sent was ignored (cluster ID mismatch: remote[7f***ca]=5a***3, local=96***07)
rafthttp: failed to dial 7f***ca on stream Message (cluster ID mismatch)
rafthttp: failed to dial 7f***ca on stream MsgApp v2 (cluster ID mismatch)
etcdserver: added member a7***32 [] to cluster 96***07
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
rafthttp: failed to dial a7***32 on stream MsgApp v2 (cluster ID mismatch)
rafthttp: request sent was ignored (cluster ID mismatch: remote[a7***32]=5a***b3, local=96***07)
osutil: received terminated signal, shutting down...
etcdserver: aborting publish because server is stopped
Logs from etcd-events-b:
rafthttp: streaming request ignored (cluster ID mismatch got 96***07 want 5a***b3)
rafthttp: the connection to peer cc***b3 is unhealthy
Logs from etcd-events-c:
etcdserver: failed to reach the peerURL( of member cc***b3 (Get dial tcp i/o timeout)
etcdserver: cannot get the version of member cc***b3 (Get dial tcp i/o timeout)
From the log I saw two problems:
etcd-events on 1a seems to ignore the existing cluster (then IDs doesn't match).
the other nodes (b and c) still somehow remembers the removed old node a.
I'm short of ideas on how to fix this. Any suggestion?
If you tried to upgrade etcd2 and did not restart all the masters at the same time, you will fail the upgrade.
Make sure you read through
I also strongly recommend using the latest possible version of kOps as there are quite a few migration bugs fixed along the way.
There may be multiple reasons why the cluster ID changed, but if I remember correctly, replacing members like that was never really supported and with etcd2 your options are limited. Trying to get to etcd-manager and etcdv3 may be the best way to get your cluster in a working state again.

Substrate Parsing mdns packet failed

I am currently doing this tutorial. And on the same machine it worked as expected: The nodes are connecting and are creating and finalizing blocks. But now I want to do the same over the internet. So I have a server (Ubuntu 16.04 xenial) with open port 30333 on which I am running this command:
./target/release/node-template \
--base-path /tmp/alice \
--chain ./customSpecRaw.json \
--alice \
--rpc-methods Unsafe \
--port 30333 \
--ws-port 9945 \
--rpc-port 9933 \
--node-key 0000000000000000000000000000000000000000000000000000000000000001 \
--telemetry-url 'wss:// 0' \
--validator \
--name Node01
And my PC (Manjaro 20.2.1 Nibia) with no open ports on which I am running this command:
--base-path /tmp/bob
--chain ./customSpecRaw.json
--port 30334
--ws-port 9946
--rpc-port 9934
--telemetry-url 'wss:// 0'
--rpc-methods Unsafe
--name Node02
--bootnodes /ip4/<SERVER IP>/tcp/30333/p2p/<BOOTNODE P2P ID>
In the terminal I see network traffic on both nodes so networking should not be the problem. But there are 0 peers on both nodes and there are no blocks created/finalized. But I am getting two errors on the bootnodes terminal printed repeatedly:
Error while dialing /dns/ Custom { kind: Other, error: Timeout }
Parsing mdns packet failed: LabelIsNotAscii
Both errors are already output before I try to connect to the bootnode from my PC.
Both nodes are compiled from the same code and are using the same custom chain spec file generated on the server.
So my questions are:
What do the errors/warnings mean?
How to fix them in order to get the expected results?
If the errors/warnings are not causing the problem what else could it be?
I did reclone and recompile both nodes and somehow it's working now. I did not change anything in the command except the --no-mdns flag.

New relic infra agent not restarted

We are using New Relic infrastructure agent from last 2 yrs but after 13th Nov 2019 suddenly its not working. Then I update the version of newrelic to But the problem is not resolve. The problem is i’m unable to restart new relic infra agent.
I used below commands…
echo “license_key: ${NEW_RELIC_LICENSE_KEY}” | sudo tee /etc/newrelic-infra.yml
sudo curl -o /etc/yum.repos.d/newrelic-infra.repo
sudo yum -q makecache -y --disablerepo=’*’ --enablerepo=‘newrelic-infra’
sudo yum install newrelic-infra -y
sudo initctl restart newrelic-infra
Application hosted aws elastic beanstalk.
I’m getting initctl: Unknown instance.
Deatails error are below…
INFO [7168] - [Application update pem.pem-staging.f6e105eb760.20191117-164558#668/AppDeployStage0/EbExtensionPostBuild/Infra-EmbeddedPostBuild/postbuild_1_PEM/Command 04-configure_new_relic] : Activity execution failed, because: license_key: XXXXXXXXXXXXXXXX
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:–:-- --:–:-- --:–:-- 0
100 239 100 239 0 0 2091 0 --:–:-- --:–:-- --:–:-- 2096
Loaded plugins: priorities, update-motd, upgrade-helper
Package newrelic-infra-1.7.1-1.x86_64 already installed and latest version
Nothing to do
initctl: Unknown instance:
in my case it looks:
newrelic-sysmond: []
command: nrsysmond-config --set license_key=XXXXXXXXXXXXXXXXXXXX
command: echo hostname=$SERVER_URL >> /etc/newrelic/nrsysmond.cfg
command: /etc/init.d/newrelic-sysmond start
You might not need '02' command
Problem is new relic end. I have solved this temporarily by using below commands...
cat /etc/newrelic-infra.yml
ps aux | grep newrelic-infra
Just kill all processes returned by the ps command above with kill -9 pid1 pid2...
Then start the service with sudo initctl start newrelic-infra
It's now working fine.

H2O cluster startup frequently timing out

Trying to start an h2o cluster on (MapR) hadoop via python
# startup hadoop h2o cluster
import os
import subprocess
import h2o
import shlex
import re
from Queue import Queue, Empty
from threading import Thread
def enqueue_output(out, queue):
Function for communicating streaming text lines from seperate thread.
for line in iter(out.readline, b''):
# clear legacy temp. dir.
hdfs_legacy_dir = '/mapr/clustername/user/mapr/hdfsOutputDir'
if os.path.isdir(hdfs_legacy_dir ):
print subprocess.check_output(shlex.split('rm -r %s'%hdfs_legacy_dir ))
# start h2o service in background thread
local_h2o_start_path = '/home/mapr/h2o-'
startup_p = subprocess.Popen(shlex.split('/bin/hadoop jar {}h2odriver.jar -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir'.format(local_h2o_start_path)),
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# setup message passing queue
q = Queue()
t = Thread(target=enqueue_output, args=(startup_p.stdout, q))
t.daemon = True # thread dies with the program
# read line without blocking
h2o_url_out = ''
while True:
try: line = q.get_nowait() # or q.get(timeout=.1)
except Empty:
else: # got line
print line
# check for first instance connection url output
if'Open H2O Flow in your web browser', line) is not None:
h2o_url_out = line
if'Error', line) is not None:
print 'Error generated: %s' % line
print 'Connection url output line: %s' % h2o_url_out
h2o_cnxn_ip ='(?<=Open H2O Flow in your web browser: http:\/\/)(.*?)(?=:)', h2o_url_out).group(1)
print 'H2O connection ip: %s' % h2o_cnxn_ip
frequently throws a timeout error
Waiting for H2O cluster to come up...
H2O node requested flatfile
H2O node requested flatfile
H2O node requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Error generated: ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
Shutting down h2o cluster
Looking at the docs ( (and just doing a wordfind for the word "timeout"), was unable to find anything that helped the problem (eg. extending the timeout time via hadoop jar h2odriver.jar -timeout <some time> did nothing but extend the time until the timeout error popped up).
Have noticed that this happens often when there is another instance of an h2o cluster already up and running (which I don't understand since I would think that YARN could support multiple instances), yet also sometimes when there is no other cluster initialized.
Anyone know anything else that can be tried to solve this problem or get more debugging info beyond the error message being thrown by h2o?
Trying to recreate the problem from the commandline, getting
[me#mnode01 project]$ /bin/hadoop jar /home/me/h2o- -nodes 4 -mapperXmx 6g -timeout 300 -output hdfsOutputDir
Determining driver host interface for mapper->driver callback...
[Possible callback IP address:]
[Possible callback IP address:]
Using mapper->driver callback IP address and port:
(You can override these with -driverif and -driverport/-driverportrange.)
Memory Settings: -Xms6g -Xmx6g -XX:PermSize=256m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Dlog4j.defaultInitOverride=true
Extra memory percent: 10 6758
18/08/15 09:18:46 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: number of splits:4
18/08/15 09:18:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1523404089784_7404
18/08/15 09:18:48 INFO security.ExternalTokenManagerFactory: Initialized external token manager class -
18/08/15 09:18:48 INFO impl.YarnClientImpl: Submitted application application_1523404089784_7404
18/08/15 09:18:48 INFO mapreduce.Job: The url to track the job: https://mnode03.cluster.local:8090/proxy/application_1523404089784_7404/
Job name 'H2O_66888' submitted
JobTracker job ID is 'job_1523404089784_7404'
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
Waiting for H2O cluster to come up...
H2O node requested flatfile
H2O node requested flatfile
H2O node requested flatfile
ERROR: Timed out waiting for H2O cluster to come up (300 seconds)
ERROR: (Try specifying the -timeout option to increase the waiting time limit)
Attempting to clean up hadoop job...
18/08/15 09:23:54 INFO client.MapRZKBasedRMFailoverProxyProvider: Updated RM address to mnode03.cluster.local/
----- YARN cluster metrics -----
Number of YARN worker nodes: 6
----- Nodes -----
Node: http://mnode03.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 7.0 GB used, 0 / 2 vcores used
Node: http://mnode05.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode06.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 10.4 GB used, 0 / 2 vcores used
Node: http://mnode01.cluster.local:8044 Rack: /default-rack, RUNNING, 0 containers used, 0.0 / 5.0 GB used, 0 / 2 vcores used
Node: http://mnode04.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 7.0 / 10.4 GB used, 1 / 2 vcores used
Node: http://mnode02.cluster.local:8044 Rack: /default-rack, RUNNING, 1 containers used, 2.0 / 8.7 GB used, 1 / 2 vcores used
----- Queues -----
Queue name: root.default
Queue state: RUNNING
Current capacity: 0.00
Capacity: 0.00
Maximum capacity: -1.00
Application count: 0
Queue 'root.default' approximate utilization: 0.0 / 0.0 GB used, 0 / 0 vcores used
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster resource limitations
For YARN users, logs command is 'yarn logs -applicationId application_1523404089784_7404'
and noticing the later outputs
WARNING: Job memory request (26.4 GB) exceeds queue available memory capacity (0.0 GB)
WARNING: Job virtual cores request (4) exceeds queue available virtual cores capacity (0)
ERROR: Only 3 out of the requested 4 worker containers were started due to YARN cluster
I am confused by the reported 0GB mem. and 0 vcores becuase there are no other applications running on the cluster and looking at the cluster details in the YARN RM web UI shows
(using image, since could not find unified place in log files for this info and why the mem. availability is so uneven despite having no other running applications, I do not know). At this point, should mention that don't have much experience tinkering with / examining YARN configs, so it's difficult for me to find relevant information at this point.
Could it be that I am starting h2o cluster with -mapperXmx=6g, but (as shown in the image) one of the nodes only has 5g mem. available, so if this node is randomly selected to contribute to the initialized h2o application, it does not have enough memory to support the requested mapper mem.? Changing the startup command to /bin/hadoop jar /home/me/h2o- -nodes 4 -mapperXmx 5g -timeout 300 -output hdfsOutputDir and start/stopping multiple times without error seems to support this theory (though need to check further to determine if I'm interpreting things correctly).
This is most likely because your Hadoop cluster is busy, and there just isn't space to start new yarn containers.
If you ask for N nodes, then you either get all N nodes, or the launch process times out like you are seeing. You can optionally use the -timeout command line flag to increase the timeout.

HA - Pacemaker - Is there a way to clean automatically failed actions after X sec/min/hour?

I'm using Pacemaker + Corosync in Centos7
When one of my resource failed/stopped I/m getting a failed action message:
Master/Slave Set: myoptClone01 [myopt_data01]
Masters: [ pcmk01-cr ]
Slaves: [ pcmk02-cr ]
myopt_fs01 (ocf::heartbeat:Filesystem): Started pcmk01-cr
myopt_VIP01 (ocf::heartbeat:IPaddr2): Started pcmk01-cr
ServicesResource (ocf::heartbeat:RADviewServices): Started pcmk01-cr
Failed Actions:
* ServicesResource_monitor_120000 on pcmk02-cr 'unknown error' (1): call=141, status=complete, exitreason='none',
last-rc-change='Mon Jan 30 10:19:36 2017', queued=0ms, exec=142ms
Is there a way to clean automatically the failed actions after X sec/min/hour?
Look into the 'failure-timeout' resource option. This will automatically cleanup the failed action if no further failures for the particular resource has occurred within the value of failure-timeout.
I believe the failure-timeout is calculated during the cluster-recheck-interval. Which means that even if you have the failure-timeout configured to 1 minute it may still take up to 15 minutes and 59 seconds to clear the failed action with Pacemaker's default 15 minute cluster-recheck-interval.
More information:
