Diagnosing High Availability -- ActiveMQ Artemis - high-availability

Is there a way to diagnose HA issues in ActiveMQ Artemis? I have a pair of shared-store servers that work really well. When I shut down the primary, the secondary takes over until it primary tells it it's back up, then the primary takes over and the secondary goes back to being a secondary.
I took the configuration and basically copied it to another pair of servers, but this one isn't working.
Everything looks fine, as far as I can tell. The cluster appears in the console, and the two servers connect. When I shut down the primary, the secondary logs this message:
2020-12-06 16:59:26,379 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure to <Primary IP>/<Primary IP>:61616 has been detected: AMQ219015: The connection was disconnected because of server shutdown [code=DISCONNECTED]
In the working pair, right after this message the secondary speedily deploys all my addresses and queues and takes over. But the new pair, the secondary does nothing after this.
I'm not sure where to start looking. I just keep comparing the configuration of the non-working pair with the working pair.
I'm using an NFS mount. The type of shared file is Azure's NetApp.
Here are my broker configurations. This is correct though because it works on the other pair...
Primary:
<connectors>
<connector name="artemis">tcp://<primary URL>:61616</connector>
<connector name="artemis-backup">tcp://<secondary URL>:61616</connector>
</connectors>
<cluster-user>activemq</cluster-user>
<cluster-password>artemis123</cluster-password>
<ha-policy>
<shared-store>
<master>
<failover-on-shutdown>true</failover-on-shutdown>
</master>
</shared-store>
</ha-policy>
<cluster-connections>
<cluster-connection name="cluster-1">
<connector-ref>artemis</connector-ref>
<static-connectors>
<connector-ref>artemis-backup</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>
Secondary:
<connectors>
<connector name="artemis-live">tcp://<primary URL>:61616</connector>
<connector name="artemis">tcp://<secondary URL>:61616</connector>
</connectors>
<cluster-user>activemq</cluster-user>
<cluster-password>artemis123</cluster-password>
<ha-policy>
<shared-store>
<slave>
<allow-failback>true</allow-failback>
<failover-on-shutdown>true</failover-on-shutdown>
</slave>
</shared-store>
</ha-policy>
<cluster-connections>
<cluster-connection name="cluster-1">
<connector-ref>artemis</connector-ref>
<static-connectors>
<connector-ref>artemis-live</connector-ref>
</static-connectors>
</cluster-connection>
</cluster-connections>

In the shared-store configuration the backup broker continuously attempts to acquire a file lock on the journal. However, since the master broker already has the lock it won't be able to until the master dies. Therefore, I would look at the shared storage and ensure that file locking is working properly.
Since you're using NFS the NFS client configuration options are worth inspecting as well. Here are the configuration options I would recommend to enable reasonable fail-over times:
timeo=50 - NFS timeout of 5 seconds
retrans=1 - allows only one retry
soft - soft mounting the NFS share disables the retry forever logic, allowing NFS errors to pop up into application stack after above timeouts
noac - turns off caching of file attributes but also enforces a sync write to the NFS share. This also reduces the time for NFS errors to pop up.

Related

Troubleshooting ActiveMQ Artemis Shared Storage HA deployment

We have 2 ActiveMQ Artemis servers in single cluster configured with shared storage HA strategy. The shared storage mount is NFS.
The servers are shutting down. Master server shuts down. The backup server gets the live lock and it works fine for sometime. Backup server also shuts down after sometime.
Exception that we get in master server is
2022-04-07 21:56:22,892 WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: The system cannot find the file specified
Exception that we get in slave server is:
2022-04-09 02:43:02,234 WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: The system cannot find the file specified
2022-04-09 03:00:10,995 WARN [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NIOSequentialFile \\xx.xxxx-dns.com\NAS\pri\prod\data\bindings\activemq-bindings-2.bindings, message=The system cannot find the file specified: ActiveMQIOErrorException[errorType=IO_ERROR message=The system cannot find the file specified]
2022-04-09 03:00:11,292 WARN [org.apache.activemq.artemis.core.server] AMQ222008: unable to restart server, please kill and restart manually: org.apache.activemq.artemis.core.server.NodeManager$NodeManagerException: java.io.IOException: An unexpected network error occurred
Referring to this existing question. Here, mount options are recommended:
Since you're using NFS the NFS client configuration options are worth inspecting as well. Here are the configuration options I would recommend to enable reasonable fail-over times:
timeo=50 - NFS timeout of 5 seconds
retrans=1 - allows only one retry
soft - soft mounting the NFS share disables the retry forever logic, allowing NFS errors to pop up into application stack after above timeouts
noac - turns off caching of file attributes but also enforces a sync write to the NFS share. This also reduces the time for NFS errors to pop up.
Can these issues be fixed by giving the mount options?
I wouldn't expect the recommended NFS mount options to solve the problems you're having. The main goal of those settings is to ensure NFS responds quickly to error conditions and reports them to the broker. In your case here you're already getting those errors (e.g. java.io.IOException: The system cannot find the file specified).
What you really need to do is track down why NFS is failing to find that file. The broker has no control over this. The exception is coming from the JVM which is, in turn, responding to an error from NFS. There is some problem with NFS itself here (e.g. a network issue).
To be clear, file-system errors like this are deemed "critical" by the broker and will cause it to shut-down so the response to the error by the broker which you are observing is considered normal.

Who rewrites redis configuration slaveof of slave redis instances?

Consider a redis sentinel setup with 5 machines. Each machine has sentinel process(s1,s2,s3,s4,s5) and redis instance(r1,r2,r3,r4,r5) running. One is master(r1) and others as slave(r2...r5). During failover of master r1, redis configuration slaveof of must be override with new master r3.
Who will override the redis configuration of slave redis(r2,r4,r5)? Elected sentinel responsible for failover(assuming s2 is elected sentinel) s2 will override the redis configuration at r2,r4,r5 or sentinel running at their respective machine will override the local redis configuration(sn will override configuration of rn)?
Elected Sentinel would update the configuration.This is the full list of Sentinel capabilities at a high level:
Monitoring: Sentinel constantly checks if your master and slave instances are working as expected.
Notification: Sentinel can notify the system administrator, another computer programs, via an API, that something is wrong with one of the monitored Redis instances.
Automatic failover: If a master is not working as expected, Sentinel can start a failover process where a slave is promoted to master, the other additional slaves are reconfigured to use the new master, and the applications using the Redis server informed about the new address to use when connecting.
Configuration provider: Sentinel acts as a source of authority for clients service discovery: clients connect to Sentinels in order to ask for the address of the current Redis master responsible for a given service. If a failover occurs, Sentinels will report the new address.
For more details, refer to docs

DRBD - automatic recover after disconnect

I have High availability cluster that configured with DRBD resource.
Master/Slave Set: RVClone01 [RV_data01]
Masters: [ rvpcmk01-cr ]
Slaves: [ rvpcmk02-cr ]
I perform a test that disconnect one of the network adapter that connect between the DRBD network interfaces (for example shutdown the network adapter).
Now the cluster display statuses that everything o.k BUT the status of the DRBD when running "drbd-overview" shows in primary server:
[root#rvpcmk01 ~]# drbd-overview
0:drbd0/0 WFConnection Primary/Unknown UpToDate/DUnknown /opt ext4 30G 13G 16G 45%
and in the secondary server:
[root#rvpcmk02 ~]# drbd-overview
0:drbd0/0 StandAlone Secondary/Unknown UpToDate/DUnknown
Now I have few questions:
1. Why cluster doesn't know about the problem with the DRBD?
2. Why when I put the network adapter that was down to UP again and connect back the connection between the DRBD the DRBD didn't handle this failure and sync back the DRBD when connection is o.k?
3. I saw an article that talk about "Solve a DRBD split-brain" - https://www.hastexo.com/resources/hints-and-kinks/solve-drbd-split-brain-4-steps/
in this article it's explain how to get over a problem of disconnection and resync the DRBD.
BUT how I should know that this kind of problem exist?
I hope I explain my case clearly and provide enough information about what I have and what I need...
1) You aren't using fencing/STONITH devices in Pacemaker or DRBD, which is why nothing happens when you unplug your network interface that DRBD is using. This isn't a scenario that Pacemaker will react to without defining fencing policies within DRBD, and STONITH devices within Pacemaker.
2) You likely are only using one ring for the Corosync communications (the same as the DRBD device), which will cause the Secondary to promote to Primary (introducing a split-brain in DRBD), until the cluster communications are reconnected and realize they have two masters, demoting one to Secondary. Again, fencing/STONITH would prevent/handle this.
3) You can set up the split-brain notification handler in your DRBD configuration.
Once you have STONITH/fencing devices setup in Pacemaker, you would add the following definitions to your DRBD configuration to "fix" all the issues you mentioned in your question:
resource <resource>
handlers {
split-brain "/usr/lib/drbd/notify-split-brain.sh root";
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
...
}
disk {
fencing resource-and-stonith;
...
}
...
}
Setting up fencing/STONITH in Pacemaker is a little too dependent on your hardware/software for me to give you pointers on setting that up for your cluster. This should get you pointed in the right direction:
http://clusterlabs.org/doc/crm_fencing.html
Hope that helps!

Redis on Windows - Sentinels not communicating

I am setting up my first Redis framework, and so far I have the following:
Server1:
- Redis master
- 3 Redis Sentinels (quorum set to 2)
Server2:
- Redis slave
- 3 Redis Sentinels (quorum set to 2)
The master and slave appear to be working properly and data is syncing from the master to the slave. When I install and start the sentinels, they too seem to run ok in the fact that if I connect to any of them, and run sentinel masters, it will show the sentinel is pointed at my Redis master and is showing the various properties.
However, the actual failover doesn't seem to work. For example, if I connect to my Redis master and run debug segfault to get it to fail, the failover to the slave does not occur. None of the sentinels log anything so it appears they are not actually connected. Here is the configuration for my sentinels:
port 26381
sentinel monitor redismaster ServerName 26380 2
sentinel down-after-milliseconds redismaster 10000
sentinel failover-timeout redismaster 180000
sentinel parallel-syncs redismaster 1
logfile "nodes/sentinel1/sentinel.log"
As you can see, this sentinel runs on 26381 (and subsequent sentinels run on 26382 and 26383). My Redis master runs on 26380. All of the ports are open, names/IPs resolve correctly, etc., so I don't think it is an infrastructure issue. In case it is useful, I am running Redis (2.8.17) which I downloaded from the MS Open Tech page.
Does anyone have any thoughts on what might be the problem, or suggestions on how to troubleshoot? I am having a hard time finding accurate documentation for setting up a H.A. instance of Redis on Windows, so any commands useful for troubleshooting these types of issues would be greatly appreciated.
I figured this out. One thing I neglected to mention in my question is that I have the masterauth configuration specified in my Redis master config file, so my clients have to provide a password to connect. I missed this in my sentinel configuration, and did not provide a password. The sentinel logging does not indicate this, so it was not obvious to me. Once I added this:
sentinel auth-pass redismaster <myPassword>
To my sentinel configuration file, everything started working as it should.

Openfire Cluster Hazelcast Plugin Issues

Windows Server 2003R2/2008R2/2012, Openfire 3.8.1, Hazelcast 1.0.4, MySQL 5.5.30-ndb-7.2.12-cluster-gpl-log
We've set up 5 servers in Openfire Cluster. Each of them in a different subnet, subnets are located in different cities and interconnected with each other through VPN routers (2-8 Mbps):
192.168.0.1 - node0
192.168.1.1 - node1
192.168.2.1 - node2
192.168.3.1 - node3
192.168.4.1 - node4
Openfire configured to use MySQL database which is successfully replicating from the master node0 to all slave nodes (each node uses it's own local database server, functioning as slave).
In Openfire Web Admin > Server Manager > Clustering we are able to see all cluster nodes.
Openfire custom settings for Hazelcast:
hazelcast.max.execution.seconds - 30
hazelcast.startup.delay.seconds - 3
hazelcast.startup.retry.count - 3
hazelcast.startup.retry.seconds - 10
Hazelcast config for node0 (similar on other nodes except for interface section) (%PROGRAMFILES%\Openfire\plugins\hazelcast\classes\hazelcast-cache-config.xml):
<join>
<multicast enabled="false" />
<tcp-ip enabled="true">
<hostname>192.168.0.1:5701</hostname>
<hostname>192.168.1.1:5701</hostname>
<hostname>192.168.2.1:5701</hostname>
<hostname>192.168.3.1:5701</hostname>
<hostname>192.168.4.1:5701</hostname>
</tcp-ip>
<aws enabled="false" />
</join>
<interfaces enabled="true">
<interface>192.168.0.1</interface>
</interfaces>
These are the only settings changed from default ones.
The problem is that XMPP clients are authorizing too long, about 3-4 minutes, after authorization other users in roster are inactive for 5-7 minutes, during this time logged in user in Openfire Web Admin > Sessions is marked as Offline. Even after user is able to see other logged in users as active, messages are not delivered, or delivered after 5-10 minutes or after few Openfire restarts...
We appreciate any help. We spent about 5 days trying to set up this monster, and are out of any ideas... :(
Thanks a lot in advance!
UPD 1: Installed Openfire 3.8.2 alpha with Hazelcast 2.5.1 Build 20130427 same problem
UPD 2: Tried starting the cluster on two servers that are in the same city, separated by probably 1-2 hops # 1-5ms ping. Everything works perfectly! Then we stopped one of those servers and started one in another city (3-4 hops # 80-100 ms ping) the problem occured again... Slow authorizations, logged off users in roster, messages are not delivered on time etc.
UPD 3: Installed Openfire 3.8.2 without JRE, and Java SDK 1.70_25.
Here are JMX screenshots:
node 0:
node 1:
Red line is the first client connection (after Openfire restart). Tested on two users. Same thing... First user (node0) connected instantly, second user (node1) spent 5 seconds on connection.
Rosters have been showing offline users on both sides for 20-30 seconds, then online users start appearing in them.
First user sends message to second user. Second user waits for 20 seconds, then receives first message. Reply and all other messages are transfered instantly.
UPD 4:
Durring the diggin through JConsole "Threads" tab we've discovered these various states:
For example hz.openfire.cached.thread-3:
WAITING on java.util.concurrent.SynchronousQueue$TransferStack#8a5325
Total blocked: 0 Total waited: 449
Maybe this could help... We actually don't know where to look for.
Thanks!
[UPDATE] Note per the Hazelcast documentation - WAN replication is supported in their enterprise version only, not in the community version that is shipped with Openfire. You must obtain an enterprise license key from Hazelcast if you would like to use this feature.
You may opt to setup multiple LAN-based Openfire clusters and then federate them using the S2S integration across separate XMPP domains. This is the preferred approach for scaling up Openfire for a very large user base.
[Original post follows]
My guess is that the longer network latency in your remote cluster configuration might be tying up the Hazelcast executor threads (for queries and events). Some of these events and queries are invoked synchronously within an Openfire cluster. Try tuning the following properties:
hazelcast.executor.query.thread.count (default: 8)
hazelcast.executor.event.thread.count (default: 16)
I would start by setting these values to 40/80 (5x) respectively to see if there is any improvement in the overall application responsiveness, and potentially even higher based on your expected load. Additional Hazelcast settings (including other thread pools) plus instructions for adding these properties into the configuration XML can be found here:
Hazelcast configuration properties
Hope that helps ... and good luck!

Resources