Achieve high availability and failover in Artemis Cluster with shared-store HA policy through JGroups protocol - high-availability

In the documentation of Artemis ActiveMQ it is stated that if high availability is configured for the replication HA policy then you can specify a group of live servers that a backup server can connect to. This is done by configuring group-name in the master and the slave element of the broker.xml. A backup server will only connect to a live server that shares the same node group name.
But in shared-store there is no such concept of group-name. I am confused. If I have to achieve high availability through shared-store in JGroups then how it can be done.
Again when I tried doing it through replication HA policy providing group-name the cluster was formed and failover was working, but I got the warning saying:
2020-10-02 16:35:21,517 WARN [org.apache.activemq.artemis.core.client] AMQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=220da24b-049c-11eb-8da6-0050569b585d
2020-10-02 16:35:21,517 WARN [org.apache.activemq.artemis.core.client] AMQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=220da24b-049c-11eb-8da6-0050569b585d
2020-10-02 16:35:25,350 WARN [org.apache.activemq.artemis.core.server] AMQ224078: The size of duplicate cache detection (<id_cache-size/>) appears to be too large 20,000. It should be no greater than the number of messages that can be squeezed into confirmation window buffer (<confirmation-window-size/>) 32,000.

As the name "shared-store" indicates, the live and the backup broker become a logical pair which can support high availability and fail-over because they share the same data store. Because they share the same data store there is no need for any kind of group-name configuration. Such an option would be confusing, redundant, and ultimately useless.
The JGroups configuration (and the cluster-connection more generally) exists because the two brokers need to exchange information with each other about their respective network locations so that the live broker can inform clients how to connect to the backup in case of a failure.
Regarding the WARN message about duplicate node ids on the network...You might get that warn message once, possibly twice, during failover or fail-back, but if you see it more than that then there's something wrong. If you're using shared-store it indicates a problem with the locks on the shared file system. If you're using replication then that indicates a potential misconfiguration or possibly a split-brain.

Related

How to configure JMS group handling in Wildfly cluster?

I have 2 servers server1 and server2.
server1 is the master server and server2 is slave.
Both are running in clustered environment.
If 2 messages with same group ID arrives simultaneously on node 1 and node 2 they won't know to which consumer the message should be sent to. Therefore, the message ends up being processed by different consumers and sometimes the message which arrived first gets processed later which is not desirable.
I would like to configure the system so that both nodes know each other that the message should be processed by which consumer.
Solution I tried :
Configured the server1 with group handler LOCAL and server2 with REMOTE.
Now whenever the message arrives LOCAL group handler identifies that the consumer is on which node and the message is picked accordingly.
This solution is valid until the server1 is running fine. However, if the server1 goes down messages won't be processed anymore.
To fix this I added backup server to messaging subsystem active mq of server1 to server2 and similarly did the same for server2.
/profile=garima/subsystem=messaging-activemq/server=backup:add
And added the same cluster-connection, discovery-group, http-connector, broadcast-group to this backup server but when I tried this solution did not seems to fix the failover condition and messages were not processed on other node.
Please suggest any other approach or how can I configure the scenario where the server with LOCAL group handler stops.
The recommended solution for clustered grouping is what you have configured - a backup for the node with the LOCAL grouping-handler. The bottom line here is if there isn't an active node in the cluster with a LOCAL grouping-handler then a decision about what consumer should handle which group simply can't be made. It sounds to me like your backup broker simply isn't working as expected (which is probably a subject for a different question).
Aside from having a backup you might consider eliminating the cluster altogether. Clusters are a way to improve overall message throughput using horizontal scaling. However, message grouping naturally serializes message consumption for each group which then decreases overall message throughput (perhaps severely depending on the use-case). It may be that you don't need the performance scalability of a cluster since you're grouping messages. Have you performed any benchmarking to determine your performance bottlenecks? If so, was clustering the proven solution to these bottlenecks?

Solace application HA across regions

Currently intra-region we achieve HA (hot/hot) between applications by using exclusive queues to ensure 1 application is Active and the rest are standby.
How do I achieve the same thing across region when the appliances are linked via cspf neighbour links? As queues are local to an appliance the approach above doesn't work.
Not possible using your design of CSPF neighbors - they are meant for direct messages and not guaranteed.
Are you able to provide more details about your use case?
Solace can easily do active/standby across regions using Data Center Replication.
Solace can also allow consumers to consume messages from the endpoints on both the active and standby regions by Allowing Clients to Connect to Standby Sites. However this means two consumers will be active - one on the active and one on the standby site.

What are the implications of using NFS3 file system for multi-instance queue managers in WebSphere MQ

We are stuck in a difficult scenario in our new MQ infrastructure implementation using multi-instance queue managers using WebSphere MQ v7.5 in Linux platform.
The concern is our Network Team is not able to configure NFS4 and hence we are still having the NFS3 version. We understand multi-instance queue managers will not function properly with NFS3. But are there any issues if we define queue managers in multi-instance fashion in NFS3 and expect to work perfect for single instance mode.
Thanks
I would not expect you to have issues running single-node queue managers with NFS3, we do so on a regular basis. The requirement for NFS4 was for the file locking mechanism required by multi-instance queue managers to determine when the primary instance has lost control and an a secondary queue manager should take over.
If you do define the queue manager as multi-instance, and the queue manager attempt to failover, it may not do so successfully, at worst it may corrupt your queue manager files.
If you control the failover yourself - as in, shutdown the queue manager on one node and start it again on another node - that should work for you, as there is no file sharing taking place and all files would be shutdown on the primary node before being opened on the secondary node. You would have to make sure the secondary queue manager is NOT running in standby node -- ever.
I hope this helps.
Dave

Websphere MQ and High Availability

When I read about HA in Websphere MQ I always come to the point, when the best practise is to create two Queue Managers handling the same queue and use out-of-the-box load balancing. Therefore, when one is down, the other takes over his job.
Well, this is great but what about the messages in the queue that belong to the Queue Manager that went down? I mean do these messages reside there (when queue is persistent of course) until QM is up and running again?
Furthermore, is it possible to create a common storage for this doubled Queue Managers? Then no message would wait for the QM to be up. Every message would be delivered in the proper order. Is this correct?
WebSphere MQ provides different capabilities for HA, depending on your requirements. WebSphere MQ clustering uses parallelism to distribute load across multiple instances of a queue. This provides availability of the service but not for in-flight messages.
Hardware clustering and Multi-Instance Queue Manager (MIQM) are both designs using multiple instances of a queue manager that see a single disk image of that queue manager's state. These provide availability of in-flight messages but the service is briefly unavailable while the cluster fails over.
Using these in combination it is possible to provide recovery of in-flight messages as well as availability of the service across multiple queue instances.
In hardware cluster model the disk is mounted to only one server and the cluster software monitors for failure and swaps the disk, IP address and possibly other resources to the secondary node. This requires a hardware cluster monitor such as PowerHA to manage the cluster.
The Multi-Instance QMgr is implemented entirely within WebSphere MQ and needs no other software. It works by having two running instances of the QMgr pointing to the same NFS 4 shared disk mount. Both instances compete for locks on the files. The first one to acquire a lock becomes the active QMgr. Because there is no hardware cluster monitor to perform IP address takeover this type of cluster will have multiple IP addresses. Any modern version of WMQ allows for this using multi-instance CONNAME where you can supply a comma-separated list of IP or DNS names. Client applications that previously used Client Channel Definition Tables (CCDT) to manage failover across multiple QMgrs will continue to work and CCDT continues to be supported in current versions of WMQ.
Please see the Infocenter topic Using WebSphere MQ with high availability configurations for details of hardware cluster and MIQM support.
Client Channel Definition Table files are discussed in the Infocenter topic Client Channel Definition Table file.

How To Load-Distribution in RabbitMQ cluster?

Hi I create three RabbitMQ servers running in cluster on EC2
I want to scale out RabbitMQ cluster base on CPU utilization but when I publish message only one server utilizes CPU and other RabbitMQ-server not utilize CPU
so how can i distribute the load across the RabbitMQ cluster
RabbitMQ clusters are designed to improve scalability, but the system is not completely automatic.
When you declare a queue on a node in a cluster, the queue is only created on that one node. So, if you have one queue, regardless to which node you publish, the message will end up on the node where the queue resides.
To properly use RabbitMQ clusters, you need to make sure you do the following things:
have multiple queues distributed across the nodes, such that work is distributed somewhat evenly,
connect your clients to different nodes (otherwise, you might end up funneling all messages through one node), and
if you can, try to have publishers/consumers connect to the node which holds the queue they're using (in order to minimize message transfers within the cluster).
Alternatively, have a look at High Availability Queues. They're like normal queues, but the queue contents are mirrored across several nodes. So, in your case, you would publish to one node, RabbitMQ will mirror the publishes to the other node, and consumers will be able to connect to either node without worrying about bogging down the cluster with internal transfers.
That is not really true. Check out the documentation on that subject.
Messages published to the queue are replicated to all mirrors. Consumers are connected to the master regardless of which node they connect to, with mirrors dropping messages that have been acknowledged at the master. Queue mirroring therefore enhances availability, but does not distribute load across nodes (all participating nodes each do all the work).

Resources