What's the difference between failover and takeover in ha? - high-availability

The two concepts seems so simlar,is there and difference between them?
What kind of occasions did they referenced to?

I've seen them used interchangably, so some confusion exists. This is how I use them:
Takeover is normally a dynamic action, associated with an Active / Active configuration, which can have minimal or no outage impact for one Active Node to takeover pending or in-progress work from the other (failed) Active node.
Failover is normally associated with an Active / Passive configuration, implying some level of automated or manual action being required for the Passive node to begin processing work, if the Active node fails.

Related

ActiveMQ Artemis - how does slave become live if there's only one master in the group?

I'm trying to understand a couple of items relating to failover in the ActiveMQ Artemis documentation. Specifically, there is a section (I'm sure I'm reading it wrong) that seems as if it's impossible to for a slave to take over for a master:
Specifically, the backup will become active when it loses connection to its live server. This can be problematic because this can also happen because of a temporary network problem. In order to address this issue, the backup will try to determine whether it still can connect to the other servers in the cluster. If it can connect to more than half the servers, it will become active, if more than half the servers also disappeared with the live, the backup will wait and try reconnecting with the live. This avoids a split brain situation
If there is only one other server, the master, the slave will not be able to connect to it. Since that is 100% of the other servers, it will stay passive. How can this work?
I did see that a pluggable quorum vote replication could be configured, but before I delve into that, I'd like to know what I'm missing here.
When using replication with only a single primary/backup pair there is no mitigation against split brain. When the backup loses its connection with the primary it will activate since it knows there are no other primary brokers in the cluster. Otherwise it would never activate, as you note.
The documentation should be clarified to remove this ambiguity.
Lastly, the documentation you referenced does not appear to be the most recent based on the fact that the latest documentation is slightly different from what you quoted (although it still contains this ambiguity).
Generally speaking, a single primary/backup pair with replication is only recommended with the new pluggable quorum voting since the risk of split brain is so high otherwise.

Mulesoft on-prem cluster has node in 'unknown' state

This happens from time to time, one of my nodes goes into an 'unknown' state. Where can I get technical information on what the cluster is? specifically ...
what controls the state in the cluster?
how does discovery and health information flow?
and what is the mechanism for consensus?
My cluster is made of two machines around a shared Oracle database.
The status of the cluster depends of connectivity between the nodes.
The status in Runtime Manager depends of connectivity between the Runtime Manager Agent, installed in each node, to Runtime Manager in the Anypoint Platform.Unknown status probably means the later. There are several possible causes, like network connectivity issues, bugs in older versions of the agent, expired certificates, etc.
I'm not quite sure to what consensus are you referring but I don't think there is a consensus mechanism that applies here. There is a quorum mechanism but with only 2 nodes I don't think it is applicable.

What is the recommended way of creating a distributed Lock with Redis on Azure?

I'm looking to create a distributed Lock within Redis on Azure for our multi-instance Worker Role. I need a way of creating "critical sections" for which only a single thread can have access at a time across multiple-instances of the Worker Role.
I am using the StackExchange.Redis client to do this and, helpfully, it has an implementation of transactional TakeLock\ReleaseLock already, and this answer on SO gives me a good idea of the pattern to use and details about how to create a lock.
Reading further around the subject, I also read this Redis article regarding distlock which describes the weaknesses of failover-based Redis nodes when trying to implement a distributed lock mechanism.
The Azure Redis cache implements master/slave failover (apart from the Basic tier) so does this mean that I will need to implement the redlock pattern in order to guarantee that only one thing will ever have the lock?
Additionally, I am wondering:
Why do Azure Redis example connection strings not seem to list the master and slave in them? Have Azure implemented the master/slave failover in a different way?
Why has one .NET implementation of redlock chosen not to support using master/slaves in its usage? (See Usage section, first para) Is this just by choice or is it because master/slave is not a valid usage of redlock (that would not seem to be the case in the redis article)
I'm the author of the RedLock.net library that you linked in your question. The reason the documentation specifies connecting to independent redis instances is based on the reasoning in the Redis Distlock documentation. By forcing writes only to master nodes, we hopefully avoid the situation where a user might misconfigure Redlock to connect to multiple replicated hosts.
According to Azure Redis Cache 103 - Failover and Monitoring there is a load balancer in front of an Azure Redis Cache (at the standard tier and above) that ensures that you are always connected to the master.
Connecting to multiple redis instances (either replicated or not) should give a fairly good guarantee that no two processes end up running at the same time (moreso than a single replicated instance).
In order for another process to 'steal' the lock before the first had finished, more than half of the independent redis instances would need to lose their lock keys (e.g. by restarting without persistence), then have process two gain the lock before the timer in process one reacquired it during its extend timer.

Is there a way asterisk reconnect calls when internet connection is missed

For being specific, I am using asterisk with a Heartbeat active/pasive cluster. There are 2 nodes in the cluster. Let's suppose Asterisk1 Asterisk2. Eveything is well configured in my cluster. When one of the nodes looses internet connection, asterisk service fails or the Asterisk1 is turned off, the asterisk service and the failover IP migrate to the surviving node (Asterisk2).
The problem is if we actually were processing a call when the Asterisk1 fell down asterisk stops the call and I can redial until asterisk service is up in asterisk2 (5 seconds, not a bad time).
But, my question is: Is there a way to make asterisk work like skype when it looses connection in a call? I mean, not stopping the call and try to reconnect the call, and reconnect it when asterisk service is up in Asterisk2?
There are some commercial systems that support such behavour.
If you want do it on non-comercial system there are 2 way:
1) Force call back to all phones with autoanswer flag. Requerment: Guru in asterisk.
2) Use xen and memory mapping/mirror system to maintain on other node vps with same memory state(same running asterisk). Requirment: guru in XEN. See for example this: http://adrianotto.com/2009/11/remus-project-full-memory-mirroring/
Sorry, both methods require guru knowledge level.
Note, if you do sip via openvpn tunnel, very likly you not loose calls inside tunnel if internet go down for upto 20 sec. That is not exactly what you asked, but can work.
Since there is no accepted answer after almost 2 years I'll provide one: NO. Here's why.
If you failover from one Asterisk server 1 to Asterisk server 2, then Asterisk server 2 has no idea what calls (i.e. endpoint to endpoing) were in progress. (Even if you share a database of called numbers, use asterisk realtime, etc). If asterisk tried to bring up both legs of the call to the same numbers, these might not be the same endpoints of the call.
Another server cannot resume the SIP TCP session of the other server since it closed with the last server.
The MAC source/destination ports may be identical and your firewall will not know you are trying to continue the same session.
etc.....
If you goal is high availability of phone services take a look at the VoIP Info web site. All the rest (network redundancy, disk redundancy, shared block storage devices, router failover protocol, etc) is a distraction...focus instead on early DETECTION of failures across all trunks/routes/devices involved with providing phone service, and then providing the highest degree of recovery without sharing ANY DEVICES. (Too many HA solutions share a disk, channel bank, etc. that create a single point of failure)
Your solution would require a shared database that is updated in realtime on both servers. The database would be managed by an event logger that would keep track of all calls in progress; flagged as LINEUP perhaps. In the event a failure was detected, then all calls that were on the failed server would be flagged as DROPPEDCALL. When your fail-over server spins up and takes over -- using heartbeat monitoring or somesuch -- then the first thing it would do is generate a set of call files of all database records flagged as DROPPPEDCALL. These calls can then be conferenced together.
The hardest part about it is the event monitor, ensuring that you don't miss any RING or HANGUP events, potentially leaving a "ghost" call in the system to be erroneously dialed in a recovery operation.
You likely should also have a mechanism to build your Asterisk config on a "management" machine that then pushes changes out to your farm of call-manager AST boxen. That way any node is replaceable with any other.
What you should likely have is 2 DB servers using replication techniques and Linux High-Availability (LHA) (1). Alternately, DNS round-robin or load-balancing with a "public" IP would do well, too. These machine will likely be light enough load to host your configuration manager as well, with the benefit of getting LHA for "free".
Then, at least N+1 AST Boxen for call handling. N is the number of calls you plan on handling per second divided by 300. The "+1" is your fail-over node. Using node-polling, you can then set up a mechanism where the fail-over node adopts the identity of the failed machine by pulling the correct configuration from the config manager.
If hardware is cheap/free, then 1:1 LHA node redundancy is always an option. However, generally speaking, your failure rate for PC hardware and Asterisk software is fairly lower; 3 or 4 "9s" out of the can. So, really, you're trying to get last bit of distance to the "5th 9".
I hope that gives you some ideas about which way to go. Let me know if you have any questions, and please take the time to "accept" which ever answer does what you need.
(1) http://www.linuxjournal.com/content/ahead-pack-pacemaker-high-availability-stack

Websphere 7 cluster

I have a Websphre 7 cluster with nodes running on different servers.
When a server with one node loses connection to the network, it takes about a minute, after which the Websphre knows that the member is unavailable.
How can I speed up the status updates?
UPD. The cluster is used only for EJB. EJB called from the local network.
I think this is always going to be a tradeoff between performance during normal operations and how quickly a down cluster member is detected.
See this article, Understanding HTTP plug-in failover in a clustered environment and this plugin-cfg.xml reference in the WebSphere 7 InfoCenter.
From the article, the answer will involve the ConnectTimeout, ServerIOTimeout, and RetryInterval settings, but note the warning that:
In an environment with busy workload or a slow network connection, setting this value too low could make the HTTP plug-in mark a cluster member down falsely. Therefore, caution should be used whenever choosing a value for ConnectTimeout.​​

Resources