Websphere 7 cluster - websphere

I have a Websphre 7 cluster with nodes running on different servers.
When a server with one node loses connection to the network, it takes about a minute, after which the Websphre knows that the member is unavailable.
How can I speed up the status updates?
UPD. The cluster is used only for EJB. EJB called from the local network.

I think this is always going to be a tradeoff between performance during normal operations and how quickly a down cluster member is detected.
See this article, Understanding HTTP plug-in failover in a clustered environment and this plugin-cfg.xml reference in the WebSphere 7 InfoCenter.
From the article, the answer will involve the ConnectTimeout, ServerIOTimeout, and RetryInterval settings, but note the warning that:
In an environment with busy workload or a slow network connection, setting this value too low could make the HTTP plug-in mark a cluster member down falsely. Therefore, caution should be used whenever choosing a value for ConnectTimeout.​​

Related

Mulesoft on-prem cluster has node in 'unknown' state

This happens from time to time, one of my nodes goes into an 'unknown' state. Where can I get technical information on what the cluster is? specifically ...
what controls the state in the cluster?
how does discovery and health information flow?
and what is the mechanism for consensus?
My cluster is made of two machines around a shared Oracle database.
The status of the cluster depends of connectivity between the nodes.
The status in Runtime Manager depends of connectivity between the Runtime Manager Agent, installed in each node, to Runtime Manager in the Anypoint Platform.Unknown status probably means the later. There are several possible causes, like network connectivity issues, bugs in older versions of the agent, expired certificates, etc.
I'm not quite sure to what consensus are you referring but I don't think there is a consensus mechanism that applies here. There is a quorum mechanism but with only 2 nodes I don't think it is applicable.

What is the difference between failover vs high availability?

According to my reading on jboss documentation it says,
We define high availability as the ability for the system to continue
functioning after failure of one or more of the servers. A part of
high availability is failover which we define as the ability for
client connections to migrate from one server to another in event of
server failure so client applications can continue to operate.
Is failover part of high availability? How can we differentiate failover vs high availability?
Failover is a means of achieving high availability (HA). Think of HA as a feature and failover as one possible implementation of that feature. Failover is not always the only consideration when achieving HA.
For example, Cassandra achieves HA through replication, but the degree of availability is determined by data consistency settings. In essence, these settings dictate how many nodes need to respond for an action (a read or a write) to succeed. Requiring more nodes to respond means less availability, and requiring fewer nodes means more availability. That's an example of HA that has nothing to do with failover, strictly speaking.
High Availability
Refers to the fact that the server system is in some way tolerant to failure.
Most of the time this is done with hardware redundancy. Assume a machine has redundant power supplies, if one fails the machine will keep running.
Failover
Then you have application redundancy (failover), which usually refers to the ability for an application running on multiple hardware installations to respond to clients in a consistent manner from any of those hardware installations. That way, if the hardware does totally fail, or the O/S dies on a particular machine, another machine can carry on.
SQL Server deals with application redundancy in four ways:
Clustering
Mirroring
Replication
Log Shipping
High-availability (HA for short) is a broad term, so when I think about it I tend to think as HA clusters.
From Wikipedia High-availability cluster:
High-availability clusters are groups of computers that
support server applications that can be reliably utilized with a
minimum amount of down-time. They operate by using high availability
software to harness redundant computers in groups or clusters that
provide continued service when system components fail. Without
clustering, if a server running a particular application crashes, the
application will be unavailable until the crashed server is fixed.
So the takeaway from the description above is that HA clusters will provide you with the minimum amount of down-time during a failover. Let me explain the two types of failover that HA clusters can provide you:
Hot-Hot / Active-Active: The redundant computers are truly operating in parallel, producing the exact same state, and the exact same output. They are all active nodes, operating as a perfect mirror of each other. In this scenario, your failover down-time is zero, and you can simply pull the power plug from any machine in the cluster without any downtime or disruption to your service.
Hot-Warn / Active-Passive: Only one primary computer is the active one, while the other computers in the cluster are passively rebuilding the same state as the primary. When the primary computer fails, it has to be disabled or killed (automatically or by an operator) and then a passive computer from the cluster needs to be made active (automatically or by an operator).
So what is the catch? The catch is that applications that can operate in a HA cluster are not trivial to design as they need to be true deterministic finite-state machines. A classic problem is when your application needs to use the clock to build state based on time, as clocks are very non-deterministic by nature.
Disclaimer: I am one of the developers of CoralSequencer.

Is there a way asterisk reconnect calls when internet connection is missed

For being specific, I am using asterisk with a Heartbeat active/pasive cluster. There are 2 nodes in the cluster. Let's suppose Asterisk1 Asterisk2. Eveything is well configured in my cluster. When one of the nodes looses internet connection, asterisk service fails or the Asterisk1 is turned off, the asterisk service and the failover IP migrate to the surviving node (Asterisk2).
The problem is if we actually were processing a call when the Asterisk1 fell down asterisk stops the call and I can redial until asterisk service is up in asterisk2 (5 seconds, not a bad time).
But, my question is: Is there a way to make asterisk work like skype when it looses connection in a call? I mean, not stopping the call and try to reconnect the call, and reconnect it when asterisk service is up in Asterisk2?
There are some commercial systems that support such behavour.
If you want do it on non-comercial system there are 2 way:
1) Force call back to all phones with autoanswer flag. Requerment: Guru in asterisk.
2) Use xen and memory mapping/mirror system to maintain on other node vps with same memory state(same running asterisk). Requirment: guru in XEN. See for example this: http://adrianotto.com/2009/11/remus-project-full-memory-mirroring/
Sorry, both methods require guru knowledge level.
Note, if you do sip via openvpn tunnel, very likly you not loose calls inside tunnel if internet go down for upto 20 sec. That is not exactly what you asked, but can work.
Since there is no accepted answer after almost 2 years I'll provide one: NO. Here's why.
If you failover from one Asterisk server 1 to Asterisk server 2, then Asterisk server 2 has no idea what calls (i.e. endpoint to endpoing) were in progress. (Even if you share a database of called numbers, use asterisk realtime, etc). If asterisk tried to bring up both legs of the call to the same numbers, these might not be the same endpoints of the call.
Another server cannot resume the SIP TCP session of the other server since it closed with the last server.
The MAC source/destination ports may be identical and your firewall will not know you are trying to continue the same session.
etc.....
If you goal is high availability of phone services take a look at the VoIP Info web site. All the rest (network redundancy, disk redundancy, shared block storage devices, router failover protocol, etc) is a distraction...focus instead on early DETECTION of failures across all trunks/routes/devices involved with providing phone service, and then providing the highest degree of recovery without sharing ANY DEVICES. (Too many HA solutions share a disk, channel bank, etc. that create a single point of failure)
Your solution would require a shared database that is updated in realtime on both servers. The database would be managed by an event logger that would keep track of all calls in progress; flagged as LINEUP perhaps. In the event a failure was detected, then all calls that were on the failed server would be flagged as DROPPEDCALL. When your fail-over server spins up and takes over -- using heartbeat monitoring or somesuch -- then the first thing it would do is generate a set of call files of all database records flagged as DROPPPEDCALL. These calls can then be conferenced together.
The hardest part about it is the event monitor, ensuring that you don't miss any RING or HANGUP events, potentially leaving a "ghost" call in the system to be erroneously dialed in a recovery operation.
You likely should also have a mechanism to build your Asterisk config on a "management" machine that then pushes changes out to your farm of call-manager AST boxen. That way any node is replaceable with any other.
What you should likely have is 2 DB servers using replication techniques and Linux High-Availability (LHA) (1). Alternately, DNS round-robin or load-balancing with a "public" IP would do well, too. These machine will likely be light enough load to host your configuration manager as well, with the benefit of getting LHA for "free".
Then, at least N+1 AST Boxen for call handling. N is the number of calls you plan on handling per second divided by 300. The "+1" is your fail-over node. Using node-polling, you can then set up a mechanism where the fail-over node adopts the identity of the failed machine by pulling the correct configuration from the config manager.
If hardware is cheap/free, then 1:1 LHA node redundancy is always an option. However, generally speaking, your failure rate for PC hardware and Asterisk software is fairly lower; 3 or 4 "9s" out of the can. So, really, you're trying to get last bit of distance to the "5th 9".
I hope that gives you some ideas about which way to go. Let me know if you have any questions, and please take the time to "accept" which ever answer does what you need.
(1) http://www.linuxjournal.com/content/ahead-pack-pacemaker-high-availability-stack

AppFabric Redundancy

We just tested an AppFabric cluster of 2 servers where we removed the "lead" server. The second server timeouts on any request to it with the error:
Microsoft.ApplicationServer.Caching.DataCacheException: ErrorCode<ERRCA0017>:SubStatus<ES0006>:
There is a temporary failure. Please retry later.
(One or more specified Cache servers are unavailable, which could be caused by busy network or servers. Ensure that security permission has been granted for this client account on the cluster and that the AppFabric Caching Service is allowed through the firewall on all cache hosts. Retry later.)
In practive this means that if one server in the cluster goes down then they all go down. (Note we are not using Windows cluster, only linking multiple AppFabric cache servers to each other.)
I need the cluster to continue operating even if a single server goes down. How do I do this?
(I realize this question is borderlining Serverfault, but imho developers should know this.)
You'll have to install the AppFabric cache on at least three lead servers for the cache to survive a single server crash. The docs state that the cluster will only go down if the "majority" of the lead servers go down, but in the fine print, they explain that 1 out of 2 constitutes a majority. I've verified that removing a server from a three lead-node cluster works as advertised.
Typical distributed systems concept. For a write or read quorum to occur in an ensemble you need to have 2f + 1 servers up where f is number of servers failing. I think appfabric or any CP (as in CAP theorem) consensus based systems need this to happen for working of the cluster.
--Sai
Thats actually a problem with the Appfabric architecture and it is rather confusing in terms of the "lead-host" concept. The idea is that the majority of lead hosts should be running so that the cluster remains up and running. So if you had three servers you'd have to have at least two lead hosts constantly communicating with each other and eating up server resources and if both go down then the whole cluster fails. The idea is to have a peer-to-peer architecture where all servers act as peers meaning that even if two servers go down the cluster remains functioning with no application downtimes. Try NCache:
http://www.alachisoft.com/ncache/

Common Issues in Developing Cluster Aware non-web-based Enterprise Applications

I've to move a Windows based multi-threaded application (which uses global variables as well as an RDBMS for storage) to an NLB (i.e., network load balancer) cluster. The common architectural issues that immediately come to mind are
Global variables (which are both read/ written) will have to be moved to a shared storage. What are the best practices here? Is there anything available in Windows Clustering API to manage such things?
My application uses sockets, and persistent connections is a norm in the field I work. I believe persistent connections cannot be load balanced. Again, what are the architectural recommendations in this regard?
I'll answer the persistent connection part of the question first since it's easier. All good network load-balancing solutions (including Microsoft's NLB service built into Windows Server, but also including load balancing devices like F5 BigIP) have the ability to "stick" individual connections from clients to particular cluster nodes for the duration of the connection. In Microsoft's NLB this is called "Single Affinity", while other load balancers call it "Sticky Sessions". Sometimes there are caveats (for example, Microsoft's NLB will break connections if a new member is added to the cluster, although a single connection is never moved from one host to another).
re: global variables, they are the bane of load-balanced systems. Most designers of load-balanced apps will do a lot of re-architecture to minimize dependence on shared state since it impedes the scalabilty and availability of a load-balanced application. Most of these approaches come down to a two-step strategy: first, move shared state to a highly-available location, and second, change the app to minimize the number of times that shared state must be accessed.
Most clustered apps I've seen will store shared state (even shared, volatile state like global variables) in an RDBMS. This is mostly out of convenience. You can also use an in-memory database for maximum performance. But the simplicity of using an RDBMS for all shared state (transient and durable), plus the use of existing database tools for high-availability, tends to work out for many services. Perf of an RDBMS is of course orders of magnitude slower than global variables in memory, but if shared state is small you'll be reading out of the RDBMS's cache anyways, and if you're making a network hop to read/write the data the difference is relatively less. You can also make a big difference by optimizing your database schema for fast reading/writing, for example by removing unneeded indexes and using NOLOCK for all read queries where exact, up-to-the-millisecond accuracy is not required.
I'm not saying an RDBMS will always be the best solution for shared state, only that improving shared-state access times are usually not the way that load-balanced apps get their performance-- instead, they get performance by removing the need to synchronously access (and, especially, write to) shared state on every request. That's the second thing I noted above: changing your app to reduce dependence on shared state.
For example, for simple "counters" and similar metrics, apps will often queue up their updates and have a single thread in charge of updating shared state asynchronously from the queue.
For more complex cases, apps may swtich from Pessimistic Concurrency (checking that a resource is available beforehand) to Optimistic Concurrency (assuming it's available, and then backing out the work later if you ended up, for example, selling the same item to two different clients!).
Net-net, in load-balanced situations, brute force solutions often don't work as well as thinking creatively about your dependency on shared state and coming up with inventive ways to prevent having to wait for synchronous reading or writing shared state on every request.
I would not bother with using MSCS (Microsoft Cluster Service) in your scenario. MSCS is a failover solution, meaning it's good at keeping a one-server app highly available even if one of the cluster nodes goes down, but you won't get the scalability and simplicity you'll get from a true load-balanced service. I suspect MSCS does have ways to share state (on a shared disk) but they require setting up an MSCS cluster which involves setting up failover, using a shared disk, and other complexity which isn't appropriate for most load-balanced apps. You're better off using a database or a specialized in-memory solution to store your shared state.
Regarding persistent connection look into the port rules, because port rules determine which tcpip port is handled and how.
MSDN:
When a port rule uses multiple-host
load balancing, one of three client
affinity modes is selected. When no
client affinity mode is selected,
Network Load Balancing load-balances
client traffic from one IP address and
different source ports on
multiple-cluster hosts. This maximizes
the granularity of load balancing and
minimizes response time to clients. To
assist in managing client sessions,
the default single-client affinity
mode load-balances all network traffic
from a given client's IP address on a
single-cluster host. The class C
affinity mode further constrains this
to load-balance all client traffic
from a single class C address space.
In an asp.net app what allows session state to be persistent is when the clients affinity parameter setting is enabled; the NLB directs all TCP connections from one client IP address to the same cluster host. This allows session state to be maintained in host memory;
The client affinity parameter makes sure that a connection would always route on the server it was landed initially; thereby maintaining the application state.
Therefore I believe, same would happen for your windows based multi threaded app, if you utilize the affinity parameter.
Network Load Balancing Best practices
Web Farming with the
Network Load Balancing Service
in Windows Server 2003 might help you give an insight
Concurrency (Check out Apache Cassandra, et al)
Speed of light issues (if going cross-country or international you'll want heavy use of transactions)
Backups and deduplication (Companies like FalconStor or EMC can help here in a distributed system. I wouldn't underestimate the need for consulting here)

Resources