DataPower B2BGW with HA pollers - high-availability

In DP XB62, a B2B Persistence store can be setup to run in HA configuration with a primary node with writeaccess and a standby/slave node with only read access. This is tightly connected with Virtual IPs and standby control. This works fine for inbound connections (HTTP for instance), but how can I put pollers into active/standby control?
I.e. MQ,SFTP and FTP polling Front side handlers should be deactivated when the machine is in standby mode (and the B2B persistence store is in standby mode).
Can this be achieved in XB62 firmware 6.0.0.2?

Sorry, no, it can't...
As the stand-by control of the DataPower boxes isn't a "real" cluster it won't deactivate the passive box, merely remove the IP address from it.
The pollers will still poll on the stand-by box and there is unfortunately no way around that.
Some customers who wants all processing to be done on one box I normally setup a "poller MPGW" that has the backside set to the VIP. That way any poller who polls data will send to the "active" box and the processing will happen there.
This is most convenient if you only want to monitor one B2B Transaction Viewer for example.
I have been testing a few scripts as well where I enable/disable the FSH depending on events sent at fail-over but I have found that there are a few too many events to monitor for that to be a "safe" approach...

Related

How to Route Messages to Microservice Instances Dynamically Based on Key/Value?

I'm building a system where client IoT devices will be making persistent websocket connections to a single instance of a microservice. We'll call it the "hardware gateway". End devices will be connecting to one of these service instances and may migrate between services at anytime (perhaps due to a reboot or network interruption).
Other services will be pushing notifications to these hardware clients via some hardware gateway instance. I need a way to route these requests to the specific instance that is maintaining a connection to a specific IoT device. At the moment, my solution is to maintain an external KV store where I can map an IoT device's UUID to a service instance, but that puts an extra dependency on all other services to know about this KV store. Not to mention the additional latency introduced by this query.
Maybe there's some reverse proxy that allows me to dynamically update its matching criteria? I've also looked into using a message broker like RabbitMQ, but it doesn't seem to support this use case.
There's a reasonable solution in JVM land for this: Akka.
The instances form an Akka cluster. When a device makes a websocket connection, an actor is spawned to handle the interactions over the websocket. The actor registers that it is the actor interacting with the device with a cluster sharded actor keyed by the device's ID (and likely periodically reregisters with the sharded actor). As instances are deployed, etc. the cluster rebalances. An important feature of this is that the service is stateful, but the instances deploy in a way that looks to the outside world like it's stateless: requests can go to any node.
For pushing notifications to the devices, the HTTP endpoint or message-bus consumer in the service looks up the cluster sharded actor which forwards the notification to the websocket actor (you'll want to think about whether you want at-least-once or at-most-once delivery, which will govern whether there's some portion of the cluster sharded actor which should be persistent).

Consul connectivity in network partition(DMZ zone)

Suppose In any data centre there are different network partition(for eg. DMZ zone) and thus some sets of hosts cant contact other sets of hosts. So if I want to propagate a message to all hosts in any datacenter, can gossip/consul work for the use case?
For the above problem, one solution I am thinking is: All hosts in DMZ zones can be allowed to connect to consul servers(few hosts only). It will be like some sets of hosts cant contact to other sets of hosts, but all the host in the datacenter can talk to consul servers. But I am not sure, even by this, any message can be propagated to all the hosts in the datacenter.
Gossip is just used for consul, which in turn is just used for service registration, service discovery and key/value data - related to configuration.
The Event mechanism is probably what you want, from the Python docs:
Event = <class 'consul.base.Consul.Event'>
The event command provides a mechanism to fire a custom user event to
an entire datacenter. These events are opaque to Consul, but they can
be used to build scripting infrastructure to do automated deploys,
restart services, or perform any other orchestration action.
Unlike most Consul data, which is replicated using consensus, event
data is purely peer-to-peer over gossip.
This means it is not persisted and does not have a total ordering. In
practice, this means you cannot rely on the order of message delivery.
An advantage however is that events can still be used even in the
absence of server nodes or during an outage.

Connecting to a Grid Cluster With GridGain

I know that out of the box that GridGain connects to the other clients through multicast, but is there a way to configure GridGain to accept connections outside of the local network? Also is there a way to enable encryption for the communication as well?
The Disovery SPI and Communication SPI allows you to plug alternative discovery and communication mechanisms.
For more detail, refer to the comprehensive API documentation (GridGain 3).
This is necessary on Amazon EC2, which doesn't support multicast. Here's an article discussing this setup.
Multicast only works well within a certain network segment (and in some cases this isn't even allowed for security reasons). So if you want to connect nodes to your grid that are outside your local network you have to resort to other transports such as JMS or mail (if performance is an issue you might get it away with unicast/static ip's and JGroups).
I think that encryption is possible with both the JMS and mail transport, depending on your message broker and mail setup.

When would you need multiple servers to host one web application?

Is that called "clustering" of servers? When a web request is sent, does it go through the main server, and if the main server can't handle the extra load, then it forwards it to the secondary servers that can handle the load? Also, is one "server" that's up and running the application called an "instance"?
[...] Is that called "clustering" of servers?
Clustering is indeed using transparently multiple nodes that are seen as a unique entity: the cluster. Clustering allows you to scale: you can spread your load on all the nodes and, if you need more power, you can add more nodes (short version). Clustering allows you to be fault tolerant: if one node (physical or logical) goes down, other nodes can still process requests and your service remains available (short version).
When a web request is sent, does it go through the main server, and if the main server can't handle the extra load, then it forwards it to the secondary servers that can handle the load?
In general, this is the job of a dedicated component called a "load balancer" (hardware, software) that can use many algorithms to balance the request: round-robin, FIFO, LIFO, load based...
In the case of EC2, you previously had to load balance with round-robin DNS and/or HA Proxy. See Introduction to Software Load Balancing with Amazon EC2. But for some time now, Amazon has launched load balancing and auto-scaling (beta) as part of their EC2 offerings. See Elastic Load Balancing.
Also, is one "server" that's up and running the application called an "instance"?
Actually, an instance can be many things (depending of who's speaking): a machine, a virtual machine, a server (software) up and running, etc.
In the case of EC2, you might want to read Amazon EC2 Instance Types.
Here is a real example:
This specific configuration is hosted at RackSpace in their Managed Colo group.
Requests pass through a Cisco Firewall. They are then routed across a Gigabit LAN to a Cisco CSS 11501 Content Services Switch (eg Load Balancer). The Load Balancer matches the incoming content to a content rule, handles the SSL decryption if necessary, and then forwards the traffic to one of several back-end web servers.
Each 5 seconds, the load balancer requests a URL on each webserver. If the webserver fails (two times in a row, IIRC) to respond with the correct value, that server is not sent any traffic until the URL starts responding correctly.
Further behind the webservers is a MySQL master / slave configuration. Connections may be mad to the master (for transactions) or to the slaves for read only requests.
Memcached is installed on each of the webservers, with 1 GB of ram dedicated to caching. Each web application may utilize the cluster of memcache servers to cache all kinds of content.
Deployment is handled using rsync to sync specific directories on a management server out to each webserver. Apache restarts, etc.. are handled through similar scripting over ssh from the management server.
The amount of traffic that can be handled through this configuration is significant. The advantages of easy scaling and easy maintenance are great as well.
For clustering, any web request would be handled by a load balancer, which being updated as to the current loads of the server forming the cluster, sends the request to the least burdened server. As for if it's an instance.....I believe so but I'd wait for confirmation first on that.
You'd' need a very large application to be bothered with thinking about clustering and the "fun" that comes with it software and hardware wise, though. Unless you're looking to start or are already running something big, it wouldn't' be anything to worry about.
Yes, it can be required for clustering. Typically as the load goes up you might find yourself with a frontend server that does url rewriting, https if required and caching with squid say. The requests get passed on to multiple backend servers - probably using cookies to associate a session with a particular backend if necessary. You might have the database on a separate server also.
I should add that there are other reasons why you might need multiple servers, for instance there may be a requirement that the database is not on the frontend server for security reasons

Detecting dead applications while server is alive in NLB

Windows NLB works great and removes computer from the cluster when the computer is dead.
But what happens if the application dies but the server still works fine? How have you solved this issue?
Thanks
By not using NLB.
Hardware load balancers often have configurable "probe" functions to determine if a server is responding to requests. This can be by accessing the real application port/URL, or some specific "healthcheck" URL that returns only if the application is healthy.
Other options on these look at the queue/time taken to respond to requests
Cisco put it like this:
The Cisco CSM continually monitors server and application availability
using a variety of probes, in-band
health monitoring, return code
checking, and the Dynamic Feedback
Protocol (DFP). When a real server or
gateway failure occurs, the Cisco CSM
redirects traffic to a different
location. Servers are added and
removed without disrupting
service—systems easily are scaled up
or down.
(from here: http://www.cisco.com/en/US/products/hw/modules/ps2706/products_data_sheet09186a00800887f3.html#wp1002630)
Presumably with Windows NLB there is some way to programmatically set the weight of nodes? The nodes should self-monitor and if there is some problem (e.g. a particular node is low on disc space), set its weight to zero so it receives no further traffic.
However, this needs to be carefully engineered and have further human monitoring to ensure that you don't end up with a situation where one fault causes the entire cluster to announce itself down.
You can't really hope to deal with a "byzantine general" situation in network load balancing; an appropriately broken node may think it's fine, appear fine, but while being completely unable to do any actual work. The trick is to try to minimise the possibility of these situations happening in production.
There are multiple levels of health check for a network application.
is the server machine up?
is the application (service) running?
is the service accepting network connections?
does the service respond appropriately to a "are you ok" request?
does the service perform real work? (this will also check back-end systems behind the service your are probing)
My experience with NLB may be incomplete, but I'll describe what I know. NLB can do 1 and 2. With custom coding you can add the other levels with varying difficulty. With some network architectures this can be very difficult.
Most hardware load balancers from vendors like Cisco or F5 can be easily configured to do 3 or 4. Level 5 testing still requires custom coding.
We start in the situation where all nodes are part of the cluster but inactive.
We run a custom service monitor which makes a request on the service locally via the external interface. If the response was successful we start the node (allow it to start handling NLB traffic). If the response failed we stop the node from receiving traffic.
All the intermediate steps described by Darron are irrelevant. Did it work or not is the only thing we care about. If the machine is inaccessible then the rest of the NLB cluster will treat it as failed.

Resources