Websphere web plug-in to automatically propagate cluster node shutdown - websphere

Does the WebServer web server plug-in automatically propagate the new configuration due to a manual shutdown of a node in the application server cluster? I've been going through the documentation and it looks like the only way for the web server to act on this is by detecting the node state by itself.
Is there any workaround?

By default, the WAS Plug-in only detects that a JVM is down by failing to send it a request or failing to establish a new TCP connection.
If you use the "Intelligent Management for WebServers" features available in 8.5 and later, there is a control connection between the cell and the Plug-in that will proactively tell the Plugin that a server is down.
Backing up to the non-IM case, here's what happens during an unplanned shutdown of a JVM (from http://publib.boulder.ibm.com/httpserv/ihsdiag/plugin_questions.html#failover)
If an application server terminates unexpectedly, several things
unfold. This is largely WebSphere edition independent.
The application servers operating system closes all open sockets.
WebServer threads waiting for the response in the WAS Plug-in are notified of EOF or ECONNRESET.
If the error occurred on a new connection to the application server, it will be marked down in the current webserver process. This server will not be retried until a configurable interval expires (RetryInterval).
If the error occurred on a an existng connection to the application server, it will not be marked down.
Retryable requests that were in-flight are retried by the WAS Plug-in, as permitted.
If the backend servers use memory to memory session replication (ND only), the WLM component will tell the WAS Plug-in to use a specific replacement affinity server.
If the backend servers use any kind of session persistence, the failover is transparent. Session persistence is available in all websphere editions.
New requests, with or without affinity, are routed to remaining servers..
After the RetryInterval expires, the WAS plug-in will try to establish new connections to the server. If it remains down, failure will be relatively fast, and put the server back into the markd down state.

Related

Can you check to see if an IBM MQ topic is up and available through a Java application before attempting to create a connection?

I would like to add some conditional logic to our Java application code for attempting to create a JMS Topic Connection. I have seen problems in the past stemming from attempting to create a connection when the MQ server had been restarted or was currently down. One improvement I added was to check for the quiescent state, and another was to increase the timer before attempting reconnection to our durable topic queue.
Is there a way to confirm with the MQ server/topic/channel that it is up and running and a connection request can safely be made?
The best way to confirm that a queue manager (and the channel you are using to connect to the queue manager) is up and running is to attempt to connect to it.
If your connection attempt fails, you will get an MQ Reason code telling you exactly why. This is a much better way to confirm than any administrative command, because it also confirms that your application, and it's security context is correct and able to connect to the queue manager. It is completely possible to have an up-and-running queue manager but an application that is not yet correctly configured to use it. So connect from the application and if it works, the queue manager is up-and-running.
Your comment about having an increased timer before attempting to reconnect after a failure is well made. It doesn't help anyone if you hammer the queue manager with lots of repeated and close together connection attempts until it is ready to accept your connection, but still anything that is going to test the availability of the queue manager needs to ultimately connect to it, so very simply, just connect.

Kubernetes pods graceful shutdown with TCP connections (Spring boot)

I am hosting my services on azure cloud, sometimes I get "BackendConnectionFailure" without any apparent reason, after investigation I found a correlation between this exception and autoscale (scaling down) almost at the same second in most of the cases.
According to documentation termination grace period by default is 30 seconds, which is the case. The pod will be marked terminating and the loadbalancer will not consider it anymore, so receiving no more requests. According to this if my service takes far less time than 30 seconds, I should not need prestop hook or any special implementation in my application (please correct me if I am wrong).
If the previous paragraph is correct, why does this exception occur relatively frequent? My thought is when the pod is marked terminating and the loadbalancer does not forward anymore requests to the pod while it should do.
Edit 1:
The Architecture is simply like this
Client -> Firewall(azure) -> API(azure APIM) -> Microservices(Spring boot) -> backend(third party) or azure RDB depending on the service
I think the Exception comes from APIM, I found two patterns for this exception:
Message The underlying connection was closed: The connection was closed unexpectedly.
Exception type BackendConnectionFailure
Failed method forward-request
Response time 10.0 s
Message The underlying connection was closed: A connection that was expected to be kept alive was closed by the server.
Exception type BackendConnectionFailure
Failed method forward-request
Response time 3.6 ms
Spring Boot doesn't do graceful termination by default.
The Spring Boot app and it's application container (not linux container) are in control of what happens to existing connections during the termination grace period. The protocols being used and how a client reacts to a "close" also have a part to play.
If you get to the end of the grace period, then everything gets a hard reset.
Kubernetes
When a pod is deleted in k8s, the Pod Endpoint removal from Services is triggered at the same time as the SIGTERM signal to the container(s).
At this point the cluster nodes will be reconfigured to remove any rules directing new traffic to the Pod. Any existing TCP connections to the Pod/containers will remain in connection tracking until they are closed (by the client, server or network stack).
For HTTP Keep Alive or HTTP/2 services, the client will continue hitting the same Pod Endpoint until it is told to close the connection (or it is forcibly reset)
App
The basic rules are, on SIGTERM the application should:
Allow running transactions to complete
Do any application cleanup required
Stop accepting new connections, just in case
Close any inactive connections it can (keep alive requests, websockets)
Some circumstances you might not be able to handle (depends on the client)
A keep alive connection that doesn't complete a request in the grace period, can't get a Connection: close header. It will need a TCP level FIN close.
A slow client with a long transfer, in a one way HTTP transfer these will have to be waited for or forcibly closed.
Although keepalive clients should respect a TCP FIN close, every client reacts differently. Microsoft APIM might be sensitive and produce the error even though there was no real world impact. It's best to load test your setup while scaling to see if there is a real world impact.
For more spring boot info see:
https://github.com/spring-projects/spring-boot/issues/4657
https://github.com/corentin59/spring-boot-graceful-shutdown
https://github.com/SchweizerischeBundesbahnen/springboot-graceful-shutdown
You can use a preStop sleep if needed. While the pod is removed from the service endpoints immediately, it still takes time (10-100ms) for the endpoint update to be sent to every node and for them to update iptables.
When your applications receives a SIGTERM (from the Pod termination) it needs to first stop reporting it is ready (fail the readinessProbe) but still serve requests as they come in from clients. After a certain time (depending on your readinessProbe settings) you can shut down the application.
For Spring Boot there is a small library doing exactly that: springboot-graceful-shutdown

Load balancer and WebSockets

Our infrastructure is composed by
1 F5 load balancer
3 nodes
We have an application which uses websockets, so when a user goes to our site, it opens a websocket to the balancer which it connects to the first available node, and it works as expected.
Our truobles arrives with maintenance tasks, when we have to update our software, we need to turn offline 1 node at a time, deploy the new release and then turn it on again. Doing this task, the balancer drops the open websocket connections to the node and the clients retries to connect after few seconds to the first available nodes, creating an inconvenience for the client because he could miss a signal (or more).
How we can keep the connection between the client and the balancer, changing the backend websocket server? Is the load balancer enough to achieve our goal or we need to change our infrastructure?
To avoid this kind of problems I recommend to read about the Azure SignalR. With this you don't need to thing about stuff like load balancer, redis backplane and other infrastructures that you possibly need to a WebSockets connection.
Basically the clients will not connected to your node directly but redirected to Azure SignalR. You can read more about it here: https://learn.microsoft.com/en-us/azure/azure-signalr/signalr-overview
Since it is important to your application to maintain the connection, I don't see how any other way to archive no connection drop to your nodes, since you need to shut them down.
It's important to understand that the F5 is a full TCP proxy. This means that the F5 is the server to the client and the client to the server. If you are using the websockets protocol then you must apply a websockets profile to the F5 Virtual Server in order for the websockets application to be handled properly by the Load Balancer.
Details of the websockets profile can be found here: https://support.f5.com/csp/article/K14754
If a websockets and an HTTP profile are applied to the Virtual Server - meaning that you have websockets and web traffic using the same port and LB nodes - then the F5 will allow the websockets traffic as passthrough. Also keep in mind that if this is an HTTPS virtual sever that you will need to ensure a client and server side HTTPS profile (SSL offload) are applied to the Virtual Server.
While there are a variety of ways that you can fiddle with load balancers to minimize the downtime caused by a software upgrade, none of them solve the problem, which is that your application-layer protocol seems to not tolerate some small network outages.
Even if you have a perfect load balancer and your software deploys cause zero downtime, the customer's computer may be on flaky wifi which causes a network dropout for half a second - or going over ethernet and someone reconfigures some routing on their LAN, etc.
I'd suggest having your server maintain a queue of messages for clients (up to some size/time limit) so that when a client drops a connection - whether it be due to load balancers/upgrades - or any other reason, it can continue without disruption.

HUGE Number of MQGET and MQINQS requests logged against a MQ channel

We have a BatchJob application which is configured in Websphere Application Server (8.0.0.7). The application processes the requests put in the source MQ queue. Through WAS, the MQ queue is polled to see if there are any new requests available for processing.
We have been recently notified by a MQ resource that, there is high CPU resource utilization due to the MQ channel used by our application. When looked at the numbers, the MQGETS and MQINQS requests are humongous. This is not a 1 off incidence. It has always been like that since the day our application was installed. So I believe there is some configuration at Websphere that is causing this high volume of MQGETS and MQINQS requests.
Can somebody give any pointers which configs need to be checked? I am from application development side, so don't have in-detailed knowledge about WAS.
Thanks in advance.

Understanding effects of Domino command to restart HTTP server

We have a Domino cluster which consists of two servers. Recently we see that one of the server has memory problems, and the HTTP service goes down after 2 hours. So we plan to implement a scheduled server task which runs the command nserver -c "restart task http" till we find the memory leak solution. The HTTP service restarts in say 15 seconds. But what would happen if a user submits data during this small period. Will the cluster manager automatically manage the user session using the other server, and hence load balance the submit task?. Not sure about this. The failover runs fine in a normal case, so when one of the server goes down the other server load balances it. But we are not sure about the behavior of "restart task http" command. Does the restart http task finish all the pending threads, or Domino cluster manager switches to other server to load balance the request?.
Thanks in advance
The server should close out all HTTP requests prior to shutting down and restarting.

Resources