Intermittent issues with MSMQ - windows

We have intermittent issues with MSMQ communication cross domain.
Applications on residing on Domain (client)A and (server)B are communicating. The client can always send messages to the server. The queues (DIRECT:OS format) are created and in "Connected" state on the server. Sometimes, however, when the server responds it generates a
"BadDestinationQueue" error.
This can go on for anywhere between 10min - 1hr after which the error is gone and communication is normal. We see nothing abnormal, no obvious temporal pattern.
Clients in B talking to the server in B have no such issues.
Are there any settings that could cause this? We assume the intermittency suggests some sort of cache, somewhere? The DNS entries are up-to-date, and machines are discoverable by hostname even while MSMQ insists the destination queue does not exist.

Related

Load balancer and WebSockets

Our infrastructure is composed by
1 F5 load balancer
3 nodes
We have an application which uses websockets, so when a user goes to our site, it opens a websocket to the balancer which it connects to the first available node, and it works as expected.
Our truobles arrives with maintenance tasks, when we have to update our software, we need to turn offline 1 node at a time, deploy the new release and then turn it on again. Doing this task, the balancer drops the open websocket connections to the node and the clients retries to connect after few seconds to the first available nodes, creating an inconvenience for the client because he could miss a signal (or more).
How we can keep the connection between the client and the balancer, changing the backend websocket server? Is the load balancer enough to achieve our goal or we need to change our infrastructure?
To avoid this kind of problems I recommend to read about the Azure SignalR. With this you don't need to thing about stuff like load balancer, redis backplane and other infrastructures that you possibly need to a WebSockets connection.
Basically the clients will not connected to your node directly but redirected to Azure SignalR. You can read more about it here: https://learn.microsoft.com/en-us/azure/azure-signalr/signalr-overview
Since it is important to your application to maintain the connection, I don't see how any other way to archive no connection drop to your nodes, since you need to shut them down.
It's important to understand that the F5 is a full TCP proxy. This means that the F5 is the server to the client and the client to the server. If you are using the websockets protocol then you must apply a websockets profile to the F5 Virtual Server in order for the websockets application to be handled properly by the Load Balancer.
Details of the websockets profile can be found here: https://support.f5.com/csp/article/K14754
If a websockets and an HTTP profile are applied to the Virtual Server - meaning that you have websockets and web traffic using the same port and LB nodes - then the F5 will allow the websockets traffic as passthrough. Also keep in mind that if this is an HTTPS virtual sever that you will need to ensure a client and server side HTTPS profile (SSL offload) are applied to the Virtual Server.
While there are a variety of ways that you can fiddle with load balancers to minimize the downtime caused by a software upgrade, none of them solve the problem, which is that your application-layer protocol seems to not tolerate some small network outages.
Even if you have a perfect load balancer and your software deploys cause zero downtime, the customer's computer may be on flaky wifi which causes a network dropout for half a second - or going over ethernet and someone reconfigures some routing on their LAN, etc.
I'd suggest having your server maintain a queue of messages for clients (up to some size/time limit) so that when a client drops a connection - whether it be due to load balancers/upgrades - or any other reason, it can continue without disruption.

When to choose a remote queue design versus local queue for get/put activities

I'm trying to figure out under what conditions I would want to implement a remote queue versus a local one for 2 endpoint applications.
Consider this scenario: App A on Server A needs to send messages to App B on Server B via MQServer1.
It seems like the simplest configuration would be to create a single local queue on MQServer1 and configure AppA to put messages to the local queue while configuring AppB to get messages from the same local queue. Both AppA and AppB would connect to the same Queue Manager but execute different commands.
What sort of circumstances would require the need to install another MQ server (e.g. MQServer2) and configure a remote queue on MQServer1 which instead sends the messages from AppA over a channel to a local queue on MQServer2 to be consumed by AppB?
I believe I understand the benefit of remote queuing but I'm not sure when it's best used over a more simpler design.
Here are some problems with what you call the simpler design that you don't have with remote queuing:-
Time Independance - Server1 has to be available all the time, whereas with a remote queue, once the messages have been moved to Server B, Server A and Server 1 don't need to be online when App B wants to get its messages.
Network Efficiency - with two client applications putting or getting from a central queue, you have two inefficient network hops, instead of one efficient channel batched network connection from Server A to Server B (no need for Server 1 in the middle)
Network Problems - No network, no messages. Whereas when they are stored locally, any that have already arrived can be processed even while the network is down. Likewise, the application putting messages is also not held up by a network problem, the messages sit on the transmit queue easy to be moved, and the application can get on with the next thing.
Of course your applications should be written so that they aren't even aware of the difference, and it's just configuration changes that switch you from one design to the other.
Here we can have separate Queue Manager for both the application.Application A will send the message on to the queue defined on local Queue Manager, which in turn transmit it to the Transmission queue via defined channels (Need to do configuration for that in the QueueManager) which in turn send it to the Local queue of the Application B.

Do I *really* need RPC and NETBIOS to use transactional NServiceBus queues between local servers and Amazon EC2?

We have been trying - without success - to get transactional message queues working between local servers and our cloud servers up in Amazon EC2.
We're using NServiceBus, and have got the pub/sub examples and various other trivial apps working locally between here and EC2, but trying to spin up the components of our actual application is proving... vexatious.
As far as I can work out, to allow a local server (DYLAN-PC) to send a message transactionally via a queue on an Amazon EC2 instance, I will need to:
Enable NETBIOS name resolution (e.g. via the /etc/lmhosts file) at both ends
Allow RPC connections to be initiated from either end (so open port 135 for RPC plus various other ports)
Configure MSTDC on both systems, enabling remote connections and inbound/outbound connections
Have I missed something? In particular, the requirement to allow NetBIOS in an age where everything (including Active Directory!) runs on DNS seems particularly archaic. Are we doing something stupid trying to use MSMQ between sites like this? This is the first big project where we've tried this kind of distributed architecture, and the deployment/configuration is starting to hurt so much I'm convinced we've taken a wrong turn somewhere... a little perspective or advice would be gratefully received!
If you're look to build a geographically distributed system, where you can't arrange a VPN between these sites, you should be using the gateway capabilities of NServiceBus to communicate over alternate transports (like HTTP) between those sites.
RPC is required for reading from remote queues.
If you push to remote queues and pull from local queues, you won't be using RPC.

Build durable architecture with Websphere MQ clients

How can you create a durable architecture environment using MQ Client and server if the clients don't allow you to persist messages nor do they allow for assured delivery?
Just trying to figure out how you can build a salable / durable architecture if the clients don't appear to contain any of the necessary components required to persist data.
Thanks,
S
Middleware messaging was born of the need to persist data locally to mitigate the effects of failures of the remote node or of the network. The idea at the time was that the queue manager was installed locally on the box where the application lives and was treated as part of the transport stack. For instance you might install TCP and WMQ as a transport and some apps would use TCP while others used WMQ.
In the intervening 20 years, the original problems that led to the creation of MQSeries (Now WebSphere MQ) have largely been solved. The networks have improved by several nines of availability and high availability hardware and software clustering have provided options to keep the different components available 24x7.
So the practices in widespread use today to address your question follow two basic approaches. Either make the components highly available so that the client can always find a messaging server, or put a QMgr where the application lives in order to provide local queueing.
The default operation of MQ is that when a message is sent (MQPUT or in JMS terms producer.send), the application does not get a response back on the MQPUT call until the message has reached a queue on a queue manager. i.e. MQPUT is a synchronous call, and if you get a completion code of OK, that means that the queue manager to which the client application is connected has received the message successfully. It may not yet have reached its ultimate destination, but it has reached the protection of an MQ Server, and therefore you can rely on MQ to look after the message and forward it on to where it needs to get to.
Whether client connected, or locally bound to the queue manager, applications sending messages are responsible for their data until an MQPUT call returns successfully. Similarly, receiving applications are responsible for their data once they get it from a successful MQGET (or JMS consumer.receive) call.
There are multiple levels of message protection are available.
If you are using non-persistent messages and asynchronous PUTs, then you are effectively saying it doesn't matter too much whether the messages reach their destination (although they generally will).
If you want MQ to really look after your messages, use synchronous PUTs as described above, persistent messages, and perform your PUTs and GETs within transactions (aka syncpoint) so you have full application control over the commit points.
If you have very unreliable networks such that you expect to regularly fail to get the messages to a server, and expect to need regular retries such that you need client-side message protection, one option you could investigate is MQ Telemetry (e.g. in WebSphere MQ V7.1) which is designed for low bandwidth and/or unreliable network communications, as a route into the wider MQ.

Detecting dead applications while server is alive in NLB

Windows NLB works great and removes computer from the cluster when the computer is dead.
But what happens if the application dies but the server still works fine? How have you solved this issue?
Thanks
By not using NLB.
Hardware load balancers often have configurable "probe" functions to determine if a server is responding to requests. This can be by accessing the real application port/URL, or some specific "healthcheck" URL that returns only if the application is healthy.
Other options on these look at the queue/time taken to respond to requests
Cisco put it like this:
The Cisco CSM continually monitors server and application availability
using a variety of probes, in-band
health monitoring, return code
checking, and the Dynamic Feedback
Protocol (DFP). When a real server or
gateway failure occurs, the Cisco CSM
redirects traffic to a different
location. Servers are added and
removed without disrupting
service—systems easily are scaled up
or down.
(from here: http://www.cisco.com/en/US/products/hw/modules/ps2706/products_data_sheet09186a00800887f3.html#wp1002630)
Presumably with Windows NLB there is some way to programmatically set the weight of nodes? The nodes should self-monitor and if there is some problem (e.g. a particular node is low on disc space), set its weight to zero so it receives no further traffic.
However, this needs to be carefully engineered and have further human monitoring to ensure that you don't end up with a situation where one fault causes the entire cluster to announce itself down.
You can't really hope to deal with a "byzantine general" situation in network load balancing; an appropriately broken node may think it's fine, appear fine, but while being completely unable to do any actual work. The trick is to try to minimise the possibility of these situations happening in production.
There are multiple levels of health check for a network application.
is the server machine up?
is the application (service) running?
is the service accepting network connections?
does the service respond appropriately to a "are you ok" request?
does the service perform real work? (this will also check back-end systems behind the service your are probing)
My experience with NLB may be incomplete, but I'll describe what I know. NLB can do 1 and 2. With custom coding you can add the other levels with varying difficulty. With some network architectures this can be very difficult.
Most hardware load balancers from vendors like Cisco or F5 can be easily configured to do 3 or 4. Level 5 testing still requires custom coding.
We start in the situation where all nodes are part of the cluster but inactive.
We run a custom service monitor which makes a request on the service locally via the external interface. If the response was successful we start the node (allow it to start handling NLB traffic). If the response failed we stop the node from receiving traffic.
All the intermediate steps described by Darron are irrelevant. Did it work or not is the only thing we care about. If the machine is inaccessible then the rest of the NLB cluster will treat it as failed.

Resources