The Cluster refresh solution - websphere

Update: We are using AIX environment.
We have been facing some random issues with our queues (cluster queues), like:
2189 Cluster resolution error (Most frequent one)
2270 MQRC_NO_DESTINATIONS_AVAILABLE
2053 Queue full error(Weirdest) : Post one message, it will be successfully posted, post some 3-4 messages, it will throw this error
for the rest of the messages.
All these issues get resolved once we do a cluster refresh. But, I want to know the root cause, why we get these errors. What goes wrong?
How cluster refresh resolve these errors?

Could be a socket issue. You can monitor sockets according to your OS - like on windows can do
netstat -a -b -o >/newfile.txt
You could also use TCP Viewer on windows (one exe from Microsoft/ sysinternals) http://technet.microsoft.com/en-us/sysinternals/bb897437.aspx actually all the sys internal toos should be in your prod box if windows.
For sockets in linux/Un* there are other tools, some just ls commands into the RAM, depending on the version. Maybe a google will help.
Also if using windows consider moving some stuff to linux, you will have some pain in the beggining but will get better.
If this did not help you should post yor environment on your quesiton and give any other details. And if you get a jprofiler into production and use it when the issue happens.
At the very least you can do a jstack and jmap
What is version/ name of OS and of java, websphere?
If it is a socket issue can try increasing sockets (registry) and then profiling your code to see who is making too many sockets, what needs to be throttled or re-written.
Remember every page, every db connection, external cache hit (if you use) or any other URL work/ remote connection is usually a socket.

Related

Automatic reconnect in case of network failures

I am testing .NET version of ZeroMQ to understand how to handle network failures. I put the server (pub socket) to one external machine and debugging the client (sub socket). If I stop my local Wi-Fi connection for seconds, then ZeroMQ automatically recovers and I even get remaining values. However, if I disable Wi-Fi for longer time like a minute, then it just gets stuck on a frame waiting. How can I configure this period when ZeroMQ is still able to recover? And how can I reconnect manually after, say, several minutes? How can I understand that the socket is locked and I need to kill/open again?
Q :" How can I configure this ... ?"
A :Use the .NET versions of zmq_setsockopt() detailed parameter settings - family of link-management parameters alike ZMQ_RECONNECT_IVL, ZMQ_RCVTIMEO and the likes.
All other questions depend on your code.
If using blocking-forms of the .recv()-methods, you can easily throw yourself into unsalvageable deadlocks, best never block your own code ( why one would ever deliberately lose one's own code domain-of-control ).
If in a need to indeed understand low-level internal link-management details, do not hesitate to use zmq_socket_monitor() instrumentation ( if not available in .NET binding, still may use another language to see details the monitor-instance reports about link-state and related events ).
I was able to find an answer on their GitHub https://github.com/zeromq/netmq/issues/845. Seems that the behavior is by design as I got the same with native zmq lib via .NET binding.

WSO2 log4j and elasticsearch: all carbon apps freeze

I've noticed a very strange behavior in my wso2 apps ( esb 4.9, AM 1.10 and GREG 5.0.0)
Every single time the elasticsearch/logstash is stopped all the carbon apps freeze.
They become completely unresponsive and the only way to stop them is send a kill -9
My conf is pretty standard (see below) so I was wondering if I'm missing something or if someone else noticed the same issue.
log4j.rootLogger=INFO, CARBON_CONSOLE, CARBON_LOGFILE, CARBON_MEMORY,tcp
log4j.appender.tcp=org.apache.log4j.net.SocketAppender
log4j.appender.tcp.layout=org.wso2.carbon.utils.logging.TenantAwarePatternLayout
log4j.appender.tcp.layout.ConversionPattern=[%d] %P%5p {%c} – %x %m%n
log4j.appender.tcp.layout.TenantPattern=%U%#%D[%T]
log4j.appender.tcp.Port=6000
log4j.appender.tcp.RemoteHost=localhost
log4j.appender.tcp.ReconnectionDelay=10000
log4j.appender.tcp.threshold=DEBUG
log4j.appender.tcp.Application=esb500wso2carbon
What says the documentation :
Logging events are automatically buffered by the native TCP
implementation. This means that if the link to server is slow but
still faster than the rate of (log) event production by the client,
the client will not be affected by the slow network connection.
However, if the network connection is slower then the rate of event
production, then the client can only progress at the network rate. In
particular, if the network link to the the server is down, the client
will be blocked.
On the other hand, if the network link is up, but the
server is down, the client will not be blocked when making log
requests but the log events will be lost due to server unavailability.
But in my case, even when the "server is down", the client is blocked sometimes because many java threads are blocked on the same lock object
Have a look to JMSAppender or AsyncAppender
According to WS02 is a bug.
It doesn't affect version 5.x
The suggested workaround, successfully tested, is to use filebeat instead :(
Not ideal, but it works

Getting specific errors when TCP connections disconnect in Windows

I'm trying to improve the usefulness of the error reporting in a server I am working on. The server uses TCP sockets, and it runs on Windows.
The problem is that when a TCP link drops due to some sort of network failure, the error code that I can get from WSARecv() (or the other Windows socket APIs) is not very descriptive. For most network hiccups, I get either WSAECONNRESET (10054) or WSAETIMEDOUT (10060). But there are about a million things that can cause both of these: the local machine is having a problem, the remote machine or process is having a problem, some intermediate router has a problem, etc. This is a problem because the server operator doesn't have a definitive way to investigate the problem, because they don't necessarily even know where the problem is, or who might be responsible.
At the IP level, it's a different story. If the server operator happens to have a network sniffer attached when something bad happens, it's usually pretty easy to sort of what went wrong. For instance, if an intermediate router sent an ICMP unreachable, the router that sent it will put its IP address in there, and that's usually enough to track it down. Put another way, Windows killed the connection for a reason, probably because it got a specific packet that had a specific problem.
However, a large number of failures are experienced in the field, unexpected. It is not realistic to always have a network sniffer attached to a production server. There needs to be a way to track down problems that happen only rarely, intermittently, or randomly.
How can I solve this problem programmatically?
Is there a way to get Windows to cough up a more specific error message? Is there some easy way to capture and mine recent Windows events (perhaps the one Microsoft Network Monitor uses)? One way I've "solved it" before is to keep dumpcap (from Wireshark) running in ring buffer mode, and force it to stop capturing when a bad event happens, that I can mine later.
I'm also open to the possibility that this is not the right way to solve this problem. For instance, perhaps there is some special Windows mode that can be turned on to cause it to log useful information, that a network administrator could use to track this down after-the-fact.

EnumPrinters() + error RPC_S_SERVER_UNAVAILABLE (1722)

I am working on a sample to get the list of printer connected to machine. For that I am using EnumPrinters() API to get the printers. Randomly it gives the error RPC_S_SERVER_UNAVAILABLE (1722). I tried to search in the net, but I could not get the solution.
Please help me to fix this issue.
How are you calling EnumPrinters (hint - post the code)?
For some modes of API invocation, the local system will RPC to the target servers in turn - this uses RPC, so you can get RPC errors back. You may be able to get the info you need via a less heavyweight call that uses different parameters to EnumPrinters.
From the docs:
when EnumPrinters is called with a
level 2 (PRINTER_INFO_2) data
structure, it performs an OpenPrinter
call on each remote connection. If a
remote connection is down, or the
remote server no longer exists, or the
remote printer no longer exists, the
function must wait for RPC to time out
and consequently fail the OpenPrinter
call. This can take a while.
I had this problem recently with my Windows 10 PC. I spent a lot of time with debugging of EnumPrinters, with all different sorts of levels, but nothing worked and I always got the error RPC_S_SERVER_UNAVAILABLE (1722). It turned out that something has stopped the Spooler service and even after a reboot it was disabled. After enabling the Spooler service, everything worked. You can notice the Spooler service failure by looking at the Win10 printer settings: All printers would show "not connected", even Print to PDF.

Meaning/cause of RPC Exception 'No interfaces have been exported.'

We have a fairly standard client/server application built using MS RPC. Both client and server are implemented in C++. The client establishes a session to the server, then makes repeated calls to it over a period of time before finally closing the session.
Periodically, however, especially under heavy load conditions, we are seeing an RPC exception show up with code 1754: RPC_S_NOTHING_TO_EXPORT.
It appears that this happens in the middle of a session. The user is logged on for a while, making successful calls, then one of the calls inexplicably returns this error. As far as we can tell, the server receives no indication that anything went wrong - and it definitely doesn't see the call the client made.
The error code appears to have permanent implications, as well. Having the client retry the connection doesn't work, either. However, if the user has multiple user sessions active simultaneously between the same client and server, the other connections are unaffected.
In essence, I have two questions:
Does anyone know what RPC_S_NOTHING_TO_EXPORT means? The MSDN documentation simply says: "No interfaces have been exported." ... Huh? The session was working fine for numerous instances of the same call up until this point...
Does anyone have any ideas as to how to identify the real problem? Note: Capturing network traffic is something we would rather avoid, if possible, as the problem is sporadic enough that we would likely go through multiple gigabytes of traffic before running into an occurrence.
Capturing network traffic would be one of the best ways to tackle this issue. If you can't do that, could you dump the client process and debug with WinDBG or Visual Studio? Perhaps compare a dump when operating normally versus in the error state?

Resources