WMQ Uncommited msgs stuck in queue status data - ibm-mq

I have a queue for which the reader consumes messages under sycpoint and it ended abruptly. This caused 2 messages left in Uncommitted state. so the "msgage" property keeps increasing for the message and "uncom" property of queue remains same as 2, even though we restarted the consumer application and no long running UOW.
Anyway we can reset these properties without restarting MQ?

Presuming your application is connecting in client mode (over TCP) I expect that although your application has gone away, from the queue manager's point of view it is still active.
When the network socket the application opened closes, then MQ should roll back the 2 messages so they're eligible for consumption by another application.
The network socket will close when the operating system eventually notices the remote end of the TCP connection is unresponsive - this triggers a 'connection reset by peer' type socket closure. It's the operating system the queue manager is running on which will do this, not the remote one.
Some operating systems can take hours to notice a duff socket in their default configuration. Look into 'TCP keepalive' settings on your operating system to tune how long this takes.

Related

The fix client can receive incoming messages but cannot send outgoing heartbeat message

We have built a fix client. The fix client can receive incoming messages but cannot send outgoing heartbeat message or reply the TestRequest message after the last heartbeat was sent, something is triggered to stop sending heartbeat anymore from client side.
fix version: fix5.0
The same incident happened before, we have tcpdump for one session in that time
we deploy every fix session to separated k8s pods.
We doubted it's CPU resource issue because the load average is high around the issue time, but it's not solved after we add more cpu cores. we think the load average is high because of fix reconnection.
We doubted it's IO issue because we use AWS efs which shared by 3 sessions for logging and message store. but it's still not solved after we use pod affinity to assign 3 sessions to different nodes.
It's not a network issue either, since we can receive fix messages, other sessions worked well at that time. We have disabled SNAT in k8s cluster too.
We are using quickfixj 2.2.0 to create a fix client, we have 3 sessions, which are deployed to k8s pods in separated nodes.
rate session to get fx price from server
order session to get transaction(execution report) messages from server, we only send logon/heartbeat/logout messages to server.
backoffice session to get marketstatus
We use apache camel quickfixj component to make our programming easy. It works well in most time, but it keeps happening to reconnect to fix servers in 3 sessions, the frequency is like once a month, mostly only 2 sessions have issues.
heartbeatInt = 30s
The fix event messages at client side
20201004-21:10:53.203 Already disconnected: Verifying message failed: quickfix.SessionException: Logon state is not valid for message (MsgType=1)
20201004-21:10:53.271 MINA session created: local=/172.28.65.164:44974, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/10.60.45.132:11050
20201004-21:10:53.537 Initiated logon request
20201004-21:10:53.643 Setting DefaultApplVerID (1137=9) from Logon
20201004-21:10:53.643 Logon contains ResetSeqNumFlag=Y, resetting sequence numbers to 1
20201004-21:10:53.643 Received logon
The fix incoming messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=Quote1☺52=20201004-21:09:02.887☺56=TA_Quote1☺10=186☺
8=FIXT.1.1☺9=65☺35=0☺34=2514☺49=Quote1☺52=20201004-21:09:33.089☺56=TA_Quote1☺10=185☺
8=FIXT.1.1☺9=74☺35=1☺34=2515☺49=Quote1☺52=20201004-21:09:48.090☺56=TA_Quote1☺112=TEST☺10=203☺
----- 21:10:53.203 Already disconnected ----
8=FIXT.1.1☺9=87☺35=A☺34=1☺49=Quote1☺52=20201004-21:10:53.639☺56=TA_Quote1☺98=0☺108=30☺141=Y☺1137=9☺10=183☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=Quote1☺52=20201004-21:11:23.887☺56=TA_Quote1☺10=026☺
The fix outgoing messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=TA_Quote1☺52=20201004-21:09:02.884☺56=Quote1☺10=183☺
---- no heartbeat message around 21:09:32 ----
---- 21:10:53.203 Already disconnected ---
8=FIXT.1.1☺9=134☺35=A☺34=1☺49=TA_Quote1☺52=20201004-21:10:53.433☺56=Quote1☺98=0☺108=30☺141=Y☺553=xxxx☺554=xxxxx☺1137=9☺10=098☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=TA_Quote1☺52=20201004-21:11:23.884☺56=Quote1☺10=023☺
8=FIXT.1.1☺9=62☺35=0☺34=3☺49=TA_Quote1☺52=20201004-21:11:53.884☺56=Quote1☺10=027☺
Thread dump when TEST message from server was received.BTW, The gist is from our development environment which has the same deployment.
https://gist.github.com/hitxiang/345c8f699b4ad1271749e00b7517bef6
We had enabled the debug log at quickfixj, but not much information, only logs for messages receieved.
The sequence in time serial
20201101-23:56:02.742 Outgoing heartbeat should be sent at this time, Looks like it's sending, but hung at io writing - in Running state
20201101-23:56:18.651 test message from server side to trigger thread dump
20201101-22:57:45.654 server side began to close the connection
20201101-22:57:46.727 thread dump - right
20201101-23:57:48.363 logon message
20201101-22:58:56.515 thread dump - left
The right(2020-11-01T22:57:46.727Z): when it hangs, The left(2020-11-01T22:58:56.515Z): after reconnection
It looks like that the storage - aws efs we are using made the issue happen.
But the feedback from aws support is that nothing is wrong at aws efs side.
Maybe it's the network issue between k8s ec2 instance and aws efs.
First, we make the logging async at all session, make the disconnection happen less.
Second, for market session, we write the sequence files to local disk, the disconnection had gone at market session.
Third, at last we replaced the aws efs with aws ebs(persist volume in k8s) for all sessions. It works great now.
BTW, aws ebs is not high availability across zone, but it's better than fix disconnection.

What would happen if a process established multiple PostgreSQL connections and terminated without closing them?

I'm writing a DLL for a purchased software.
The software will perform multi-threaded calculations on certain tasks.
My job is to output the relative result into a database.
However, due to the limited support of the software, it is kind of difficult to do multi-threaded output of the data.
The key problem is that there is no info on the last execution of the DLL function.
Therefore, the database connection will not be closed.
So may I ask if I leave the connection open and terminate the process, what would be the potential problems?
My platform is winserver 2008, and PostgreSQL 10.
I don't understand the background information you are giving, but I can answer the question:
If a PostgreSQL client process dies without closing the database (and TCP) connection, the PostgreSQL server process (“backend process”) that servers this connection will not realize this immediately.
Of course, as soon as the server tries to communicate to the client, e.g. to return some results, TCP it will notice that the partner has gone away and will return an error.
However, often the backend process is idle, waiting for the client to send the next request. In this case, it would never notice that its partner has died. This could eventually cause max_connections to be exhausted with dead connections.
Because this is a common problem in networking, TCP provides the “keepalive” functionality: when a connection has been idle for a while (2 hours by default), the operating system will send a so-called “keepalive packet” and wait for a response from the other side. Sending keepalive packets is repeated several times (5 times by default) in short intervals (1 second by default), and if no response is received, the connection is closed by the operating system, the backend process receives an error message and terminates.
PostgreSQL provides parameters with which you can configure these settings on the server side: tcp_keepalives_idle, tcp_keepalives_count and tcp_keepalives_interval. If you set tcp_keepalives_idle to a shorter value, dead connections will be detected and removed faster, at the cost of some little added network traffic.

Why RabbitMQ won't the connection stay open when not in use?

I have used http://github.com/streadway/amqp package in my application in order to handle connections to a remote RabbitMQ server. Everything is ok and works fine but when a connection is idle for a long period of time f.g 6 hours it gets closed. I check NotifyClose(make(chan *amqp.Error)) all time in my go routine and it returns :
Exception (501) Reason: "write tcp
192.168.133.53:55424->192.168.134.34:5672: write: broken pipe"
Why this error happens? (is there any problem in my code?)
How long a connection can be idle?
How to prevent this problem?
As Cosmic Ossifrage says, the error is saying your RabbitMQ client has disconnected.
There are so many things that could sit between your client and server that can/will drop dormant connections that it's not worth focusing on how long your connection can be dormant for. You want to set the requested heartbeat interval in your connection manager.
https://www.rabbitmq.com/heartbeats.html
I'm not familiar with the framework you're using but I see it has a defaultHeartbeat field in connection.go. You might need to experiment with the value to find the best balance is to stop the connection being killed but not hit the server too often with keep-alive traffic.

AMQ9999 occuring in AMQERR01.LOG

I have the following error showing up in AMQERR01.LOG
AMQ9999: Channel 'MGATESrvChannel' to host 'Mgate (127.0.0.1)' ended
abnormally.
EXPLANATION:
The channel program running under process ID 1060(4364) for channel
'MGATESrvChannel' ended abnormally. The host name is 'Mgate (127.0.0.1)'; in
some cases the host name cannot be determined and so is shown as '????'.
This error is preceded with following message:
AMQ9508: Program cannot connect to the queue manager.
EXPLANATION:
The connection attempt to queue manager 'MGATE.QM' failed with reason code
2059.
ACTION:
Ensure that the queue manager is available and operational.
According to what I have been told this can be caused by an application that is using queue manager, however, it seems to me that this has more to do with the way that manager was set up or similar. Can anyone please shed some light on this?
Thanks in advance!
The 2059 says that a connection request was received and refused because the QMgr was not available. We used to see this a lot when the listener was run as a separate process or when inetd was used to start channels. This is because the listener was there to accept the connection but the QMgr processes were not.
Now that the listener is run as a child process of the QMgr, it is quite rare to see this on the WMQ error logs though clients commonly see it. This is because when the listener is run as a child process of the QMgr, there is nothing listening to receive the connection request and it bounces off of the host's IP stack before ever getting to MQ code.
The AMQ9999 message says that a channel program, one of the QMgr's child processes, died or was killed and this caused the channel to terminate. There are many reasons for a channel process to die including being killed by the OS if resources are short, or being killed by a human operator. Other than that the most common way they can die due to running in trusted or fastpath mode and the attached program corrupts them.
It would help to narrow down the field to know the details of the QMgr in question - version and fix pack, how the listeners are started, channel settings, etc.
Start your listener up, you may check the Control property for that channel, so it start up automatically when the Queue Manager restart.

What is the difference between "ORA-12571: TNS packet writer failure" and "ORA-03135: connection lost contact"?

I am working in an environment where we get production issues from time to time related to Oracle connections. We use ODP.NET from ASP.NET applications, and we suspect the firewall closes connections that have been in the connection pool too long.
Sometimes we get an "ORA-12571: TNS packet writer failure" error, and sometimes we get "ORA-03135: connection lost contact."
I was wondering if someone has run into this and/or has an understanding of the difference between the 2 errors.
Using a mobile phone analogy:
ORA-12571 (Failure) Means call is dropped.
ORA-03135 (Connection Lost) Other party hung up.
My understanding is that 3135 occurs when a connection is lost. This doesn't tell you why the connection was lost, though. It may have been terminated by the server because the server failed to recieve a response to a probe for a certain amount of time, and assumed that the connection was dead. Or (I'm not sure about this) the exact reverse of that: the client failed to recieve a probe response from the server for a certain amount of time, so it assumed the connection was lost. The "certain amount of time" is cotrolled by SQLNET.EXPIRE_TIME=[minutes] in sqlnet.ora.
As for 12571, my (again vague) understanding is that there was a sudden failure to send a packet during communication with the server, and that this is typically caused by some software or hardware interfering with the connection (either by design, or by error). For instance, if you pull out your ethernet cable and then try to execute a query, you'll probably get this. Or if a firewall or anti-malware application decides to block the traffic.

Resources