IBM Websphere MQ cluster channels ended abnormally and resuming frequently - ibm-mq

In a cluster environment , I see channels to a particular server is ending abnormally and resuming frequently in a day.
Eg: QMGR A has several QMGRS(B,C,D,E,F) connected to it.(each in different server)
Cluster Receiver channels from QMGR B,C,D,E,F are ended abnormally on QMGR A and resuming quite frequently in a day.
QMGR A LOGS
-------------------------------------------------------------------------------
08/04/12 08:44:41 - Process(1720412.1165) User(mqad) Program(amqrmppa)
AMQ9209: Connection to host 'HOST.B (139.120.210.19)' closed.
EXPLANATION:
An error occurred receiving data from 'HOST.B (139.120.210.19)' over TCP/IP.
The connection to the remote host has unexpectedly terminated.
ACTION:
Tell the systems administrator.
----- amqccita.c : 3094 -------------------------------------------------------
08/04/12 08:44:41 - Process(1720412.1165) User(mqad) Program(amqrmppa)
AMQ9999: Channel program ended abnormally.
EXPLANATION:
Channel program 'CHANNEL.TO.B' ended abnormally.
ACTION:
Look at previous error messages for channel program 'CHANNEL.TO.B' in the
error files to determine the cause of the failure.
----- amqrccca.c : 777 --------------------------------------------------------
08/04/12 08:44:41 - Process(1720412.1175) User(mqad) Program(amqrmppa)
AMQ9209: Connection to host 'HOST.C (155.10.186.20)' closed.
EXPLANATION:
An error occurred receiving data from 'HOST.C (155.10.186.20)' over TCP/IP.
The connection to the remote host has unexpectedly terminated.
ACTION:
Tell the systems administrator.
----- amqccita.c : 3094 -------------------------------------------------------
08/04/12 08:44:41 - Process(1720412.1175) User(mqad) Program(amqrmppa)
AMQ9999: Channel program ended abnormally.
EXPLANATION:
Channel program 'CHANNEL.TO.C' ended abnormally.
ACTION:
Look at previous error messages for channel program 'CHANNEL.TO.C' in the
error files to determine the cause of the failure.
-------------------------------------------------------------------------------
QMGR LOG on HOST B
08/04/2012 08:44:09 AM - Process(17174.16023) User(mqad) Program(amqrmppa)
AMQ9259: Connection timed out from host 'HOST.A'.
EXPLANATION:
A connection from host 'HOST.A' over TCP/IP timed out.
ACTION:
Check to see why data was not received in the expected time. Correct the
problem. Reconnect the channel, or wait for a retrying channel to reconnect
itself.
----- amqccita.c : 3546 -------------------------------------------------------
08/04/2012 08:44:09 AM - Process(17174.16023) User(mqad) Program(amqrmppa)
AMQ9999: Channel program ended abnormally.
EXPLANATION:
Channel program 'CHANNEL.TO.B' ended abnormally.
ACTION:
Look at previous error messages for channel program 'CHANNEL.TO.B' in the
error files to determine the cause of the failure.
QMGR LOG on HOST C
-------------------------------------------------------------------------------
08/04/12 08:44:35 - Process(462890.4658) User(mqad) Program(amqrmppa)
AMQ9259: Connection timed out from host 'HOST.A'.
EXPLANATION:
A connection from host 'HOST.A' over TCP/IP timed out.
ACTION:
Check to see why data was not received in the expected time. Correct the
problem. Reconnect the channel, or wait for a retrying channel to reconnect
itself.
----- amqccita.c : 3341 -------------------------------------------------------
08/04/12 08:44:35 - Process(462890.4658) User(mqad) Program(amqrmppa)
AMQ9999: Channel program ended abnormally.
EXPLANATION:
Channel program 'CHANNEL.TO.C' ended abnormally.
ACTION:
Look at previous error messages for channel program 'CHANNEL.TO.C' in the
error files to determine the cause of the failure.
----- amqrmrsa.c : 468 --------------------------------------------------------
I'm trying to understand what is causing this?? Is it caused if the Queue manager A is overloaded with as many connections ?? I don't see any TCP/IP error code logged on the qmgr log.

Looks like you are running pre V7.1 version of MQ? In MQ V7.1 that error message was updated from:-
AMQ9259: Connection timed out from host 'HOST.A'.
EXPLANATION:
A connection from host 'HOST.A' over TCP/IP timed out.
ACTION:
Check to see why data was not received in the expected time. Correct the
problem. Reconnect the channel, or wait for a retrying channel to reconnect
itself.
to
AMQ9259: Connection timed out from host 'HOST.A'.
EXPLANATION:
A connection from host 'HOST.A' over TCP/IP timed out.
ACTION:
The select() [TIMEOUT] 60 seconds call timed out. Check to see why data was
not received in the expected time. Correct the problem. Reconnect the channel,
or wait for a retrying channel to reconnect itself.
as an example. The most likely reason for the AMQ9259 error message is that your receive timeout settings have caused the channel to pop out of its receive and close the channel. Suggest you review the receive time out settings in your qm.ini file to see if they are set to something shorter than your heartbeat intervals.
The channels restart again automatically because you have retry intervals defined on them. This is good!

Related

Pre-login handshake Error while connecting to Azure SQl Edge from ADS

I'm setting up Azure SQL Database on the local machine (Windows 11) using Azure Data Studio.
I followed the below article to create an Azure SQL Edge instance:
https://learn.microsoft.com/en-us/azure/azure-sql/database/local-dev-experience-quickstart?view=azuresql
And after publishing (i.e after step 11 in above article) I'm getting the below error logs:
Waiting for 2 seconds before another attempt for operation 'Validating the docker container'
Running operation 'Validating the docker container' Attempt 0 of 10
> docker ps -q -a --filter label=source=sqldbproject-choicemls -q
stdout: 142c44a8b420
stdout:
>>> docker ps -q -a --filter label=source=sqldbproject-choicemls -q … exited with code: 0
Operation 'Validating the docker container' completed successfully. Result: 142c44a8b420
Docker created id: '142c44a8b420
'
Waiting for 10 seconds before another attempt for operation 'Connecting to SQL Server'
Running operation 'Connecting to SQL Server' Attempt 0 of 3
Operation 'Connecting to SQL Server' failed. Re-trying... Current Result: undefined. Error: 'Connection failed error: 'A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)''
Waiting for 10 seconds before another attempt for operation 'Connecting to SQL Server'
Running operation 'Connecting to SQL Server' Attempt 1 of 3
Operation 'Connecting to SQL Server' failed. Re-trying... Current Result: undefined. Error: 'Connection failed error: 'A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)''
Waiting for 10 seconds before another attempt for operation 'Connecting to SQL Server'
Running operation 'Connecting to SQL Server' Attempt 2 of 3
Operation 'Connecting to SQL Server' failed. Re-trying... Current Result: undefined. Error: 'Connection failed error: 'A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)''
Please give suggestions to solve this issue.
Thanks,
Saurabh

Parse IBM MQ v9.1 Error Logs using Splunk

I'm forwarding my IBM MQ v9.1 error logs using splunk forwarder to a centralized cluster to see trends on common error occurring across my distributed messaging systems.
However I'm unable to parse the required fields, since the format of MQ error logs are varying i.e. the severity of the messages could be error, warning, informational, severe and termination and each have different set of fields in itself and are not consistent.
Please let me know if anyone have used regex in splunk for parsing the fields of IBM MQ error logs for v9.1.
I have tried few regex patterns but it wasn't parsing as expected.
I have already referred below link, but that is for v8 and there is a different in format of error logs for v9,
https://t-rob.net/2017/12/18/parsing-mq-error-logs-in-splunk/
Also the splunk user is unable to access the error logs. I have updated below stanza in qm.ini
Filesystem:
ValidateAuth=No
also set chmod -R 755 to /var/mqm/qmgrs/qmName/errors folder.
Though the permissions for the ERROR logs doesn't change whenever it gets updated, when the logs rotate the permissions are revoked and splunk user is not able to read the logs.
Please let me know how to overcome this without adding splunk user to mqm group
I would suggest enabling JSON logging and forward those logs to Splunk which should be able to parse this format.
In IBM MQ v9.0.4 CDS release IBM added the ability to log out to a JSON formatted log, MQ will always log to the original AMQERR0x.LOG files even if you enable the JSON logging. This is included in all MQ 9.1 LTS and CSD releases.
The IBM MQ v9.1 Knowledge Center Page IBM MQ>Configuring>Changing IBM MQ and queue manager configuration information>Attributes for changing queue manager configuration information>Diagnostic message logging>Diagnostic message service stanzas>Diagnostic message services has information on the topic. You can add the following to your qm.ini to have it output the log information to a JSON formatted file called AMQERR0x.json in the standard queue manager errors directory:
DiagnosticMessages:
Service = File
Name = JSONLogs
Format = json
FilePrefix = AMQERR
As noted by the OP the JSON formatted logs do not contain the EXPLANATION or ACTION portion that you see in the normal logs.
In IBM MQ v9.1 you can use the mqrc command to convert the JSON format to the familiar format you see in AMQERR01.LOG.
One simple example is below:
cat <<EOL |mqrc -i json -o text -
{"ibm_messageId":"AMQ9209E","ibm_arithInsert1":0,"ibm_arithInsert2":0,"ibm_commentInsert1":"localhost (127.0.0.1)","ibm_commentInsert2":"TCP/IP","ibm_commentInsert3":"SYSTEM.DEF.SVRCONN","ibm_datetime":"2018-02-22T06:54:53.942Z","ibm_serverName":"QM1","type":"mq_log","host":"0df0ce19c711","loglevel":"ERROR","module":"amqccita.c:4214","ibm_sequence":"1519282493_947814358","ibm_remoteHost":"127.0.0.1","ibm_qmgrId":"QM1_2018-02-13_10.49.57","ibm_processId":4927,"ibm_threadId":4,"ibm_version":"9.1.0.5","ibm_processName":"amqrmppa","ibm_userName":"johndoe","ibm_installationName":"Installation1","ibm_installationDir":"/opt/mqm","message":"AMQ9209E: Connection to host 'localhost (127.0.0.1)' for channel 'SYSTEM.DEF.SVRCONN' closed."}
EOL
The output will be:
02/22/2018 06:54:53 AM - User(johndoe) Program(amqrmppa)
Host(0df0ce19c711) Installation(Installation1)
VRMF(9.1.0.5) QMgr(QM1)
Time(2018-02-22T11:54:53.942Z)
RemoteHost(127.0.0.1)
CommentInsert1(localhost (127.0.0.1))
CommentInsert2(TCP/IP)
CommentInsert3(SYSTEM.DEF.SVRCONN)
AMQ9209E: Connection to host 'localhost (127.0.0.1)' for channel
'SYSTEM.DEF.SVRCONN' closed.
EXPLANATION:
An error occurred receiving data from 'localhost (127.0.0.1)' over TCP/IP. The
connection to the remote host has unexpectedly terminated.
The channel name is 'SYSTEM.DEF.SVRCONN'; in some cases it cannot be determined
and so is shown as '????'.
ACTION:
Tell the systems administrator.
----- amqccita.c : 4214 -------------------------------------------------------
You can also use mqrc with just the error message from the JSON, for example AMQ9209E, you can run the command like this:
mqrc AMQ9209E
The output will be:
536908297 0x20009209 rrcE_CONNECTION_CLOSED
536908297 0x20009209 urcMS_CONN_CLOSED
MESSAGE:
Connection to host '<insert one>' for channel '<insert three>' closed.
EXPLANATION:
An error occurred receiving data from '<insert one>' over <insert two>. The
connection to the remote host has unexpectedly terminated.
The channel name is '<insert three>'; in some cases it cannot be determined and
so is shown as '????'.
ACTION:
Tell the systems administrator.
You could take it further and specify the inserts from the JSON:
Exmple portion of the JSON log:
"ibm_messageId":"AMQ9209E","ibm_arithInsert1":0,"ibm_arithInsert2":0,"ibm_commentInsert1":"localhost (127.0.0.1)","ibm_commentInsert2":"TCP/IP","ibm_commentInsert3":"SYSTEM.DEF.SVRCONN"
In the command below each ibm_arthInsert is specified with a proceeding -n flag in order following by each ibm_commentInsert with a proceeding -c flag:
mqrc AMQ9209E -n 0 -n 0 -c "localhost (127.0.0.1)" -c "TCP/IP" -c "SYSTEM.DEF.SVRCONN"
The output is below:
536908297 0x20009209 rrcE_CONNECTION_CLOSED
536908297 0x20009209 urcMS_CONN_CLOSED
MESSAGE:
Connection to host 'localhost (127.0.0.1)' for channel 'SYSTEM.DEF.SVRCONN'
closed.
EXPLANATION:
An error occurred receiving data from 'localhost (127.0.0.1)' over TCP/IP. The
connection to the remote host has unexpectedly terminated.
The channel name is 'SYSTEM.DEF.SVRCONN'; in some cases it cannot be determined
and so is shown as '????'.
ACTION:
Tell the systems administrator.

drbdadm not creating block device

We are in process of building active-passive cluster via DRBD installed in Centos-7.4 which running kernel-3.10.0-862.el7. While creating cluster with drbadm is unable to create a volume and giving below error. Can you please help me out.
open(/dev/vdb) failed: Invalid argument
could not open with O_DIRECT, retrying without
'/dev/vdb' is not a block device!
open(/dev/vdb) failed: Invalid argument
could not open with O_DIRECT, retrying without
'/dev/vdb' is not a block device!
Command 'drbdmeta 0 v08 /dev/vdb internal create-md' terminated with exit code 20

MS MPI Permission errors

I have two machines both with MS MPI 7.1 installed, one called SERVER and one called COMPUTE.
The machines are set up on LAN in a simple windows workgroup (No DA), and both have an account with the same name and password.
Both are running the MSMPILaunchSvc service.
Both machines can execute MPI jobs locally, verified by testing with the hostname command
SERVER> mpiexec -hosts 1 SERVER 1 hostname
SERVER
or
COMPUTE> mpiexec -hosts 1 COMPUTE 1 hostname
COMPUTE
in a terminal on the machines themselves.
I have disabled the firewall on both machines to make things easier.
My problem is I can not get MPI to run jobs from SERVER on a remote host:
1: SERVER with MSMPILaunchSvc -> COMPUTE with MSMPILaunchSvc
SERVER> mpiexec -hosts 1 COMPUTE 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 1722
Aborting: mpiexec on SERVER is unable to connect to the smpd service on COMPUTE:8677
Other MPI error, error stack:
connect failed - The RPC server is unavailable. (errno 1722)
What's even more frustrating here is that only sometimes I get prompted to enter a password. It suggests SERVER\Maarten as the user for COMPUTE, the account I am already logged in as on SERVER and shouldn't exist on COMPUTE (should be COMPUTE\Maarten then?). Nonetheless it also fails:
SERVER>mpiexec -hosts 1 COMPUTE 1 hostname.exe -pwd
Enter Password for SERVER\Maarten:
Save Credentials[y|n]? n
ERROR: Failed to connect to SMPD Manager Instance error 1726
Aborting: mpiexec on SERVER is unable to connect to the
smpd manager on COMPUTE:50915 error 1726
2: COMPUTE with MSMPILaunchSvc -> SERVER with MSMPILaunchSvc
COMPUTE> mpiexec -hosts 1 SERVER 1 hostname -pwd
ERROR: Failed RpcCliCreateContext error 5
Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied. (errno 5)
3: COMPUTE with MSMPILaunchSvc -> SERVER with smpd daemon
Aborting: mpiexec on COMPUTE is unable to connect to the smpd service on SERVER:8677
Other MPI error, error stack:
connect failed - Access is denied. (errno 5)
4: SERVER with MSMPILaunchSvc -> COMPUTE with smpd daemon
ERROR: Failed to connect to SMPD Manager Instance error 1726
Aborting: mpiexec on SERVER is unable to connect to the smpd manager on
COMPUTE:51022 error 1726
Update:
Trying with smpd daemon on both nodes I get this error:
[-1:9796] Authentication completed. Successfully obtained Context for Client.
[-1:9796] version check complete, using PMP version 3.
[-1:9796] create manager process (using smpd daemon credentials)
[-1:9796] smpd reading the port string from the manager
[-1:9848] Launching smpd manager instance.
[-1:9848] created set for manager listener, 376
[-1:9848] smpd manager listening on port 51149
[-1:9796] closing the pipe to the manager
[-1:9848] Authentication completed. Successfully obtained Context for Client.
[-1:9848] Authorization completed.
[-1:9848] version check complete, using PMP version 3.
[-1:9848] Received session header from parent id=1, parent=0, level=0
[01:9848] Connecting back to parent using host SERVER and endpoint 17979
[01:9848] Previous attempt failed with error 5, trying to authenticate without Kerberos
[01:9848] Failed to connect back to parent error 5.
[01:9848] ERROR: Failed to connect back to parent 'ncacn_ip_tcp:SERVER:17979' error 5
[01:9848] smpd manager successfully stopped listening.
[01:9848] SMPD exiting with error code 4294967293.
and on the host:
[-1:12264] Launching SMPD service.
[-1:12264] smpd listening on port 8677
[-1:12264] Authentication completed. Successfully obtained Context for Client.
[-1:12264] version check complete, using PMP version 3.
[-1:12264] create manager process (using smpd daemon credentials)
[-1:12264] smpd reading the port string from the manager
[-1:16668] Launching smpd manager instance.
[-1:16668] created set for manager listener, 364
[-1:16668] smpd manager listening on port 18033
[-1:12264] closing the pipe to the manager
[-1:16668] Authentication completed. Successfully obtained Context for Client.
[-1:16668] Authorization completed.
[-1:16668] version check complete, using PMP version 3.
[-1:16668] Received session header from parent id=1, parent=0, level=0
[01:16668] Connecting back to parent using host SERVER and endpoint 18031
[01:16668] Authentication completed. Successfully obtained Context for Client.
[01:16668] Authorization completed.
[01:16668] handling command SMPD_CONNECT src=0
[01:16668] now connecting to COMPUTE
[01:16668] 1 -> 2 : returning SMPD_CONTEXT_LEFT_CHILD
[01:16668] using spn msmpi/COMPUTE to contact server
[01:16668] SERVER posting a re-connect to COMPUTE:51161 in left child context.
[01:16668] ERROR: Failed to connect to SMPD Manager Instance error 1726
[01:16668] sending abort command to parent context.
[01:16668] posting command SMPD_ABORT to parent, src=1, dest=0.
[01:16668] ERROR: smpd running on SERVER is unable to connect to smpd service on COMPUTE:8677
[01:16668] Handling cmd=SMPD_ABORT result
[01:16668] cmd=SMPD_ABORT result will be handled locally
[01:16668] parent terminated unexpectedly - initiating cleaning up.
[01:16668] no child processes to kill - exiting with error code -1
I found after trial and error that these and other unspecific errors come up when trying to run MS MPI with different configurations (in my case a mix of HPC Cluster 2008 and HPC Cluster 2012 with MSMPI).
The solution was to downgrade all nodes to Windows Server 2008 R2 with HPC Cluster 2008. Because I dont use AD, I had to fall back to using the SMPD daemon and add firewall rules for it (skipping the cluster management tools alltogether).

akka camel ftp. Exception checking connection status: File operation failed: null Connection is not open. Code: 0

My akka camel ftp app worked fine and I didn't touch it for a long time. But since several days ago it stopped to consume files (it works fine locally with the same ftp server). No errors etc.
Settings:
ftp://user#myip.com?delay=600000&filter=%23datFileFilter&initialDelay=1000&password=xxxxxx&throwExceptionOnConnectFailed=true&username=user
I switched on trace and found:
Exception checking connection status: File operation failed: null Connection is not open. Code: 0
,
Exception checking connection status: File operation failed: 250 OK. Current directory is /
Connection reset. Code: 250
,
User XXXX logged in: true
,
No files found in directory:
which is not true, because SAME code works fine on localhost
Some time ago server crushed and restarted. maybe something left. I rebuilt rpm and restarted.

Resources