Troubleshooting ActiveMQ Artemis Shared Storage HA deployment - high-availability

We have 2 ActiveMQ Artemis servers in single cluster configured with shared storage HA strategy. The shared storage mount is NFS.
The servers are shutting down. Master server shuts down. The backup server gets the live lock and it works fine for sometime. Backup server also shuts down after sometime.
Exception that we get in master server is
2022-04-07 21:56:22,892 WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: The system cannot find the file specified
Exception that we get in slave server is:
2022-04-09 02:43:02,234 WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: The system cannot find the file specified
2022-04-09 03:00:10,995 WARN [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NIOSequentialFile \\xx.xxxx-dns.com\NAS\pri\prod\data\bindings\activemq-bindings-2.bindings, message=The system cannot find the file specified: ActiveMQIOErrorException[errorType=IO_ERROR message=The system cannot find the file specified]
2022-04-09 03:00:11,292 WARN [org.apache.activemq.artemis.core.server] AMQ222008: unable to restart server, please kill and restart manually: org.apache.activemq.artemis.core.server.NodeManager$NodeManagerException: java.io.IOException: An unexpected network error occurred
Referring to this existing question. Here, mount options are recommended:
Since you're using NFS the NFS client configuration options are worth inspecting as well. Here are the configuration options I would recommend to enable reasonable fail-over times:
timeo=50 - NFS timeout of 5 seconds
retrans=1 - allows only one retry
soft - soft mounting the NFS share disables the retry forever logic, allowing NFS errors to pop up into application stack after above timeouts
noac - turns off caching of file attributes but also enforces a sync write to the NFS share. This also reduces the time for NFS errors to pop up.
Can these issues be fixed by giving the mount options?

I wouldn't expect the recommended NFS mount options to solve the problems you're having. The main goal of those settings is to ensure NFS responds quickly to error conditions and reports them to the broker. In your case here you're already getting those errors (e.g. java.io.IOException: The system cannot find the file specified).
What you really need to do is track down why NFS is failing to find that file. The broker has no control over this. The exception is coming from the JVM which is, in turn, responding to an error from NFS. There is some problem with NFS itself here (e.g. a network issue).
To be clear, file-system errors like this are deemed "critical" by the broker and will cause it to shut-down so the response to the error by the broker which you are observing is considered normal.

Related

Hadoop distcp: what ports are used?

If I want to use distCp on an on-prem hadoop cluster, so it can 'push' data to external cloud storage, what firewall considerations must be made in order to leverage this tool? What ports does the actual transfer of data take place on? Is it via SSH, and/or port 8020? I need to make sure network connectivity is provided for source to destination, but with the least amount of privileges ascribed to it. (i.e., only opening ports that are absolutely needed)
I do not believe SSH is used for the actual data transfer, other than you actually logging into the cluster and starting the command, for example.
At a minimum, it would be the RPC data-transfer ports for the NameNodes and Datanodes, so whatever you've configured for fs.defaultFS, dfs.namenode.rpc-address and dfs.datanode.address

NIFI secure 3 node cluster

I am seeing some errors in my nifi cluster, I have a 3 node secured nifi cluster i am seeing the below errors. at the 2 nodes
ERROR [main] org.apache.nifi.web.server.JettyServer Unable to load flow due to:
java.io.IOException: org.apache.nifi.cluster.ConnectionException:
Failed to connect node to cluster due to: java.io.IOException:
Could not begin listening for incoming connections in order to load balance data across the cluster.
Please verify the values of the 'nifi.cluster.load.balance.port' and 'nifi.cluster.load.balance.host'
properties as well as the 'nifi.security.*' properties
See the clustering configuration guide for the list of clustering options you have to configure. For load balancing, you'll need to specify ports that are open in your firewall so that the nodes can communicate. You'll also need to make sure that each host has its node hostname property set, its host ports set and that there are no firewall restricts between the nodes and your Apache Zookeeper cluster.
If you want to simplify the setup to play around, you can use the information in the clustering configuration section of the admin guide to set up an embedded ZooKeeper node within each NiFi instance. However, I would recommend setting up an external ZooKeeper cluster. A little more work, but ultimately worth it.

The fix client can receive incoming messages but cannot send outgoing heartbeat message

We have built a fix client. The fix client can receive incoming messages but cannot send outgoing heartbeat message or reply the TestRequest message after the last heartbeat was sent, something is triggered to stop sending heartbeat anymore from client side.
fix version: fix5.0
The same incident happened before, we have tcpdump for one session in that time
we deploy every fix session to separated k8s pods.
We doubted it's CPU resource issue because the load average is high around the issue time, but it's not solved after we add more cpu cores. we think the load average is high because of fix reconnection.
We doubted it's IO issue because we use AWS efs which shared by 3 sessions for logging and message store. but it's still not solved after we use pod affinity to assign 3 sessions to different nodes.
It's not a network issue either, since we can receive fix messages, other sessions worked well at that time. We have disabled SNAT in k8s cluster too.
We are using quickfixj 2.2.0 to create a fix client, we have 3 sessions, which are deployed to k8s pods in separated nodes.
rate session to get fx price from server
order session to get transaction(execution report) messages from server, we only send logon/heartbeat/logout messages to server.
backoffice session to get marketstatus
We use apache camel quickfixj component to make our programming easy. It works well in most time, but it keeps happening to reconnect to fix servers in 3 sessions, the frequency is like once a month, mostly only 2 sessions have issues.
heartbeatInt = 30s
The fix event messages at client side
20201004-21:10:53.203 Already disconnected: Verifying message failed: quickfix.SessionException: Logon state is not valid for message (MsgType=1)
20201004-21:10:53.271 MINA session created: local=/172.28.65.164:44974, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/10.60.45.132:11050
20201004-21:10:53.537 Initiated logon request
20201004-21:10:53.643 Setting DefaultApplVerID (1137=9) from Logon
20201004-21:10:53.643 Logon contains ResetSeqNumFlag=Y, resetting sequence numbers to 1
20201004-21:10:53.643 Received logon
The fix incoming messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=Quote1☺52=20201004-21:09:02.887☺56=TA_Quote1☺10=186☺
8=FIXT.1.1☺9=65☺35=0☺34=2514☺49=Quote1☺52=20201004-21:09:33.089☺56=TA_Quote1☺10=185☺
8=FIXT.1.1☺9=74☺35=1☺34=2515☺49=Quote1☺52=20201004-21:09:48.090☺56=TA_Quote1☺112=TEST☺10=203☺
----- 21:10:53.203 Already disconnected ----
8=FIXT.1.1☺9=87☺35=A☺34=1☺49=Quote1☺52=20201004-21:10:53.639☺56=TA_Quote1☺98=0☺108=30☺141=Y☺1137=9☺10=183☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=Quote1☺52=20201004-21:11:23.887☺56=TA_Quote1☺10=026☺
The fix outgoing messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=TA_Quote1☺52=20201004-21:09:02.884☺56=Quote1☺10=183☺
---- no heartbeat message around 21:09:32 ----
---- 21:10:53.203 Already disconnected ---
8=FIXT.1.1☺9=134☺35=A☺34=1☺49=TA_Quote1☺52=20201004-21:10:53.433☺56=Quote1☺98=0☺108=30☺141=Y☺553=xxxx☺554=xxxxx☺1137=9☺10=098☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=TA_Quote1☺52=20201004-21:11:23.884☺56=Quote1☺10=023☺
8=FIXT.1.1☺9=62☺35=0☺34=3☺49=TA_Quote1☺52=20201004-21:11:53.884☺56=Quote1☺10=027☺
Thread dump when TEST message from server was received.BTW, The gist is from our development environment which has the same deployment.
https://gist.github.com/hitxiang/345c8f699b4ad1271749e00b7517bef6
We had enabled the debug log at quickfixj, but not much information, only logs for messages receieved.
The sequence in time serial
20201101-23:56:02.742 Outgoing heartbeat should be sent at this time, Looks like it's sending, but hung at io writing - in Running state
20201101-23:56:18.651 test message from server side to trigger thread dump
20201101-22:57:45.654 server side began to close the connection
20201101-22:57:46.727 thread dump - right
20201101-23:57:48.363 logon message
20201101-22:58:56.515 thread dump - left
The right(2020-11-01T22:57:46.727Z): when it hangs, The left(2020-11-01T22:58:56.515Z): after reconnection
It looks like that the storage - aws efs we are using made the issue happen.
But the feedback from aws support is that nothing is wrong at aws efs side.
Maybe it's the network issue between k8s ec2 instance and aws efs.
First, we make the logging async at all session, make the disconnection happen less.
Second, for market session, we write the sequence files to local disk, the disconnection had gone at market session.
Third, at last we replaced the aws efs with aws ebs(persist volume in k8s) for all sessions. It works great now.
BTW, aws ebs is not high availability across zone, but it's better than fix disconnection.

Websphere mq listener available but showing not found error

we have facing error, application unable to connect to queue manager,with reason
code mqrc 2538,
webspher MQ version v7.0.1.2.
operating system "Solaris".
I have started the listener manually through
runmqlsr -m qmname -t tcp -p port
after i have checked status of listener through command,
display lsstatus(listener name)
"listener is available but when I try to display the status of this listener it is showing MQ object not found."
we have checked error logs but there is no information for related client fails we have started listener manually, listener information only available in error logs.
Also we have checked "/var/mqm/error" we found the FDC files "probe ID: XY132002" we have contact with sysadmin they mount the disk space.
After mounting /var/mqm/ disc space still we are facing the same issue.
i have already given "start lstr(lstr name)" in script mode, but i its accepting the request, while I try to display the status of this listener it is showing MQ object not found."
i have checked qmgr error logs and fdc error logs"
can you please find the below errors written in /var/mqm/errors/AMQERR01.LOG
Explanation: 1. An attempt hasbeen made to run the brker(SFMSICREQMGR) but the brker has ended for reason '6119:xecF_E_UNEXPECTED_SYSTEM_RC'.
error: AMQ6119:An internal WebSphere MQ error has occured(failed to get memory segment:shmget(0x00000000, 16384) [rc =1 errno=28] no space left on device.
++below error written in queue manger level error:++
AMQ5008: An essential websphere MQ process 10063 (amqfgpub) cannot be found is assumed to be terminated.
these are errors written in queue manager level error logs and system level error logs:
we have added below values
process.max-file-descriptor=(basic,10000,deny)
project.max-sem-ids=(priv,1024,deny)
project.max-shm-ids=(priv,1024,deny)
project.max-shm-memory=(priv,4294967296,deny)
after adding this parameters we restarted the queue manager's,
we have four queue managers in server, three queue managers and listeners are in running state, fourth queue manager facing same error.
we have stopped one queue manager and we have run the fourth queue manager,the fourth queue manager is running and listener also in running state.
one queue manager is not allowing to start. we are facing same error for this queue manager.
All queue managers and listeners running fine.
we have created local queue,
queue name(error_local_queue).but while application tried get msg from this queue his getting error
Mqrc 2033.
Kindly help for this issue
thank you so much to all issue got resolved.
If you start a listener using the following command (as per your question):-
runmqlsr -m qmname -t tcp -p port
Then you have not specified a name for the listener anywhere (because this command does not have that capability).
It will however still show up in a DISPLAY LSSTATUS command with a system generated name. If you use the following command:-
DISPLAY LSSTATUS(*)
that will show all running listeners, and you will see that there is one with a name something like SYSTEM.LISTENER.TCP.1 which is your runmqlsr one.
Alternatively, if you want to give your listener a specific name, then you must define a listener as follows (replacing nnnn with your port number):-
DEFINE LISTENER(TCP.LSTR) TRPTYPE(TCP) CONTROL(QMGR) PORT(nnnn)
Then you are able to start it as follows:-
START LISTENER(TCP.LSTR)
and show it's status as follows:-
DISPLAY LSSTATUS(TCP.LSTR) ALL
N.B. I used the name TCP.LSTR but you may choose any name you wish.
The errors you mention at the end of your question are unrelated to listeners. Please open a separate question for those.
MQ v7.0 has been out of support since September 30th 2015.
The errors you found indicate the queue manager is short on shared memory, this could cause the entire queue manager to have issues including your listener. The current values along with IBM's recommendations can by found using the mqconfig script.
MQ v7.0 did not come with the mqconfig script. Download the script and verify which kernel settings are not correct, the download site is "How to configure UNIX and Linux systems for IBM MQ".
You can find more information on setting these in the IBM MQ v7 Knowledge Center page "Resource limit configuration".
The values in the Knowledge center are recommended values for a average server with a couple of queue managers and should be treated as a minimum value. If you can't run 4 queue managers then I would suggest going to higher values. I would start with setting max-sem-ids and max-shm-ids to 10240 and see if that solves it, if not then attempt to add 50% to the max-shm-memory value.

ora-12528: TNS:Listener: All Appropriate instances are blocking new connections

I am getting this error when I try to connect to my database:
ora-12528: TNS:Listener: All Appropriate instances are blocking new connections
I tried the following, with no success:
Stop and Start the Listener.
Shutdown and Startup database.
Restart the oracle services.
How might I resolve this?
You might have a problem with either the network and/or the archive logs - the above usually happens when the area/disk where the archive logs are stored is full, Oracle then just refuses new connections.
Another possibility is that you maxed out the number of allowed connections - this should usually be warning sign that you might have an application which leaks connections.
If you are 100% sure that you are not leaking connections then you could configure Oracle to accept more connections (BEWARE of licensing, RAM etc.!).

Resources