I am running rabbitmq through AMQP gem on 3 worker machines. When the machines are rebooted, my queue shows that workers are only added, not unsubscribed. For example, say each machine runs 5 workers:
When I boot 3 machines, I have 15 workers subscribed to the queue
When I shut down all 3 machines, I still have 15 workers subscribed to the queue
When I reboot the 3 machines, I now have 30 workers subscribed to the queue
In reality, I should only have 15 workers.
How can I ensure that my connection to my task queue closes when the machine reboots/shuts down? I have tried:
Signal.trap("INT") do #handles the ctrl c case
connection.close do
EM.stop { exit }
end
end
Signal.trap("TERM") do #handles the reboot and shut down case
connection.close do
EM.stop { exit }
end
end
This does NOT work.
I think what you are looking for is a Consumer cancellation Notification extension.
In your case, the clients has not been notified of the machines rebooting (in other words they have not received a 'basic.cancel' notification from the rabbitmq broker when the machines rebooted).
See the excerpts taken from the link above:
an extension in which the broker will send to the client a basic.cancel in the case of such unexpected consumer cancellations. This is not sent in the case of the broker receiving a basic.cancel from the client. AMQP 0-9-1 clients don't by default expect to receive basic.cancel methods from the broker asynchronously, and so in order to enable this behaviour, the client must present a capabilities table in its client-properties in which there is a key consumer_cancel_notify and a boolean value true
I'm not a ruby programmer, but I reckon the java example in the link above should give you the full picture.
Related
We have built a fix client. The fix client can receive incoming messages but cannot send outgoing heartbeat message or reply the TestRequest message after the last heartbeat was sent, something is triggered to stop sending heartbeat anymore from client side.
fix version: fix5.0
The same incident happened before, we have tcpdump for one session in that time
we deploy every fix session to separated k8s pods.
We doubted it's CPU resource issue because the load average is high around the issue time, but it's not solved after we add more cpu cores. we think the load average is high because of fix reconnection.
We doubted it's IO issue because we use AWS efs which shared by 3 sessions for logging and message store. but it's still not solved after we use pod affinity to assign 3 sessions to different nodes.
It's not a network issue either, since we can receive fix messages, other sessions worked well at that time. We have disabled SNAT in k8s cluster too.
We are using quickfixj 2.2.0 to create a fix client, we have 3 sessions, which are deployed to k8s pods in separated nodes.
rate session to get fx price from server
order session to get transaction(execution report) messages from server, we only send logon/heartbeat/logout messages to server.
backoffice session to get marketstatus
We use apache camel quickfixj component to make our programming easy. It works well in most time, but it keeps happening to reconnect to fix servers in 3 sessions, the frequency is like once a month, mostly only 2 sessions have issues.
heartbeatInt = 30s
The fix event messages at client side
20201004-21:10:53.203 Already disconnected: Verifying message failed: quickfix.SessionException: Logon state is not valid for message (MsgType=1)
20201004-21:10:53.271 MINA session created: local=/172.28.65.164:44974, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/10.60.45.132:11050
20201004-21:10:53.537 Initiated logon request
20201004-21:10:53.643 Setting DefaultApplVerID (1137=9) from Logon
20201004-21:10:53.643 Logon contains ResetSeqNumFlag=Y, resetting sequence numbers to 1
20201004-21:10:53.643 Received logon
The fix incoming messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=Quote1☺52=20201004-21:09:02.887☺56=TA_Quote1☺10=186☺
8=FIXT.1.1☺9=65☺35=0☺34=2514☺49=Quote1☺52=20201004-21:09:33.089☺56=TA_Quote1☺10=185☺
8=FIXT.1.1☺9=74☺35=1☺34=2515☺49=Quote1☺52=20201004-21:09:48.090☺56=TA_Quote1☺112=TEST☺10=203☺
----- 21:10:53.203 Already disconnected ----
8=FIXT.1.1☺9=87☺35=A☺34=1☺49=Quote1☺52=20201004-21:10:53.639☺56=TA_Quote1☺98=0☺108=30☺141=Y☺1137=9☺10=183☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=Quote1☺52=20201004-21:11:23.887☺56=TA_Quote1☺10=026☺
The fix outgoing messages at client side
8=FIXT.1.1☺9=65☺35=0☺34=2513☺49=TA_Quote1☺52=20201004-21:09:02.884☺56=Quote1☺10=183☺
---- no heartbeat message around 21:09:32 ----
---- 21:10:53.203 Already disconnected ---
8=FIXT.1.1☺9=134☺35=A☺34=1☺49=TA_Quote1☺52=20201004-21:10:53.433☺56=Quote1☺98=0☺108=30☺141=Y☺553=xxxx☺554=xxxxx☺1137=9☺10=098☺
8=FIXT.1.1☺9=62☺35=0☺34=2☺49=TA_Quote1☺52=20201004-21:11:23.884☺56=Quote1☺10=023☺
8=FIXT.1.1☺9=62☺35=0☺34=3☺49=TA_Quote1☺52=20201004-21:11:53.884☺56=Quote1☺10=027☺
Thread dump when TEST message from server was received.BTW, The gist is from our development environment which has the same deployment.
https://gist.github.com/hitxiang/345c8f699b4ad1271749e00b7517bef6
We had enabled the debug log at quickfixj, but not much information, only logs for messages receieved.
The sequence in time serial
20201101-23:56:02.742 Outgoing heartbeat should be sent at this time, Looks like it's sending, but hung at io writing - in Running state
20201101-23:56:18.651 test message from server side to trigger thread dump
20201101-22:57:45.654 server side began to close the connection
20201101-22:57:46.727 thread dump - right
20201101-23:57:48.363 logon message
20201101-22:58:56.515 thread dump - left
The right(2020-11-01T22:57:46.727Z): when it hangs, The left(2020-11-01T22:58:56.515Z): after reconnection
It looks like that the storage - aws efs we are using made the issue happen.
But the feedback from aws support is that nothing is wrong at aws efs side.
Maybe it's the network issue between k8s ec2 instance and aws efs.
First, we make the logging async at all session, make the disconnection happen less.
Second, for market session, we write the sequence files to local disk, the disconnection had gone at market session.
Third, at last we replaced the aws efs with aws ebs(persist volume in k8s) for all sessions. It works great now.
BTW, aws ebs is not high availability across zone, but it's better than fix disconnection.
I'm using Apache Artemis ActiveMQ 2.6.3 as an MQTT broker embedded in a Spring 5 application:
#Bean(initMethod = "start", destroyMethod = "stop")
fun embeddedActiveMQ(securityManager: ActiveMQJAASSecurityManager) =
EmbeddedActiveMQ().apply {
setConfiguration(getEmbeddedActiveMQConfiguration())
setConfigResourcePath("activemq-broker.xml")
setSecurityManager(securityManager)
}
private fun getEmbeddedActiveMQConfiguration() =
ConfigurationImpl().apply {
addAcceptorConfiguration("netty", DefaultConnectionProperties.DEFAULT_BROKER_URL)
addAcceptorConfiguration("mqtt", "tcp://$host:$mqttPort?protocols=MQTT")
name = brokerName
bindingsDirectory = "$dataDir${File.separator}bindings"
journalDirectory = "$dataDir${File.separator}journal"
pagingDirectory = "$dataDir${File.separator}paging"
largeMessagesDirectory = "$dataDir${File.separator}largemessages"
isPersistenceEnabled = persistence
connectionTTLOverride = 60000
}
Although I'm setting the connection TTL to 60 seconds in the above Kotlin code as suggested in the documentation and the client disconnected and terminated an hour ago, the log shows the following entries:
2020-06-22 10:57:03,890 [Thread-29 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5#ade4717)] DEBUG o.a.a.a.core.server.impl.QueueImpl - Scanning for expires on client1.some-topic
2020-06-22 10:58:03,889 [Thread-35 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5#ade4717)] DEBUG o.a.a.a.core.server.impl.QueueImpl - Scanning for expires on client1.some-topic
Based on these log entries, I'm afraid that "dead" connection resources are never cleaned up by the server.
What should I do to actually remove the "dead" connections from the server to avoid leaking resources?
The broker will often create resources like addresses, queues, etc. to deal with clients. In the case of MQTT clients the broker will create queues which essentially represent the client's subscriptions.
In this particular case a queue named client1.some-topic has been created for an MQTT subscription and the broker is scanning that queue for expired messages. At this point it looks like the broker is working as designed.
When a client disconnects without unsubscribing what the broker does with the subscription depends on whether the client used a clean session or not.
If the client used a clean session then the broker will delete the subscription queue when the client disconnects (even in the event of a failure).
Otherwise the broker is obliged to hold on to the subscription queue and route messages to it. If the client never reconnects to unsubscribe then the subscription may fill up with lots of messages and trigger the broker's paging mode and eventually even limit message production altogether. In this case the client can either reconnect and unsubscribe or the subscription queue can be removed administratively.
I have a queue for which the reader consumes messages under sycpoint and it ended abruptly. This caused 2 messages left in Uncommitted state. so the "msgage" property keeps increasing for the message and "uncom" property of queue remains same as 2, even though we restarted the consumer application and no long running UOW.
Anyway we can reset these properties without restarting MQ?
Presuming your application is connecting in client mode (over TCP) I expect that although your application has gone away, from the queue manager's point of view it is still active.
When the network socket the application opened closes, then MQ should roll back the 2 messages so they're eligible for consumption by another application.
The network socket will close when the operating system eventually notices the remote end of the TCP connection is unresponsive - this triggers a 'connection reset by peer' type socket closure. It's the operating system the queue manager is running on which will do this, not the remote one.
Some operating systems can take hours to notice a duff socket in their default configuration. Look into 'TCP keepalive' settings on your operating system to tune how long this takes.
In Amazon VPC, on two nodes I have installed rabbitmq
On Node 1, I ran the following commands
#Node 1
/etc/init.d/rabbitmq-server stop
rabbitmq-server -detached
rabbitmqctl start_app
rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'
On Node 2, I ran the following commands to setup the cluster
/etc/init.d/rabbitmq-server stop
rabbitmq-server -detached
rabbitmqctl stop_app
rabbitmqctl join_cluster rabbit#<PrivateIP>
rabbitmqctl start_app
rabbitmqctl set_policy HA '^(?!amq\.).*' '{"ha-mode": "all"}'
RabbitMQ nodes are behind a Elastic Load Balancer. I ran a java program to keep pushing messages into rabbitmq.
Case 1: rabbitmqctl list-queues -- showed the quename and queue message count same while the java program was pushing messages to the queue.
Case 2: I stopped rabbitmq on node 2 and then started it again. Checked the cluster status and queue message counts. The message count was correct ( 3330 on both node 1 and node 2 )
Case 3 : I stopped rabbitmq on node 1 while the java program was pushing messages to the queue.
I checked the queue message count in node 2 , count was 70.
I started rabbitmq on node 1, and then checked queue count was 75.
I want to setup a rabbitmq high availability cluster and ensure no message loss. I have enabled sync_queue on rabitmq start in /etc/init.d/rabbitmq-server.
Appreciate if you can point out, why the message counts dropped from approx 3330 to 70. And also what's the best way to setup and ensure HA.
A few tips:
Does your app use publisher confirms? If you don't want to lose messages, it should.
Is automatic syncing of queues enabled? If not, you have to manually initiate queue syncing for any queue.
You should not restart any node while queues are being synced, or messages might be lost.
If you don't want lose messages you should considerer to use tx-transtaction
channel.txSelect();
channel.basicPublish("", youQueue, MessageProperties.PERSISTENT_TEXT_PLAIN,
message.getBytes());
channel.txCommit();
This could be kill the performance, if you have a high messages rate.
Visit
http://www.rabbitmq.com/blog/2011/02/10/introducing-publisher-confirms/
Hi I created two queue managers, I put the message in remote queue(PVN.TO.NAG) in qmgr(pvn)
but when I get the message from the other qmgr(nag) localqueue(LQ.NAG) . but it shown no more messages?
-bash-3.00$ /opt/mqm/samp/bin/amqsput PVN.TO.NAG pvn
Sample AMQSPUT0 start
target queue is PVN.TO.NAG
hi this is pvn
Sample AMQSPUT0 end
-bash-3.00$ /opt/mqm/samp/bin/amqsget LQ.NAG nag
Sample AMQSGET0 start
no more messages
1) Have you linked both pvn and nag queue managers using sender & receiver channels and transmit queue?
2) Verify the remote queue definition PVN.TO.NAG is correct, i.e it points to LQ.NAG
3) Check if nag queue manager has a dead letter queue(DLQ) set. If it is already set, check if your message is ending up there because the nag queue manager is not able to route the message to a queue on remote queue manager. If the message is there, then it will contain reason why the message ended up in DLQ.