AMQP 1.0 Qpid JMS and an Issue with Failover/Reconnect - jms

Im using Qpid JMS 0.8.0 library in order to implement a standalone java AMQP client. Because the underlying transport connection tends to break every couple of hours I have set the reconnection using following configuration:
failover:(amqps://someurl:5671)?failover.reconnectDelay=2000&failover.warnAfterReconnectAttempts=1
In accordance with Qpid client configuration explanation page I expect my client to keep trying to reconnect increasing the attempt delays for factor 2 (starting with 2 seconds). Instead, according to the log file, only two attempts to reconnect have been performed when a connection failure was detected and at the end the whole client application has been terminated, what I definitively would like to avoid! Here is the log file:
2016-03-22 14:29:40 INFO AmqpProvider:1190 - IdleTimeoutCheck closed the transport due to the peer exceeding our requested idle-timeout.
2016-03-22 14:29:40 DEBUG FailoverProvider:761 - Failover: the provider reports failure: Transport closed due to the peer exceeding our requested idle-timeout
2016-03-22 14:29:40 DEBUG FailoverProvider:519 - handling Provider failure: Transport closed due to the peer exceeding our requested idle-timeout
2016-03-22 14:29:40 DEBUG FailoverProvider:653 - Connection attempt:[1] to: amqps://publish.preops.nm.eurocontrol.int:5671 in-progress
2016-03-22 14:29:40 INFO FailoverProvider:659 - Connection attempt:[1] to: amqps://publish.preops.nm.eurocontrol.int:5671 failed
2016-03-22 14:29:40 WARN FailoverProvider:686 - Failed to connect after: 1 attempt(s) continuing to retry.
2016-03-22 14:29:42 DEBUG FailoverProvider:653 - Connection attempt:[2] to: amqps://publish.preops.nm.eurocontrol.int:5671 in-progress
2016-03-22 14:29:42 INFO FailoverProvider:659 - Connection attempt:[2] to: amqps://publish.preops.nm.eurocontrol.int:5671 failed
2016-03-22 14:29:42 WARN FailoverProvider:686 - Failed to connect after: 2 attempt(s) continuing to retry.
2016-03-22 14:29:43 DEBUG ThreadPoolUtils:156 - Shutdown of ExecutorService: java.util.concurrent.ThreadPoolExecutor#778970af[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0] is shutdown: true and terminated: true took: 0.000 seconds.
2016-03-22 14:29:45 DEBUG ThreadPoolUtils:192 - Waited 2.004 seconds for ExecutorService: java.util.concurrent.ScheduledThreadPoolExecutor#877a470[Shutting down, pool size = 1, active threads = 0, queued tasks = 1, completed tasks = 3] to terminate...
2016-03-22 14:29:46 DEBUG ThreadPoolUtils:156 - Shutdown of ExecutorService: java.util.concurrent.ScheduledThreadPoolExecutor#877a470[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 4] is shutdown: true and terminated: true took: 2.889 seconds.
Any idea, what I’m doing wrong here? Basically, what I'm looking for to achieve is a client which is capable to detect transport connection failure and try to reconnect every 5-10 seconds.
Many thanks!

Related

Springboot application unable to recover after jms connection failure

We have a sprinboot application which stops retrying to connect with solace queues after 3 connection attempts. We get below information logged and then application just does not respond and we have to restart the application:
2021-09-15 16:49:08.021 INFO 4444 --- [recovery-thread] bitronix.tm.recovery.Recoverer : recoverer is already running, abandoning this recovery request
2021-09-15 16:50:04.862 INFO 4444 --- [connect_service] c.s.j.protocol.impl.TcpClientChannel : Connection attempt failed to host '<<hostname>>' ReconnectException com.solacesystems.jcsmp.JCSMPSecurityException: Error performing login to LoginContext (*****) cause: javax.security.auth.login.LoginException: *****
2021-09-15 16:50:07.865 INFO 4444 --- [connect_service] c.s.j.protocol.impl.TcpClientChannel : Connecting to host 'orig=tcp://<<hostname>>:55555, scheme=tcp://, host=<<hostname>>, port=55555' (host 1 of 1, smfclient 2, attempt 3 of 3, this_host_attempt: 1 of 1)
2021-09-15 16:50:07.877 INFO 4444 --- [connect_service] c.s.j.protocol.impl.TcpClientChannel : Connection attempt failed to host '<<hostname>>' ReconnectException com.solacesystems.jcsmp.JCSMPSecurityException: Error performing login to LoginContext (*****) cause: javax.security.auth.login.LoginException: *****
2021-09-15 16:50:10.878 INFO 4444 --- [connect_service] c.s.j.protocol.impl.TcpClientChannel : Stale reconnect task, aborting reconnect.
Below is our configuration for connecting to solace queues:
spring.jta.bitronix.connectionfactory.className=com.solacesystems.jms.SolXAConnectionFactoryImpl
spring.jta.bitronix.connectionfactory.driverProperties.host=smf://<<hostname>>:55555
spring.jta.bitronix.connectionfactory.driverProperties.VPN=<<vpn>>
spring.jta.bitronix.connectionfactory.driverProperties.authenticationScheme=AUTHENTICATION_SCHEME_GSS_KRB
spring.jta.bitronix.connectionfactory.driverProperties.KRBServiceName=HOST
In our service class we are just autowiring the object of jmsTemplate and publishing messages on the queue.
I went through few documentations and tried adding below configuration:
spring.jta.bitronix.connectionfactory.ignore-recovery-failures=true
But still I am facing the same issue. Any suggestions
====Edit
I face this issue only when I put my laptop in airplane mode and reconnect. If I just disconnect from VPN and connect back solace connection is getting reestablished
The SolXAConnectionFactory interface allows for you to tune the connect and reconnect parameters. Docs here.
You'll want to checkout these and maybe a few others. I suggest searching the javadoc for "retry" and "retries":
connectRetries
connectRetriesPerHost
connectTimeoutInMillies
reconnectRetries
I did more research and found the following helpful, would try it in my application : https://solace.community/discussion/917/why-won-t-my-solace-enterprise-application-reconnect-after-an-ha-failover To set it at JNDI, I think this should also be configured at SolAdmin -> JMS Administration -> connection factory -> Transport Properties.
After going through the various documentations and doing some hit and trials, below properties turn out too be useful. Hope it can help somebody:
spring.jta.bitronix.connectionfactory.driverProperties.reconnectRetries = -1
spring.jta.bitronix.connectionfactory.driverProperties.connectRetries = -1

Wrong atomikos state ABORTING after timeout in transaction

I am testing how atomikos solving timeout in transaction. I have 30s timeout in infinite read from db. After 30s I got this exception:
14:08:40.329 [Atomikos:4] WARN c.a.icatch.imp.ActiveStateHandler - Transaction rob-app-b1c7b95a3b0efb82dfb516b04620a213154159609015500001 has timed out - rolling back...
2018-11-07 14:08:40.354 [pool-3-thread-1] WARN c.a.jdbc.JdbcConnectionProxyHelper - Error enlisting in transaction - connection might be broken? Please check the logs for more information...
java.lang.IllegalStateException: wrong state: ABORTING
at com.atomikos.icatch.imp.CoordinatorImp.registerSynchronization(CoordinatorImp.java:420)
at com.atomikos.icatch.imp.TransactionStateHandler.registerSynchronization(TransactionStateHandler.java:129)
at com.atomikos.icatch.imp.CompositeTransactionImp.registerSynchronization(CompositeTransactionImp.java:177)
at com.atomikos.jdbc.AtomikosConnectionProxy.enlist(AtomikosConnectionProxy.java:211)
at com.atomikos.jdbc.AtomikosConnectionProxy.invoke(AtomikosConnectionProxy.java:122)
at com.sun.proxy.$Proxy133.prepareStatement(Unknown Source)
Why I got this error and no Atomikos Exception in AtomikosConnectionProxy during enlist method ?
AtomikosSQLException.throwAtomikosSQLException("The transaction has timed out - try increasing the timeout if needed");
Seems like there is a timeout because your application has been using the transaction for too long. Try increasing the transaction timeout?

Setting up a RocketMQ cluster: slave not visible & not replicating

I'm trying to set up a RocketMQ cluster, with a single name server, 1 master and 2 slaves. But, I'm running into some problems.
The version I'm running is downloaded from github/rocketmq-all-4.1.0-incubating.zip.
The brokers are run using mqbroker -c broker.conf, where broker.conf
differs for master and slave. For the master I have:
listenPort=10911
brokerName=mybroker
brokerClusterName=mybrokercluster
brokerId=0
deleteWhen=04
fileReservedTime=48
brokerRole=SYNC_MASTER
flushDiskType=ASYNC_FLUSH
And for slaves:
listenPort=10911
brokerName=mybroker
brokerClusterName=mybrokercluster
brokerId=1
deleteWhen=04
fileReservedTime=48
brokerRole=SLAVE
flushDiskType=ASYNC_FLUSH
The second slave has brokerId=2.
Brokers start up fine, some parts of the logs for a slave:
2017-10-02 20:31:35 INFO main - brokerRole=ASYNC_MASTER
2017-10-02 20:31:35 INFO main - flushDiskType=ASYNC_FLUSH
(...)
2017-10-02 20:31:35 INFO main - Replace, key: brokerId, value: 0 -> 1
2017-10-02 20:31:35 INFO main - Replace, key: brokerRole, value:
ASYNC_MASTER -> SLAVE
(...)
2017-10-02 20:31:37 INFO main - Set user specified name server address:
172.22.1.38:9876
2017-10-02 20:31:37 INFO ShutdownHook - Shutdown hook was invoked, 1
2017-10-02 20:31:37 INFO ShutdownHook - shutdown thread
PullRequestHoldService interrupt false
2017-10-02 20:31:37 INFO ShutdownHook - join thread PullRequestHoldService
eclipse time(ms) 0 90000
2017-10-02 20:31:37 WARN ShutdownHook - unregisterBroker Exception,
172.22.1.38:9876
org.apache.rocketmq.remoting.exception.RemotingConnectException: connect to
<172.22.1.38:9876> failed
at
org.apache.rocketmq.remoting.netty.NettyRemotingClient.invokeSync(NettyRemotingClient.java:359)
~[rocketmq-remoting-4.1.0-incubating.jar:4.1.0-incubating]
at
org.apache.rocketmq.broker.out.BrokerOuterAPI.unregisterBroker(BrokerOuterAPI.java:221)
~[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at
org.apache.rocketmq.broker.out.BrokerOuterAPI.unregisterBrokerAll(BrokerOuterAPI.java:198)
~[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at
org.apache.rocketmq.broker.BrokerController.unregisterBrokerAll(BrokerController.java:623)
[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at
org.apache.rocketmq.broker.BrokerController.shutdown(BrokerController.java:589)
[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at org.apache.rocketmq.broker.BrokerStartup$1.run(BrokerStartup.java:218)
[rocketmq-broker-4.1.0-incubating.jar:4.1.0-incubating]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_141]
2017-10-02 20:31:37 INFO ShutdownHook - Shutdown hook over, consuming total
time(ms): 25
2017-10-02 20:31:45 INFO BrokerControllerScheduledThread1 - dispatch behind
commit log 0 bytes
2017-10-02 20:31:45 INFO BrokerControllerScheduledThread1 - Slave fall
behind master: 0 bytes
2017-10-02 20:31:45 INFO BrokerControllerScheduledThread1 - register broker
to name server 172.22.1.38:9876 OK
2017-10-02 20:32:15 INFO BrokerControllerScheduledThread1 - register broker
to name server 172.22.1.38:9876 OK
As I suspect the broker is trying to connect to the name server, which
isn't running initially, so it retries and eventually succeeds?
However, later when trying clusterList I only see one broker listed, which happens to be a slave (172.22.1.17) and has brokerId=2 in the configuration (although here it's listed as 0):
$ ./mqadmin clusterList -n 172.22.1.38:9876
#Cluster Name #Broker Name #BID #Addr
#Version #InTPS(LOAD) #OutTPS(LOAD) #PCWait(ms) #Hour
#SPACE
mybrokercluster mybroker 0 172.22.1.17:10911
V4_1_0_SNAPSHOT 0.00(0,0ms) 0.00(0,0ms) 0
418597.80 -1.0000
Moreover, when sending messages to the master, I get SLAVE_NOT_AVAILABLE.
Why is that? Are the brokers configured properly? If so, wy does
clusterList report them incorrectly?
you should change slave port,as you know 10911 has been used by anthoer process(master node),slave should be use different tcp port(eg.10921/10931 and so on)
tips:my cluster deploy on one machine,so i changed tcp port and startup successful,if you master&slave deploy on different machine and startup failed,you should visit rocketmq error log for more information.
notice:one master which have more than one slave,brokerId should be different

Hadoop HiveServer2 worker threads - how to get info about them

Was getting
HiveServer2 Process - Thrift server getting "Connection refused"
hiveserver2.log flooded with messages similar to
2016-01-06 12:19:57,617 WARN [Thread-8]: server.TThreadPoolServer (TThreadPoolServer.java:serve(184)) - Task has been rejected by ExecutorService 9 times till timedout, reason: java.util.concurrent.RejectedExecutionException: Task org.apache.thrift.server.TThreadPoolServer$WorkerProcess#753dd4d2 rejected from java.util.concurrent.ThreadPoolExecutor#4408d67c[Running, pool size = 500, active threads = 500, queued tasks = 0, completed tasks = 12772]
being those active 500 the max default hive.server2.thrift.http.max.worker.threads
This was sorted by restarting HS2 but how can I gather more info about
which YARN job a worker thread maps to
or for how long it has been running ?

What causes Cometd to timeout a session so soon?

I'm having trouble with a websocket client connecting to our Cometd server. I can see it is connecting ok and the handshake is good. But after 160 ms, Cometd thinks the session is timed out and removes it.
05:45:45.597 [00003] INFO - canHandshake() result true
05:45:45.597 [00003] INFO - Websocket session-added : 86db64o1yu7v5elbwdkgg595e
05:45:45.597 [00003] INFO - Registering websocket session : session {id=86db64o1yu7v5elbwdkgg595e,cid=0,appId=GMCN01,rid=90C3301D-0295-633
05:45:45.598 [00003] INFO - Registering websocket session : 86db64o1yu7v5elbwdkgg595e for registrationId 90C3301D-0295-6336-351C-45C8884DD
05:45:45.598 [00003] INFO - storing session info for client 0 # with host http://10.200.1.87:8081/websocket/transmit?sId=86db64o1yu7v5elbw
05:45:45.605 [00003] INFO - << {minimumVersion=1.0, supportedConnectionTypes=[websocket, callback-polling, long-polling], successful=true,
05:45:45.606 [00003] INFO - < {minimumVersion=1.0, supportedConnectionTypes=[websocket, callback-polling, long-polling], successful=true,
05:45:46.135 [00004] INFO - > {connectionType=websocket, channel=/meta/connect, clientId=86db64o1yu7v5elbwdkgg595e} 86db64o1yu7v5elbwdkgg
05:45:46.136 [00004] INFO - >> {connectionType=websocket, channel=/meta/connect, clientId=86db64o1yu7v5elbwdkgg595e}
05:45:46.136 [00004] INFO - << {successful=true, advice={interval=0, reconnect=retry, timeout=30000}, channel=/meta/connect}
05:45:46.136 [00004] INFO - < {successful=true, advice={interval=0, reconnect=retry, timeout=30000}, channel=/meta/connect}
05:45:46.296 [00007] INFO - Removing session 86db64o1yu7v5elbwdkgg595e - last connect 160 ms ago, timed out: true <--- THIS IS VERY ODD
05:45:46.296 [00007] INFO - Websocket session-removed : (t/o=true) 86db64o1yu7v5elbwdkgg595e
My own test client appears to work ok, but perhaps because I am closer to the servers and the delay is not as much. The client that is failing is in another region. But the logs do not indicate any latency. The 160 ms timeout seems way too small.
I'm using Java Cometd 2.6.0 embedded with jetty 8.1.12.
I'm thinking there is a setting for the timeout that is too small, but not sure which one controls this, or if there are other reasons behind the timeout.
Anyone else seen this or can explain why this is happening?
Embarrassingly, I found the ws.maxInterval was set to 25 instead of 25000. That was the issue.

Resources