Retry connection not attempted for span exporter - open-telemetry

We are using OpenTelemetry java api and OpenTelemetry auto configuration. If span exporter is down then i want do retry or catch below exception.
SEVERE: Failed to export spans. Server is UNAVAILABLE. Make sure your collector is running and reachable from this network. Full error message:UNAVAILABLE: io exception
can anyone help me?

You should have the collector instance running and reachable from the application host where the java instrumentation exporter is running. The OTLP exporter already has a retry mechanism for transient errors as dictated by the opentelemetry-specification. In your case the collector is not even reachable. These examples might help you https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/examples.

Related

Spring Cloud Stream Binder Kafka broker is not available

I have a Spring Config Server application which is working with Kafka in dev environment but in local environment I keep getting:
{host} could not be established. Broker may not be available.
Is there any way to start the application in local environment even if the broker is not available and not get the warning logs?
The desired behaviour: If broker is not available, application should not have explicit warnings and should continue working.
I have tried to set fatalIfBrokerNotAvailable to false and missingTopicsFatal to false but it does not have any effect.
Unfortunately, the kafka-clients (used by the binder) does not have any API to get the status of the broker(s). So there is no way to determine the state without trying to connect. Those logs are emitted by the kafka-clients code.

How to restart kubernetes pod when issue because of Rabbit MQ connectivity in logs

I have a Spring Boot 2 standalone application( not REST service) which connect to rabbit MQ and process message. The application is deployed in kubernetes. While it work great, but when Rabbit MQ remain down for little longer and in logs I see hearbeat exception 60sec and eventually connection get drop even if the rabbit mq comes up after certain time:
Automatic retry connection to broker by spring-rabbitmq
https://www.rabbitmq.com/heartbeats.html
While I try to manage above issue by increasing number of retry :https://stackoverflow.com/questions/45385119/how-configure-timeouts-retries-or-max-attempts-in-differents-queues-with-spring
but after expiry of retry still above issue comes.
How can I reboot/delete-recreate pod if I see above issue in logs from kubernetes.
The easiest way is to use actuator, which has a /actuator/health endpoint. (Note that the recent version also add /actuator/health/liveness and /actuator/health/readiness).
You can assign the endpoint to livenessProbe property of k8s. Then it will automatically restart when it is necessary. You can parameterize, when your app is down if necessary.
See the docs:
Kubernetes liveness probe
Spring actuator health

Tracing memory leak in Spring Azure qPID JMS code

Im trying to trace and identify root cause for memory leak in our very small and simple Spring Boot application.
It uses following:
- Spring Boot 2.2.4
- azure-servicebus-jms-spring-boot-starter 2.2.1
- MSSQL
Function:
The app only dispatches Azure ServiceBus queue and stores data and sends data to other destination.
It is a small app so it starts easily with 64 megs of memory, despite I give it up to 256 megs via Xmx option. Important note is the queue is being dispatched using Spring default transacted mode with dedicated JmsTransactionManager who is actually inner TM of ChainedTransactionManager along with dbTM and additional outbound JMS TM. Both JMS ConnectionFactory objects are created as CachingConnectionFactory.
Behavior:
Once the app is started it seems OK. There is no traffic so I can see in the log it is opening transactions and closing when checking the queue (jms:message-driven-channel-adapter).
However after some time when there is still no traffic, no single message was consumed the memory starts climbing as monitored via JVVM.
There is an error thrown:
--2020-04-24 11:17:01.443 - WARN 39892 --- [er.container-10] o.s.j.l.DefaultMessageListenerContainer : Setup of JMS message listener invoker failed for destination 'MY QUEUE NAME HERE' - trying to recover. Cause: Heuristic completion: outcome state is rolled back; nested exception is org.springframework.transaction.TransactionSystemException: Could not commit JMS transaction; nested exception is javax.jms.IllegalStateException: The Session was closed due to an unrecoverable error.
... and after several minutes it reaches MAX of the heap and since that moment it is failing on OutOfMemory error in the thread opening JMS connections.
--2020-04-24 11:20:04.564 - WARN 39892 --- [windows.net:-1]] i.n.u.concurrent.AbstractEventExecutor : A task raised an exception. Task: org.apache.qpid.jms.provider.amqp.AmqpProvider$$Lambda$871/0x000000080199f840#1ed8f2b9
-
java.lang.OutOfMemoryError: Java heap space
at java.base/java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:61)
at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348)
at org.apache.qpid.proton.engine.impl.ByteBufferUtils.newWriteableBuffer(ByteBufferUtils.java:99)
at org.apache.qpid.proton.engine.impl.TransportOutputAdaptor.init_buffers(TransportOutputAdaptor.java:108)
at org.apache.qpid.proton.engine.impl.TransportOutputAdaptor.pending(TransportOutputAdaptor.java:56)
at org.apache.qpid.proton.engine.impl.SaslImpl$SwitchingSaslTransportWrapper.pending(SaslImpl.java:842)
at org.apache.qpid.proton.engine.impl.HandshakeSniffingTransportWrapper.pending(HandshakeSniffingTransportWrapper.java:138)
at org.apache.qpid.proton.engine.impl.TransportImpl.pending(TransportImpl.java:1577)
at org.apache.qpid.proton.engine.impl.TransportImpl.getOutputBuffer(TransportImpl.java:1526)
at org.apache.qpid.jms.provider.amqp.AmqpProvider.pumpToProtonTransport(AmqpProvider.java:994)
at org.apache.qpid.jms.provider.amqp.AmqpProvider.pumpToProtonTransport(AmqpProvider.java:985)
at org.apache.qpid.jms.provider.amqp.AmqpProvider.lambda$close$3(AmqpProvider.java:351)
at org.apache.qpid.jms.provider.amqp.AmqpProvider$$Lambda$871/0x000000080199f840.run(Unknown Source)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518)
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.base/java.lang.Thread.run(Thread.java:835)
HeapDumps:
I took couple of heap snapshots during this whole process and looked at what gets increased.
I can see suspicious amount of ConcurrentHashMap/String/Byte[] objects.
Has anyone some clue/hint what can be wrong in this setup and libs: Spring Boot, Apache qPid used under the hood of the Azure JMS dependency etc.? Many thanks.
Update #1
I have clear evidence that the problem is either in Spring or azure service bus starter library - not automatically qPid client used. I would say the library has the bug rather than Spring, just my guess. This is how the failing setup looks like:
There are two JMS destinations and one DB, each having its transaction manager
There is ChainedTransactionManager wrapping above three TMs.
Spring integration app which connects to Azure ServiceBus queue via jms:message-driven-channel-adapter and setting the transaction manager on this component (as created in point 2)
Start the app., no traffic on the queue is needed, after 10 minutes the app will crash due to OutOfMemoryError ... within those 10 minutes I watch log on debug level and only thing which is happening is opening and closing transactions using ChainedTransactionManager ... also as written in the comments another important condition is the third JMS TransactionManager ... with 2 TMs it works and is stable, with 3 it will crash ...
Additional research and steps taken identified the most likely root cause Spring CachingConnectionFactory class. Once I removed that and used only native types the problem went away and memory consumption profile is very different and healthy.
I have to say I created CachingConnectionFactory using standard constructor and didnt further configure the behavior. However these Spring defaults clearly lead to memory leak as per my experience.
In past I had memory leak with ActiveMq which had to be resolved by using CachingConnectionFactory and now I have memory leak with Azure ServiceBus when using CachingConnectionFactory .. strange :) In both cases I see that as bugs because memory management should be correct regardless caching involved or not.
Marking this as my answer.
Tested case: The problem occurs when receiving and sending message both with its own TM and both JMS connectionFactories are type CachedConnectionFactory. At the end I tested the app. with inbound connection factory of type CachedConnectionFactory and outbound just native type ... no memory leak as well.

Gemfire ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator

Gemfire cluster suddenly goes down because of ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator
We have a 2 locator and 2 server Gemfire cluster. We bootstrap Gemfire cache server using cache.xml and spring data gemfire xml using spring boot initializer.
We have a client spring boot service which connect to cluster.
Gemfire cluster suddenly goes down randomly due to ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator. What could be the reason for it?. After restart it works fine for a day or 2 without issues and then this issue comes. It impacts our High availability. Please help us fixing this.
org.apache.geode.GemFireConfigException: cluster configuration service not available
at org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1025)
at org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1149)
at org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:758)
at org.apache.geode.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:735)
at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2748)
at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2518)
at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:993)
at org.apache.geode.distributed.internal.DistributionManager$MyListener.membershipFailure(DistributionManager.java:4354)
at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.uncleanShutdown(GMSMembershipManager.java:1556)
at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.lambda$forceDisconnect$0(GMSMembershipManager.java:2593)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.geode.internal.config.ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator.
at org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:259)
at org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:988)
... 10 more
Expected behavior is high availability of Gemfire cluster
By default, whenever a GemFire server starts up (or automatically reconnects to the cluster after an unexpected shutdown), it tries to recover the Cluster Configuration from any locator, if it fails to do so then the member will just shutdown itself, which is what's happening looking at the stack trace attached (see the occurrence of org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect in the stack). I'd focus my analysis in why the member was disconnected in the first place, the subsequent failure to reconnect is just a consequence and not the root cause of the issue.
Either way, if you're just using individual xml files to configure your members and don't want to use the Cluster Configuration Service at all, then you can just start your locator with the property --enable-cluster-configuration=false (the default is true) and your servers with --use-cluster-configuration=false (the default is also true), this will prevent the servers from trying to start up using the cluster configuration from the locators.
Hope this helps. Cheers.

IBM WAS 9, MDB deployment fail the entire application

We have an IBM WebSphere AS 9.0.0.7 and when we want to deploy an application containing an MDB - which listens to a remote WebShpere MQ server - while the MQ server is down, then WAS reports an error
Caused by: com.ibm.mq.connector.DetailedResourceAdapterInternalException: MQJCA1011: Failed to allocate a JMS connection., error code: MQJCA1011 An internalerror caused an attempt to allocate a connection to fail. See the linked exception for details of the failure.
and stops the deployment, i.e. application does not start. Which is a big problem as it is a critical hub for other operations. We want to force WAS to start the application and retry the JMS connection later. Is it possible?
You can try setting custom property WAS_EndpointInitialState property to INACTIVE, see here and here, and also may want to look through here.
We've found a solution here: Configuring properties for the IBM MQ resource adapter
Trick was to set startupRetryCount and startupRetryInterval. When the MQ server is not available, the app starts, however it is reported as "Partial start". All other parts of the application seems to be running just fine.

Resources