Inifnispan server stop responding (near cache configured , clustered environment, client server configuration) - caching

We have been facing infinispan server issue while we are testing our application with 100 concurrent active session each cluster node of our application.
Infinispan server configuration.
Two Domain with Two node
Near Cache is configured,INVALIDATE type , with no max entry limit.
our Application is having two cluster node as well.
Infinispan version : 8.2.3
Error we are finding in infinispan server:
2017-06-08 18:49:18,464 DEBUG [org.infinispan.server.hotrod.HotRodExceptionHandler] (HotRodServerWorker-8-7) Exception caught: java.io.IOException: An existing connection was forcibly closed by the remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:898)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:485)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:371)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
After getting this error, we tried disabling near cache with infinispan server. That configuration get the things correct but It also result in very slow performance which is not acceptable as we are using infinispan server for performance purpose only.
Our Template using which we are creating cache container in Infinispan
<distributed-cache-configuration name="OurTempateName" owners="2" segments="20" mode="SYNC" remote-timeout="30000" start="EAGER">
<locking striping="false" acquire-timeout="30000" concurrency-level="1000"/>
<transaction mode="NONE"/>
<security/>
</distributed-cache-configuration>
With Client server infinispan setup, Can anyone suggest configuration which can overcome this issue?

If you see that kind of messages, they might indicate that there is some problem with the client. Did you look into client logs?

Related

DNS Resolution for Redis, Elasticsearch and Postgres hosts not working after introducing Spring Wbflux

Everything was working until I introduced Webflux into our Sprint Boot application running on external tomcat inside a docker container. This was needed when I wrote an elasticsearch repository using ReactiveCrudRepository to implement asynchronous save/fetch operations. Now, our Spring Boot apps hosted on docker are not able to connect to any of the datastores.
For Redis the exception is -
Caused by: java.net.UnknownHostException: Failed to resolve
'redis1' after 2 queries
For Elasticsearch -
Caused by:
org.springframework.data.elasticsearch.client.NoReachableHostException:
Host 'es_host:9200' not reachable. Cluster state is offline.
The data nodes are all hosted on separate instances.
On researching around a bit, I came across several issues with netty's DNS resolver, but there is no standard fix mentioned anywhere. Please advice.

HConnection Closed - JDBC - Connection-Pool

We're using JDBC Connection through Hikari CP to connect to Apache Phoenix & we're facing "HConnection-Closed" Issues. It's because of stale connections present in the pool and the pool does not get cleared until we restart the applications.
Has anyone faced something like above (may not be same DB)?
Is there any recommended approach to connect to Phoenix from Spring applications?

Hikari CP (Spring Boot) Connection Recovery Problem After DB Failure

We have several microservices build on Spring Boot (2.2.4) and Hikari CP (3.4.2) with PostgreSQL.
Recently we have faced DB failure around 30 seconds. After the connections are lost some of the containers are failed to recover connections while others which has exactly the same configuration and application are just fine. Unfortunately we don't have the log indicating the pool sizes(idle active waiting) on time of the error.
We have received some broken pipe and connection lost errors on all containers when the connections are lost. After DB recovery we got the following exception only on some (2/18) containers that are failed to recover.
StackTrace:
org.springframework.orm.jpa.JpaTransactionManager.doBegin(JpaTransactionManager.java:402) ... 20 moreCaused by:
java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms. at
com.zaxxer.hikari.pool.HikariPool.createTimeoutException(HikariPool.java:689) at
com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:196) at
com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:161) at
com.zaxxer.hikari.HikariDataSource.getConnection(HikariDataSource.java:128) at
org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:122) at
org.hibernate.internal.NonContextualJdbcConnectionAccess.obtainConnection(NonContextualJdbcConnectionAccess.java:38) at
org.hibernate.resource.jdbc.internal.LogicalConnectionManagedImpl.acquireConnectionIfNeeded(LogicalConnectionManagedImpl.java:104)
... 30 moreCaused by:org.postgresql.util.PSQLException: This connection has been closed. at
org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:857) at
org.postgresql.jdbc.PgConnection.setNetworkTimeout(PgConnection.java:1639) at
com.zaxxer.hikari.pool.PoolBase.setNetworkTimeout(PoolBase.java:556) at
com.zaxxer.hikari.pool.PoolBase.isConnectionAlive(PoolBase.java:169) at
com.zaxxer.hikari.pool.HikariPool.getConnection(HikariPool.java:185) ... 35 more
we have seen similar(on the same system) situations and tests where the DB failovers and connections are restored on Hikari without any problem. But in this case one of the containers are restored by itself after 1 hour and others after restart.
As far as we know Hikari is not returning the broken connections on the pool and evicts them from the pool after marked as broken or closed. Any ideas what might happened to those containers while the others(exactly same image and configuration) are just fine.
PS: we cannot reproduce the problem.
Hikari configuration:
allowPoolSuspension.............false
connectionInitSql...............none
connectionTestQuery.............none
connectionTimeout...............30000
idleTimeout.....................600000
initializationFailTimeout.......1
isolateInternalQueries..........false
leakDetectionThreshold..........0
maxLifetime.....................1800000
maximumPoolSize.................15
minimumIdle.....................15
validationTimeout...............5000
You can configure something like:
connectionTestQuery=select 1
This way Hikari tests that the connection is still alive before handling it over to Hibernate.

Gemfire ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator

Gemfire cluster suddenly goes down because of ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator
We have a 2 locator and 2 server Gemfire cluster. We bootstrap Gemfire cache server using cache.xml and spring data gemfire xml using spring boot initializer.
We have a client spring boot service which connect to cluster.
Gemfire cluster suddenly goes down randomly due to ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator. What could be the reason for it?. After restart it works fine for a day or 2 without issues and then this issue comes. It impacts our High availability. Please help us fixing this.
org.apache.geode.GemFireConfigException: cluster configuration service not available
at org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1025)
at org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1149)
at org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:758)
at org.apache.geode.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:735)
at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2748)
at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2518)
at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:993)
at org.apache.geode.distributed.internal.DistributionManager$MyListener.membershipFailure(DistributionManager.java:4354)
at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.uncleanShutdown(GMSMembershipManager.java:1556)
at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.lambda$forceDisconnect$0(GMSMembershipManager.java:2593)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.geode.internal.config.ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator.
at org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:259)
at org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:988)
... 10 more
Expected behavior is high availability of Gemfire cluster
By default, whenever a GemFire server starts up (or automatically reconnects to the cluster after an unexpected shutdown), it tries to recover the Cluster Configuration from any locator, if it fails to do so then the member will just shutdown itself, which is what's happening looking at the stack trace attached (see the occurrence of org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect in the stack). I'd focus my analysis in why the member was disconnected in the first place, the subsequent failure to reconnect is just a consequence and not the root cause of the issue.
Either way, if you're just using individual xml files to configure your members and don't want to use the Cluster Configuration Service at all, then you can just start your locator with the property --enable-cluster-configuration=false (the default is true) and your servers with --use-cluster-configuration=false (the default is also true), this will prevent the servers from trying to start up using the cluster configuration from the locators.
Hope this helps. Cheers.

Database failure shuts down ActiveMQ windows service using JDBC persistence

I have an ActiveMQ broker running as a Windows service. Its using jdbcPersistenceAdapter with Oracle data source and Oracle's Universal Connection Pooling (UCP).
When the database is down (due to network problems or scheduled maintenance), the ActiveMQ windows service shuts down completely. This, of course makes the broker unavailable even after the database is restored.
I have tried connection validation in UCP, DBCP with connection validation and even MySQL data source without any success. The service shuts down within 30 seconds of database failure (I believe this is because the default cleanupInterval is 30 seconds).
Is there a way to prevent the windows service from shutting down and make it wait for database availability?
Any help is greatly appreciated.
Here is my current configuration from activemq.xml:
<persistenceAdapter>
<jdbcPersistenceAdapter dataSource="#oracle-ds"/>
</persistenceAdapter>
<bean id="oracle-ds" class="oracle.ucp.jdbc.PoolDataSourceFactory"
factory-method="getPoolDataSource" p:URL="jdbc:oracle:thin:#localhost:1521:amq"
p:connectionFactoryClassName="oracle.jdbc.pool.OracleDataSource"
p:validateConnectionOnBorrow="true" p:user="appuser" p:password="userspassword" />
generally, you should use a JDBC master/slave to support failover from another broker when the database becomes unavailable...
see http://activemq.apache.org/jdbc-master-slave.html
that said, there is a known issue with JDBC master/slave failover that was fixed in 5.6.0...
see https://issues.apache.org/jira/browse/AMQ-1958

Resources