PartitionOfflineException getting spring + geode + kubernetes - spring-boot

Setup:
spring boot 2.7.8
geode version: 1.14.4
I have one service (app) with PARTITION_PROXY region and second one (cache) with PARTITION_PERSISTENT_OVERFLOW. Both are running as separated pods in kubernetes.
Problem:
When i patch those services , kubernetes let old ones running and parallely starts new ones. When new ones are ready , then terminate old ones. This process is running simultaneously for both services. But somehow in this process new service(app) remebers old service (cache ). Im getting this exception:
Exception:
Region /REGION_NAME bucket 89 has persistent data that is no longer online stored at these locations
Question:
How would you solve this? I mean the way so i can stick with parallel deployment. Is there configuration which say "forget offline partitioned buckets" ? Can i force service (app) to forget offline bucket ?

Related

[Ignite]Problem upgrading Ignite 2.7 Services to 2.10....Services always getting canceled?

I completed the code upgrade and removed all deprecated calls from my service code.
This is a Java Spring application and all services come up (as reported in the log files) when the Cluster (of 5 nodes) starts up.
However, when I try to get a serviceProxy to each service, all services get cancelled as a result of ServiceDeploymentTask attempting to do a redeploy of the services!! The redeploy fails and only cancels all of the services and fails to restart them. This can be demonstrated with both a thick and a thin client.
Why is Ignite trying to redeploy the services? (and why don't the services restart?)
Is there something I'm missing from the move to Ignite 2.10???
Finally, why does a Java Thin Client create a NODE_JOIN event?
Thanks in advance.
Greg

Gemfire ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator

Gemfire cluster suddenly goes down because of ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator
We have a 2 locator and 2 server Gemfire cluster. We bootstrap Gemfire cache server using cache.xml and spring data gemfire xml using spring boot initializer.
We have a client spring boot service which connect to cluster.
Gemfire cluster suddenly goes down randomly due to ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator. What could be the reason for it?. After restart it works fine for a day or 2 without issues and then this issue comes. It impacts our High availability. Please help us fixing this.
org.apache.geode.GemFireConfigException: cluster configuration service not available
at org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:1025)
at org.apache.geode.internal.cache.GemFireCacheImpl.initialize(GemFireCacheImpl.java:1149)
at org.apache.geode.internal.cache.GemFireCacheImpl.basicCreate(GemFireCacheImpl.java:758)
at org.apache.geode.internal.cache.GemFireCacheImpl.create(GemFireCacheImpl.java:735)
at org.apache.geode.distributed.internal.InternalDistributedSystem.reconnect(InternalDistributedSystem.java:2748)
at org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect(InternalDistributedSystem.java:2518)
at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:993)
at org.apache.geode.distributed.internal.DistributionManager$MyListener.membershipFailure(DistributionManager.java:4354)
at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.uncleanShutdown(GMSMembershipManager.java:1556)
at org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.lambda$forceDisconnect$0(GMSMembershipManager.java:2593)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.geode.internal.config.ClusterConfigurationNotAvailableException: Unable to retrieve cluster configuration from the locator.
at org.apache.geode.internal.cache.ClusterConfigurationLoader.requestConfigurationFromLocators(ClusterConfigurationLoader.java:259)
at org.apache.geode.internal.cache.GemFireCacheImpl.requestSharedConfiguration(GemFireCacheImpl.java:988)
... 10 more
Expected behavior is high availability of Gemfire cluster
By default, whenever a GemFire server starts up (or automatically reconnects to the cluster after an unexpected shutdown), it tries to recover the Cluster Configuration from any locator, if it fails to do so then the member will just shutdown itself, which is what's happening looking at the stack trace attached (see the occurrence of org.apache.geode.distributed.internal.InternalDistributedSystem.tryReconnect in the stack). I'd focus my analysis in why the member was disconnected in the first place, the subsequent failure to reconnect is just a consequence and not the root cause of the issue.
Either way, if you're just using individual xml files to configure your members and don't want to use the Cluster Configuration Service at all, then you can just start your locator with the property --enable-cluster-configuration=false (the default is true) and your servers with --use-cluster-configuration=false (the default is also true), this will prevent the servers from trying to start up using the cluster configuration from the locators.
Hope this helps. Cheers.

How to check if docker Cassandra instance is ready to take connections

I have two docker instances that I launch with docker-compose.
One holds a Cassandra instance
One holds a Spring Boot application that tries to connect to that application.
However, the Spring Boot application will always fail, because it's trying to connect to a Cassandra instance that is not ready yet to take connections.
I have tried:
Using restart:always in Docker-compose
This still doesn't always work, because the Cassandra might be up 'enough' to no longer crash the Spring Boot application, but not up 'enough' to have successfully created the Table/Column family. On top of that, this is a very hacky solution.
Using healthcheck
It seems like healthcheck in compose doesn't have restart capabilities
Using a bash script as entrypoint
In the hope that I could use netstat,ping,... whatever to determine that readiness state of Cassandra
Right now the only thing that really works is using that same bash script and sleep the process for x seconds, then start the jar. This is even more hacky...
Does anyone have an idea on how to solve this?
Thanks!
Does the spring boot service defined in the docker-compose.yml depends_on the cassandara service? If yes then the service is started only if the cassandra service is ready.
https://docs.docker.com/compose/compose-file/#depends_on
Take a look at this github repository, to find a healthcheck for the cassandra service.
https://github.com/docker-library/healthcheck
CONCLUSION
After some discussion we found out that docker-compose seems not to provide a functionality for waiting until services are up and healthy, such as Kubernetes and Openshift provide (See comments below). They recommend to use wrapper script (docker-entrypoint.sh) which waits for the depending service to come up, which make binaries necessary, the actual service shouldn't use such as the cassandra client binary. Additionally the service depending on cassandra could never get up if cassandra doesn't, which shouldn't happen.
A main thing with microservices is that they have to be resilient for failures and are not supposed to die or not to come up if a depending service is currently not available or unexpectedly disappears. Therefore the microservice should be implemented in a way so that it retries to get connection after startup or an unexpected disappearance. Unexpected is a word actually wrongly used in this context, because you should always expect such issues in a distributed environment, and even with docker-compose you will face issues like that as discussed in this topic.
The following link points to a tutorial which helped to integrate cassandra properly into a spring boot application. It provides a way to implement the retrieval of a cassandra connection with a retry behavior, therefore the service is resilient to a non existing cassandra database and will not fail to start anymore. Hope this helps others as well.
https://dzone.com/articles/containerising-a-spring-data-cassandra-application

EC2 check in Auto Scaling group

I have an ASG with min=2, max=4 configuration. In the boot up script of each EC2, I have a series of yum install and starting of 2 spring boot files. Now, when th e load increases and ASG spins up a new EC2 instance, it will perform all these in the boot up script.
Could anybody suggest a good method to validate whether all these yum installs have been successful and also whether the 2 spring boot files are running currently. If there is any problem with these, I dont want the EC2 instance to be attached to ELB.
I used cfn-signal to send information back to Cloudformation after performing my application and infrastructure level checks
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-signal.html

How to make sure there will be a fixed DB server across multiple deployments in Cloud Foundry?

I am a newbie with CF microservices and I am trying to deploy a service multiple times. As far as I understood, each time I deploy into a space the application is getting a different database server and schema. Is there a way to tell the Cloud Foundry to use only a fixed DB server all the times across multiple deployments in one environment?
The keyword for your case is 'Service Instance'
You can create a service instance of database server within the environment specific for your application and bind it via application manifest.
e.g.
cf create-service rabbitmq small-plan myapplication-rabbitmq-instance
As long as you have a binding to myapplication-rabbitmq-instance in your application manifest it would be preserved/be the same between application deployments within this space.
e.g. in your application manifest:
---
...
services:
- myapplication-rabbitmq-instance
More on https://docs.cloudfoundry.org/devguide/services/

Resources