Understanding Spring Cloud Eureka Server self preservation and renew threshold - spring

I am new to developing microservices, although I have been researching about it for a while, reading both Spring's docs and Netflix's.
I have started a simple project available on Github. It is basically a Eureka server (Archimedes) and three Eureka client microservices (one public API and two private). Check github's readme for a detailed description.
The point is that when everything is running I would like that if one of the private microservices is killed, the Eureka server realizes and removes it from the registry.
I found this question on Stackoverflow, and the solution passes by using enableSelfPreservation:false in the Eureka Server config. Doing this after a while the killed service disappears as expected.
However I can see the following message:
THE SELF PRESERVATION MODE IS TURNED OFF.THIS MAY NOT PROTECT INSTANCE
EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.
1. What is the purpose of the self preservation? The doc states that with self preservation on "clients can get the instances that do not exist anymore". So when is it advisable to have it on/off?
Furthermore, when self preservation is on, you may get an outstanding message in the Eureka Server console warning:
EMERGENCY! EUREKA MAY BE INCORRECTLY CLAIMING INSTANCES ARE UP WHEN
THEY'RE NOT. RENEWALS ARE LESSER THAN THRESHOLD AND HENCE THE
INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE.
Now, going on with the Spring Eureka Console.
Lease expiration enabled true/false
Renews threshold 5
Renews (last min) 4
I have come across a weird behaviour of the threshold count: when I start the Eureka Server alone, the threshold is 1.
2. I have a single Eureka server and is configured with registerWithEureka: false to prevent it from registering on another server. Then, why does it show up in the threshold count?
3. For every client I start the threshold count increases by +2. I guess it is because they send 2 renew messages per min, am I right?
4. The Eureka server never sends a renew so the last min renews is always below the threshold. Is this normal?
renew threshold 5
rewnews last min: (client1) +2 + (client2) +2 -> 4
Server cfg:
server:
port: ${PORT:8761}
eureka:
instance:
hostname: localhost
client:
registerWithEureka: false
fetchRegistry: false
serviceUrl:
defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
server:
enableSelfPreservation: false
# waitTimeInMsWhenSyncEmpty: 0
Client 1 cfg:
spring:
application:
name: random-image-microservice
server:
port: 9999
eureka:
client:
serviceUrl:
defaultZone: http://localhost:8761/eureka/
healthcheck:
enabled: true

I got the same question as #codependent met, I googled a lot and did some experiment, here I come to contribute some knowledge about how Eureka server and instance work.
Every instance needs to renew its lease to Eureka Server with frequency of one time per 30 seconds, which can be define in eureka.instance.leaseRenewalIntervalInSeconds.
Renews (last min): represents how many renews received from Eureka instance in last minute
Renews threshold: the renews that Eureka server expects received from Eureka instance per minute.
For example, if registerWithEureka is set to false, eureka.instance.leaseRenewalIntervalInSeconds is set to 30 and run 2 Eureka instance. Two Eureka instance will send 4 renews to Eureka server per minutes, Eureka server minimal threshold is 1 (written in code), so the threshold is 5 (this number will be multiply a factor eureka.server.renewalPercentThreshold which will be discussed later).
SELF PRESERVATION MODE: if Renews (last min) is less than Renews threshold, self preservation mode will be activated.
So in upper example, the SELF PRESERVATION MODE is activated, because threshold is 5, but Eureka server can only receive 4 renews/min.
Question 1:
The SELF PRESERVATION MODE is design to avoid poor network connectivity failure. Connectivity between Eureka instance A and B is good, but B is failed to renew its lease to Eureka server in a short period due to connectivity hiccups, at this time Eureka server can't simply just kick out instance B. If it does, instance A will not get available registered service from Eureka server despite B is available. So this is the purpose of SELF PRESERVATION MODE, and it's better to turn it on.
Question 2:
The minimal threshold 1 is written in the code. registerWithEureka is set to false so there will be no Eureka instance registers, the threshold will be 1.
In production environment, generally we deploy two Eureka server and registerWithEureka will be set to true. So the threshold will be 2, and Eureka server will renew lease to itself twice/minute, so RENEWALS ARE LESSER THAN THRESHOLD won't be a problem.
Question 3:
Yes, you are right. eureka.instance.leaseRenewalIntervalInSeconds defines how many renews sent to server per minute, but it will multiply a factor eureka.server.renewalPercentThreshold mentioned above, the default value is 0.85.
Question 4:
Yes, it's normal, because the threshold initial value is set to 1. So if registerWithEureka is set to false, renews is always below threshold.
I have two suggestions for this:
Deploy two Eureka server and enable registerWithEureka.
If you just want to deploy in demo/dev environment, you can set eureka.server.renewalPercentThreshold to 0.49, so when you start up a Eureka server alone, threshold will be 0.

I've created a blog post with the details of Eureka here, that fills in some missing detail from Spring doc or Netflix blog. It is the result of several days of debugging and digging through source code. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.

You can try to set renewal threshold limit in your eureka server properties. If you have around 3 to 4 Microservices to register on eureka, then you can set it to this:
eureka.server.renewalPercentThreshold=0.33

server:
enableSelfPreservation: false
if set to true, Eureka expects service instances to register themselves and to continue to send registration renewal requests every 30 s. Normally, if Eureka doesn’t receive a renewal from a service for three renewal periods (or 90 s), it deregisters that instance.
if set to false, in this case, Eureka assumes there’s a network problem, enters self-preservation mode, and won’t deregister service instances.
Even if you decide to disable self-preservation mode for development, you should leave it enabled when you go into production.

Related

Azure VM is closing the idle sessions in my postgres database

I am creating a VM in azure to upload a postgres instance in docker and connect to it with my local backend in Spring. What happens is that once connected to the DB after X time of inactivity when trying to make a request I get the following "HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#f162126 (This connection has been closed.). Possibly consider using a shorter maxLifetime value." digging around I realized that it is as if my VM has some kind of behavior that when a connection becomes inactive it closes it causing the above error. The curious thing here is that the sessions are not closed as you can see in the following image even shutting down my backend the sessions are maintained and the only options to delete them is restarting the container in which the DB is hosted.
I have tried to reproduce this behavior on local but it never happens even if I leave the backend idle for an hour if I do the request to the DB it works as if nothing, it only happens with my VM in azure.
I want to clarify that the sessions that appear in the attached image no longer work, i.e. if I try to consume the DB from spring, the error I mentioned appears and automatically Hikari creates new sessions for its pool and I can reproduce this behavior until I reach 100 sessions that after a while would not work again and that Spring never closes when shutting down the backend.
HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#f162126 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
This error is thrown by method isConnectionDead. While checking the connection, if it's still alive & can be used and it will issue an above error if it has already been closed.
You can adjust your maxLifetime setting, to resolve this problem. 30000ms (30 seconds) is the shortest value that is permitted (30 seconds), 1800000ms (30 minutes) is by default value.
A connection in the pool can only last for a certain amount of time which is controlled by the maxLifetime attribute. the value of this property it should be several seconds below any connection time restrictions set by the infrastructure or any databases.
Reference: Hikari Configuration Github.
Well, after much research and reviewing various sources it turns out that azure when creating a VM has certain security policies as Pedro Perez says in the following post in Stack Excahnge: Azure closing idle network connections
You're hitting a design feature of the software load balancer in front of your VMs. By default it will close any idle connections after 4 minutes, but you can configure the timeout to be anything between those 4 and 30 minutes
So in order to overwrite this policy that governs your VM you must do the process of creating a load balancer, do all the relevant configuration and create a Load balancing rule for port 5432 which is the default port of postgres and set the Idle timeout in a range of 4 to 30 min according to your needs.
finally configure your VM so its public ip points to the LB(Load Balancer) public ip and everything will work normally.
It is true that if you simply want to take advantage of Azure's default security policies on the VMs you create you should set the maxLifetime to a maximum of 4 minutes in your Spring application.properties or appliation.yml as #PratikLad says.
in my case I prefer to leave the default Hikari configuration (maxLifetime of 30 mins) so I need to create the LB but if you prefer to change the property by setting it to a maximum of 4 min you would not need to do all the above mentioned on the LB.

Eureka client sometimes registrates with wrong host name

I have a question about eureka like this question but the solution of this issue were of no help at all. See the similar issue here:
Another similar issue
Well, in my issue, I'm trying to construct a graceful release module based on eureka. By pull down any service in eureka before actually shut them down to ensure there is no loadbalance exception when the specified application is closed.
I have tested the situations to set eureka.instance.preferIpAddressto false and true.
while eureka.instance.preferIpAddress=false,ribbon will not recognize those applications registered with machine name and will throw a no loadbalancer exception.
while eureka.instance.preferIpAddress=false,ribbon will recognize those applications registered with machine name and everything is going right. That means, ribbon can get the real ip address of those applications.
Here is my case, I need to not only figure out why in both situations, the instanceId of applications in eureka will still showing with machine name, but also the same application will
get chance to have different instanceId even after simple restart!
Here is what I observed:
Server IP is 192.168.24.201 with hosts setting it's name to localhost
restart the same application in several times It can be seen that sometimes the instanceId of this application will change between localhost:applicationName:8005 and 192.168.24.201:applicationName:8005.
But both instanceId have the same IP address. And that means both of them won't lead to a loadbalance exception. It only makes my manually controlling of eureka server more difficult. And that is also acceptable.
The biggest problem is, sometimes the instanceId of different server will be localhost:applicationName:8005 and that leads to conflicts! By restart the application, the situation will solve in chance but not all the times! So if I'm using eureka as a cluster of several server, I cannot ensure my application is correctly registrate into eureka!
Here is the eureka client setting of application8005:
eureka:
instance:
lease-renewal-interval-in-seconds: ${my-config.eureka.instance.heartbeatInterval:5}
lease-expiration-duration-in-seconds: ${my-config.eureka.instance.deadInterval:15}
preferIpAddress: true
client:
service-url:
defaultZone: http://192.168.24.201:8008/eureka/
registry-fetch-interval-seconds: ${my-config.eureka.client.fetchRegistryInterval:20}
Here is the eureka server setting of EurekaServer:
eureka:
server:
eviction-interval-timer-in-ms: ${my-config.eureka.server.refreshInterval:5000}
enable-self-preservation: false
responseCacheUpdateIntervalMs: 5000
I don't know why applications' instanceId will sometimes not using IP as beginning string but using localhost.
The problem was solved by using prefer-ip-address: true and instance-id: ${spring.cloud.client.ip-address}:${spring.application.name}:${server.port}:${spring.cloud.nacos.config.group}
I have ruled that each server can run only one same app.
In this case each instance will have it's own unique id in this way.

GCP Postgres connections consumed when stressing spring boot service (using SQL Cloud proxy)

I'm working on a Spring Boot service to post data and persist it in a GCP SQL Postgres database. The problem is when I stress the service with requests, I get an SQL exception about consuming all available connections:
"FATAL: remaining connection slots are reserved for non-replication superuser connections"
I figured out the issue and I added a proper hikari configuration to limit used connections and set a limit for when to close idle connections, here is my properties.yml configuration:
type: com.zaxxer.hikari.HikariDataSource
hikari:
initializationFailTimeout: 30000
idle-timeout: 30000
minimum-idle: 5
maximum-pool-size: 15
connection-timeout: 20000
max-lifetime: 1000
The service works fine with this setup when I run it locally with the same database, but it consumes all available connections when I run it from my cloud setup, and then get the same exception.
IMPORTANT! I'm using SQL cloud proxy to connect to the database.
Here are screenshots for the database available connections:
1- before running the service
2- after running the service locally
3- after running the service from the cloud
After a few days of investigating this problem, we found a solution that is mitigating the problem but not solving it completely (the ideal solution is mentioned in the end).
If you want to keep using SQL Cloud proxy, then you need to accept the fact that you don't have full control over your database connections configuration, as SQL cloud proxy might keep those connections alive for more than you've configured it (source).
To mitigate this problem, we used SQL Cloud proxy 1.19.2 from the docker registry, and we used this hikari configuration:
hikari:
idle-timeout: 30000 # maximum amount of time (in milliseconds) that a connection is allowed to sit idle in the pool
minimum-idle: 1 # minimum number of idle connections that HikariCP tries to maintain in the pool, including both idle and in-use connections. If the idle connections dip below this value, HikariCP will make a best effort to restore them quickly and efficiently
maximum-pool-size: 15 # maximum size that the pool is allowed to reach, including both idle and in-use connections. Basically this value will determine the maximum number of actual connections to the database backend
connection-timeout: 20000 #maximum number of milliseconds that a client will wait for a connection
max-lifetime: 20000 # maximum lifetime in milliseconds of a connection in the pool after it is closed.
The proper solution in this case, is to use a Shared VPC to have a private connection to your database where you will rely on your database driver to make those connections.

Customize retries behaviour with Ribbon and Hystrix

Objective
I have a task to write API Gateway & load balancer with the following objectives:
Gateway/LB should redirect requests to instances of 3rd party service (no code change = client-side service discovery)
Each service instance is able to process only single response simultaneously, concurrent request = immediate error response.
Services response latency is 0-5 seconds. I can't cache their responses, and therefore as I understand fallback is not an option for me. Also timeout is not an option, because latency is random and you haven't warranty you'll get better one on another instance.
My solution
Spring Boot Cloud Netflix: Zuul-Hystrix-Ribbon. Two approaches:
Retry. Ribbon retry with fixed interval or exponential increase. I failed to make it work, the best result I achieved is MaxAutoRetriesNextServer: 1000, where Ribbon fires retries immediatelly and spamming donwstream services.
Circuit Breaker. Instead of adding exponential wait period in Ribbon, I can open circuit after few fails and redirect requests to another services. This also not the best approach for two reasons: a) having only few instances each with 0-5 sec latency means open all circuits very quickly and fail to serve request. b) my configuration doesn't work for some reason
Question
How can I make Ribbon wait between retries? Or can I solve my problem with Circuit Breaker?
My configuration
Full config could be found on GitHub.
ribbon:
eureka:
enabled: false
# Obsolete option (Apache HttpClient by default), but without this Ribbon doesn't retry against another instances
restclient:
enabled: true
hystrix:
command:
my-service:
circuitBreaker:
sleepWindowInMilliseconds: 3000
errorThresholdPercentage: 50
requestVolumeThreshold: 5
execution:
isolation:
thread:
timeoutInMilliseconds: 5500
my-service:
ribbon:
OkToRetryOnAllOperations: false
NFLoadBalancerRuleClassName: com.netflix.loadbalancer.WeightedResponseTimeRule
listOfServers: ${LIST_OF_SERVERS}
ConnectTimeout: 500
ReadTimeout: 4500
MaxAutoRetries: 0
MaxAutoRetriesNextServer: 1000
retryableStatusCodes: 404,502,503,504
Tests
In order to check your assumptions, you can play with the test on GitHub, that simulates single-thread service instances with different latencies

Eureka slow to remove instances

I am doing a POC with Eureka. When I shut down the service instance, it is currently taking about 3 minutes to no longer show in the Eureka console. I am presuming (but not confident) this also means this downed instance can still be discovered by clients?
With debugging on, I can see server running the evict task several times before it determines the lease is expired on the instance I shut down.
My settings are client:
eureka.client.serviceUrl.defaultZone=http://localhost:8761/eureka/
eureka.instance.statusPageUrlPath=${management.context-path}/info
eureka.instance.healthCheckUrlPath=${management.context-path}/health
eureka.instance.leaseRenewalIntervalInSeconds=5
eureka.client.healthcheck.enabled=true
eureka.client.lease.duration=2
eureka.client.leaseExpirationDurationInSeconds=5
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
Server:
server.port=8761
eureka.client.register-with-eureka=false
eureka.client.fetch-registry=false
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
eureka.server.enableSelfPreservation=false
I have also tried these settings on server:
eureka.server.response-cache-update-interval-ms: 500
eureka.server.eviction-interval-timer-in-ms: 500
These seem to increase the checking but do not decrease the time for server to recognize instance is down.
Am I missing a setting? Is there a best practice to shutting down instances in production to get this instantaneous?
Thanks!

Resources