I am doing a POC with Eureka. When I shut down the service instance, it is currently taking about 3 minutes to no longer show in the Eureka console. I am presuming (but not confident) this also means this downed instance can still be discovered by clients?
With debugging on, I can see server running the evict task several times before it determines the lease is expired on the instance I shut down.
My settings are client:
eureka.client.serviceUrl.defaultZone=http://localhost:8761/eureka/
eureka.instance.statusPageUrlPath=${management.context-path}/info
eureka.instance.healthCheckUrlPath=${management.context-path}/health
eureka.instance.leaseRenewalIntervalInSeconds=5
eureka.client.healthcheck.enabled=true
eureka.client.lease.duration=2
eureka.client.leaseExpirationDurationInSeconds=5
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
Server:
server.port=8761
eureka.client.register-with-eureka=false
eureka.client.fetch-registry=false
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
eureka.server.enableSelfPreservation=false
I have also tried these settings on server:
eureka.server.response-cache-update-interval-ms: 500
eureka.server.eviction-interval-timer-in-ms: 500
These seem to increase the checking but do not decrease the time for server to recognize instance is down.
Am I missing a setting? Is there a best practice to shutting down instances in production to get this instantaneous?
Thanks!
Related
I am creating a VM in azure to upload a postgres instance in docker and connect to it with my local backend in Spring. What happens is that once connected to the DB after X time of inactivity when trying to make a request I get the following "HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#f162126 (This connection has been closed.). Possibly consider using a shorter maxLifetime value." digging around I realized that it is as if my VM has some kind of behavior that when a connection becomes inactive it closes it causing the above error. The curious thing here is that the sessions are not closed as you can see in the following image even shutting down my backend the sessions are maintained and the only options to delete them is restarting the container in which the DB is hosted.
I have tried to reproduce this behavior on local but it never happens even if I leave the backend idle for an hour if I do the request to the DB it works as if nothing, it only happens with my VM in azure.
I want to clarify that the sessions that appear in the attached image no longer work, i.e. if I try to consume the DB from spring, the error I mentioned appears and automatically Hikari creates new sessions for its pool and I can reproduce this behavior until I reach 100 sessions that after a while would not work again and that Spring never closes when shutting down the backend.
HikariPool-1 - Failed to validate connection org.postgresql.jdbc.PgConnection#f162126 (This connection has been closed.). Possibly consider using a shorter maxLifetime value.
This error is thrown by method isConnectionDead. While checking the connection, if it's still alive & can be used and it will issue an above error if it has already been closed.
You can adjust your maxLifetime setting, to resolve this problem. 30000ms (30 seconds) is the shortest value that is permitted (30 seconds), 1800000ms (30 minutes) is by default value.
A connection in the pool can only last for a certain amount of time which is controlled by the maxLifetime attribute. the value of this property it should be several seconds below any connection time restrictions set by the infrastructure or any databases.
Reference: Hikari Configuration Github.
Well, after much research and reviewing various sources it turns out that azure when creating a VM has certain security policies as Pedro Perez says in the following post in Stack Excahnge: Azure closing idle network connections
You're hitting a design feature of the software load balancer in front of your VMs. By default it will close any idle connections after 4 minutes, but you can configure the timeout to be anything between those 4 and 30 minutes
So in order to overwrite this policy that governs your VM you must do the process of creating a load balancer, do all the relevant configuration and create a Load balancing rule for port 5432 which is the default port of postgres and set the Idle timeout in a range of 4 to 30 min according to your needs.
finally configure your VM so its public ip points to the LB(Load Balancer) public ip and everything will work normally.
It is true that if you simply want to take advantage of Azure's default security policies on the VMs you create you should set the maxLifetime to a maximum of 4 minutes in your Spring application.properties or appliation.yml as #PratikLad says.
in my case I prefer to leave the default Hikari configuration (maxLifetime of 30 mins) so I need to create the LB but if you prefer to change the property by setting it to a maximum of 4 min you would not need to do all the above mentioned on the LB.
I have a question about eureka like this question but the solution of this issue were of no help at all. See the similar issue here:
Another similar issue
Well, in my issue, I'm trying to construct a graceful release module based on eureka. By pull down any service in eureka before actually shut them down to ensure there is no loadbalance exception when the specified application is closed.
I have tested the situations to set eureka.instance.preferIpAddressto false and true.
while eureka.instance.preferIpAddress=false,ribbon will not recognize those applications registered with machine name and will throw a no loadbalancer exception.
while eureka.instance.preferIpAddress=false,ribbon will recognize those applications registered with machine name and everything is going right. That means, ribbon can get the real ip address of those applications.
Here is my case, I need to not only figure out why in both situations, the instanceId of applications in eureka will still showing with machine name, but also the same application will
get chance to have different instanceId even after simple restart!
Here is what I observed:
Server IP is 192.168.24.201 with hosts setting it's name to localhost
restart the same application in several times It can be seen that sometimes the instanceId of this application will change between localhost:applicationName:8005 and 192.168.24.201:applicationName:8005.
But both instanceId have the same IP address. And that means both of them won't lead to a loadbalance exception. It only makes my manually controlling of eureka server more difficult. And that is also acceptable.
The biggest problem is, sometimes the instanceId of different server will be localhost:applicationName:8005 and that leads to conflicts! By restart the application, the situation will solve in chance but not all the times! So if I'm using eureka as a cluster of several server, I cannot ensure my application is correctly registrate into eureka!
Here is the eureka client setting of application8005:
eureka:
instance:
lease-renewal-interval-in-seconds: ${my-config.eureka.instance.heartbeatInterval:5}
lease-expiration-duration-in-seconds: ${my-config.eureka.instance.deadInterval:15}
preferIpAddress: true
client:
service-url:
defaultZone: http://192.168.24.201:8008/eureka/
registry-fetch-interval-seconds: ${my-config.eureka.client.fetchRegistryInterval:20}
Here is the eureka server setting of EurekaServer:
eureka:
server:
eviction-interval-timer-in-ms: ${my-config.eureka.server.refreshInterval:5000}
enable-self-preservation: false
responseCacheUpdateIntervalMs: 5000
I don't know why applications' instanceId will sometimes not using IP as beginning string but using localhost.
The problem was solved by using prefer-ip-address: true and instance-id: ${spring.cloud.client.ip-address}:${spring.application.name}:${server.port}:${spring.cloud.nacos.config.group}
I have ruled that each server can run only one same app.
In this case each instance will have it's own unique id in this way.
I am new to developing microservices, although I have been researching about it for a while, reading both Spring's docs and Netflix's.
I have started a simple project available on Github. It is basically a Eureka server (Archimedes) and three Eureka client microservices (one public API and two private). Check github's readme for a detailed description.
The point is that when everything is running I would like that if one of the private microservices is killed, the Eureka server realizes and removes it from the registry.
I found this question on Stackoverflow, and the solution passes by using enableSelfPreservation:false in the Eureka Server config. Doing this after a while the killed service disappears as expected.
However I can see the following message:
THE SELF PRESERVATION MODE IS TURNED OFF.THIS MAY NOT PROTECT INSTANCE
EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.
1. What is the purpose of the self preservation? The doc states that with self preservation on "clients can get the instances that do not exist anymore". So when is it advisable to have it on/off?
Furthermore, when self preservation is on, you may get an outstanding message in the Eureka Server console warning:
EMERGENCY! EUREKA MAY BE INCORRECTLY CLAIMING INSTANCES ARE UP WHEN
THEY'RE NOT. RENEWALS ARE LESSER THAN THRESHOLD AND HENCE THE
INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE.
Now, going on with the Spring Eureka Console.
Lease expiration enabled true/false
Renews threshold 5
Renews (last min) 4
I have come across a weird behaviour of the threshold count: when I start the Eureka Server alone, the threshold is 1.
2. I have a single Eureka server and is configured with registerWithEureka: false to prevent it from registering on another server. Then, why does it show up in the threshold count?
3. For every client I start the threshold count increases by +2. I guess it is because they send 2 renew messages per min, am I right?
4. The Eureka server never sends a renew so the last min renews is always below the threshold. Is this normal?
renew threshold 5
rewnews last min: (client1) +2 + (client2) +2 -> 4
Server cfg:
server:
port: ${PORT:8761}
eureka:
instance:
hostname: localhost
client:
registerWithEureka: false
fetchRegistry: false
serviceUrl:
defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
server:
enableSelfPreservation: false
# waitTimeInMsWhenSyncEmpty: 0
Client 1 cfg:
spring:
application:
name: random-image-microservice
server:
port: 9999
eureka:
client:
serviceUrl:
defaultZone: http://localhost:8761/eureka/
healthcheck:
enabled: true
I got the same question as #codependent met, I googled a lot and did some experiment, here I come to contribute some knowledge about how Eureka server and instance work.
Every instance needs to renew its lease to Eureka Server with frequency of one time per 30 seconds, which can be define in eureka.instance.leaseRenewalIntervalInSeconds.
Renews (last min): represents how many renews received from Eureka instance in last minute
Renews threshold: the renews that Eureka server expects received from Eureka instance per minute.
For example, if registerWithEureka is set to false, eureka.instance.leaseRenewalIntervalInSeconds is set to 30 and run 2 Eureka instance. Two Eureka instance will send 4 renews to Eureka server per minutes, Eureka server minimal threshold is 1 (written in code), so the threshold is 5 (this number will be multiply a factor eureka.server.renewalPercentThreshold which will be discussed later).
SELF PRESERVATION MODE: if Renews (last min) is less than Renews threshold, self preservation mode will be activated.
So in upper example, the SELF PRESERVATION MODE is activated, because threshold is 5, but Eureka server can only receive 4 renews/min.
Question 1:
The SELF PRESERVATION MODE is design to avoid poor network connectivity failure. Connectivity between Eureka instance A and B is good, but B is failed to renew its lease to Eureka server in a short period due to connectivity hiccups, at this time Eureka server can't simply just kick out instance B. If it does, instance A will not get available registered service from Eureka server despite B is available. So this is the purpose of SELF PRESERVATION MODE, and it's better to turn it on.
Question 2:
The minimal threshold 1 is written in the code. registerWithEureka is set to false so there will be no Eureka instance registers, the threshold will be 1.
In production environment, generally we deploy two Eureka server and registerWithEureka will be set to true. So the threshold will be 2, and Eureka server will renew lease to itself twice/minute, so RENEWALS ARE LESSER THAN THRESHOLD won't be a problem.
Question 3:
Yes, you are right. eureka.instance.leaseRenewalIntervalInSeconds defines how many renews sent to server per minute, but it will multiply a factor eureka.server.renewalPercentThreshold mentioned above, the default value is 0.85.
Question 4:
Yes, it's normal, because the threshold initial value is set to 1. So if registerWithEureka is set to false, renews is always below threshold.
I have two suggestions for this:
Deploy two Eureka server and enable registerWithEureka.
If you just want to deploy in demo/dev environment, you can set eureka.server.renewalPercentThreshold to 0.49, so when you start up a Eureka server alone, threshold will be 0.
I've created a blog post with the details of Eureka here, that fills in some missing detail from Spring doc or Netflix blog. It is the result of several days of debugging and digging through source code. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.
You can try to set renewal threshold limit in your eureka server properties. If you have around 3 to 4 Microservices to register on eureka, then you can set it to this:
eureka.server.renewalPercentThreshold=0.33
server:
enableSelfPreservation: false
if set to true, Eureka expects service instances to register themselves and to continue to send registration renewal requests every 30 s. Normally, if Eureka doesn’t receive a renewal from a service for three renewal periods (or 90 s), it deregisters that instance.
if set to false, in this case, Eureka assumes there’s a network problem, enters self-preservation mode, and won’t deregister service instances.
Even if you decide to disable self-preservation mode for development, you should leave it enabled when you go into production.
We have a web instance (nginx) behind a ELB which we manually power on when required.
The web app starts up quickly and returns a successful 200 response when we run wget locally.
However the website will not load as the ELB isn't sending healthcheck requests to the instance. I can confirm this by viewing the nginx access logs.
The workaround I've been using is to remove the web instance from the ELB and add it back in.
This seem to activate the healthchecks again and they are visible from our access logs.
I've edited our Healthcheck settings to allow a longer timeout and raise the Unhealthy Threshold to 3 but this has made no difference.
Currently our Health Check Config is:
Ping Target: HTTPS:443/login
Timeout: 10 sec
Interval: 12 sec
Unhealthy: 2
Healthy: 2
Listener:
HTTPS 443 to HTTPS 443 SSL Cert
The ELB and web instance are both on the same public VPC Security Group which has http/https opened to 0.0.0.0/0
Can anyone help me figure out why the ELB Health checks aren't kicking in as soon as the web instance has started? Is this by design or is there a way of automatically initiating the checks? Thank you.
Niall
Does your instance come up with a different IP address each time you start it?
Elastic Load Balancing registers your load balancer with your EC2 instances using the IP addresses that are associated with your instances. When an instance is stopped and then restarted, the IP address associated with your instance changes. Your load balancer cannot recognize the new IP address, which prevents it from routing traffic to your instances.
— http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/TerminologyandKeyConcepts.html#registerinstance
It would seem like the appropriate approach to getting the instance re-associated would be for code running on the web server instance to programmatically register itself with the load balancer via the API when the startup process determines that the instance is ready for web traffic.
Update:
Luke#AWS: "You should be de-registering from your ELB during a stop/start."
— https://forums.aws.amazon.com/thread.jspa?messageID=463835
I'm curious what the console shows as the reason why the instance isn't active in the ELB. There does appear to be some kind of interaction between ELB and EC2 where ELB has some kind of awareness of an instance's EC2 state (e.g. "stopped") that goes beyond just the health checks. This isn't well-documented, but I would speculate that ELB, based on that awareness, decides that it isn't worth bothering with the health checks, and the console may provide something useful to at least confirm this.
It's possible that, given sufficient time, ELB might become aware that the instance is running again and start sending health checks, but it's also possible that instances have a hidden global meta-identifier separate from i-xxxxxx and that a stopped and restarted instance is, from the perspective of this identifier, a different instance.
...but the answer seems to be that stopping an instance and restarting it requires re-registration with the ELB.
I'm having an issue with an Amazon EC2 instance during auto scaling. Every command I typed worked. I found no errors. But when testing whether auto scaling is working or not I found that it works until the instance started. The newly spawned instance does not work afterwards though: It's under my load balancer but its status is out of service. One more issue is when I copy and paste the public DNS link into the browser it does not respond and an error is triggered like "firefox can't find ..."
I doubt that there should be problem with the image or the Linux configuration.
Thanks in advance.
Although its been long since you posted it, but try adjusting the health check of the Load balancer,
if your health check is like this
Ping Target:
HTTP:80/index.php
Timeout:
10 seconds
Interval:
30 seconds
Unhealthy Threshold:
4
Healthy Threshold:
2
that means an instance will be marked out of service if the ping target doesn't respond within 10 seconds for 4 consecutive instances, while ELB will try to reach it every 30 seconds.
usually the fact that you get "firefox can't find ..." when you try to access the instance directly means that the service is down. Try to login on the instance check if the service is alive, also check the firewall rules which might block internet/elb requests. Check also your ELB health-check it's a good place to start. If you still have issues try to post some debug information like instance netstat, elb describe, parameters.
Rules on security groups assigned to the instance and the load balancer were not allowing traffic to pass between the two. This caused the health check to fail.So , u r load balancer is out of service.
If you don't have index.html in document root of instance - default health check will fail. You can set custom protocol, port and path for health check when creating load balancer saying as per my experience