What would be the best way to create a liveness probe based on incoming TCP traffic on particular port?
tcpdump and bash are available inside so it could be achieved by some script checking if there is incoming traffic on that port, but I wonder if there are better (cleaner) ways?
The example desired behaviour:
if there is no incoming traffic on port 1234 for the last 10 seconds the container crashes
if there is no incoming traffic on port 1234 for the last 10 seconds the container will be restarted with the below configuration. Also, note that there is no probe that makes the container crashes
livenessProbe:
tcpSocket:
port: 1234
periodSeconds: 10
failureThreshold: 1
Here is the documentation
Objective
I have a task to write API Gateway & load balancer with the following objectives:
Gateway/LB should redirect requests to instances of 3rd party service (no code change = client-side service discovery)
Each service instance is able to process only single response simultaneously, concurrent request = immediate error response.
Services response latency is 0-5 seconds. I can't cache their responses, and therefore as I understand fallback is not an option for me. Also timeout is not an option, because latency is random and you haven't warranty you'll get better one on another instance.
My solution
Spring Boot Cloud Netflix: Zuul-Hystrix-Ribbon. Two approaches:
Retry. Ribbon retry with fixed interval or exponential increase. I failed to make it work, the best result I achieved is MaxAutoRetriesNextServer: 1000, where Ribbon fires retries immediatelly and spamming donwstream services.
Circuit Breaker. Instead of adding exponential wait period in Ribbon, I can open circuit after few fails and redirect requests to another services. This also not the best approach for two reasons: a) having only few instances each with 0-5 sec latency means open all circuits very quickly and fail to serve request. b) my configuration doesn't work for some reason
Question
How can I make Ribbon wait between retries? Or can I solve my problem with Circuit Breaker?
My configuration
Full config could be found on GitHub.
ribbon:
eureka:
enabled: false
# Obsolete option (Apache HttpClient by default), but without this Ribbon doesn't retry against another instances
restclient:
enabled: true
hystrix:
command:
my-service:
circuitBreaker:
sleepWindowInMilliseconds: 3000
errorThresholdPercentage: 50
requestVolumeThreshold: 5
execution:
isolation:
thread:
timeoutInMilliseconds: 5500
my-service:
ribbon:
OkToRetryOnAllOperations: false
NFLoadBalancerRuleClassName: com.netflix.loadbalancer.WeightedResponseTimeRule
listOfServers: ${LIST_OF_SERVERS}
ConnectTimeout: 500
ReadTimeout: 4500
MaxAutoRetries: 0
MaxAutoRetriesNextServer: 1000
retryableStatusCodes: 404,502,503,504
Tests
In order to check your assumptions, you can play with the test on GitHub, that simulates single-thread service instances with different latencies
I am doing a POC with Eureka. When I shut down the service instance, it is currently taking about 3 minutes to no longer show in the Eureka console. I am presuming (but not confident) this also means this downed instance can still be discovered by clients?
With debugging on, I can see server running the evict task several times before it determines the lease is expired on the instance I shut down.
My settings are client:
eureka.client.serviceUrl.defaultZone=http://localhost:8761/eureka/
eureka.instance.statusPageUrlPath=${management.context-path}/info
eureka.instance.healthCheckUrlPath=${management.context-path}/health
eureka.instance.leaseRenewalIntervalInSeconds=5
eureka.client.healthcheck.enabled=true
eureka.client.lease.duration=2
eureka.client.leaseExpirationDurationInSeconds=5
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
Server:
server.port=8761
eureka.client.register-with-eureka=false
eureka.client.fetch-registry=false
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
eureka.server.enableSelfPreservation=false
I have also tried these settings on server:
eureka.server.response-cache-update-interval-ms: 500
eureka.server.eviction-interval-timer-in-ms: 500
These seem to increase the checking but do not decrease the time for server to recognize instance is down.
Am I missing a setting? Is there a best practice to shutting down instances in production to get this instantaneous?
Thanks!
I am new to developing microservices, although I have been researching about it for a while, reading both Spring's docs and Netflix's.
I have started a simple project available on Github. It is basically a Eureka server (Archimedes) and three Eureka client microservices (one public API and two private). Check github's readme for a detailed description.
The point is that when everything is running I would like that if one of the private microservices is killed, the Eureka server realizes and removes it from the registry.
I found this question on Stackoverflow, and the solution passes by using enableSelfPreservation:false in the Eureka Server config. Doing this after a while the killed service disappears as expected.
However I can see the following message:
THE SELF PRESERVATION MODE IS TURNED OFF.THIS MAY NOT PROTECT INSTANCE
EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.
1. What is the purpose of the self preservation? The doc states that with self preservation on "clients can get the instances that do not exist anymore". So when is it advisable to have it on/off?
Furthermore, when self preservation is on, you may get an outstanding message in the Eureka Server console warning:
EMERGENCY! EUREKA MAY BE INCORRECTLY CLAIMING INSTANCES ARE UP WHEN
THEY'RE NOT. RENEWALS ARE LESSER THAN THRESHOLD AND HENCE THE
INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE.
Now, going on with the Spring Eureka Console.
Lease expiration enabled true/false
Renews threshold 5
Renews (last min) 4
I have come across a weird behaviour of the threshold count: when I start the Eureka Server alone, the threshold is 1.
2. I have a single Eureka server and is configured with registerWithEureka: false to prevent it from registering on another server. Then, why does it show up in the threshold count?
3. For every client I start the threshold count increases by +2. I guess it is because they send 2 renew messages per min, am I right?
4. The Eureka server never sends a renew so the last min renews is always below the threshold. Is this normal?
renew threshold 5
rewnews last min: (client1) +2 + (client2) +2 -> 4
Server cfg:
server:
port: ${PORT:8761}
eureka:
instance:
hostname: localhost
client:
registerWithEureka: false
fetchRegistry: false
serviceUrl:
defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
server:
enableSelfPreservation: false
# waitTimeInMsWhenSyncEmpty: 0
Client 1 cfg:
spring:
application:
name: random-image-microservice
server:
port: 9999
eureka:
client:
serviceUrl:
defaultZone: http://localhost:8761/eureka/
healthcheck:
enabled: true
I got the same question as #codependent met, I googled a lot and did some experiment, here I come to contribute some knowledge about how Eureka server and instance work.
Every instance needs to renew its lease to Eureka Server with frequency of one time per 30 seconds, which can be define in eureka.instance.leaseRenewalIntervalInSeconds.
Renews (last min): represents how many renews received from Eureka instance in last minute
Renews threshold: the renews that Eureka server expects received from Eureka instance per minute.
For example, if registerWithEureka is set to false, eureka.instance.leaseRenewalIntervalInSeconds is set to 30 and run 2 Eureka instance. Two Eureka instance will send 4 renews to Eureka server per minutes, Eureka server minimal threshold is 1 (written in code), so the threshold is 5 (this number will be multiply a factor eureka.server.renewalPercentThreshold which will be discussed later).
SELF PRESERVATION MODE: if Renews (last min) is less than Renews threshold, self preservation mode will be activated.
So in upper example, the SELF PRESERVATION MODE is activated, because threshold is 5, but Eureka server can only receive 4 renews/min.
Question 1:
The SELF PRESERVATION MODE is design to avoid poor network connectivity failure. Connectivity between Eureka instance A and B is good, but B is failed to renew its lease to Eureka server in a short period due to connectivity hiccups, at this time Eureka server can't simply just kick out instance B. If it does, instance A will not get available registered service from Eureka server despite B is available. So this is the purpose of SELF PRESERVATION MODE, and it's better to turn it on.
Question 2:
The minimal threshold 1 is written in the code. registerWithEureka is set to false so there will be no Eureka instance registers, the threshold will be 1.
In production environment, generally we deploy two Eureka server and registerWithEureka will be set to true. So the threshold will be 2, and Eureka server will renew lease to itself twice/minute, so RENEWALS ARE LESSER THAN THRESHOLD won't be a problem.
Question 3:
Yes, you are right. eureka.instance.leaseRenewalIntervalInSeconds defines how many renews sent to server per minute, but it will multiply a factor eureka.server.renewalPercentThreshold mentioned above, the default value is 0.85.
Question 4:
Yes, it's normal, because the threshold initial value is set to 1. So if registerWithEureka is set to false, renews is always below threshold.
I have two suggestions for this:
Deploy two Eureka server and enable registerWithEureka.
If you just want to deploy in demo/dev environment, you can set eureka.server.renewalPercentThreshold to 0.49, so when you start up a Eureka server alone, threshold will be 0.
I've created a blog post with the details of Eureka here, that fills in some missing detail from Spring doc or Netflix blog. It is the result of several days of debugging and digging through source code. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.
You can try to set renewal threshold limit in your eureka server properties. If you have around 3 to 4 Microservices to register on eureka, then you can set it to this:
eureka.server.renewalPercentThreshold=0.33
server:
enableSelfPreservation: false
if set to true, Eureka expects service instances to register themselves and to continue to send registration renewal requests every 30 s. Normally, if Eureka doesn’t receive a renewal from a service for three renewal periods (or 90 s), it deregisters that instance.
if set to false, in this case, Eureka assumes there’s a network problem, enters self-preservation mode, and won’t deregister service instances.
Even if you decide to disable self-preservation mode for development, you should leave it enabled when you go into production.
I'm having an issue with an Amazon EC2 instance during auto scaling. Every command I typed worked. I found no errors. But when testing whether auto scaling is working or not I found that it works until the instance started. The newly spawned instance does not work afterwards though: It's under my load balancer but its status is out of service. One more issue is when I copy and paste the public DNS link into the browser it does not respond and an error is triggered like "firefox can't find ..."
I doubt that there should be problem with the image or the Linux configuration.
Thanks in advance.
Although its been long since you posted it, but try adjusting the health check of the Load balancer,
if your health check is like this
Ping Target:
HTTP:80/index.php
Timeout:
10 seconds
Interval:
30 seconds
Unhealthy Threshold:
4
Healthy Threshold:
2
that means an instance will be marked out of service if the ping target doesn't respond within 10 seconds for 4 consecutive instances, while ELB will try to reach it every 30 seconds.
usually the fact that you get "firefox can't find ..." when you try to access the instance directly means that the service is down. Try to login on the instance check if the service is alive, also check the firewall rules which might block internet/elb requests. Check also your ELB health-check it's a good place to start. If you still have issues try to post some debug information like instance netstat, elb describe, parameters.
Rules on security groups assigned to the instance and the load balancer were not allowing traffic to pass between the two. This caused the health check to fail.So , u r load balancer is out of service.
If you don't have index.html in document root of instance - default health check will fail. You can set custom protocol, port and path for health check when creating load balancer saying as per my experience