k8s probes for spring boot services, best practices - spring-boot

I'm configuring startup/liveness/readiness probes for kubernetes deployments serving spring boot services. As of the spring boot documentation it's best practice to use the corresponding liveness & readiness actuator endpoints as describes here:
https://spring.io/blog/2020/03/25/liveness-and-readiness-probes-with-spring-boot
What do you use for your startup probe?
What are your recommendations for failureThreshold, delay, period and timeout values?
Did you encounter issues when deploying isito sidecars to an existing setup?

I use the paths /actuator/health/readiness and /actuator/health/liveness :
readinessProbe:
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
failureThreshold: 3
httpGet:
scheme: HTTP
path: /actuator/health/readiness
port: 8080
livenessProbe:
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
failureThreshold: 3
httpGet:
scheme: HTTP
path: /actuator/health/liveness
port: 8080
for the recommendations, it depends on your needs and policies actually ( https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ )
no istio sidecars issues with this :)
do not forget to activate the endpoints in properties (cf https://www.baeldung.com/spring-liveness-readiness-probes):
management.endpoint.health.probes.enabled=true
management.health.livenessState.enabled=true
management.health.readinessState.enabled=true

In addition to #Bguess answer and the part:
for the recommendations, it depends on your needs and policies
Some of our microservices communicate with heavy/slow containers which need time to boot up, therefore we have to protect them with startup probes. The recommended way from https://kubernetes.io/ is to use liveness probe, what with the connection with spring-boot actuator endpoints results in:
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: http
periodSeconds: 5
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: http
periodSeconds: 5
startupProbe:
httpGet:
path: /actuator/health/liveness
port: http
failureThreshold: 25
periodSeconds: 10
The above setup makes sure that we probe liveness and readiness when the application is fully started (has 10*25=250 seconds to do so). As the doc says:
If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup
Please note that
management.endpoint.health.probes.enabled=true
is not needed for the applications running in Kubernetes (doc)
These health groups are automatically enabled only if the application runs in a Kubernetes environment. You can enable them in any environment by using the management.endpoint.health.probes.enabled configuration property.
Hence you need it if you want to check the probes for example locally.

The startup probe is optional.
Originally, there were two types of probes in Kubernetes: readiness and liveness. However, people have encountered issues with slow-start containers. When a container takes a long time to start, Kubernetes does the first check on the liveness probe after initialDelaySeconds. If the check fails, Kubernetes attempts failureThreshold times with an interval of periodSeconds. If the liveness probe still fails, Kubernetes assumes that the container is not alive and restarts it. Unfortunately, the container will likely fail again, resulting in an endless cycle of restarting.
You may want to increase failureThreshold and periodSeconds to avoid the endless restarting, but it can cause longer detection and recovery times in case of a thread deadlock.
You may want to make the initialDelaySeconds longer to allow sufficient time for the container to start. However, it can be challenging to determine the appropriate delay since your application can run on various hardware. For instance, increasing initialDelaySeconds to 60 seconds to avoid this problem in one environment may cause unnecessary slow startup when deploying the service to a more advanced hardware that only requires 20 seconds to start. In such a scenario, Kubernetes waits for 60 seconds for the first liveness check, causing the pod to be idle for 40 seconds, and it still takes 60 seconds to serve.
To address this issue, Kubernetes introduced the startup probe in 1.16, which defers all other probes until a pod completes its startup process. For slow-starting pods, the startup probe can poll at short intervals with a high failure threshold until it is satisfied, at which point the other probes can begin.
If a container’s components take a long time to be ready except for the API component, the container can simply report 200 in the liveness probe, and the startup probe is not needed. Because the API component will be ready and report 200 very soon, Kubernetes will not restart the container endlessly, it will patiently wait until all the readiness probes indicate that the containers are all “ready” then take traffic to the pod.
The startup probe can be implemented in the same way as the liveness probe. Once the startup probe confirms that the container is initialized, the liveness probe will immediately report that the container is alive, leaving no room for Kubernetes to mistakenly restart the container.
Regarding the initialDelaySeconds, periodSeconds, failureThreshold, and timeout, it is really a balance between sensitivity and false positives. e.g., if you have a high failureThreshold and a high periodSeconds for the readiness probe, k8s cannot timely detect issues in the container and your pod continues to take traffic, hence many requests will fail. If you put a low failureThreshold and a low periodSeconds for the readiness probe, a temporary problem could take the pod out of traffic, that's a false positive. I kind of prefer default failureThreshold to 3 and periodSeconds to 5, successThreshold to 1 or 2.
BTW, don't use the default health check from Spring Boot, you always need to customize them. More details here: https://danielw.cn/health-check-probes-in-k8s

Related

Kubernetes pod - liveness probe based on incoming port traffic

What would be the best way to create a liveness probe based on incoming TCP traffic on particular port?
tcpdump and bash are available inside so it could be achieved by some script checking if there is incoming traffic on that port, but I wonder if there are better (cleaner) ways?
The example desired behaviour:
if there is no incoming traffic on port 1234 for the last 10 seconds the container crashes
if there is no incoming traffic on port 1234 for the last 10 seconds the container will be restarted with the below configuration. Also, note that there is no probe that makes the container crashes
livenessProbe:
tcpSocket:
port: 1234
periodSeconds: 10
failureThreshold: 1
Here is the documentation

Customize retries behaviour with Ribbon and Hystrix

Objective
I have a task to write API Gateway & load balancer with the following objectives:
Gateway/LB should redirect requests to instances of 3rd party service (no code change = client-side service discovery)
Each service instance is able to process only single response simultaneously, concurrent request = immediate error response.
Services response latency is 0-5 seconds. I can't cache their responses, and therefore as I understand fallback is not an option for me. Also timeout is not an option, because latency is random and you haven't warranty you'll get better one on another instance.
My solution
Spring Boot Cloud Netflix: Zuul-Hystrix-Ribbon. Two approaches:
Retry. Ribbon retry with fixed interval or exponential increase. I failed to make it work, the best result I achieved is MaxAutoRetriesNextServer: 1000, where Ribbon fires retries immediatelly and spamming donwstream services.
Circuit Breaker. Instead of adding exponential wait period in Ribbon, I can open circuit after few fails and redirect requests to another services. This also not the best approach for two reasons: a) having only few instances each with 0-5 sec latency means open all circuits very quickly and fail to serve request. b) my configuration doesn't work for some reason
Question
How can I make Ribbon wait between retries? Or can I solve my problem with Circuit Breaker?
My configuration
Full config could be found on GitHub.
ribbon:
eureka:
enabled: false
# Obsolete option (Apache HttpClient by default), but without this Ribbon doesn't retry against another instances
restclient:
enabled: true
hystrix:
command:
my-service:
circuitBreaker:
sleepWindowInMilliseconds: 3000
errorThresholdPercentage: 50
requestVolumeThreshold: 5
execution:
isolation:
thread:
timeoutInMilliseconds: 5500
my-service:
ribbon:
OkToRetryOnAllOperations: false
NFLoadBalancerRuleClassName: com.netflix.loadbalancer.WeightedResponseTimeRule
listOfServers: ${LIST_OF_SERVERS}
ConnectTimeout: 500
ReadTimeout: 4500
MaxAutoRetries: 0
MaxAutoRetriesNextServer: 1000
retryableStatusCodes: 404,502,503,504
Tests
In order to check your assumptions, you can play with the test on GitHub, that simulates single-thread service instances with different latencies

Eureka slow to remove instances

I am doing a POC with Eureka. When I shut down the service instance, it is currently taking about 3 minutes to no longer show in the Eureka console. I am presuming (but not confident) this also means this downed instance can still be discovered by clients?
With debugging on, I can see server running the evict task several times before it determines the lease is expired on the instance I shut down.
My settings are client:
eureka.client.serviceUrl.defaultZone=http://localhost:8761/eureka/
eureka.instance.statusPageUrlPath=${management.context-path}/info
eureka.instance.healthCheckUrlPath=${management.context-path}/health
eureka.instance.leaseRenewalIntervalInSeconds=5
eureka.client.healthcheck.enabled=true
eureka.client.lease.duration=2
eureka.client.leaseExpirationDurationInSeconds=5
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
Server:
server.port=8761
eureka.client.register-with-eureka=false
eureka.client.fetch-registry=false
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
eureka.server.enableSelfPreservation=false
I have also tried these settings on server:
eureka.server.response-cache-update-interval-ms: 500
eureka.server.eviction-interval-timer-in-ms: 500
These seem to increase the checking but do not decrease the time for server to recognize instance is down.
Am I missing a setting? Is there a best practice to shutting down instances in production to get this instantaneous?
Thanks!

Understanding Spring Cloud Eureka Server self preservation and renew threshold

I am new to developing microservices, although I have been researching about it for a while, reading both Spring's docs and Netflix's.
I have started a simple project available on Github. It is basically a Eureka server (Archimedes) and three Eureka client microservices (one public API and two private). Check github's readme for a detailed description.
The point is that when everything is running I would like that if one of the private microservices is killed, the Eureka server realizes and removes it from the registry.
I found this question on Stackoverflow, and the solution passes by using enableSelfPreservation:false in the Eureka Server config. Doing this after a while the killed service disappears as expected.
However I can see the following message:
THE SELF PRESERVATION MODE IS TURNED OFF.THIS MAY NOT PROTECT INSTANCE
EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.
1. What is the purpose of the self preservation? The doc states that with self preservation on "clients can get the instances that do not exist anymore". So when is it advisable to have it on/off?
Furthermore, when self preservation is on, you may get an outstanding message in the Eureka Server console warning:
EMERGENCY! EUREKA MAY BE INCORRECTLY CLAIMING INSTANCES ARE UP WHEN
THEY'RE NOT. RENEWALS ARE LESSER THAN THRESHOLD AND HENCE THE
INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE.
Now, going on with the Spring Eureka Console.
Lease expiration enabled true/false
Renews threshold 5
Renews (last min) 4
I have come across a weird behaviour of the threshold count: when I start the Eureka Server alone, the threshold is 1.
2. I have a single Eureka server and is configured with registerWithEureka: false to prevent it from registering on another server. Then, why does it show up in the threshold count?
3. For every client I start the threshold count increases by +2. I guess it is because they send 2 renew messages per min, am I right?
4. The Eureka server never sends a renew so the last min renews is always below the threshold. Is this normal?
renew threshold 5
rewnews last min: (client1) +2 + (client2) +2 -> 4
Server cfg:
server:
port: ${PORT:8761}
eureka:
instance:
hostname: localhost
client:
registerWithEureka: false
fetchRegistry: false
serviceUrl:
defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
server:
enableSelfPreservation: false
# waitTimeInMsWhenSyncEmpty: 0
Client 1 cfg:
spring:
application:
name: random-image-microservice
server:
port: 9999
eureka:
client:
serviceUrl:
defaultZone: http://localhost:8761/eureka/
healthcheck:
enabled: true
I got the same question as #codependent met, I googled a lot and did some experiment, here I come to contribute some knowledge about how Eureka server and instance work.
Every instance needs to renew its lease to Eureka Server with frequency of one time per 30 seconds, which can be define in eureka.instance.leaseRenewalIntervalInSeconds.
Renews (last min): represents how many renews received from Eureka instance in last minute
Renews threshold: the renews that Eureka server expects received from Eureka instance per minute.
For example, if registerWithEureka is set to false, eureka.instance.leaseRenewalIntervalInSeconds is set to 30 and run 2 Eureka instance. Two Eureka instance will send 4 renews to Eureka server per minutes, Eureka server minimal threshold is 1 (written in code), so the threshold is 5 (this number will be multiply a factor eureka.server.renewalPercentThreshold which will be discussed later).
SELF PRESERVATION MODE: if Renews (last min) is less than Renews threshold, self preservation mode will be activated.
So in upper example, the SELF PRESERVATION MODE is activated, because threshold is 5, but Eureka server can only receive 4 renews/min.
Question 1:
The SELF PRESERVATION MODE is design to avoid poor network connectivity failure. Connectivity between Eureka instance A and B is good, but B is failed to renew its lease to Eureka server in a short period due to connectivity hiccups, at this time Eureka server can't simply just kick out instance B. If it does, instance A will not get available registered service from Eureka server despite B is available. So this is the purpose of SELF PRESERVATION MODE, and it's better to turn it on.
Question 2:
The minimal threshold 1 is written in the code. registerWithEureka is set to false so there will be no Eureka instance registers, the threshold will be 1.
In production environment, generally we deploy two Eureka server and registerWithEureka will be set to true. So the threshold will be 2, and Eureka server will renew lease to itself twice/minute, so RENEWALS ARE LESSER THAN THRESHOLD won't be a problem.
Question 3:
Yes, you are right. eureka.instance.leaseRenewalIntervalInSeconds defines how many renews sent to server per minute, but it will multiply a factor eureka.server.renewalPercentThreshold mentioned above, the default value is 0.85.
Question 4:
Yes, it's normal, because the threshold initial value is set to 1. So if registerWithEureka is set to false, renews is always below threshold.
I have two suggestions for this:
Deploy two Eureka server and enable registerWithEureka.
If you just want to deploy in demo/dev environment, you can set eureka.server.renewalPercentThreshold to 0.49, so when you start up a Eureka server alone, threshold will be 0.
I've created a blog post with the details of Eureka here, that fills in some missing detail from Spring doc or Netflix blog. It is the result of several days of debugging and digging through source code. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.
You can try to set renewal threshold limit in your eureka server properties. If you have around 3 to 4 Microservices to register on eureka, then you can set it to this:
eureka.server.renewalPercentThreshold=0.33
server:
enableSelfPreservation: false
if set to true, Eureka expects service instances to register themselves and to continue to send registration renewal requests every 30 s. Normally, if Eureka doesn’t receive a renewal from a service for three renewal periods (or 90 s), it deregisters that instance.
if set to false, in this case, Eureka assumes there’s a network problem, enters self-preservation mode, and won’t deregister service instances.
Even if you decide to disable self-preservation mode for development, you should leave it enabled when you go into production.

Why is my auto scaling EC2 instance reported as 'out of service' by the load balancer?

I'm having an issue with an Amazon EC2 instance during auto scaling. Every command I typed worked. I found no errors. But when testing whether auto scaling is working or not I found that it works until the instance started. The newly spawned instance does not work afterwards though: It's under my load balancer but its status is out of service. One more issue is when I copy and paste the public DNS link into the browser it does not respond and an error is triggered like "firefox can't find ..."
I doubt that there should be problem with the image or the Linux configuration.
Thanks in advance.
Although its been long since you posted it, but try adjusting the health check of the Load balancer,
if your health check is like this
Ping Target:
HTTP:80/index.php
Timeout:
10 seconds
Interval:
30 seconds
Unhealthy Threshold:
4
Healthy Threshold:
2
that means an instance will be marked out of service if the ping target doesn't respond within 10 seconds for 4 consecutive instances, while ELB will try to reach it every 30 seconds.
usually the fact that you get "firefox can't find ..." when you try to access the instance directly means that the service is down. Try to login on the instance check if the service is alive, also check the firewall rules which might block internet/elb requests. Check also your ELB health-check it's a good place to start. If you still have issues try to post some debug information like instance netstat, elb describe, parameters.
Rules on security groups assigned to the instance and the load balancer were not allowing traffic to pass between the two. This caused the health check to fail.So , u r load balancer is out of service.
If you don't have index.html in document root of instance - default health check will fail. You can set custom protocol, port and path for health check when creating load balancer saying as per my experience

Resources