Customize retries behaviour with Ribbon and Hystrix - microservices

Objective
I have a task to write API Gateway & load balancer with the following objectives:
Gateway/LB should redirect requests to instances of 3rd party service (no code change = client-side service discovery)
Each service instance is able to process only single response simultaneously, concurrent request = immediate error response.
Services response latency is 0-5 seconds. I can't cache their responses, and therefore as I understand fallback is not an option for me. Also timeout is not an option, because latency is random and you haven't warranty you'll get better one on another instance.
My solution
Spring Boot Cloud Netflix: Zuul-Hystrix-Ribbon. Two approaches:
Retry. Ribbon retry with fixed interval or exponential increase. I failed to make it work, the best result I achieved is MaxAutoRetriesNextServer: 1000, where Ribbon fires retries immediatelly and spamming donwstream services.
Circuit Breaker. Instead of adding exponential wait period in Ribbon, I can open circuit after few fails and redirect requests to another services. This also not the best approach for two reasons: a) having only few instances each with 0-5 sec latency means open all circuits very quickly and fail to serve request. b) my configuration doesn't work for some reason
Question
How can I make Ribbon wait between retries? Or can I solve my problem with Circuit Breaker?
My configuration
Full config could be found on GitHub.
ribbon:
eureka:
enabled: false
# Obsolete option (Apache HttpClient by default), but without this Ribbon doesn't retry against another instances
restclient:
enabled: true
hystrix:
command:
my-service:
circuitBreaker:
sleepWindowInMilliseconds: 3000
errorThresholdPercentage: 50
requestVolumeThreshold: 5
execution:
isolation:
thread:
timeoutInMilliseconds: 5500
my-service:
ribbon:
OkToRetryOnAllOperations: false
NFLoadBalancerRuleClassName: com.netflix.loadbalancer.WeightedResponseTimeRule
listOfServers: ${LIST_OF_SERVERS}
ConnectTimeout: 500
ReadTimeout: 4500
MaxAutoRetries: 0
MaxAutoRetriesNextServer: 1000
retryableStatusCodes: 404,502,503,504
Tests
In order to check your assumptions, you can play with the test on GitHub, that simulates single-thread service instances with different latencies

Related

k8s probes for spring boot services, best practices

I'm configuring startup/liveness/readiness probes for kubernetes deployments serving spring boot services. As of the spring boot documentation it's best practice to use the corresponding liveness & readiness actuator endpoints as describes here:
https://spring.io/blog/2020/03/25/liveness-and-readiness-probes-with-spring-boot
What do you use for your startup probe?
What are your recommendations for failureThreshold, delay, period and timeout values?
Did you encounter issues when deploying isito sidecars to an existing setup?
I use the paths /actuator/health/readiness and /actuator/health/liveness :
readinessProbe:
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
failureThreshold: 3
httpGet:
scheme: HTTP
path: /actuator/health/readiness
port: 8080
livenessProbe:
initialDelaySeconds: 120
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 5
failureThreshold: 3
httpGet:
scheme: HTTP
path: /actuator/health/liveness
port: 8080
for the recommendations, it depends on your needs and policies actually ( https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ )
no istio sidecars issues with this :)
do not forget to activate the endpoints in properties (cf https://www.baeldung.com/spring-liveness-readiness-probes):
management.endpoint.health.probes.enabled=true
management.health.livenessState.enabled=true
management.health.readinessState.enabled=true
In addition to #Bguess answer and the part:
for the recommendations, it depends on your needs and policies
Some of our microservices communicate with heavy/slow containers which need time to boot up, therefore we have to protect them with startup probes. The recommended way from https://kubernetes.io/ is to use liveness probe, what with the connection with spring-boot actuator endpoints results in:
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: http
periodSeconds: 5
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: http
periodSeconds: 5
startupProbe:
httpGet:
path: /actuator/health/liveness
port: http
failureThreshold: 25
periodSeconds: 10
The above setup makes sure that we probe liveness and readiness when the application is fully started (has 10*25=250 seconds to do so). As the doc says:
If such a probe is configured, it disables liveness and readiness checks until it succeeds, making sure those probes don't interfere with the application startup
Please note that
management.endpoint.health.probes.enabled=true
is not needed for the applications running in Kubernetes (doc)
These health groups are automatically enabled only if the application runs in a Kubernetes environment. You can enable them in any environment by using the management.endpoint.health.probes.enabled configuration property.
Hence you need it if you want to check the probes for example locally.
The startup probe is optional.
Originally, there were two types of probes in Kubernetes: readiness and liveness. However, people have encountered issues with slow-start containers. When a container takes a long time to start, Kubernetes does the first check on the liveness probe after initialDelaySeconds. If the check fails, Kubernetes attempts failureThreshold times with an interval of periodSeconds. If the liveness probe still fails, Kubernetes assumes that the container is not alive and restarts it. Unfortunately, the container will likely fail again, resulting in an endless cycle of restarting.
You may want to increase failureThreshold and periodSeconds to avoid the endless restarting, but it can cause longer detection and recovery times in case of a thread deadlock.
You may want to make the initialDelaySeconds longer to allow sufficient time for the container to start. However, it can be challenging to determine the appropriate delay since your application can run on various hardware. For instance, increasing initialDelaySeconds to 60 seconds to avoid this problem in one environment may cause unnecessary slow startup when deploying the service to a more advanced hardware that only requires 20 seconds to start. In such a scenario, Kubernetes waits for 60 seconds for the first liveness check, causing the pod to be idle for 40 seconds, and it still takes 60 seconds to serve.
To address this issue, Kubernetes introduced the startup probe in 1.16, which defers all other probes until a pod completes its startup process. For slow-starting pods, the startup probe can poll at short intervals with a high failure threshold until it is satisfied, at which point the other probes can begin.
If a container’s components take a long time to be ready except for the API component, the container can simply report 200 in the liveness probe, and the startup probe is not needed. Because the API component will be ready and report 200 very soon, Kubernetes will not restart the container endlessly, it will patiently wait until all the readiness probes indicate that the containers are all “ready” then take traffic to the pod.
The startup probe can be implemented in the same way as the liveness probe. Once the startup probe confirms that the container is initialized, the liveness probe will immediately report that the container is alive, leaving no room for Kubernetes to mistakenly restart the container.
Regarding the initialDelaySeconds, periodSeconds, failureThreshold, and timeout, it is really a balance between sensitivity and false positives. e.g., if you have a high failureThreshold and a high periodSeconds for the readiness probe, k8s cannot timely detect issues in the container and your pod continues to take traffic, hence many requests will fail. If you put a low failureThreshold and a low periodSeconds for the readiness probe, a temporary problem could take the pod out of traffic, that's a false positive. I kind of prefer default failureThreshold to 3 and periodSeconds to 5, successThreshold to 1 or 2.
BTW, don't use the default health check from Spring Boot, you always need to customize them. More details here: https://danielw.cn/health-check-probes-in-k8s

GCP Postgres connections consumed when stressing spring boot service (using SQL Cloud proxy)

I'm working on a Spring Boot service to post data and persist it in a GCP SQL Postgres database. The problem is when I stress the service with requests, I get an SQL exception about consuming all available connections:
"FATAL: remaining connection slots are reserved for non-replication superuser connections"
I figured out the issue and I added a proper hikari configuration to limit used connections and set a limit for when to close idle connections, here is my properties.yml configuration:
type: com.zaxxer.hikari.HikariDataSource
hikari:
initializationFailTimeout: 30000
idle-timeout: 30000
minimum-idle: 5
maximum-pool-size: 15
connection-timeout: 20000
max-lifetime: 1000
The service works fine with this setup when I run it locally with the same database, but it consumes all available connections when I run it from my cloud setup, and then get the same exception.
IMPORTANT! I'm using SQL cloud proxy to connect to the database.
Here are screenshots for the database available connections:
1- before running the service
2- after running the service locally
3- after running the service from the cloud
After a few days of investigating this problem, we found a solution that is mitigating the problem but not solving it completely (the ideal solution is mentioned in the end).
If you want to keep using SQL Cloud proxy, then you need to accept the fact that you don't have full control over your database connections configuration, as SQL cloud proxy might keep those connections alive for more than you've configured it (source).
To mitigate this problem, we used SQL Cloud proxy 1.19.2 from the docker registry, and we used this hikari configuration:
hikari:
idle-timeout: 30000 # maximum amount of time (in milliseconds) that a connection is allowed to sit idle in the pool
minimum-idle: 1 # minimum number of idle connections that HikariCP tries to maintain in the pool, including both idle and in-use connections. If the idle connections dip below this value, HikariCP will make a best effort to restore them quickly and efficiently
maximum-pool-size: 15 # maximum size that the pool is allowed to reach, including both idle and in-use connections. Basically this value will determine the maximum number of actual connections to the database backend
connection-timeout: 20000 #maximum number of milliseconds that a client will wait for a connection
max-lifetime: 20000 # maximum lifetime in milliseconds of a connection in the pool after it is closed.
The proper solution in this case, is to use a Shared VPC to have a private connection to your database where you will rely on your database driver to make those connections.

Observing more “dial tcp : I/O timeout” during the performance test execution in K6

In K6, I'm observing more fail request in my performance test execution with dial tcp : I/O timeout. Please suggest any fine tuning if I missed at K6.
With low concurrent let’s with 225 users no issues but when increase user to 300 am facing this issue and I'm using MacBook for the test execution
This error indicates that your server under test isn't able to keep up with TCP connection attempts from k6, which is usually a hint that you're reaching the performance limits of what your server is able to deliver.
At this point you would have to tweak your server settings or increase app performance to reach the levels you're aiming for. One sanity check you can do on the server side is confirm if the maximum number of file descriptors (which includes network sockets) is sufficient for your test. See ulimit.

Eureka slow to remove instances

I am doing a POC with Eureka. When I shut down the service instance, it is currently taking about 3 minutes to no longer show in the Eureka console. I am presuming (but not confident) this also means this downed instance can still be discovered by clients?
With debugging on, I can see server running the evict task several times before it determines the lease is expired on the instance I shut down.
My settings are client:
eureka.client.serviceUrl.defaultZone=http://localhost:8761/eureka/
eureka.instance.statusPageUrlPath=${management.context-path}/info
eureka.instance.healthCheckUrlPath=${management.context-path}/health
eureka.instance.leaseRenewalIntervalInSeconds=5
eureka.client.healthcheck.enabled=true
eureka.client.lease.duration=2
eureka.client.leaseExpirationDurationInSeconds=5
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
Server:
server.port=8761
eureka.client.register-with-eureka=false
eureka.client.fetch-registry=false
logging.level.com.netflix.eureka=DEBUG
logging.level.com.netflix.discovery=DEBUG
eureka.server.enableSelfPreservation=false
I have also tried these settings on server:
eureka.server.response-cache-update-interval-ms: 500
eureka.server.eviction-interval-timer-in-ms: 500
These seem to increase the checking but do not decrease the time for server to recognize instance is down.
Am I missing a setting? Is there a best practice to shutting down instances in production to get this instantaneous?
Thanks!

Understanding Spring Cloud Eureka Server self preservation and renew threshold

I am new to developing microservices, although I have been researching about it for a while, reading both Spring's docs and Netflix's.
I have started a simple project available on Github. It is basically a Eureka server (Archimedes) and three Eureka client microservices (one public API and two private). Check github's readme for a detailed description.
The point is that when everything is running I would like that if one of the private microservices is killed, the Eureka server realizes and removes it from the registry.
I found this question on Stackoverflow, and the solution passes by using enableSelfPreservation:false in the Eureka Server config. Doing this after a while the killed service disappears as expected.
However I can see the following message:
THE SELF PRESERVATION MODE IS TURNED OFF.THIS MAY NOT PROTECT INSTANCE
EXPIRY IN CASE OF NETWORK/OTHER PROBLEMS.
1. What is the purpose of the self preservation? The doc states that with self preservation on "clients can get the instances that do not exist anymore". So when is it advisable to have it on/off?
Furthermore, when self preservation is on, you may get an outstanding message in the Eureka Server console warning:
EMERGENCY! EUREKA MAY BE INCORRECTLY CLAIMING INSTANCES ARE UP WHEN
THEY'RE NOT. RENEWALS ARE LESSER THAN THRESHOLD AND HENCE THE
INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE.
Now, going on with the Spring Eureka Console.
Lease expiration enabled true/false
Renews threshold 5
Renews (last min) 4
I have come across a weird behaviour of the threshold count: when I start the Eureka Server alone, the threshold is 1.
2. I have a single Eureka server and is configured with registerWithEureka: false to prevent it from registering on another server. Then, why does it show up in the threshold count?
3. For every client I start the threshold count increases by +2. I guess it is because they send 2 renew messages per min, am I right?
4. The Eureka server never sends a renew so the last min renews is always below the threshold. Is this normal?
renew threshold 5
rewnews last min: (client1) +2 + (client2) +2 -> 4
Server cfg:
server:
port: ${PORT:8761}
eureka:
instance:
hostname: localhost
client:
registerWithEureka: false
fetchRegistry: false
serviceUrl:
defaultZone: http://${eureka.instance.hostname}:${server.port}/eureka/
server:
enableSelfPreservation: false
# waitTimeInMsWhenSyncEmpty: 0
Client 1 cfg:
spring:
application:
name: random-image-microservice
server:
port: 9999
eureka:
client:
serviceUrl:
defaultZone: http://localhost:8761/eureka/
healthcheck:
enabled: true
I got the same question as #codependent met, I googled a lot and did some experiment, here I come to contribute some knowledge about how Eureka server and instance work.
Every instance needs to renew its lease to Eureka Server with frequency of one time per 30 seconds, which can be define in eureka.instance.leaseRenewalIntervalInSeconds.
Renews (last min): represents how many renews received from Eureka instance in last minute
Renews threshold: the renews that Eureka server expects received from Eureka instance per minute.
For example, if registerWithEureka is set to false, eureka.instance.leaseRenewalIntervalInSeconds is set to 30 and run 2 Eureka instance. Two Eureka instance will send 4 renews to Eureka server per minutes, Eureka server minimal threshold is 1 (written in code), so the threshold is 5 (this number will be multiply a factor eureka.server.renewalPercentThreshold which will be discussed later).
SELF PRESERVATION MODE: if Renews (last min) is less than Renews threshold, self preservation mode will be activated.
So in upper example, the SELF PRESERVATION MODE is activated, because threshold is 5, but Eureka server can only receive 4 renews/min.
Question 1:
The SELF PRESERVATION MODE is design to avoid poor network connectivity failure. Connectivity between Eureka instance A and B is good, but B is failed to renew its lease to Eureka server in a short period due to connectivity hiccups, at this time Eureka server can't simply just kick out instance B. If it does, instance A will not get available registered service from Eureka server despite B is available. So this is the purpose of SELF PRESERVATION MODE, and it's better to turn it on.
Question 2:
The minimal threshold 1 is written in the code. registerWithEureka is set to false so there will be no Eureka instance registers, the threshold will be 1.
In production environment, generally we deploy two Eureka server and registerWithEureka will be set to true. So the threshold will be 2, and Eureka server will renew lease to itself twice/minute, so RENEWALS ARE LESSER THAN THRESHOLD won't be a problem.
Question 3:
Yes, you are right. eureka.instance.leaseRenewalIntervalInSeconds defines how many renews sent to server per minute, but it will multiply a factor eureka.server.renewalPercentThreshold mentioned above, the default value is 0.85.
Question 4:
Yes, it's normal, because the threshold initial value is set to 1. So if registerWithEureka is set to false, renews is always below threshold.
I have two suggestions for this:
Deploy two Eureka server and enable registerWithEureka.
If you just want to deploy in demo/dev environment, you can set eureka.server.renewalPercentThreshold to 0.49, so when you start up a Eureka server alone, threshold will be 0.
I've created a blog post with the details of Eureka here, that fills in some missing detail from Spring doc or Netflix blog. It is the result of several days of debugging and digging through source code. I understand it's preferable to copy-paste rather than linking to an external URL, but the content is too big for an SO answer.
You can try to set renewal threshold limit in your eureka server properties. If you have around 3 to 4 Microservices to register on eureka, then you can set it to this:
eureka.server.renewalPercentThreshold=0.33
server:
enableSelfPreservation: false
if set to true, Eureka expects service instances to register themselves and to continue to send registration renewal requests every 30 s. Normally, if Eureka doesn’t receive a renewal from a service for three renewal periods (or 90 s), it deregisters that instance.
if set to false, in this case, Eureka assumes there’s a network problem, enters self-preservation mode, and won’t deregister service instances.
Even if you decide to disable self-preservation mode for development, you should leave it enabled when you go into production.

Resources