Load balancing problems with Spring Cloud Kubernetes - spring-boot

We have Spring Boot services running in Kubernetes and are using the Spring Cloud Kubernetes Load Balancer functionality with RestTemplate to make calls to other Spring Boot services. One of the main reasons we have this in place is historical - in that previously we ran our services in EC2 using Eureka for service discovery and after the migration we kept the Spring discovery client/client-side load balancing in place (updating dependencies etc for it to work with the Spring Cloud Kubernetes project)
We have a problem that when one of the target pods goes down we get multiple failures for requests for a period of time with java.net.NoRouteToHostException ie the spring load balancer is still trying to send to that pod.
So I have a few questions on this:
Shouldn't the target instance get removed automatically when this happens? So it might happen once but after that, the target pod list will be repaired?
Or if not is there some other configuration we need to add to handle this - eg retry / circuit breaker, etc?
A more general question is what benefit does Spring's client-side load balancing bring with Kubernetes? Without it, our service would still be able to call other services using Kubernetes built-in service / load-balancing functionality and this should handle the issue of pods going down automatically. The Spring documentation also talks about being able to switch from POD mode to SERVICE mode (https://docs.spring.io/spring-cloud-kubernetes/docs/current/reference/html/index.html#loadbalancer-for-kubernetes). But isn't this service mode just what Kubernetes does automatically? I'm wondering if the simplest solution here isn't to remove the Spring Load Balancer altogether? What would we lose then?

An update on this: we had the spring-retry dependency in place, but the retry was not working as by default it only works for GETs and most of our calls are POST (but OK to call again). Adding the configuration spring.cloud.loadbalancer.retry.retryOnAllOperations: true fixed this, and hence most of these failures should be avoided by the retry using an alternative instance on the second attempt.
We have also added a RetryListener that clears the load balancer cache for the service on certain connection exceptions:
#Configuration
public class RetryConfig {
private static final Logger logger = LoggerFactory.getLogger(RetryConfig.class);
// Need to use bean factory here as can't autowire LoadBalancerCacheManager -
// - it's set to 'autowireCandidate = false' in LoadBalancerCacheAutoConfiguration
#Autowired
private BeanFactory beanFactory;
#Bean
public CacheClearingLoadBalancedRetryFactory cacheClearingLoadBalancedRetryFactory(ReactiveLoadBalancer.Factory<ServiceInstance> loadBalancerFactory) {
return new CacheClearingLoadBalancedRetryFactory(loadBalancerFactory);
}
// Extension of the default bean that defines a retry listener
public class CacheClearingLoadBalancedRetryFactory extends BlockingLoadBalancedRetryFactory {
public CacheClearingLoadBalancedRetryFactory(ReactiveLoadBalancer.Factory<ServiceInstance> loadBalancerFactory) {
super(loadBalancerFactory);
}
#Override
public RetryListener[] createRetryListeners(String service) {
RetryListener cacheClearingRetryListener = new RetryListener() {
#Override
public <T, E extends Throwable> boolean open(RetryContext context, RetryCallback<T, E> callback) { return true; }
#Override
public <T, E extends Throwable> void close(RetryContext context, RetryCallback<T, E> callback, Throwable throwable) {}
#Override
public <T, E extends Throwable> void onError(RetryContext context, RetryCallback<T, E> callback, Throwable throwable) {
logger.warn("Retry for service {} picked up exception: context {}, throwable class {}", service, context, throwable.getClass());
if (throwable instanceof ConnectTimeoutException || throwable instanceof NoRouteToHostException) {
try {
LoadBalancerCacheManager loadBalancerCacheManager = beanFactory.getBean(LoadBalancerCacheManager.class);
Cache loadBalancerCache = loadBalancerCacheManager.getCache(CachingServiceInstanceListSupplier.SERVICE_INSTANCE_CACHE_NAME);
if (loadBalancerCache != null) {
boolean result = loadBalancerCache.evictIfPresent(service);
logger.warn("Load Balancer Cache evictIfPresent result for service {} is {}", service, result);
}
} catch(Exception e) {
logger.error("Failed to clear load balancer cache", e);
}
}
}
};
return new RetryListener[] { cacheClearingRetryListener };
}
}
}
Are there any issues with this approach? Could something like this be added to the built in functionality?

Shouldn't the target instance get removed automatically when this
happens? So it might happen once but after that the target pod list
will be repaired?
To resolve this issue you have to use the Readiness and Liveness Probe in Kubernetes.
Readiness will check the health of the endpoint that your application has, on the period of interval. If the application fails it will mark your PODs as Unready to accept the Traffic. So no traffic will go to that POD(replica).
Liveness will restart your application if it fails so your container or we can say POD will come up again and once we will get 200 response from app K8s will mark your POD as Ready to accept the traffic.
You can create the simple endpoint in the application that give response as 200 or 204 as per need.
Read more at : https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
Make sure you application using the Kubernetes service to talk with each other.
Application 1 > Kubernetes service of App 2 > Application 2 PODs
To enable load balancing based on Kubernetes Service name use the
following property. Then load balancer would try to call application
using address, for example service-a.default.svc.cluster.local
spring.cloud.kubernetes.loadbalancer.mode=SERVICE
The most typical way to use Spring Cloud LoadBalancer on Kubernetes is
with service discovery. If you have any DiscoveryClient on your
classpath, the default Spring Cloud LoadBalancer configuration uses it
to check for service instances. As a result, it only chooses from
instances that are up and running. All that is needed is to annotate
your Spring Boot application with #EnableDiscoveryClientto enable
K8s-native Service Discovery.
References : https://stackoverflow.com/a/68536834/5525824

Related

Standalone hazelcast expired event not working, but embedded hazelcast expired event does work

I used embedded hazelcast on localhost:5701.
Later I changed config. Now I use standalone hazelcast instances on app-server:5701, 3 machines built into 1 cluster. Everything works fine, except expired event.
Cache configuration is rather simple and common:
#Slf4j
#RequiredArgsConstructor
#Component
public class ExpiredListener implements EntryEvictedListener<String, Application>, EntryExpiredListener<String, Application> {
private final ApplicationEventPublisher publisher;
#Override
public void entryEvicted(EntryEvent<String, Application> event) {
publisher.publishEvent(new ApplicationCacheExpireEvent(this, event.getOldValue()));
}
#Override
public void entryExpired(EntryEvent<String, Application> event) {
publisher.publishEvent(new ApplicationCacheExpireEvent(this, event.getOldValue()));
}
}
and
#Bean
public List<MapConfig> mapConfigs(final ExpiredListener expiredListener) {
final MapConfig applicationCache = getMapConfig(APPLICATIONS_CACHE_NAME)
.setMaxSizeConfig(new MaxSizeConfig()
.setMaxSizePolicy(MaxSizeConfig.MaxSizePolicy.USED_HEAP_SIZE)
.setSize(properties.getMaxMapHeapSize()))
.setTimeToLiveSeconds(properties.getTtlBeforeSave())
.addEntryListenerConfig(new EntryListenerConfig(expiredListener, false, true));
final MapConfig warmingCacheMapConfig = getMapConfig(LOCK_MAP_NAME);
return List.of(applicationCache, warmingCacheMapConfig);
}
UPDATE
Hazelcast service is run in docker container
UPDATE 2
I used setTtl to my imap, but that ruined my server (rest api not supported, but lib version and hazel version are the same)
I used put(K, V, TTL, UNITS), aaaand... it does not help!
I put <time-to-live-soconds> in server xml config file, restarted it, and the hazel started to send events!
But, then I removed that lines from the xml config, reloaded server and it remains working... (I don't use map backup right now)
So, tomorrow I will recheck everything. It seems like either hazelcast docker hangs or put(K, V, TTL, UNITS) works.
UPDATE 3 (Solution)
This exact combination works for me
Setup TTL in config bean (see code above)
Use put(K, V, ttl, TimeUnits) instead of put(K, V)
Explicitly add event listener to hazelcast map
...
applications = hazelcastInstance.getMap(APPLICATIONS_CACHE_NAME);
applications.addEntryListener(new ExpiredListener(this.applicationEventPublisher), true);
...
Tested both with and without hazelcast server restart.

Fail-fast behavior for Eureka client

It seems that following problem doesn't have common decision and I try to solve it from another side. Microservices infrastructure consists of Spring Boot Microservices with Eureka-Zuul-Config-Admin Servers as service mesh. All Microservices runs inside Docker containers at the Kubernetes platform. Kubernetes monitors application health check (liveness/readyness probes) and redeploy it when health check in down state exceeds liveness probe timeout.
The problem is following - sometimes Microservice doesn't get correct Eureka server address after redeployment. Service discovery registration fails but Microservice continue working with health check 'UP' and dependent Microservices miss it.
Microservices are interdependent and failure of one Microservice causes cascade failure of all dependent Microservices. I don't use Histrix because of some reasons and actually it is not resolve my problem - missed data from failed Microservice just disables entire functionality related to the set of dependent Microservices.
Question: Is it possible to configure something like 'fail-fast' behavior for Eureka client without writing custom HealthIndicator? The actuator health check should be in 'DOWN' state while Eureka client doesn't get 204 successful registration response from Eureka.
Here is an example of how I fix it in code. It has pretty simple behavior - healthcheck goes down 'forever' after exceeding timeout to successful registration in Eureka on start or/and during runtime. The main goal is that the Kubernetes will redeploy Microservice when liveness probe timeout exceeded.
#Component
public class CustomHealthIndicator implements HealthIndicator {
private static final Logger logger = LoggerFactory.getLogger(CustomHealthIndicator.class);
#Autowired
#Qualifier("eurekaClient")
private EurekaClient eurekaClient;
private static final int HEALTH_CHECK_DOWN_LIMIT_MIN = 15;
private LocalDateTime healthCheckDownTimeLimit = getHealthCheckDownLimit();
#Override
public Health health() {
int errCode = registeredInEureka();
return errCode != 0
? Health.down().withDetail("Eureka registration fails", errCode).build()
: Health.up().build();
}
private int registeredInEureka() {
int status = 0;
if (isStatusUp()) {
healthCheckDownTimeLimit = getHealthCheckDownLimit();
} else if (LocalDateTime.now().isAfter(healthCheckDownTimeLimit)) {
logger.error("Exceeded {} min. limit for getting 'UP' state in Eureka", HEALTH_CHECK_DOWN_LIMIT_MIN);
status = HttpStatus.GONE.value();
}
return status;
}
private boolean isStatusUp() {
return eurekaClient.getInstanceRemoteStatus().compareTo(InstanceInfo.InstanceStatus.UP) == 0;
}
private LocalDateTime getHealthCheckDownLimit() {
return LocalDateTime.now().plus(HEALTH_CHECK_DOWN_LIMIT_MIN, ChronoUnit.MINUTES);
}
}
Is it possible to do the same by just configuring Spring components?

Understanding difference between Custom Handler and SpringBootApiGatewayRequestHandler

I'm new to Spring Cloud Function and came across it as one of best solution for developing FaaS based solution. I am specifically writing application for AWS Lambda Service which is back-end of API Gateway. I ran into very interesting problem with My test application and it is around Handler. My test application works well with Custom Handler written as -
public class UserProfileHandler extends SpringBootRequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
}
which works well when configured as Handler in the AWS Lambda. Then I came across org.springframework.cloud.function.adapter.aws.SpringBootApiGatewayRequestHandler which is available in Spring Cloud Function dependency, so I wanted to get rid of UserProfileHandler hence I changed Handler configuration in AWS Lambda to org.springframework.cloud.function.adapter.aws.SpringBootApiGatewayRequestHandler instead of ...UserProfileHandler and now lambda fails with following error message. Has anyone run into this problem?
{
"errorMessage": "java.util.Optional cannot be cast to com.amazonaws.services.lambda.runtime.events.APIGatewayProxyRequestEvent",
"errorType": "java.lang.ClassCastException",
"stackTrace": [
"com.transformco.hs.css.userprofile.function.UserProfileFunction.apply(UserProfileFunction.java:16)",
"org.springframework.cloud.function.context.catalog.BeanFactoryAwareFunctionRegistry$FunctionInvocationWrapper.invokeFunction(BeanFactoryAwareFunctionRegistry.java:499)",
"org.springframework.cloud.function.context.catalog.BeanFactoryAwareFunctionRegistry$FunctionInvocationWrapper.lambda$doApply$1(BeanFactoryAwareFunctionRegistry.java:543)",
"reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:107)",
"reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onNext(FluxMapFuseable.java:121)",
"reactor.core.publisher.FluxJust$WeakScalarSubscription.request(FluxJust.java:99)",
"reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:162)",
"reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.request(FluxMapFuseable.java:162)",
"reactor.core.publisher.BlockingIterable$SubscriberIterator.onSubscribe(BlockingIterable.java:218)",
"reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onSubscribe(FluxMapFuseable.java:90)",
"reactor.core.publisher.FluxMapFuseable$MapFuseableSubscriber.onSubscribe(FluxMapFuseable.java:90)",
"reactor.core.publisher.FluxJust.subscribe(FluxJust.java:70)",
"reactor.core.publisher.InternalFluxOperator.subscribe(InternalFluxOperator.java:53)",
"reactor.core.publisher.BlockingIterable.iterator(BlockingIterable.java:80)",
"org.springframework.cloud.function.adapter.aws.SpringBootRequestHandler.result(SpringBootRequestHandler.java:59)",
"org.springframework.cloud.function.adapter.aws.SpringBootRequestHandler.handleRequest(SpringBootRequestHandler.java:52)",
"org.springframework.cloud.function.adapter.aws.SpringBootApiGatewayRequestHandler.handleRequest(SpringBootApiGatewayRequestHandler.java:140)",
"org.springframework.cloud.function.adapter.aws.SpringBootApiGatewayRequestHandler.handleRequest(SpringBootApiGatewayRequestHandler.java:43)"
]
}
Ganesh, I believe you have already raise the issue in Github of SCF. So as I stated there, we have recently did several enhancements, polished the sample and modified documentation by adding a Getting Started guide.
That said, with new generic request handler you no longer need to provide implementation of AWS request handler including SpringBootApiGatewayRequestHandler.
Simply write your boot application to contain function bean
#SpringBootApplication
public class FunctionConfiguration {
public static void main(String[] args) {
SpringApplication.run(FunctionConfiguration.class, args);
}
#Bean
public Function<String, String> uppercase() {
return value -> value.toUpperCase();
}
}
. . . and specify org.springframework.cloud.function.adapter.aws.FunctionInvoker as a handler in AWS dashboard. We'll do the rest for you.

Propagating errors between Hazelcast Server and Hazelcast Client

I have the following scenario:
- a Hazelcast Server as a microservice which performs some computations when receives a method call.
- a Hazelcast Client as another microservice which calls the Hazelcast Server through the specified method call.
I want that when I throw an exception from the Hazelcast Server to receive it on the Hazelcast Client side as it is (currently, I'm receiving somthing like this: java.util.concurrent.ExecutionException: com.hazelcast.client.UndefinedErrorCodeException: Class name: ro.orange.eshop.personalisationengineapi.application.exception.ValidationException)
I've digged a little into the APIs and on the Hazelcast Client side I've found a way to register a new exception:
#Bean
fun addHazelcastKnownExceptions(hazelcastInstance: HazelcastInstance): Int {
val hazelcastClientInstance = (hazelcastInstance as HazelcastClientProxy).client
hazelcastClientInstance.clientExceptionFactory.register(400, ValidationException::class.java) { message, cause -> ValidationException(message, cause) }
return 1
}
But it seems that this exception must be registered also on the server side as well. And here comes the problem! On the server side, I've found a class called ClientExceptions which has a method public void register(int errorCode, Class clazz) but I can't find a way to receive a ClientExceptions instance (I should mention that I'm using Hazelcast Spring).
Thank you!
It is not supported to register custom exception factory as an API as of 3.12.x.
Related issue to follow https://github.com/hazelcast/hazelcast/issues/9753
As a workaround, I could suggest using class name (UndefinedErrorCodeException.getOriginialClassName()) to recreate exception classes on the client side.
== EDIT ==
Client API does not support it. You have found the private API.
If you are ok with relying on private API here is the hack for registering classes on the hazelcast server:
Note that I DO NOT recommend this solution since it relies on private API that can change.
HazelcastInstance instance = Hazelcast.newHazelcastInstance();
if (instance instanceof HazelcastInstanceProxy) {
HazelcastInstanceImpl original = ((HazelcastInstanceProxy) instance).getOriginal();
ClientExceptions clientExceptions = original.node.getClientEngine().getClientExceptions();
clientExceptions.register( USER_EXCEPTIONS_RANGE_START + 1, UndefinedCustomFormatException.class);
}

TimeoutException in edge using Feign and Hystrix

I'm building a training app with Netflix micro-services APIs.
This is my edge, starting on localhost:9999:
#EnableHystrix
#EnableZuulProxy
#EnableEurekaClient
#SpringBootApplication
public class EdgeApplication {
public static void main(String[] args) {
SpringApplication.run(EdgeApplication.class, args);
}
}
I defined the 2 following apps:
app-a exposes a simple web service service-a and starts on localhost:8081
app-b exposes a web service service-b which calls service-a, and starts on localhost:8082
service-b calls service-a using Netflix Feign:
#FeignClient(value = "app-a", fallback = AppAFallback.class)
public interface AppAClient {
#RequestMapping(value = "service-a", method = RequestMethod.GET)
List<Entity> serviceA();
}
#Component
public class AppAFallback implements AppAClient {
private static final Entity DEFAULT_ENTITY = new Entity();
#Override
public List<Entity> serviceA() {
return Collections.singletonList(DEFAULT_ENTITY);
}
}
While app-a and app-b are running, every service answers as expected:
http://localhost:8081/service-a
http://localhost:8082/service-b
http://localhost:9999/app-a/service-a (through edge)
http://localhost:9999/app-b/service-b (through edge)
The fallback AppAFallback should be called if app-b is down. However I have to wait about like 1 minute before it happens.
Just after the app-b is down:
http://localhost:8081/service-a works well and the fallback is called
http://localhost:8082/service-b is not reachable
http://localhost:9999/app-a/service-a TIMEOUT : HystrixRuntimeException: app-a timed-out and no fallback available.
http://localhost:9999/app-b/service-b TIMEOUT : HystrixRuntimeException: app-b timed-out and no fallback available.
And 1 minute after app-b is down:
http://localhost:8081/service-a works well and the fallback is called
http://localhost:8082/service-b is not reachable
http://localhost:9999/app-a/service-a works well and the fallback is called
http://localhost:9999/app-b/service-b GENERAL : load balancer does not have available server for client: app-b
And this is the result I expected. Any idea about why the calls to app-a/service-a just after app-b is down are giving me TIMEOUT?
Thanks in advance for your help.
I've experience the same problem, and I think (not tested), it's Eureka's updating frequency cause the problem. Just after app-b is down, Eureka still thinks app-b is up (not yet checked the heartbeat). And after app-b is down 1 minute, Eureka known app-b is down, just tells your app-a there is no app-b, thus the fallback fired immediately.

Resources