Different Retry objects based on error type spring WebFlux - spring

I've written a Web client to interact with some external service using spring project-reactor. The external service sometimes throttles incoming requests. How can I provide different Retry types based on the different responses/exceptions?
For Instance:
If it is a response with a 429 error code, create a Retry object with the duration provided in the response header (e.g retry-after), or else,
If it is some other exception, for example 5XX, then retry with exponential backoff:
Retry.backoff(MAX_ATTEMPTS, Duration.ofMillis(MILLIS))
The client API call code is:
.bodyValue(inputQuery)
.retrieve()
.bodyToMono(QueryResult.class)
.retryWhen(customStrategy)
.doOnError(ex -> log.debug("API invocation error: ", ex));
customStrategy can will have the logic to decide which Retry object to create. Is there anyway we can achieve this?

You can specify more than one .retryWhen in the statement and create your Retry objects with the appropriate filters.
Something like that:
.bodyToMono(String.class)
.retryWhen(
Retry
.backoff(3, Duration.ofSeconds(5))
.filter(throwable -> throwable instanceof WebClientResponseException.TooManyRequests)
)
.retryWhen(
Retry
.fixedDelay(3, Duration.ofSeconds(1))
.filter(throwable -> ((WebClientResponseException) throwable).getStatusCode().is5xxServerError())
)
.block();
Here, you can customize your Filter and Retry objects any way you want.

Related

Project reactor - react to timeout happened downstream

Project Reactor has a variety of timeout() operators.
The very basic implementation raises TimeoutException in case no item arrives within the given Duration. The exception is propagated downstream , and to upstream it sends cancel signal.
Basically my question is: is it possible to somehow react (and do something) specifically to timeout that happened downstream, not just to cancelation that sent after timeout happened?
My question is based on the requirements of my real business case and also I'm wondering if there is a straight solution.
I'll simplify my code for better understanding what I want to achieve.
Let's say I have the following reactive pipeline:
Flux.fromIterable(List.of(firstClient, secondClient))
.concatMap(Client::callApi) // making API calls sequentially
.collectList() // collecting results of API calls for further processing
.timeout(Duration.ofMillis(3000)) // the entire process should not take more than duration specified
.subscribe();
I have multiple clients for making API calls. The business requirement is to call them sequantilly, so I call them with concatMap(). Then I should collect all the results and the entire process should not take more than some Duration
The Client interface:
interface Client {
Mono<Result> callApi();
}
And the implementations:
Client firstClient = () ->
Mono.delay(Duration.ofMillis(2000L)) // simulating delay of first api call
.map(__ -> new Result())
// !!! Pseudo-operator just to demonstrate what I want to achieve
.doOnTimeoutDownstream(() ->
log.info("First API call canceled due to downstream timeout!")
);
Client secondClient = () ->
Mono.delay(Duration.ofMillis(1500L)) // simulating delay of second api call
.map(__ -> new Result())
// !!! Pseudo-operator just to demonstrate what I want to achieve
.doOnTimeoutDownstream(() ->
log.info("Second API call canceled due to downstream timeout!")
);
So, if I have not received and collected all the results during the amount of time specified, I need to know which API call was actually canceled due to downstream timeout and have some callback for this "event".
I know I could put doOnCancel() callback to every client call (instead of pseudo-operator I demonstrated) and it would work, but this callback reacts to cancelation, which may happen due to any error.
Of course, with proper exception handling (onErrorResume(), for example) it would work as I expect, however, I'm interesting if there is some straight way to somehow react specifically to timeout in this case.

Spring WebFlux: onErrorResume not being called when exception is thrown halfway during the webclient reactive chain

I have written code that uses webclient to call another endpoint and want to add reactive error handling. However, it seems my understanding of doOnError or onErrorResume may not be correct:
webClient
.get()
.uri(someUri)
.retrieve()
.bodyToFlux(Some.class)
.onErrorResume(throwable -> {
log.error("Error occurred when calling other service: {}", throwable.getMessage());
return Flux.error(new RunTimeException("Exception type: " + throwable.getClass() + " Exception message: " + throwable.getMessage()));
});
Then intention is that this call is actually part of a larger reactive chain that calls this, and if an exception is thrown whilst running the api call (.get().retrieve()), onErrorResume should throw and pass on the exception to the higher level reactive chain caller.
I tried to unit test the validity of this by:
Mockito.when(webClient.get().uri(URI.create(uri)).retrieve()).thenThrow(new RuntimeException("Hello world exception thrown"));
But noticed that the exception just gets thrown, and the code terminates at the .retrieve step of the reactive chain, rather than proceeding to the onErrorResume step.
This is because your test throws an error when constructing the retrieve Mono, instead of returning a functional Mono that immediately emits an error when subscribed to. Thus it's not your data flow that is in error, but the pipeline handling the data flow itself.
You can solve this by returning a Mono.error:
Mockito.when(webClient.get().uri(URI.create(uri)).retrieve())
.thenReturn(Mono.error(new RuntimeException("Hello world exception thrown")));

Webflux - hanging requests when using bounded elastic Scheduler

I have a service written with webflux that has high load (40 request per second)
and I'm encountering a really bad latency and performance issues with behaviours I can't explain: at some point during peaks, the service hangs in random locations as if it doesn't have any threads to handle the request.
The service does however have several calls to different service that aren't reactive - using WebClient, and another call to a main service that retrieves the main data through an sdk wrapped in Mono.fromCallable(..).publishOn(Schedulers.boundedElastic()).
So the flow is:
upon request such as Mono<Request>
convert to internal object Mono<RequestAggregator>
call GCP to get JWT token and then call some service to get data using webclient
call the main service using Mono.fromCallable(MainService.getData(RequestAggregator)).publishOn(Schedulers.boundedElastic())
call another service to get more data (same as 3)
call another service to get more data (same as 3)
do some manipulation with all the data and return a Mono<Response>
the webclient calls look something like that:
Mono.fromCallable(() -> GoogleService.getToken(account, clientId)
.buildIapRequest(REQUEST_URL))
.map(httpRequest -> httpRequest.getHeaders().getAuthorization())
.flatMap(authToken -> webClient.post()
.uri("/call/some/endpoint")
.header(HttpHeaders.AUTHORIZATION, authToken)
.header(HttpHeaders.CONTENT_TYPE, MediaType.APPLICATION_JSON_VALUE)
.header(HttpHeaders.ACCEPT, MediaType.APPLICATION_JSON_VALUE)
.body(BodyInserters.fromValue(countries))
.retrieve()
.onStatus(HttpStatus::isError, clientResponse -> {
log.error("{} got status code: {}",
ERROR_MSG_ERROR, clientResponse.statusCode());
return Mono.error(new SomeWebClientException(STATE_ABBREVIATIONS_ERROR));
})
.bodyToMono(SomeData.class));
sometimes step 6 hangs for more than 11 minutes, and this service does not have any issues. It's not reactive but responses take ~400ms
Another thing worth mentioning is that MainService is a heavy IO operation that might take 1 minute or more.
I feel like a lot of request hangs on MainService and theren't any threads left for the other operations, does that make sense? if so, how does one solve something like that?
Can someone suggest any reason for this issue? I'm all out of ideas
It's not possible to tell for sure without knowing the full application, but indeed the blocking IO operation is the most likely culprit.
Schedulers.boundedElastic(), as its name suggests, is bounded. By default the bound is "ten times the number of available CPU cores", so on a 2-core machine it would be 20. If you have more concurrent requests than the limit, the rest is put into a queue waiting for a free thread indefinitely. If you need more concurrency than that, you should consider setting up your own scheduler using Scheduler.fromExecutor with a higher limit.

Using onErrorResume to handle problematic payloads posted to Kafka using Reactor Kafka

I am using reactor kafka to send in kafka messages and receive and process them.
While receiving the kakfa payload, I do some deserialization, and if there is an exception, I want to just log that payload ( by saving to mongo ), and then continue receiving other payloads.
For this I am using the below approach -
#EventListener(ApplicationStartedEvent.class)
public void kafkaReceiving() {
for(Flux<ReceiverRecord<String, Object>> flux: kafkaService.getFluxReceives()) {
flux.delayUntil(//some function to do something)
.doOnNext(r -> r.receiverOffset().acknowledge())
.onErrorResume(this::handleException()) // here I'll just save to mongo
.subscribe();
}
}
private Publisher<? extends ReceiverRecord<String,Object>> handleException(object ex) {
// save to mongo
return Flux.empty();
}
Here I expect that whenever I encounter an exception while receiving a payload, the onErrorResume should catch it and log to mongo and then I should be good to continue receiving more messages when I send to the kafka queue. However, I see that after the exception, even though the onErrorResume method gets invoked, but I am not able to process anymore messages sent to Kakfa topic.
Anything I might be missing here?
If you need to handle the error gracefully, you can add onErrorResume inside delayUntil:
flux
.delayUntil(r -> {
return process(r)
.onErrorReturn(e -> saveToMongo(r));
});
.doOnNext(r -> r.receiverOffset().acknowledge())
.subscribe();
Reactive operators treat error as a terminal signal, and, if your inner logic (inside delayUntil) throws an error, delayUntil will terminate the sequence, and onErrorReturn after delayUntil will not make it continue processing the events from Kafka.
As mentioned by #bsideup too, I ultimately went ahead with not throwing exception from the deserializer, since the kafka is not able to commit offset for that record, and there is no clean way of ignoring that record and going ahead with further consumption of records as we dont have the offset information of the record( since it is malformed). So even if I try to ignore the record using reactive error operators, the poll fetches the same record, and the consumer is then kind of stuck

Reactive WebClient not emitting a response

I have a question about Spring Reactive WebClient...
Few days ago I decided to play with the new reactive stuff in Spring Framework and I made one small project for scraping data only for personal purposes. (making multiple requests to one webpage and combining the results).
I started using the new reactive WebClient for making requests but the problem I found is that the client not emitting response for every request. Sounds strange. Here is what I did for fetching data:
private Mono<String> fetchData(String uri) {
return this.client
.get()
.uri(uri)
.header("X-Fsign","SW9D1eZo")
.retrieve()
.bodyToMono(String.class)
.timeout(Duration.ofSeconds(35))
.log("category", Level.ALL, SignalType.ON_ERROR, SignalType.ON_COMPLETE, SignalType.CANCEL, SignalType.REQUEST);
}
And the function that calls fetchData:
public Mono<List<Stat>> fetch() {
return fetchData(URL)
.map(this::extractUrls)
.doOnNext(System.out::println)
.doOnNext(s-> System.out.println("all ids are "+s.size()))
.flatMapIterable(q->q)
.map(s -> s.substring(7, 15))
.map(s -> "http://d.flashscore.com/x/feed/d_hh_" + s + "_en_1") // list of N-length urls
.flatMap(this::fetchData)
.map(this::extractHeadToHead)
.collectList();
}
and the subscriber:
FlashScoreService bean = ctx.getBean(FlashScoreService.class);
bean.fetch().subscribe(s->{
System.out.println("finished !!! " + s.size()); //expecting same N-length list size
},Throwable::printStackTrace);
The problem is if I made a little bit more requests > 100.
I didn't get responses for all of them, no error is thrown or error response code is returned and subscribe method is invoked with size different from the number of requests.
The requests I made are based on List of Strings (urls) and after all responses are emitted I should receive all of them as list because I'm using collectList(). When I execute 100 requests, I expect to receive list of 100 responses but actually I'm receiving sometimes 100, sometimes 96 etc ... May be something fails silently.
This is easy reproducible here is my github project link.
Sample output:
all ids are 176
finished !!! 171
Please give me suggestions how I can debug or what I'm doing wrong. Help is appreciated.
Update:
The log shows if I pass 126 urls for example:
onNext(ReactorClientHttpResponse{request=[GET/some_url],status=200}) is called 121 times. May be here is the problem.
onComplete() is called 126 times which is the exact same length of the passed list of urls
but how it's possible some of the requests to be completed without calling onNext() or onError( ) ? (success and error in Mono)
I think the problem is not in the WebClient but somewhere else. Environment or server blocking the request, but may be I should receive some error log.
ps. Thanks for the help !
This is a tricky one. Debugging the actual HTTP frames received, it seems we're really not getting responses for some requests. Debugging a little more with Wireshark, it looks like the remote server is requesting the end of the connection with a FIN, ACK TCP packet and that the client acknowledges it. The problem is this connection is still taken from the pool to send another GET request after the first FIN, ACK TCP packet.
Maybe the remote server is closing connections after they've served a number of requests; in any case it's perfectly legal behavior. Note that I'm not reproducing this consistently.
Workaround
You can disable connection pooling on the client; this will be slower and apparently doesn't trigger this issue. For that, use the following:
this.client = WebClient.builder()
.clientConnector(new ReactorClientHttpConnector(new Consumer<HttpClientOptions.Builder>() {
#Override
public void accept(HttpClientOptions.Builder builder) {
builder.disablePool();
}
}))
.build();
Underlying issue
The root problem is that the HTTP client should not onComplete when the TCP connection is closed without sending a response. Or better, the HTTP client should not reuse a connection while it's being closed. I'll report back here when I'll know more.

Resources