spring integration message released twice from aggregator - spring

I have a spring integration flow that starts with a channel inboundadapter and picks up files and passes them through the system as messages.
After a few components, the messages are aggregated at an "Aggregator" from where they are released based on release strategies or by group timeout of 30 sec.
The downstream processing has another bunch of components till the final one.
The problem I am facing is this,
When I send 33 files which create 33 "groups/buckets" based on correlation IDs, aggregated at the "Aggregator", some of the files or messages seems to be "released" twice. The reason I conclude that is because I have a channel interceptor which shows a few messages passing through the "released" channel (appearing right after the aggregator) a second time, after completing the downstream processing successfully, the first time. Additionally, this behavior causes my application to not find a file and throw an exception which I see. This leads me to conclude that the message bucket/group/corrID is somehow being "Released" twice.
I have tried to debug this many ways , but essentially, I want to know how a corrID/bucket after being released and having successfully gone through all downstream components in a single thread, can be "released" again.
My question is, how can I debug this? I want to know what is making this message/bucket re-appear in the aggregator.
My aggregator is as follows,
<int:aggregator id="bufferedFiles" input-channel="inQueueForStage"
output-channel="released" expire-groups-upon-completion="true"
send-partial-result-on-expiry="true" release-strategy="releaseHandler"
release-strategy-method="canRelease"
group-timeout-expression="size() > 0 ? T(com.att.datalake.ifr.loader.utils.MessageUtils).getAggregatorTimeout(one, #sourceSnapshot) : -1">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}"
task-executor="executor" />
</int:aggregator>
Explanation of aggregator: The size()>0 applies to EACH correlation bucket. each of the 33 files I am sending will spawn/generate/create a new bucket because of the file name, so the aggregator will have 33 buckets/groups/corrIds, each bucket will contain only one file.
So the aggregator SPEL expression simply says that if there no release strategies, then release the bucket/group after 30 secs if the group indeed has at least some files.
My Channel inbound adapter is as follows:
<int-file:inbound-channel-adapter id="files"
channel="dispatchFiles" directory="${source.dir}" scanner="directoryScanner">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}" />
</int-file:inbound-channel-adapter>
Logs
here is the log of message completing the flow the first time. The completion time invoked suggests reaching the last component a "completionHandler" SA.
Explanation of Log: "cor" is the bucket/corrId that is being released twice. The reason I get the final exception is because during the first time, the file is removed from that original location and processed. So the second time around when this erroneous release happens, there is nothing to process there.
From the pictures it can be seen that the first batch/corrId/bucket is processed and finished around 11:09, and the second one is started around 11:10
an important point I noticed that this behavior only happens when I have a global channel interceptor in which I am doing somewhat long processing. When this interceptor is commented out, the errors go away.
Question:
is it possible for aggregator to double release a batch/corrId under any circumstance? How can I make aggregator emit any logs?
Thanks
Edit 10:15pm
My channel following the aggregator has an interceptor as follows,
public Message<?> preSend(Message<?> message, MessageChannel channel) {
LOGGER.info("******** Releasing from aggregator(interceptor) , corrID:{} at time:{} ********",MessageUtils.getCorrelationId(message), new Date() );
finalReporter.callback(channel.toString(), message);
return message;
}
From Aggregator down to final compeltionHandler SA, I have single threaded processing
Aggregator -> releasedChannel -> some SA1 -> some channel -> ..... -> completionChannel->completeSA
When I run for 33 partitions, let's follow corrId = "alh" The first time it is released, it looks like following,
What it shows is that thread-5 released it and it should process all the downstream components. But it leaves it mid-way and starts doing other things and is picked up again by a diffferent thread a little later as follows,
That seems/seemed to be the problem,
Solution Update:
I did following 3 things to sort of work around, at the moment,
for some reason, my interceptors were doing return super.preSend(message, channel) instead of simply return message. I changed it to latter
I had a global channel interceptors, I removed global and kept individual ones
If the channel interceptors had any issues before returning, would that cause a new release?
Although I still see the above scenario depicted in pictures, I am not getting double processing attempts and as such it avoids the errors. I am still trying to make sense out of this.
I understand it's too specific and difficult to explain; still thanks for the time and comments...

However, yes. I think #GaryRussell is right: since you use expire-groups-upon-completion="true" some partial groups may be released by group-timeout-expression and the new messages with the same correlationId will form a new group, which is released by the next group-timeout. Your size() > 0 isn't good too. It means that it is going to release partial group after that group-timeout. Maybe size() > 1? The group can't be size() == 0 though. Because it is created on the first message, so, if gruop exists, it contains at least one message. Yes, group can be empty, but in that case the aggregator should be marked with expire-groups-upon-completion="false". In that case it is marked as completed and doesn't allow new messages.

After struggling with debugging and various blind scenarios, I believe that at least I have a workaround and a possible root cause. I will try to outline all the things that I modified,
Root Cause:
My interceptors were calling a Common class with a common callback method. This method, based on the channel name from which the request was coming from, would decide the appropriate action to take. The actions were essentially collecting data, incrementing counters and persisting to database some information.
It seems that some of them were having errors and consequently, the thread was dying and message re-released. I am not entirely sure about it and please correct me if that's not the case.
But after I fixed those errors, the re-release issue seems to have subsided or vanished altogether.
The reason it was hard to diagnose was because I could not see those errors thrown during callback method invocations; may be I was catching them or may be they were lost.
I also found that the issue was only on any channel interceptors AFTER the aggregator. Interceptors before the aggregator did not present any issues; may be because they were simpler...
To debug,
I removed the interceptors and made the callback directly from various components (SAs), removed global interceptors and tried to add individual interceptors for specific channels.
Thanks for all the help.

Related

Spring Integration can’t use multiple Outbound Channel Adapters

I want to write to a channel adapter only if the previous channel adapter write has been written successfully. I’m trying to do this by:
#Bean
public IntegrationFlow buildFlow() {
return IntegrationFlows.from(someChannelAdapter)
.handle(outboundChannelAdapter1)
.handle(outboundChannelAdapter2)
.get();
}
But I’m getting the following exception: The ‘currentComponent’ (…ReactiveMessageHandlerAdapter) is a one-way 'MessageHandler’ and it isn’t appropriate to configure ‘outputChannel’. This is the end of the integration flow.
How can I perform this?
If your handler implementation is one-way, fire-n-forget, then indeed there is no justification to continue the flow. It can go ahead with the configuration if the current handler is reply-producing and there will be something we can build a message to send to the next channel.
In your case .handle(outboundChannelAdapter1) is just void, so the next .handle(outboundChannelAdapter2) is not going to have anything to continue the flow. So, the framework gives you a hint that such a configuration is wrong. It is called flow for a reason: the result of the current endpoint is going to be an input for the next one. If no result, no continuation. How else it could work in your opinion?
The point is that there need to be something to write to your channel adapter. One of the solution is a PublishSubscribeChannel which distributes the same input message to all its subscribers. If that is what would fit to your expectations, then take a look into its support in Java DSL: https://docs.spring.io/spring-integration/docs/current/reference/html/dsl.html#java-dsl-subflows.
Another way is a RecipientListRouter pattern: https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#router-implementations-recipientlistrouter.
You may achieve the same with WireTap as well, but it depends on a business logic of your solution: https://docs.spring.io/spring-integration/docs/current/reference/html/core.html#channel-wiretap.
But anyway: you need to understand that the second handler can be called only if there is an input message for its channel. In all those cases I showed you it is exactly the same message you send to a first handler. If your expectations are different, please elaborate what kind of message you'd like to have for a second handler if the first does not return anything.

How do I debug a Mono that never completes

I have a Spring Boot application which contains a complex reactive flow (it involves MongoDB and RabbitMQ operations). Most of the time it works, but...
Some of the methods return a Mono<Void>. This is a typical pattern, in multiple layers:
fun workflowStep(things: List<Thing>): Mono<Void> =
Flux.fromIterable(things).flatMap { thing -> doSomethingTo(thing) }.collectList().then()
Let's say doSomethingTo() returns a Mono<Void> (it writes something to the database, sends a message etc). If I just replace it with Mono.empty() then everything works as expected, but otherwise it doesn't. More specifically the Mono never completes, it runs through all processing but misses the termination signal at the end. So the things are actually written in the database, messages are actually sent, etc.
To prove that the lack of termination is the problem, here is a hack that works:
val hackedDelayedMono = Mono.empty<Void>().delayElement(Duration.ofSeconds(1))
return Mono.first(
workflowStep(things),
hackedDelayedMono
)
The question is, what can I do with a Mono that never completes, to figure out what's going on? There is nowhere I could put a logging statement or a brakepoint, because:
there are no errors
there are no signals emitted
How could I check what the Mono is waiting for to be completed?
ps. I could not reproduce this behaviour outside the application, with simple Mono workflows.
You can trace and log events in your stream by using the log() operator in your reactive stream. This is useful for gaining a better understanding about what events are occurring within your app.
Flux.fromIterable(things)
.flatMap(thing -> doSomethingTo(thing))
.log()
.collectList()
.then()
Chained inside a sequence, it peeks at every event of the Flux or Mono
upstream of it (including onNext, onError, and onComplete as well as
subscriptions, cancellations, and requests).
Reactor Reference Documentation - Logging a Sequence
The Reactor reference documentation also contains other helpful advice for debugging a reactive stream and can be found here: Debugging Reactor
(We managed to fix the problem - it was not directly in the code I was working on, but for some reason my changes triggered it. I still don't understand the root cause, but higher up the chain we found a Mono.zip() zipping a Mono<Void>. Although this used to work before, it stopped working at some point. Why is a Mono<Void> even zippable, why don't we get a compiler error, and even worse, why does it work sometimes?)
To answer my own question here, the tool used for debugging was adding the following to all Monos in the chain, until it didn't produce any output:
mono.doOnEach { x ->
logger.info("signal: ${x}")
}
.then(Mono.defer {
logger.info("then()")
Mono.empty<Void>()
})
I also experimented with the .log() - also fine tool, but maybe too detailed, and it is not very easy to understand which Mono produces which log messages - as these are logged with the dynamic scope, not the lexical scope, which the above method gives you unambiguously.

How to retry a kafka message when there is an error - spring cloud stream

I'm pretty new to Kafka. I'm using spring cloud stream Kafka to produce and consume
#StreamListener(Sink.INPUT)
public void process(Order order) {
try {
// have my message processing
}
catch( exception e ) {
//retry here that record..
}
}
}
Just want to know how can I implement a retry ? Any help on this is highly appreciated
Hy
There are multiple ways to handle "retries" and it depends on the kind of events you encounter.
For basic issues kafka framework will retry for you to recover from an error condition, for example in case of a short network downtime the consumer and producer api implement auto retry.
In particular kafka support "built-in producer/consumer retries" to correctly handle a large variety of errors without loss of messages, but as a developer, you must still be able to handle other types of errors with the try-catch block you mention.
Error in kafka can be divided in the following categories:
(producer & consumer side) Nonretriable broker errors such as errors regarding message size, authorization errors, etc -> you must handle them in "design phase" of your app.
(producer side) Errors that occur before the message was sent to the broker—for example, serialization errors --> you must handle them in the runtime app execution
(producer & consumer sideErrors that occur when the producer exhausted all retry attempts or when the
available memory used by the producer is filled to the limit due to using all of it to store messages while retrying -> you should handle these errors.
Another point of attention regarding "how to retry" is how to handle correctly the order of commits in case of auto-commit option is set to false.
A common and simple pattern to get commit order right is to use a monotonically increasing sequence number. Increase the sequence number every time you commit and add the sequence number at the time of the commit to the commit function.
When you’re getting ready to send a retry, check if the
commit sequence number the callback got is equal to the instance
variable; if it is, there was no newer commit and it is safe to retry. If
the instance sequence number is higher, don’t retry because a
newer commit was already sent.

Spring Integration Flow : Circuit breaker for each endpoints or at flow level

I have successfully implemented some spring Integration Flow.
I am looking to have a circuit breaker either the same one for each endpoints or either at the flow level.
I have already read this documentation https://docs.spring.io/spring-integration/reference/html/handler-advice.html, but I havent find my answer.
Should I use some AOP ?
Thanks
G.
I'm not sure what you have missed in the mentioned docs, but RequestHandlerCircuitBreakerAdvice is indeed over there: https://docs.spring.io/spring-integration/reference/html/handler-advice.html#circuit-breaker-advice
The advises like this should be applied in the Java DSL with this configuration option:
.transform(..., c -> c.advice(expressionAdvice()))
Pay attention to that advice(expressionAdvice()) call. The expressionAdvice() is a bean method. So, you can do something similar for the RequestHandlerCircuitBreakerAdvice and any your endpoints in the flow which need to be guarded by the circuit.
And yes, you can use only a single bean for the RequestHandlerCircuitBreakerAdvice. It does keep a state for any endpoint it is called against:
protected Object doInvoke(ExecutionCallback callback, Object target, Message<?> message) {
AdvisedMetadata metadata = this.metadataMap.get(target);
if (metadata == null) {
this.metadataMap.putIfAbsent(target, new AdvisedMetadata());
metadata = this.metadataMap.get(target);
}
Thanks for your answer #artem-bilan.
I really appreciate that a spring integration team member anwsered to this.
After more thoughts, I have reformulated my problem.
Given an IntegrationFlow, with a specific error channel, if there are more than a given amount of errors in given span time (more than 10 errors in 10s), I want to stop polling the input channel.
So I redirect all the errors for this flow to the specific flow error channel.
An error counter is incremented, and then if the threshold is reached in the given span time, I stop the poller.
I have a second flow that monitor "stopped" pollers, and it restart them after some time.
[UPDATE]
I do have use your recommendations.
Mainly because I the framework dont solve your problem, your probably wrong.
And I was wrong.
Thanks !

Project reactor processors v3.X

We are trying to migrate from 2.X to 3.X.
https://github.com/reactor/reactor-core/issues/375
We have used the EventBus as event manager in our application(Low latency FX system) and it works very well for us.
After the change we decided to take every module and create his own processor to handle event.
1. Does this use seems to be correct from your point of view? Because lack of document at the current stage and after reviewing everything we could we don't really know what to do here
2. We have tried to use Flux in order to perform action every X interval
For example: Market is arriving 1000 for 1 second but we want to process an update only 4 time in a second. After upgrading we are using:
Processor with buffer and sending to another method.
In this method we have Flux that get list and try to work in parallel in order to complete his task.
We had 2 major problems:
1. Sometimes we received Null event which we cannot find that our system is sending to i suppose maybe we are miss using the processor
//Definition of processor
ReplayProcessor<Event> classAEventProcessor = ReplayProcessor.create();
//Event handler subscribing
public void onMyEventX(Consumer<Event> consumer) {
Flux<Event> handler = classAEventProcessor .filter(event -> event.getType().equals(EVENT_X));
handler.subscribe(consumer);
}
in the example above the event in the handler sometimes get null.. Once he does the stream stop working until we are restating server(Because only on restart we are doing creating processor)
2.We have tried to us parallel but sometimes some of the message were disappeared so maybe we are misusing the framework
//On constructor
tickProcessor.buffer(1024, Duration.of(250, ChronoUnit.MILLIS)).subscribe(markets ->
handleMarkets(markets));
//Handler
Flux.fromIterable(getListToProcess())
.parallel()
.runOn(Schedulers.parallel())
.doOnNext(entryMap -> {
DoBlockingWork(entryMap);
})
.sequential()
.subscribe();
The intention of this is that the processor will wakeup every 250ms and invoke the handler. The handler will work work with Flux parallel in order to make better and faster processing.
*In case that DoBlockingWork takes more than 250ms i couldn't understand what will be the behavior
UPDATE:
The EventBus was wrapped by us and every event subscribed throw the wrapped event manager.
Now we have tried to create event processor for every module but it works very slow. We have used TopicProcessor with ThreadExecutor and still very slow.. EventBus did the same work in high speed
Anyone has any idea? BTW when i tried to use DirectProcessor it seems to work much better that the TopicProcessor
Reactor 3 is built around the concept that you should avoid blocking as much as you can, so in your second snippet DoBlockingWork doesn't look good.
How are the events generated? Do you maybe have an listener-based asynchronous API to get them? If so, you could try using Flux.create.
For your use case of "we have 1000 events in 1 second, but only want to process 4", I'd chain a sample operator. For instance, sample(Duration.ofMillis(250)) will divide each second into 4 windows, from which it will only emit the last element.
The reference guide is being written, as well as a page where you can find links to external articles and learning material.There's a preview of the WIP reference guide here and the learning resources page here.

Resources