How do I debug a Mono that never completes - spring-boot

I have a Spring Boot application which contains a complex reactive flow (it involves MongoDB and RabbitMQ operations). Most of the time it works, but...
Some of the methods return a Mono<Void>. This is a typical pattern, in multiple layers:
fun workflowStep(things: List<Thing>): Mono<Void> =
Flux.fromIterable(things).flatMap { thing -> doSomethingTo(thing) }.collectList().then()
Let's say doSomethingTo() returns a Mono<Void> (it writes something to the database, sends a message etc). If I just replace it with Mono.empty() then everything works as expected, but otherwise it doesn't. More specifically the Mono never completes, it runs through all processing but misses the termination signal at the end. So the things are actually written in the database, messages are actually sent, etc.
To prove that the lack of termination is the problem, here is a hack that works:
val hackedDelayedMono = Mono.empty<Void>().delayElement(Duration.ofSeconds(1))
return Mono.first(
workflowStep(things),
hackedDelayedMono
)
The question is, what can I do with a Mono that never completes, to figure out what's going on? There is nowhere I could put a logging statement or a brakepoint, because:
there are no errors
there are no signals emitted
How could I check what the Mono is waiting for to be completed?
ps. I could not reproduce this behaviour outside the application, with simple Mono workflows.

You can trace and log events in your stream by using the log() operator in your reactive stream. This is useful for gaining a better understanding about what events are occurring within your app.
Flux.fromIterable(things)
.flatMap(thing -> doSomethingTo(thing))
.log()
.collectList()
.then()
Chained inside a sequence, it peeks at every event of the Flux or Mono
upstream of it (including onNext, onError, and onComplete as well as
subscriptions, cancellations, and requests).
Reactor Reference Documentation - Logging a Sequence
The Reactor reference documentation also contains other helpful advice for debugging a reactive stream and can be found here: Debugging Reactor

(We managed to fix the problem - it was not directly in the code I was working on, but for some reason my changes triggered it. I still don't understand the root cause, but higher up the chain we found a Mono.zip() zipping a Mono<Void>. Although this used to work before, it stopped working at some point. Why is a Mono<Void> even zippable, why don't we get a compiler error, and even worse, why does it work sometimes?)
To answer my own question here, the tool used for debugging was adding the following to all Monos in the chain, until it didn't produce any output:
mono.doOnEach { x ->
logger.info("signal: ${x}")
}
.then(Mono.defer {
logger.info("then()")
Mono.empty<Void>()
})
I also experimented with the .log() - also fine tool, but maybe too detailed, and it is not very easy to understand which Mono produces which log messages - as these are logged with the dynamic scope, not the lexical scope, which the above method gives you unambiguously.

Related

What could cause Realm to lock up when attempting a write?

My team is currently facing an issue in our Xamarin.Forms app across all platforms(Android, iOS, and UWP). Realm will frequently become unresponsive, where the only way to use it again is to close the app. Over the past few months it's become more frequent and easy to reproduce, yet we have not been able to determine the cause or a workaround.
We have identified a few patterns that may help identify what's happening. We've noticed that whenever something that needs information from the database, we'll see that worker thread stuck on a Realm.Write() call. This behavior seems almost as if there's a deadlock occuring within the Realm system. It's not consistent as to which Write() call it's stuck on, seeming to be random based on when the Realm fails. At that point, any other attempts to access this realm through any method, such as Find(),All(),Remove(), etc also get stuck. We've also confirmed that the code within the Write() is never being run at this point, since we can put a realm independent logging call on the first line and never see it in our logs.
Once this issue occurs, some other issues can happen in addition to this. We have two other Realms in our app that handle completely separate data, and as such have no overlapping code. These Realms are never the cause of this issue, but when the problem Realm gets stuck, it sometimes causes the other Realms to get stuck on their next calls as well. This issue also sometimes persists between uses of the app, causing the very first call to Realm to get stuck and requires a complete reinstall to fix.
Due to our app using Reactive based programming, we've had to structure how we handle our database a bit differently. For the problem Realm, we have a service that keeps a single instance active in an observable stream, which can then be subscribed to for watching changes. I've added some examples of this architecture at the end of this post. We also route all our other non-observable actions through this stream, however during debugging we've been able to move these calls to their own independent realm instances with little issue/no change to functionality.
Currently, we're thinking it's most likely an issue related either to how we're converting Realm to an observable system, or with our Realms crashing/becoming corrupted somehow.
RealmStream declaration:
_realmStream = Observable
.Start(() => Realm.GetInstance(_dbConfig), _scheduler)
.Do(_ => logger.LogTrace("Realm created"), () => logger.LogTrace("Realm stream completed"))
.Replay()
.AutoConnect();
RealmStream use example:
public IObservable<IChangeSet<TResult>> GetChangeSetStream<TSource, TResult>(Func<Realm, IQueryable<TSource>> selector, Func<TSource, TResult> transformFactory) where TSource : RealmObject
{
return _realmStream
.Select(realm =>
selector(realm)
.AsRealmCollection()
.ToObservableChangeSet<IRealmCollection<TSource>, TSource>()
.SubscribeOn(_scheduler)
.Transform(transformFactory)
.DisposeMany())
.Switch()
.Catch<IChangeSet<TResult>, Exception>(ex =>
{
_logger.LogError(ex, "Error getting property change stream");
return Observable.Return<IChangeSet<TResult>>(default);
})
.SubscribeOn(_scheduler);
}
Non-Observable realm methodss:
public async Task Run(Action<Realm> action)
{
await _realmStream
.Do(action)
.SubscribeOn(_scheduler);
}
public async Task<TResult> Run<TResult>(Func<Realm, TResult> action)
{
return await _realmStream
.Select(action)
.SubscribeOn(_scheduler);
}
So far, we've attempted the following:
Made sure Realm and Xamarin are both on the most recent versions
Reducing the number of Realm.Write()s (Minor improvement)
Moving every Realm function into our observable system (No noticable change, most of our functions already do this)
Attempted moving everything that does not require observables to using independent realm instances (increased frequency of locking)
Attempted to move everything away from our single instance of Realm. We weren't able to do this, as we could not determine how to properly handle some observable events, such as a RealmObject being deleted, without causing major performance issues
realm.Write needs to acquire a write lock and based on your description, it appears that you do get a deadlock where a thread with an open write transaction waits for another thread that is stuck on the realm.Write call. If you're able to reproduce the hand with a debugger attached, you can inspect the threads window and try to pinpoint the offending code.
This article provides some tips about debugging deadlocks. Unfortunately, without the whole project and a repro case, it'd be hard to pinpoint the cause.

How does ktor websocket flow api works?

I'm using ktor for server side development with websockets.
Documentations shows us this example of using incoming channel:
for (frame in incoming.mapNotNull { it as? Frame.Text }) {
// some
}
But mapNotNull is marked as deprecated in favor of Flow. How should I use this API and what problems could be there? For example, the Flow is a cold stream. It means that the producer function will be called on each collect. How does it work in context of websocket. Will it be reopened on second collect call, or maybe old messages will be delivered once after the next collect? How can I collect N messages, then stop collecting, then collect again?
Thanks in advance :)
How should I use this API and what problems could be there?
What I am using and what I have seen in one of the examples somewhere in the docs is the consumeAsFlow() method called on ReceiveChannel. Here is the entire snippet:
webSocket("/websocket") { //this: DefaultWebSocketServerSession
incoming
.consumeAsFlow()
.map { receive(it) }
.collect()
}
Haven't seen major issues with this approach. One thing you should be aware of (but that goes for the non-flow approach as well) is that if you throw inside your flow, then it will break the WebSocket connection, which is usually not something you'd like to do. It might be worth considering wrapping the entire thing in a try-catch.
Will it be reopened on second collect call, or maybe old messages will be delivered once after the next collect?
You open the websocket before you even start consuming the messages from the flow. You can see that inside webSocket() {} you are in the context of DefaultWebSocketServerSession. This is your connection management. Inside your flow you are simply receiving messages one by one as they arrive (after the connection has been established). If the connection breaks, then you're out of the flow. It needs to be re-established before you can process your messages. This establishing bit is done by the Route.webSocket() method. I do recommend taking a look at its Javadoc.
If you wish to add some clean up after the connection is closed you can add a finally block like so:
webSocket("/chat") {
try {
incoming
.consumeAsFlow()
.map { receive(it, client) }
.collect()
} finally {
// cleanup
}
}
In short: collect is called once per received message. If there is no connection (or it was broken) then collect won't be called.
How can I collect N messages, then stop collecting, then collect again?
What is the use case for this? I don't think you should be doing this with any flow. You can of course take(n) items from a flow, but you won't be able to take any more from it again.

Spring Boot Webflux/Netty - Detect closed connection

I've been working with spring-boot 2.0.0.RC1 using the webflux starter (spring-boot-starter-webflux). I created a simple controller that returns a infinite flux. I would like that the Publisher only does its work if there is a client (Subscriber). Let's say I have a controller like this one:
#RestController
public class Demo {
#GetMapping(value = "/")
public Flux<String> getEvents(){
return Flux.create((FluxSink<String> sink) -> {
while(!sink.isCancelled()){
// TODO e.g. fetch data from somewhere
sink.next("DATA");
}
sink.complete();
}).doFinally(signal -> System.out.println("END"));
}
}
Now, when I try to run that code and access the endpoint http://localhost:8080/ with Chrome, then I can see the data. However, once I close the browser the while-loop continues since no cancel event has been fired. How can I terminate/cancel the streaming as soon as I close the browser?
From this answer I quote that:
Currently with HTTP, the exact backpressure information is not
transmitted over the network, since the HTTP protocol doesn't support
this. This can change if we use a different wire protocol.
I assume that, since backpressure is not supported by the HTTP protocol, it means that no cancel request will be made either.
Investigating a little bit further, by analyzing the network traffic, showed that the browser sends a TCP FIN as soon as I close the browser. Is there a way to configure Netty (or something else) so that a half-closed connection will trigger a cancel event on the publisher, making the while-loop stop?
Or do I have to write my own adapter similar to org.springframework.http.server.reactive.ServletHttpHandlerAdapter where I implement my own Subscriber?
Thanks for any help.
EDIT:
An IOException will be raised on the attempt to write data to the socket if there is no client. As you can see in the stack trace.
But that's not good enough, since it might take a while before the next chunk of data will be ready to send and therefore it takes the same amount of time to detect the gone client. As pointed out in Brian Clozel's answer it is a known issue in Reactor Netty. I tried to use Tomcat instead by adding the dependency to the POM.xml. Like this:
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-tomcat</artifactId>
</dependency>
Although it replaces Netty and uses Tomcat instead, it does not seem reactive due to the fact that the browser does not show any data. However, there is no warning/info/exception in the console. Is spring-boot-starter-webflux as of this version (2.0.0.RC1) supposed to work together with Tomcat?
Since this is a known issue (see Brian Clozel's answer), I ended up using one Flux to fetch my real data and having another one in order to implement some sort of ping/heartbeat mechanism. As a result, I merge both together with Flux.merge().
Here you can see a simplified version of my solution:
#RestController
public class Demo {
public interface Notification{}
public static class MyData implements Notification{
…
public boolean isEmpty(){…}
}
#GetMapping(value = "/", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<ServerSentEvent<? extends Notification>> getNotificationStream() {
return Flux.merge(getEventMessageStream(), getHeartbeatStream());
}
private Flux<ServerSentEvent<Notification>> getHeartbeatStream() {
return Flux.interval(Duration.ofSeconds(2))
.map(i -> ServerSentEvent.<Notification>builder().event("ping").build())
.doFinally(signalType ->System.out.println("END"));
}
private Flux<ServerSentEvent<MyData>> getEventMessageStream() {
return Flux.interval(Duration.ofSeconds(30))
.map(i -> {
// TODO e.g. fetch data from somewhere,
// if there is no data return an empty object
return data;
})
.filter(data -> !data.isEmpty())
.map(data -> ServerSentEvent
.builder(data)
.event("message").build());
}
}
I wrap everything up as ServerSentEvent<? extends Notification>. Notification is just a marker interface. I use the event field from the ServerSentEvent class in order to separate between data and ping events. Since the heartbeat Flux sends events constantly and in short intervals, the time it takes to detect that the client is gone is at most the length of that interval. Remember, I need that because it might take a while before I get some real data that can be sent and, as a result, it might also take a while before it detects that the client is gone. Like this, it will detect that the client is gone as soon as it can’t sent the ping (or possibly the message event).
One last note on the marker interface, which I called Notification. This is not really necessary, but it gives some type safety. Without that, we could write Flux<ServerSentEvent<?>> instead of Flux<ServerSentEvent<? extends Notification>> as return type for the getNotificationStream() method. Or also possible, make getHeartbeatStream() return Flux<ServerSentEvent<MyData>>. However, like this it would allow that any object could be sent, which I don’t want. As a consequence, I added the interface.
I'm not sure why this behaves like this, but I suspect it is because of the choice of generation operator. I think using the following would work:
return Flux.interval(Duration.ofMillis(500))
.map(input -> {
return "DATA";
});
According to Reactor's reference documentation, you're probably hitting the key difference between generate and push (I believe a quite similar approach using generate would probably work as well).
My comment was referring to the backpressure information (how many elements a Subscriber is willing to accept), but the success/error information is communicated over the network.
Depending on your choice of web server (Reactor Netty, Tomcat, Jetty, etc), closing the client connection might result in:
a cancel signal being received on the server side (I think this is supported by Netty)
an error signal being received by the server when it's trying to write on a connection that's been closed (I believe the Servlet spec does not provide that that callback and we're missing the cancel information).
In short: you don't need to do anything special, it should be supported already, but your Flux implementation might be the actual problem here.
Update: this is a known issue in Reactor Netty

Project reactor processors v3.X

We are trying to migrate from 2.X to 3.X.
https://github.com/reactor/reactor-core/issues/375
We have used the EventBus as event manager in our application(Low latency FX system) and it works very well for us.
After the change we decided to take every module and create his own processor to handle event.
1. Does this use seems to be correct from your point of view? Because lack of document at the current stage and after reviewing everything we could we don't really know what to do here
2. We have tried to use Flux in order to perform action every X interval
For example: Market is arriving 1000 for 1 second but we want to process an update only 4 time in a second. After upgrading we are using:
Processor with buffer and sending to another method.
In this method we have Flux that get list and try to work in parallel in order to complete his task.
We had 2 major problems:
1. Sometimes we received Null event which we cannot find that our system is sending to i suppose maybe we are miss using the processor
//Definition of processor
ReplayProcessor<Event> classAEventProcessor = ReplayProcessor.create();
//Event handler subscribing
public void onMyEventX(Consumer<Event> consumer) {
Flux<Event> handler = classAEventProcessor .filter(event -> event.getType().equals(EVENT_X));
handler.subscribe(consumer);
}
in the example above the event in the handler sometimes get null.. Once he does the stream stop working until we are restating server(Because only on restart we are doing creating processor)
2.We have tried to us parallel but sometimes some of the message were disappeared so maybe we are misusing the framework
//On constructor
tickProcessor.buffer(1024, Duration.of(250, ChronoUnit.MILLIS)).subscribe(markets ->
handleMarkets(markets));
//Handler
Flux.fromIterable(getListToProcess())
.parallel()
.runOn(Schedulers.parallel())
.doOnNext(entryMap -> {
DoBlockingWork(entryMap);
})
.sequential()
.subscribe();
The intention of this is that the processor will wakeup every 250ms and invoke the handler. The handler will work work with Flux parallel in order to make better and faster processing.
*In case that DoBlockingWork takes more than 250ms i couldn't understand what will be the behavior
UPDATE:
The EventBus was wrapped by us and every event subscribed throw the wrapped event manager.
Now we have tried to create event processor for every module but it works very slow. We have used TopicProcessor with ThreadExecutor and still very slow.. EventBus did the same work in high speed
Anyone has any idea? BTW when i tried to use DirectProcessor it seems to work much better that the TopicProcessor
Reactor 3 is built around the concept that you should avoid blocking as much as you can, so in your second snippet DoBlockingWork doesn't look good.
How are the events generated? Do you maybe have an listener-based asynchronous API to get them? If so, you could try using Flux.create.
For your use case of "we have 1000 events in 1 second, but only want to process 4", I'd chain a sample operator. For instance, sample(Duration.ofMillis(250)) will divide each second into 4 windows, from which it will only emit the last element.
The reference guide is being written, as well as a page where you can find links to external articles and learning material.There's a preview of the WIP reference guide here and the learning resources page here.

spring integration message released twice from aggregator

I have a spring integration flow that starts with a channel inboundadapter and picks up files and passes them through the system as messages.
After a few components, the messages are aggregated at an "Aggregator" from where they are released based on release strategies or by group timeout of 30 sec.
The downstream processing has another bunch of components till the final one.
The problem I am facing is this,
When I send 33 files which create 33 "groups/buckets" based on correlation IDs, aggregated at the "Aggregator", some of the files or messages seems to be "released" twice. The reason I conclude that is because I have a channel interceptor which shows a few messages passing through the "released" channel (appearing right after the aggregator) a second time, after completing the downstream processing successfully, the first time. Additionally, this behavior causes my application to not find a file and throw an exception which I see. This leads me to conclude that the message bucket/group/corrID is somehow being "Released" twice.
I have tried to debug this many ways , but essentially, I want to know how a corrID/bucket after being released and having successfully gone through all downstream components in a single thread, can be "released" again.
My question is, how can I debug this? I want to know what is making this message/bucket re-appear in the aggregator.
My aggregator is as follows,
<int:aggregator id="bufferedFiles" input-channel="inQueueForStage"
output-channel="released" expire-groups-upon-completion="true"
send-partial-result-on-expiry="true" release-strategy="releaseHandler"
release-strategy-method="canRelease"
group-timeout-expression="size() > 0 ? T(com.att.datalake.ifr.loader.utils.MessageUtils).getAggregatorTimeout(one, #sourceSnapshot) : -1">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}"
task-executor="executor" />
</int:aggregator>
Explanation of aggregator: The size()>0 applies to EACH correlation bucket. each of the 33 files I am sending will spawn/generate/create a new bucket because of the file name, so the aggregator will have 33 buckets/groups/corrIds, each bucket will contain only one file.
So the aggregator SPEL expression simply says that if there no release strategies, then release the bucket/group after 30 secs if the group indeed has at least some files.
My Channel inbound adapter is as follows:
<int-file:inbound-channel-adapter id="files"
channel="dispatchFiles" directory="${source.dir}" scanner="directoryScanner">
<int:poller fixed-delay="${files.pickup.delay:3000}"
max-messages-per-poll="${num.files.pickup.per.poll:10}" />
</int-file:inbound-channel-adapter>
Logs
here is the log of message completing the flow the first time. The completion time invoked suggests reaching the last component a "completionHandler" SA.
Explanation of Log: "cor" is the bucket/corrId that is being released twice. The reason I get the final exception is because during the first time, the file is removed from that original location and processed. So the second time around when this erroneous release happens, there is nothing to process there.
From the pictures it can be seen that the first batch/corrId/bucket is processed and finished around 11:09, and the second one is started around 11:10
an important point I noticed that this behavior only happens when I have a global channel interceptor in which I am doing somewhat long processing. When this interceptor is commented out, the errors go away.
Question:
is it possible for aggregator to double release a batch/corrId under any circumstance? How can I make aggregator emit any logs?
Thanks
Edit 10:15pm
My channel following the aggregator has an interceptor as follows,
public Message<?> preSend(Message<?> message, MessageChannel channel) {
LOGGER.info("******** Releasing from aggregator(interceptor) , corrID:{} at time:{} ********",MessageUtils.getCorrelationId(message), new Date() );
finalReporter.callback(channel.toString(), message);
return message;
}
From Aggregator down to final compeltionHandler SA, I have single threaded processing
Aggregator -> releasedChannel -> some SA1 -> some channel -> ..... -> completionChannel->completeSA
When I run for 33 partitions, let's follow corrId = "alh" The first time it is released, it looks like following,
What it shows is that thread-5 released it and it should process all the downstream components. But it leaves it mid-way and starts doing other things and is picked up again by a diffferent thread a little later as follows,
That seems/seemed to be the problem,
Solution Update:
I did following 3 things to sort of work around, at the moment,
for some reason, my interceptors were doing return super.preSend(message, channel) instead of simply return message. I changed it to latter
I had a global channel interceptors, I removed global and kept individual ones
If the channel interceptors had any issues before returning, would that cause a new release?
Although I still see the above scenario depicted in pictures, I am not getting double processing attempts and as such it avoids the errors. I am still trying to make sense out of this.
I understand it's too specific and difficult to explain; still thanks for the time and comments...
However, yes. I think #GaryRussell is right: since you use expire-groups-upon-completion="true" some partial groups may be released by group-timeout-expression and the new messages with the same correlationId will form a new group, which is released by the next group-timeout. Your size() > 0 isn't good too. It means that it is going to release partial group after that group-timeout. Maybe size() > 1? The group can't be size() == 0 though. Because it is created on the first message, so, if gruop exists, it contains at least one message. Yes, group can be empty, but in that case the aggregator should be marked with expire-groups-upon-completion="false". In that case it is marked as completed and doesn't allow new messages.
After struggling with debugging and various blind scenarios, I believe that at least I have a workaround and a possible root cause. I will try to outline all the things that I modified,
Root Cause:
My interceptors were calling a Common class with a common callback method. This method, based on the channel name from which the request was coming from, would decide the appropriate action to take. The actions were essentially collecting data, incrementing counters and persisting to database some information.
It seems that some of them were having errors and consequently, the thread was dying and message re-released. I am not entirely sure about it and please correct me if that's not the case.
But after I fixed those errors, the re-release issue seems to have subsided or vanished altogether.
The reason it was hard to diagnose was because I could not see those errors thrown during callback method invocations; may be I was catching them or may be they were lost.
I also found that the issue was only on any channel interceptors AFTER the aggregator. Interceptors before the aggregator did not present any issues; may be because they were simpler...
To debug,
I removed the interceptors and made the callback directly from various components (SAs), removed global interceptors and tried to add individual interceptors for specific channels.
Thanks for all the help.

Resources