TPL Dataflow block with an inner life - tpl-dataflow

I've been doing a bit of TPL dataflow coding and am quite happy with the basics. The question I have is, how would I go about doing a TPL block that, besides reacting to its queue, also has a life on its own?
Like, a background task that runs on a permanent loop, polling the odd webservice or database every few seconds and emitting messages on its outputs when it sees fit?
The interface would be a source- and target block, but with no apparent connection between source messages and target messages.
Basically an "active" block.

Blocks aren't background tasks or workers. They don't have a lifetime of their own outside the pipeline and their messages. That doesn't mean you can't do what you ask though. You'll have to somehow trigger the head block from the outside though.
One way to do what you want is to create a timer that "pings" the head block of a pipeline:
var head=new TransformBlock<int,...>(...);
...
var timer=new System.Threading.Timer(_=>head.PostAsync(0),null,0,5000);
You can start, stop or change the timer as needed.
You could use this eg to process the files in a folder periodically :
var crawler=new TransformManyBlock<string,string>(root=>{
return Directory.EnumerateFiles(root,"*.csv");
});
var parseCsv=new TransformBlock<string,Record[]>(filePath=>{
var records=await parseCsvAsync(filePath);
return records;
}
...
var timer=new System.Threading.Timer(_=>head.PostAsync(rootFolder),null,0,5000);

So, here's what I did in the end:
My FSM class consisted of a broadcast block, working as input, a bufferblock working as output and an async method with an endless loop running in the threadpool. The async method gets the current input value from the broadcastblock whenever needed, does its thing and posts a result to the bufferblock, in my case every second.
A user of that FSM class can link to the broadcast block's input and the bufferblock's output. The async method polls the input when needed and gets new values. Whatever it does, the result goes to the buffer block.
The interface then looks like
public interface IFsmBlock : IAsyncDisposable
{
ITargetBlock<string?> SearchParameterInputPort { get; }
ISourceBlock<string> FoundEntriesOutputPort { get; }
}
With possibly more than one input or output block. In this instance, the FSM treats the input port basically as a variable containing a value. If the input port should trigger something, that's easily possible by using an action block instead of the buffer block.

Related

Using yield in nested object in Kotlin sequence

I want to stream result objects captured by Spring JDBC RowCallbackHandler using via a Kotlin Sequence.
The code looks basically like this:
fun findManyObjects(): Sequence<Thing> = sequence {
val rowHandler = object : RowCallbackHandler {
override fun processRow(resultSet: ResultSet) {
val thing = // create from resultSet
yield(thing) // ERROR! No coroutine scope
}
}
jdbcTemplate.query("select * from ...", rowHandler)
}
But I get the compilation error:
Suspension functions can be called only within coroutine body.
However, exactly this "coroutine body" should exist, because the whole block is wrapped in a sequence builder. But it doesn't seem to work with a nested object.
Minimal example to show that it doesn't compile with a nested object:
// compiles
sequence {
yield(1)
}
// doesn't compile
sequence {
object {
fun doit() {
yield(1) // Suspension functions can be called only within coroutine body.
}
}
}
How can I pass an object from the ResultSet into the Sequence?
Use Flow for asynchronous data streams
The reason you can't call yield inside your RowCallbackHandler object is twofold.
The processRow function isn't a suspending function (and can't be, because it's declared in and called by Java). A suspending function like yield can only be called by another suspending function.
A sequence always ends when the sequence { ... } builder returns. Even if you and I know that the query method will invoke the RowCallbackHandler before returning from the sequence, the Kotlin compiler has no way of knowing that. Yielding sequence values from functions and objects other than the body of the sequence itself is never allowed, because there's no way of knowing where or when they will run.
To solve this problem, we need to introduce a different kind of coroutine: one that can suspend itself while it waits for the RowCallbackHandler to be invoked.
Unfortunately, because we're talking about JDBC here, there may not be much to gain by introducing full-blown coroutines. Under the hood, calls to the database will always be made in a blocking way, removing a lot of the benefit. It might well be simpler not to try and 'stream' results, and just iterate over them in a boring, old-fashioned way. But let's explore the possibilities all the same.
The problem with sequences
Sequences are designed for on-demand computation, and are not asynchronous. They can't wait for other asynchronous operations, such as callbacks. The sequence builder's yield function simply suspends while waiting for the caller to retrieve the next item, and it's the only suspending function a sequence is ever allowed to call. You can demonstrate this if you try to use a simple suspending call like delay inside a sequence. You'll get a compile error letting you know that you're operating in a restricted coroutine scope.
sequence<String> { delay(1000) } // doesn't compile
Without the ability to call suspending functions, there's no way to wait for a callback to be invoked. Recognising this limitation, Kotlin provides an alternative mechanism for streams of on-demand values that do provide data in an asynchronous way. It's called a Flow.
Callback flows
The mechanism for using Flows to provide values from a callback interface is described very nicely by Roman Elizarov in his Medium article Callbacks and Kotlin Flows.
If you did want to use a callback flow, you'd simply replace sequence with callbackFlow, and replace yield with sendBlocking.
Your code might look something like this:
fun findManyObjects(): Flow<Thing> = callbackFlow {
val rowHandler = object : RowCallbackHandler {
override fun processRow(resultSet: ResultSet) {
val thing = // create from resultSet
sendBlocking(thing)
}
}
jdbcTemplate.query("select * from ...", rowHandler)
close() // the query is finished, so there are no more rows
}
A simpler flow
While that's the idiomatic way to stream values provided by a callback, it might not be the simplest approach to this problem. By avoiding callbacks altogether, you can use the much more common flow builder, passing each value to its emit function. But now that you have asynchrony in the form of coroutines, you can't just return a flow and then allow Spring to immediately close the result set. You need to be able to delay the closing of the result set until the flow has actually been consumed. That means peeling back the abstractions provided by RowCallbackHandler or ResultSetExtractor, which expect to process all the results in a blocking way, and instead providing your own implementation.
fun Connection.findManyObjects(): Flow<Thing> = flow {
prepareStatement("select * from ...").use { statement ->
statement.executeQuery().use { resultSet ->
while (resultSet.next()) {
val thing = // create from resultSet
emit(thing)
}
}
}
}
Note the use blocks, which will deal with closing the statement and result set. Because we don't reach the end of the use blocks until the while loop has completed and all the values have been emitted, the flow is free to suspend while the result set remains open.
So why use a flow at all?
You might notice that if you do it this way, you can actually replace flow and emit with sequence and yield. So have we come full circle? Well, sort of. The difference is that a flow can only be consumed from a coroutine, whereas with sequence, you can iterate over the resulting values without suspending at all. In this particular case, it's a hard call to make, because JDBC operations are always blocking.
If you use a sequence, the calling thread will block as it waits to receive the data. Values in a sequence are always computed by the thing consuming the sequence, so if the sequence invokes a blocking function, the consumer's thread will block waiting for the value. In a non-coroutine application, that might be okay, but if you're using coroutines, you really want to avoid hiding blocking calls inside innocuous-looking sequences.
If you use a flow, you can at least isolate the blocking calls by having the flow run on a particular dispatcher. For example, you could use the built-in IO dispatcher to perform the JDBC call, then switch back to the default dispatcher for any further processing. If you definitely want to stream values, I think this is a better approach than using a sequence.
With all this in mind, you'll need to be careful with your use of coroutines and dispatchers if you do choose one of these solutions. If you'd rather not worry about that, there's nothing wrong with using a regular ResultSetExtractor and forgetting about both sequences and flows for now.

Are hot non completing database observables a Rx usecase? Side-effect writing issue

I have more of a opinions question, asi if this, what many people do, should be a Rx use case.
In apps there is usually sql database, which is queried by UI as a observable, which emits after the query is loaded + anytime data changes (Room / SqlDelight etc)
Reads sound okay, however, is it possible to have "pure" writes to the database?
Writing to the database might look like this
fun sync() = Completable.fromCallable {
// do something
database.writeSomethingSynchronously()
}
SomeUi {
init {
database.someQueryObservable()
.subscribe { show list }
}
}
Imagine you want to display progressbar while this Completable is in flight.
What is effectively happening here is sideffecting to the database. Which means the opened database observable will re-emit when the data is written, but still before the sync() returns (assuming single threaded for simplicity)
Now there is point in time where there is new data in the UI and the progressbar is shown. (and worse with multithreading timings) This is invalid state.
In imperative world, sync would provide a completion callback, in which one would reload the query manually + show/hide progressbar synchronously. (And somehow block the database change listener for duration of the sync writes?)
Is there a way around this at all?

Concurrent Collection, reporting custom progress data to UI when parallel tasking

I have a concurrent collection that contains 100K items. The processing of each item in the collection can take as little as 100ms or as long as 10 seconds. I want to speed things up by parallelizing the processing, and have a 100 minions doing the work simultaneously. I also have to report some specific data to the UI as this processing occurs, not simply a percentage complete.
I want the parallelized sub-tasks to nibble away at the concurrent collection like a school of minnows attacking a piece of bread tossed into a pond. How do I expose the concurrent collection to the parallelized tasks? Can I have a normal loop and simply launch an async task inside the loop and pass it an IProgress? Do I even need the concurrent collection for this?
It has been recommended to me that I use Parallel.ForEach but I don't see how each sub-process established by the degrees of parallelism could report a custom object back to the UI with each item it processes, not only after it has finished processing its share of the 100K items.
The framework already provides the IProgress inteface for this purpose, and an implementation in Progress. To report progress, call IProgress.Report with a progressvalue. The value T can be any type, not just a number.
Each IProgress implementation can work in its own way. Progress raises an event and calls a callback you pass to it when you create it.
Additionally, Progress.Report executes asynchronously. Under the covers, it uses SychronizationContext.Post to execute its callback and all event handlers on the thread that created the Progress instance.
Assuming you create a progress value class like this:
class ProgressValue
{
public long Step{get;set;}
public string Message {get;set;}
}
You could write something like this:
IProgress<ProgressValue> myProgress=new Progress<ProgressValue>(p=>
{
myProgressBar.Value=p.Step;
});
IList<int> myVeryLargeList=...;
Parallel.ForEach(myVeryLargeList,item,state,step=>
{
//Do some heavy work
myProgress.Report(new ProgressValue
{
Step=step,
Message=String.Format("Processed step {0}",step);
});
});
EDIT
Oops! Progress implements IProgress explicitly. You have to cast it to IProgress , as #Tim noticed.
Fixed the code to explicitly declare myProgress as an IProgress.

What approach should I take for creating a "lobby" in Node.js?

I have users connecting to a Node.js server, and when they join, I add them into a Lobby (essentially a queue). Any time there are 2 users in the lobby, I want them to pair off and be removed from the lobby. So essentially, it's just a simple queue.
I started off by trying to implement this with a Lobby.run method, which has an infinite loop (started within a process.nextTick call), and any time there are more than two entries in the queue, I remove them form the queue. However, I found that this was eating all my memory and that infinite loops like this are generally ill-advised.
I'm now assuming that emitting events via EventEmitter is the way to go. However, my concern is with synchronization. Let's assuming my Lobby is pretty simple:
Lobby = {
users: []
, join: function (user) {
this.users.push(user);
emitter.emit('lobby.join', user);
}
, leave: function (user) {
var index = this.users.indexOf(user);
this.users.splice(index, 1);
emitter.emit('lobby.leave', user);
}
};
Now essentially I assume I want to watch for users joining the lobby and pair them up, maybe something like this:
Lobby = {
...
, run: function () {
emitter.on('lobby.join', function (user) {
// TODO: determine if this.users contains other users,
// pair them off, and remove them from the array
});
}
}
As I mentioned, this does not account for synchronization. Multiple users can join the lobby at the same time, and so the event listener might pair up a single user with multiple other users instead of just one.
Can someone with more Node.js experience tell me if I am right to be concerned with this event-based approach? Any insight for improvement on this approach would be much appreciated.
You are wrong to be concerned with this. This is because Node.JS is single-threaded, there is no concurrency at all! Whenever a block of code is fired no other code (including event handlers) can be fired until the block finishes what it does. In particular if you define this empty loop in your app:
while(true) { }
then your server is crashed, no other code will ever fire, no other request will be ever handled. So be careful with blocks of code, make sure that each block will eventually end.
Back to the question... So in your case it is impossible for multiple users to be paired with the same user. And let me say one more time: this is simply because there is no concurrency in Node.JS!
On the other hand this only applies to one instance of Node.JS. If you want to scale it to many machines, then obviously you will have to implement some locking mechanism (which ensures that no other process can work with the data at the same time).

Can someone explain callback/event firing

In a previous SO question it was recommended to me to use callback/event firing instead of polling. Can someone explain this in a little more detail, perhaps with references to online tutorials that show how this can be done for Java based web apps.
Thanks.
The definition of a callback from Wikipedia is:
In computer programming, a callback is
executable code that is passed as an
argument to other code. It allows a
lower-level software layer to call a
subroutine (or function) defined in a
higher-level layer.
In it's very basic form a callback could be used like this (pseudocode):
void function Foo()
{
MessageBox.Show("Operation Complete");
}
void function Bar(Method myCallback)
{
//Perform some operation
//When completed execute the callback method
myCallBack().Invoke();
}
static int Main()
{
Bar(Foo); //Pops a message box when Bar is completed
}
Modern languages like Java and c# have a standardized way of doing this and they call it events. An event is simply a special type of property added to a class that contains a list of Delegates / Method Pointers / Callbacks (all three of these things are the same thing. When the event gets "fired" it simply iterates through it's list of callbacks and executes them. These are also referred to as listeners.
Here's an example
public class Button
{
public event Clicked;
void override OnMouseUp()
{
//User has clicked on the button. Let's notify anyone listening to this event.
Clicked(); //Iterates through all the callbacks in it's list and calls Invoke();
}
}
public class MyForm
{
private _Button;
public Constructor()
{
_Button = new Button();
//Different languages provide different ways of registering listeners to events.
// _Button.Clicked += Button_Clicked_Handler;
// _Button.Clicked.AddListener(Button_Clicked_Handler);
}
public void Button_Clicked_Handler()
{
MessageBox.Show("Button Was Clicked");
}
}
In this example the Button class has an event called Clicked. It allows anyone who wants to be notified when is clicked to register a callback method. In this case the "Button_Clicked_Handler" method would be executed by Clicked event.
Eventing/Callback architecture is very handy whenever you need to be notified that something has occurred elsewhere in the program and you have no direct knowledge of when or how this happens.
This greatly simplifies notification. Polling makes it much more difficult because you are responsible for checking every so often whether or not an operation has completed. A simple polling mechanism would be like this:
static void CheckIfDone()
{
while(!Button.IsClicked)
{
//Sleep
}
//Button has been clicked.
}
The problem is that this particular situation would block your existing thread and have to continue checking until Button.IsClicked is true. The nice thing about eventing architecture is that it is asynchronous and let's the Acting Item (button) notify the listener when it is completed instead of the listener having to keep checking,
The difference between polling and callback/event is simple:
Polling: You are asking, continuously or every fixed amount of time, if some condition is meet, for example, if some keyboard key have been pressed.
Callback: You say to some driver, other code or whatever: When something happens (the keyboard have been pressed in our example), call this function, and you pass it what function you want to be called when the event happens. This way, you can "forget" about that event, knowing that it will be handled correctly when it happens.
Callback is when you pass a function/object to be called/notified when something it cares about happens. This is used a lot in UI - A function is passed to a button that is called whenever the button is pressed, for example.
There are two players involved in this scenario. First you have the "observed" which from time to time does things in which other players are interested. These other players are called "observers". The "observed" could be a timer, the "observers" could be tasks, interested in alarm events.
This "pattern" is described in the book "Design Patterns, Elements of Reusable Object-Oriented Software" by Gamma, Helm, Johnson and Vlissides.
Two examples:
The SAX parser to parse XML walks
trough an XML file and raises events
each time an element is encountered.
A listener can listen to these
elements and do something with it.
Swing and AWT are based on this
pattern. When the user moves the
mouse, clicks or types something on
the keyboard, these actions are
converted into events. The UI
components listen to these
events and react to them.
Being notified via an event is almost always preferable to polling, especially if hardware is involved and that event originates from a driver issuing a CPU interrupt. In that case, you're not using ANY cpu at all while you wait for some piece of hardware to complete a task.

Resources