Synchronous calls in akka / actor model - actor

I've been looking into Akka lately and it looks like a great framework for building scalable servers on the JVM. However most of the libraries on the JVM are blocking (e.g. JDBC) so don't your lose out on the performance benefits of using an event based model because your threads will always be blocked? Does Akka do something to get around this? Or is it just something you have to live with until we get more non-blocking libraries on the JVM?

Have a look at CQRS, it greatly improves scalability by separating reads from writes. This means that you can scale your reads separately from your writes.

With the types of IO blocking issues you mentioned Scala provides a language embedded solution that matches perfectly: Futures. For example:
def expensiveDBQuery(key : Key) = Future {
//...query the database
}
val dbResult : Future[Result] =
expensiveDBQuery(...) //non-blocking call
The dbResult returns immediately from the function call. The Result will be a available in the "Future". The cool part about a Future is that you can think about them like any old collection, except you can never call .size on the Future. Other than that all collection-ish functions (e.g. map, filter, foreach, ...) are fair game. Simply think of the dbResult as a list of Results. What would you do with such a list:
dbResult.map(_.getValues)
.filter(values => someTestOnValues(values))
...
That sequence of calls sets up a computation pipeline that is invoked whenever the Result is actually returned from the database. You can give a sequence of computing steps before the data has arrived. All asynchronously.

Related

How to avoid using Kotlin Coroutines' GlobalScope in a Spring WebFlux controller that performs long-running computations

I have a Rest API that is implemented using Spring WebFlux and Kotlin with an endpoint that is used to start a long running computation. As it's not really elegant to just let the caller wait until the computation is done, it should immediately return an ID which the caller can use to fetch the result on a different endpoint once it's available. The computation is started in the background and should just complete whenever it's ready - I don't really care about when exactly it's done, as it's the caller's job to poll for it.
As I'm using Kotlin, I thought the canonical way of solving this is using Coroutines. Here's a minimal example of how my implementation looks like (using Spring's Kotlin DSL instead of traditional controllers):
import org.springframework.web.reactive.function.server.coRouter
// ...
fun route() = coRouter {
POST("/big-computation") { request: ServerRequest ->
val params = request.awaitBody<LongRunningComputationParams>()
val runId = GlobalResultStorage.prepareRun(params);
coroutineScope {
launch(Dispatchers.Default) {
GlobalResultStorage.addResult(runId, longRunningComputation(params))
}
}
ok().bodyValueAndAwait(runId)
}
}
This doesn't do what I want, though, as the outer Coroutine (the block after POST("/big-computation")) waits until its inner Coroutine has finished executing, thus only returning runId after it's no longer needed in the first place.
The only possible way I could find to make this work is by using GlobalScope.launch, which spawns a Coroutine without a parent that awaits its result, but I read everywhere that you are strongly discouraged from using it. Just to be clear, the code that works would look like this:
POST("/big-computation") { request: ServerRequest ->
val params = request.awaitBody<LongRunningComputationParams>()
val runId = GlobalResultStorage.prepareRun(params);
GlobalScope.launch {
GlobalResultStorage.addResult(runId, longRunningComputation(params))
}
ok().bodyValueAndAwait(runId)
}
Am I missing something painfully obvious that would make my example work using proper structured concurrency or is this really a legitimate use case for GlobalScope? Is there maybe a way to launch the Coroutine of the long running computation in a scope that is not attached to the one it's launched from? The only idea I could come up with is to launch both the computation and the request handler from the same coroutineScope, but because the computation depends on the request handler, I don't see how this would be possible.
Thanks a lot in advance!
Maybe others won't agree with me, but I think this whole aversion to GlobalScope is a little exaggerated. I often have an impression that some people don't really understand what is the problem with GlobalScope and they replace it with solutions that share similar drawbacks or are effectively the same. But well, at least they don't use evil GlobalScope anymore...
Don't get me wrong: GlobalScope is bad. Especially because it is just too easy to use, so it is tempting to overuse it. But there are many cases when we don't really care about its disadvantages.
Main goals of structured concurrency are:
Automatically wait for subtasks, so we don't accidentally go ahead before our subtasks finish.
Cancelling of individual jobs.
Cancelling/shutting down of the service/component that schedules background tasks.
Propagating of failures between asynchronous tasks.
These features are critical for providing a reliable concurrent applications, but there are surprisingly many cases when none of them really matter. Let's get your example: if your request handler works for the whole time of the application, then you don't need both waiting for subtasks and shutting down features. You don't want to propagate failures. Cancelling of individual subtasks is not really applicable here, because no matter if we use GlobalScope or "proper" solutions, we do this exactly the same - by storing task's Job somewhere.
Therefore, I would say that the main reasons why GlobalScope is discouraged, do not apply to your case.
Having said that, I still think it may be worth implementing the solution that is usually suggested as a proper replacement for GlobalScope. Just create a property with your own CoroutineScope and use it to launch coroutines:
private val scope = CoroutineScope(Dispatchers.Default)
fun route() = coRouter {
POST("/big-computation") { request: ServerRequest ->
...
scope.launch {
GlobalResultStorage.addResult(runId, longRunningComputation(params))
}
...
}
}
You won't get too much from it. It won't help you with leaking of resources, it won't make your code more reliable or something. But at least it will help keep background tasks somehow categorized. It will be technically possible to determine who is the owner of background tasks. You can easily configure all background tasks in one place, for example provide CoroutineName or switch to another thread pool. You can count how many active subtasks you have at the moment. It will make easier to add graceful shutdown should you need it. And so on.
But most importantly: it is cheap to implement. You won't get too much, but it won't cost you much neither, so why not.

Reactor on-demand flux or a sink

Consider a HTTP controller endpoint that accepts requests, validates and then returns ack, but in a meantime in a background does some "heavy work".
There are 2 approaches with Reactor (that I'm interested in) that you can achieve this:
First approach
#PostMapping(..)
fun acceptRequest(request: Request): Response {
if(isValid(request)) {
Mono.just(request)
.flatMap(service::doHeavyWork)
.subscribe(...)
return Response(202)
} else {
return Response(400)
}
}
Second approach
class Controller {
private val service = ...
private val sink = Sinks.many().unicast().onBackpressureBuffer<Request>()
private val stream = sink.asFlux().flatMap(service::doHeavyWork).subscribe(..)
fun acceptRequest(request: Request): Response {
if(isValid(request)) {
sink.tryEmitNext(request) // for simplicity not handling errors
return Response(202)
} else {
return Response(400)
}
}
}
Which approach is better/worse and why?
The reason I'm asking is, that in Akka, building streams on demand was not really effective, since the stream needed to materialize every time, so it was better to have the "sink approach". I'm wondering if this might be a case for reactor as well or maybe there are other advantages / disadvantages of using those approaches.
I'm not too familiar with Akka, but building a reactive chain definitely doesn't attract a huge overhead with Reactor - that's the "normal" way of handling a request. So I don't see the need to use a separate sink like in your second approach - that just seems to be adding complexity for little gain. I'd therefore say the first approach is better.
That being said, generally, subscribing yourself as you do in both examples isn't recommended - but this kind of "fire and forget" work is one of the few cases it might make sense. There's just a couple of other potential caveats I'd raise here that may be worth considering:
You call the work "heavy", and I'm not sure if that means it's CPU heavy, or just IO heavy or takes a long time. If it just takes a long time due to firing off a bunch of requests, then that's no big deal. If it's CPU heavy however, then that could cause an issue if you're not careful - you probably don't want to execute CPU heavy tasks on your event loop threads. In this case, I'd probably create a separate scheduler backed by your own executor service, and then use subscribeOn() to offload those CPU heavy tasks.
Remember the "fire and forget" pattern in this case really is "forget" - you have absolutely no way of knowing if the heavy task you've offloaded has worked or not, since you've essentially thrown that information away by self-subscribing. Depending on your use-case that might be fine, but if the task is critical or you need some kind of feedback if it fails, then it's worth considering that may not be the best approach.

Run async function in specific thread

I would like to run specific long-running functions (which execute database queries) on a separate thread. However, let's assume that the underlying database engine only allows one connection at a time and the connection struct isn't Sync (I think at least the latter is true for diesel).
My solution would be to have a single separate thread (as opposed to a thread pool) where all the database-work happens and which runs as long as the main thread is alive.
I think I know how I would have to do this with passing messages over channels, but that requires quite some boilerplate code (e.g. explicitly sending the function arguments over the channel etc.).
Is there a more direct way of achieving something like this with rust (and possibly tokio and the new async/await notation that is in nightly)?
I'm hoping to do something along the lines of:
let handle = spawn_thread_with_runtime(...);
let future = run_on_thread!(handle, query_function, argument1, argument2);
where query_function would be a function that immediately returns a future and does the work on the other thread.
Rust nightly and external crates / macros would be ok.
If external crates are an option, I'd consider taking a look at actix, an Actor Framework for Rust.
This will let you spawn an Actor in a separate thread that effectively owns the connection to the DB. It can then listen for messages, execute work/queries based on those messages, and return either sync results or futures.
It takes care of most of the boilerplate for message passing, spawning, etc. at a higher level.
There's also a Diesel example in the actix documentation, which sounds quite close to the use case you had in mind.

std::promise/std::future vs std::condition_variable in C++

Signaling between threads can be achieved with std::promise/std::future or with good old condition variables. Can someone provide examples/use case where one would be a better choice over the other ?
I know that CVs could be used to signal multiple times between threads. Can you give example with std::future/promise to signal multiple times?
Also, is std::future::wait_for equivalent in performance with std::condition_variable::wait?
Let's say I need to wait on multiple futures in a queue as a consumer; does it make sense to go through each of them and check if they are ready like below ?
for(auto it = activeFutures.begin(); it!= activeFutures.end();) {
if(it->valid() && it->wait_for(std::chrono::milliseconds(1)) == std::future_status::ready) {
Printer::print(std::string("+++ Value " + std::to_string(it->get()->getBalance())));
activeFutures.erase(it);
} else {
++it;
}
}
can some one provide examples/use case where 1 would be a better
choice over other ?
These are 2 different tools of the standard library.
In order to give an example where 1 would be better over the other you'd have to come up with a scenario where both tools are a good fit.
However, these are different levels of abstractions to what they do and what they are good for.
from cppreference (emphasis mine):
Condition variables
A condition variable is a synchronization primitive that allows
multiple threads to communicate with each other. It allows some number
of threads to wait (possibly with a timeout) for notification from
another thread that they may proceed. A condition variable is always
associated with a mutex.
Futures
The standard library provides facilities to obtain values that are
returned and to catch exceptions that are thrown by asynchronous tasks
(i.e. functions launched in separate threads). These values are
communicated in a shared state, in which the asynchronous task may
write its return value or store an exception, and which may be
examined, waited for, and otherwise manipulated by other threads that
hold instances of std::future or std::shared_future that reference
As you can see, a condition variable is a synchronization primitive whereas a future is a facility used to communicate results of asynchronous tasks.
The condition variable can be used in a variety of scenarios where you need to synchronizes multiple threads, however you would typically use a std::future when you have tasks/jobs/work to do and you need it done without interrupting your main flow, aka asynchronously.
so in my opinion a good example where you would use a future + promise is when you need to run a long running calculation and get/wait_for the result at a later point of time. In comparison to a condition variable, where you would have had to basically implement std::future + std::promise on your own, possibly using std::condition_variable somewhere internally.
can you give example with std::future/promise to signal multiple times?
have a look at the toy example from shared_future
Also is std::future::wait_for equivalent in performance with std::condition_variable::wait?
well, GCC's implementation of std::future::wait_for uses std::condition_variable::wait_for which correlates with my explanation of the difference between the two. So as you can understand std::future::wait_for adds a very small performance overhead to std::condition_variable::wait_for

How to await a ParallelQuery with LINQ?

I have an async method, that should look up database entries. It filters by name, thus is a candiate for parallel execution.
However, I can not find a simple way to support both parallel execution and asynchronous tasks.
Here's what I have:
private async Task<List<Item>> GetMatchingItems(string name) {
using (var entities = new Entities()) {
var items = from item in entities.Item.AsParallel()
where item.Name.Contains(name)
select item;
return await items.ToListAsync(); //complains "ParallelQuery<Item> does not contain a definition for ToListAsync..."
}
}
When I remove AsParallel() it will compile. A I not supposed to use both features at the same time? Or do I understand something wrong?
IHMO, both make sense:
AsParallel() would indicate that the database query may get split up into several sub-queries running at the same time, because the individual matching of any item is not dependent on any other item. UPDATE: Bad idea in this example, see comments and answer!
ToListAsync() would support the asynchronous execution of this method, to allow other match methods (on other data) to start executing immediately.
How to use both parallel exectuion (with LINQ) and asynchronous tasks at the same time?
You can't, and shouldn't. PLINQ isn't for database access. The database knows best on how to parallelize the query and does that just fine on it's own using normal LINQ. PLINQ is for accessing objects and such where you are doing computationally expensive calculations within the LINQ query so it can parallelize it on the client side (vs parallelizing it on the server/database server side).
A better answer might be:
PLINQ is for distributing a query that is compute intensive across multiple threads.
Async is for returning the thread back so that others can use it because you are going to be waiting on an external resource (database, Disk I/O, network I/O, etc).
Async PLINQ doesn't have a very strong use case where you want to return the thread while you wait AND you have a lot of calculations to do... If you are busy calculating, you NEED the thread (or multiple threads). They are almost completely on different ends of optimization. If you need something like this, there are much better mechanisms like Tasks, etc.
Well, you can't. These are 2 different options that don't go together.
You can use Task.Run with async-await to parallelize the synchronous part of the asynchronous operation (i.e. what comes before the first await) on multiple ThreadPool threads, but without all the partition logic built into PLINQ, for example:
var tasks = enumerable.Select(item => Task.Run(async () =>
{
LongSynchronousPart(item);
await AsynchronouPartAsync(item);
}));
await Task.WhenAll(tasks);
In your case however (assuming it's Entity Framework) there's no value in using PLINQ as there's no actual work to parallelize. The query itself is executed in the DB.

Resources