Run async function in specific thread - async-await

I would like to run specific long-running functions (which execute database queries) on a separate thread. However, let's assume that the underlying database engine only allows one connection at a time and the connection struct isn't Sync (I think at least the latter is true for diesel).
My solution would be to have a single separate thread (as opposed to a thread pool) where all the database-work happens and which runs as long as the main thread is alive.
I think I know how I would have to do this with passing messages over channels, but that requires quite some boilerplate code (e.g. explicitly sending the function arguments over the channel etc.).
Is there a more direct way of achieving something like this with rust (and possibly tokio and the new async/await notation that is in nightly)?
I'm hoping to do something along the lines of:
let handle = spawn_thread_with_runtime(...);
let future = run_on_thread!(handle, query_function, argument1, argument2);
where query_function would be a function that immediately returns a future and does the work on the other thread.
Rust nightly and external crates / macros would be ok.

If external crates are an option, I'd consider taking a look at actix, an Actor Framework for Rust.
This will let you spawn an Actor in a separate thread that effectively owns the connection to the DB. It can then listen for messages, execute work/queries based on those messages, and return either sync results or futures.
It takes care of most of the boilerplate for message passing, spawning, etc. at a higher level.
There's also a Diesel example in the actix documentation, which sounds quite close to the use case you had in mind.

Related

How to terminate long running function after a timeout

So I a attempting to shut down a long running function if something takes too long, maybe is just a solution to treating the symptoms rather than cause, but in any case for my situation it didn't really worked out.
I did it like this:
func foo(abort <- chan struct{}) {
for {
select{
case <-abort:
return
default:
///long running code
}
}
}
And in separate function I have which after some time closes the passed chain, which it does, if I cut the body returns the function. However if there is some long running code, it does not affect the outcome it simply continues the work as if nothing has happened.
I am pretty new to GO, but it feels like it should work, but it does not. Is there anything I am missing. After all routers frameworks have timeout function, after which whatever is running is terminated. So maybe this is just out of curiosity, but I would really want how to od it.
your code only checks whether the channel was closed once per iteration, before executing the long running code. There's no opportunity to check the abort chan after the long running code starts, so it will run to completion.
You need to occasionally check whether to exit early in the body of the long running code, and this is more idiomatically accomplished using context.Context and WithTimeout for example: https://pkg.go.dev/context#example-WithTimeout
In your "long running code" you have to periodically check that abort channel.
The usual approach to implement that "periodically" is to split the code into chunks each of which completes in a reasonably short time frame (given that the system the process runs on is not overloaded).
After executing each such chunk you check whether the termination condition holds and then terminate execution if it is.
The idiomatic approach to perform such a check is "select with default":
select {
case <-channel:
// terminate processing
default:
}
Here, the default no-op branch is immediately taken if channel is not ready to be received from (or closed).
Some alogrithms make such chunking easier because they employ a loop where each iteration takes roughly the same time to execute.
If your algorithm is not like this, you'd have to chunk it manually; in this case, it's best to create a separate function (or a method) for each chunk.
Further points.
Consider using contexts: they provide a useful framework to solve the style of problems like the one you're solving.
What's better, the fact they can "inherit" one another allow one to easily implement two neat things:
You can combine various ways to cancel contexts: say, it's possible to create a context which is cancelled either when some timeout passes or explicitly by some other code.
They make it possible to create "cancellation trees" — when cancelling the root context propagates this signal to all the inheriting contexts — making them cancel what other goroutines are doing.
Sometimes, when people say "long-running code" they do not mean code actually crunching numbers on a CPU all that time, but rather the code which performs requests to slow entities — such as databases, HTTP servers etc, — in which case the code is not actually running but sleeping on the I/O to deliver some data to be processed.
If this is your case, note that all well-written Go packages (of course, this includes all the packages of the Go standard library which deal with networked services) accept contexts in those functions of their APIs which actually make calls to such slow entities, and this means that if you make your function to accept a context, you can (actually should) pass this context down the stack of calls where applicable — so that all the code you call can be cancelled in the same way as yours.
Further reading:
https://go.dev/blog/pipelines
https://blog.golang.org/advanced-go-concurrency-patterns

Idiomatic Golang goroutines

In Go, if we have a type with a method that starts some looped mechanism (polling A and doing B forever) is it best to express this as:
// Run does stuff, you probably want to run this as a goroutine
func (t Type) Run() {
// Do long-running stuff
}
and document that this probably wants to be launched as a goroutine (and let the caller deal with that)
Or to hide this from the caller:
// Run does stuff concurrently
func (t Type) Run() {
go DoRunStuff()
}
I'm new to Go and unsure if convention says let the caller prefix with 'go' or do it for them when the code is designed to run async.
My current view is that we should document and give the caller a choice. My thinking is that in Go the concurrency isn't actually part of the exposed interface, but a property of using it. Is this right?
I had your opinion on this until I started writing an adapter for a web service that I want to make concurrent. I have a go routine that must be started to parse results that are returned to the channel from the web calls. There is absolutely no case in which this API would work without using it as a go routine.
I then began to look at packages like net/http. There is mandatory concurrency within that package. It is documented at the interface level that it should be able to be used concurrently, however the default implementations automatically use go routines.
Because Go's standard library commonly fires of go routines within its own packages, I think that if your package or API warrants it, you can handle them on your own.
My current view is that we should document and give the caller a choice.
I tend to agree with you.
Since Go makes it so easy to run code concurrently, you should try to avoid concurrency in your API (which forces clients to use it concurrently). Instead, create a synchronous API, and then clients have the option to run it synchronously or concurrently.
This was discussed in a talk a couple years ago: Twelve Go Best Practices
Slide 26, in particular, shows code more like your first example.
I would view the net/http package as an exception because in this case, the concurrency is almost mandatory. If the package didn't use concurrency internally, the client code would almost certainly have to. For example, http.Client doesn't (to my knowledge) start any goroutines. It is only the server that does so.
In most cases, it's going to be one line of the code for the caller either way:
go Run() or StartGoroutine()
The synchronous API is no harder to use concurrently and gives the caller more options.
There is no 'right' answer because circumstances differ.
Obviously there are cases where an API might contain utilities, simple algorithms, data collections etc that would look odd if packaged up as goroutines.
Conversely, there are cases where it is natural to expect 'under-the-hood' concurrency, such as a rich IO library (http server being the obvious example).
For a more extreme case, consider you were to produce a library of plug-n-play concurrent services. Such an API consists of modules each having a well-described interface via channels. Clearly, in this case it would inevitably involve goroutines starting as part of the API.
One clue might well be the presence or absence of channels in the function parameters. But I would expect clear documentation of what to expect either way.

How do I wait for both threads to finish and files to be ready for reading without polling?

What I'd like to achieve is as follows (pseudocode):
f, t = select(files, threads)
if f
<read from files>
elsif t
<do something else>
end
Where select is a method similar to IO.select. But it seems unlikely to be possible.
The big picture is I'm trying to write a program which has to perform several types of jobs. The idea was to pass job data using database. But also inform the program about new jobs using pipes (by sending just type of the job). So that it wouldn't need to poll for jobs. So I was planning to create a loop waiting for either new notifications from pipes, or for worker threads to finish. After thread finishes I check if there were at least one notification about this particular type of job and run the worker thread again if needed. I'm not really sure what's is the best route to take here, so if you've got suggestions I'd like to hear them out.
Don't reinvent the wheel mate :) check out https://github.com/eventmachine/eventmachine (IO lib based on reactor pattern like node.js etc) or (perhaps preferably) https://github.com/celluloid/celluloid-io (IO lib based on actor pattern, better docs and active maintainers)
OPTION 1 - use EM or Celluloid to handle non-blocking sockets
EM and Celluloid are quite different, EM is reactor pattern ("same thing" as node.js, with a threadpool as workaround for blocking calls) and Celluloid is actor pattern (an actor thread pool).
Both can do non-blocking IO to/from a lot of sockets and delegate work to a lot of threads, depending on how you go about to do it. Both libs are very robust, efficient and battle tested, EM has more history but seems to have fallen slightly out of maintenance (https://www.youtube.com/watch?v=mPDs-xQhPb0), celluloid has nicer API and more active community (http://www.youtube.com/watch?v=KilbFPvLBaI).
Best advice I can give is to play with code samples that both projects provide and see what feels best. I'd go with celluloid for a new project, but that's a personal opinion - you may find that EM has more IO-related features (such as handling files, keyboard, unix sockets, ...)
OPTION 2 - use background job queues
I may have been misguided by the low level of your question :) Have you considered using some of the job queues available under ruby? There's a TON of decent and different options available, see https://www.ruby-toolbox.com/categories/Background_Jobs
OPTION 3 - DIY (not recommended)
There is a pure ruby implementation of EM, it uses IO selectables to handle sockets so it offers a pattern for what you're trying to do, check it out: https://github.com/eventmachine/eventmachine/blob/master/lib/em/pure_ruby.rb#L311 (see selectables handling).
However, given the amount of other options, hopefully you shouldn't need to resort to such low level coding.

Synchronous calls in akka / actor model

I've been looking into Akka lately and it looks like a great framework for building scalable servers on the JVM. However most of the libraries on the JVM are blocking (e.g. JDBC) so don't your lose out on the performance benefits of using an event based model because your threads will always be blocked? Does Akka do something to get around this? Or is it just something you have to live with until we get more non-blocking libraries on the JVM?
Have a look at CQRS, it greatly improves scalability by separating reads from writes. This means that you can scale your reads separately from your writes.
With the types of IO blocking issues you mentioned Scala provides a language embedded solution that matches perfectly: Futures. For example:
def expensiveDBQuery(key : Key) = Future {
//...query the database
}
val dbResult : Future[Result] =
expensiveDBQuery(...) //non-blocking call
The dbResult returns immediately from the function call. The Result will be a available in the "Future". The cool part about a Future is that you can think about them like any old collection, except you can never call .size on the Future. Other than that all collection-ish functions (e.g. map, filter, foreach, ...) are fair game. Simply think of the dbResult as a list of Results. What would you do with such a list:
dbResult.map(_.getValues)
.filter(values => someTestOnValues(values))
...
That sequence of calls sets up a computation pipeline that is invoked whenever the Result is actually returned from the database. You can give a sequence of computing steps before the data has arrived. All asynchronously.

Inter-thread communication (worker threads)

I've created two threads A & B using CreateThread windows API. I'm trying to send the data from thread A to B.
I know I can use Event object and wait for the Event object in another using "WaitForSingleObject" method. What this event does all is just signal the thread. That's it! But how I can send a data. Also I don't want thread B to wait till thread A signals. It has it own job to do. I can't make it wait.
I can't find a Windows function that will allow me to send data to / from the worker thread and main thread referencing the worker thread either by thread ID or by the returned HANDLE. I do not want to introduce the MFC dependency in my project and would like to hear any suggestions as to how others would or have done in this situation. Thanks in advance for any help!
First of all, you should keep in mind that Windows provides a number of mechanisms to deal with threading for you: I/O Completion Ports, old thread pools and new thread pools. Depending on what you're doing any of them might be useful for your purposes.
As to "sending" data from one thread to another, you have a couple of choices. Windows message queues are thread-safe, and a a thread (even if it doesn't have a window) can have a message queue, which you can post messages to using PostThreadMessage.
I've also posted code for a thread-safe queue in another answer.
As far as having the thread continue executing, but take note when a change has happened, the typical method is to have it call WaitForSingleObject with a timeout value of 0, then check the return value -- if it's WAIT_OBJECT_0, the Event (or whatever) has been set, so it needs to take note of the change. If it's WAIT_TIMEOUT, there's been no change, and it can continue executing. Either way, WaitForSingleObject returns immediately.
Since the two threads are in the same process (at least that's what it sounds like), then it is not necessary to "send" data. They can share it (e.g., a simple global variable). You do need to synchronize access to it via either an event, semaphore, mutex, etc.
Depending on what you are doing, it can be very simple.
Thread1Func() {
Set some global data
Signal semaphore to indicate it is available
}
Thread2Func() {
WaitForSingleObject to check/wait if data is available
use the data
}
If you are concerned with minimizing Windows dependencies, and assuming you are coding in C++, then I recommend using Boost.Threads, which is a pretty nice, Posix-like C++ threading interface. This will give you easy portability between Windows and Linux.
If you go this route, then use a mutex to protect any data shared across threads, and a condition variable (combined with the mutex) to signal one thread from the other.
Don´t use a mutexes when only working in one single process, beacuse it has more overhead (since it is a system-wide defined object)... Place a critical section around Your data and try to enter it (as Jerry Coffin did in his code around for the thread safe queue).

Resources