Is nvidia dali thread safe if there are multiple threads to run the same pipeline? - thread-safety

I have a function like this:
run() {
pipe->SetExternalInput("raw_jpegs", data);
Then I call run() with multiple threads, the result turn out dali it's not thread safe.
Or I use dali in the wrong way?


Wrap blocking code into a Mono flatMap, is this still a non-blocking operation?

if i wrap blocking code into a flatMap, is this still a non-blocking operation ?
public Mono<String> foo() {
Mono.empty().flatMap(obj -> {
try {
Object temp = f.get();//are the thread at this point blocked or not ?
} catch (Exception e) {
throw e;
return Mono.just("test");
So, i think when i wrap blocking code into reactive code, the operation is still non-blocking ? If i am wrong, pls explain it to me.
if i wrap blocking code into a flatMap, is this still a non-blocking operation ?
flatMap doesn't create new threads for you. For example:
Mono.just("abc").flatMap(val -> Mono.just("cba")).subscribe();
All the code above will be executed by the current thread that called subscribe. So if the mapper function contained a long blocking operation the thread that called subscribe will be blocked as well.
To transform this to an asynchronous operation you can use subscribeOn(Schedulers.elastic());
Mono.just("abc").flatMap(val -> Mono.just("cba")).subscribeOn(Schedulers.elastic());
Mono and Flux don't create threads, but some operators take Scheduler as an extra argument to use such as the interval operator, or alter threading model all together such as subscribeOn.
One extra thing, in your example the mapper function is never going to be called, since your applying flatMap to an empty mono which completes directly with no values emitted.

Ruby thread synchronization

My process has two threads like the following
#semaphore =
thread_a = {
loop do
#some work
#semaphore.synchronize {
#thread_b_running = false
thread_b = {
while(#semaphore.synchronize { #thread_b_running }) do
#thread_b's work
Basically, thread_a and thread_b do some work in parallel, however when thread_a sees an event happen it needs to shut down thread_b. As you can see right now I am doing it by using a boolean protected by a mutex. I think this approach is not too bad performance wise since thread_b will almost always get the lock without waiting for it. However since I have not written a lot of multithreaded code I was wondering if there is a better way of doing what I'm doing?
If only one of the threads is writing the variable, there is no need for a mutex. So a better way in your example is just removing the mutex.

C++ memory management patterns for objects used in callback chains

A couple codebases I use include classes that manually call new and delete in the following pattern:
class Worker {
void DoWork(ArgT arg, std::function<void()> done) {
new Worker(std::move(arg), std::move(done)).Start();
Worker(ArgT arg, std::function<void()> done)
: arg_(std::move(arg)),
latch_(2) {} // The error-prone Latch interface isn't the point of this question. :)
void Start() {
Async1(<args>, [=]() { this->Method1(); });
void Method1() {
StartParallel(<args>, [=]() { this->latch_.count_down(); });
StartParallel(<other_args>, [=]() { this->latch_.count_down(); });
latch_.then([=]() { this->Finish(); });
void Finish() {
// Note manual memory management!
delete this;
ArgT arg_
std::function<void()> done_;
Latch latch_;
Now, in modern C++, explicit delete is a code smell, as, to some extent is delete this. However, I think this pattern (creating an object to represent a chunk of work managed by a callback chain) is fundamentally a good, or at least not a bad, idea.
So my question is, how should I rewrite instances of this pattern to encapsulate the memory management?
One option that I don't think is a good idea is storing the Worker in a shared_ptr: fundamentally, ownership is not shared here, so the overhead of reference counting is unnecessary. Furthermore, in order to keep a copy of the shared_ptr alive across the callbacks, I'd need to inherit from enable_shared_from_this, and remember to call that outside the lambdas and capture the shared_ptr into the callbacks. If I ever wrote the simple code using this directly, or called shared_from_this() inside the callback lambda, the object could be deleted early.
I agree that delete this is a code smell, and to a lesser extent delete on its own. But I think that here it is a natural part of continuation-passing style, which (to me) is itself something of a code smell.
The root problem is that the design of this API assumes unbounded control-flow: it acknowledges that the caller is interested in what happens when the call completes, but signals that completion via an arbitrarily-complex callback rather than simply returning from a synchronous call. Better to structure it synchronously and let the caller determine an appropriate parallelization and memory-management regime:
class Worker {
void DoWork(ArgT arg) {
// Async1 is a mistake; fix it later. For now, synchronize explicitly.
Latch async_done(1);
Async1(<args>, [&]() { async_done.count_down(); });
Latch parallel_done(2);
RunParallel([&]() { DoStuff(<args>); parallel_done.count_down(); });
RunParallel([&]() { DoStuff(<other_args>); parallel_done.count_down(); };
On the caller-side, it might look something like this:
Latch latch(tasks.size());
for (auto& task : tasks) {
RunParallel([=]() { DoWork(<args>); latch.count_down(); });
Where RunParallel can use std::thread or whatever other mechanism you like for dispatching parallel events.
The advantage of this approach is that object lifetimes are much simpler. The ArgT object lives for exactly the scope of the DoWork call. The arguments to DoWork live exactly as long as the closures containing them. This also makes it much easier to add return-values (such as error codes) to DoWork calls: the caller can just switch from a latch to a thread-safe queue and read the results as they complete.
The disadvantage of this approach is that it requires actual threading, not just boost::asio::io_service. (For example, the RunParallel calls within DoWork() can't block on waiting for the RunParallel calls from the caller side to return.) So you either have to structure your code into strictly-hierarchical thread pools, or you have to allow a potentially-unbounded number of threads.
One option is that the delete this here is not a code smell. At most, it should be wrapped into a small library that would detect if all the continuation callbacks were destroyed without calling done_().


I came across a code that adds a timer with timeout 0:
EventMachine.add_timer(0) {
does this make sense? how this can be useful? is this any different than using next_tick?
EventMachine.next_tick {
Since i was curios myself i took a quick look in the Eventmachine source code:
where i found this inside the event loop:
if #next_tick_queue && !#next_tick_queue.empty?
add_timer(0) { signal_loopbreak }
which pretty much means that when you define a next_tick internally it will use add_timer(0) {..} for it.
The only difference might be the execution order, i'm not sure which way the queued timers are executed in at this moment.

Critical region for the threads of current team

I want a piece of code to be critical for the current team of threads and not global critical.
How can i achieve this ??
spawn threads #x
// code
Here the open-mp critical-region construct will block all threads before accessing the critical region. But i don't have problem with two threads entering the critical region as long as they are not spawned at the same time. I want a solution for openMP.
You cannot do that with #pragma omp critical, but you may use OpenMP locks:
declare an instance of OpenMP lock
omp_init_lock( the lock instance )
spawn threads #x
// critical-region
omp_set_lock( the lock instance )
// code
omp_unset_lock( the lock instance )
omp_destroy_lock( the lock instance )
Since each invocation of quick-sort will declare its own lock object, it will give you what you want.
However, from your pseudo-code it seems that you will never have two different thread teams running at the same time, unless there are OpenMP parallel regions in other functions. If the only code that has a parallel region ("spawns threads") is in quick-sort, you would need to have a recursive call to that function from inside the parallel region, which you do not.
