OpenMP split-joint model - openmp

I am parallelizing several separated for-loops using OpenMP. While debugging in gdb, I found that the multiple threads are created when the running reaches the first parallel region. The multiple threads exited at the end of running the whole program. This is contrary to what I think about the split-join model of OpenMP, where threads should join together into a master thread and then terminate at the end of each parallel region instead of the end of the whole program.
Am I wrong?
Thanks!

It is implementation specific, but it is likely that the implementation puts the worker threads in a thread-pool.

Related

More OpenMP threads running than set via omp_set_num_threads()?

I have a program that loops through a sequence of multiple for-loops (with other things in-between). Some the for-loops are parallelized using "#pragma omp parallel for". The number of threads is set via omp_set_num_threads() in the beginning.
It appears, however, that some of those for-loops make OpenMP to start a second team of threads and the process ends up with twice as many OpenMP threads than set with omp_set_num_threads().
What could cause this?

Is there a way to end idle threads in GNU OpenMP?

I use OpenMP for parallel sorting at start of my program. Once data is loaded and sorted, the program runs as a daemon and OpenMP is not used any more. Is there a way to turn off the idle threads created by OpenMP? omp_set_num_threads() doesn't affect the idle threads which have already been created for a task.
Please look up OMP_WAIT_POLICY, which is new in OpenMP 4 [https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fWAIT_005fPOLICY.html].
There are non-portable alternatives like GOMP_SPINCOUNT if your OpenMP implementation isn't recent enough. I recall from OpenMP specification discussions that at least Intel, IBM, Cray, and Oracle support their own implementation of this feature already.
I don't believe there is a way to trigger the threads' destruction. Modern OpenMP implementations tend to keep threads around in a pool to speed up starting future parallel sections.
In your case I would recommend a two program solution (one parallel to sort and one serial for the daemon). How you communicate the data between them is up to you. You could do something simple like writing it to a file and then reading it again. This may not be as slow as it sounds since a modern linux distribution might keep that file in memory in the file cache.
If you really want to be sure it stays in memory, you could launch the two processes simultaneously and allow them to share memory and allow the first parallel sort process to exit when it is done.
In theory, OpenMP has a implicit synchronization at the end of the "pragma" clauses. So, when the OpenMP parallel work ends, all the threads are deleted. You dont need to kill them or free them: OpenMP does that automatically.
Maybe "omp_get_num_threads()" is telling to you the actual configuration of the program, not the number of active threads. I mean: if you set the number of threads to 4, omp will tell you that the configuration is "4 threads", but this does not mean that there are actually 4 threads in process.

Ruby multithreading Performance issues

I am building Ruby application. I have a set of images that I want to greyscale. My code used to be like this:
def Tools.grayscale_all_frames(frames_dir,output_dir)
number_of_frames = get_frames_count(frames_dir)
img_processor = ImageProcessor.new(frames_dir)
create_dir(output_dir)
for i in 1..number_of_frames
img_processor.load_image(frames_dir+"/frame_%04d.png"%+i)
img_processor.greyscale_image
img_processor.save_image_in_dir(output_dir,"frame_%04d"%+i)
end
end
after threading the code:
def Tools.greyscale_all_frames_threaded(frames_dir,output_dir)
number_of_frames = get_frames_count(frames_dir)
img_processor = ImageProcessor.new(frames_dir)
create_dir(output_dir)
greyscale_frames_threads = []
for frame_index in 1..3
greyscale_frames_threads << Thread.new(frame_index) { |frame_number|
puts "Loading Image #{frame_number}"
img_processor.load_image(frames_dir+"/frame_%04d.png"%+frame_number)
img_processor.greyscale_image
img_processor.save_image_in_dir(output_dir,"frame_%04d"%+frame_number)
puts "Greyscaled Image #{frame_number}"
}
end
puts "Starting Threads"
greyscale_frames_threads.each { |thread| thread.join }
end
What I expected is a thread being spawned for each image. I have 1000 images. The resolution is 1920*1080. So how I see things is like this. I have an array of threads that I call .join on it. So join will take all the threads and start them, one after the other? Does that mean that it will wait until thread 1 is done and then start thread 2? What is the point of multithreading then?
What I want is this:
Run all the threads at the same time and not one after the other. So mathematically, it will finish all the 1000 frames in the same time it will take to finish 1 frame, right?
Also can somebody explain me what .join does?
From my understanding .join will stop the main thread until your thread(s) is or are done?
If you don't use .join, then the thread will run the background and the main thread will just continue.
So what is the point of using .join? I want my main thread to continue running and have the other threads in the background doing stuff?
Thanks for any help/clarification!!
This is only true if you have 1000 CPU cores and massive (read: hundreds and hundreds) of RAM.
The point of join is not to start the thread, but to wait until the thread has finished. So calling join on an array of threads is a common pattern for waiting for them all to finish.
Explaining all of this, and clarifying your misconception this requires digging a little deeper. At the C/Assembler level, mst modern OSes (Win, Mac, Linux, and some others) use a preemptive scheduler. If you have only one core, two programs running in paralel is a complete illusion. In reality, the kernel is switching between the two every few milliseconds, giving all of use slow processing humans the illusion of parallel processing.
In newer, more modern CPUs, there are often more than one core. The most powerful CPU's today can go up to (I think) 16 real cores + 16 hyperthreaded cores (see here). This means that you could actually run 32 tasks completely in parallel. But even this does not ensure that if you start 32 threads they will all finish at the same time.
Because of competition for resources that are shared between cores (some cache, all the RAM, harddrive, network card, etc.), and the essentially random nature of preemptive scheduling, the amount of time your thread takes can be estimated in a certain range, but not exactly.
Unfortunatly, all of this breaks down when you get to Ruby. Because of some hariy internal details about the threading model an compatibility, only one thread can execute ruby code at a time. So, if your image processing is done in C, happy joy joy. If it's written in Ruby, well, all the treads in the world arn't going to help you now.
To be able to actually run Ruby code in parallel, you have to use fork. fork is only available on Linux and Mac, and not Windows, but you can think of it as a fork in a road. One process goes in, two processes come out. Multiple processes can run on all your different cores at once.
So, take #Stefan's advice: use a queue and a number of worker threads = to # of CPU cores. And con't expect so much of your computer. Now you know why ;).
So join will take all the threads and start them, one after the other?
No, the threads are started when invoking Thread#new. It creates a new thread and executed the given block within that thread.
join will stop the main thread until your thread(s) is or are done?
Yes, it will suspend execution until the receiver (each of your threads) exists.
So what is the point of using join?
Sometimes you want to start some tasks in parallel but you have to wait for each task to finish before you can continue.
I want my main thread to continue running and have the other threads in the background doing stuff
Then don't call join.
After all it's not a good idea to start 1,000 threads in parallel. Your machine is only capable of running as many tasks in parallel as CPUs are available. So instead of starting 1,000 threads, place your jobs / tasks in a queue / pool and process them using some worker threads (number of CPUs = number of workers).

Multithreading within a hash of hashes ruby

I have a code snippet like this
myhash.each_value{|subhash|
(subhash['key]'.each {|subsubhash|
statement that modifies the subsubhash and takes about 0.07 s to execute
})
}
This loop runs 100+ times and needless to say slows down my application tremendously(about 7 seconds to run this loop).
Any pointers on how to make this faster? I have no control over the really expensive statement. Is there a way I can multi thread within the loop so the statements can be executed in parallel?
threads = []
myhash.each_value{ |subhash|
threads << Thread.start do
subhash['key'].each { |subsubhash|
threads << Thread.start do
statement that modifies the subsubhash and takes about 0.07 s to execute
end
}
end
}
threads.each { |t| t.join }
Note that MRI 1.8.x doesn't use real threads, but rather green ones which do not correspond to real OS threads. However, if you use JRuby you might see a performance boost as it supports real threads.
You could run each subhash processing loop in a separate thread but whether or not this results in a performance boost may depend on either (1) the Ruby interpreter you are using or (2) whether the innermost block is IO-bound or compute-bound.
The reason for #1 is that some Ruby interpreters (such as CRuby/MRI 1.8) use green threads which typically do not benefit from any actual parallel processing, even on multicore machines. However, YARV and JRuby both use native OS threads (JRuby even for 1.8 since the JVM uses native threads), so if you can target those interpreters specifically then you might see an improvement.
The reason for #2 is that if the innermost block is IO-bound then even a green thread based interpreter might improve performance since most OSes do a good job of scheduling threads around blocking IO calls. If the block is strictly compute-bound then only a native-thread based interpreter will likely show a performance boost using multiple threads.

Can Ruby Fibers be Concurrent?

I'm trying to get some speed up in my program and I've been told that Ruby Fibers are faster than threads and can take advantage of multiple cores. I've looked around, but I just can't find how to actually run different fibers concurrently. With threads you can do this:
threads = []
threads << Thread.new {Do something}
threads << Thread.new {Do something}
threads.each {|thread| thread.join}
I can't see how to do something like this with fibers. All I can find is yield and resume which seems like just a bunch of starting and stopping between the fibers. Is there a way to do true concurrency with fibers?
No, you cannot do concurrency with Fibers. Fibers simply aren't a concurrency construct, they are a control-flow construct, like Exceptions. That's the whole point of Fibers: they never run in parallel, they are cooperative and they are deterministic. Fibers are coroutines. (In fact, I never understood why they aren't simply called Coroutines.)
The only concurrency construct in Ruby is Thread.
There seems to be a terminology issue between concurrency and parallelism.
I just can't find how to actually run different fibers concurrently.
I think you actually talk about parallelism, not about concurrency:
Concurrency is when two tasks can start, run, and complete in overlapping time periods. It doesn't necessarily mean they'll ever both be running at the same instant. Eg. multitasking on a single-core machine. Parallelism is when tasks literally run at the same time, eg. on a multicore processor
Quoting: Concurrency vs Parallelism - What is the difference?.
Also well illustrated here:
http://concur.rspace.googlecode.com/hg/talk/concur.html#title-slide
So to answer the question:
Fibers are primitives for implementing light weight cooperative concurrency in Ruby.
http://www.ruby-doc.org/core-2.1.1/Fiber.html
Which doesn't mean it can run in parallel.
if you want true concurrency you'll want to use threads with jruby (which doesn't actually have fibers, it only has threads, one per fiber).
Another option is to "fork" to new processes, which could run things in true parallel on MRI.

Resources