Ruby multithreading Performance issues - ruby

I am building Ruby application. I have a set of images that I want to greyscale. My code used to be like this:
def Tools.grayscale_all_frames(frames_dir,output_dir)
number_of_frames = get_frames_count(frames_dir)
img_processor = ImageProcessor.new(frames_dir)
create_dir(output_dir)
for i in 1..number_of_frames
img_processor.load_image(frames_dir+"/frame_%04d.png"%+i)
img_processor.greyscale_image
img_processor.save_image_in_dir(output_dir,"frame_%04d"%+i)
end
end
after threading the code:
def Tools.greyscale_all_frames_threaded(frames_dir,output_dir)
number_of_frames = get_frames_count(frames_dir)
img_processor = ImageProcessor.new(frames_dir)
create_dir(output_dir)
greyscale_frames_threads = []
for frame_index in 1..3
greyscale_frames_threads << Thread.new(frame_index) { |frame_number|
puts "Loading Image #{frame_number}"
img_processor.load_image(frames_dir+"/frame_%04d.png"%+frame_number)
img_processor.greyscale_image
img_processor.save_image_in_dir(output_dir,"frame_%04d"%+frame_number)
puts "Greyscaled Image #{frame_number}"
}
end
puts "Starting Threads"
greyscale_frames_threads.each { |thread| thread.join }
end
What I expected is a thread being spawned for each image. I have 1000 images. The resolution is 1920*1080. So how I see things is like this. I have an array of threads that I call .join on it. So join will take all the threads and start them, one after the other? Does that mean that it will wait until thread 1 is done and then start thread 2? What is the point of multithreading then?
What I want is this:
Run all the threads at the same time and not one after the other. So mathematically, it will finish all the 1000 frames in the same time it will take to finish 1 frame, right?
Also can somebody explain me what .join does?
From my understanding .join will stop the main thread until your thread(s) is or are done?
If you don't use .join, then the thread will run the background and the main thread will just continue.
So what is the point of using .join? I want my main thread to continue running and have the other threads in the background doing stuff?
Thanks for any help/clarification!!

This is only true if you have 1000 CPU cores and massive (read: hundreds and hundreds) of RAM.
The point of join is not to start the thread, but to wait until the thread has finished. So calling join on an array of threads is a common pattern for waiting for them all to finish.
Explaining all of this, and clarifying your misconception this requires digging a little deeper. At the C/Assembler level, mst modern OSes (Win, Mac, Linux, and some others) use a preemptive scheduler. If you have only one core, two programs running in paralel is a complete illusion. In reality, the kernel is switching between the two every few milliseconds, giving all of use slow processing humans the illusion of parallel processing.
In newer, more modern CPUs, there are often more than one core. The most powerful CPU's today can go up to (I think) 16 real cores + 16 hyperthreaded cores (see here). This means that you could actually run 32 tasks completely in parallel. But even this does not ensure that if you start 32 threads they will all finish at the same time.
Because of competition for resources that are shared between cores (some cache, all the RAM, harddrive, network card, etc.), and the essentially random nature of preemptive scheduling, the amount of time your thread takes can be estimated in a certain range, but not exactly.
Unfortunatly, all of this breaks down when you get to Ruby. Because of some hariy internal details about the threading model an compatibility, only one thread can execute ruby code at a time. So, if your image processing is done in C, happy joy joy. If it's written in Ruby, well, all the treads in the world arn't going to help you now.
To be able to actually run Ruby code in parallel, you have to use fork. fork is only available on Linux and Mac, and not Windows, but you can think of it as a fork in a road. One process goes in, two processes come out. Multiple processes can run on all your different cores at once.
So, take #Stefan's advice: use a queue and a number of worker threads = to # of CPU cores. And con't expect so much of your computer. Now you know why ;).

So join will take all the threads and start them, one after the other?
No, the threads are started when invoking Thread#new. It creates a new thread and executed the given block within that thread.
join will stop the main thread until your thread(s) is or are done?
Yes, it will suspend execution until the receiver (each of your threads) exists.
So what is the point of using join?
Sometimes you want to start some tasks in parallel but you have to wait for each task to finish before you can continue.
I want my main thread to continue running and have the other threads in the background doing stuff
Then don't call join.
After all it's not a good idea to start 1,000 threads in parallel. Your machine is only capable of running as many tasks in parallel as CPUs are available. So instead of starting 1,000 threads, place your jobs / tasks in a queue / pool and process them using some worker threads (number of CPUs = number of workers).

Related

Benefits of green threads vs a simple loop

Is there any benefit to using green threads / lightweight threads over a simple loop or sequential code, assuming only non blocking operations are used in both?
for i := 0; i < 5; i++ {
go doSomethingExpensive() // using golang example
}
// versus
for i := 0; i < 5; i++ {
doSomethingExpensive()
}
As far as I can think of
- green threads help avoid a bit of callback hell on async operations
- allow scheduling of M green threads on N kernel threads
But
- add a bit of complexity and performance requiring a scheduler
- easier cross thread communication when the language supports it and the execution was split to different cpu's (otherwise sequential code is simpler)
No, the green threads have no performance benefits at all.
If the threads are performing non-blocking operations:
Multiple threads have no benefits if you have only one physical core (since the same core has to execute everything, threads only makes things slower because of an overhead)
Up to as many threads as CPU cores you have have a performance benefit, since multiple cores can execute your threads physically parallel (see Play! framework)
Green threads have no benefits, since they are running from the same one real thread by a sub-scheduler, so actually green threads == 1 thread
If the threads are performing blocking operations, things may look different:
multiple threads makes sense, since one thread can be blocked, but the others can go on, so blocking slows down only one thread
you can avoid the callback-hell by just implementing your partially blocking process as one thread. Since you're free to block from one thread while e.g. waiting for IO, you get much simpler code.
Green threads
Green threads are not real threads by design, so they won't be split amongst multiple CPUs and are not indended to work in parallel. This can give a false understading that you can avoid synchronization - however once you upgrade to real threads the lack of proper synchronization will introduce a good set of issues.
Green threads were widely used in early Java days, when the JVM did not support real OS threads. A variant of green threads, called Fibers are part of the Windows operating system, and e.g. the MS SQL server uses them heavily to handle various blocking scenarios without the heavy overhead of using real threads.
You can choose not only amongst green threads and real threads, but may also consider continuations (https://www.playframework.com/documentation/1.3.x/asynchronous)
Continuations give you the best of both worlds:
your code logically looks like if it is a linear code, no callback hells
in reality the code is executed by real threads, however if a thread is getting blocked it suspends its execution and can switch to executing other code. Once the blocking condition signals, the thread can switch back and continue your code.
This approach is quite resource friendly. Play! framework uses as many threads as CPU cores you have (4-8) but beats all high-end Java application servers in terms of performance.

Preventing Windows from changing process affinity

I have a multithreaded code that I want to run on all 4 cores that my processor has. I.e. I create four threads, and I want each of them to run on a separate core.
What happens is that it starts running on four cores, but occasionally would switch to only three cores. The only things running are the OS and my exe. This is somewhat disappointing, since it decreases performance by a quarter, which is significant enough for me.
The process affinity that I see in Task Manager allows the process to use any core. I tried restricting thread affinities, but it did't help. I also tried increasing priority of the process, but it did not help the case either.
So the question is, is there any way to force Windows to keep it running on all four cores? If this is not possible, can I reduce the frequency of these interruptions? Thanks!
This is not an issue of affinity unless I am very much mistaken. Certainly the system will not restrict your process to affinity with a specific set of threads. Some other program in the system would have to do that, if indeed that is happening.
Much more likely however is that, simply, there is another thread that is ready to run that the system is scheduling in a round-robin fashion. You have four threads that are always ready to run. If there is another thread that is ready to run, it will get its turn. Now there are 5 threads sharing 4 processors. When the other thread is running, only 3 of yours are able to run.
If you want to be sure that such other threads won't run then you need to do one of the following:
Stop running the other program that wants to use CPU resource.
Make the relative thread priorities such that your threads always run in preference to the other thread.
Now, of these options, the first is to be preferred. If you prioritize your threads above others, then the other threads don't get to run at all. Is that really what you want to happen?
In the question you say that there are no other processes running. If that is the case, and nobody is meddling with processor affinity, and only a subset of your threads are executing, then the only conclusion is that not all of your threads are ready to run and have work to do. That might happen if you, for instance, join your threads at the end of one part of work, before continuing on to the next.
Perhaps the next step for you is to narrow things down a little. Use a tool like Process Explorer to diagnose which threads are actually running.
If this is windows, try SetThreadAffinityMask():
http://msdn.microsoft.com/en-us/library/windows/desktop/ms686247(v=vs.85).aspx
I would assume that if you only set a single bit, then that forces the thread to run only on the selected processor (core).
other process / thread functions:
http://msdn.microsoft.com/en-us/library/windows/desktop/ms684847(v=vs.85).aspx
I use a windows video program, and it's able to keep all the cores running at near max when rendering video.

Does the Task Parallel Library (or PLINQ) take other processes into account?

In particular, I'm looking at using TPL to start (and wait for) external processes. Does the TPL look at total machine load (both CPU and I/O) before deciding to start another task (hence -- in my case -- another external process)?
For example:
I've got about 100 media files that need to be encoded or transcoded (e.g. from WAV to FLAC or from FLAC to MP3). The encoding is done by launching an external process (e.g. FLAC.EXE or LAME.EXE). Each file takes about 30 seconds. Each process is mostly CPU-bound, but there's some I/O in there. I've got 4 cores, so the worst case (transcoding by piping the decoder into the encoder) still only uses 2 cores. I'd like to do something like:
Parallel.ForEach(sourceFiles,
sourceFile =>
TranscodeUsingPipedExternalProcesses(sourceFile));
Will this kick off 100 tasks (and hence 200 external processes competing for the CPU)? Or will it see that the CPU's busy and only do 2-3 at a time?
You're going to run into a couple of issues here. The starvation avoidance mechanism of the scheduler will see your tasks as blocked as they wait on processes. It will find it hard to distinguish between a deadlocked thread and one simply waiting for a process to complete. As a result it may schedule new tasks if your tasks run or a long time (see below). The hillclimbing heuristic should take into account the overall load on the system, both from your application and others. It simply tries to maximize work done, so it will add more work until the overall throughput of the system stops increasing and then it will back off. I don't think this will effect your application but the stavation avoidance issue probably will.
You can find more detail as to how this all works in Parallel Programming with Microsoft®.NET, Colin Campbell, Ralph Johnson, Ade Miller, Stephen Toub (an earlier draft is online).
"The .NET thread pool automatically manages the number of worker
threads in the pool. It adds and removes threads according to built-in
heuristics. The .NET thread pool has two main mechanisms for injecting
threads: a starvation-avoidance mechanism that adds worker
threads if it sees no progress being made on queued items and a hillclimbing
heuristic that tries to maximize throughput while using as
few threads as possible.
The goal of starvation avoidance is to prevent deadlock. This kind
of deadlock can occur when a worker thread waits for a synchronization
event that can only be satisfied by a work item that is still pending
in the thread pool’s global or local queues. If there were a fixed
number of worker threads, and all of those threads were similarly
blocked, the system would be unable to ever make further progress.
Adding a new worker thread resolves the problem.
A goal of the hill-climbing heuristic is to improve the utilization
of cores when threads are blocked by I/O or other wait conditions
that stall the processor. By default, the managed thread pool has one
worker thread per core. If one of these worker threads becomes
blocked, there’s a chance that a core might be underutilized, depending
on the computer’s overall workload. The thread injection logic
doesn’t distinguish between a thread that’s blocked and a thread
that’s performing a lengthy, processor-intensive operation. Therefore,
whenever the thread pool’s global or local queues contain pending
work items, active work items that take a long time to run (more than
a half second) can trigger the creation of new thread pool worker
threads.
The .NET thread pool has an opportunity to inject threads every
time a work item completes or at 500 millisecond intervals, whichever
is shorter. The thread pool uses this opportunity to try adding threads
(or taking them away), guided by feedback from previous changes in
the thread count. If adding threads seems to be helping throughput,
the thread pool adds more; otherwise, it reduces the number of
worker threads. This technique is called the hill-climbing heuristic.
Therefore, one reason to keep individual tasks short is to avoid
“starvation detection,” but another reason to keep them short is to
give the thread pool more opportunities to improve throughput by
adjusting the thread count. The shorter the duration of individual
tasks, the more often the thread pool can measure throughput and
adjust the thread count accordingly.
To make this concrete, consider an extreme example. Suppose
that you have a complex financial simulation with 500 processor-intensive
operations, each one of which takes ten minutes on average
to complete. If you create top-level tasks in the global queue for each
of these operations, you will find that after about five minutes the
thread pool will grow to 500 worker threads. The reason is that the
thread pool sees all of the tasks as blocked and begins to add new
threads at the rate of approximately two threads per second.
What’s wrong with 500 worker threads? In principle, nothing, if
you have 500 cores for them to use and vast amounts of system
memory. In fact, this is the long-term vision of parallel computing.
However, if you don’t have that many cores on your computer, you are
in a situation where many threads are competing for time slices. This
situation is known as processor oversubscription. Allowing many
processor-intensive threads to compete for time on a single core adds
context switching overhead that can severely reduce overall system
throughput. Even if you don’t run out of memory, performance in this
situation can be much, much worse than in sequential computation.
(Each context switch takes between 6,000 and 8,000 processor cycles.)
The cost of context switching is not the only source of overhead.
A managed thread in .NET consumes roughly a megabyte of stack
space, whether or not that space is used for currently executing functions.
It takes about 200,000 CPU cycles to create a new thread, and
about 100,000 cycles to retire a thread. These are expensive operations.
As long as your tasks don’t each take minutes, the thread pool’s
hill-climbing algorithm will eventually realize it has too many threads
and cut back on its own accord. However, if you do have tasks that
occupy a worker thread for many seconds or minutes or hours, that
will throw off the thread pool’s heuristics, and at that point you
should consider an alternative.
The first option is to decompose your application into shorter
tasks that complete fast enough for the thread pool to successfully
control the number of threads for optimal throughput.
A second possibility is to implement your own task scheduler
object that does not perform thread injection. If your tasks are of long
duration, you don’t need a highly optimized task scheduler because
the cost of scheduling will be negligible compared to the execution
time of the task. MSDN® developer program has an example of a
simple task scheduler implementation that limits the maximum degree
of concurrency. For more information, see the section, “Further Reading,”
at the end of this chapter.
As a last resort, you can use the SetMaxThreads method to
configure the ThreadPool class with an upper limit for the number
of worker threads, usually equal to the number of cores (this is the
Environment.ProcessorCount property). This upper limit applies for
the entire process, including all AppDomains."
The short answer is: no.
Internally, the TPL uses the standard ThreadPool to schedule its tasks. So you're actually asking whether the ThreadPool takes machine load into account and it doesn't. The only thing that limits the number of tasks simultaneously running is the number of threads in the thread pool, nothing else.
Is it possible to have the external processes report back to your application once they are ready? In that case you do not have to wait for them (keeping threads occupied).
Ran a test using TPL/ThreadPool to schedule a great number of tasks doing looped spins. Using an external app I've loaded one of the cores to 100% using proc affinity. The number of active tasks never decreased.
Even better, I ran multiple instances of the same CPU intensive .NET TPL enabled app. The number of threads for all the apps was the same, and never went below the number of cores, even though my machine was barely usable.
So theory aside, TPL uses the number of cores available, but never checks on their actual load. A very poor implementation in my opinion.

Can Ruby Fibers be Concurrent?

I'm trying to get some speed up in my program and I've been told that Ruby Fibers are faster than threads and can take advantage of multiple cores. I've looked around, but I just can't find how to actually run different fibers concurrently. With threads you can do this:
threads = []
threads << Thread.new {Do something}
threads << Thread.new {Do something}
threads.each {|thread| thread.join}
I can't see how to do something like this with fibers. All I can find is yield and resume which seems like just a bunch of starting and stopping between the fibers. Is there a way to do true concurrency with fibers?
No, you cannot do concurrency with Fibers. Fibers simply aren't a concurrency construct, they are a control-flow construct, like Exceptions. That's the whole point of Fibers: they never run in parallel, they are cooperative and they are deterministic. Fibers are coroutines. (In fact, I never understood why they aren't simply called Coroutines.)
The only concurrency construct in Ruby is Thread.
There seems to be a terminology issue between concurrency and parallelism.
I just can't find how to actually run different fibers concurrently.
I think you actually talk about parallelism, not about concurrency:
Concurrency is when two tasks can start, run, and complete in overlapping time periods. It doesn't necessarily mean they'll ever both be running at the same instant. Eg. multitasking on a single-core machine. Parallelism is when tasks literally run at the same time, eg. on a multicore processor
Quoting: Concurrency vs Parallelism - What is the difference?.
Also well illustrated here:
http://concur.rspace.googlecode.com/hg/talk/concur.html#title-slide
So to answer the question:
Fibers are primitives for implementing light weight cooperative concurrency in Ruby.
http://www.ruby-doc.org/core-2.1.1/Fiber.html
Which doesn't mean it can run in parallel.
if you want true concurrency you'll want to use threads with jruby (which doesn't actually have fibers, it only has threads, one per fiber).
Another option is to "fork" to new processes, which could run things in true parallel on MRI.

Win32 Thread scheduling

As I understand, windows thread scheduler does not discriminate beween threads belonging two different processes, provided all of them have the same base priority. My question is if I have two applications one with only one thread and the other with say 50 threads all with same base priority, does it mean that the second process enjoys more CPU time then the first one?
Scheduling in Windows is at the thread granularity. The basic idea behind this approach is that processes don't run but only provide resources and a context in which their threads run. Coming back to your question, because scheduling decisions are made strictly on a thread basis, no consideration is given to what process the thread belongs to. In your example, if process A has 1 runnable thread and process B has 50 runnable threads, and all 51 threads are at the same priority, each thread would receive 1/51 of the CPU time—Windows wouldn't give 50 percent of the CPU to process A and 50 percent to process B.
To understand the thread-scheduling algorithms, you must first understand the priority levels that Windows uses. You can refer here for quick reference.
Try reading Windows Internals for in depth understanding.
All of the above are accurate but if you're worried about the 50 thread process hogging all the CPU, there ARE techniques you can do to ensure that no single process overwhelms the CPU.
IMHO the best way to do this is to use job objects to manage the usage of a process. First call CreateJobObject, then SetInformationJobObject to limit the max CPU usage of the processes in the job object and AssignProcessToJobObject to assign the process with 50 threads to the job object. You can then let the OS ensure that the 50 thread process doesn't consume too much CPU time.
The unit of scheduling is a thread, not a process, so a process with 50 threads, all in a tight loop, will get much more of the cpu than a process with only a single thread, provided all are running at the same priority. This is normally not a concern since most threads in the system are not in a runnable state and will not be up for scheduling; they are waiting on I/O, waiting for input from the user, and so on.
Windows Internals is a great book for learning more about the Windows thread scheduler.
That depends on the behavior of the threads. In general with a 50 : 1 difference in thread count, yes, the application with more threads is going to get a lot more time. However, windows also uses dynamic thread prioritization, which can change this somewhat. Dynamic thread prioritization is described here:
https://web.archive.org/web/20130312225716/http://support.microsoft.com/kb/109228
Relevant excerpt:
The base priority of a thread is the base level from which these upward adjustments are made. The current priority of a thread is called its dynamic priority. Interactive threads that yield before their time slice is up will tend to be adjusted upward in priority from their base priority. Compute-bound threads that do not yield, consuming their entire time slice, will tend to have their priority decreased, but not below the base level. This arrangement is often called heuristic scheduling. It provides better interactive performance and tends to lessen the system impact of "CPU hog" threads.
There is a local 'advanced' setting that purportedly can be used to shade scheduling slightly in favor of the app with focus. With the 'services' setting, there is no preference. In previous versions of Windows, this setting used to be somewhat more granular than just 'applications with focus'(slight preference to app with focus) and 'services' (all equal weigthing)
As this can be set by the user on the targe machine, it seems like it is asking for grief to depend on this setting...

Resources