Benefits of green threads vs a simple loop - green-threads

Is there any benefit to using green threads / lightweight threads over a simple loop or sequential code, assuming only non blocking operations are used in both?
for i := 0; i < 5; i++ {
go doSomethingExpensive() // using golang example
}
// versus
for i := 0; i < 5; i++ {
doSomethingExpensive()
}
As far as I can think of
- green threads help avoid a bit of callback hell on async operations
- allow scheduling of M green threads on N kernel threads
But
- add a bit of complexity and performance requiring a scheduler
- easier cross thread communication when the language supports it and the execution was split to different cpu's (otherwise sequential code is simpler)

No, the green threads have no performance benefits at all.
If the threads are performing non-blocking operations:
Multiple threads have no benefits if you have only one physical core (since the same core has to execute everything, threads only makes things slower because of an overhead)
Up to as many threads as CPU cores you have have a performance benefit, since multiple cores can execute your threads physically parallel (see Play! framework)
Green threads have no benefits, since they are running from the same one real thread by a sub-scheduler, so actually green threads == 1 thread
If the threads are performing blocking operations, things may look different:
multiple threads makes sense, since one thread can be blocked, but the others can go on, so blocking slows down only one thread
you can avoid the callback-hell by just implementing your partially blocking process as one thread. Since you're free to block from one thread while e.g. waiting for IO, you get much simpler code.
Green threads
Green threads are not real threads by design, so they won't be split amongst multiple CPUs and are not indended to work in parallel. This can give a false understading that you can avoid synchronization - however once you upgrade to real threads the lack of proper synchronization will introduce a good set of issues.
Green threads were widely used in early Java days, when the JVM did not support real OS threads. A variant of green threads, called Fibers are part of the Windows operating system, and e.g. the MS SQL server uses them heavily to handle various blocking scenarios without the heavy overhead of using real threads.
You can choose not only amongst green threads and real threads, but may also consider continuations (https://www.playframework.com/documentation/1.3.x/asynchronous)
Continuations give you the best of both worlds:
your code logically looks like if it is a linear code, no callback hells
in reality the code is executed by real threads, however if a thread is getting blocked it suspends its execution and can switch to executing other code. Once the blocking condition signals, the thread can switch back and continue your code.
This approach is quite resource friendly. Play! framework uses as many threads as CPU cores you have (4-8) but beats all high-end Java application servers in terms of performance.

Related

Does task switching in concurrent code result in faster code than synchronous execution?

I understand that concurrency is not parallelism, but I believe that is my source of confusion about the speed of concurrency in environments that only use a single thread (go/node).
If everything is running in a single process, and a scheduler is constantly switching between different concurrent tasks wouldn't the overhead generated by this constant switching lead to slower execution of code than if everything was done synchronously?
I know that concurrency has it advantages when you want non-blocking code, for example a web server that switches between servicing thousands of requests instead of just focusing on one, and it shines in that regard; however, I've having difficulty understanding if it actually is faster, or if concurrency just appears to be faster.
Concurrent code is efficient when there are some IO-bound activities (e.g. sending to and receiving data from the network). Without concurrency your single thread has to wait doing nothing for the call to complete. Pure CPU-bound activities do not benefit from concurrency on a single thread (which may add unnecessary overhead) but can benefit from multi-threading if the workload can be distributed across multiple CPU's working in parallel.
Another advantage of async IO is thread it is threadless. That saves memory and OS resources. It's the only way to solve, for instance, the C10M problem.

Goroutines 8kb and windows OS thread 1 mb

As windows user, I know that OS threads consume ~1 Mb of memory due to By default, Windows allocates 1 MB of memory for each thread’s user-mode stack. How does golang use ~8kb of memory for each goroutine, if OS thread is much more gluttonous. Are goroutine sort of virtual threads?
Goroutines are not threads, they are (from the spec):
...an independent concurrent thread of control, or goroutine, within the same address space.
Effective Go defines them as:
They're called goroutines because the existing terms—threads, coroutines, processes, and so on—convey inaccurate connotations. A goroutine has a simple model: it is a function executing concurrently with other goroutines in the same address space. It is lightweight, costing little more than the allocation of stack space. And the stacks start small, so they are cheap, and grow by allocating (and freeing) heap storage as required.
Goroutines don't have their own threads. Instead multiple goroutines are (may be) multiplexed onto the same OS threads so if one should block (e.g. waiting for I/O or a blocking channel operation), others continue to run.
The actual number of threads executing goroutines simultaneously can be set with the runtime.GOMAXPROCS() function. Quoting from the runtime package documentation:
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit.
Note that in current implementation by default only 1 thread is used to execute goroutines.
1 MiB is the default, as you correctly noted. You can pick your own stack size easily (however, the minimum is still a lot higher than ~8 kiB).
That said, goroutines aren't threads. They're just tasks with coöperative multi-tasking, similar to Python's. The goroutine itself is just the code and data required to do what you want; there's also a separate scheduler (which runs on one on more OS threads), which actually executes that code.
In pseudo-code:
loop forever
take job from queue
execute job
end loop
Of course, the execute job part can be very simple, or very complicated. The simplest thing you can do is just execute a given delegate (if your language supports something like that). In effect, this is simply a method call. In more complicated scenarios, there can be also stuff like restoring some kind of context, handling continuations and coöperative task yields, for example.
This is a very light-weight approach, and very useful when doing asynchronous programming (which is almost everything nowadays :)). Many languages now support something similar - Python is the first one I've seen with this ("tasklets"), long before go. Of course, in an environment without pre-emptive multi-threading, this was pretty much the default.
In C#, for example, there's Tasks. They're not entirely the same as goroutines, but in practice, they come pretty close - the main difference being that Tasks use threads from the thread pool (usually), rather than a separate dedicated "scheduler" threads. This means that if you start 1000 tasks, it is possible for them to be run by 1000 separate threads; in practice, it would require you to write very bad Task code (e.g. using only blocking I/O, sleeping threads, waiting on wait handles etc.). If you use Tasks for asynchronous non-blocking I/O and CPU work, they come pretty close to goroutines - in actual practice. The theory is a bit different :)
EDIT:
To clear up some confusion, here is how a typical C# asynchronous method might look like:
async Task<string> GetData()
{
var html = await HttpClient.GetAsync("http://www.google.com");
var parsedStructure = Parse(html);
var dbData = await DataLayer.GetSomeStuffAsync(parsedStructure.ElementId);
return dbData.First().Description;
}
From point of view of the GetData method, the entire processing is synchronous - it's just as if you didn't use the asynchronous methods at all. The crucial difference is that you're not using up threads while you're doing the "waiting"; but ignoring that, it's almost exactly the same as writing synchronous blocking code. This also applies to any issues with shared state, of course - there isn't much of a difference between multi-threading issues in await and in blocking multi-threaded I/O. It's easier to avoid with Tasks, but just because of the tools you have, not because of any "magic" that Tasks do.
The main difference from goroutines in this aspect is that Go doesn't really have blocking methods in the usual sense of the word. Instead of blocking, they queue their particular asynchronous request, and yield. When the OS (and any other layers in Go - I don't have deep knowledge about the inner workings) receives the response, it posts it to the goroutine scheduler, which in turns knows that the goroutine that "waits" for the response is now ready to resume execution; when it actually gets a slot, it will continue on from the "blocking" call as if it had really been blocking - but in effect, it's very similar to what C#'s await does. There's no fundamental difference - there's quite a few differences between C#'s approach and Go's, but they're not all that huge.
And also note that this is fundamentally the same approach used on old Windows systems without pre-emptive multi-tasking - any "blocking" method would simply yield the thread's execution back to the scheduler. Of course, on those systems, you only had a single CPU core, so you couldn't execute multiple threads at once, but the principle is still the same.
goroutines are what we call green threads. They are not OS threads, the go scheduler is responsible for them. This is why they can have much smaller memory footprints.

I/O Completion Ports vs. RegisterWaitForSingleObject?

What's the difference between using I/O completion ports, versus just using RegisterWaitForSingleObject to have a thread pool thread wait for I/O to complete?
Is one of them faster, and if so, why?
IOCP's are generally the fastest performing IO turn-around mechanism you will find for one reason above all else: blocking detection.
The simple example of this is a server that is responsible for serving up files from a disk. An IOCP is generally made up of three primary things:
The pool of N threads for servicing the IOCP requests.
A limit of M threads (M is always < N) the tells the IOCP how many concurrent, non-blocked threads to allow.
A completion-status loop that all threads run on.
The difference between N and M in this is very important. The general philosophy is to configure M to be the number of cores on the machine, and N to be larger. How much larger depends on the amount of time your worker threads spend in a blocked-state. If you're reading disk files, your threads will be bound to the speed of the disk IO channel. When you make that call to ReadFile() you've just introduced a blocking call. If M == N, then as soon as you hit all threads reading disk files, you're utterly stalled, with all threads on the disk IO channel.
But what if there was a way for some fancy scheduler to "know" that this thread is (a) participating in an IOCP thread pool, and (b) just stalled because it issued an API call that will be time consuming? What if, when that happens, that fancy scheduler could temporarily "move" that thread into a special "running-but-stalled" group, and then "release" an extra thread that has volunteered to work while there are threads stalled?
That is exactly what IOCP brings. When N is greater than M, The IOCP will put the thread that just issued the stall into a special running-but-stalled state, and then temporarily "borrow" an additional thread from your pool of N. It will continue to do this until the N pool is exhausted, or threads that were stalled begin returning from their blocking requests.
So under that light, an IOCP configured to have, say 8 threads concurrently running on an 8-core machine could actually have a few hundred threads in the real pool. Only 8 will ever be "allowed" to be concurrently running in non-blocked state, though you may pop over that temporarily when blocked threads return from their blocks and you already have borrowed threads servicing additional requests.
Finally, though not as important for your cause, it is still important: An IOCP thread will NOT block, nor context switch, if there is pending work on the queue when it finishes its current work and issues its next GetQueueCompletionStatus() call. If there is work waiting, it will pick it up and continue executing with no mandated preemption. Of course the OS scheduler may preempt anyway, but only as part of the general scheduler; not because of the specific call to GetQueueCompletionStatus(). The lone exception to this is if there are already over M threads running and non-blocked. In that case, GetQueueCompletionStatus() will block the calling thread until it is needed again for slack-work when enough threads once-again become blocked.
The description you gave indicates you will be heavily disk-io-bound. For absolute performance-critical io-server architectures, it is near-impossible to beat the benefits of IOCP, especially the OS-level block-detection that allows the scheduler to know it can temporarily release extra threads from your master-pool to keep things pumping while other threads are stalled.
You simply cannot replicate that specific feature of IOCPs using Windows thread pools. If all of your threads were number crunchers with little or no IO, I would say thread-pools would be a better fit, but your specificity of disk-IO tells me you should be using an IOCP instead.

Windows, multiple process vs multiple threads

We have to make our system highly scalable and it has been developed for windows platform using VC++. Say initially, we would like to process 100 requests(from msmq) simultaneously. What would be the best approach? Single process with 100 threads or 2 processes with 50-50 threads? What is the gain apart from process memory in case of second approach. does in windows first CPU time is allocated to process and then split between threads for that process, or OS counts the number of threads for each process and allocate CPU on the basis of threads rather than process. We notice that in first case, CPU utilization is 15-25% and we want to consume more CPU. Remember that we would like to get optimal performance thus 100 requests are just for example. We have also noticed that if we increase number of threads of the process above 120, performance degrades due to context switches.
One more point; our product already supports clustering, but we want to utilize more CPU on the single node.
Any suggestions will be highly appreciated.
You cant process more requests than you have CPU cores. "fast" scalable solutions involve setting up thread pools, where the number of active (not blocked on IO) threads == the number of CPU cores. So creating 100 threads because you want to service 100 msmq requests is not good design.
Windows has a thread pooling mechanism called IO Completion Ports.
Using IO Completion ports does push the design to a single process as, in a multi process design, each process would have its own IO Completion Port thread pool that it would manage independently and hence you could get a lot more threads contending for CPU cores.
The "core" idea of an IO Completion Port is that its a kernel mode queue - you can manually post events to the queue, or get asynchronous IO completions posted to it automatically by associating file (file, socket, pipe) handles with the port.
On the other side, the IO Completion Port mechanism automatically dequeues events onto waiting worker threads - but it does NOT dequeue jobs if it detects that the current "active" threads in the thread pool >= the number of CPU cores.
Using IO Completion Ports can potentially increase the scalability of a service a lot, usually however the gain is a lot smaller than expected as other factors quickly come into play when all the CPU cores are contending for the services other resource.
If your services are developed in c++, you might find that serialized access to the heap is a big performance minus - although Windows version 6.1 seems to have implemented a low contention heap so this might be less of an issue.
To summarize - theoretically your biggest performance gains would be from a design using thread pools managed in a single process. But you are heavily dependent on the libraries you are using to not serialize access to critical resources which can quickly loose you all the theoretical performance gains.
If you do have library code serializing your nicely threadpooled service (as in the case of c++ object creation&destruction being serialized because of heap contention) then you need to change your use of the library / switch to a low contention version of the library or just scale out to multiple processes.
The only way to know is to write test cases that stress the server in various ways and measure the results.
The standard approach on windows is multiple threads. Not saying that is always your best solution but there is a price to be paid for each thread or process and on windows a process is more expensive. As for scheduler i'm not sure but you can set the priory of the process and threads. The real benefit to threads is their shared address space and the ability to communicate without IPC, however synchronization must be careful maintained.
If you system is already developed, which it appears to be, it is likely to be easier to implement a multiple process solution especially if there is a chance that latter more then one machine may be utilized. As your IPC from 2 process on one machine can scale to multiple machines in the general case. Most attempts at massive parallelization fail because the entire system is not evaluated for bottle necks. for example if you implement a 100 threads that all write to the same database you may gain little in actual performance and just wait on your database.
just my .02

Does the Task Parallel Library (or PLINQ) take other processes into account?

In particular, I'm looking at using TPL to start (and wait for) external processes. Does the TPL look at total machine load (both CPU and I/O) before deciding to start another task (hence -- in my case -- another external process)?
For example:
I've got about 100 media files that need to be encoded or transcoded (e.g. from WAV to FLAC or from FLAC to MP3). The encoding is done by launching an external process (e.g. FLAC.EXE or LAME.EXE). Each file takes about 30 seconds. Each process is mostly CPU-bound, but there's some I/O in there. I've got 4 cores, so the worst case (transcoding by piping the decoder into the encoder) still only uses 2 cores. I'd like to do something like:
Parallel.ForEach(sourceFiles,
sourceFile =>
TranscodeUsingPipedExternalProcesses(sourceFile));
Will this kick off 100 tasks (and hence 200 external processes competing for the CPU)? Or will it see that the CPU's busy and only do 2-3 at a time?
You're going to run into a couple of issues here. The starvation avoidance mechanism of the scheduler will see your tasks as blocked as they wait on processes. It will find it hard to distinguish between a deadlocked thread and one simply waiting for a process to complete. As a result it may schedule new tasks if your tasks run or a long time (see below). The hillclimbing heuristic should take into account the overall load on the system, both from your application and others. It simply tries to maximize work done, so it will add more work until the overall throughput of the system stops increasing and then it will back off. I don't think this will effect your application but the stavation avoidance issue probably will.
You can find more detail as to how this all works in Parallel Programming with Microsoft®.NET, Colin Campbell, Ralph Johnson, Ade Miller, Stephen Toub (an earlier draft is online).
"The .NET thread pool automatically manages the number of worker
threads in the pool. It adds and removes threads according to built-in
heuristics. The .NET thread pool has two main mechanisms for injecting
threads: a starvation-avoidance mechanism that adds worker
threads if it sees no progress being made on queued items and a hillclimbing
heuristic that tries to maximize throughput while using as
few threads as possible.
The goal of starvation avoidance is to prevent deadlock. This kind
of deadlock can occur when a worker thread waits for a synchronization
event that can only be satisfied by a work item that is still pending
in the thread pool’s global or local queues. If there were a fixed
number of worker threads, and all of those threads were similarly
blocked, the system would be unable to ever make further progress.
Adding a new worker thread resolves the problem.
A goal of the hill-climbing heuristic is to improve the utilization
of cores when threads are blocked by I/O or other wait conditions
that stall the processor. By default, the managed thread pool has one
worker thread per core. If one of these worker threads becomes
blocked, there’s a chance that a core might be underutilized, depending
on the computer’s overall workload. The thread injection logic
doesn’t distinguish between a thread that’s blocked and a thread
that’s performing a lengthy, processor-intensive operation. Therefore,
whenever the thread pool’s global or local queues contain pending
work items, active work items that take a long time to run (more than
a half second) can trigger the creation of new thread pool worker
threads.
The .NET thread pool has an opportunity to inject threads every
time a work item completes or at 500 millisecond intervals, whichever
is shorter. The thread pool uses this opportunity to try adding threads
(or taking them away), guided by feedback from previous changes in
the thread count. If adding threads seems to be helping throughput,
the thread pool adds more; otherwise, it reduces the number of
worker threads. This technique is called the hill-climbing heuristic.
Therefore, one reason to keep individual tasks short is to avoid
“starvation detection,” but another reason to keep them short is to
give the thread pool more opportunities to improve throughput by
adjusting the thread count. The shorter the duration of individual
tasks, the more often the thread pool can measure throughput and
adjust the thread count accordingly.
To make this concrete, consider an extreme example. Suppose
that you have a complex financial simulation with 500 processor-intensive
operations, each one of which takes ten minutes on average
to complete. If you create top-level tasks in the global queue for each
of these operations, you will find that after about five minutes the
thread pool will grow to 500 worker threads. The reason is that the
thread pool sees all of the tasks as blocked and begins to add new
threads at the rate of approximately two threads per second.
What’s wrong with 500 worker threads? In principle, nothing, if
you have 500 cores for them to use and vast amounts of system
memory. In fact, this is the long-term vision of parallel computing.
However, if you don’t have that many cores on your computer, you are
in a situation where many threads are competing for time slices. This
situation is known as processor oversubscription. Allowing many
processor-intensive threads to compete for time on a single core adds
context switching overhead that can severely reduce overall system
throughput. Even if you don’t run out of memory, performance in this
situation can be much, much worse than in sequential computation.
(Each context switch takes between 6,000 and 8,000 processor cycles.)
The cost of context switching is not the only source of overhead.
A managed thread in .NET consumes roughly a megabyte of stack
space, whether or not that space is used for currently executing functions.
It takes about 200,000 CPU cycles to create a new thread, and
about 100,000 cycles to retire a thread. These are expensive operations.
As long as your tasks don’t each take minutes, the thread pool’s
hill-climbing algorithm will eventually realize it has too many threads
and cut back on its own accord. However, if you do have tasks that
occupy a worker thread for many seconds or minutes or hours, that
will throw off the thread pool’s heuristics, and at that point you
should consider an alternative.
The first option is to decompose your application into shorter
tasks that complete fast enough for the thread pool to successfully
control the number of threads for optimal throughput.
A second possibility is to implement your own task scheduler
object that does not perform thread injection. If your tasks are of long
duration, you don’t need a highly optimized task scheduler because
the cost of scheduling will be negligible compared to the execution
time of the task. MSDN® developer program has an example of a
simple task scheduler implementation that limits the maximum degree
of concurrency. For more information, see the section, “Further Reading,”
at the end of this chapter.
As a last resort, you can use the SetMaxThreads method to
configure the ThreadPool class with an upper limit for the number
of worker threads, usually equal to the number of cores (this is the
Environment.ProcessorCount property). This upper limit applies for
the entire process, including all AppDomains."
The short answer is: no.
Internally, the TPL uses the standard ThreadPool to schedule its tasks. So you're actually asking whether the ThreadPool takes machine load into account and it doesn't. The only thing that limits the number of tasks simultaneously running is the number of threads in the thread pool, nothing else.
Is it possible to have the external processes report back to your application once they are ready? In that case you do not have to wait for them (keeping threads occupied).
Ran a test using TPL/ThreadPool to schedule a great number of tasks doing looped spins. Using an external app I've loaded one of the cores to 100% using proc affinity. The number of active tasks never decreased.
Even better, I ran multiple instances of the same CPU intensive .NET TPL enabled app. The number of threads for all the apps was the same, and never went below the number of cores, even though my machine was barely usable.
So theory aside, TPL uses the number of cores available, but never checks on their actual load. A very poor implementation in my opinion.

Resources