Multiple threads and performance on a single CPU - performance

Is here any performance benefit to using multiple threads on a computer with a single CPU that does not having hyperthreading?

In terms of speed of computation, No. In fact things will slow down due to the overhead of managing the threads.
In terms of responsiveness, yes. You can for example have one thread wait on an IO operation and have another run a GUI at the same time.

It depends on your application. If it spends all its time using the CPU, then multithreading will just slow things down - though you may be able to use it to be more responsive to the user and thus give the impression of better performance.
However, if your code is limited by other things, for example using the file system, the network, or any other resource, then multithreading can help, since it allows your application to behave asynchronously. So while one thread is waiting for a file to load from disk, another can be querying a remote webserver and another redrawing the GUI, while another is doing various calculations.
Working with multiple threads can also simplify your business logic, since you don't have to pay so much attention to how various independent tasks need to interleave. If the operating system's scheduling logic is better than yours, then you may indeed see improved performance.

You can consider using multithreading on a single CPU
If you use network resources
If you do high-intensive IO operations
If you pull data from a database
If you exploit other stuff with possible delays
If you want to make your app with ultraspeed reaction
When you should not use multithreading on a single CPU
High-intensive operations which do almost 100% CPU usage
If you are not sure how to use threads and synchronization
If your application cannot be divided into several parallel processes

Yes, there is a benefit of using multiple threads (or processes) on a single CPU - if one thread is busy waiting for something, others can continue doing useful work.
However this can be offset by the overhead of task switching. You'll have to benchmark and/or profile your application on production grade hardware to find out.

Regardless of the number of CPUs available, if you require preemptive multitasking and/or applications with asynchronous components (i.e. pretty much anything that combines a responsive GUI with a non-trivial amount of computation or continuous I/O processing), multithreading performs much better than the alternative, which is to use multiple processes for each application.
This is because threads in the same process can exchange data much more efficiently than can multiple processes, because they share the same memory context.
See this Wikipedia article on computer multitasking for a fairly concise discussion of these issues.

Absolutely! If you do any kind of I/O, there is great advantage to having a multithreaded system. While one thread wait for an I/O operation (which are relatively slow), another thread can do useful work.

Related

Does task switching in concurrent code result in faster code than synchronous execution?

I understand that concurrency is not parallelism, but I believe that is my source of confusion about the speed of concurrency in environments that only use a single thread (go/node).
If everything is running in a single process, and a scheduler is constantly switching between different concurrent tasks wouldn't the overhead generated by this constant switching lead to slower execution of code than if everything was done synchronously?
I know that concurrency has it advantages when you want non-blocking code, for example a web server that switches between servicing thousands of requests instead of just focusing on one, and it shines in that regard; however, I've having difficulty understanding if it actually is faster, or if concurrency just appears to be faster.
Concurrent code is efficient when there are some IO-bound activities (e.g. sending to and receiving data from the network). Without concurrency your single thread has to wait doing nothing for the call to complete. Pure CPU-bound activities do not benefit from concurrency on a single thread (which may add unnecessary overhead) but can benefit from multi-threading if the workload can be distributed across multiple CPU's working in parallel.
Another advantage of async IO is thread it is threadless. That saves memory and OS resources. It's the only way to solve, for instance, the C10M problem.

When should I use parallel-programming?

What could be a typical or real problem for using parallel programming? It can be quite challenging to implement. On the internet they explain how to use it but not why.
Performance is the most common reason to use parallel programming. But: Not all programs will become faster by using parallel programming. In most cases your algorithm consists of parts that are parallelizable and parts, that are inherently sequential. You always have to reason about the potential performance gain of using parallel programming. In some cases the overhead for using it will actually make your program slower. Have a look at Amdahl's law to learn more about the potential performance improvements you can reach.
If you only want some examples of usage of parallel computations: There are some classes of algorithms that are inherently parallel, see this article the dwarfs of berkeley
Another reason for using a multithreaded application architecture is it's responsiveness. There are certain functions which block program execution for a certain amount of time, i.e. reads from files, network, waiting for user inputs, etc. While waiting like this does not consume CPU power, it often blocks or slows program flow.
Using threads in such case is simply a good practice to make the code clearer. Instead of using (often complex or unintuitive) checks for inputs, integrating those checks into program flow, manual switching between handling input and other tasks, a programmer may choose to use threads and let one thread wait for input, and the other i.e. to perform calculations.
In other words, multiple threads sometimes allow for better use of different resources at your computer's disposal: network, disk, input devices or simply monitor.
Generalization: using multiple threads (including parallel data processing) is advisable when the speed and responsiveness gains outweigh the synchronization costs and work required to parallelize the application.
The reason why there is increased interest in parallel programming is partly because the hardware we use today is more parallel. (multicore processors, many-core GPU). To fully benefit from this hardware you need to program in parallel.
Interestingly, parallel processing also improves battery life:
Having 4 cores at 1Ghz draws less power than one single core at 4Ghz.
A phone with a multicore CPU will try to run as much tasks as possible simultaneously, so it can turn off the CPU when all work is done. This is sometimes called "the rush to idle".
Now, some programs are more easy parallelize than others. You should not randomly try to parallelize your entire code base. But it can be a useful excersise to do so even if there is no business reason: then you will be more ready the day when you really need it.
There are very few problems which can't be solved more quickly by a parallel program than by a serial program. There are very few computers which do not have multiple processing units.
I conclude, therefore, that you should use parallel programming all the time.

How expensive is a context switch? Is it better to implement a manual task switch than to rely on OS threads?

Imagine I have two (three, four, whatever) tasks that have to run in parallel. Now, the easy way to do this would be to create separate threads and forget about it. But on a plain old single-core CPU that would mean a lot of context switching - and we all know that context switching is big, bad, slow, and generally simply Evil. It should be avoided, right?
On that note, if I'm writing the software from ground up anyway, I could go the extra mile and implement my own task-switching. Split each task in parts, save the state inbetween, and then switch among them within a single thread. Or, if I detect that there are multiple CPU cores, I could just give each task to a separate thread and all would be well.
The second solution does have the advantage of adapting to the number of available CPU cores, but will the manual task-switch really be faster than the one in the OS core? Especially if I'm trying to make the whole thing generic with a TaskManager and an ITask, etc?
Clarification: I'm a Windows developer so I'm primarily interested in the answer for this OS, but it would be most interesting to find out about other OSes as well. When you write your answer, please state for which OS it is.
More clarification: OK, so this isn't in the context of a particular application. It's really a general question, the result on my musings about scalability. If I want my application to scale and effectively utilize future CPUs (and even different CPUs of today) I must make it multithreaded. But how many threads? If I make a constant number of threads, then the program will perform suboptimally on all CPUs which do not have the same number of cores.
Ideally the number of threads would be determined at runtime, but few are the tasks that can truly be split into arbitrary number of parts at runtime. Many tasks however can be split in a pretty large constant number of threads at design time. So, for instance, if my program could spawn 32 threads, it would already utilize all cores of up to 32-core CPUs, which is pretty far in the future yet (I think). But on a simple single-core or dual-core CPU it would mean a LOT of context switching, which would slow things down.
Thus my idea about manual task switching. This way one could make 32 "virtual" threads which would be mapped to as many real threads as is optimal, and the "context switching" would be done manually. The question just is - would the overhead of my manual "context switching" be less than that of OS context switching?
Naturally, all this applies to processes which are CPU-bound, like games. For your run-of-the-mill CRUD application this has little value. Such an application is best made with one thread (at most two).
I don't see how a manual task switch could be faster since the OS kernel is still switching other processes, including yours in out of the running state too. Seems like a premature optimization and a potentially huge waste of effort.
If the system isn't doing anything else, chances are you won't have a huge number of context switches anyway. The thread will use its timeslice, the kernel scheduler will see that nothing else needs to run and switch right back to your thread. Also the OS will make a best effort to keep from moving threads between CPUs so you benefit there with caching.
If you are really CPU bound, detect the number of CPUs and start that many threads. You should see nearly 100% CPU utilization. If not, you aren't completely CPU bound and maybe the answer is to start N + X threads. For very IO bound processes, you would be starting a (large) multiple of the CPU count (i.e. high traffic webservers run 1000+ threads).
Finally, for reference, both Windows and Linux schedulers wake up every millisecond to check if another process needs to run. So, even on an idle system you will see 1000+ context switches per second. On heavily loaded systems, I have seen over 10,000 per second per CPU without any significant issues.
The only advantage of manual switch that I can see is that you have better control of where and when the switch happens. The ideal place is of course after a unit of work has been completed so that you can trash it all together. This saves you a cache miss.
I advise not to spend your effort on this.
Single-core Windows machines are going to become extinct in the next few years, so I generally write new code with the assumption that multi-core is the common case. I'd say go with OS thread management, which will automatically take care of whatever concurrency the hardware provides, now and in the future.
I don't know what your application does, but unless you have multiple compute-bound tasks, I doubt that context switches are a significant bottleneck in most applications. If your tasks block on I/O, then you are not going to get much benefit from trying to out-do the OS.

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages
One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.
One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.
Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.
More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.
You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.
What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.
Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.
You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.
I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.
I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.
You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.
Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

Performance impact of Processes vs Threads

Clearly if performance is critical it makes sense to prototype and profile. But all the same, wisdom and advice can be sought on StackOverflow :)
For the handling of highly parallel tasks where inter-task communication is infrequent or suits message-passing, is there a performance disadvantage to using processes (fork() etc) or threads?
Is the context switch between threads cheaper than that between processes? Some processors have single-instruction context-switching don't they? Do the mainstream operating systems better utilise SMP with multiple threads or processes? Is the COW overhead of fork() more expensive than threads if the process never writes to those pages?
And so on. Thanks!
The idea that processes are slow to create is an old one, and was much more true in the past. Google's Chrome team did a little paragraph somewhere about how it's not as big an impact anymore, and here is Scott Hanselman on the subject: http://www.hanselman.com/blog/MicrosoftIE8AndGoogleChromeProcessesAreTheNewThreads.aspx
My take on it is that threads are faster?'c but only moderately so, and currently it's easier to make mistakes with threads.
I have heard that .NET 4.0 is going to extend the thread library... Something about system.threading.thread.For ? And I can think of a few places I'd want to do that... For each item in this thousand item list go do something.
http://reedcopsey.com/?p=87
At the following URL you will find a real world benchmark and a comparison of fork vs. pthread_create in a real world application, though its from 2003 and things may have changed a bit. Quickly reasoning from this benchmark, it looks like fork scales better if you have more than 500 processes or threads.
http://bulk.fefe.de/scalable-networking.pdf - pages 29 to 32
My guess would be, that threads are faster, since they are the more lightweight solution. Processes are designed to be isolated from each other. Each process uses it's own TLB, whereas threads share one virtual address space (afaik), so this could be an argument. Processes are usefull if you want to do some kind of distributed computing.
In general about threading and stuff, I suggest you look into OpenMP or Intel-TBB. These guys really know their stuff with multithreading and high performance computing.
It comes down to the isolation cost: processes are isolated from each other (e.g. separate memory resources, protection, separate file handles etc.) whereas threads can shared resources within a process. It takes time & resources to support & enforce this isolation.
As with anything in this universe, you have to "pay" for what you get.
According to this book: http://reiber.org/nxt/pub/Linux/LinuxKernelDevelopment/Linux.Kernel.Development.3rd.Edition.pdf Linux implements all threads as standard processes. Considering you're writing about COW - that's linux. However more on this on pages 33-34.

Resources