Why should I use Grand Central Dispatch over OpenMP?

Why should I use Grand Central Dispatch over OpenMP? - macos

Apple introduced Grand Central Dispatch (a thread pool) in Snow Leopard, but haven't gone into why one should use it over OpenMP, which is cross-platform and also works on Leopard. They're both pretty easy to use and look similar in capability. So, any ideas?

GCD is much better at runtime evaluation of the appropriate level of resources to throw at a problem - OpenMP decides how many threads to invoke for a set of parallel tasks based on information like environment variables. GCD looks at the current system load and number of available cores and allows an appropriate number of threads to run - scaling up and back as the resource usage changes in real time. That means that a GCD program ought to get better results in the general case. Of course, if you've bought a cluster of dedicated boxes to run your code, then this is moot because there will be little else for your code to conflict with.

Now that GCD has been open sourced, it's a matter of putting both tools side by side and see who lives in the end.

Performance and OS Level Integration?

Related

Multithreaded programming for CPUs with E/P cores on Alder Lake architecture

This is more a conceptual/high level question rather than a language specific one.
In terms of writing software for high performance applications that will use multiple threads, would it be advisable to manually check for and only allocate threads to P cores, or rather just simply allocate threads as they were done before Alder Lake, and let the OS scheduler decide where to put them?
To be more specific, my program will be a computer game with separate computationally expensive CPU threads for AI, pathfinding, etc. Ideally I don't want these threads on E cores but I'm wondering if I should be leaving this sort of thing up to the OS to decide instead of ensuring it manually.

This is probably not a good question for SO, and might fit better on https://softwareengineering.stackexchange.com, but here goes with a conceptual/high-level answer anyways.
In terms of writing software for high performance applications you will get best performance by writing platform-dependent code, that is by writing programs which are informed by, and take advantage of, the particular features of the hardware+o/s+runtime on which the programs are to execute.
The costs of this approach are that the code will, by definition, be less-than-optimal-performancewise on any other platform; and that writing codes to squeeze out every last drop of performance for a particular problem can be quite difficult and time consuming.
Personally (so this might be an opinion, which is something SO doesn't like) I would first write the platform-neutral version of the code and test it. Only when I was convinced that I couldn't achieve necessary performance (or other) goals would I roll up my sleeves and develop that first version into a platform-dependent version. (Well, I might do this extra work for fun, but you catch my drift).
Later, if you want to move the program to another platform you already have the platform-neutral version to start with.

Are all cores used automatically?

I am a mere astronomer, so this is quite probably an obvious question.
Having no experience with parallel computing and hardly any with optimizing performance in general: my machine has four cores. If I ignorantly run my code, will all four be utilized automatically?

Chances are your code is not executed on all cores but just one. It depends if you code using a specific platform/library that already abstracts thread management.
Depending on the language, you want to have a look at specific libraries. But thread programming is a general subject you have to further explore before choosing any.

What to avoid for performance reasons in multithreaded code?

I'm currently reviewing/refactoring a multithreaded application which is supposed to be multithreaded in order to be able to use all the available cores and theoretically deliver a better / superior performance (superior is the commercial term for better :P)
What are the things I should be aware when programming multithreaded applications?
I mean things that will greatly impact performance, maybe even to the point where you don't gain anything with multithreading at all but lose a lot by design complexity. What are the big red flags for multithreading applications?
Should I start questioning the locks and looking to a lock-free strategy or are there other points more important that should light a warning light?
Edit: The kind of answers I'd like are similar to the answer by Janusz, I want red warnings to look up in code, I know the application doesn't perform as well as it should, I need to know where to start looking, what should worry me and where should I put my efforts. I know it's kind of a general question but I can't post the entire program and if I could choose one section of code then I wouldn't be needing to ask in the first place.
I'm using Delphi 7, although the application will be ported / remake in .NET (c#) for the next year so I'd rather hear comments that are applicable as a general practice, and if they must be specific to either one of those languages

One thing to definitely avoid is lots of write access to the same cache lines from threads.
For example: If you use a counter variable to count the number of items processed by all threads, this will really hurt performance because the CPU cache lines have to synchronize whenever the other CPU writes to the variable.

One thing that decreases performance is having two threads with much hard drive access. The hard drive would jump from providing data for one thread to the other and both threads would wait for the disk all the time.

Something to keep in mind when locking: lock for as short a time as possible. For example, instead of this:
lock(syncObject)
{
bool value = askSomeSharedResourceForSomeValue();
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
}
Do this (if possible):
bool value = false;
lock(syncObject)
{
value = askSomeSharedResourceForSomeValue();
}
if (value)
DoSomethingIfTrue();
else
DoSomtehingIfFalse();
Of course, this example only works if DoSomethingIfTrue() and DoSomethingIfFalse() don't require synchronization, but it illustrates this point: locking for as short a time as possible, while maybe not always improving your performance, will improve the safety of your code in that it reduces surface area for synchronization problems.
And in certain cases, it will improve performance. Staying locked for long lengths of time means that other threads waiting for access to some resource are going to be waiting longer.

More threads then there are cores, typically means that the program is not performing optimally.
So a program which spawns loads of threads usually is not designed in the best fashion. A good example of this practice are the classic Socket examples where every incoming connection got it's own thread to handle of the connection. It is a very non scalable way to do things. The more threads there are, the more time the OS will have to use for context switching between threads.

You should first be familiar with Amdahl's law.
If you are using Java, I recommend the book Java Concurrency in Practice; however, most of its help is specific to the Java language (Java 5 or later).
In general, reducing the amount of shared memory increases the amount of parallelism possible, and for performance that should be a major consideration.
Threading with GUI's is another thing to be aware of, but it looks like it is not relevant for this particular problem.

What kills performance is when two or more threads share the same resources. This could be an object that both use, or a file that both use, a network both use or a processor that both use. You cannot avoid these dependencies on shared resources but if possible, try to avoid sharing resources.

Run-time profilers may not work well with a multi-threaded application. Still, anything that makes a single-threaded application slow will also make a multi-threaded application slow. It may be an idea to run your application as a single-threaded application, and use a profiler, to find out where its performance hotspots (bottlenecks) are.
When it's running as a multi-threaded aplication, you can use the system's performance-monitoring tool to see whether locks are a problem. Assuming that your threads would lock instead of busy-wait, then having 100% CPU for several threads is a sign that locking isn't a problem. Conversely, something that looks like 50% total CPU utilitization on a dual-processor machine is a sign that only one thread is running, and so maybe your locking is a problem that's preventing more than one concurrent thread (when counting the number of CPUs in your machine, beware multi-core and hyperthreading).
Locks aren't only in your code but also in the APIs you use: e.g. the heap manager (whenever you allocate and delete memory), maybe in your logger implementation, maybe in some of the O/S APIs, etc.
Should I start questioning the locks and looking to a lock-free strategy
I always question the locks, but have never used a lock-free strategy; instead my ambition is to use locks where necessary, so that it's always threadsafe but will never deadlock, and to ensure that locks are acquired for a tiny amount of time (e.g. for no more than the amount of time it takes to push or pop a pointer on a thread-safe queue), so that the maximum amount of time that a thread may be blocked is insignificant compared to the time it spends doing useful work.

You don't mention the language you're using, so I'll make a general statement on locking. Locking is fairly expensive, especially the naive locking that is native to many languages. In many cases you are reading a shared variable (as opposed to writing). Reading is threadsafe as long as it is not taking place simultaneously with a write. However, you still have to lock it down. The most naive form of this locking is to treat the read and the write as the same type of operation, restricting access to the shared variable from other reads as well as writes. A read/writer lock can dramatically improve performance. One writer, infinite readers. On an app I've worked on, I saw a 35% performance improvement when switching to this construct. If you are working in .NET, the correct lock is the ReaderWriterLockSlim.

I recommend looking into running multiple processes rather than multiple threads within the same process, if it is a server application.
The benefit of dividing the work between several processes on one machine is that it is easy to increase the number of servers when more performance is needed than a single server can deliver.
You also reduce the risks involved with complex multithreaded applications where deadlocks, bottlenecks etc reduce the total performance.
There are commercial frameworks that simplifies server software development when it comes to load balancing and distributed queue processing, but developing your own load sharing infrastructure is not that complicated compared with what you will encounter in general in a multi-threaded application.

I'm using Delphi 7
You might be using COM objects, then, explicitly or implicitly; if you are, COM objects have their own complications and restrictions on threading: Processes, Threads, and Apartments.

You should first get a tool to monitor threads specific to your language, framework and IDE. Your own logger might do fine too (Resume Time, Sleep Time + Duration). From there you can check for bad performing threads that don't execute much or are waiting too long for something to happen, you might want to make the event they are waiting for to occur as early as possible.
As you want to use both cores you should check the usage of the cores with a tool that can graph the processor usage on both cores for your application only, or just make sure your computer is as idle as possible.
Besides that you should profile your application just to make sure that the things performed within the threads are efficient, but watch out for premature optimization. No sense to optimize your multiprocessing if the threads themselves are performing bad.
Looking for a lock-free strategy can help a lot, but it is not always possible to get your application to perform in a lock-free way.

Threads don't equal performance, always.
Things are a lot better in certain operating systems as opposed to others, but if you can have something sleep or relinquish its time until it's signaled...or not start a new process for virtually everything, you're saving yourself from bogging the application down in context switching.

Converting a parallel program to a cluster program. From OpenMP to?

I want to write a code converter that takes an OpenMP based parallel program and runs it on a cluster.
How do I go about this problem? What libraries do I use? How do I set up a small cluster for this?
I'm finding it extremely hard to find good material about cluster computing on the internet.
EDIT: If it's impossible then how does Intel do it? The Intel compiler seems to do exactly what I want to. I don't have any specific application that I would like to run. I want to write the "converter/compiler", not the application. I understand that shared memory is different from distributed memory, but there has to be a way to sync memory, if not for all cases, then for some specific cases, even if it means that application is written with custom constructs.

Intel has an implementation of OpenMP that works with their C++ and Fortran compilers for x86 64-bit clusters. You can get a 30-day eval version of these compilers for free. Other than that, Zifre is mostly right. If you are concerned with scalability, bite the bullet and write your parallel program in another programming model (MPI, CUDA, Cilk, ...) which is designed with distributed systems in mind. If you provide a little more information about your application, we may be able to provide more useful guidance on that front.

It seems to me that this is not a good idea.
The basic idea behind OpenMP is data-shared parallel execution. It works well, when accessing shared data costs you nothing. Every thread can access a variable in shared cache or RAM.
The cluster computations exploit message-passing, because computers in cluster have distributed memory. When one process needs data from another one then you should manage data passing over the network. It is time-consuming operation.
So, if you want to write such compiler, you should implement data broadcasting operations (e.g. MPI_Bcast from MPI) for each data access in OpenMP. This will kill parallel performance at all.

This is simply not possible. You have to structure your code in a completely different way to get it to work on a cluster (programming multiple machines is very different from programming one machine).
There is no magic pixie dust to do this.
On the other hand, if you write your program with clusters in mind, it is possible to run it on a single machine (although it will obviously be slower).

SCORE/SCASH and Omni OpenMP compiler

Best practice regarding number of threads in GUI applications

In the past I've worked with a number of programmers who have worked exclusively writing GUI applications.
And I've been given the impression that they have almost universally minimised the use of multiple threads in their applications. In some cases they seem to have gone to extreme lengths to ensure that they use a single thread.
Is this common? Is this the generally accepted philosophy for gui application design?
And if so, why?
[edit]
There are a number of answers saying that thread usage should be minimised to reduce complexity. Reducing complexity in general is a good thing.
But if you look at any number of applications where response to external events is of paramount importance (eg. web servers, any number of embedded applications) there seems to be a world of difference in the attitude toward thread usage.

Generally speaking, GUI frameworks aren't thread safe. For things like Swing(Java's GUI API), only one thread can be updating the UI (or bad things can happen). Only one thread handles dispatching events. If you have multiple threads updating the screen, you can get some ugly flicker and incorrect drawing.
That doesn't mean the application needs to be single threaded, however. There are certainly circumstances when you don't want this to be the case. If you click on a button that calculates pi to 1000 digits, you don't want the UI to be locked up and the button to be depressed for the next couple of days. This is when things like SwingWorker come in handy. It has two parts a doInBackground() which runs in a seperate thread and a done() that gets called by the thread that handles updating the UI sometime after the doInBackground thread has finished. This allows events to be handled quickly, or events that would take a long time to process in the background, while still having the single thread updating the screen.

I think in terms of windows you are limited to all GUI operations happening on a single thread - because of the way the windows message pump works, to increase responsivness most apps add at least one additional worker thread for longer running tasks that would otherwise block and make the ui unresponsive.
Threading is fundamentally hard and so thinking in terms or more than a couple threads can often result in a lot of debugging effort - there is a quote that escapes me right now that goes something like - "if you think you understand threading then you really dont"

I've seen the same thing. Ideally you should perform any operation that is going to take longer then a few hundred ms in a background thread. Anything sorter than 100ms and a human probably wont notice the difference.
A lot of GUI programmers I've worked with in the past are scared of threads because they are "hard". In some GUI frameworks such as the Delphi VCL there are warnings about using the VCL from multiple threads, and this tends to scare some people (others take it as a challenge ;) )
One interesting example of multi-threaded GUI coding is the BeOS API. Every window in an application gets its own thread. From my experience this made BeOS apps feel more responsive, but it did make programming things a little more tricky. Fortunately since BeOS was designed to be multi-threaded by default there was a lot of stuff in the API to make things easier than on some other OSs I've used.

Most GUI frameworks are not thread safe, meaning that all controls have to me accessed from the same thread that created them. Still, it's a good practice to create worker threads to have responsive applications, but you need to be careful to delegate GUI updates to the GUI thread.

Yes.
GUI applications should minimize the the number of threads that they use for the following reasons:
Thread programming is very hard and complicated
In general, GUI applications do at most 2 things at once : a) Respond to User Input, and b) Perform a background task (such as load in data) in response to a user action or an anticipated user action
In general therefore, the added complexity of using multiple threads is not justified by the needs of the application.
There are of course exceptions to the rule.

GUIs generally don't use a whole lot of threads, but they often do throw off another thread for interacting with certain sub-systems especially if those systems take awhile or are very shared resources.
For example, if you're going to print, you'll often want to throw off another thread to interact with the printer pool as it may be very busy for awhile and there's no reason not to keep working.
Another example would be database loads where you're interacting with SQL server or something like that and because of the latency involved you may want to create another thread so your main UI processing thread can continue to respond to commands.

The more threads you have in an application, (generally) the more complex the solution is. By attempting to minimise the number of threads being utilised within a GUI, there are less potential areas for problems.
The other issue is the biggest problem in GUI design: the human. Humans are notorious in their inability to want to do multiple things at the same time. Users have a habit of clicking multiple butons/controls in quick sucession in order to attempt to get something done quicker. Computers cannot generally keep up with this (this is only componded by the GUIs apparent ability to keep up by using multiple threads), so to minimise this effect GUIs will respond to input on a first come first serve basis on a single thread. By doing this, the GUI is forced to wait until system resorces are free untill it can move on. Therefore elimating all the nasty deadlock situations that can arise. Obviously if the program logic and the GUI are on different threads, then this goes out the window.
From a personal preference, I prefer to keep things simple on one thread but not to the detriment of the responsivness of the GUI. If a task is taking too long, then Ill use a different thread, otherwise Ill stick to just one.

As the prior comments said, GUI Frameworks (at least on Windows) are single threaded, thus the single thread. Another recommendation (that is difficult to code in practice) is to limit the number of the threads to the number of available cores on the machine. Your CPU can only do one operation at a time with one core. If there are two threads, a context switch has to happen at some point. If you've got too many threads, the computer can sometimes spend more time swapping between threads than letting threads work.
As Moore's Law changes with more cores, this will change and hopefully programming frameworks will evolve to help us use threads more effectively, depending on the number of cores available to the program, such as the TPL.

Generally all the windowing messages from the window manager / OS will go to a single queue so its natural to have all UI elements in a single thread. Some frameworks, such as .Net, actually throw exceptions if you attempt to directly access UI elements from a thread other than the thread that created it.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio