Would I need semaphores to synchronize processes in matrix multiplication using multiple processes when processes are dealing with different rows . Would it still create race codition?
If different threads are dealing with different rows, then there shouldn't be any danger of two threads accessing the same memory location, so there shouldn't be a reason to worry about race conditions.
My understanding was, that each workgroup is executed on the GPU and then the next one is executed.
Unfortunately, my observations lead to the conclusion that this is not correct.
In my implementation, all workgroups share a big global memory buffer.
All workgroups perform read and write operations to various positions on this buffer.
If the kernel operate on it directly, no conflicts arise.
If the workgroup loads chunk into local memory, performe some computation and copies the result back, the global memory gets corrupted by other workgroups.
So how can I avoid this behaviour?
Can I somehow tell OpenCL to only execute one workgroup at once or rearrange the execution order, so that I somehow don't get conflicts?
The answer is that it depends. A whole workgroup must be executed concurrently (though not necessarily in parallel) on the device, at least when barriers are present, because the workgroup must be able to synchronize and communicate. There is no rule that says work-groups must be concurrent - but there is no rule that says they cannot. Usually hardware will place a single work-group on a single compute core. Most hardware has multiple cores, which will each get a work-group, and to cover latency a lot of hardware will also place multiple work-groups on a single core if there is capacity available.
You have no way to control the order in which work-groups execute. If you want them to serialize you would be better off launching just one work-group and writing a loop inside to serialize the series of work chunks in that same work-group. This is often a good strategy in general even with multiple work-groups.
If you really only want one work-group at a time, though, you will probably be using only a tiny part of the hardware. Most hardware cannot spread a single work-group across the entire device - so if you're stuck to one core on a 32-core GPU you're not getting much use of the device.
You need to set the global size and dimensions to that of a single work group, and enqueue a new NDRange for each group. Essentially, breaking up the call to your kernel into many smaller calls. Make sure your command queue is not allowing out of order execution, so that the kernel calls are blocking.
This will likely result in poorer performance, but you will get the dedicated global memory access you are looking for.
Yes, the groups can be executed in parallel; this is normally a very good thing. Here is a related question.
The number of workgroups that can be concurrently launched on a ComputeUnit (AMD) or SMX (Nvidia) depends on the availability of GPU hardware resources, important ones being vector-registers and workgroup-level-memory** (called LDS for AMD and shared memory for Nvidia). If you want to launch just one workgroup on the CU/SMX, make sure that the workgroup consumes a bulk of these resources and blocks further workgroups on the same CU/SMX. You would, however, still have other workgroups executing on other CUs/SMXs - a GPU normally has multiple of these.
I am not aware of any API which lets you pin a kernel to a single CU/SMX.
** It also depends on the number of concurrent wavefronts/warps the scheduler can handle.
I understand that creating many processes may yield no benefit, depending on how many cores your processor has (if the tasks are CPU-bound), or depending on how many IO operations you can do simultaneously (if your tasks are IO-bound). In such cases, creating too many processes simply has no effect.
However, can creating too many processes have a negative effect on performance? If yes, why?
Short answer: yes.
A process that isn't active has some overhead in memory and CPU time -- not a lot, but not none. So if you have an extremely large number of processes, you will see negatives.
On a modern system, multiple processes of the same executable will share code and read-only data, but each needs its own copy of mutable data, each needs its own stack, etc. Thus, each additional process takes up some amount of memory; this means more cache pressure, and in the extreme case, more swapfile activity or outright running out of memory. There may be a hard limit to the number of processes as well.
The OS process scheduler will have more overhead working through a longer list of processes (though this probably won't be linearly bad; if heap-based it might be O(log n)).
Cache pressure is probably the biggest factor in practice. Assume your processes are all processing similar workloads. Some of the data they will need while processing will be shared across multiple work units, while not being known at compile time; each process will wind up having its own copy of that data. Thus two work units being handled by two processes will use up twice as much cache space for that kind of data.
In the parallel MPI program on for example 100 processors:
In case of having a global counting number which should be known by all MPI processes and each one of them can add to this number and the others should see the change instantly and add to the changed value.
Synchronization is not possible and would have lots of latency issue.
Would it be OK to open a shared memory among all the processes and use this memory for accessing this number also changing that?
Would it be OK to use MPI_WIN_ALLOCATE_SHARED or something like that or is this not a good solution?
Your question suggests to me that you want to have your cake and eat it too. This will end in tears.
I write you want to have your cake and eat it too because you state that you want to synchronise the activities of 100 processes without synchronisation. You want to have 100 processes incrementing a shared counter, (presumably) to have all the updates applied correctly and consistently, and to have increments propagated to all processes instantly. No matter how you tackle this problem it is one of synchronisation; either you write synchronised code or you offload the task to a library or run-time which does it for you.
Is it reasonable to expect MPI RMA to provide automatic synchronisation for you ? No, not really. Note first that mpi_win_allocate_shared is only valid if all the processes in the communicator which make the call are in shared memory. Given that you have the hardware to support 100 processes in the same, shared, memory, you still have to write code to ensure synchronisation, MPI won't do it for you. If you do have 100 processes, any or all of which may increment the shared counter, there is nothing in the MPI standard, or any implementations that I am familiar with, which will prevent a data race on that counter.
Even shared-memory parallel programs (as opposed to MPI providing shared-memory-like parallel programs) have to take measures to avoid data races and other similar issues.
You could certainly write an MPI program to synchronise accesses to the shared counter but a better approach would be to rethink your program's structure to avoid too-tight synchronisation between processes.
I have written a CUDA kernel in which each thread makes an update to a particular memory address (with int size). Some threads might want to update this address simultaneously.
How does CUDA handle this? Does the operation become atomic? Does this increase the latency of my application in any way? If so, how?
The operation does not become atomic, and it is essentially undefined behavior. When two or more threads write to the same location, one of the values will end up in the location, but there is no way to predict which one.
It can be especially problematic if you are reading and writing, such as to increment a variable.
CUDA provides a set of atomic operations to help.
You may also use other coding techniques such as parallel reductions, to help when there are multiple updates to the same location, such as finding a max or min value.
If you don't care about the order of updates, it should not be a performance issue for newer GPUs which automatically condense writes or reads to a single location in global memory or shared memory, but this is also not specified behavior.
I wrote a C program which reads a dataset from a file and then applies a data mining algorithm to find the clusters and classes in the data. At the moment I am trying to rewrite this sequential program multithreaded with PThreads and I am newbie to a parallel programming and I have a question about the number of worker threads which struggled my mind:
What is the best practice to find the number of worker threads when you do parallel programming and how do you determine it? Do you try different number of threads and see its results then determine or is there a procedure to find out the optimum number of threads. Of course I'm investigating this question from the performance point of view.
There are a couple of issues here.
As Alex says, the number of threads you can use is application-specific. But there are also constraints that come from the type of problem you are trying to solve. Do your threads need to communicate with one another, or can they all work in isolation on individual parts of the problem? If they need to exchange data, then there will be a maximum number of threads beyond which inter-thread communication will dominate, and you will see no further speed-up (in fact, the code will get slower!). If they don't need to exchange data then threads equal to the number of processors will probably be close to optimal.
Dynamically adjusting the thread pool to the underlying architecture for speed at runtime is not an easy task! You would need a whole lot of additional code to do runtime profiling of your functions. See for example the way FFTW works in parallel. This is certainly possible, but is pretty advanced, and will be hard if you are new to parallel programming. If instead the number of cores estimate is sufficient, then trying to determine this number from the OS at runtime and spawning your threads accordingly will be a much easier job.
To answer your question about technique: Most big parallel codes run on supercomputers with a known architecture and take a long time to run. The best number of processors is not just a function of number, but also of the communication topology (how the processors are linked). They therefore benefit from a testing phase where the best number of processors is determined by measuring the time taken on small problems. This is normally done by hand. If possible, profiling should always be preferred to guessing based on theoretical considerations.
You basically want to have as many ready-to-run threads as you have cores available, or at most 1 or 2 more to ensure no core that's available to you will ever be left idle. The trick is in estimating how many threads will typically be blocked waiting for something else (mostly I/O), as that is totally dependent on your application and even on external entities beyond your control (databases, other distributed services, etc, etc).
In the end, once you've determined about how many threads should be optimal, running benchmarks for thread pool sizes around your estimated value, as you suggest, is good practice (at the very least, it lets you double check your assumptions), especially if, as it appears, you do need to get the last drop of performance out of your system!