OpenMP omp_get_max_threads() return value - openmp

Does omp_get_max_threads() return the value of $OMP_NUM_THREADS when OMP_NUM_THREADS environment variable is defined?
It look like so on my system (Linux/gcc). Even if OMP_NUM_THREADS is set to a ridiculously high value...
Calling it in the beginning of main, i.e., not inside any (nested) parallel region (if it makes a difference).

You are right that omp_get_max_threads() returns the value of OMP_NUM_THREADS. And you can set this as high as you want: a thread is a software construct and has nothing to do with the number of cores you have. For that, use omp_get_num_procs(). Normally it wouldn't make much sense to have more threads than cores, but it could: if the work in the threads is very different, or threads can pause because of locks and such, you could get a higher processor utilization by having more threads than cores.

Related

What if GOMAXPROCS is too large?

We all know that runtime.GOMAXPROCS is set to CPU core number by default, what if this property has been set too large?
Will program have more context switches?
Will garbage collector be triggered more frequently?
GOMAXPROCS is set to the number of available logical CPUs by default for a reason: this gives best performance in most cases.
GOMAXPROCS only limits the number of "active" threads, if a thread's goroutine gets blocked (e.g. by a syscall), a new thread might be started. There is no direct correclation, see Number of threads used by Go runtime.
If GOMAXPROCS is greater than the number of available CPUs, then there will be more active threads than CPU cores, which means active threads have to be "multiplexed" to the available processing units, so yes, there will be more context switches if there are more active threads than cores, which is not necessarily the case.
Garbage collections are not directly related to the number of threads, so you shouldn't worry about that. Quoting from package runtime:
The GOGC variable sets the initial garbage collection target percentage. A collection is triggered when the ratio of freshly allocated data to live data remaining after the previous collection reaches this percentage. The default is GOGC=100. Setting GOGC=off disables the garbage collector entirely. The runtime/debug package's SetGCPercent function allows changing this percentage at run time. See https://golang.org/pkg/runtime/debug/#SetGCPercent.
If you have more threads that don't allocate / release memory, that shouldn't effect how frequently collections are triggered.
There might be cases when setting GOMAXPROCS above the number of CPUs increases the performance of your app, but they are rare. Measure to find out if it helps in your case.

How to remove Fortran race condition?

Forgive me if this is not actually a race condition; I'm not that familiar with the nomenclature.
The problem I'm having is that this code runs slower with OpenMP enabled. I think the loop should be plenty big enough (k=100,000), so I don't think overhead is the issue.
As I understand it, a race condition is occurring here because all the loops are trying to access the same v(i,j) values all the time, slowing down the code.
Would the best fix here be to create as many copies of the v() array as threads and have each thread access a different one?
I'm using intel compiler on 16 cores, and it runs just slightly slower than on a single core.
Thanks all!
!$OMP PARALLEL DO
Do 500, k=1,n
Do 10, i=-(b-1),b-1
Do 20, j=-(b-1),b-1
if (abs(i).le.l.and.abs(j).eq.d) then
cycle
endif
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
if (k.eq.n-1) then
vtest(i,j,1)=v(i,j)
endif
if (k.eq.n) then
vtest(i,j,2)=v(i,j)
endif
20 continue
10 continue
500 continue
!$OMP END PARALLEL DO
You certainly have programmed a race condition though I'm not sure that that is the cause of your program's failure to execute more quickly. This line
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
which will be executed by all threads for the same (set of) values for i and j is where the racing happens. Given that your program does nothing to coordinate reads and writes to the elements of v your program is, in practice, not deterministic as there is no way to know the order in which updates to v are made.
You should have observed this non-determinism on inspecting the results of the program, and have noticed that changing the number of threads has an impact on the results too. Then again, with a long-running stencil operation over an array the results may have converged to the same (or similar enough) values.
OpenMP gives you the tools to coordinate access to variables but it doesn't automatically implement them; there is definitely nothing going on under the hood to prevent quasi-simultaneous reads from and writes to v. So the explanation for the lack of performance improvement lies elsewhere. It may be down to the impact of multiple threads on cache at some level in your system's memory hierarchy. A nice, cache-friendly, run over every element of an array in memory order for a serial program becomes a blizzard of (as far as the cache is concerned) random accesses to memory requiring access to RAM at every go.
It's possible that the explanation lies elsewhere. If the time to execute the OpenMP version is slightly longer than the time to execute a serial version I suspect that the program is not, in fact, being executed in parallel. Failure to compile properly is a common (here on SO) cause of that.
How to fix this ?
Well the usual pattern of OpenMP across an array is to parallelise on one of the array indices. The statements
!$omp parallel do
do i=-(b-1),b-1
....
end do
ensure that each thread gets a different set of values for i which means that they write to different elements of v, removing (almost) the data race. As you've written the program each thread gets a different set of values of k but that's not used (much) in the inner loops.
In passing, testing
if (k==n-1) then
and
if (k==n) then
in every iteration looks like you are tying an anchor to your program, why not just
do k=1,n-2
and deal with the updates to vtest at the end of the loop.
You could separate the !$omp parallel do like this
!$omp parallel
do k=1,n-2
!$omp do
do i=-(b-1),b-1
(and make the corresponding changes at the end of the parallel loop and region). Now all threads execute the entire contents of the parallel region but each gets its own set of i values to use. I recommend that you add clauses to your directives to specify the accessibility (eg private or shared) of each variable; but this answer is getting a bit too long and I won't go into more detail on these. Or on using a schedule clause.
Finally, of course, even with the changes I've suggested your program will be non-deterministic because this statement
v(i,j)=.25*(v(i+1,j)+v(i-1,j)+v(i,j+1)+v(i,j-1))
will read neighbouring elements from v which are updated (at a time you have no control over) by another thread. To sort that out ... got to go back to work.

CUDA __syncthreads() usage within a warp

If it was absolutely required for all the threads in a block to be at the same point in the code, do we require the __syncthreads function if the number of threads being launched is equal to the number of threads in a warp?
Note: No extra threads or blocks, just a single warp for the kernel.
Example code:
shared _voltatile_ sdata[16];
int index = some_number_between_0_and_15;
sdata[tid] = some_number;
output[tid] = x ^ y ^ z ^ sdata[index];
Updated with more information about using volatile
Presumably you want all threads to be at the same point since they are reading data written by other threads into shared memory, if you are launching a single warp (in each block) then you know that all threads are executing together. On the face of it this means you can omit the __syncthreads(), a practice known as "warp-synchronous programming". However, there are a few things to look out for.
Remember that a compiler will assume that it can optimise providing the intra-thread semantics remain correct, including delaying stores to memory where the data can be kept in registers. __syncthreads() acts as a barrier to this and therefore ensures that the data is written to shared memory before other threads read the data. Using volatile causes the compiler to perform the memory write rather than keep in registers, however this has some risks and is more of a hack (meaning I don't know how this will be affected in the future)
Technically, you should always use __syncthreads() to conform with the CUDA Programming Model
The warp size is and always has been 32, but you can:
At compile time use the special variable warpSize in device code (documented in the CUDA Programming Guide, under "built-in variables", section B.4 in the 4.1 version)
At run time use the warpSize field of the cudaDeviceProp struct (documented in the CUDA Reference Manual)
Note that some of the SDK samples (notably reduction and scan) use this warp-synchronous technique.
You still need __syncthreads() even if warps are being executed in parallel. The actual execution in hardware may not be parallel because the number of cores within a SM (Stream Multiprocessor) can be less than 32. For example, GT200 architecture has 8 cores in each SM, so you can never be sure all threads are in the same point in the code.

How do you limit a process' CPU usage on Windows? (need code, not an app)

There is programs that is able to limit the CPU usage of processes in Windows. For example BES and ThreadMaster. I need to write my own program that does the same thing as these programs but with different configuration capabilities. Does anybody know how the CPU throttling of a process is done (code)? I'm not talking about setting the priority of a process, but rather how to limit it's CPU usage to for example 15% even if there is no other processes competing for CPU time.
Update: I need to be able to throttle any processes that is already running and that I have no source code access to.
You probably want to run the process(es) in a job object, and set the maximum CPU usage for the job object with SetInformationJobObject, with JOBOBJECT_CPU_RATE_CONTROL_INFORMATION.
Very simplified, it could work somehow like this:
Create a periodic waitable timer with some reasonable small wait time (maybe 100ms). Get a "last" value for each relevant process by calling GetProcessTimes once.
Loop forever, blocking on the timer.
Each time you wake up:
if GetProcessAffinityMask returns 0, call SetProcessAffinityMask(old_value). This means we've suspended that process in our last iteration, we're now giving it a chance to run again.
else call GetProcessTimes to get the "current" value
call GetSystemTimeAsFileTime
calculate delta by subtracting last from current
cpu_usage = (deltaKernelTime + deltaUserTime) / (deltaTime)
if that's more than you want call old_value = GetProcessAffinityMask followed by SetProcessAffinityMask(0) which will take the process offline.
This is basically a very primitive version of the scheduler that runs in the kernel, implemented in userland. It puts a process "to sleep" for a small amount of time if it has used more CPU time than what you deem right. A more sophisticated measurement maybe going over a second or 5 seconds would be possible (and probably desirable).
You might be tempted to suspend all threads in the process instead. However, it is important not to fiddle with priorities and not to use SuspendThread unless you know exactly what a program is doing, as this can easily lead to deadlocks and other nasty side effects. Think for example of suspending a thread holding a critical section while another thread is still running and trying to acquire the same object. Or imagine your process gets swapped out in the middle of suspending a dozen threads, leaving half of them running and the other half dead.
Setting the affinity mask to zero on the other hand simply means that from now on no single thread in the process gets any more time slices on any processor. Resetting the affinity gives -- atomically, at the same time -- all threads the possibility to run again.
Unluckily, SetProcessAffinityMask does not return the old mask as SetThreadAffinityMask does, at least according to the documentation. Therefore an extra Get... call is necessary.
CPU usage is fairly simple to estimate using QueryProcessCycleTime. The machine's processor speed can be obtained from HKLM\HARDWARE\DESCRIPTION\System\CentralProcessor\\~MHz (where is the processor number, one entry for each processor present). With these values, you can estimate your process' CPU usage and yield the CPU as necessary using Sleep() to keep your usage in bounds.

OpenMP thread mapping to physical cores

So I've looked around online for some time to no avail. I'm new to using OpenMP and so not sure of the terminology here, but is there a way to figure out a specific machine's mapping from OMPThread (given by omp_get_thread_num();) and the physical cores on which the threads will run?
Also I was interested in how exactly OMP assigned threads, for example is thread 0 always going to run in the same location when the same code is run on the same machine? Thanks.
Typically, the OS takes care of assigning threads to cores, including with OpenMP. This is by design, and a good thing - you normally would want the OS to be able to move a thread across cores (transparently to your application) as required, since it will interrupt your application at times.
Certain operating system APIs will allow thread affinity to be set. For example, on Windows, you can use SetThreadAffinityMask to force a thread onto a specific core.
Most of the time Reed is correct, OpenMP doesn't care about the assignment of threads to cores (or processors). However, because of things like cache reuse and data locality we have found that there are many cases where having the threads assigned to cores increases the performance of OpenMP. Therefore if you look at most OpenMP implementations, you will find that there is usually some environment variable that can be set to "bind" threads to cores. The OpenMP ARB has not yet specified any "standard" way of doing this, so at this time it is left up to an OpenMP implementation to decide if and how this should be done. There has been a great deal of discussion about whether this should be included in the OpenMP spec or not and if so how it could best be done.

Resources