Fewer threads updating value than expected with omp_set_num_threads() - parallel-processing

Why does this program print the result as 64 and not 5000? If the count variable is being updated in the critical section, I would expect that only one thread should have access to it at any given point in time. Thus, each thread would be able to increment the count, and produce the result 5000, so why do I get 64 instead in answer?
#include <iostream>
#include <omp.h>
using namespace std;
int main()
{
int count = 0;
omp_set_num_threads(5000);
#pragma omp parallel
{
#pragma omp critical
{
count++;
}
}
cout << "count = " << count << endl;
system("pause");
return 0;
}

As Michael Dussere points out, you're getting 64 as an answer because your implementation is only launching 64 threads. It may be using an internal default value to limit the max number of threads (try varying the environment variable OMP_THREAD_LIMIT, or calling omp_get_thread_limit() to see if that is the case.)
The reason for such a limit is that creating threads requires resources - each thread has to have its own stack space, process table entries on linux, etc. These aren't lightweight stateless Erlang threads that are scheduled in user space. On my 8-core system using gcc or icpc, setting the thread number to anything 1024 or above simply fails due to lack of resources, although setting system parameters can shift that limitation around.
Between the resources required by the threads and the fact that most single-image system have significantly fewer than 5000 cores, it's not clear what you'd be able to accomplish with 5000 threads on most systems.

The value you can set with omp_set_num_threads is not unlimitted.
It depends of the OpemMP implementation you use, the number of cores of you computer and so on.
You get 64 because there should be 64 threads in the current thread team. You can check with omp_get_num_threads.

Related

Why is it a Bad Thing for one process to be able to read, or even write, to memory occupied by a different process? [duplicate]

It's my understanding that if two threads are reading from the same piece of memory, and no thread is writing to that memory, then the operation is safe. However, I'm not sure what happens if one thread is reading and the other is writing. What would happen? Is the result undefined? Or would the read just be stale? If a stale read is not a concern is it ok to have unsynchronized read-write to a variable? Or is it possible the data would be corrupted, and neither the read nor the write would be correct and one should always synchronize in this case?
I want to say that I've learned it is the later case, that a race on memory access leaves the state undefined... but I don't remember where I may have learned that and I'm having a hard time finding the answer on google. My intuition is that a variable is operated on in registers, and that true (as in hardware) concurrency is impossible (or is it), so that the worst that could happen is stale data, i.e. the following:
WriteThread: copy value from memory to register
WriteThread: update value in register
ReadThread: copy value of memory to register
WriteThread: write new value to memory
At which point the read thread has stale data.
Usually memory is read or written in atomic units determined by the CPU architecture (32 bit and 64 bits item aligned on 32 bit and 64 bit boundaries is common these days).
In this case, what happens depends on the amount of data being written.
Let's consider the case of 32 bit atomic read/write cells.
If two threads write 32 bits into such an aligned cell, then it is absolutely well defined what happens: one of the two written values is retained. Unfortunately for you (well, the program), you don't know which value. By extremely clever programming, you can actually use this atomicity of reads and writes to build synchronization algorithms (e.g., Dekker's algorithm), but it is faster typically to use architecturally defined locks instead.
If two threads write more than an atomic unit (e.g., they both write a 128 bit value), then in fact the atomic unit sized pieces of the values written will be stored in a absolutely well defined way, but you won't know which pieces of which value get written in what order. So what may end up in storage is the value from the first thread, the second thread, or mixes of the bits in atomic unit sizes from both threads.
Similar ideas hold for one thread reading, and one thread writing in atomic units, and larger.
Basically, you don't want to do unsynchronized reads and writes to memory locations, because you won't know the outcome, even though it may be very well defined by the architecture.
The result is undefined. Corrupted data is entirely possible. For an obvious example, consider a 64-bit value being manipulated by a 32-bit processor. Let's assume the value is a simple counter, and we increment it when the lower 32-bits contain 0xffffffff. The increment produces 0x00000000. When we detect that, we increment the upper word. If, however, some other thread read the value between the time the lower word was incremented and the upper word was incremented, they get a value with an un-incremented upper word, but the lower word set to 0 -- a value completely different from what it would have been either before or after the increment is complete.
As I hinted in Ira Baxter's answer, CPU cache also plays a part on multicore systems. Consider the following test code:
DANGER WILL ROBISON!
The following code boosts priority to realtime to achieve somewhat more consistent results - while doing so requires admin privileges, be careful if running the code on dual- or single-core systems, since your machine will lock up for the duration of the test run.
#include <windows.h>
#include <stdio.h>
const int RUNFOR = 5000;
volatile bool terminating = false;
volatile int value;
static DWORD WINAPI CountErrors(LPVOID parm)
{
int errors = 0;
while(!terminating)
{
value = (int) parm;
if(value != (int) parm)
errors++;
}
printf("\tThread %08X: %d errors\n", parm, errors);
return 0;
}
static void RunTest(int affinity1, int affinity2)
{
terminating = false;
DWORD dummy;
HANDLE t1 = CreateThread(0, 0, CountErrors, (void*)0x1000, CREATE_SUSPENDED, &dummy);
HANDLE t2 = CreateThread(0, 0, CountErrors, (void*)0x2000, CREATE_SUSPENDED, &dummy);
SetThreadAffinityMask(t1, affinity1);
SetThreadAffinityMask(t2, affinity2);
ResumeThread(t1);
ResumeThread(t2);
printf("Running test for %d milliseconds with affinity %d and %d\n", RUNFOR, affinity1, affinity2);
Sleep(RUNFOR);
terminating = true;
Sleep(100); // let threads have a chance of picking up the "terminating" flag.
}
int main()
{
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
RunTest(1, 2); // core 1 & 2
RunTest(1, 4); // core 1 & 3
RunTest(4, 8); // core 3 & 4
RunTest(1, 8); // core 1 & 4
}
On my Quad-core intel Q6600 system (which iirc has two sets of cores where each set share L2 cache - would explain the results anyway ;)), I get the following results:
Running test for 5000 milliseconds with affinity 1 and 2
Thread 00002000: 351883 errors
Thread 00001000: 343523 errors
Running test for 5000 milliseconds with affinity 1 and 4
Thread 00001000: 48073 errors
Thread 00002000: 59813 errors
Running test for 5000 milliseconds with affinity 4 and 8
Thread 00002000: 337199 errors
Thread 00001000: 335467 errors
Running test for 5000 milliseconds with affinity 1 and 8
Thread 00001000: 55736 errors
Thread 00002000: 72441 errors

different OpenMP output in different machine

When I m trying to run the following code in my system centos running virtually i am getting right output but when i am trying to run the same code on compact supercomputer "Param Shavak" I am getting incorrect output.... :(
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,ti
#pragma omp parallel private(p,tid)shared(s)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
If your program runs correctly in one machine, it must be because it's actually not running in parallel in that machine.
Your program suffers from a race condition in the s=s+tid; line of code. s is a shared variable, so several threads at the same time try to update it, which results in data loss.
You can fix the problem by turning that line of code into an atomic operation:
#pragma omp atomic
s=s+tid;
That way only one thread at a time can read and update the variable s, and the race condition is no more.
In more complex programs you should use atomic operations or critical regions only when necessary, because you don't have parallelism in those regions and that hurts performance.
EDIT: As suggested by user High Performance Mark, I must remark that the program above is very inefficient because of the atomic operation. The proper way to do that kind of calculation (adding to the same variable in all iterations of a loop) is to implement a reduction. OpenMP makes it easy by using the reduction clause:
#pragma omp reduction(operator : variables)
Try this version of your program, using reduction:
#include<stdio.h>
#include<omp.h>
int main()
{
int p=1,s=1,tid;
#pragma omp parallel reduction(+:s) private(p,tid)
{
p=1;
tid=omp_get_thread_num();
p=p+tid;
s=s+tid;
printf("Thread %d P=%d S=%d\n",tid,p,s);
}
return 0;
}
The following link explains critical sections, atomic operations and reduction in a more verbose way: http://www.lindonslog.com/programming/openmp/openmp-tutorial-critical-atomic-and-reduction/

OpenMP output for "for" loop

I am new to OpenMP and I just tried to write a small program with the parallel for construct. I have trouble understanding the output of my program. I don't understand why thread number 3 prints the output before 1 and 2. Could someone offer me an explanation?
So, the program is:
#pragma omp parallel for
for (i = 0; i < 7; i++) {
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
and the output is:
We are in thread number 0 and are printing 0
We are in thread number 0 and are printing 1
We are in thread number 3 and are printing 6
We are in thread number 1 and are printing 2
We are in thread number 1 and are printing 3
We are in thread number 2 and are printing 4
We are in thread number 2 and are printing 5
My processor is a Intel(R) Core(TM) i5-2410M CPU with 4 cores.
Thank you!
OpenMP makes no guarantees of the relative ordering, in time, of the execution of statements by different threads. OpenMP leaves it to the programmer to impose such ordering if it is required. In general it is not required, in many cases not even desirable, which is why OpenMP's default behaviour is as it is. The cost, in time, of imposing such an ordering is likely to be significant.
I suggest you run much larger tests several times, you should observe that the cross-thread sequencing of events is, essentially, random.
If you want to print in order then you can use the ordered construct
#pragma omp parallel for ordered
for (i = 0; i < 7; i++) {
#pragma omp ordered
printf("We are in thread number %d and are printing %d\n",
omp_get_thread_num(), i);
}
I assume this requires threads from larger iterations to wait for the ones with lower iteration so it will have an effect on performance. You can see it used here http://bisqwit.iki.fi/story/howto/openmp/#ExampleCalculatingTheMandelbrotFractalInParallel
That draws the Mandelbrot set as characters using ordered. A much faster solution than using ordered is to fill an array in parallel of the characters and then draw them serially (try the code). Since one uses OpenMP for performance I have never found a good reason to use ordered but I'm sure it has its use somewhere.

Why is this simple OpenCL kernel running so slowly?

I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:
__kernel void copy(
const __global char* pSrc,
__global __write_only char* pDst,
int length)
{
const int tid = get_global_id(0);
if(tid < length) {
pDst[tid] = pSrc[tid];
}
}
I've created the buffers in the following way:
char* out = new char[2048*2048];
cl::Buffer(
context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
length,
out);
Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(1),
NULL,
&event);
event.wait();
On average, the time is around 75 milliseconds, as calculated by:
cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";
I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?
EDIT:
At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!
I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.
Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.
cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(myOptimalGroupSize),
NULL,
&event);
event.wait();
You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.
CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).
In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization.
The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block.
The GPU approach is orthogonal: each thread participates in the copy of the same block.
This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).
To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.
I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.

OpenMP slows down program instead of speeding it up: a bug in gcc?

I will first give some background about the problem I'm having so you know what I'm trying to do. I have been helping out with the development of a certain software tool and found out that we could benefit greatly from using OpenMP to parallelize some of the biggest loops in this software. We actually parallelized the loops successfully and with just two cores the loops executed 30% faster, which was an OK improvement. On the other hand we noticed a weird phenomenom in a function that traverses through a tree structure using recursive calls. The program actually slowed down here with OpenMP on and the execution time of this function over doubled. We thought that maybe the tree-structure was not balanced enough for parallelization and commented out the OpenMP pragmas in this function. This appeared to have no effect on the execution time though. We are currently using GCC-compiler 4.4.6 with the -fopenmp flag on for OpenMP support. And here is the current problem:
If we don't use any omp pragmas in the code, all runs fine. But if we add just the following to the beginning of the program's main function, the execution time of the tree travelsal function over doubles from 35 seconds to 75 seconds:
//beginning of main function
...
#pragma omp parallel
{
#pragma omp single
{}
}
//main function continues
...
Does anyone have any clues about why this happens? I don't understand why the program slows down so greatly just from using the OpenMP pragmas. If we take off all the omp pragmas, the execution time of the tree traversal function drops back to 35 seconds again. I would guess that this is some sort of compiler bug as I have no other explanation on my mind right now.
Not everything that can be parallelized, should be parallelized. If you are using a single, then only one thread executes it and the rest have to wait until the region is done. They can either spin-wait or sleep. Most implementations start out with a spin-wait, hoping that the single region will not take too long and the waiting threads can see the completion faster than if sleeping. Spin-waits eat up a lot of processor cycles. You can try specifying that the wait should be passive - but this is only in OpenMP V3.0 and is only a hint to the implementation (so it might not have any effect). Basically, unless you have a lot of work in the parallel region that can compensate for the single, the single is going to increase the parallel overhead substantially and may well make it too expensive to parallelize.
First, OpenMP often reduces performance on first try. It can be tricky to to use omp parallel if you don't understand it inside-out. I may be able to help if you can you tell me a little more about the program structure, specifically the following questions annotated by ????.
//beginning of main function
...
#pragma omp parallel
{
???? What goes here, is this a loop? if so, for loop, while loop?
#pragma omp single
{
???? What goes here, how long does it run?
}
}
//main function continues
....
???? Does performance of this code reduce or somewhere else?
Thanks.
Thank you everyone. We were able to fix the issue today by linking with TCMalloc, one of the solutions ejd offered. The execution time dropped immediately and we were able to get around 40% improvement in execution times over a non-threaded version. We used 2 cores. It seems that when using OpenMP on Unix with GCC, you should also pick a replacement for the standard memory management solution. Otherwise the program may just slow down.
I did some more testing and made a small test program to test whether the issue could be memory operation related. I was unable to replicate the issue of an empty parallel-single region causing program to slow down in my small test program, but I was able to replicate the slow down by parallelizing some malloc calls.
When running the test program on Windows 7 64-bit with 2 CPU-cores, no noticeable slow down was caused by using -fopenmp flag with the gcc (g++) compiler and running the compiled program compared to running the program without OpenMP support.
Doing the same on Kubuntu 11.04 64-bit on the same computer, however, raised the execution to over 4 times of the non-OpenMP version. This issue seems to only appear on Unix-systems and not on Windows.
The source of my test program is below. I have also uploaded zipped-source for win and unix version as well as assembly source for win and unix version for both with and without OpenMP-support. This zip can be downloaded here http://www.2shared.com/file/0thqReHk/omp_speed_test_2011_05_11.html
#include <stdio.h>
#include <windows.h>
#include <list>
#include <sys/time.h>
//#include <cstdlib>
using namespace std;
int main(int argc, char* argv[])
{
// #pragma omp parallel
// #pragma omp single
// {}
int start = GetTickCount();
/*
struct timeval begin, end;
int usecs;
gettimeofday(&begin, NULL);
*/
list<void *> pointers;
#pragma omp parallel for default(shared)
for(int i=0; i< 10000; i++)
//pointers.push_back(calloc(20000, sizeof(void *)));
pointers.push_back(malloc(20000));
for(list<void *>::iterator i = pointers.begin(); i!= pointers.end(); i++)
free(*i);
/*
gettimeofday(&end, NULL);
if (end.tv_usec < begin.tv_usec) {
end.tv_usec += 1000000;
begin.tv_sec += 1;
}
usecs = (end.tv_sec - begin.tv_sec) * 1000000;
usecs += (end.tv_usec - begin.tv_usec);
*/
printf("It took %d milliseconds to finish the memory operations", GetTickCount() - start);
//printf("It took %d milliseconds to finish the memory operations", usecs/1000);
return 0;
}
What remains unanswered now is, what can I do to avoid issues such as these on the Unix-platform..

Resources