openMP number of threads is higher than asked for - openmp

I'm implementing an openMP version of a sequential program, and for a function that distributes a list for the threads, I need function to know the number of threads.
Boiled down, the code looks like this:
int numberOfThreads = 0;
#pragma omp parallel
{
//split nodeQueue
omp_set_num_threads(NUM_THREADS);
#pragma omp master
{
cout << "Asked for " << NUM_THREADS << endl;
numberOfThreads = omp_get_num_threads();
cout << "Got " << numberOfThreads << " threads" << endl;
splitNodeQueue(numberOfThreads);
}
}
No matter what I set NUM_THREADS to, it seems to get 4 threads, and outputs:
Asked for 1
Got 4 threads
Shouln't it get a maximum of NUM_THREADS when I use omp_set_num_threads(NUM_THREADS)?
It doesn't matter what number of threads I ask for - it always gets 4 (which is the number of threads available on the CPU)...
Can't I force it to use the specified number of threads as maximum?

I think, setting num_threads from within parallel region would not change the number of threads for the fork at the start of the parallel region, it only changes the number of threads for nested parallel regions, which defaults to 1 by OMP specs

Related

What kinds of problem do we have with this program?(OpenMP) how to svoid it?

Following is my code and i wanna know why there will happen the problem with "race-condition" and how to solve this?
#include <iostream>
int main(){
int a = 123;
#pragma omp parallel num_threads(2)
{
int thread_id = omp_get_thread_num();
int b = (thread_id + 1)*10;
a += b;
}
std::cout << “a = “ << a << “\n”;
return 0;
}
A race condition occurs when two or more threads access shared data and at least one of them changes its value at the same time. In your code this line cause a race condition:
a += b;
a is a shared variable and updated by 2 threads simultaneously, so the final result may be incorrect. Note that depending on the hardware used, possible race condition does not necessarily means that a data race actually will occur, so the result may be correct, but it is a semantic error in your code.
To fix it you have 2 options:
use atomic operation:
#pragma omp atomic
a += b;
use reduction:
#pragma omp parallel num_threads(2) reduction(+:a)

Counting time or cpu clicks in Cygwin

I can´t count the time of cpu processing in cygwin?Why is that?
Do I need a special command.Counting clicks in cpu is done by clock
function after including time.h!
Still, after i get it done in visual studio i just can´t run on cygwin?
Why is that!
Here is the code.
#include <iostream>
#include <time.h>
using namespace std;
int main()
{
clock_t t1,t2;
int x=0;
int num;
cout << "0 to get out of program, else, number of iterations" << endl;
cin>>num;
if(num==0)
system(0);
t1=clock();
while (x!=num)
{
cout << "Number "<<x<<" e"<< endl;
if(x%2==0)
cout << "Even" << endl;
else
cout << "Odd" << endl;
x=x+1;
}
t2=clock();
float diff ((float)t2-(float)t1);
cout<<diff<<endl;
float seconds = diff / CLOCKS_PER_SEC;
cout<<seconds<<endl;
system ("pause");
return 0;
}
Sorry for the bad english.
Looks like the clock() function is defined differently for Windows and POSIX (and hence Cygwin). MSDN says that the Windows clock() returns "the elapsed wall-clock time since the start of the process", whereas the POSIX version returns "the implementation's best approximation to the processor time used by the process". In your example, the process will be spending almost its entire time waiting for output to the terminal to complete, which doesn't count towards the processing time.

Can compiler reorder code over calls to std::chrono::system_clock::now()?

While playing with VS11 beta I noticed something weird:
this code couts
f took 0 milliseconds
int main()
{
std::vector<int> v;
size_t length =64*1024*1024;
for (int i = 0; i < length; i++)
{
v.push_back(rand());
}
uint64_t sum=0;
auto t1 = std::chrono::system_clock::now();
for (size_t i=0;i<v.size();++i)
sum+=v[i];
//std::cout << sum << std::endl;
auto t2 = std::chrono::system_clock::now();
std::cout << "f() took "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t2-t1).count()
<< " milliseconds\n";
}
But when I decide to uncomment the line with couting of the sum then it prints out a reasonable number.
This is the behaviour I get with optimizations enabled, with them disabled I get "normal" cout
f() took 471 milliseconds
So is this standard compliant behaviour?
Important: it is not that dead code gets optimized away, I can see the lag when running from console, and I can see CPU spike in Task Manager.
My guess is that this is dead code optimization - and that your load spike is due to the work initializing the vector isn't being optimized away, but the computation of your unused sum variable is.
But when I decide to uncomment the line with couting of the sum then it prints out a reasonable number.
That goes along with my theory, yes - when you're forced to use the result of the computation, the computation itself can't be optimized away.
If you want to confirm that further, make your program say when it's ready and pause for you to press return - that will allow you to wait for any CPU spike to be obviously "gone" before you press return, which will give you more confidence about what's causing it.

pthread failure to join with unknown error

I'm planning using pthreads and mach semaphores to try to basically farm out a parallel computation to a limited number of CPUs, and I can't quite get a test program to work. Right now I have something that just goes through threads and prints out some identifier so that I could verify that it works. The code is pretty simple, except that I'm on OSX so I have to use mach semaphores instead of POSIX. My code is below
#include <iostream>
#include <pthread.h>
#include <semaphore.h>
#include <errno.h>
#include <mach/semaphore.h>
#include <mach/mach.h>
#define MAX_THREADS 256
semaphore_t free_CPU = 0;
void* t_function(void *arg) {
int* cur_number;
cur_number = (int*) arg;
kern_return_t test = semaphore_wait(free_CPU);
std::cout << "I am thread # " << *cur_number << ". Kernel return is " << test << std::endl;
semaphore_signal(free_CPU);
std::cout << "I am thread # " << *cur_number << ". I just signaled the semaphore." << std::endl;
pthread_exit(NULL);
}
int main (int argc, char * const argv[]) {
int num_reps = 10;
int n_threads = 1;
if (n_threads < MAX_THREADS) {
n_threads += 0;
} else {
n_threads = MAX_THREADS;
}
pthread_t threads[n_threads];
semaphore_create(mach_task_self(), &free_CPU, SYNC_POLICY_FIFO, 1);
// Loop over a bunch of things, feeding out to only nthreads threads at a time!
int i;
int* numbers = new int[num_reps];
for (i = 0; i < num_reps; i++) {
numbers[i] = i;
std::cout << "Throwing thread " << numbers[i] << std::endl;
int rc = pthread_create(&threads[i], NULL, &t_function, &numbers[i]);
if (rc) {
std::cout << "Failed to throw thread " << i << " Error: " << strerror(errno) << std::endl;
exit(1);
}
}
std::cout << "Threw all threads" << std::endl;
// Loop over threads to join
for (i = 0; i < num_reps; i++) {
std::cout << "Joining thread " << i << std::endl;
int rc = pthread_join(threads[i],NULL);
if (rc) {
std::cout << "Failed to join thread " << i << ". Error: " << strerror(errno) << std::endl;
exit(1);
}
}
semaphore_destroy(mach_task_self(), free_CPU);
delete[] numbers;
return 0;
}
Running this code gives me:
Throwing thread 0
Throwing thread 1
Throwing thread 2
Throwing thread 3
Throwing thread 4
Throwing thread 5
Throwing thread 6
Throwing thread 7
Throwing thread 8
Throwing thread 9
Threw all threads
Joining thread 0
I am thread # 0. Kernel return is 0
I am thread # 0. I just signaled the semaphore.
I am thread # 1. Kernel return is 0
I am thread # 1. I just signaled the semaphore.
I am thread # 2. Kernel return is 0
I am thread # 2. I just signaled the semaphore.
I am thread # 3. Kernel return is 0
I am thread # 3. I just signaled the semaphore.
I am thread # 4. Kernel return is 0
I am thread # 4. I just signaled the semaphore.
I am thread # 5. Kernel return is 0
I am thread # 5. I just signaled the semaphore.
I am thread # 6. Kernel return is 0
I am thread # 6. I just signaled the semaphore.
I am thread # 7. Kernel return is 0
I am thread # 7. I just signaled the semaphore.
I am thread # 8. Kernel return is 0
I am thread # 8. I just signaled the semaphore.
I am thread # 9. Kernel return is 0
I am thread # 9. I just signaled the semaphore.
Joining thread 1
Joining thread 2
Joining thread 3
Joining thread 4
Joining thread 5
Joining thread 6
Joining thread 7
Joining thread 8
Failed to join thread 8. Error: Unknown error: 0
To me, it looks like everything is totally fine, except it just bites the dust when it tries to join thread 8. I have no clue what's going on.
Your problem lies here:
#define MAX_THREADS 256
:
int n_threads = 1;
if (n_threads < MAX_THREADS) {
n_threads += 0;
} else {
n_threads = MAX_THREADS;
}
pthread_t threads[n_threads];
This is giving you an array of one thread ID. You're then trying to populate ten of them.
I'm not entirely certain what you're trying to acheive with that. It seems to me that, if you just used num_reps to dimension your array, it would work fine (you'd get an array of ten elements).

OpenMP - executing threads on chunks

I have the following piece of code, which I want to make parallel in a certain way. I am making a mistake, and hence not all threads are running the loop as I thought it should. It would be great if somebody could help me out identifying that mistake.
This is a code to calculate histograms.
#pragma omp parallel default(shared) private(iIndex2, iIndex1, fDist) shared(iSize, dense) reduction(+:iCount)
{
chunk = (unsigned int)(iSize / omp_get_num_threads());
threadID = omp_get_thread_num();
svtout << "Number of threads available " << omp_get_num_threads() << endl;
svtout << "The threadID is " << threadID << endl;
//want each of the thread to execute the loop
for (iIndex1=0; iIndex1 < chunk; iIndex1++)
{
for (iIndex2=iIndex1+1; iIndex2 < chunk; iIndex2++)
{
iCount++;
fDist = (*this)[iIndex1 + threadID*chunk].distance( (*this)[iIndex2 + threadID*chunk] );
idx = (int)(fDist/fWidth);
if ((int)fDist % (int)fWidth >= 0)
{
#pragma omp atomic
dense[idx] += 1;
}
}
}
The iCount variable keeps track of the number of iterations, and I noticed that there is a marked difference between the serial and the parallel version. I guess not all threads are running, and hence the histogram values that I'm obtaining from the parallel program are much less than the actual readings (the dense array stores the histogram values).
Thanks,
Sayan
you are a looping over chunk, rather than iSize with more than one thread.
Try replacing loop bounds with iSize .

Resources