The following code(Pg.187,Computational Geom in C by Rourke) takes same time to run serially as well as in parallel(2 proc).
Please help me identify the problem.
Here's the parallel portion
int chunk;
chunk=10;
#pragma omp parallel private(i,j,k,xn,yn,zn)
{
#pragma omp for schedule(static,chunk)
for(i=0;i<n-2;i++)
{
for(j=i+1;j<n;j++)
for(k=i+1;k<n;k++)
if(j!=k)
{
xn=(y[j]-y[i])*(z[k]-z[i])-(y[k]-y[i])*(z[j]-z[i]);
yn=(x[k]-x[i])*(z[j]-z[i])-(x[j]-x[i])*(z[k]-z[i]);
zn=(x[j]-x[i])*(y[k]-y[i])-(x[k]-x[i])*(y[j]-y[i]);
if(flag=(zn<0))
for(m=0;m<n;m++)
flag=flag && ((x[m]-x[i])*xn + (y[m]-y[i])*yn + (z[m]-z[i])*zn <=0);
if (flag)
printf("%d\t%d\t%d\n",i,j,k);
}
}
}
Ok solved it.
Just had to make the rest of variables-m and flag private too.How silly of me!!!!!
Related
I am programming a Kalman-Filter based realtime program in C using a large number of realizations. To generate the realization output I have to execute an external program (groundwater simulation software) about 100 times. I am therefore using OpenMP with fork and exceclp for parallelization of this block:
#pragma omp parallel for private(temp)
for(i=1;i<=n_real;i++){
/* Start ensemble model run */
int = fork();
int ffork = 0;
int status;
if (pid == 0) {
//Child process
log_info("Start Dadia for Ens %d on thread %d",i,omp_get_thread_num());
sprintf(temp,"ens_%d",i);
chdir(temp);
execlp("../SPRING/dadia","dadia","-return","-replace_kenn","73","ini.csv","CSV",NULL);
log_info("Could not execute function dadia - exit anyway");
ffork = 1;
_exit(1);
}
else{
//Parent process
wait(NULL);
if (ffork == 0){
log_info("DADIA run ens_%d successfully finished",i);
}
}
}
In general the code runs smoothly for small number of realizations (with 6 threads). However sometimes the code hangs in the last cycle of parallel iterations. The occurrence only happens if number iterations >> number threads. I tried scheduling the for loop with different options, but it didn't solve the problem. I know that fork is not the best solution to use with OpenMP. But I am wondring why it's sometimes hangs at arbitrary points.
Thank's a lot for any kind if feedback.
Different Ubuntu versions tried (including difffrent comiler versions)
This version finally seems to run stable. Thank's for your hints:
#pragma omp parallel for private(temp) schedule(dynamic)
for(i=1;i<=n_real;i++) {
/* Start ensemble model run */
/* Calling bash-commands inside OpenMP enviroment requires using fork().
Else the parallel process is terminated.
Note exec automatically ends the child process after termination.
If Sitra fails, the child process needs to be terminated manually */
pid_t pid = 0;
int status;
pid = fork();
if (pid == 0) {
log_info("Start SITRA for Ens %d on thread %d",i,omp_get_thread_num());
sprintf(temp,"ens_%d",i);
chdir(temp);
execl("../SPRING/sitra","sitra","-return",(char *) 0);
perror("In exec(): ");
}
if (pid > 0) {
pid = wait(&status);
log_info("End of sitra for ensemble %d",i);
if (WIFEXITED(status)) {
printf("Sitra ended with exit(%d).\n", WEXITSTATUS(status));
}
if (WIFSIGNALED(status)) {
printf("Sitra ended with kill -%d.\n", WTERMSIG(status));
}
}
if (pid < 0){
perror("In fork():");
}
}
I don't see this program having any practical usage, but while experimenting with c++ 11 concurrency and conditional_variables I stumbled across something I don't fully understand.
At first I assumed that using notify_one() would allow the program below to work. However, in actuality the program just froze after printing one. When I switched over to using notify_all() the program did what I wanted it to do (print all natural numbers in order). I am sure this question has been asked in various forms already. But my specific question is where in the doc did I read wrong.
I assume notify_one() should work because of the following statement.
If any threads are waiting on *this, calling notify_one unblocks one of the waiting threads.
Looking below only one of the threads will be blocked at a given time, correct?
class natural_number_printer
{
public:
void run()
{
m_odd_thread = std::thread(
std::bind(&natural_number_printer::print_odd_natural_numbers, this));
m_even_thread = std::thread(
std::bind(&natural_number_printer::print_even_natural_numbers, this));
m_odd_thread.join();
m_even_thread.join();
}
private:
std::mutex m_mutex;
std::condition_variable m_condition;
std::thread m_even_thread;
std::thread m_odd_thread;
private:
void print_odd_natural_numbers()
{
for (unsigned int i = 1; i < 100; ++i) {
if (i % 2 == 1) {
std::cout << i << " ";
m_condition.notify_all();
} else {
std::unique_lock<std::mutex> lock(m_mutex);
m_condition.wait(lock);
}
}
}
void print_even_natural_numbers()
{
for (unsigned int i = 1; i < 100; ++i) {
if (i % 2 == 0) {
std::cout << i << " ";
m_condition.notify_all();
} else {
std::unique_lock<std::mutex> lock(m_mutex);
m_condition.wait(lock);
}
}
}
};
The provided code "works" correctly and gets stuck by design. The cause is described in the documentation
The effects of notify_one()/notify_all() and
wait()/wait_for()/wait_until() take place in a single total order, so
it's impossible for notify_one() to, for example, be delayed and
unblock a thread that started waiting just after the call to
notify_one() was made.
The step-by-step logic is
The print_odd_natural_numbers thread is started
The print_even_natural_numbers thread is started also.
The m_condition.notify_all(); line of print_even_natural_numbers is executed before than the print_odd_natural_numbers thread reaches the m_condition.wait(lock); line.
The m_condition.wait(lock); line of print_odd_natural_numbers is executed and the thread gets stuck.
The m_condition.wait(lock); line of print_even_natural_numbers is executed and the thread gets stuck also.
Here are some openmp 3.0 code:
#pragma omp parallel
{
#pragma omp single nowait
{
while(!texture_patches.empty())
{
int texture_size = calculate_texture_size(texture_patches, BORDER);
mve::ByteImage::Ptr texture = mve::ByteImage::create(texture_size, texture_size, 3);
mve::ByteImage::Ptr validity_mask = mve::ByteImage::create(texture_size, texture_size, 1);
//...
//the content of texture_patches changes here
//...
#pragma omp task
dilate_valid_pixel(texture, validity_mask);//has no effects on the content of texture_patches
}
#pragma omp taskwait
std::cout << "done. (Took: " << timer.get_elapsed_sec() << "s)" << std::endl;
}
}
As is known, vs2010 doesn't support openmp3.0. Could anyone tell me how to adjust the above code so that they can be used in vs2010? Thanks!
It is not entirely clear to me how dilate_valid_pixel(texture, validity_mask) affects the content of texture_patches, but the fact that taskwait is located outside of the while loop and the lack of locking directives/calls hint that the dilate_valid_pixel() tasks in no way affect the container. What one typically does in such cases is to prepopulate an array of work items and then process the array with a parallel loop. Code follows:
// Create work items in serial
std::vector<*WorkItem> work_items;
while(!texture_patches.empty()){
/* Try to insert each of the texture patches into the texture atlas. */
for (std::list<TexturePatch>::iterator it = texture_patches.begin(); it != texture_patches.end();){
TexturePatch texture_patch = *it;
// ...
if (bin.insert(&rect)){
it = texture_patches.erase(it);//Attention!!! texture_patches changes
remaining_patches--;
} else {
it++;
}
}
//#pragma omp task
//dilate_valid_pixel(texture, validity_mask);
work_items.push_back(new WorkItem(texture, validity_mask));
groups.push_back(group);
material_lib.add_material(group.material_name, texture);
}
// Process the work items in parallel
#pragma omp parallel for
for (int i = 0; i < work_items.size(); i++)
dilate_valid_pixel(work_items[i]->texture, work_items[i]->validity_mask);
std::cout << "done. (Took: " << timer.get_elapsed_sec() << "s)" << std::endl;
Freeing up the memory for work items not shown. The original task code retained in comments.
One important aspect: if dilate_valid_pixel() does not always take the same amount of time to complete, then add schedule(dynamic,1) clause to the parallel for construct and play with the chunk size (the 1) in order to achieve the best performance.
This code is not exactly equivalent to the one using explicit tasks since all tasks are created first and then executed while the original code creates tasks in parallel with their execution, i.e. the code above will perform from a bit worse (if tasks creation is very fast) to a lot worse (if tasks creation takes a long time) than the original code.
I populate a very large array using a stxxl::VECTOR_GENERATOR<MyData>::result::bufwriter_type (something like 100M entries) which I need to sort in parallel.
I use the stxxl::sort(vector->begin(), vector->end(), cmp(), memoryAmount) method, which in theory should do what I need: sort the elements very efficiently.
However, during the execution of this method I noticed that only one processor is fully utilised, and all the other cores are quite idle (I suspect there is little activity to fetch the input, but in practice they don't do anything).
This is my question: is it possible to exploit more cores during the sorting phase, or is the parallelism used only to fetch the input asynchronously? If so, are there documents that explain how to enable it? (I looked extensively the documentation on the website, but I couldn't find anything).
Thanks very much!
EDIT
Thanks for the suggestion. I provide below some more information.
First of all I use MacOs for my experiments. What I do is that I launch the following program and I study its behaviour.
typedef struct Triple {
long t1, t2, t3;
Triple(long s, long p, long o) {
this->t1 = s;
this->t2 = p;
this->t3 = o;
}
Triple() {
t1 = t2 = t3 = 0;
}
} Triple;
const Triple minv(std::numeric_limits<long>::min(),
std::numeric_limits<long>::min(), std::numeric_limits<long>::min());
const Triple maxv(std::numeric_limits<long>::max(),
std::numeric_limits<long>::max(), std::numeric_limits<long>::max());
struct cmp: std::less<Triple> {
bool operator ()(const Triple& a, const Triple& b) const {
if (a.t1 < b.t1) {
return true;
} else if (a.t1 == b.t1) {
if (a.t2 < b.t2) {
return true;
} else if (a.t2 == b.t2) {
return a.t3 < b.t3;
}
}
return false;
}
Triple min_value() const {
return minv;
}
Triple max_value() const {
return maxv;
}
};
typedef stxxl::VECTOR_GENERATOR<Triple>::result vector_type;
int main(int argc, const char** argv) {
vector_type vector;
vector_type::bufwriter_type writer(vector);
for (int i = 0; i < 1000000000; ++i) {
if (i % 10000000 == 0)
std::cout << "Inserting element " << i << std::endl;
Triple t;
t.t1 = rand();
t.t2 = rand();
t.t3 = rand();
writer << t;
}
writer.finish();
//Sort the vector
stxxl::sort(vector.begin(), vector.end(), cmp(), 1024*1024*1024);
std::cout << vector.size() << std::endl;
}
Indeed there seems to be only one or maximum two threads working during the execution of this program. Notice that the machine has only a single disk.
Can you please confirm me whether the parallelism work on macos? If not, then I will try to use linux to see what happens. Or is perhaps because there is only one disk?
In principle what you are doing should work out-of-the-box. With everything working you should see all cores doing processing.
Since it doesnt work, we'll have to find the error, and debugging why we see no parallel speedups is still tricky business these days.
The main idea is to go from small to large examples:
what platform is this? There is no parallelism on MSVC, only on Linux/gcc.
By default STXXL builds on Linux/gcc with USE_GNU_PARALLEL. you can turn it off to see if it has an effect.
Try reproducing the example values shown in http://stxxl.sourceforge.net/tags/master/stxxl_tool.html - with and without USE_GNU_PARALLEL
See if just in memory parallel sorting scales on your processor/system.
I am trying to instrument java synchronized block using ASM. The problem is that after instrumenting, the execution time of the synchronized block takes more time. Here it increases from 2 msecs to 200 msecs on Linux box.
I am implementing this by identifying the MonitorEnter and MonitorExit opcode.
I try to instrument at three level 1. just before the MonitorEnter 2. after MonitorEnter 3. Before MonitorExit.
1 and 3 together works fine, but when i do 2, the execution time increase dramatically.
Even if we instrument another single SOP statement, which is intended to be executed just once, it give higher values.
Here the sample code (prime number, 10 loops):
for(int w=0;w<10;w++){
synchronized(s){
long t1 = System.currentTimeMillis();
long num = 2000;
for (long i = 1; i < num; i++) {
long p = i;
int j;
for (j = 2; j < p; j++) {
long n = p % i;
}
}
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1) );
}
Here the code for instrumention (here System.currentMilliSeconds() gives the time at which instrumention happened, its no the measure of execution time, the excecution time is from obove SOP statement):
public void visitInsn(int opcode)
{
switch(opcode)
{
// Scenario 1
case 194:
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io /PrintStream;");
visitLdcInsn("TIME Arrive: "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
break;
// scenario 3
case 195:
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
visitLdcInsn("TIME exit : "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
break;
}
super.visitInsn(opcode);
// scenario 2
if(opcode==194)
{
visitFieldInsn(Opcodes.GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
visitLdcInsn("TIME enter: "+System.currentTimeMillis());
visitMethodInsn(Opcodes.INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
}
}
I am not able to find the reason why it is happening and how t correct it.
Thanks in advance.
The reason lies in the internals of the JVM that you were using for running the code. I assume that this was a HotSpot JVM but the answers below are equally right for most other implementations.
If you trigger the following code:
int result = 0;
for(int i = 0; i < 1000; i++) {
result += i;
}
This will be translated directly into Java byte code by the Java compiler but at run time the JVM will easily see that this code is not doing anything. Executing this code will have no effect on the outside (application) world, so why should the JVM execute it? This consideration is exactly what compiler optimization does for you.
If you however trigger the following code:
int result = 0;
for(int i = 0; i < 1000; i++) {
System.out.println(result);
}
the Java runtime cannot optimize away your code anymore. The whole loop must always run since the System.out.println(int) method is always doing something real such that your code will run slower.
Now let's look at your example. In your first example, you basically write this code:
synchronized(s) {
// do nothing useful
}
This entire code block can easily be removed by the Java run time. This means: There will be no synchronization! In the second example, you are writing this instead:
synchronized(s) {
long t1 = System.currentTimeMillis();
// do nothing useful
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1));
}
This means that the effective code might be look like this:
synchronized(s) {
long t1 = System.currentTimeMillis();
long t2 = System.currentTimeMillis();
System.out.println("Time>>>>>>>>>>>> " + (t2-t1));
}
What is important here is that this optimized code will be effectively synchronized what is an important difference with respect to execution time. Basically, you are measuring the time it costs to synchronize something (and even that might be optimized away after a couple of runs if the JVM realized that the s is not locked elsewhere in your code (buzzword: temporary optimization with the possibility of deoptimization if loaded code in the future will also synchronize on s).
You should really read this:
http://www.ibm.com/developerworks/java/library/j-jtp02225/
http://www.ibm.com/developerworks/library/j-jtp12214/
Your test for example misses a warm-up, such that you are also measuring how much time the JVM will use for byte code to machine code optimization.
On a side note: Synchronizing on a String is almost always a bad idea. Your strings might be or might not be interned what means that you cannot be absolutely sure about their identity. This means, that synchronization might or might not work and you might even inflict synchronization of other parts of your code.