I'm debugging an OpenMP program. Its behavior is strange.
1) If a simple program P (while(1) loop) occupies one core 100%, the OpenMP program pauses even it occupies all remained cores. Once I terminate the program P, OpenMP program continues to execute.
2) The OpenMP program can execute successfully in situation 1 if I set OMP_NUMBER_THREADS to 32/16/8.
I tested on both 8-core x64 machines and 32-core Itanium machines. The former uses GCC and libomp. The later uses privately-owned aCC compiler and libraries. So it is unlikely related to compiler/library.
Could you help point out any possible reasons which may cause the scene? Why can it be affected by another program?
Thanks.
I am afraid that you need to give more information.
What is the OS you are running on?
When you run using 16 threads are you doing this on the 8-core or the 32 core machine?
What is the simple while(p) program doing in this while loop?
What is the OpenMP program doing (in general terms - if you can't be specific)?
Have you tried using a profiling tool to see what the OpenMP program is doing?
Related
I'm reading an article about CUDA and it says "A CUDA program is a serial program with parallel kernels". My questions are:
What does it mean for it to be a serial program? I know that serial is the opposite of parallel, but what does that mean in terms of CUDA's code being run on different processors, different cores, etc? I know the point of CUDA is that it facilitates parallel programming, so I'm interested to know which part of it is serial.
What does it mean to have multiple kernels? I've always understood the kernel to be a part of the operating system, and I think CUDA is just software that runs within the operating system, right? How does CUDA have multiple kernels and how does it use them to achieve parallelism?
A CUDA kernel is written from the standpoint of a single thread. It answers the question "what will each thread do?" A CUDA kernel gives a single definition for what every thread will do. From the standpoint of a single thread, it appears to be a serial program. However it becomes parallel at launch time, when many threads execute the same code, "in parallel".
I think you're misinterpreting. CUDA has "parallel kernels" means that each kernel in CUDA has the opportunity to express (according to how it is written, and the specifics of CUDA concepts such as built-in variables) and manifest (at launch time, across many threads of execution) parallelism. It does not mean that CUDA inherently requires multiple kernels to express parallelism. A single CUDA kernel launch is inherently parallel.
You may wish to read the CUDA programming guide.
First of all, here's just something I'm curious about
I've made a little program which fills some templates with values and I noticed that every time I run it the execution time changes a little bit, it ranges from 0.550s to 0.600s. My CPU is running at 2.9GHZ if that could be useful.
The instructions are always the same, is it maybe something that has to do with physics or something more software oriented?
it has to do with java running on a virtual machine; even a c program might run different times slightly longer/shorter, also the operating system steers when a program has resources (cpu time, memory …) to be executed.
I am running 60 MPI processes and MKL_THREAD_NUM is set to 4 to get me to the full 240 hardware threads on the Xeon Phi. My code is running but I want to make sure that MKL is actually using 4 threads. What is the best way to check this with the limited Xeon Phi linux kernel?
You can set MKL_NUM_THREADS to 4 if you like. However,using every single thread does not necessarily give the best performance. In some cases, the MKL library knows things about the algorithm that mean fewer threads is better. In these cases, the library routines can choose to use fewer threads. You should only use 60 MPI ranks if you have 61 coresIf you are going to use that many MPI ranks, you will want to set the I_MPI_PIN_DOMAIN environment variable to "core". Remember to leave one core free for the OS and system level processes. This will put one rank per core on the coprocessor and allow all the OpenMP threads for each MPI process to reside on the same core, giving you better cache behavior. If you do this, you can also use micsmc in gui mode on the host processor to continuously monitor the activity on all the cores. With one MPI processor per core, you can see how much of the time all threads on a core are being used.
Set MKL_NUM_THREADS to 4. You can use environment variable or runtime call. This value will be respected so there is nothing to check.
Linux kernel on KNC is not stripped down so I don't know why you think that's a limitation. You should not use any system calls for this anyways though.
I have a C program that I am compiling with mingw, but it runs on only one core of my 8-core machine. How do I compile it to run on multiple cores?
(To clarify: I am not looking to use multiple cores to compile, as compilation time is low. It's runtime where I want to use my full CPU capacity.)
There is no other way but to write a multithread program. You need to first see how to split your tasks into independent parts which can be then run in threads simultaneously.
It cannot be fully automated. You may consider making use of the last additions of the C11 standard, or taking a look at pthreads or OpenMP.
We're moving from one local computational server with 2*Xeon X5650 to another one with 2*Opteron 4280... Today I was trying to launch my wonderful C programs on the new machine (AMD one), and discovered a significant downfall of the performance >50%, keeping all possible parameters the same(even seed for a random numbers generator). I started digging into this problem: googling "amd opteron 4200 compiler options" gave me couple suggestions, i.e., "flags"(options) for available to me GCC 4.6.3 compiler. I played with these flags and summarized my findings on the plots down here...
I'm not allowed to upload pictures, so the charts are here https://plus.google.com/117744944962358260676/posts/EY6djhKK9ab
I'm wondering if anyone (coding folks) could give me any comments on the subject, especially I'm interested in the fact that "... -march=bdver1 -fprefetch-loop-arrays" and "... -fprefetch-loop-arrays -march=bdver1" yield in a different runtime?
I'm not sure also if, let's say "-funroll-all-loops" is already included in "-O3" or "-Ofast", - why then adding this flag one more time makes any difference at all?
Why any additional flags for intel processor makes the performance even worse (except only "-ffast-math" - which is kind of obvious, because it enables less precise and faster by definition floating point arithmetic, as I understand it, though...)?
A bit more details about machines and my program:
2*Xeon X5650 machine is an Ubuntu Server with gcc 4.4.3, it is 2(CPUs on the motherboard)X6(real cores per each)*2(HyperThreading)=24 thread machine, and there was something running on it , during my "experiments" or benchmarks...
2*Opteron 4280 machine is an Ubuntu Server with gcc 4.6.3, it is 2(CPUs on the motherboard)X4(real cores per each=Bulldozer module)*2(AMD Bulldozer whatever threading=kind of a core)=18 thread machine, and I was using it solely for my wonderful "benchmarks"...
My benchmarking program is just a Monte Carlo simulation thing, it does some IO in the beginning, and then ~10^5 Mote Carlo loops to give me the result. So, I assume it is both integer and floating point calculations program, looping every now and then and checking if randomly generated "result" is "good" enough for me or not... The program is just a single-threaded , and I was launching it with the very same parameters for every benchmark(it is obvious, but I should mention it anyway) including random generator seed(so, the results were 100% identical)... The program IS NOT MEMORY INTENSIVE. Resulting runtime is just a "user" time by the standard "/usr/bin/time" command.