I am currently trying to parallelize some multigrid code written in Fortran using OpenMP, and I have found that the OpenMP scheduling clauses make a huge impact on performance. Recall that the OpenMP scheduling clauses are static, dynamic, runtime, and guided, and they determine how the work in a loop is divided between threads. For example, an OpenMP parallelized SAXPY loop with a scheduling clause would look like the following:
!$OMP Parallel Do Schedule(Static)
Do i=1,n
z(i)=a*x(i)+y(i)
End Do
!$OMP End Parallel Do
Now imagine that we have many parallelized loops in a piece of code, and have no way of determining a priori which of these scheduling clauses will get the program running the fastest. Changing each scheduling clause by hand would be a pain in the ass, so here's what I thought I would do:
Character(Len=10)::sched="Dynamic"
!$OMP Parallel Do Schedule(sched)
Do i=1,n
z(i)=a*x(i)+y(i)
End Do
!$OMP End Parallel Do
and then I could simply put that character variable 'sched' in every parallelized loop and change them all at once, by say, putting sched="Static", and then do a runtime test to see which one went the fastest! Of course, it doesn't work-at least not with gfortran or the Absoft compiler. So my question is any or all of the following: Why doesn't this work?, How can I get it to work?, or How can I avoid using this construct to solve this problem? Any help is greatly appreciated.
This won't work, as the modes are not really strings, and no variable evaluation is made at this point, I guess. The best thing I can think of, is using a pre-processor like CoCo or the C-Preprocessor to achieve exactly this.
However alternatively, you could use the runtime mode and use either the environment variable OMP_SCHEDULE or the omp_set_schedule routine to set the mode.
The scheduling clause that you specify will have a dramatic effect on the way in which the loop is compiled to machine code. Once the code is compiled, the scheduling mode is locked in stone and cannot be changed at runtime. I agree with haraldkl, use a pre-processor.
Related
"taskloop" is introduced in OpenMP 4.5. It can take clauses from both loop and task constructs (except depend clause AFAIK).
However, I'm wondering if "taskloop" and "omp for" constructs differ performance wise too.
I think it may depends on the actual problem. To parallelize a for loop omp for can be faster than tasks, because it offers several different scheduling scheme for your needs. In my experience (solving a particular problem using clang12 compiler) omp for produces a bit faster code than tasks (on Ryzen 5 7800X).
I have just started to learn parallel programming with OpenMP, with the OpenMP tutorial by Blaise Barney at Lawrence Livermore National Laboratory. There, in many places it is specified that it is illegal to branch into or out of a parallel region, but I do not have, at least, a little clue why.
If someone can explain why that is so, it will be really helpful to be comfortable with OpenMP. Thanks!
A parallel region will require some set-up and take-down to operate correctly. For example, entering the region may require spawning threads, while exiting may require synchronization. The compiler generates the material "in the middle" of the parallel region with the assumption that this set-up and take-down have occurred.
If you were to branch into a parallel region, then you've skipped the set-up and it's hard to say what would actually happen. I.e., where would the threads be? Would you even be in the function call that, e.g., pthread was supposed to invoke for you?
And if you were to branch out, would you even be in the non-parallel section of your code? What if all the threads were to execute this section? What about race conditions?
So because the compiler must make assumptions of your behavior to generate parallel code correctly, you would do well to honor those assumptions.
I have a program with more than 100 subroutines and I am trying to make this code to run faster and I am trying to compile these subroutines using parallel flag. I was wondering what variable or parameters do I need to define in the program if I want to use the parallel flag. Just using the parallel optimization flag increased the run time for my program compared to the one without parallel flag.
Any suggestions is highly appreciated. Thanks a lot.
Best Regards,
Jdbaba
I can give you some general guidelines, but without knowing your specific compiler and platform/OS I won't be able to help you specifically. As far as I know, all of the autoparallelization schemes that are used in Fortran compilers end up using either OpenMP or MPI commands to split the loops out into either threads or processes. The issue is that there is a certain amount of overhead associated with those schemes. For instance, in one case I had a program that used an optimization library which was provided by a vendor as a compiled library without optimization within it. As all of my subroutines and functions were either outside or inside the large loop of the optimizer, and since there was only object data, the autoparallelizer wasn't able to perform ipo and as such it failed to use more than the one core. The run times in this case, due to the DLL that was loaded for OpenMP, the /qparallel actually added ~10% to the run time.
As a note, autoparallelizers aren't magic. Essentially all they are doing is the same type of thing that the autovectorization techniques do, which is to look for loops that have no data that are dependent upon the previous iteration. If it detects that variables are changed between iterations or if the compiler can't tell, then it will not attempt to parallelize the loop.
If you are using the Intel Fortran compiler, you can turn on a diagnostic switch "/qpar-report3" or "-par-report3" to give you information as to the dependency tree of loops to see why they failed to optimize. If you don't have access to large sections of the code you are using, in particular parts with major loops, there is a good chance that there won't be much opportunity in your code to use the auto-parallelizer.
In any case, you can always attempt to reduce dependencies and reformulate your code such that it is more friendly to autoparallelization.
as it is so common with Fortran, I'm writing a massively parallel scientific code. In the beginning of my code I read my configuration file which tells me which type of solver I want to use. Now that means that in a subroutine (during the main run) I have
if(solver.eq.1)then
call solver1()
elseif(solver.eq.2)then
call solver2()
else
call solver3()
endif
Edit to avoid some confusion: This if is inside my time integration loop and I have one that is inside 3 nested loops.
Now my question is, wouldn't it be more efficient to use function pointers instead as the solver variable will not change during execution, except at the initialisation procedure.
Obviously function pointers are F2003. That shouldn't be a problem as long as I use gfortran 4.6. But I'm mainly using a BlueGene P, there is a f2003 compiler, so I suppose it's going to work there as well although I couldn't find any conclusive evidence on the web.
Knowing nothing about Fortran, this is my answer: The main problem with branching is that a CPU potentially cannot speculatively execute code across them. To mitigate this problem, branch prediction was introduced (which is very sophisticated in modern CPUs).
Indirect calls through a function pointer can be a problem for the prediction unit of the CPU. If it can't predict where the call will actually go, this will stall the pipeline.
I am quite sure that the CPU will correctly predict that your branch will always be taken or not taken because it is a trivial case of prediction.
Maybe the CPU can speculate across the indirect call, maybe it can't. This is why you need to test which is better.
If it cannot, you will certainly notice in your benchmark.
In addition, maybe you can hoist the if test out of your inner loop so it won't be called often. This will make the actual performance of the branch irrelevant.
If you only plan to use the function pointers once, at initialisation, and you are running codes on a BlueGene, isn't your concern for the efficiency mis-directed ? Generally, any initialisation which works is OK, if it takes 1sec instead of 1msec it's probably going to have 0 impact on total execution time.
Code initialisation routines for clarity, ease of modification, that sort of thing.
EDIT
My guess is that using function pointers rather than your current code will have no impact on execution speed. But it's just a (educated perhaps) guess and I'll be very interested in any data you gather on this question.
If you solver routines take a non-trivial runtime, then the trivial runtime of the IF statements is likely to be immaterial. If the sovler routines have a comparable runtine to the IF statement, then the total runtime is very short, so why do your care? This seems an optimization unlikely to pay off.
The first rule of runtime optimization is to profile your code is see what portions are consuming the runtime. Otherwise you are likely to optimize portions that are unimportant, which will accomplish nothing.
For what its worth, someone else recently had a very similar concern: Fortran Subroutine Pointers for Mismatching Array Dimensions
After a brief search I couldn't find the answer to the question, so I ran a little benchmark myself (see this link for the Makefile & dependencies). The benchmark consists of:
Draw random number to select method a, b, or c, which all perform a simple addition to their single integer argument
Call the chosen method 100 million times, using either a procedure pointer or if-statements
Repeat the above 5 times
The result with gfortran 4.8.5 on an CPU E5-2630 v3 # 2.40GHz is:
Time per call (proc. pointer): 1.89 ns
Time per call (if statement): 1.89 ns
In other words, there is not much of a performance difference!
Do for loops in Verilog execute in parallel? I need to call a module several times, but they have to execute at the same time. Instead of writing them out one by one, I was thinking of using a for loop. Will it work the same?
Verilog describes hardware, so it doesn't make sense to think in terms of executing loops or calling modules in this context. If I understand the intent of your question correctly, you'd like to have multiple instantiations of the same module with distinct inputs and outputs.
To accomplish this you can use Verilog's generate statements to generate the instantiations automatically.
You can also use the auto_template functionality in Emacs' excellent verilog-mode. I prefer this approach as each instantiation appears explicitly in my source code and I find it easier to detect errors.
As jlf answered, you're looking for a generate statement. You would use a for-loop to model combinational logic, such as going through all of the bits in a register and computing an output. This would be in an always block or even an initial block in your testbench.