OpenMP #pragma num_threads - openmp

I am trying to understand OpenMP a bit, but I am confused how I am allowed to enter variables in num_threads() when it is a part of a #pragma which is a place to give information to the compiler. I was expecting it would not allow variables as a parameter to the num_threads but looks like I am allowed to use variables. How is that? How does it work?

The compiler does convert the pragma into a call to the OpenMP runtime, that is why a variable is allowed here.

Don't hard wire thread counts into your code :-), leave it to the runtime to do the right thing, or set it from the environment. (Use OMP_NUM_THREADS or KMP_HW_SUBSET with LLVM or Intel compilers).
Certainly never put in num_threads(constant).
Or, at least, consider some questions before you do...
How can you choose the right number of threads?
If you answer: "It's the number of cores in my machine", OK, next questions
Will you always use that machine, and only that machine?
Is no-one else ever going to use this code?
If you answer: "It's the number that perform best", next question:
As above: will that be true on all machines you ever use? On machines other people use to run your code?
Answering like that implies you have run scaling studies, which clearly required changing the number of threads. If you hard-wire this (with a constant), how can you ever run them again without editing and recompiling...

Related

Best practices to determine stack usage in Ravenscar program

I am writing an Ada program using the Ravenscar subset (thus, I am aware of the number of running tasks at execution time). The code is compiled by gcc with the -fstack-check switch enabled. This should cause the program raise a STORAGE_ERROR at runtime if any of my tasks exceed their stack.
Ada allows to set the upper limit for those (task-specific) stacks during the specification of the respective task like so:
pragma Storage_Size (Some_Value);
Now I was wondering what options I have to determine Some_Value. What I have heard of so far:
Do wild guesses until no STORAGE_ERROR is raised anymore. This is more or less what the OP suggests here.
Feed the output of -fstack-usage in there.
Use some gnat specific extensions as outlined here (how does this technically differ from item #2?).
Get a stack analyzer like gnatstack and let it do the work for you.
If I understand this correctly all the above techniques are dynamic (i.e. they require the program to run in order to work). Are static approaches also conceivable? E.g. by restricting myself further through some of Ada's high integrity options (such as No_Recursion, what else?).
Perhaps any of you can name some best practices to tackle this problem and/or extend/comment on my (surely incomplete) list.
Bonus question: What is the default size of a task's stack when the above pragma is not specified? GCC's docs only state this value depends on the runtime, without giving any concrete numbers.
You can generally check the stack space required by individual types with the 'Storage_Size attribute (which counts in bits).
Once you have tabulated this (you may need to round it up to whole words/double words), you can add up how much stack space is used by each declarative region, and then walk through your calls to find the maximum stack usage.

ATmegaXXX V, P, does it matter for compilation?

I used a ATmega649 before but then switched to ATmega649V.
Does it matter which MCU version given to the compiler, ATmega649, ATmega649V or ATmega649P?
I understand it as the architecture is exactly the same it is only some powersaving that is somehow achieved without changing the architecture that is the difference?
Using avr-gcc.
well, you can use an "almost" compatible architecture with no harm, though you have to triple check the datasheet that there's no difference in the way registers are setup otherwise your program won't work, or worst will work until a feature is failing. It is usually a source of frustration when you've forgotten you've been using a close enough, but not exactly the architecture you're targetting.
I don't know well enough the Atmega649X, and I won't thoroughly read the lengthy datasheets to find those differences. So if you decide to do it, be careful, and don't forget about that!
usually the additional letters signalize differences in max speed, supply voltage ratings or power consumptions. the core itself is compatible. so if numbers match, it is no difference from the compilers point of view.
however the flash tool may recognize them as different parts and require correct settings.

Why is it illegal to branch into or out of a parallel region?

I have just started to learn parallel programming with OpenMP, with the OpenMP tutorial by Blaise Barney at Lawrence Livermore National Laboratory. There, in many places it is specified that it is illegal to branch into or out of a parallel region, but I do not have, at least, a little clue why.
If someone can explain why that is so, it will be really helpful to be comfortable with OpenMP. Thanks!
A parallel region will require some set-up and take-down to operate correctly. For example, entering the region may require spawning threads, while exiting may require synchronization. The compiler generates the material "in the middle" of the parallel region with the assumption that this set-up and take-down have occurred.
If you were to branch into a parallel region, then you've skipped the set-up and it's hard to say what would actually happen. I.e., where would the threads be? Would you even be in the function call that, e.g., pthread was supposed to invoke for you?
And if you were to branch out, would you even be in the non-parallel section of your code? What if all the threads were to execute this section? What about race conditions?
So because the compiler must make assumptions of your behavior to generate parallel code correctly, you would do well to honor those assumptions.

What do we need to define while using parallel optimization flag?

I have a program with more than 100 subroutines and I am trying to make this code to run faster and I am trying to compile these subroutines using parallel flag. I was wondering what variable or parameters do I need to define in the program if I want to use the parallel flag. Just using the parallel optimization flag increased the run time for my program compared to the one without parallel flag.
Any suggestions is highly appreciated. Thanks a lot.
Best Regards,
Jdbaba
I can give you some general guidelines, but without knowing your specific compiler and platform/OS I won't be able to help you specifically. As far as I know, all of the autoparallelization schemes that are used in Fortran compilers end up using either OpenMP or MPI commands to split the loops out into either threads or processes. The issue is that there is a certain amount of overhead associated with those schemes. For instance, in one case I had a program that used an optimization library which was provided by a vendor as a compiled library without optimization within it. As all of my subroutines and functions were either outside or inside the large loop of the optimizer, and since there was only object data, the autoparallelizer wasn't able to perform ipo and as such it failed to use more than the one core. The run times in this case, due to the DLL that was loaded for OpenMP, the /qparallel actually added ~10% to the run time.
As a note, autoparallelizers aren't magic. Essentially all they are doing is the same type of thing that the autovectorization techniques do, which is to look for loops that have no data that are dependent upon the previous iteration. If it detects that variables are changed between iterations or if the compiler can't tell, then it will not attempt to parallelize the loop.
If you are using the Intel Fortran compiler, you can turn on a diagnostic switch "/qpar-report3" or "-par-report3" to give you information as to the dependency tree of loops to see why they failed to optimize. If you don't have access to large sections of the code you are using, in particular parts with major loops, there is a good chance that there won't be much opportunity in your code to use the auto-parallelizer.
In any case, you can always attempt to reduce dependencies and reformulate your code such that it is more friendly to autoparallelization.

Parallel STL algorithms in OS X

I working on converting an existing program to take advantage of some parallel functionality of the STL.
Specifically, I've re-written a big loop to work with std::accumulate. It runs, nicely.
Now, I want to have that accumulate operation run in parallel.
The documentation I've seen for GCC outline two specific steps.
Include the compiler flag -D_GLIBCXX_PARALLEL
Possibly add the header <parallel/algorithm>
Adding the compiler flag doesn't seem to change anything. The execution time is the same, and I don't see any indication of multiple core usage when monitoring the system.
I get an error when adding the parallel/algorithm header. I thought it would be included with the latest version of gcc (4.7).
So, a few questions:
Is there some way to definitively determine if code is actually running in parallel?
Is there a "best practices" way of doing this on OS X? (Ideal compiler flags, header, etc?)
Any and all suggestions are welcome.
Thanks!
See http://threadingbuildingblocks.org/
If you only ever parallelize STL algorithms, you are going to disappointed in the results in general. Those algorithms generally only begin to show a scalability advantage when working over very large datasets (e.g. N > 10 million).
TBB (and others like it) work at a higher level, focusing on the overall algorithm design, not just the leaf functions (like std::accumulate()).
Second alternative is to use OpenMP, which is supported by both GCC and
Clang, though is not STL by any means, but is cross-platform.
Third alternative is to use Grand Central Dispatch - the official multicore API in OSX, again hardly STL.
Forth alternative is to wait for C++17, it will have Parallelism module.

Resources