Schedule clause in OpenMP - openmp

I have a piece of code (which is part of an application) that I'm trying to optimize using OpenMP, am trying out various scheduling policies. In my case, I noticed that the schedule(RUNTIME) clause has an edge over others (I am not specifying a chunk_size). I've two questions:
When I do not specify chunk_size, is there a difference between schedule(DYNAMIC) and schedule(GUIDED)?
How does OpenMP determine the default implementation-specific scheduling that is stored in the OMP_SCHEDULE variable?
I learned that if no scheduling scheme is specified, then by default schedule(STATIC) is used. So if I don't modify the OMP_SCHEDULE variable, and use schedule(RUNTIME) in my program, would the scheduling scheme be schedule(STATIC) all the times or does OpenMP have some intelligent way to dynamically devise the schedule strategy and change it from time to time?

Yes, if you do not specify a chunk size then DYNAMIC will make the size of all chunks 1. But GUIDED will make the minimum chunk size 1 but other chunk sizes will be implementation dependent. Perhaps you could figure out your situation by running some experiments or reading the documentation.
As I understand the situation: if the environment variable OMP_SCHEDULE is not set then the runtime schedule is implementation dependent. I think it would be very odd if the same schedule was not chosen for each execution of the program. I do not believe that OpenMP, which is a set of compile-time directives, has any way to understand the run-time performance of your program and to choose a schedule based on such information.

Related

Use of threads in SCIP

In the list of SCIP's parameters I see three types of references to the use of threads:
lp/threads (threads used for solving the LP, which don't matter when using SoPlex, according to this question).
parallel/{min, max}threads (number of threads during parallel solve).
concurrent/* (parameters related to the use of threads in concurrent mode).
My question(s) is(are): How are threads used in SCIP when using a default installation? Are the parallel/{min, max}threads parameters related only to the concurrent solver? If I don't turn on the concurrent solver, would SCIP use the available threads for solving the branch-and-bound subproblems in parallel?
Thanks in advance!
All parameters in the "parallel/" and "concurrent/" sections of the SCIP parameter space only affect the concurrent mode.
SCIP is by default single-threaded, but can be parallelized using the UG-framework or by enabling the concurrent mode.
Concurrent optimization is started by "concurrentopt" in the interactive shell or method "SCIPsolveConcurrent()" in the C API. The number of threads used can be controlled by parallel/minnthreads and parallel/maxnthreads. If a tight memory limit is set, then the thread count may be reduced further.
When you look at the unscaled solving time on Hans Mittelmann's webpage you see that FiberSCIP achieves a slight speedup. Your statement that there would be a slowdown probably comes from the fact that the relative performance to the best solver is worse.
Generally: For both parallelization schemes, it is highly instance-dependent on whether parallelization holds potential to improve performance.

Are there options that would boost the amount of memory that NuSMV can use during a specification run?

During specification running in nusmv, it takes several hours and eventually gives the result of "killed 9". How to speed up the execution?
Are there options that would boost the amount of memory that NuSMV can use during a specification run?
You might want to use nuXmv https://es-static.fbk.eu/tools/nuxmv/ which is the successor of NuSMV. It provides newer SAT-based model-checking algorithms that often use less memory than BDD-based ones, and it allows for the same model specifications as NuSMV.
Overall it depends on why NuSMV runs out of memory. Most of the time, it will not manage to get the model to build, which means you'll have to reduce your model size. For this you might want to look whether some state variables could become Boolean signals without state, or if you an reduce the range of some of the variables.
If you have a parametric model, e.g., where a variable number of modules is used or the bit width of some variables can be changed, you can try to get a simpler variant to run and then find out which part makes the memory demand grow. This part should then be modelled in a different way.

Cilk or Cilk++ or OpenMP

I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing
Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.
At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.

omp_set_dynamic - how does the runtime determine the number of threads?

How does the OpenMP runtime determine the best number of threads when omp_set_dynamic is used?
e.g. Are some sort of timing mechanisms used or does the compiler give hints to the runtime of how large the task size is?
I don't think that the OpenMP does determine the 'best' number of threads for an application, in any likely sense of the word 'best'. As #aaa has commented, the runtime's behaviour when omp_set_dynamic is true is implementation specific.
I don't think that current Fortran/C/C++ compilers could provide information such as timings or task sizes to the runtime.
I believe that this function is available so that schedulers (and similar) can manage programs on machines, for throughput or similar.

What is easier to learn and debug OpenMP or MPI?

I have a number crunching C/C++ application. It is basically a main loop for different data sets. We got access to a 100 node cluster with openmp and mpi available. I would like to speedup the application but I am an absolut newbie for both mpi and openmp. I just wonder what is the easiest one to learn and to debug even if the performance is not the best.
I also wonder what is the most adequate for my main loop application.
Thanks
If your program is just one big loop using OpenMP can be as simple as writing:
#pragma omp parallel for
OpenMP is only useful for shared memory programming, which unless your cluster is running something like kerrighed means that the parallel version using OpenMP will only run on at most one node at a time.
MPI is based around message passing and is slightly more complicated to get started. The advantage is though that your program could run on several nodes at one time, passing messages between them as and when needed.
Given that you said "for different data sets" it sounds like your problem might actually fall into the "embarrassingly parallel" category, where provided you've got more than 100 data sets you could just setup the scheduler to run one data set per node until they are all completed, with no need to modify your code and almost a 100x speed up over just using a single node.
For example if your cluster is using condor as the scheduler then you could submit 1 job per data item to the "vanilla" universe, varying only the "Arguments =" line of the job description. (There are other ways to do this for Condor which may be more sensible and there are also similar things for torque, sge etc.)
OpenMP is essentially for SMP machines, so if you want to scale to hundreds of nodes you will need MPI anyhow. You can however use both. MPI to distribute work across nodes and OpenMP to handle parallelism across cores or multiple CPUs per node. I would say OpenMP is a lot easier than messing with pthreads. But it being coarser grained, the speed up you will get from OpenMP will usually be lower than a hand optimized pthreads implementation.

Resources