Use of threads in SCIP - parallel-processing

In the list of SCIP's parameters I see three types of references to the use of threads:
lp/threads (threads used for solving the LP, which don't matter when using SoPlex, according to this question).
parallel/{min, max}threads (number of threads during parallel solve).
concurrent/* (parameters related to the use of threads in concurrent mode).
My question(s) is(are): How are threads used in SCIP when using a default installation? Are the parallel/{min, max}threads parameters related only to the concurrent solver? If I don't turn on the concurrent solver, would SCIP use the available threads for solving the branch-and-bound subproblems in parallel?
Thanks in advance!

All parameters in the "parallel/" and "concurrent/" sections of the SCIP parameter space only affect the concurrent mode.
SCIP is by default single-threaded, but can be parallelized using the UG-framework or by enabling the concurrent mode.

Concurrent optimization is started by "concurrentopt" in the interactive shell or method "SCIPsolveConcurrent()" in the C API. The number of threads used can be controlled by parallel/minnthreads and parallel/maxnthreads. If a tight memory limit is set, then the thread count may be reduced further.
When you look at the unscaled solving time on Hans Mittelmann's webpage you see that FiberSCIP achieves a slight speedup. Your statement that there would be a slowdown probably comes from the fact that the relative performance to the best solver is worse.
Generally: For both parallelization schemes, it is highly instance-dependent on whether parallelization holds potential to improve performance.

Related

OpenMP and MPI Interaction

Hi currently I'm working on a program that I have working in parallel using MPI. I was wondering if I could gain additional speed in the for loops using OpenMP so I could get more out of each processor. Would I gain anything out of doing this? Also how would I go about it?
From experience it really depend on your problem and on how many MPI processes you are using.
Using large amount of MPI processes usually improve data locality, but your parallelization might not allow large amount of processes.
The thought that you will gain for sure a decent speedup is very often wrong :-(... But then if you reach the point where you cant use more MPI processes due to lack of parallel efficiency you will probably gain the possibility of using more cores efficiently.
From experience you should target a small number of thread (4-8, 1/2 of the socket cores count), especially if you have only small loops (which should be the case if you reach the max number of MPI processes).
A good intro of hybrid parallelism:
http://www.openmp.org/press-release/sc13-tutorial-hybrid-mpi-openmp-parallel-programming/

Emulate a very fast (virtual) CPU core

I know that the usual method when we want to make a big math computation faster is to use multiprocessing / parallel processing: we split the job in for example 4 parts, and we let 4 CPU cores run in parallel (parallelization). This is possible for example in Python with multiprocessing module: on a 4-core CPU, it would allow to use 100% of the processing power of the computer instead of only 25% for a single-process job.
But let's say we want to make faster a non-easily-splittable computation job.
Example: we are given a number generator function generate(n) that takes the previously-generated number as input, and "it is said to have 10^20 as period". We want to check this assertion with the following pseudo-code:
a = 17
for i = 1..10^20
a = generate(a)
check if a == 17
Instead of having a computer's 4 CPU cores (3.3 Ghz) running "in parallel" with a total of 4 processes, is it possible to emulate one very fast single-core CPU of 13.2 Ghz (4*3.3) running one single process with the previous code?
Is such technique available for a desktop computer? If not, is it available on cloud computing platforms (AWS EC2, etc.)?
Single-threaded performance is extremely valuable; it's much easier to write sequential code than to explicitly expose thread-level parallelism.
If there was an easy and efficient general-purpose way to do what you're asking which works when there is no parallelism in the code, it would already be in widespread use. Either internally inside multi-core CPUs, or in software if it required higher-level / larger-scale code transformations.
Out-of-order CPUs can find and exploit instruction-level parallelism within a single thread (over short distances, like a couple hundred instructions), but you need explicit thread-level parallelism to take advantage of multiple cores.
This is similar to How does a single thread run on multiple cores? over on SoftwareEnginnering.SE, except that you've already ruled out any easy-to-find parallelism including instruction-level parallelism. (And the answer is: it doesn't. It's the hardware of a single core that finds the instruction-level parallelism in a single thread; my answer there explains some of the microarchitectural details of how that works.)
The reverse process: turning one big CPU into multiple weaker CPUs does exist, and is useful for running multiple threads which don't have much instruction-level parallelism. It's called SMT (Simultaneous MultiThreading). You've probably heard of Intel's Hyperthreading, the most widely known implementation of SMT. It trades single-threaded performance for more throughput, keeping more execution units fed with useful work more of the time. The cost of building a single wide core grows at least quadratically, which is why typical desktop CPUs don't just have a single massive core with 8-way SMT. (And note that a really wide CPU still wouldn't help with a totally dependent instruction stream, unless the generate function has some internal instruction-level parallelism.)
SMT would be good if you wanted to test 8 different generate() functions at once on a quad-core CPU. Without SMT, you could alternate in software between two generate chains in one thread, so out-of-order execution could be working on instructions from both dependency chains in parallel.
Auto-parallelization by compilers at compile time is possible for source that has some visible parallelism, but if generate(a) isn't "separable" (not the correct technical term, I think) then you're out of luck.
e.g. if it's return a + hidden_array[static_counter++]; then the compiler can use math to prove that summing chunks of the array in parallel and adding the partial sums will still give the same result.
But if there's truly a serial dependency through a (like even a simple LCG PRNG), and the software doesn't know any mathematical tricks to break the dependency or reduce it to a closed form, you're out of luck. Compilers do know tricks like sum(0..n) = n*(n+1)/2 (evaluated slightly differently to avoid integer overflow in a partial result), or a+a+a+... (n times) is a * n, but that doesn't help here.
There is a scheme studied mostly in the academy called "Thread Decomposition". It aims to do more or less what you ask about - given a single-threaded code, it tries to break it down into multiple threads in order to divide the work on a multicore system. This process can be done by a compiler (although this requires figuring out all possible side effects at compile time which is very hard), by a JIT runtime, or through HW binary-translation, but each of these methods has complicated limitations and drawbacks.
Unfortunately, other than being automated, this process has very little appeal as it can hardly match true manual parallelization done by a person how understands the code. It also doesn't simply scale performance according to the number of threads, since it usually incurs a large overhead in the form of code that has to be duplicated.
Example paper by some nice folks from UPC in Barcelona: http://ieeexplore.ieee.org/abstract/document/5260571/

Can parallelization have a negative performance impact?

With the abundance of techniques being employed to increase parallelization in today's compiler-tools (especially auto-parallelization of certain viable for-constructs, c.f. the Intel C++ Compiler, Microsoft Visual Studio 2011, alongside various others), I wondered if parallelization is always guaranteed to improve or have no impact on performance.
Are there any cases in which parallelization would have a distinctly negative impact on performance?
A quick internet search didn't yield much hope, so I decided to turn here to see if anyone has any knowledge of cases where parallelization has a detrimental impact on performance, or better yet, experience in a project where parallelization actually caused difficulties.
I am also curious about whether there are any negative performance implication of auto-vectorization, although I find it quite unlikely that there would be.
Thanks in advance!
Parallelisation usually involves some abstract data exchange between the different processing elements since not all of them have exclusive access to all the data that it needs in order to complete its part of the computation. It could either be messages passed between different processes in an MPI job or it could be synchronisation actions in a multithreaded program. Passing data around or synchronising things takes time and that's why it is usually called communication or synchronisation overhead. There are different classes of problems depending on the ratio between overhead and computation.
Parallel algorithms that require no communication or synchronisation at all are called trivially (or "embarrassingly") parallel problems. An example of this class is a ray-tracing application: each pixel can be computed independently of all the others. Problems in this class scale linearly with the number of processing elements used (and sometimes even superlinearly because of caching effects) - give it twice as many processing elements and it will take twice as less time to perform the computation.
If any amount of communication or synchronisation is involved then things get progressively worse as the ratio between communication/synchronisation and computation increases. Usually this is the case when the problem size is kept fixed as one increases the number of processing elements. Usually the overhead increases with the number of processing elements while the amount of computation per element decreases.
Auto-vectorization can theoretically fall into "traps" where the overhead of getting all the elements in the right places is actually bigger than the time saved by doing things in parallel. Analyzing how much time a piece of code will take is hard, so it's hard for compilers to make the right decision.
Towards the end of these slides are some examples and statistics about auto-vectorization making the performance worse.
Usually with reasonable usage parallelization (mean parallel processing) gives positive performance imact.
But in some cases, from developer point of view, it could cause negative effects:
When allocating to many thread for parallel and/or multithreading processing.
Fork/join parallelism and loops parallelization when iteration is to small and allocating threads costs more time and resources than simple to process items synchronously
Typical multithreading/parallel execution problems like deadlocks, livelocks, threads stravation, race conditions etc.
Debugging and diagnostic, it's harder to find bugs
So all should be used reasonably.
And some links. Sorry they are .NET/Microsoft specific but problems described there are same:
Potential Pitfalls in Data and Task Parallelism
Potential Pitfalls with Parallel LINQ (PLINQ)
Good book where common problems and pitfalls are described:
Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the .NET Framework 4
From a more theoretical point of view, you may be interested in problems that are not in NC, i.e. the class of decision problems decidable in polylogarithmic time on a parallel computer with a polynomial number of processors.
Off the top of my head, I cannot think of any computational problem that is not, in some way or another, parallelizable. What I have encountered many times though are problems that have been badly parallelized.
Badly parallelized programs can easily be slower than their sequential versions. This can be a result of:
Massive overheads due to the parallelism being too fine-grained, e.g. the amount of work performed per thread is negligible compared to the overhead of starting/scheduling the operation. In OpenMP, this could be the case of a #pragma omp parallel for schedule(dynamic,k) for a small chunk size k.
Repeated concurrent access to shared resources, e.g. if all threads have to wait to access some resource or memory location sequentially. In OpenMP, this can be caused by too many or too large #pragma omp critical sections.
Over-use of slow atomic operations to update variables shared between threads, e.g. using #pragma omp atomic where, in the sequential case, faster regular memory access would be used.
In summary, and in my opinion, there are few inherently sequential problems, but mountains of badly-implemented parallel solutions.

Cilk or Cilk++ or OpenMP

I'm creating a multi-threaded application in Linux. here is the scenario:
Suppose I am having x instance of a class BloomFilter and I have some y GB of data(greater than memory available). I need to test membership for this y GB of data in each of the bloom filter instance. It is pretty much clear that parallel programming will help to speed up the task moreover since I am only reading the data so it can be shared across all processes or threads.
Now I am confused about which one to use Cilk, Cilk++ or OpenMP(which one is better). Also I am confused about which one to go for Multithreading or Multiprocessing
Cilk Plus is the current implementation of Cilk by Intel.
They both are multithreaded environment, i.e., multiple threads are spawned during execution.
If you are new to parallel programming probably OpenMP is better for you since it allows an easier parallelization of already developed sequential code. Do you already have a sequential version of your code?
OpenMP uses pragma to instruct the compiler which portions of the code has to run in parallel. If I understand your problem correctly you probably need something like this:
#pragma omp parallel for firstprivate(array_of_bloom_filters)
for i in DATA:
check(i,array_of_bloom_filters);
the instances of different bloom filters are replicated in every thread in order to avoid contention while data is shared among thread.
update:
The paper actually consider an application which is very unbalanced, i.e., different taks (allocated on different thread) may incur in very different workload. Citing the paper that you mentioned "a highly unbalanced task graph that challenges scheduling,
load balancing, termination detection, and task coarsening strategies". Consider that in order to balance computation among threads it is necessary to reduce the task size and therefore increase the time spent in synchronizations.
In other words, good load balancing comes always at a cost. The description of your problem is not very detailed but it seems to me that the problem you have is quite balanced. If this is not the case then go for Cilk, its work stealing approach its probably the best solution for unbalanced workloads.
At the time this was posted, Intel was putting a lot of effort into boosting Cilk(tm) Plus; more recently, some effort has been diverted toward OpenMP 4.0.
It's difficult in general to contrast OpenMP with Cilk(tm) Plus.
If it's not possible to distribute work evenly across threads, one would likely set schedule(runtime) in an OpenMP version, and then at run time try various values of environment variable, such as OMP_SCHEDULE=guided, OMP_SCHEDULE=dynamic,2 or OMP_SCHEDULE=auto. Those are the closest OpenMP analogies to the way Cilk(tm) Plus work stealing works.
Some sparse matrix functions in Intel MKL library do actually scan the job first and determine how much to allocate to each thread so as to balance work. For this method to be useful, the time spent in serial scanning and allocating has to be of lower order than the time spent in parallel work.
Work-stealing, or dynamic scheduling, may lose much of the potential advantage of OpenMP in promoting cache locality by pinning threads with cache locality e.g. by OMP_PROC_BIND=close.
Poor cache locality becomes a bigger issue on a NUMA architecture where it may lead to significant time spent on remote memory access.
Both OpenMP and Cilk(tm) Plus have facilities for switching between serial and parallel execution.

How to determine the optimum number of worker threads

I wrote a C program which reads a dataset from a file and then applies a data mining algorithm to find the clusters and classes in the data. At the moment I am trying to rewrite this sequential program multithreaded with PThreads and I am newbie to a parallel programming and I have a question about the number of worker threads which struggled my mind:
What is the best practice to find the number of worker threads when you do parallel programming and how do you determine it? Do you try different number of threads and see its results then determine or is there a procedure to find out the optimum number of threads. Of course I'm investigating this question from the performance point of view.
There are a couple of issues here.
As Alex says, the number of threads you can use is application-specific. But there are also constraints that come from the type of problem you are trying to solve. Do your threads need to communicate with one another, or can they all work in isolation on individual parts of the problem? If they need to exchange data, then there will be a maximum number of threads beyond which inter-thread communication will dominate, and you will see no further speed-up (in fact, the code will get slower!). If they don't need to exchange data then threads equal to the number of processors will probably be close to optimal.
Dynamically adjusting the thread pool to the underlying architecture for speed at runtime is not an easy task! You would need a whole lot of additional code to do runtime profiling of your functions. See for example the way FFTW works in parallel. This is certainly possible, but is pretty advanced, and will be hard if you are new to parallel programming. If instead the number of cores estimate is sufficient, then trying to determine this number from the OS at runtime and spawning your threads accordingly will be a much easier job.
To answer your question about technique: Most big parallel codes run on supercomputers with a known architecture and take a long time to run. The best number of processors is not just a function of number, but also of the communication topology (how the processors are linked). They therefore benefit from a testing phase where the best number of processors is determined by measuring the time taken on small problems. This is normally done by hand. If possible, profiling should always be preferred to guessing based on theoretical considerations.
You basically want to have as many ready-to-run threads as you have cores available, or at most 1 or 2 more to ensure no core that's available to you will ever be left idle. The trick is in estimating how many threads will typically be blocked waiting for something else (mostly I/O), as that is totally dependent on your application and even on external entities beyond your control (databases, other distributed services, etc, etc).
In the end, once you've determined about how many threads should be optimal, running benchmarks for thread pool sizes around your estimated value, as you suggest, is good practice (at the very least, it lets you double check your assumptions), especially if, as it appears, you do need to get the last drop of performance out of your system!

Resources