Can Intel MKL Pardiso library can be run with MPI's? - thread-safety

I'm currently using Intel MKL library to solve a linear system.
As far as I know, Intel MKl library do not support MPI parallelization.
Previously
I have one big target system to calculate => thus building one big linear system to solve
What I'm planning is
Split the big system into pieces (to each MPI processes) => building a few small lineary systems (independent to each other) to solve
This is different from what parallel solvers would generally do (distribute 'one' big matrix to MPI processes). I will have independent a few small matrice, and will make MPI processes would solve them independently.
My question is that, is Intel MKL Pardiso solver could be utilized in this way (called simultaneously from a few MPI processes for independt problems) ?
I add a picture to describe what I'm going to do in more understandable way.

The current version of MKL provides the cluster version of Intel MKL Pardiso ( cluster_sparse_solver). You may check the link to the MKL Reference follow the link:https://software.intel.com/content/www/us/en/develop/documentation/mkl-developer-reference-c/top/sparse-solver-routines/parallel-direct-sparse-solver-for-clusters-interface.html.

Related

Speedup comparison between BLAS and OpenBLAS in Armadillo [duplicate]

I've been testing various open source codes for solving a linear system of equations in C++. So far the fastest I've found is armadillo, using the OPENblas package as well. To solve a dense linear NxN system, where N=5000 takes around 8.3 seconds on my system, which is really really fast (without openblas installed, it takes around 30 seconds).
One reason for this increase is that armadillo+openblas seems to enable using multiple threads. It runs on two of my cores, whereas armadillo without openblas only uses 1. I have an i7 processor, so I want to increase the number of cores, and test it further. I'm using ubuntu, so from the openblas documentation I can do in the terminal:
export OPENBLAS_NUM_THREADS=4
however, running the code again doesn't seem to increase the number of cores being used or the speed. Am i doing something wrong, or is the 2 the max amount for using armadillo's "solve(A,b)" command? I wasn't able to find armadillo's source code anywhere to take a look.
Incidentally does anybody know the methods armadillo/openblas use for solving Ax=b (standard LU decomposition with parallelism or something else) ? Thanks!
edit: Actually the number of cores stuck at 2 seems to be a bug when installing openblas with synaptic package manager see here. Reinstalling from source allows it to detect how many cores i actutally have (8). Now I can use export OPENBLAS_NUM_THREADS=4 etc to govern it.
Armadillo doesn't prevent OpenBlas from using more cores. It's possible that the current implementation of OpenBlas simply chooses 2 cores for certain operations.
You can see Armadillo's source code directly in the downloadable package (it's open source), in the folder "include". Specifically, have a look at the file "include/armadillo_bits/fn_solve.hpp" (which contains the user accessible solve() function), and the file "include/armadillo_bits/auxlib_meat.hpp" (which contains the wrapper and housekeeping code for calling the torturous Blas and Lapack functions).
If you already have Armadillo installed on your machine, have a look at "/usr/include/armadillo_bits" or "/usr/local/include/armadillo_bits".

Numerical differences between older Mac Mini and newer Macbook

I have a project that I compile on both my Mac Mini (Core2 Duo) and a 2014 Macbook quadcore i7. Both are running the latest version of Yosemite. The application is single threaded and I am compiling the tool and libraries using the exact same version of cmake and the clang (xcode) compiler. I am getting test failures due to slight numeric differences.
I am wondering if the inconsistency is coming from the clang compiler automatically doing processor specific optimizations, (which I did not select in cmake)? Could the difference be between the processors? Do the frameworks use processor specific optimizations? I am using the BLAS/Lapack routines the from the Accelerate framework. They are called from the SuperLU sparse matrix factorization package.
In general you should not expect results from BLAS or LAPACK to be bitwise reproducible across machines. There are a number of factors that implementors tune to get the best performance, all of which result in small differences in rounding:
your two machines have different numbers of processors, which will result in work being divided differently for threading purposes (even if your application is single threaded, BLAS may use multiple threads internally).
your two machines handle hyper threading quite differently, which may also cause BLAS to use different numbers of threads.
the cache and TLB hierarchy is different between your two machines, which means that different block sizes are optimal for data reuse.
the SIMD vector size on the newer machine is twice as large as that on the older machine, which again will effect how arithmetic is grouped.
finally, the newer machine supports FMA (and using FMA is necessary to get the best performance on it); this also contributes to small differences in rounding.
Any one of these factors would be enough to result in small differences; taken together it should be expected that the results will not be bitwise identical. And that's OK, so long as both results satisfy the error bounds of the computation.
Making the results identical would require severely limiting the performance on the newer machine, which would result in your shiny expensive hardware going to waste.

Developing Software For Multi-Core CPUs - Do Programs Have To Be Manually Optimised To Use All Cores? Or Does This Occur Automatically?

The vast majority of CPUs coming out nowadays contain multiple cores which can operate at the same time - in parallel.
I'm just wondering, from the point of executing a program as quickly as possible using all available CPU cores, does a programmer need to take into consideration that the software being developed will be running on a multi-core CPU? For instance, would the software being developed have to be manually configured to assign different tasks to each CPU core? Or does the OS/CPU automatically identify and choose which parts of a program can run - in parallel - on different cores?
Apologies if this may seem like a simple or silly question. I'm completely new to the topic of parallel programming and I've come across some conflicting information early on in my research - some sources state that the programmer must manually configure their software in order to utilise more than one CPU core (the more believable option in my opinion) - and other sources state that the OS/CPU automatically identifies and chooses which tasks can be run in parallel on different CPU cores (the less believable option in my opinion due to the complexity involved in automatically identifying this).
Just in case different Operating Systems, CPUs or Programming Languages perform differently in a parallel computing or multi-core environment - I will be using Windows 7 as my OS, an Intel Dual Core i7 Processor, and OpenCL as the programming language.
Any help is much appreciated.
In practice this occurs semi-automatically.
More detailed answer will depend on your application nature, preferred programming model and target architecture.
More explanation:
In order to exploit multicore hardware efficiently (in your case, keeping as much cores busy as possible) you first of all 1) need to "parallelize" algorithm itself - make it "concurrent", 2) use one of multi-threading (most often) or multi-process (rare case) parallel programming APIs, like for example "OpenMP", "Intel TBB", "OpenCL", "Posix Threads" or (for multi-process) "MPI" in order to efficiently and often automatically assign different "pieces" of your concurrent program to different threads (or, rare case, processes).
One of the simplest possible examples of such kind of parallel programming (using OpenMP) is given here.
Now, you've told that you are using OpenCL as a programming model for CPU. In certain cases, when you use vendor-provided OpenCL implementations (like Intel OpenCL) you could semi-automatically assign OpenCL kernel to be executed by various threads using "NDRange" and other OpenCL concepts, like explained here for Intel Xeon Phi co-processor (not exactly CPU-programming, but similar idea) or here (more general, but more advanced article).
However, using OpenCL as a general-purpose multi-threading programming API for CPU - is definitely not the simplest approach; and it is not always optimal in terms of final performance. There are certain application types, where OpenCL makes some little sense for general-purpose CPU multi-threading programming, but again it very much depends on your algorithm nature and target architecture..
There is one very obsolete, but still reasonable post about OpenCL vs. OpenMP/TBB on stackoverflow. This is obsolete in sense that OpenMP 4.0 now also provides solid capabilities to do Threading*+SIMD* programming (which will make you interested in some future if you explore given topic in more details). That's why I would tell that OpenMP seems to be number-one choice nowadays, bug TBB, MPI or OpenCL might also be appropriate in certain cases.

Could GPU accelerate gcc/g++ compilation

When I'm building my gentoo system, my nvidia gpu is usually unused, can I make some use of it?
No, you cannot.
GPUs are typically best at accelerating massively parallel math-heavy tasks that involve little branching. Compiling software is basically the exact opposite of this - it's branch-heavy and does not parallelize well beyond the file level.

How to get best performance of 8 core system using INTEL fortran

Please let me know how to set INTEL fortran compiler option to gain the best performance of 8 core system for IA32 and X64 bits. Actually I want to execute a fortran program and take the advantages of the all CPU time available in 8 core system. Now the program is only using 13 % of CPU time.
You can learn about autovectorization and guided auto-parallelization features of Intel FORTRAN in this tutorial: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/start/win/tutorial_comp_for_win.pdf.
If you are doing linear algebra, solvers, FFTs, you might get best results if you map your problem into calls into the Intel Math Kernel Libraries: http://software.intel.com/en-us/articles/intel-mkl/
which are already multithreaded and vectorized and cache optimized.
If you are doing media / signal processing you might map your problem into calls into the Intel Performance Primitives library: http://software.intel.com/en-us/articles/intel-ipp/
Happy hacking!
In my specific application, a computational network model containing several loops running thoughout 20k iterations, each iteration accessing a number of nested if's, just by enabling /Q2 level optimization in the compiler was sufficient to reduce the computing time drastically, while keeping the CPU load around 15%.
On a similar note, I have noticed rising the optimization setting to the last level (/Q3), did do what you were asking (running all CPUs at about full load), but the computing time have NOT been reduced at all.
Therefore, if one has a small problem and several cases to test and processing capacity is the only bottleneck, it could be a good idea to open more than one Fortran solution and run those cases simultaneously.

Resources