PETSc solving linear system with ksp guide - parallel-processing

I am starting use PETSc library to solve linear system of equations in parallel. I have installed all packages, build and run successfully the examples in petsc/src/ksp/ksp/examples/tutorials/ folder, for example ex.c
But I couldn't understand how to fill matrices A,X an B by reading them for example from file.
Here I provide the code within ex2.c file:
/* Program usage: mpiexec -n <procs> ex2 [-help] [all PETSc options] */
static char help[] = "Solves a linear system in parallel with KSP.\n\
Input parameters include:\n\
-random_exact_sol : use a random exact solution vector\n\
-view_exact_sol : write exact solution vector to stdout\n\
-m <mesh_x> : number of mesh points in x-direction\n\
-n <mesh_n> : number of mesh points in y-direction\n\n";
/*T
Concepts: KSP^basic parallel example;
Concepts: KSP^Laplacian, 2d
Concepts: Laplacian, 2d
Processors: n
T*/
/*
Include "petscksp.h" so that we can use KSP solvers. Note that this file
automatically includes:
petscsys.h - base PETSc routines petscvec.h - vectors
petscmat.h - matrices
petscis.h - index sets petscksp.h - Krylov subspace methods
petscviewer.h - viewers petscpc.h - preconditioners
*/
#include <C:\PETSC\include\petscksp.h>
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
Vec x,b,u; /* approx solution, RHS, exact solution */
Mat A; /* linear system matrix */
KSP ksp; /* linear solver context */
PetscRandom rctx; /* random number generator context */
PetscReal norm; /* norm of solution error */
PetscInt i,j,Ii,J,Istart,Iend,m = 8,n = 7,its;
PetscErrorCode ierr;
PetscBool flg = PETSC_FALSE;
PetscScalar v;
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
PetscInitialize(&argc,&args,(char *)0,help);
ierr = PetscOptionsGetInt(PETSC_NULL,"-m",&m,PETSC_NULL);CHKERRQ(ierr);
ierr = PetscOptionsGetInt(PETSC_NULL,"-n",&n,PETSC_NULL);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
- Each processor needs to insert only elements that it owns
locally (but any non-local elements will be sent to the
appropriate processor during matrix assembly).
- Always specify global rows and columns of matrix entries.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
ierr = PetscLogStageRegister("Assembly", &stage);CHKERRQ(ierr);
ierr = PetscLogStagePush(stage);CHKERRQ(ierr);
for (Ii=Istart; Ii<Iend; Ii++) {
v = -1.0; i = Ii/n; j = Ii - i*n;
if (i>0) {J = Ii - n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (i<m-1) {J = Ii + n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (j>0) {J = Ii - 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (j<n-1) {J = Ii + 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
v = 4.0; ierr = MatSetValues(A,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr);
}
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = PetscLogStagePop();CHKERRQ(ierr);
/* A is symmetric. Set symmetric flag to enable ICC/Cholesky preconditioner */
ierr = MatSetOption(A,MAT_SYMMETRIC,PETSC_TRUE);CHKERRQ(ierr);
/*
Create parallel vectors.
- We form 1 vector from scratch and then duplicate as needed.
- When using VecCreate(), VecSetSizes and VecSetFromOptions()
in this example, we specify only the
vector's global dimension; the parallel partitioning is determined
at runtime.
- When solving a linear system, the vectors and matrices MUST
be partitioned accordingly. PETSc automatically generates
appropriately partitioned matrices and vectors when MatCreate()
and VecCreate() are used with the same communicator.
- The user can alternatively specify the local vector and matrix
dimensions when more sophisticated partitioning is needed
(replacing the PETSC_DECIDE argument in the VecSetSizes() statement
below).
*/
ierr = VecCreate(PETSC_COMM_WORLD,&u);CHKERRQ(ierr);
ierr = VecSetSizes(u,PETSC_DECIDE,m*n);CHKERRQ(ierr);
ierr = VecSetFromOptions(u);CHKERRQ(ierr);
ierr = VecDuplicate(u,&b);CHKERRQ(ierr);
ierr = VecDuplicate(b,&x);CHKERRQ(ierr);
/*
Set exact solution; then compute right-hand-side vector.
By default we use an exact solution of a vector with all
elements of 1.0; Alternatively, using the runtime option
-random_sol forms a solution vector with random components.
*/
ierr = PetscOptionsGetBool(PETSC_NULL,"-random_exact_sol",&flg,PETSC_NULL);CHKERRQ(ierr);
if (flg) {
ierr = PetscRandomCreate(PETSC_COMM_WORLD,&rctx);CHKERRQ(ierr);
ierr = PetscRandomSetFromOptions(rctx);CHKERRQ(ierr);
ierr = VecSetRandom(u,rctx);CHKERRQ(ierr);
ierr = PetscRandomDestroy(&rctx);CHKERRQ(ierr);
} else {
ierr = VecSet(u,1.0);CHKERRQ(ierr);
}
ierr = MatMult(A,u,b);CHKERRQ(ierr);
/*
View the exact solution vector if desired
*/
flg = PETSC_FALSE;
ierr = PetscOptionsGetBool(PETSC_NULL,"-view_exact_sol",&flg,PETSC_NULL);CHKERRQ(ierr);
if (flg) {ierr = VecView(u,PETSC_VIEWER_STDOUT_WORLD);CHKERRQ(ierr);}
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create the linear solver and set various options
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create linear solver context
*/
ierr = KSPCreate(PETSC_COMM_WORLD,&ksp);CHKERRQ(ierr);
/*
Set operators. Here the matrix that defines the linear system
also serves as the preconditioning matrix.
*/
ierr = KSPSetOperators(ksp,A,A,DIFFERENT_NONZERO_PATTERN);CHKERRQ(ierr);
/*
Set linear solver defaults for this problem (optional).
- By extracting the KSP and PC contexts from the KSP context,
we can then directly call any KSP and PC routines to set
various options.
- The following two statements are optional; all of these
parameters could alternatively be specified at runtime via
KSPSetFromOptions(). All of these defaults can be
overridden at runtime, as indicated below.
*/
ierr = KSPSetTolerances(ksp,1.e-2/((m+1)*(n+1)),1.e-50,PETSC_DEFAULT,
PETSC_DEFAULT);CHKERRQ(ierr);
/*
Set runtime options, e.g.,
-ksp_type <type> -pc_type <type> -ksp_monitor -ksp_rtol <rtol>
These options will override those specified above as long as
KSPSetFromOptions() is called _after_ any other customization
routines.
*/
ierr = KSPSetFromOptions(ksp);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Solve the linear system
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
ierr = KSPSolve(ksp,b,x);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Check solution and clean up
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Check the error
*/
ierr = VecAXPY(x,-1.0,u);CHKERRQ(ierr);
ierr = VecNorm(x,NORM_2,&norm);CHKERRQ(ierr);
ierr = KSPGetIterationNumber(ksp,&its);CHKERRQ(ierr);
/* Scale the norm */
/* norm *= sqrt(1.0/((m+1)*(n+1))); */
/*
Print convergence information. PetscPrintf() produces a single
print statement from all processes that share a communicator.
An alternative is PetscFPrintf(), which prints to a file.
*/
ierr = PetscPrintf(PETSC_COMM_WORLD,"Norm of error %A iterations %D\n",
norm,its);CHKERRQ(ierr);
/*
Free work space. All PETSc objects should be destroyed when they
are no longer needed.
*/
ierr = KSPDestroy(&ksp);CHKERRQ(ierr);
ierr = VecDestroy(&u);CHKERRQ(ierr); ierr = VecDestroy(&x);CHKERRQ(ierr);
ierr = VecDestroy(&b);CHKERRQ(ierr); ierr = MatDestroy(&A);CHKERRQ(ierr);
/*
Always call PetscFinalize() before exiting a program. This routine
- finalizes the PETSc libraries as well as MPI
- provides summary and diagnostic information if certain runtime
options are chosen (e.g., -log_summary).
*/
ierr = PetscFinalize();
return 0;
}
Does someone know how to fill own matrices within examples?

Yeah, this can be a little daunting when you're getting started. There's a good walk-through of the process in this ACTS tutorial from 2006; the tutorials listed on the PetSC web page are generally quite good.
The key parts of this are:
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
Actually create the PetSC matrix object, Mat A;
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n);CHKERRQ(ierr);
set the sizes; here, the matrix is m*n x m*n, as it's a stencil for operating on an m x n 2d grid
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
This just takes any PetSC command line options that you might have supplied at run time and apply them to the matrix, if you wanted to control how A was set up; otherwise, you could just, have eg, used MatCreateMPIAIJ() to create it as an AIJ-format matrix (the default), MatCreateMPIDense() if it was going to be a dense matrix.
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
Now that we've gotten an AIJ matrix, these calls just pre-allocates the sparse matrix, assuming 5 non-zeros per row. This is for performance. Note that both the MPI and Seq functions must be called to make sure this works with both 1 processor and multiple processors; this always seemed weird to be, but there you go.
Ok, now that the matrix is all set up, here's where we start getting into the actual meat of the matter.
First, we find out which rows this particular process owns. The distribution is by rows, which is a good distribution for typical sparse matrices.
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
So after this call, each processor has its own version of Istart and Iend, and its this processors job to update rows starting at Istart end ending just before Iend, as you see in this for loop:
for (Ii=Istart; Ii<Iend; Ii++) {
v = -1.0; i = Ii/n; j = Ii - i*n;
Ok, so if we're operating on row Ii, this corresponds to grid location (i,j) where i = Ii/n and j = Ii % n. Eg, grid location (i,j) corresponds to row Ii = i*n + j. Makes sense?
I'm going to strip out the if statements here because they're important but they're just dealing with the boundary values and they make things more complicated.
In this row, there will be a +4 on the diagonal, and -1s at columns corresponding to (i-1,j), (i+1,j), (i,j-1), and (i,j+1). Assuming that we haven't gone off the end of the grid for these (eg, 1 < i < m-1 and 1 < j < n-1), that means
J = Ii - n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii + n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii - 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii + 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
v = 4.0; ierr = MatSetValues(A,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr);
}
The if statements I took out just avoid setting those values if they don't exist, and the CHKERRQ macro just prints out a useful error if ierr != 0, eg the set values call failed (because we tried to set an invalid value).
Now we've set local values; the MatAssembly calls start communication to ensure any necessary values are exchanged between processors. If you have any unrelated work to do, it can be stuck between the Begin and End to try to overlap communication and computation:
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
And now you're done and can call your solvers.
So a typical workflow is:
Create your matrix (MatCreate)
Set its size (MatSetSizes)
Set various matrix options (MatSetFromOptions is a good choice, rather than hardcoding things)
For sparse matrices, set the preallocation to reasonable guesses for the number of non-zeros per row; you can do this with a single value (as here), or with an array representing the number of non-zeros per row (here filled in with PETSC_NULL): (MatMPIAIJSetPreallocation, MatSeqAIJSetPreallocation)
Find out which rows are your responsibility: (MatGetOwnershipRange)
Set the values (calling MatSetValues either once per value, or passing in a chunk of values; INSERT_VALUES sets new elements, ADD_VALUES increments any existing elements)
Then do the assembly (MatAssemblyBegin,MatAssemblyEnd).
Other more complicated use cases are possible.

Related

Finite difference method for solving the Klein-Gordon equation in Matlab

I am trying to numerically solve the Klein-Gordon equation that can be found here. To make sure I solved it correctly, I am comparing it with an analytical solution that can be found on the same link. I am using the finite difference method and Matlab. The initial spatial conditions are known, not the initial time conditions.
I start off by initializing the constants and the space-time coordinate system:
close all
clear
clc
%% Constant parameters
A = 2;
B = 3;
lambda = 2;
mu = 3;
a = 4;
b = - (lambda^2 / a^2) + mu^2;
%% Coordinate system
number_of_discrete_time_steps = 300;
t = linspace(0, 2, number_of_discrete_time_steps);
dt = t(2) - t(1);
number_of_discrete_space_steps = 100;
x = transpose( linspace(0, 1, number_of_discrete_space_steps) );
dx = x(2) - x(1);
Next, I define and plot the analitical solution:
%% Analitical solution
Wa = cos(lambda * x) * ( A * cos(mu * t) + B * sin(mu * t) );
figure('Name', 'Analitical solution');
surface(t, x, Wa, 'edgecolor', 'none');
colormap(jet(256));
colorbar;
xlabel('t');
ylabel('x');
title('Wa(x, t) - analitical solution');
The plot of the analytical solution is shown here.
In the end, I define the initial spatial conditions, execute the finite difference method algorithm and plot the solution:
%% Numerical solution
Wn = zeros(number_of_discrete_space_steps, number_of_discrete_time_steps);
Wn(1, :) = Wa(1, :);
Wn(2, :) = Wa(2, :);
for j = 2 : (number_of_discrete_time_steps - 1)
for i = 2 : (number_of_discrete_space_steps - 1)
Wn(i + 1, j) = dx^2 / a^2 ...
* ( ( Wn(i, j + 1) - 2 * Wn(i, j) + Wn(i, j - 1) ) / dt^2 + b * Wn(i - 1, j - 1) ) ...
+ 2 * Wn(i, j) - Wn(i - 1, j);
end
end
figure('Name', 'Numerical solution');
surface(t, x, Wn, 'edgecolor', 'none');
colormap(jet(256));
colorbar;
xlabel('t');
ylabel('x');
title('Wn(x, t) - numerical solution');
The plot of the numerical solution is shown here.
The two plotted graphs are not the same, which is proof that I did something wrong in the algorithm. The problem is, I can't find the errors. Please help me find them.
To summarize, please help me change the code so that the two plotted graphs become approximately the same. Thank you for your time.
The finite difference discretization of w_tt = a^2 * w_xx - b*w is
( w(i,j+1) - 2*w(i,j) + w(i,j-1) ) / dt^2
= a^2 * ( w(i+1,j) - 2*w(i,j) + w(i-1,j) ) / dx^2 - b*w(i,j)
In your order this gives the recursion equation
w(i,j+1) = dt^2 * ( (a/dx)^2 * ( w(i+1,j) - 2*w(i,j) + w(i-1,j) ) - b*w(i,j) )
+2*w(i,j) - w(i,j-1)
The stability condition is that at least a*dt/dx < 1. For the present parameters this is not satisfied, they give this ratio as 2.6. Increasing the time discretization to 1000 points is sufficient.
Next up is the boundary conditions. Besides the two leading columns for times 0 and dt one also needs to set the values at the boundaries for x=0 and x=1. Copy also them from the exact solution.
Wn(:,1:2) = Wa(:,1:2);
Wn(1,:)=Wa(1,:);
Wn(end,:)=Wa(end,:);
Then also correct the definition (and use) of b to that in the source
b = - (lambda^2 * a^2) + mu^2;
and the resulting numerical image looks identical to the analytical image in the color plot. The difference plot confirms the closeness

How to vectorize this five-point difference code to fast calculate derivatives of a matrix?

I am trying to calculate the partial first derivatives with respect to each of the two dimensions in a 2D matrix, i.e dF/dx and dF/dy, using the five point method. I have managed to do this successfully by looping over the points:
dF_dx = zeros(size(F));
dF_dy = zeros(size(F));
% Derivative with respect to y for each x value (Apply to all columns simultaneously)
dF_dy(2,1:(Nx-1)) = ( F(3,1:(Nx-1)) - F(1,1:(Nx-1)) )/(2*dy);
for m = 3:(Ny-2)
dF_dy(m,1:(Nx-1)) = ( F(m-2,1:(Nx-1)) - F(m+2,1:(Nx-1)) + 8*F(m+1,1:(Nx-1)) - 8*F(m-1,1:(Nx-1)) )/(12*dy);
end
dF_dy(Ny-1,1:(Nx-1)) = ( F(Ny,1:(Nx-1)) - F(Ny-2,1:(Nx-1)) )/(2*dy);
% Derivative with respect to x for each y value (Apply to all rows simultaneously)
dF_dx(2:(Ny-1),2) = ( F(2:(Ny-1),3) - F(2:(Ny-1),1) )/(2*dx);
for n = 3:(Nx-2)
dF_dx(2:(Ny-1),n) = ( F(2:(Ny-1),n-2) - F(2:(Ny-1),n+2) + 8*F(2:(Ny-1),n+1) - 8*F(2:(Ny-1),n-1) )/(12*dx);
end
dF_dx(2:(Ny-1),(Nx-1)) = ( F(2:(Ny-1),Nx) - F(2:(Ny-1),Nx-2) )/(2*dx);
Is there a clever Matlab way to vectorize this to make it much faster to execute? I have read several functions which seem like they might do the trick (such as diff(), circshift(), or maybe even kron() ), but am not sure how they might be used to solve this problem.
(My plan is to the implement this as a gpuArray later, if that is relevant for the solution).
Thank you!
1st Edit
By looking at the source code for gradient(), I was able to make the following version which is vectorized (i.e without the loop):
F = rand(5,8);
Nx = size(F,2);
Ny = size(F,1);
dx = 2;
dy = 3;
dF_dx = zeros(size(F));
dF_dy = zeros(size(F));
dF_dx(2:(Ny-1),3:Nx-2) = (F(2:(Ny-1),1:Nx-4) - F(2:(Ny-1),5:Nx) + 8*F(2:(Ny-1),4:Nx-1) - 8*F(2:(Ny-1),2:Nx-3))/(12*dx);
dF_dx(2:(Ny-1),2) = ( F(2:(Ny-1),3) - F(2:(Ny-1),1) )/(2*dx);
dF_dx(2:(Ny-1),(Nx-1)) = ( F(2:(Ny-1),Nx) - F(2:(Ny-1),Nx-2) )/(2*dx);
dF_dy(3:Ny-2,1:(Nx-1)) = (F(1:Ny-4,1:(Nx-1)) - F(5:Ny,1:(Nx-1)) + 8*F(4:Ny-1,1:(Nx-1)) - 8*F(2:Ny-3,1:(Nx-1)))/(12*dy);
dF_dy(2,1:(Nx-1)) = ( F(3,1:(Nx-1)) - F(1,1:(Nx-1)) )/(2*dy);
dF_dy((Ny-1),1:(Nx-1)) = ( F(Ny,1:(Nx-1)) - F(Ny-2,1:(Nx-1)) )/(2*dy);
Is there a way to do it without all the indexing (which I suspect is what is taking the most time)?
2nd Edit
I have now implemented Cris and chtz's suggestion and achieved this using a convolution with conv2(), as follows:
F = rand(500,600);
Nx = size(F,2);
Ny = size(F,1);
dx = 2;
dy = 3;
[dF_dx, dF_dy] = partial_derivatives(F, dx, dy, Nx, Ny)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [dF_dx, dF_dy] = partial_derivatives(F, dx, dy, Nx, Ny)
kernx = [-1 8 0 -8 1]/(12*dx); % Convolution kernel for x dimension
kerny = [-1;8;0;-8;1]/(12*dy); % Convolution kernel for y dimension
%%%%%%%% Partial derivative across x dimension %%%%%%%%
dF_dx = conv2(F, kernx, 'same') ; % Internal mesh points (five-point method)
% Second and penultimate mesh points (two-point method)
dF_dx(2:(Ny-1),2) = ( F(2:(Ny-1),3) - F(2:(Ny-1),1) )/(2*dx);
dF_dx(2:(Ny-1),(Nx-1)) = ( F(2:(Ny-1),Nx) - F(2:(Ny-1),Nx-2) )/(2*dx);
dF_dx(:,[1 Nx]) = 0; % Set boundary conditions
dF_dx([1 Ny],:) = 0; %
%%%%%%%% Partial derivative across x dimension %%%%%%%%
dF_dy = conv2(F, kerny, 'same') ; % Internal mesh points (five-point method)
% Second and penultimate mesh points (two-point method)
dF_dy(2,1:(Nx-1)) = ( F(3,1:(Nx-1)) - F(1,1:(Nx-1)) )/(2*dy);
dF_dy((Ny-1),1:(Nx-1)) = ( F(Ny,1:(Nx-1)) - F(Ny-2,1:(Nx-1)) )/(2*dy);
dF_dy(:,Nx) = 0; % Set boundary conditions
dF_dy([1 Ny],:) = 0; %
end
As a gpuArray, this does not provide any noticable improvement over the vectorized version in the 1st Edit. Is there an obvious way to improve it? Thanks.

CLAPACK f2c vs MKL : Matrix multiplication performance issue

I am looking for a solution to accelerate the performances of my program with a lot of matrix multiplications. So I hace replaced the CLAPACK f2c libraries with the MKL. Unfortunately, the performances results was not the expected ones.
After investigation, I faced to a block triangular matrix which gives bad performances principaly when i try to multiply it with its transpose.
In order to simplify the problem I did my tests with an identity matrix of 5000 elements ( I found the same comportment )
NAME
Matrix [Size,Size]
CLAPACK f2c(second)
MKL_GNU_THREAD (second)
Multiplication of an identity matrix by itself
5000
0.076536
1.090167
Multiplication of dense matrix by its transpose
5000*5000
93.71569
1.113872
We can see that the CLAPACK f2c multiplication of an identity matrix is faster ( x14) than the MKL.
We can note an acceleration multipliy by 84 between the MKL and CLAPACK f2c dense matrix multiplication.
Moreover, the difference of the time consumption during the muliplication of a dense*denseT and an identity matrix is very slim.
So I tried to found in CLAPACK f2c DGEMM where is the optimization for the multiplication of a parse matrix, and I found a condition on null values.
/* Form C := alpha*A*B + beta*C. */
i__1 = *n;
for (j = 1; j <= i__1; ++j) {
if (*beta == 0.) {
i__2 = *m;
for (i__ = 1; i__ <= i__2; ++i__) {
c__[i__ + j * c_dim1] = 0.;
/* L50: */
}
} else if (*beta != 1.) {
i__2 = *m;
for (i__ = 1; i__ <= i__2; ++i__) {
c__[i__ + j * c_dim1] = *beta * c__[i__ + j * c_dim1];
/* L60: */
}
}
i__2 = *k;
for (l = 1; l <= i__2; ++l) {
if (b[l + j * b_dim1] != 0.) { // HERE THE CONDITION
temp = *alpha * b[l + j * b_dim1];
i__3 = *m;
for (i__ = 1; i__ <= i__3; ++i__) {
c__[i__ + j * c_dim1] += temp * a[i__ + l *
a_dim1];
/* L70: */
} // ENF of condition
}
When I removed this condition I got this kind of results :
NAME
Matrix [Size,Size]
CLAPACK f2c (second)
MKL_GNU_THREAD (second)
Multiplication of an identity matrix by itself
5000
93.210873
1.090167
Multiplication of dense matrix by its transpose
5000*5000
93.71569
1.113872
Here we note that the multiplication of a dense and an identity is
very clause in term of performances, and now the MKL shows the best
performances.
The MKL multiplication seems to be faster than CLAPACK f2c but only with
the same number of non-null elements.
I have two ideas on this results :
The 0 optimization is not activated by default in the MKL
The MKL cannot see the 0 (double) values inside my sparse matrices .
May you tell me why the MKL shows performance issues ?
Do you have any tips in order to bypass the multiplication on null elements with dgemm ?
I did a conservion in CSR and it shows better performances but in is case why lapacke_dgemm is worst than f2c_dgemmm.
Thank you for your help :)
MKL_VERBOSE Intel(R) MKL 2021.0 Update 1 Product build 20201104 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 3.50GHz lp64 gnu_thread

Why Amdahl Law on serial and parallel fractions does not provide a theoretical speedup of 4 on quad-core CPU?

I have a code ( the Floyd-Warshall algorithm for the shortest path in an NxN matrix ),with three for-loops, one within the other and with the same number of cycles.
In the last for I have an assignment via a ternary-operation = <bool> ? <val1> : <val2> - based on a comparison and if it is True or not.
I used an OpenMP to parallelize the second for with a #pragma omp parallel for.
I can't compute the parallel percentage and serial percentage of the code to successfully apply the Amdahl Law to recover the theoretical speedup.
for (k = 0; k < N; k++)
#pragma omp parallel for private(i,j) shared(x) num_threads(4)
for (i = 0; i < N; i++){
for (j = 0; j < N; j++) {
x[i][j] = x[i][j] < (x[i][k] + x[k][j]) ? x[i][j] : (x[i][k] + x[k][j]) ;
}
}
I'm using four cores, so I expect a theoretical speedup of 4.
X[i][j] is the matrix, where every element acts as the weight of the edge which connects nodes i and j; it is the macro INF ( infinite ) if they're not connected.
TL;DR:
it is great that universities spend more time in Amdahl's Law practical examples to show, how easily the marketing girls and boys create the false expectations on multi-core and many-core toys.
That said, let's define the test-case:
The problem in Floyd-Warshall Processing could be structured into:
process launch overheads
data-structures memory allocations + setup
data-values initialisations
Floyd-Warshall specific conditions ( Zeroised diagonal, etc. )
Section timing tools
Section with Floyd-Warshall O(N^3) process with a potential Improvement-Under-Test [IUT]
Amdahl's Law declares an ultimate limit for any Process overall "improvement", given the section [6] contains an [IUT] to be evaluated, while the overall "improvement" will NEVER become better than ~ 1 / ( ( 1 - IUT ) + ( IUT / N ) ).
Kind readers are left to test and record the timing for the ( 1 - IUT ) part of the experiment.
How to compute an effect of the potentially parallelised [IUT] in the section [6] of the code?
First, let's focus on what happens in the originally posted code, in a pure SEQ ( serial ) code-execution flow:
The inital snippet already had some space for performance improvement, even without OpenMP based attempt to distribute the task onto larger resources-base:
for ( k = 0; k < N; k++ )
for ( i = 0; i < N; i++ ){
for ( j = 0; j < N; j++ ){
x[i][j] = x[i][j] > ( x[i][k] + x[k][j] ) // .TEST <bool>
? ( x[i][k] + x[k][j] ) // .ASSIGN <val1>
: x[i][j]; // .ASSIGN <val2>
}
}
If this were run as a purely SEQ solo or under an attempt to harness the #pragma omp, as was posted in the original question in both cases the Amdahl's Law will show ZERO or even "negative" improvement.
Why? Why not a speed-up of 4? Why even no improvement at all?
Because the code was instructed to run "mechanically" repeated on all resources, running exactly the same, identical scope of the task for full 4 times, shoulder-on-shoulder, each one besides the others, so the 4-times more resources did not bring any expected positive effect, as they have together spent the same time to co-run all the parts of the task 4-times independently each on the others' potential "help" ( if not worse, due to some cases, when a resource contention was observed during the whole task running ).
So, let's rather use the OpenMP strengths to split the task and let each of the resource process just the adequate portion of the scope of the algorithm ( thanks to the Floyd-Warshall algorithm, as this is a lot forgiving in this direction and allows that, because it's processing scheme, even when negative weights are allowed, is non-intervening, so no hostile barriers, syncs, critical-section are needed to propagate anything among the threads )
So, can we get any OpenMP benefit here? Oh yes, a lot:
#include "omp.h" // .MUST SET a gcc directive // "-fopenmp"
// --------------------------------------------------------[1] ref. above
void main(){
int i, j, k;
const int N = 100;
int x[100][100];
// --------------------------------------------------------[2] ref. above
// --------------------------------------------------------[3] ref. above
// --------------------------------------------------------[4] ref. above
for ( k = 0; k < N; k++ )
{
// --------------------------------------------------------[5] ref. above
//------------------------------------------------------[6]----- OMP
// ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
// PARALLEL is not precise, "just"-CONCURRENT is EXACT IN THE SECTION LEVEL BELOW
#pragma omp parallel for private(i,j) shared(x) num_threads(4)
for ( i = 0; i < N; i++ ){ // .MUST incl.actual k-th ROW, in case NEG weights are permitted
int nTHREADs = omp_get_num_threads(); // .GET "total" number of spawned threads
int tID = omp_get_thread_num(); // .GET "own" tID# {0,1,..omp_get_num_threads()-1} .AVOID dumb repeating the same .JOB by all spawned threads
for ( j = tID; j < N; j += nTHREADs ){ // .FOR WITH tID#-offset start + strided .INC STEP // .MUST incl.actual k-th COL, in case NEG weights are permitted
// - - - - - - - - - - - - - - - - - - - - - - - -
// SINCE HERE: // .JOB WAS SPLIT 2 tID#-ed, NON-OVERLAPPING tasks
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // .N.B: dumb "just"-CONCURRENT processing is O.K. here
// ................................................ // 1-thread .INC STEP +1 a sure ZERO Amdahl-Law effect ( will bear an adverse penalty from use-less omp_get_*() calls )
// °.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°. // 2-threads .INC STEP +2 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 2 ) ) if enough free CPU-resources
// '-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-. // 3-threads .INC STEP +3 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 3 ) ) if enough free CPU-resources
// ^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-. // 4-threads .INC STEP +4 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 4 ) ) if enough free CPU-resources
// o1234567o1234567o1234567o1234567o1234567o1234567 // 8-threads .INC STEP +8 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 8 ) ) if enough free CPU-resources
// o123456789ABCDEFo123456789ABCDEFo123456789ABCDEF // 16-threads .INC STEP +16 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 16 ) ) if enough free CPU-resources
// o123456789ABCDEFGHIJKLMNOPQRSTUVo123456789ABCDEF // 32-threads .INC STEP +32 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 32 ) ) if enough free CPU-resources
// o123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl // 64-threads .INC STEP +64 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 64 ) ) if enough free CPU-resources
// o123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl // 128-threads .INC STEP +128 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 128 ) ) if enough free CPU-resources
int aPair = x[i][k] + x[k][j]; // .MUST .CALC ADD( x_ik, x_kj ) to TEST // .MAY smart re-use in case .GT. and ASSIGN will have to take a due place
if ( x[i][j] > aPair ) x[i][j] = aPair; // .IFF .UPD // .AVOID dumb re-ASSIGN(s) of self.value(s) to self
// - - - - - - - - - - - - - - - - - - - - - - - -
}
}
// ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
}// --------------------------------------------------------------- OMP
return;
}
Understanding the OpenMP beyond an Amdahl's Law predicted limit:
The proposed approach opens some potential for further exploration by some funny experimentation:
setup the number of threads as 1 ( to use as an experimentation baseline )
setup the number of threads as ( nCPUcores / 2 )
setup the number of threads as ( nCPUcores - 1 )
setup the number of threads as ( nCPUcores ) + run disk defragmentation/compression
setup the number of threads as ( nCPUcores * 2 )
setup the number of threads as ( nCPUcores * 2 ) + enforce CPU-affinity on just 2 CPU-cores
setup the number of threads as ( nCPUcores * 20 )
setup the number of rows/cols N ~ { 1.000 | 10.000 | 100.000 | 1.000.000 } and check the effects
The two inner loops and the body of the innermost loop are executed in serial on each core.That's because you marked the outer loop to be executed in parallel.
But:
I would expect far less than a speedup of 4. There is always a communication-overhead
The body in the innermost loop uses the same matrix for all cores and also modifies the matrix. therefore the changes in the matrix must be propageted to all other cores. This might in the lead to the following problems:
CPU-caches might be useless because the cached array-elements might be changed by another core which might have a different cache.
In worst case all modifications in the matrix depend on the previous change (I haven't check this for your case). In this case no parallel execution is possible => no speedup at all for more than one core.
You should check if it is possible to change your algorithm in a way that not intersecting partial matrixes can be processed. You will get the best speedup if the cores work on separate not intersecting data.
Since there is nearly no effort in doing so you should definitively profile the code and variants of it.

Explanation of the calc_delta_mine function

I am currently reading "Linux Kernel Development" by Robert Love, and I got a few questions about the CFS.
My question is how calc_delta_mine calculates :
delta_exec_weighted= (delta_exec * weight)/lw->weight
I guess it is done by two steps :
calculation the (delta_exec * 1024) :
if (likely(weight > (1UL << SCHED_LOAD_RESOLUTION)))
tmp = (u64)delta_exec * scale_load_down(weight);
else
tmp = (u64)delta_exec;
calculate the /lw->weight ( or * lw->inv_weight ) :
if (!lw->inv_weight) {
unsigned long w = scale_load_down(lw->weight);
if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
lw->inv_weight = 1;
else if (unlikely(!w))
lw->inv_weight = WMULT_CONST;
else
lw->inv_weight = WMULT_CONST / w;
}
/*
* Check whether we'd overflow the 64-bit multiplication:
*/
if (unlikely(tmp > WMULT_CONST))
tmp = SRR(SRR(tmp, WMULT_SHIFT/2) * lw->inv_weight,
WMULT_SHIFT/2);
else
tmp = SRR(tmp * lw->inv_weight, WMULT_SHIFT);
return (unsigned long)min(tmp, (u64)(unsigned long)LONG_MAX);
The SRR (Shift right and round) macro is defined via :
#define SRR(x, y) (((x) + (1UL << ((y) - 1))) >> (y))
And the other MACROS are defined :
#if BITS_PER_LONG == 32
# define WMULT_CONST (~0UL)
#else
# define WMULT_CONST (1UL << 32)
#endif
#define WMULT_SHIFT 32
Can someone please explain how exactly the SRR works and how does this check the 64-bit multiplication overflow?
And please explain the definition of the MACROS in this function((~0UL) ,(1UL << 32))?
The code you posted is basically doing calculations using 32.32 fixed-point arithmetic, where a single 64-bit quantity holds the integer part of the number in the high 32 bits, and the decimal part of the number in the low 32 bits (so, for example, 1.5 is 0x0000000180000000 in this system). WMULT_CONST is thus an approximation of 1.0 (using a value that can fit in a long for platform efficiency considerations), and so dividing WMULT_CONST by w computes 1/w as a 32.32 value.
Note that multiplying two 32.32 values together as integers produces a result that is 232 times too large; thus, WMULT_SHIFT (=32) is the right shift value needed to normalize the result of multiplying two 32.32 values together back down to 32.32.
The necessity of using this improved precision for scheduling purposes is explained in a comment in sched/sched.h:
/*
* Increase resolution of nice-level calculations for 64-bit architectures.
* The extra resolution improves shares distribution and load balancing of
* low-weight task groups (eg. nice +19 on an autogroup), deeper taskgroup
* hierarchies, especially on larger systems. This is not a user-visible change
* and does not change the user-interface for setting shares/weights.
*
* We increase resolution only if we have enough bits to allow this increased
* resolution (i.e. BITS_PER_LONG > 32). The costs for increasing resolution
* when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
* increased costs.
*/
As for SRR, mathematically, it computes the rounded result of x / 2y.
To round the result of a division x/q you can calculate x + q/2 floor-divided by q; this is what SRR does by calculating x + 2y-1 floor-divided by 2y.

Resources