I am building my first large-scale MATLAB program, and I've managed to write original vectorized code for everything so for until I came to trying to create an image representing vector density in stereographic projection. After a couple failed attempts I went to the Mathworks file exchange site and found an open source program which fits my needs courtesy of Malcolm Mclean. With a test matrix his function produces something like this:
And while this is almost exactly what I wanted, his code relies on a triply nested for-loop. On my workstation a test data matrix of size 25000x2 took 65 seconds in this section of code. This is unacceptable since I will be scaling up to a data matrices of size 500000x2 in my project.
So far I've been able to vectorize the innermost loop (which was the longest/worst loop), but I would like to continue and be rid of the loops entirely if possible. Here is Malcolm's original code that I need to vectorize:
dmap = zeros(height, width); % height, width: scalar with default value = 32
for ii = 0: height - 1 % 32 iterations of this loop
yi = limits(3) + ii * deltay + deltay/2; % limits(3) & deltay: scalars
for jj = 0 : width - 1 % 32 iterations of this loop
xi = limits(1) + jj * deltax + deltax/2; % limits(1) & deltax: scalars
dd = 0;
for kk = 1: length(x) % up to 500,000 iterations in this loop
dist2 = (x(kk) - xi)^2 + (y(kk) - yi)^2;
dd = dd + 1 / ( dist2 + fudge); % fudge is a scalar
end
dmap(ii+1,jj+1) = dd;
end
end
And here it is with the changes I've already made to the innermost loop (which was the biggest drain on efficiency). This cuts the time from 65 seconds down to 12 seconds on my machine for the same test matrix, which is better but still far slower than I would like.
dmap = zeros(height, width);
for ii = 0: height - 1
yi = limits(3) + ii * deltay + deltay/2;
for jj = 0 : width - 1
xi = limits(1) + jj * deltax + deltax/2;
dist2 = (x - xi) .^ 2 + (y - yi) .^ 2;
dmap(ii + 1, jj + 1) = sum(1 ./ (dist2 + fudge));
end
end
So my main question, are there any further changes I can make to optimize this code? Or even an alternative method to approach the problem? I've considered using C++ or F# instead of MATLAB for this section of the program, and I may do so if I cannot get to a reasonable efficiency level with the MATLAB code.
Please also note that at this point I don't have ANY additional toolboxes, if I did then I know this would be trivial (using hist3 from the statistics toolbox for example).
Mem consuming solution
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width ) - .5 * deltax;
dx = bsxfun( #minus, x(:), xi ) .^ 2;
dy = bsxfun( #minus, y(:), yi ) .^ 2;
dist2 = bsxfun( #plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = sum( 1./(dist2 + fudge ) , 3 );
EDIT
handling extremely large x and y by breaking the operation into blocks:
blockSize = 50000; % process up to XX elements at once
dmap = 0;
yi = limits(3) + deltay * ( 1:height ) - .5 * deltay;
xi = limits(1) + deltax * ( 1:width ) - .5 * deltax;
bi = 1;
while bi <= numel(x)
% take a block of x and y
bx = x( bi:min(end, bi + blockSize - 1) );
by = y( bi:min(end, bi + blockSize - 1) );
dx = bsxfun( #minus, bx(:), xi ) .^ 2;
dy = bsxfun( #minus, by(:), yi ) .^ 2;
dist2 = bsxfun( #plus, permute( dy, [2 3 1] ), permute( dx, [3 2 1] ) );
dmap = dmap + sum( 1./(dist2 + fudge ) , 3 );
bi = bi + blockSize;
end
This is a good example of why starting a loop from 1 matters. The only reason that ii and jj are initiated at 0 is to kill the ii * deltay and jj * deltax terms which however introduces sequentiality in the dmap indexing, preventing parallelization.
Now, by rewriting the loops you could use parfor() after opening a matlabpool:
dmap = zeros(height, width);
yi = limits(3) + deltay*(1:height) - .5*deltay;
matlabpool 8
parfor ii = 1: height
for jj = 1: width
xi = limits(1) + (jj-1) * deltax + deltax/2;
dist2 = (x - xi) .^ 2 + (y - yi(ii)) .^ 2;
dmap(ii, jj) = sum(1 ./ (dist2 + fudge));
end
end
matlabpool close
Keep in mind that opening and closing the pool has significant overhead (10 seconds on my Intel Core Duo T9300, vista 32 Matlab 2013a).
PS. I am not sure whether the inner loop instead of the outer one can be meaningfully parallelized. You can try to switch the parfor to the inner one and compare speeds (I would recommend going for the big matrix immediately since you are already running in 12 seconds and the overhead is almost as big).
Alternatively, this problem can be solved in using kernel density estimation techniques. This is part of the Statistics Toolbox, or there's this KDE implementation by Zdravko Botev (no toolboxes required).
For the example code below, I get 0.3 seconds for N = 500000, or 0.7 seconds for N = 1000000.
N = 500000;
data = [randn(N,2); rand(N,1)+3.5, randn(N,1);]; % 2 overlaid distrib
tic; [bandwidth,density,X,Y] = kde2d(data); toc;
imagesc(density);
I'm aware of how to check if two circles are intersecting one another. However, sometimes the circles move too fast and end up avoiding collision on the next frame.
My current solution to the problem is to check circle-circle collision an arbitrary amount of times between the previous position and it's current position.
Is there a mathematical way to find the time it takes for the two circle to collide? If I was able to get that time value, I could move the circle to the position at that time and then collide them at that point.
Edit: Constant Velocity
I'm assuming the motion of the circles is linear. Let's say the position of circle A's centre is given by the vector equation Ca = Oa + t*Da where
Ca = (Cax, Cay) is the current position
Oa = (Oax, Oay) is the starting position
t is the elapsed time
Da = (Dax, Day) is the displacement per unit of time (velocity).
Likewise for circle B's centre: Cb = Ob + t*Db.
Then you want to find t such that ||Ca - Cb|| = (ra + rb) where ra and rb are the radii of circles A and B respectively.
Squaring both sides:
||Ca-Cb||^2 = (ra+rb)^2
and expanding:
(Oax + t*Dax - Obx - t*Dbx)^2 + (Oay + t*Day - Oby - t*Dby)^2 = (ra + rb)^2
From that you should get a quadratic polynomial that you can solve for t (if such a t exists).
Here is a way to solve for t the equation in Andrew Durward's excellent answer.
To just plug in values one can skip to the bottom.
(Oax + t*Dax - Obx - t*Dbx)^2 + (Oay + t*Day - Oby - t*Dby)^2 = (ra + rb)^2
(Oax * (Oax + t*Dax - Obx - t*Dbx) + t*Dax * (Oax + t*Dax - Obx - t*Dbx)
- Obx * (Oax + t*Dax - Obx - t*Dbx) - t*Dbx * (Oax + t*Dax - Obx - t*Dbx))
+
(Oay * (Oay + t*Day - Oby - t*Dby) + t*Day * (Oay + t*Day - Oby - t*Dby)
- Oby * (Oay + t*Day - Oby - t*Dby) - t*Dby * (Oay + t*Day - Oby - t*Dby))
=
(ra + rb)^2
Oax^2 + (Oax * t*Dax) - (Oax * Obx) - (Oax * t*Dbx)
+ (t*Dax * Oax) + (t*Dax)^2 - (t*Dax * Obx) - (t*Dax * t*Dbx)
- (Obx * Oax) - (Obx * t*Dax) + Obx^2 + (Obx * t*Dbx)
- (t*Dbx * Oax) - (t*Dbx * t*Dax) + (t*Dbx * Obx) + (t*Dbx)^2
+
Oay^2 + (Oay * t*Day) - (Oay * Oby) - (Oay * t*Dby)
+ (t*Day * Oay) + (t*Day)^2 - (t*Day * Oby) - (t*Day * t*Dby)
- (Oby * Oay) - (Oby * t*Day) + Oby^2 + (Oby * t*Dby)
- (t*Dby * Oay) - (t*Dby * t*Day) + (t*Dby * Oby) + (t*Dby)^2
=
(ra + rb)^2
t^2 * (Dax^2 + Dbx^2 - (Dax * Dbx) - (Dbx * Dax)
+ Day^2 + Dby^2 - (Day * Dby) - (Dby * Day))
+
t * ((Oax * Dax) - (Oax * Dbx) + (Dax * Oax) - (Dax * Obx)
- (Obx * Dax) + (Obx * Dbx) - (Dbx * Oax) + (Dbx * Obx)
+ (Oay * Day) - (Oay * Dby) + (Day * Oay) - (Day * Oby)
- (Oby * Day) + (Oby * Dby) - (Dby * Oay) + (Dby * Oby))
+
Oax^2 - (Oax * Obx) - (Obx * Oax) + Obx^2
+ Oay^2 - (Oay * Oby) - (Oby * Oay) + Oby^2 - (ra + rb)^2
=
0
Now it's a standard form quadratic equation:
ax2 + bx + c = 0
solved like this:
x = (−b ± sqrt(b^2 - 4ac)) / 2a // this x here is t
where--
a = Dax^2 + Dbx^2 + Day^2 + Dby^2 - (2 * Dax * Dbx) - (2 * Day * Dby)
b = (2 * Oax * Dax) - (2 * Oax * Dbx) - (2 * Obx * Dax) + (2 * Obx * Dbx)
+ (2 * Oay * Day) - (2 * Oay * Dby) - (2 * Oby * Day) + (2 * Oby * Dby)
c = Oax^2 + Obx^2 + Oay^2 + Oby^2
- (2 * Oax * Obx) - (2 * Oay * Oby) - (ra + rb)^2
t exists (collision will occur) if--
(a != 0) && (b^2 >= 4ac)
You can predict collision by using direction vector and speed, this gives you the next steps, and when they will make a collision (if there will be).
You just need to check line crossing algorithm to detect that...
I am starting use PETSc library to solve linear system of equations in parallel. I have installed all packages, build and run successfully the examples in petsc/src/ksp/ksp/examples/tutorials/ folder, for example ex.c
But I couldn't understand how to fill matrices A,X an B by reading them for example from file.
Here I provide the code within ex2.c file:
/* Program usage: mpiexec -n <procs> ex2 [-help] [all PETSc options] */
static char help[] = "Solves a linear system in parallel with KSP.\n\
Input parameters include:\n\
-random_exact_sol : use a random exact solution vector\n\
-view_exact_sol : write exact solution vector to stdout\n\
-m <mesh_x> : number of mesh points in x-direction\n\
-n <mesh_n> : number of mesh points in y-direction\n\n";
/*T
Concepts: KSP^basic parallel example;
Concepts: KSP^Laplacian, 2d
Concepts: Laplacian, 2d
Processors: n
T*/
/*
Include "petscksp.h" so that we can use KSP solvers. Note that this file
automatically includes:
petscsys.h - base PETSc routines petscvec.h - vectors
petscmat.h - matrices
petscis.h - index sets petscksp.h - Krylov subspace methods
petscviewer.h - viewers petscpc.h - preconditioners
*/
#include <C:\PETSC\include\petscksp.h>
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
Vec x,b,u; /* approx solution, RHS, exact solution */
Mat A; /* linear system matrix */
KSP ksp; /* linear solver context */
PetscRandom rctx; /* random number generator context */
PetscReal norm; /* norm of solution error */
PetscInt i,j,Ii,J,Istart,Iend,m = 8,n = 7,its;
PetscErrorCode ierr;
PetscBool flg = PETSC_FALSE;
PetscScalar v;
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
PetscInitialize(&argc,&args,(char *)0,help);
ierr = PetscOptionsGetInt(PETSC_NULL,"-m",&m,PETSC_NULL);CHKERRQ(ierr);
ierr = PetscOptionsGetInt(PETSC_NULL,"-n",&n,PETSC_NULL);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
- Each processor needs to insert only elements that it owns
locally (but any non-local elements will be sent to the
appropriate processor during matrix assembly).
- Always specify global rows and columns of matrix entries.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
ierr = PetscLogStageRegister("Assembly", &stage);CHKERRQ(ierr);
ierr = PetscLogStagePush(stage);CHKERRQ(ierr);
for (Ii=Istart; Ii<Iend; Ii++) {
v = -1.0; i = Ii/n; j = Ii - i*n;
if (i>0) {J = Ii - n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (i<m-1) {J = Ii + n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (j>0) {J = Ii - 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (j<n-1) {J = Ii + 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
v = 4.0; ierr = MatSetValues(A,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr);
}
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = PetscLogStagePop();CHKERRQ(ierr);
/* A is symmetric. Set symmetric flag to enable ICC/Cholesky preconditioner */
ierr = MatSetOption(A,MAT_SYMMETRIC,PETSC_TRUE);CHKERRQ(ierr);
/*
Create parallel vectors.
- We form 1 vector from scratch and then duplicate as needed.
- When using VecCreate(), VecSetSizes and VecSetFromOptions()
in this example, we specify only the
vector's global dimension; the parallel partitioning is determined
at runtime.
- When solving a linear system, the vectors and matrices MUST
be partitioned accordingly. PETSc automatically generates
appropriately partitioned matrices and vectors when MatCreate()
and VecCreate() are used with the same communicator.
- The user can alternatively specify the local vector and matrix
dimensions when more sophisticated partitioning is needed
(replacing the PETSC_DECIDE argument in the VecSetSizes() statement
below).
*/
ierr = VecCreate(PETSC_COMM_WORLD,&u);CHKERRQ(ierr);
ierr = VecSetSizes(u,PETSC_DECIDE,m*n);CHKERRQ(ierr);
ierr = VecSetFromOptions(u);CHKERRQ(ierr);
ierr = VecDuplicate(u,&b);CHKERRQ(ierr);
ierr = VecDuplicate(b,&x);CHKERRQ(ierr);
/*
Set exact solution; then compute right-hand-side vector.
By default we use an exact solution of a vector with all
elements of 1.0; Alternatively, using the runtime option
-random_sol forms a solution vector with random components.
*/
ierr = PetscOptionsGetBool(PETSC_NULL,"-random_exact_sol",&flg,PETSC_NULL);CHKERRQ(ierr);
if (flg) {
ierr = PetscRandomCreate(PETSC_COMM_WORLD,&rctx);CHKERRQ(ierr);
ierr = PetscRandomSetFromOptions(rctx);CHKERRQ(ierr);
ierr = VecSetRandom(u,rctx);CHKERRQ(ierr);
ierr = PetscRandomDestroy(&rctx);CHKERRQ(ierr);
} else {
ierr = VecSet(u,1.0);CHKERRQ(ierr);
}
ierr = MatMult(A,u,b);CHKERRQ(ierr);
/*
View the exact solution vector if desired
*/
flg = PETSC_FALSE;
ierr = PetscOptionsGetBool(PETSC_NULL,"-view_exact_sol",&flg,PETSC_NULL);CHKERRQ(ierr);
if (flg) {ierr = VecView(u,PETSC_VIEWER_STDOUT_WORLD);CHKERRQ(ierr);}
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create the linear solver and set various options
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create linear solver context
*/
ierr = KSPCreate(PETSC_COMM_WORLD,&ksp);CHKERRQ(ierr);
/*
Set operators. Here the matrix that defines the linear system
also serves as the preconditioning matrix.
*/
ierr = KSPSetOperators(ksp,A,A,DIFFERENT_NONZERO_PATTERN);CHKERRQ(ierr);
/*
Set linear solver defaults for this problem (optional).
- By extracting the KSP and PC contexts from the KSP context,
we can then directly call any KSP and PC routines to set
various options.
- The following two statements are optional; all of these
parameters could alternatively be specified at runtime via
KSPSetFromOptions(). All of these defaults can be
overridden at runtime, as indicated below.
*/
ierr = KSPSetTolerances(ksp,1.e-2/((m+1)*(n+1)),1.e-50,PETSC_DEFAULT,
PETSC_DEFAULT);CHKERRQ(ierr);
/*
Set runtime options, e.g.,
-ksp_type <type> -pc_type <type> -ksp_monitor -ksp_rtol <rtol>
These options will override those specified above as long as
KSPSetFromOptions() is called _after_ any other customization
routines.
*/
ierr = KSPSetFromOptions(ksp);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Solve the linear system
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
ierr = KSPSolve(ksp,b,x);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Check solution and clean up
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Check the error
*/
ierr = VecAXPY(x,-1.0,u);CHKERRQ(ierr);
ierr = VecNorm(x,NORM_2,&norm);CHKERRQ(ierr);
ierr = KSPGetIterationNumber(ksp,&its);CHKERRQ(ierr);
/* Scale the norm */
/* norm *= sqrt(1.0/((m+1)*(n+1))); */
/*
Print convergence information. PetscPrintf() produces a single
print statement from all processes that share a communicator.
An alternative is PetscFPrintf(), which prints to a file.
*/
ierr = PetscPrintf(PETSC_COMM_WORLD,"Norm of error %A iterations %D\n",
norm,its);CHKERRQ(ierr);
/*
Free work space. All PETSc objects should be destroyed when they
are no longer needed.
*/
ierr = KSPDestroy(&ksp);CHKERRQ(ierr);
ierr = VecDestroy(&u);CHKERRQ(ierr); ierr = VecDestroy(&x);CHKERRQ(ierr);
ierr = VecDestroy(&b);CHKERRQ(ierr); ierr = MatDestroy(&A);CHKERRQ(ierr);
/*
Always call PetscFinalize() before exiting a program. This routine
- finalizes the PETSc libraries as well as MPI
- provides summary and diagnostic information if certain runtime
options are chosen (e.g., -log_summary).
*/
ierr = PetscFinalize();
return 0;
}
Does someone know how to fill own matrices within examples?
Yeah, this can be a little daunting when you're getting started. There's a good walk-through of the process in this ACTS tutorial from 2006; the tutorials listed on the PetSC web page are generally quite good.
The key parts of this are:
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
Actually create the PetSC matrix object, Mat A;
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n);CHKERRQ(ierr);
set the sizes; here, the matrix is m*n x m*n, as it's a stencil for operating on an m x n 2d grid
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
This just takes any PetSC command line options that you might have supplied at run time and apply them to the matrix, if you wanted to control how A was set up; otherwise, you could just, have eg, used MatCreateMPIAIJ() to create it as an AIJ-format matrix (the default), MatCreateMPIDense() if it was going to be a dense matrix.
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
Now that we've gotten an AIJ matrix, these calls just pre-allocates the sparse matrix, assuming 5 non-zeros per row. This is for performance. Note that both the MPI and Seq functions must be called to make sure this works with both 1 processor and multiple processors; this always seemed weird to be, but there you go.
Ok, now that the matrix is all set up, here's where we start getting into the actual meat of the matter.
First, we find out which rows this particular process owns. The distribution is by rows, which is a good distribution for typical sparse matrices.
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
So after this call, each processor has its own version of Istart and Iend, and its this processors job to update rows starting at Istart end ending just before Iend, as you see in this for loop:
for (Ii=Istart; Ii<Iend; Ii++) {
v = -1.0; i = Ii/n; j = Ii - i*n;
Ok, so if we're operating on row Ii, this corresponds to grid location (i,j) where i = Ii/n and j = Ii % n. Eg, grid location (i,j) corresponds to row Ii = i*n + j. Makes sense?
I'm going to strip out the if statements here because they're important but they're just dealing with the boundary values and they make things more complicated.
In this row, there will be a +4 on the diagonal, and -1s at columns corresponding to (i-1,j), (i+1,j), (i,j-1), and (i,j+1). Assuming that we haven't gone off the end of the grid for these (eg, 1 < i < m-1 and 1 < j < n-1), that means
J = Ii - n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii + n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii - 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii + 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
v = 4.0; ierr = MatSetValues(A,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr);
}
The if statements I took out just avoid setting those values if they don't exist, and the CHKERRQ macro just prints out a useful error if ierr != 0, eg the set values call failed (because we tried to set an invalid value).
Now we've set local values; the MatAssembly calls start communication to ensure any necessary values are exchanged between processors. If you have any unrelated work to do, it can be stuck between the Begin and End to try to overlap communication and computation:
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
And now you're done and can call your solvers.
So a typical workflow is:
Create your matrix (MatCreate)
Set its size (MatSetSizes)
Set various matrix options (MatSetFromOptions is a good choice, rather than hardcoding things)
For sparse matrices, set the preallocation to reasonable guesses for the number of non-zeros per row; you can do this with a single value (as here), or with an array representing the number of non-zeros per row (here filled in with PETSC_NULL): (MatMPIAIJSetPreallocation, MatSeqAIJSetPreallocation)
Find out which rows are your responsibility: (MatGetOwnershipRange)
Set the values (calling MatSetValues either once per value, or passing in a chunk of values; INSERT_VALUES sets new elements, ADD_VALUES increments any existing elements)
Then do the assembly (MatAssemblyBegin,MatAssemblyEnd).
Other more complicated use cases are possible.