Huge slow down using openmp - parallel-processing

Huge slow down using openmp - parallel-processing

I am trying to test the speed up for a small piece of code as follows:
for(i=0;i<imgDim;i++)
{
X[0][i] = Z[i] - U1[i] * rhoinv;
X[1][i] = Z[i] - U2[i] * rhoinv;
X[2][i] = Z[i] - U3[i] * rhoinv;
}
The iteration is around 200 and imgDim is 1000000. The total time for this piece of code is around 2 seconds. And the whole code cost about 15 seconds.
But after I use openmp to parallel this piece of code like:
omp_set_num_threads(max_threads);
#pragma omp parallel shared(X,Z,U1,U2,U3,imgDim,rhoinv) private(i)
{
#pragma omp for schedule(dynamic)
for(i=0;i<imgDim;i++)
{
X[0][i] = Z[i] - U1[i] * rhoinv;
X[1][i] = Z[i] - U2[i] * rhoinv;
X[2][i] = Z[i] - U3[i] * rhoinv;
}
}
max_threads is 8. Only this small piece of code needs around 11 seconds and the entire code use around 27 seconds. The most strange thing is the time decreases to 6 seconds if I change max_threads to 1. But still much longer than the sequential code.
It costs me a lot of time and I can not find the problem. Deeply appreciate if anyone can help me with that.

schedule(dynamic) introduces a huge run-time overhead. It should only be used for loops where each iteration could take a different amount of time and the improved load balancing would justify the overhead. For regular loops like yours dynamic scheduling is an overkill as it introduces unnecessary overhead, which slows down the computation.
Change the schedule type to static:
#pragma omp parallel for schedule(static)
for(i=0;i<imgDim;i++)
{
X[0][i] = Z[i] - U1[i] * rhoinv;
X[1][i] = Z[i] - U2[i] * rhoinv;
X[2][i] = Z[i] - U3[i] * rhoinv;
}
(Note: variables declared in outer scopes are shared by default and the parallel loop control variable is implicitly private)

Related

Why Amdahl Law on serial and parallel fractions does not provide a theoretical speedup of 4 on quad-core CPU?

I have a code ( the Floyd-Warshall algorithm for the shortest path in an NxN matrix ),with three for-loops, one within the other and with the same number of cycles.
In the last for I have an assignment via a ternary-operation = <bool> ? <val1> : <val2> - based on a comparison and if it is True or not.
I used an OpenMP to parallelize the second for with a #pragma omp parallel for.
I can't compute the parallel percentage and serial percentage of the code to successfully apply the Amdahl Law to recover the theoretical speedup.
for (k = 0; k < N; k++)
#pragma omp parallel for private(i,j) shared(x) num_threads(4)
for (i = 0; i < N; i++){
for (j = 0; j < N; j++) {
x[i][j] = x[i][j] < (x[i][k] + x[k][j]) ? x[i][j] : (x[i][k] + x[k][j]) ;
}
}
I'm using four cores, so I expect a theoretical speedup of 4.
X[i][j] is the matrix, where every element acts as the weight of the edge which connects nodes i and j; it is the macro INF ( infinite ) if they're not connected.

TL;DR:
it is great that universities spend more time in Amdahl's Law practical examples to show, how easily the marketing girls and boys create the false expectations on multi-core and many-core toys.
That said, let's define the test-case:
The problem in Floyd-Warshall Processing could be structured into:
process launch overheads
data-structures memory allocations + setup
data-values initialisations
Floyd-Warshall specific conditions ( Zeroised diagonal, etc. )
Section timing tools
Section with Floyd-Warshall O(N^3) process with a potential Improvement-Under-Test [IUT]
Amdahl's Law declares an ultimate limit for any Process overall "improvement", given the section [6] contains an [IUT] to be evaluated, while the overall "improvement" will NEVER become better than ~ 1 / ( ( 1 - IUT ) + ( IUT / N ) ).
Kind readers are left to test and record the timing for the ( 1 - IUT ) part of the experiment.
How to compute an effect of the potentially parallelised [IUT] in the section [6] of the code?
First, let's focus on what happens in the originally posted code, in a pure SEQ ( serial ) code-execution flow:
The inital snippet already had some space for performance improvement, even without OpenMP based attempt to distribute the task onto larger resources-base:
for ( k = 0; k < N; k++ )
for ( i = 0; i < N; i++ ){
for ( j = 0; j < N; j++ ){
x[i][j] = x[i][j] > ( x[i][k] + x[k][j] ) // .TEST <bool>
? ( x[i][k] + x[k][j] ) // .ASSIGN <val1>
: x[i][j]; // .ASSIGN <val2>
}
}
If this were run as a purely SEQ solo or under an attempt to harness the #pragma omp, as was posted in the original question in both cases the Amdahl's Law will show ZERO or even "negative" improvement.
Why? Why not a speed-up of 4? Why even no improvement at all?
Because the code was instructed to run "mechanically" repeated on all resources, running exactly the same, identical scope of the task for full 4 times, shoulder-on-shoulder, each one besides the others, so the 4-times more resources did not bring any expected positive effect, as they have together spent the same time to co-run all the parts of the task 4-times independently each on the others' potential "help" ( if not worse, due to some cases, when a resource contention was observed during the whole task running ).
So, let's rather use the OpenMP strengths to split the task and let each of the resource process just the adequate portion of the scope of the algorithm ( thanks to the Floyd-Warshall algorithm, as this is a lot forgiving in this direction and allows that, because it's processing scheme, even when negative weights are allowed, is non-intervening, so no hostile barriers, syncs, critical-section are needed to propagate anything among the threads )
So, can we get any OpenMP benefit here? Oh yes, a lot:
#include "omp.h" // .MUST SET a gcc directive // "-fopenmp"
// --------------------------------------------------------[1] ref. above
void main(){
int i, j, k;
const int N = 100;
int x[100][100];
// --------------------------------------------------------[2] ref. above
// --------------------------------------------------------[3] ref. above
// --------------------------------------------------------[4] ref. above
for ( k = 0; k < N; k++ )
{
// --------------------------------------------------------[5] ref. above
//------------------------------------------------------[6]----- OMP
// ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
// PARALLEL is not precise, "just"-CONCURRENT is EXACT IN THE SECTION LEVEL BELOW
#pragma omp parallel for private(i,j) shared(x) num_threads(4)
for ( i = 0; i < N; i++ ){ // .MUST incl.actual k-th ROW, in case NEG weights are permitted
int nTHREADs = omp_get_num_threads(); // .GET "total" number of spawned threads
int tID = omp_get_thread_num(); // .GET "own" tID# {0,1,..omp_get_num_threads()-1} .AVOID dumb repeating the same .JOB by all spawned threads
for ( j = tID; j < N; j += nTHREADs ){ // .FOR WITH tID#-offset start + strided .INC STEP // .MUST incl.actual k-th COL, in case NEG weights are permitted
// - - - - - - - - - - - - - - - - - - - - - - - -
// SINCE HERE: // .JOB WAS SPLIT 2 tID#-ed, NON-OVERLAPPING tasks
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ // .N.B: dumb "just"-CONCURRENT processing is O.K. here
// ................................................ // 1-thread .INC STEP +1 a sure ZERO Amdahl-Law effect ( will bear an adverse penalty from use-less omp_get_*() calls )
// °.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°.°. // 2-threads .INC STEP +2 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 2 ) ) if enough free CPU-resources
// '-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-.'-. // 3-threads .INC STEP +3 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 3 ) ) if enough free CPU-resources
// ^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-.^'-. // 4-threads .INC STEP +4 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 4 ) ) if enough free CPU-resources
// o1234567o1234567o1234567o1234567o1234567o1234567 // 8-threads .INC STEP +8 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 8 ) ) if enough free CPU-resources
// o123456789ABCDEFo123456789ABCDEFo123456789ABCDEF // 16-threads .INC STEP +16 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 16 ) ) if enough free CPU-resources
// o123456789ABCDEFGHIJKLMNOPQRSTUVo123456789ABCDEF // 32-threads .INC STEP +32 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 32 ) ) if enough free CPU-resources
// o123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl // 64-threads .INC STEP +64 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 64 ) ) if enough free CPU-resources
// o123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijkl // 128-threads .INC STEP +128 may have Amdahl-Law effect ~ 1 / ( ( 1 - OMP ) + ( OMP / 128 ) ) if enough free CPU-resources
int aPair = x[i][k] + x[k][j]; // .MUST .CALC ADD( x_ik, x_kj ) to TEST // .MAY smart re-use in case .GT. and ASSIGN will have to take a due place
if ( x[i][j] > aPair ) x[i][j] = aPair; // .IFF .UPD // .AVOID dumb re-ASSIGN(s) of self.value(s) to self
// - - - - - - - - - - - - - - - - - - - - - - - -
}
}
// ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
}// --------------------------------------------------------------- OMP
return;
}
Understanding the OpenMP beyond an Amdahl's Law predicted limit:
The proposed approach opens some potential for further exploration by some funny experimentation:
setup the number of threads as 1 ( to use as an experimentation baseline )
setup the number of threads as ( nCPUcores / 2 )
setup the number of threads as ( nCPUcores - 1 )
setup the number of threads as ( nCPUcores ) + run disk defragmentation/compression
setup the number of threads as ( nCPUcores * 2 )
setup the number of threads as ( nCPUcores * 2 ) + enforce CPU-affinity on just 2 CPU-cores
setup the number of threads as ( nCPUcores * 20 )
setup the number of rows/cols N ~ { 1.000 | 10.000 | 100.000 | 1.000.000 } and check the effects

The two inner loops and the body of the innermost loop are executed in serial on each core.That's because you marked the outer loop to be executed in parallel.
But:
I would expect far less than a speedup of 4. There is always a communication-overhead
The body in the innermost loop uses the same matrix for all cores and also modifies the matrix. therefore the changes in the matrix must be propageted to all other cores. This might in the lead to the following problems:
CPU-caches might be useless because the cached array-elements might be changed by another core which might have a different cache.
In worst case all modifications in the matrix depend on the previous change (I haven't check this for your case). In this case no parallel execution is possible => no speedup at all for more than one core.
You should check if it is possible to change your algorithm in a way that not intersecting partial matrixes can be processed. You will get the best speedup if the cores work on separate not intersecting data.
Since there is nearly no effort in doing so you should definitively profile the code and variants of it.

OpenMP program freezing before starting loop?

I have a program I am trying to parallelize using OpenMP - it makes a very large loop over some data. Since incrementing a shared variable (so I can report progress as it goes) is somewhat of an issue, I thought I'd break the loop up into smaller chunks, loop over those multiple times, and just report the status at the end of/outside the openmp loop.
Problem is, before the OpenMP for loop starts for the 3rd time, the program locks up. Just sits there, does nothing. I've stripped out all but the simplest code. Here it is:
some other variable declarations for removed code above here
int dbl = 0;
int lasttime = 0;
int seedbase = 0;
const char *pl = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
const double mm = 62.0 / 2147483647.0;
for(dbl = 0; dbl < 2048 && !abort; dbl++) {
seedbase = dbl; //(dbl * 2097152) - 2147483648;
printf("Loop %d %d\n", dbl, abort);
#pragma omp parallel for private(seed) shared(dbl)
for(seed = 0; seed < 20971; seed++) { //52
if(dbl == 2)
printf("oo\n");
}
if(abort)
break;
lasttime = time();
hps = (double)((dbl*2097152) * clk_tck) / (double)((times(&tms) - start_time));
printf("So far: %0.2fsec (%0.2fhps) %0.2f sec left\n", (double)(times(&tms) - start_time) / (double)clk_tck, hps, (((long)1 << 32) - (dbl * 2097152)) / hps);
}
}
When compiled and run, I get:
Loop 0 0
So far: 0.02sec (0.00hps) inf sec left
Loop 1 0
So far: 0.02sec (104857600.00hps) 40.94 sec left
Loop 2 0
^C
Loop 0 starts, and the openmp runs (and does nothing) then exits, and the "So far:" is printed.
Loop 1 starts, same thing.
Loop 2 starts, and everything hangs. The printf("oo"); never happens. If I change the line to be if(dbl <= 2) my screen fills with looped "oo"'s as the loop runs.
But before the seed loop ever happens the third time - it's dead. Just sits there chewing up CPU time doing nothing.
Can you not quickly loop over a openmp loop? Is that the issue? I find it odd it's ALWAYS stopping before the 3rd run, regardless of how complex the code inside the seed loop is (I removed 200 lines of code - it had no effect)

speeding up some for loops in matlab

Basically I am trying to solve a 2nd order differential equation with the forward euler method. I have some for loops inside my code, which take considerable time to solve and I would like to speed things up a bit. Does anyone have any suggestions how could I do this?
And also when looking at the time it takes, I notice that my end at line 14 takes 45 % of my total time. What is end actually doing and why is it taking so much time?
Here is my simplified code:
t = 0:0.01:100;
dt = t(2)-t(1);
B = 3.5 * t;
F0 = 2 * t;
BB=zeros(1,length(t)); % Preallocation
x = 2; % Initial value
u = 0; % Initial value
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
Running the code it takes me 8.552 sec.

You can remove the inner loop, I think:
for ii = 1:length(t)
for kk = 1:ii
BB(ii) = BB(ii) + B(kk) * u(ii-kk+1)*dt; % This line takes the most time
end % This end takes 45% of the other time
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
So BB(ii) = BB(ii) (zero at initalisation) + sum for 1 to ii of BB(kk)* u(ii-kk+1).dt
but kk = 1:ii, so for a given ii, ii-kk+1 →　ii-(1:ii) + 1 → ii:-1:1
So I think this is equivalent to:
for ii = 1:length(t)
BB(ii) = sum(B(1:ii).*u(ii:-1:1)*dt);
x(ii+1) = x(ii) + dt*u(ii);
u(ii+1) = u(ii) + dt * (F0(ii) - BB(ii));
end
It doesn't take as long as 8 seconds for me using either method, but the version with only one loop is about 2x as fast (the output of BB appears to be the same).

Is the sum loop of B(kk) * u(ii-kk+1) just conv(B(1:ii),u(1:ii),'same')

The best way to speed up loops in matlab is to try to avoid them. Try if you are able to perform a matrix operation instead of the inner loop. For example try to break the calculation you do there in small parts, then decide, if there are parts you can perform in advance without knowing the results of the next iteration of the loop.
to your secound part of the question, my guess:: The end contains the check if the loop runs for another round and this check by it self is not that long but called 50.015.001 times!

Memory and excecution speed in Matlab

I am trying to create random lines and select some of them, which are really rare. My code is rather simple, but to get something that I can use I need to create very large vectors(i.e.: <100000000 x 1, tracks variable in my code). Is there any way to be able to creater larger vectors and to reduce the time needed for all those calculations?
My code is
%Initial line values
tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;
Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;
for i=1:tracks
if ((Xend(i,1)<width/2 && Xend(i,1)>-width/2)||(b(i,1)<height && b(i,1)>0))
plot(X(:, i),Y(:, i),'r');%the chosen ones!
hold all
hot=hot+1;
else
%plot(X(:, i),Y(:, i),'b');%the rest of them
%hold all
cold=cold+1;
end
end
I am also using and calling a Laplace distribution generator made my Elvis Chen which can be found here
function y = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
% with mean mu and standard deviation sigma.
% mu : mean
% sigma : standard deviation
% [m, n] : the dimension of y.
% Default mu = 0, sigma = 1.
% For more information, refer to
% http://en.wikipedia.org./wiki/Laplace_distribution
% Author : Elvis Chen (bee33#sjtu.edu.cn)
% Date : 01/19/07
%Check inputs
if nargin < 2
error('At least two inputs are required');
end
if nargin == 2
mu = 0; sigma = 1;
end
if nargin == 3
sigma = 1;
end
% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));
The result plot is

As you indicate, your problem is two-fold. On the one hand, you have memory issues because you need to do so many trials. On the other hand, you have performance issues, because you have to process all those trials.
Solutions to each issue often have a negative impact on the other issue. IMHO, the best approach would be to find a compromise.
More trials are only possible of you get rid of those gargantuan arrays that are required for vectorization, and use a different strategy to do the loop. I will give priority to the possibility of using more trials, possibly at the cost of optimal performance.
When I execute your code as-is in the Matlab profiler, it immediately shows that the initial memory allocation for all your variables takes a lot of time. It also shows that the plot and hold all commands are the most time-consuming lines of them all. Some more trial-and-error shows that there is a disappointingly low maximum value for the trials you can do before OUT OF MEMORY errors start appearing.
The loop can be accelerated tremendously if you know a few things about its limitations in Matlab. In older versions of Matlab, it used to be true that loops should be avoided completely in favor of 'vectorized' code. In recent versions (I believe R2008a and up), the Mathworks introduced a piece of technology called the JIT accelerator (Just-in-Time compiler) which translates M-code into machine language on the fly during execution. Simply put, the JIT accelerator allows your code to bypass Matlab's interpreter and talk much more directly with the underlying hardware, which can save a lot of time.
The advice you'll hear a lot that loops should be avoided in Matlab, is no longer generally true. While vectorization still has its value, any procedure of sizable complexity that is implemented using only vectorized code is often illegible, hard to understand, hard to change and hard to upkeep. An implementation of the same procedure that uses loops, often has none of these drawbacks, and moreover, it will quite often be faster and require less memory.
Unfortunately, the JIT accelerator has a few nasty (and IMHO, unnecessary) limitations that you'll have to learn about.
One such thing is plot; it's generally a better idea to let a loop do nothing other than collect and manipulate data, and delay any plotting commands etc. until after the loop.
Another such thing is hold; the hold function is not a Matlab built-in function, meaning, it is implemented in M-language. Matlab's JIT accelerator is not able to accelerate non-builtin functions when used in a loop, meaning, your entire loop will run at Matlab's interpretation speed, rather than machine-language speed! Therefore, also delay this command until after the loop :)
Now, in case you're wondering, this last step can make a HUGE difference -- I know of one case where copy-pasting a function body into the upper-level loop caused a 1200x performance improvement. Days of execution time had been reduced to minutes!).
There is actually another minor issue in your loop (which is really small, and rather inconvenient, I will immediately agree with) -- the name of the loop variable should not be i. The name i is the name of the imaginary unit in Matlab, and the name resolution will also unnecessarily consume time on each iteration. It's small, but non-negligible.
Now, considering all this, I've come to the following implementation:
function [hot, cold, h] = MuonTracks(tracks)
% NOTE: no variables larger than 1x1 are initialized
width = 1e-4;
height = 2e-4;
% constant used for Laplacian noise distribution
bL = 15 / sqrt(2);
% Loop through all tracks
X = [];
hot = 0;
ii = 0;
while ii <= tracks
ii = ii + 1;
% Note that I've inlined (== copy-pasted) the original laprnd()
% function call. This was necessary to work around limitations
% in loops in Matlab, and prevent the nececessity of those HUGE
% variables.
%
% Of course, you can still easily generalize all of this:
% the new data
u = rand-0.5;
Ystart = 15;
Xstart = 800*rand-400;
Xend = Xstart - bL*sign(u)*log(1-2*abs(u));
b = (Ystart*Xend)/(Xend-Xstart);
% the test
if ((b < height && b > 0)) ||...
(Xend < width/2 && Xend > -width/2)
hot = hot+1;
% growing an array is perfectly fine when the chances of it
% happening are so slim
X = [X [Xstart; Xend]]; %#ok
end
end
% This is trivial to do here, and prevents an 'else' in the loop
cold = tracks - hot;
% Now plot the chosen ones
h = figure;
hold all
Y = repmat([15;0], 1, size(X,2));
plot(X, Y, 'r');
end
With this implementation, I can do this:
>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.
with a completely negligible memory footprint.
The profiler now also shows a nice and even distribution of effort along the code; no lines that really stand out because of their memory use or performance.
It's possibly not the fastest possible implementation (if anyone sees obvious improvements, please, feel free to edit them in). But, if you're willing to wait, you'll be able to do MuonTracks(1e23) (or higher :)
I've also done an implementation in C, which can be compiled into a Matlab MEX file:
/* DoMuonCounting.c */
#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
#include <stdlib.h>
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout);
/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double() { return (double)rand()/(double)RAND_MAX; }
/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int
dims[] = {1,1};
const mxArray
/* Output arguments */
*hot_out = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*X_out = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);
const unsigned long long
tracks = (const unsigned long long)mxGetPr(prhs[0])[0];
unsigned long long
*hot = (unsigned long long*)mxGetPr(hot_out),
*cold = (unsigned long long*)mxGetPr(cold_out);
double
*Xout = mxGetPr(X_out);
/* call the actual function, and return */
CountMuons(tracks, hot,cold, Xout);
}
// The actual muon counting
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout)
{
const double
width = 1.0e-4,
height = 2.0e-4,
bL = 15.0/sqrt(2.0),
Ystart = 15.0;
double
Xstart,
Xend,
u,
b;
unsigned long long
i = 0ul;
*hot = 0ul;
*cold = tracks;
/* seed the RNG */
srand((unsigned)time(NULL));
/* aaaand start! */
while (i++ < tracks)
{
u = rand_double() - 0.5;
Xstart = 800.0*rand_double() - 400.0;
Xend = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));
b = (Ystart*Xend)/(Xend-Xstart);
if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
{
Xout[0 + *hot*2] = Xstart;
Xout[1 + *hot*2] = Xend;
++(*hot);
--(*cold);
}
}
}
compile in Matlab with
mex DoMuonCounting.c
(after having run mex setup :) and then use it in conjunction with a small M-wrapper like this:
function [hot,cold, h] = MuonTrack2(tracks)
% call the MEX function
[hot,cold, Xtmp] = DoMuonCounting(tracks);
% process outputs, and generate plots
hot = uint32(hot); % circumvents limitations in 32-bit matlab
X = Xtmp(:,1:hot);
clear Xtmp
h = NaN;
if ~isempty(X)
h = figure;
hold all
Y = repmat([15;0], 1, hot);
plot(X, Y, 'r');
end
end
which allows me to do
>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.
Note that the memory footprint of the MEX version is slightly larger, but I think that's nothing to worry about.
The only flaw I see is the fixed maximum number of Muon counts (hard-coded as 10000 as the initial array size of Xout; needed because there are no dynamically growing arrays in standard C)...if you're worried this limit could be broken, simply increase it, change it to be equal to a fraction of tracks, or do some smarter (but more painful) dynamic array-growing tricks.

In Matlab, it is sometimes faster to vectorize rather than use a for loop. For example, this expression:
(Xend(i,1) < width/2 && Xend(i,1) > -width/2) || (b(i,1) < height && b(i,1) > 0)
which is defined for each value of i, can be rewritten in a vectorised manner like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0)
Expessions like Xend(:,1) will give you a column vector, so Xend(:,1) < width/2 will give you a column vector of boolean values. Note then that I have used & rather than && - this is because & performs an element-wise logical AND, unlike && which only works on scalar values. In this way you can build the entire expression, such that the variable isChosen holds a column vector of boolean values, one for each row of your Xend/b vectors.
Getting counts is now as simple as this:
hot = sum(isChosen);
since true is represented by 1. And:
cold = sum(~isChosen);
Finally, you can get the data points by using the boolean vector to select rows:
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
hold all;
plot(X(:, ~isChosen),Y(:, ~isChosen),'b'); % Plot unchosen values
EDIT: The code should look like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0);
hot = sum(isChosen);
cold = sum(~isChosen);
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values

PETSc solving linear system with ksp guide

I am starting use PETSc library to solve linear system of equations in parallel. I have installed all packages, build and run successfully the examples in petsc/src/ksp/ksp/examples/tutorials/ folder, for example ex.c
But I couldn't understand how to fill matrices A,X an B by reading them for example from file.
Here I provide the code within ex2.c file:
/* Program usage: mpiexec -n <procs> ex2 [-help] [all PETSc options] */
static char help[] = "Solves a linear system in parallel with KSP.\n\
Input parameters include:\n\
-random_exact_sol : use a random exact solution vector\n\
-view_exact_sol : write exact solution vector to stdout\n\
-m <mesh_x> : number of mesh points in x-direction\n\
-n <mesh_n> : number of mesh points in y-direction\n\n";
/*T
Concepts: KSP^basic parallel example;
Concepts: KSP^Laplacian, 2d
Concepts: Laplacian, 2d
Processors: n
T*/
/*
Include "petscksp.h" so that we can use KSP solvers. Note that this file
automatically includes:
petscsys.h - base PETSc routines petscvec.h - vectors
petscmat.h - matrices
petscis.h - index sets petscksp.h - Krylov subspace methods
petscviewer.h - viewers petscpc.h - preconditioners
*/
#include <C:\PETSC\include\petscksp.h>
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
Vec x,b,u; /* approx solution, RHS, exact solution */
Mat A; /* linear system matrix */
KSP ksp; /* linear solver context */
PetscRandom rctx; /* random number generator context */
PetscReal norm; /* norm of solution error */
PetscInt i,j,Ii,J,Istart,Iend,m = 8,n = 7,its;
PetscErrorCode ierr;
PetscBool flg = PETSC_FALSE;
PetscScalar v;
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
PetscInitialize(&argc,&args,(char *)0,help);
ierr = PetscOptionsGetInt(PETSC_NULL,"-m",&m,PETSC_NULL);CHKERRQ(ierr);
ierr = PetscOptionsGetInt(PETSC_NULL,"-n",&n,PETSC_NULL);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
- Each processor needs to insert only elements that it owns
locally (but any non-local elements will be sent to the
appropriate processor during matrix assembly).
- Always specify global rows and columns of matrix entries.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
ierr = PetscLogStageRegister("Assembly", &stage);CHKERRQ(ierr);
ierr = PetscLogStagePush(stage);CHKERRQ(ierr);
for (Ii=Istart; Ii<Iend; Ii++) {
v = -1.0; i = Ii/n; j = Ii - i*n;
if (i>0) {J = Ii - n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (i<m-1) {J = Ii + n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (j>0) {J = Ii - 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
if (j<n-1) {J = Ii + 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);}
v = 4.0; ierr = MatSetValues(A,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr);
}
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = PetscLogStagePop();CHKERRQ(ierr);
/* A is symmetric. Set symmetric flag to enable ICC/Cholesky preconditioner */
ierr = MatSetOption(A,MAT_SYMMETRIC,PETSC_TRUE);CHKERRQ(ierr);
/*
Create parallel vectors.
- We form 1 vector from scratch and then duplicate as needed.
- When using VecCreate(), VecSetSizes and VecSetFromOptions()
in this example, we specify only the
vector's global dimension; the parallel partitioning is determined
at runtime.
- When solving a linear system, the vectors and matrices MUST
be partitioned accordingly. PETSc automatically generates
appropriately partitioned matrices and vectors when MatCreate()
and VecCreate() are used with the same communicator.
- The user can alternatively specify the local vector and matrix
dimensions when more sophisticated partitioning is needed
(replacing the PETSC_DECIDE argument in the VecSetSizes() statement
below).
*/
ierr = VecCreate(PETSC_COMM_WORLD,&u);CHKERRQ(ierr);
ierr = VecSetSizes(u,PETSC_DECIDE,m*n);CHKERRQ(ierr);
ierr = VecSetFromOptions(u);CHKERRQ(ierr);
ierr = VecDuplicate(u,&b);CHKERRQ(ierr);
ierr = VecDuplicate(b,&x);CHKERRQ(ierr);
/*
Set exact solution; then compute right-hand-side vector.
By default we use an exact solution of a vector with all
elements of 1.0; Alternatively, using the runtime option
-random_sol forms a solution vector with random components.
*/
ierr = PetscOptionsGetBool(PETSC_NULL,"-random_exact_sol",&flg,PETSC_NULL);CHKERRQ(ierr);
if (flg) {
ierr = PetscRandomCreate(PETSC_COMM_WORLD,&rctx);CHKERRQ(ierr);
ierr = PetscRandomSetFromOptions(rctx);CHKERRQ(ierr);
ierr = VecSetRandom(u,rctx);CHKERRQ(ierr);
ierr = PetscRandomDestroy(&rctx);CHKERRQ(ierr);
} else {
ierr = VecSet(u,1.0);CHKERRQ(ierr);
}
ierr = MatMult(A,u,b);CHKERRQ(ierr);
/*
View the exact solution vector if desired
*/
flg = PETSC_FALSE;
ierr = PetscOptionsGetBool(PETSC_NULL,"-view_exact_sol",&flg,PETSC_NULL);CHKERRQ(ierr);
if (flg) {ierr = VecView(u,PETSC_VIEWER_STDOUT_WORLD);CHKERRQ(ierr);}
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Create the linear solver and set various options
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create linear solver context
*/
ierr = KSPCreate(PETSC_COMM_WORLD,&ksp);CHKERRQ(ierr);
/*
Set operators. Here the matrix that defines the linear system
also serves as the preconditioning matrix.
*/
ierr = KSPSetOperators(ksp,A,A,DIFFERENT_NONZERO_PATTERN);CHKERRQ(ierr);
/*
Set linear solver defaults for this problem (optional).
- By extracting the KSP and PC contexts from the KSP context,
we can then directly call any KSP and PC routines to set
various options.
- The following two statements are optional; all of these
parameters could alternatively be specified at runtime via
KSPSetFromOptions(). All of these defaults can be
overridden at runtime, as indicated below.
*/
ierr = KSPSetTolerances(ksp,1.e-2/((m+1)*(n+1)),1.e-50,PETSC_DEFAULT,
PETSC_DEFAULT);CHKERRQ(ierr);
/*
Set runtime options, e.g.,
-ksp_type <type> -pc_type <type> -ksp_monitor -ksp_rtol <rtol>
These options will override those specified above as long as
KSPSetFromOptions() is called _after_ any other customization
routines.
*/
ierr = KSPSetFromOptions(ksp);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Solve the linear system
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
ierr = KSPSolve(ksp,b,x);CHKERRQ(ierr);
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Check solution and clean up
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Check the error
*/
ierr = VecAXPY(x,-1.0,u);CHKERRQ(ierr);
ierr = VecNorm(x,NORM_2,&norm);CHKERRQ(ierr);
ierr = KSPGetIterationNumber(ksp,&its);CHKERRQ(ierr);
/* Scale the norm */
/* norm *= sqrt(1.0/((m+1)*(n+1))); */
/*
Print convergence information. PetscPrintf() produces a single
print statement from all processes that share a communicator.
An alternative is PetscFPrintf(), which prints to a file.
*/
ierr = PetscPrintf(PETSC_COMM_WORLD,"Norm of error %A iterations %D\n",
norm,its);CHKERRQ(ierr);
/*
Free work space. All PETSc objects should be destroyed when they
are no longer needed.
*/
ierr = KSPDestroy(&ksp);CHKERRQ(ierr);
ierr = VecDestroy(&u);CHKERRQ(ierr); ierr = VecDestroy(&x);CHKERRQ(ierr);
ierr = VecDestroy(&b);CHKERRQ(ierr); ierr = MatDestroy(&A);CHKERRQ(ierr);
/*
Always call PetscFinalize() before exiting a program. This routine
- finalizes the PETSc libraries as well as MPI
- provides summary and diagnostic information if certain runtime
options are chosen (e.g., -log_summary).
*/
ierr = PetscFinalize();
return 0;
}
Does someone know how to fill own matrices within examples?

Yeah, this can be a little daunting when you're getting started. There's a good walk-through of the process in this ACTS tutorial from 2006; the tutorials listed on the PetSC web page are generally quite good.
The key parts of this are:
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
Actually create the PetSC matrix object, Mat A;
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m*n,m*n);CHKERRQ(ierr);
set the sizes; here, the matrix is m*n x m*n, as it's a stencil for operating on an m x n 2d grid
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
This just takes any PetSC command line options that you might have supplied at run time and apply them to the matrix, if you wanted to control how A was set up; otherwise, you could just, have eg, used MatCreateMPIAIJ() to create it as an AIJ-format matrix (the default), MatCreateMPIDense() if it was going to be a dense matrix.
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
Now that we've gotten an AIJ matrix, these calls just pre-allocates the sparse matrix, assuming 5 non-zeros per row. This is for performance. Note that both the MPI and Seq functions must be called to make sure this works with both 1 processor and multiple processors; this always seemed weird to be, but there you go.
Ok, now that the matrix is all set up, here's where we start getting into the actual meat of the matter.
First, we find out which rows this particular process owns. The distribution is by rows, which is a good distribution for typical sparse matrices.
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
So after this call, each processor has its own version of Istart and Iend, and its this processors job to update rows starting at Istart end ending just before Iend, as you see in this for loop:
for (Ii=Istart; Ii<Iend; Ii++) {
v = -1.0; i = Ii/n; j = Ii - i*n;
Ok, so if we're operating on row Ii, this corresponds to grid location (i,j) where i = Ii/n and j = Ii % n. Eg, grid location (i,j) corresponds to row Ii = i*n + j. Makes sense?
I'm going to strip out the if statements here because they're important but they're just dealing with the boundary values and they make things more complicated.
In this row, there will be a +4 on the diagonal, and -1s at columns corresponding to (i-1,j), (i+1,j), (i,j-1), and (i,j+1). Assuming that we haven't gone off the end of the grid for these (eg, 1 < i < m-1 and 1 < j < n-1), that means
J = Ii - n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii + n; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii - 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
J = Ii + 1; ierr = MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
v = 4.0; ierr = MatSetValues(A,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr);
}
The if statements I took out just avoid setting those values if they don't exist, and the CHKERRQ macro just prints out a useful error if ierr != 0, eg the set values call failed (because we tried to set an invalid value).
Now we've set local values; the MatAssembly calls start communication to ensure any necessary values are exchanged between processors. If you have any unrelated work to do, it can be stuck between the Begin and End to try to overlap communication and computation:
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
And now you're done and can call your solvers.
So a typical workflow is:
Create your matrix (MatCreate)
Set its size (MatSetSizes)
Set various matrix options (MatSetFromOptions is a good choice, rather than hardcoding things)
For sparse matrices, set the preallocation to reasonable guesses for the number of non-zeros per row; you can do this with a single value (as here), or with an array representing the number of non-zeros per row (here filled in with PETSC_NULL): (MatMPIAIJSetPreallocation, MatSeqAIJSetPreallocation)
Find out which rows are your responsibility: (MatGetOwnershipRange)
Set the values (calling MatSetValues either once per value, or passing in a chunk of values; INSERT_VALUES sets new elements, ADD_VALUES increments any existing elements)
Then do the assembly (MatAssemblyBegin,MatAssemblyEnd).
Other more complicated use cases are possible.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Huge slow down using openmp - parallel-processing

Related

Why Amdahl Law on serial and parallel fractions does not provide a theoretical speedup of 4 on quad-core CPU?

OpenMP program freezing before starting loop?

speeding up some for loops in matlab

Memory and excecution speed in Matlab

PETSc solving linear system with ksp guide

Categories

Resources