Proper implementation of triangular solver for sparse matrix with OpenACC - solver

I'm currently implementing triangular solver for sparse matrix and I've try to accelerate then using OpenACC directives. Given my matrix factors LU in sparse CSR format, OpenACC have managed to solve the L factor properly but the U factor gives completely wrong when compared with the true solution of the application. Here is the code of the accelerated kernel for the backward substitution task:
#pragma acc kernels deviceptr( ia, ja, factorValsU, y, x )
{
for ( int idx = size; idx > 0; idx-- )
{
double temp = 0.0;
int rowInit = ia[ idx - 1];
int rowEnd = ia[ idx ];
#pragma acc loop vector reduction( + : temp)
for ( int k = rowInit + 1; k < rowEnd; k++ )
{
temp += factorValsU[ k ] * x[ ja[ k ] ];
}
x[ idx ] = (y[ idx ] - temp) / factorValsU[ rowInit ];
}
}
I'm clueless about why this kernels does produce incorrect result. I've already tried a different version for the kernel where the matrix is saved backwards, i.e from down to top, which in principle could be solved with the follow kernel:
#pragma acc kernels deviceptr( ia, ja, factorValsU, y, x )
{
for ( int idx = 0; idx < size; idx++ )
{
double temp = 0.0;
int rowInit = ia[ idx ];
int rowEnd = ia[ idx + 1 ];
#pragma acc loop vector reduction( + : temp)
for ( int k = rowInit + 1; k < rowEnd; k++ )
{
temp += factorValsU[ k ] * x[ ja[ k ] ];
}
x[ size - idx ] = (y[ size - idx ] - temp) / factorValsU[ rowInit ];
}
}
But the result is always wrong. Did I miss something fundamental about decorating regular code with OpenACC directives to achieve proper results?
As mentioned before, the forward substitution of the L factor is properly working so, for completeness I do post the code here.
#pragma acc kernels deviceptr( ia, ja, factorValsL, y, x )
{
for ( int idx = 0; idx < size; idx++ )
{
double temp = 0.0;
int rowInit = ia[ idx ];
int rowEnd = ia[ idx + 1 ];
#pragma acc loop vector reduction( + : temp)
for ( int k = rowInit; k < rowEnd; k++ )
{
temp += factorValsL[ k ] * x[ ja[ k ] ];
}
x[ idx ] = y[ idx ] - temp;
}
}
Note the subtle difference between the kernel for the forward substitution (works) and backward substitution (both not working), is memory region where the result are saved:
x[ idx ] = y[ idx ] - temp for the L factor
x[ size - idx ] = (y[ size - idx ] - temp) / factorValsU[ rowInit ] for the U factor;
Is there some reason for the U factor solver to compute mistaken results cause of the order in which the assignation (and lecture) in memory is made?.
For completeness, the information provided by the pgi18.4 compiler about the kernel is:
triangularSolverU_acc(int, const int *, const int *, const double *, const double *, double *, bool):
614, Complex loop carried dependence of y->,x->,factorVals-> prevents parallelization
Loop carried dependence of x-> prevents parallelization
Loop carried backward dependence of x-> prevents vectorization
Accelerator kernel generated
Generating Tesla code
614, #pragma acc loop seq
621, #pragma acc loop vector(128) /* threadIdx.x */
Generating reduction(+:temp)
621, Loop is parallelizable
Which shows that the external loop have been serialized and the inner loop is a reduction.

With "kernels", the compiler must prove that the loop does not contain any dependencies and therefor is safe to parallelize. However since your code contains pointers, and that pointers may be aliased to the same memory, the compiler can't prove this so does the safe thing and make the loop run sequential. To override the compiler analysis, you can add a "#pragma acc loop independent" before the outer for loop. The "independent" is an assertion to the compiler that the loop is safe to parallelize. Alternatively, you can use the "parallel loop" directive in place of "kernels" since "parallel" implies "independent".
For the wrong answers, these are often due to the data not properly being synchronized between the host and device copies. How are you managing the data movement? Since you're using "deviceptr", this implies that you're using CUDA.
Also, if you can post a complete reproducing example, it would be easier to help determine the issue.

Related

Is there any optimization function in Rcpp

The following is my Rcpp code, and I want to minimize the objective function logtpoi(x,theta) respect to theta in R by 'nlminb'. I found it is slow.
I have two question:
Anyone can improve my Rcpp code? Thank you very much.
Is there any optimization functions in Rcpp? If yes,maybe I can use them in Rcpp directly. And how to use them? Thank you very much.
My code:
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;
// [[Rcpp::depends("RcppArmadillo")]]
// [[Rcpp::export]]
List dtpoi0(const IntegerVector& x, const NumericVector& theta){
//x is 3-dim vector; theta is a 6-dim parameter vector.
//be careful the order of theta1,...,theta6.
double theta1 = theta[0]; double theta2 = theta[1];
double theta3 = theta[2]; double theta4 = theta[3];
double theta5 = theta[4]; double theta6 = theta[5];
int x1 = x[0]; int x2 = x[1]; int x3 = x[2];
IntegerVector z1 = IntegerVector::create(x1,x2);
IntegerVector z2 = IntegerVector::create(x1,x3);
IntegerVector z3 = IntegerVector::create(x2,x3);
int s1 = min(z1); int s2 = min(z2); int s3 = min(z3);
arma::imat missy(1,3,fill::zeros); arma::irowvec ijk={0,0,0};
for (int i = 0; i <= s1; ++i) {
for (int j = 0; j <= s2; ++j) {
for (int k = 0; k <= s3; ++k) {
if ((i+j <= s1) & (i+k <= s2) & ( j+k <= s3))
{ ijk = {i,j,k};
missy = join_cols(missy,ijk);}
}
}
}
IntegerMatrix misy = as<IntegerMatrix>(wrap(missy));
IntegerVector u1 = IntegerVector::create(0);
IntegerVector u2 = IntegerVector::create(0);
IntegerVector u3 = IntegerVector::create(0);
IntegerVector u4 = IntegerVector::create(0);
IntegerVector u5 = IntegerVector::create(0);
IntegerVector u6 = IntegerVector::create(0);
int total = misy.nrow();
double fvalue = 0;
NumericVector part1(1); NumericVector part2(1);
NumericVector part3(1); NumericVector part4(1);
NumericVector part5(1); NumericVector part6(1);
for (int l = 1; l < total; ++l) {
u1 = IntegerVector::create(x1-misy(l,0)-misy(l,1));
u2 = IntegerVector::create(x2-misy(l,0)-misy(l,2));
u3 = IntegerVector::create(x3-misy(l,1)-misy(l,2));
u4 = IntegerVector::create(misy(l,0));
u5 = IntegerVector::create(misy(l,1));
u6 = IntegerVector::create(misy(l,2));
part1 = dpois(u1,theta1);
part2 = dpois(u2,theta2);
part3 = dpois(u3,theta3);
part4 = dpois(u4,theta4);
part5 = dpois(u5,theta5);
part6 = dpois(u6,theta6);
fvalue = fvalue + (part1*part2*part3*part4*part5*part6)[0]; }
return(List::create(Named("misy") = misy,Named("fvalue") = fvalue));
}
// [[Rcpp::export]]
NumericVector dtpoi(const IntegerMatrix& x, const NumericVector& theta){
//x is n*3 matrix, n is the number of observations.
int n = x.nrow();
NumericVector density(n);
for (int i = 0; i < n; ++i){
density(i) = dtpoi0(x.row(i),theta)["fvalue"];
}
return(density);
}
// [[Rcpp::export]]
double logtpoi0(const IntegerMatrix& x,const NumericVector theta){
// theta must be a 6-dimiension parameter.
double nln = -sum(log( dtpoi(x,theta) + 1e-60 ));
if(arma::is_finite(nln)) {nln = nln;} else {nln = -1e10;}
return(nln);
}
Huge caveat ahead: I don’t really know Armadillo. But I’ve had a stab at it because the code looks interesting.
A few general things:
You don’t need to declare things before you assign them for the first time. In particular, it’s generally not necessary to declare vectors outside a loop if they’re only used inside the loop. This is probably no less efficient than declaring them inside the loop. However, if your code is too slow it makes sense to carefully profile this, and test whether the assumption holds.
Many of your declarations are just aliases for vector elements and don’t seem necessary.
Your z{1…3} vectors aren’t necessary. C++ has a min function to find the minimum of two elements.
dtpoi0 contains two main loops. Both of these have been heavily modified in my code:
The first loop iterates over many ks that can are never used, due to the internal if that tests whether i + j exceeds s2. By pulling this check into the loop condition of j, we perform fewer k loops.
Your if uses & instead of &&. Like in R, using && rather than & causes short-circuiting. While this is probably not more efficient in this case, using && is idiomatic, whereas & causes head-scratching (my code uses and which is an alternative way of spelling && in C++; I prefer its readability).
The second loops effectively performs a matrix operation manually. I feel that there should be a way of expressing this purely with matrix operations — but as mentioned I’m not an Armadillo user. Still, my changes attempt to vectorise as much of this operation as possible (if nothing else this makes the code shorter). The dpois inner product is unfortunately still inside a loop.
The logic of logtpoi0 can be made more idiomatic and (IMHO) more readable by using the conditional operator instead of if.
const-correctness is a big deal in C++, since it weeds out accidental modifications. Use const liberally when declaring variables that are not supposed to change.
In terms of efficiency, the biggest hit when calling dtpoi or logtpoi0 is probably the conversion of missy to misy, which causes allocations and memory copies. Only convert to IntegerMatrix when necessary, i.e. when actually returning that value to R. For that reason, I’ve split dtpoi0 into two parts.
Another inefficiency is the fact that the first loop in dtpoi0 grows a matrix by appending columns. That’s a big no-no. However, rewriting the code to avoid this isn’t trivial.
#include <algorithm>
#include <RcppArmadillo.h>
// [[Rcpp::depends("RcppArmadillo")]]
using namespace Rcpp;
using namespace arma;
imat dtpoi0_mat(const IntegerVector& x) {
const int s1 = std::min(x[0], x[1]);
const int s2 = std::min(x[0], x[2]);
const int s3 = std::min(x[1], x[2]);
imat missy(1, 3, fill::zeros);
for (int i = 0; i <= s1; ++i) {
for (int j = 0; j <= s2 and i + j <= s1; ++j) {
for (int k = 0; k <= s3 and i + k <= s2 and j + k <= s3; ++k) {
missy = join_cols(missy, irowvec{i, j, k});
}
}
}
return missy;
}
double dtpoi0_fvalue(const IntegerVector& x, const NumericVector& theta, imat& missy) {
double fvalue = 0.0;
ivec xx = as<ivec>(x);
missy.each_row([&](irowvec& v) {
const ivec u(join_cols(xx - v(uvec{0, 0, 1}) - v(uvec{1, 2, 3}), v));
double prod = 1;
for (int i = 0; i < u.n_elem; ++i) {
prod *= R::dpois(u[i], theta[i], 0);
}
fvalue += prod;
});
return fvalue;
}
double dtpoi0_fvalue(const IntegerVector& x, const NumericVector& theta) {
imat missy = dtpoi0_mat(x);
return dtpoi0_fvalue(x, theta, missy);
}
// [[Rcpp::export]]
List dtpoi0(const IntegerVector& x, const NumericVector& theta) {
imat missy = dtpoi0_mat(x);
const double fvalue = dtpoi0_fvalue(x, theta, missy);
return List::create(Named("misy") = as<IntegerMatrix>(wrap(missy)), Named("fvalue") = fvalue);
}
// [[Rcpp::export]]
NumericVector dtpoi(const IntegerMatrix& x, const NumericVector& theta) {
//x is n*3 matrix, n is the number of observations.
int n = x.nrow();
NumericVector density(n);
for (int i = 0; i < n; ++i){
density(i) = dtpoi0_fvalue(x.row(i), theta);
}
return density;
}
// [[Rcpp::export]]
double logtpoi0(const IntegerMatrix& x, const NumericVector theta) {
// theta must be a 6-dimension parameter.
const double nln = -sum(log(dtpoi(x, theta) + 1e-60));
return is_finite(nln) ? nln : -1e10;
}
Important: This compiles, but I can’t test its correctness. It’s entirely possible (even likely!) that my refactor introduced errors. It should therefore only be viewed as a solution sketch, and should by no means be copied and pasted into an application.

Count the no of shifts in insertion sort using Fenwick tree (Binary index tree) in C++

I was attempting to solve the hackerrank problem on insertion sort that asks to calculate the number of shifts while sorting the array using insertion sort - https://www.hackerrank.com/challenges/insertion-sort/leaderboard. Seeing on the disccussion forums I get to know that using concept of binary index tree , the solution can be optimized for this question. I even understood the concept of BIT using link - https://www.hackerearth.com/practice/notes/binary-indexed-tree-or-fenwick-tree/. But still I was not able to understand how it applies in this question. I may be sounding dumb , but any intutive explanation along with the below code that I found on the forum that uses BIT concept, but some more lucid explanation will help me a lot to actually grasp the solution.
/* Enter your code here. Read input from STDIN. Print output to STDOUT */
#include<iostream>
#include<stdio.h>
#include<string.h>
using namespace std ;
#define MAXN 100002
#define MAX 1000002
int n,a[MAXN],c[MAX] ;
int main()
{
int runs ;
scanf("%d",&runs) ;
while(runs--)
{
scanf("%d",&n) ;
for(int i = 0;i < n;i++) scanf("%d",&a[i]) ;
long long ret = 1LL * n * (n - 1) / 2 ;
memset(c,0,sizeof c) ;
for(int i = 0;i < n;i++)
{
for(int j = a[i];j > 0;j -= j & -j) ret -= c[j] ;
for(int j = a[i];j < MAX;j += j & -j) c[j]++ ;
}
cout << ret << endl ;
}
return 0 ;
}
it is in python. might it help you
def insertionSort(arr):
count = 0
bit = [0] * 10000001
bit_sum = 0
count = 0
for n in arr:
count += bit_sum
idx = n
while idx:
count -= bit[idx]
idx -= idx & -idx
idx = n
while idx < 10000001:
bit[idx] += 1
idx += idx & -idx
bit_sum += 1
print(count)
python code here

Portable efficient alternative to PDEP without using BMI2?

The documentation for the parallel deposit instruction (PDEP) in Intel's Bit Manipulation Instruction Set 2 (BMI2) describes the following serial implementation for the instruction (C-like pseudocode):
U64 _pdep_u64(U64 val, U64 mask) {
U64 res = 0;
for (U64 bb = 1; mask; bb += bb) {
if (val & bb)
res |= mask & -mask;
mask &= mask - 1;
}
return res;
}
See also Intel's pdep insn ref manual entry.
This algorithm is O(n), where n is the number of set bits in mask, which obviously has a worst case of O(k) where k is the total number of bits in mask.
Is a more efficient worst case algorithm possible?
Is it possible to make a faster version that assumes that val has at most one bit set, ie either equals 0 or equals 1<<r for some value of r from 0 to 63?
The second part of the question, about the special case of a 1-bit deposit, requires two steps. In the first step, we need to determine the bit index r of the single 1-bit in val, with a suitable response in case val is zero. This can easily be accomplished via the POSIX function ffs, or if r is known by other means, as alluded to by the asker in comments. In the second step we need to identify bit index i of the r-th 1-bit in mask, if it exists. We can then deposit the r-th bit of val at bit i.
One way of finding the index of the r-th 1-bit in mask is to tally the 1-bits using a classical population count algorithm based on binary partitioning, and record all of the intermediate group-wise bit counts. We then perform a binary search on the recorded bit-count data to identify the position of the desired bit.
The following C-code demonstrates this using 64-bit data. Whether this is actually faster than the iterative method will very much depend on typical values of mask and val.
#include <stdint.h>
/* Find the index of the n-th 1-bit in mask, n >= 0
The index of the least significant bit is 0
Return -1 if there is no such bit
*/
int find_nth_set_bit (uint64_t mask, int n)
{
int t, i = n, r = 0;
const uint64_t m1 = 0x5555555555555555ULL; // even bits
const uint64_t m2 = 0x3333333333333333ULL; // even 2-bit groups
const uint64_t m4 = 0x0f0f0f0f0f0f0f0fULL; // even nibbles
const uint64_t m8 = 0x00ff00ff00ff00ffULL; // even bytes
uint64_t c1 = mask;
uint64_t c2 = c1 - ((c1 >> 1) & m1);
uint64_t c4 = ((c2 >> 2) & m2) + (c2 & m2);
uint64_t c8 = ((c4 >> 4) + c4) & m4;
uint64_t c16 = ((c8 >> 8) + c8) & m8;
uint64_t c32 = (c16 >> 16) + c16;
int c64 = (int)(((c32 >> 32) + c32) & 0x7f);
t = (c32 ) & 0x3f; if (i >= t) { r += 32; i -= t; }
t = (c16>> r) & 0x1f; if (i >= t) { r += 16; i -= t; }
t = (c8 >> r) & 0x0f; if (i >= t) { r += 8; i -= t; }
t = (c4 >> r) & 0x07; if (i >= t) { r += 4; i -= t; }
t = (c2 >> r) & 0x03; if (i >= t) { r += 2; i -= t; }
t = (c1 >> r) & 0x01; if (i >= t) { r += 1; }
if (n >= c64) r = -1;
return r;
}
/* val is either zero or has a single 1-bit.
Return -1 if val is zero, otherwise the index of the 1-bit
The index of the least significant bit is 0
*/
int find_bit_index (uint64_t val)
{
return ffsll (val) - 1;
}
uint64_t deposit_single_bit (uint64_t val, uint64_t mask)
{
uint64_t res = (uint64_t)0;
int r = find_bit_index (val);
if (r >= 0) {
int i = find_nth_set_bit (mask, r);
if (i >= 0) res = (uint64_t)1 << i;
}
return res;
}

How many moves to reach a destination? Efficient flood filling

I want to compute the distance of cells from a destination cell, using number of four-way movements to reach something. So the the four cells immediately adjacent to the destination have a distance of 1, and those on the four cardinal directions of each of them have a distance of 2 and so on. There is a maximum distance that might be around 16 or 20, and there are cells that are occupied by barriers; the distance can flow around them but not through them.
I want to store the output into a 2D array, and I want to be able to compute this 'distance map' for any destination on a bigger maze map very quickly.
I am successfully doing it with a variation on a flood fill where the I place incremental distance of the adjacent unfilled cells in a priority queue (using C++ STL).
I am happy with the functionality and now want to focus on optimizing the code, as it is very performance sensitive.
What cunning and fast approaches might there be?
I think you have done everything right. If you coded it correct it takes O(n) time and O(n) memory to compute flood fill, where n is the number of cells, and it can be proven that it's impossible to do better (in general case). And after fill is complete you just return distance for any destination with O(1), it easy to see that it also can be done better.
So if you want to optimize performance, you can only focused on CODE LOCAL OPTIMIZATION. Which will not affect asymptotic but can significantly improve your real execution time. But it's hard to give you any advice for code optimization without actually seeing source.
So if you really want to see optimized code see the following (Pure C):
include
int* BFS()
{
int N, M; // Assume we have NxM grid.
int X, Y; // Start position. X, Y are unit based.
int i, j;
int movex[4] = {0, 0, 1, -1}; // Move on x dimension.
int movey[4] = {1, -1, 0, 0}; // Move on y dimension.
// TO DO: Read N, M, X, Y
// To reduce redundant functions calls and memory reallocation
// allocate all needed memory once and use a simple arrays.
int* map = (int*)malloc((N + 2) * (M + 2));
int leadDim = M + 2;
// Our map. We use one dimension array. map[x][y] = map[leadDim * x + y];
// If (x,y) is occupied then map[leadDim*x + y] = -1;
// If (x,y) is not visited map[leadDim*x + y] = -2;
int* queue = (int*)malloc(N*M);
int first = 0, last =1;
// Fill the boarders to simplify the code and reduce conditions
for (i = 0; i < N+2; ++i)
{
map[i * leadDim + 0] = -1;
map[i * leadDim + M + 1] = -1;
}
for (j = 0; j < M+2; ++j)
{
map[j] = -1;
map[(N + 1) * leadDim + j] = -1;
}
// TO DO: Read the map.
queue[first] = X * leadDim + Y;
map[X * leadDim + Y] = 0;
// Very simple optimized process loop.
while (first < last)
{
int current = queue[first];
int step = map[current];
for (i = 0; i < 4; ++i)
{
int temp = current + movex[i] * leadDim + movey[i];
if (map[temp] == -2) // only one condition in internal loop.
{
map[temp] = step + 1;
queue[last++] = temp;
}
}
++first;
}
free(queue);
return map;
}
Code may seems tricky. And of course, it doesn't look like OOP (I actually think that OOP fans will hate it) but if you want something really fast that's what you need.
It's common task for BFS. Complexity is O(cellsCount)
My c++ implementation:
vector<vector<int> > GetDistance(int x, int y, vector<vector<int> > cells)
{
const int INF = 0x7FFFFF;
vector<vector<int> > distance(cells.size());
for(int i = 0; i < distance.size(); i++)
distance[i].assign(cells[i].size(), INF);
queue<pair<int, int> > q;
q.push(make_pair(x, y));
distance[x][y] = 0;
while(!q.empty())
{
pair<int, int> curPoint = q.front();
q.pop();
int curDistance = distance[curPoint.first][curPoint.second];
for(int i = -1; i <= 1; i++)
for(int j = -1; j <= 1; j++)
{
if( (i + j) % 2 == 0 ) continue;
pair<int, int> nextPoint(curPoint.first + i, curPoint.second + j);
if(nextPoint.first >= 0 && nextPoint.first < cells.size()
&& nextPoint.second >= 0 && nextPoint.second < cells[nextPoint.first].size()
&& cells[nextPoint.first][nextPoint.second] != BARRIER
&& distance[nextPoint.first][nextPoint.second] > curDistance + 1)
{
distance[nextPoint.first][nextPoint.second] = curDistance + 1;
q.push(nextPoint);
}
}
}
return distance;
}
Start with a recursive implementation: (untested code)
int visit( int xy, int dist) {
int ret =1;
if (array[xy] <= dist) return 0;
array[xy] = dist;
if (dist == maxdist) return ret;
ret += visit ( RIGHT(xy) , dist+1);
...
same for left, up, down
...
return ret;
}
You'l need to handle the initalisation and the edge-cases. And you have to decide if you want a two dimentional array or a one dimensonal array.
A next step could be to use a todo list and remove the recursion, and a third step could be to add some bitmasking.
8-bit computers in the 1970s did this with an optimization that has the same algorithmic complexity, but in the typical case is much faster on actual hardware.
Starting from the initial square, scan to the left and right until "walls" are found. Now you have a "span" that is one square tall and N squares wide. Mark the span as "filled," in this case each square with the distance to the initial square.
For each square above and below the current span, if it's not a "wall" or already filled, pick it as the new origin of a span.
Repeat until no new spans are found.
Since horizontal rows tend to be stored contiguously in memory, this algorithm tends to thrash the cache far less than one that has no bias for horizontal searches.
Also, since in the most common cases far fewer items are pushed and popped from a stack (spans instead of individual blocks) there is less time spent maintaining the stack.

A Cache Efficient Matrix Transpose Program?

So the obvious way to transpose a matrix is to use :
for( int i = 0; i < n; i++ )
for( int j = 0; j < n; j++ )
destination[j+i*n] = source[i+j*n];
but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?
Edit: I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.
Edit2: The matrices are stored in column major order, that is to say for a matrix
a1 a2
a3 a4
is stored as a1 a3 a2 a4.
You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < n; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*n];
}
}
}
}
An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.
Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.
I had the exact same problem yesterday.
I ended up with this solution:
void transpose(double *dst, const double *src, size_t n, size_t p) noexcept {
THROWS();
size_t block = 32;
for (size_t i = 0; i < n; i += block) {
for(size_t j = 0; j < p; ++j) {
for(size_t b = 0; b < block && i + b < n; ++b) {
dst[j*n + i + b] = src[(i + b)*p + j];
}
}
}
}
This is 4 time faster than the obvious solution on my machine.
This solution takes care of a rectangular matrix with dimensions which are not a multiple of the block size.
if dst and src are the same square matrix an in place function should really be used instead:
void transpose(double*m,size_t n)noexcept{
size_t block=0,size=8;
for(block=0;block+size-1<n;block+=size){
for(size_t i=block;i<block+size;++i){
for(size_t j=i+1;j<block+size;++j){
std::swap(m[i*n+j],m[j*n+i]);}}
for(size_t i=block+size;i<n;++i){
for(size_t j=block;j<block+size;++j){
std::swap(m[i*n+j],m[j*n+i]);}}}
for(size_t i=block;i<n;++i){
for(size_t j=i+1;j<n;++j){
std::swap(m[i*n+j],m[j*n+i]);}}}
I used C++11 but this could be easily translated in other languages.
Instead of transposing the matrix in memory, why not collapse the transposition operation into the next operation you're going to do on the matrix?
Steve Jessop mentioned a cache oblivious matrix transpose algorithm.
For the record, I want to share an possible implementation of a cache oblivious matrix transpose.
public class Matrix {
protected double data[];
protected int rows, columns;
public Matrix(int rows, int columns) {
this.rows = rows;
this.columns = columns;
this.data = new double[rows * columns];
}
public Matrix transpose() {
Matrix C = new Matrix(columns, rows);
cachetranspose(0, rows, 0, columns, C);
return C;
}
public void cachetranspose(int rb, int re, int cb, int ce, Matrix T) {
int r = re - rb, c = ce - cb;
if (r <= 16 && c <= 16) {
for (int i = rb; i < re; i++) {
for (int j = cb; j < ce; j++) {
T.data[j * rows + i] = data[i * columns + j];
}
}
} else if (r >= c) {
cachetranspose(rb, rb + (r / 2), cb, ce, T);
cachetranspose(rb + (r / 2), re, cb, ce, T);
} else {
cachetranspose(rb, re, cb, cb + (c / 2), T);
cachetranspose(rb, re, cb + (c / 2), ce, T);
}
}
}
More details on cache oblivious algorithms can be found here.
Matrix multiplication comes to mind, but the cache issue there is much more pronounced, because each element is read N times.
With matrix transpose, you are reading in a single linear pass and there's no way to optimize that. But you can simultaneously process several rows so that you write several columns and so fill complete cache lines. You will only need three loops.
Or do it the other way around and read in columns while writing linearly.
With a large matrix, possibly a large sparse matrix, it might be an idea to decompose it into smaller cache friendly chunks (Say, 4x4 sub matrices). You can also flag sub matrices as identity which will help you in creating optimized code paths.

Resources