A Cache Efficient Matrix Transpose Program? - algorithm

So the obvious way to transpose a matrix is to use :
for( int i = 0; i < n; i++ )
for( int j = 0; j < n; j++ )
destination[j+i*n] = source[i+j*n];
but I want something that will take advantage of locality and cache blocking. I was looking it up and can't find code that would do this, but I'm told it should be a very simple modification to the original. Any ideas?
Edit: I have a 2000x2000 matrix, and I want to know how can I change the code using two for loops, basically splitting the matrix into blocks that I transpose individually, say 2x2 blocks, or 40x40 blocks, and see which block size is most efficient.
Edit2: The matrices are stored in column major order, that is to say for a matrix
a1 a2
a3 a4
is stored as a1 a3 a2 a4.

You're probably going to want four loops - two to iterate over the blocks, and then another two to perform the transpose-copy of a single block. Assuming for simplicity a block size that divides the size of the matrix, something like this I think, although I'd want to draw some pictures on the backs of envelopes to be sure:
for (int i = 0; i < n; i += blocksize) {
for (int j = 0; j < n; j += blocksize) {
// transpose the block beginning at [i,j]
for (int k = i; k < i + blocksize; ++k) {
for (int l = j; l < j + blocksize; ++l) {
dst[k + l*n] = src[l + k*n];
}
}
}
}
An important further insight is that there's actually a cache-oblivious algorithm for this (see http://en.wikipedia.org/wiki/Cache-oblivious_algorithm, which uses this exact problem as an example). The informal definition of "cache-oblivious" is that you don't need to experiment tweaking any parameters (in this case the blocksize) in order to hit good/optimal cache performance. The solution in this case is to transpose by recursively dividing the matrix in half, and transposing the halves into their correct position in the destination.
Whatever the cache size actually is, this recursion takes advantage of it. I expect there's a bit of extra management overhead compared with your strategy, which is to use performance experiments to, in effect, jump straight to the point in the recursion at which the cache really kicks in, and go no further. On the other hand, your performance experiments might give you an answer that works on your machine but not on your customers' machines.

I had the exact same problem yesterday.
I ended up with this solution:
void transpose(double *dst, const double *src, size_t n, size_t p) noexcept {
THROWS();
size_t block = 32;
for (size_t i = 0; i < n; i += block) {
for(size_t j = 0; j < p; ++j) {
for(size_t b = 0; b < block && i + b < n; ++b) {
dst[j*n + i + b] = src[(i + b)*p + j];
}
}
}
}
This is 4 time faster than the obvious solution on my machine.
This solution takes care of a rectangular matrix with dimensions which are not a multiple of the block size.
if dst and src are the same square matrix an in place function should really be used instead:
void transpose(double*m,size_t n)noexcept{
size_t block=0,size=8;
for(block=0;block+size-1<n;block+=size){
for(size_t i=block;i<block+size;++i){
for(size_t j=i+1;j<block+size;++j){
std::swap(m[i*n+j],m[j*n+i]);}}
for(size_t i=block+size;i<n;++i){
for(size_t j=block;j<block+size;++j){
std::swap(m[i*n+j],m[j*n+i]);}}}
for(size_t i=block;i<n;++i){
for(size_t j=i+1;j<n;++j){
std::swap(m[i*n+j],m[j*n+i]);}}}
I used C++11 but this could be easily translated in other languages.

Instead of transposing the matrix in memory, why not collapse the transposition operation into the next operation you're going to do on the matrix?

Steve Jessop mentioned a cache oblivious matrix transpose algorithm.
For the record, I want to share an possible implementation of a cache oblivious matrix transpose.
public class Matrix {
protected double data[];
protected int rows, columns;
public Matrix(int rows, int columns) {
this.rows = rows;
this.columns = columns;
this.data = new double[rows * columns];
}
public Matrix transpose() {
Matrix C = new Matrix(columns, rows);
cachetranspose(0, rows, 0, columns, C);
return C;
}
public void cachetranspose(int rb, int re, int cb, int ce, Matrix T) {
int r = re - rb, c = ce - cb;
if (r <= 16 && c <= 16) {
for (int i = rb; i < re; i++) {
for (int j = cb; j < ce; j++) {
T.data[j * rows + i] = data[i * columns + j];
}
}
} else if (r >= c) {
cachetranspose(rb, rb + (r / 2), cb, ce, T);
cachetranspose(rb + (r / 2), re, cb, ce, T);
} else {
cachetranspose(rb, re, cb, cb + (c / 2), T);
cachetranspose(rb, re, cb + (c / 2), ce, T);
}
}
}
More details on cache oblivious algorithms can be found here.

Matrix multiplication comes to mind, but the cache issue there is much more pronounced, because each element is read N times.
With matrix transpose, you are reading in a single linear pass and there's no way to optimize that. But you can simultaneously process several rows so that you write several columns and so fill complete cache lines. You will only need three loops.
Or do it the other way around and read in columns while writing linearly.

With a large matrix, possibly a large sparse matrix, it might be an idea to decompose it into smaller cache friendly chunks (Say, 4x4 sub matrices). You can also flag sub matrices as identity which will help you in creating optimized code paths.

Related

Merge sort gives poor efficiency and isn't affected by compiler optimizations

While trying to measure the time various sorting algorithms require to sort a random array of unsigned integers, I've obtained some peculiar behavior regarding top-down Merge sort that doesn't seem to be caused by bad implementation.
On arrays of length up to 1 million values, Merge sort behaves a lot worst than random-pivot Quicksort and even Shell sort. This is unexpected so I've tried with multiple online implementations of Merge sort but the result still seems to be about the same.
Graph 1, optimizations ON
This is the implementation I used for these graphs:
void merge(int *array, int l, int m, int r) {
int i, j, k, nl, nr;
nl = m - l + 1; nr = r - m;
int *larr = new int[nl], *rarr = new int[nr];
for (i = 0; i < nl; i++)
larr[i] = array[l + i];
for (j = 0; j < nr; j++)
rarr[j] = array[m + 1 + j];
i = 0; j = 0; k = l;
while (i < nl && j < nr) {
if (larr[i] <= rarr[j]) {
array[k] = larr[i];
i++;
}
else {
array[k] = rarr[j];
j++;
}
k++;
}
while (i < nl) {
array[k] = larr[i];
i++; k++;
}
while (j < nr) {
array[k] = rarr[j];
j++; k++;
}
delete[] larr;
delete[] rarr;
}
void mergeSort(int *array, int l, int r) {
if (l < r) {
int m = l + (r - l) / 2;
mergeSort(array, l, m);
mergeSort(array, m + 1, r);
merge(array, l, m, r);
}
}
I have also tried to remove compiler optimizations (VisualC++15), favoring size instead of speed and this seem to have affected all the other algorithms instead of Merge sort. Nonetheless, it still got the worst time.
Graph 2, optimizations OFF
The only time Merge sort didn't give the worst time was on a test with arrays of 15 million elements where it got just a slightly better performance than Heap sort, but still far from the others.
The values that I plot are the averages of 100 tests with random arrays so I don't think this is just a particular case. I also don't think the use of dynamic memory in Merge sort is the cause of these results, 16GB of RAM are plenty for these tests and everything else.
Does anybody know why Merge sort behaves so badly and why compiler optimizations don't seem to affect Merge sort?

Faster way to find overlapping position in 2D Arrays

Suppose there are two 2D matrix, A and B
A = [
[1,2,3],
[4,5,6],
[7,8,9]
]
B = [
[2,3,9],
[5,6,7],
[8,9,0]
]
A and B are the dimension-wise same. A[i], B[i] are 64 bit integer
We can see that matrix A is overlapping with B from the 2nd column of A. So the position of overlapping is 1 and the length of overlapping is 2;
A naive approach would be as follows:
int N = 3; // row
int M = 3; // col
int A[3][3] = {
{1,2,3},
{4,5,6},
{7,8,9}
};
int B[3][3] = {
{2,3,9},
{5,6,7},
{8,9,0}
};
for (int i = 0; i < M;i++) {
int j = 0;
for (j = 0; j < M-i; j++) {
int k = 0;
for (k = 0; k < N; k++) {
if (A[k][i + j] != B[k][j]) {
break;
}
}
if (k < N) break;
}
if (j == M - i) return i;
}
A practical approach is to replace the matrices A and B with vectors a and b containing hashes of each column. Then you check for overlaps in the vectors, and only when you find one do you check to see if the full matrices match with the same overlap.
If your hash function is decent, then the probability of failing a full matrix check will be low.
To find the vector overlaps, you can use a similar strategy, checking hashes of suffixes of a against hashes of prefixes of b, and only checking the full vectors, and then full matrices, when they match.
To make this an optimization, you need to be able to calculate those prefix and suffix hashes incrementally, so you can get the hash of the next suffix by adding one character to the hash of the previous suffix in constant time. A common polynomial hash function makes that pretty easy.
For example, if your hash function is:
h = 0;
for item in vec:
h = h*31 + item;
return h;
Then you have hash(concat(x,y)) = hash(x)*(31^y.length) + hash(y)

How to print values in memoization method-Dynamic pragraming

I know for a problem that can be solved using DP, can be solved by either tabulation(bottom-up) approach or memoization(top-down) approach. personally i find memoization is easy and even efficient approach(analysis required just to get recursive formula,once recursive formula is obtained, a brute-force recursive method can easily be converted to store sub-problem's result and reuse it.) The only problem that i am facing in this approach is, i am not able to construct actual result from the table which i filled on demand.
For example, in Matrix Product Parenthesization problem ( to decide in which order to perform the multiplications on Matrices so that cost of multiplication is minimum) i am able to calculate minimum cost not not able to generate order in algo.
For example, suppose A is a 10 × 30 matrix, B is a 30 × 5 matrix, and C is a 5 × 60 matrix. Then,
(AB)C = (10×30×5) + (10×5×60) = 1500 + 3000 = 4500 operations
A(BC) = (30×5×60) + (10×30×60) = 9000 + 18000 = 27000 operations.
here i am able to get min-cost as 27000 but unable to get order which is A(BC).
I used this. Suppose F[i, j] represents least number of multiplication needed to multiply Ai.....Aj and an array p[] is given which represents the chain of matrices such that the ith matrix Ai is of dimension p[i-1] x p[i]. So
0 if i=j
F[i,j]=
min(F[i,k] + F[k+1,j] +P_i-1 * P_k * P_j where k∈[i,j)
Below is the implementation that i have created.
#include<stdio.h>
#include<limits.h>
#include<string.h>
#define MAX 4
int lookup[MAX][MAX];
int MatrixChainOrder(int p[], int i, int j)
{
if(i==j) return 0;
int min = INT_MAX;
int k, count;
if(lookup[i][j]==0){
// recursively calculate count of multiplcations and return the minimum count
for (k = i; k<j; k++) {
int gmin=0;
if(lookup[i][k]==0)
lookup[i][k]=MatrixChainOrder(p, i, k);
if(lookup[k+1][j]==0)
lookup[k+1][j]=MatrixChainOrder(p, k+1, j);
count = lookup[i][k] + lookup[k+1][j] + p[i-1]*p[k]*p[j];
if (count < min){
min = count;
printf("\n****%d ",k); // i think something has be done here to represent the correct answer ((AB)C)D where first mat is represented by A second by B and so on.
}
}
lookup[i][j] = min;
}
return lookup[i][j];
}
// Driver program to test above function
int main()
{
int arr[] = {2,3,6,4,5};
int n = sizeof(arr)/sizeof(arr[0]);
memset(lookup, 0, sizeof(lookup));
int width =10;
printf("Minimum number of multiplications is %d ", MatrixChainOrder(arr, 1, n-1));
printf("\n ---->");
for(int l=0;l<MAX;++l)
printf(" %*d ",width,l);
printf("\n");
for(int z=0;z<MAX;z++){
printf("\n %d--->",z);
for(int x=0;x<MAX;x++)
printf(" %*d ",width,lookup[z][x]);
}
return 0;
}
I know using tabulation approach printing the solution is much easy but i want to do it in memoization technique.
Thanks.
Your code correctly computes the minimum number of multiplications, but you're struggling to display the optimal chain of matrix multiplications.
There's two possibilities:
When you compute the table, you can store the best index found in another memoization array.
You can recompute the optimal splitting points from the results in the memoization array.
The first would involve creating the split points in a separate array:
int lookup_splits[MAX][MAX];
And then updating it inside your MatrixChainOrder function:
...
if (count < min) {
min = count;
lookup_splits[i][j] = k;
}
You can then generate the multiplication chain recursively like this:
void print_mult_chain(int i, int j) {
if (i == j) {
putchar('A' + i - 1);
return;
}
putchar('(');
print_mult_chain(i, lookup_splits[i][j]);
print_mult_chain(lookup_splits[i][j] + 1, j);
putchar(')');
}
You can call the function with print_mult_chain(1, n - 1) from main.
The second possibility is that you don't cache lookup_splits and recompute it as necessary.
int get_lookup_splits(int p[], int i, int j) {
int best = INT_MAX;
int k_best;
for (int k = i; k < j; k++) {
int count = lookup[i][k] + lookup[k+1][j] + p[i-1]*p[k]*p[j];
if (count < best) {
best = count;
k_best = k;
}
}
return k;
}
This is essentially the same computation you did inside MatrixChainOrder, so if you go with this solution you should factor the code appropriately to avoid having two copies.
With this function, you can adapt print_mult_chain above to use it rather than the lookup_splits array. (You'll need to pass the p array in).
[None of this code is tested, so you may need to edit the answer to fix bugs].

Backtracking optimization

recently I was trying to solve famous little bishops algorithmic problem. In one of the websites I read that I should divide chessboard into black and white parts to optimize the execution. After that I should use backtracking to count number of possible ways to put bishops on black squares and white squares separetely.
In the following code I try to put 6 bishops ONLY ON WHITE squares of an 8 by 8 chessboard. I do it only to verify that technique is really working.
//inside main function
int k = 6; //number of bishops
int n = 8; //length of one side of chessboard
Integer[] positions = new Integer[k];
long result = backtrack(positions, 0, n);
//find how many times we double counting each possible combination of bishops
int factor = 1;
for(int i = k; i>0; i--) {
factor = factor * i;
}
System.out.println("The result is " + result/factor);
//implementation of other functions
public long backtrack(Integer[] prevPositions, int k, int n) {
if(k == 6) {
return 1;
}
long sum = 0;
Integer[] candidates = new Integer[n*n];
int length = getCandidates(prevPositions, k, candidates, n);
for(int i=0 ; i<length ; i++) {
prevPositions[k] = candidates[i];
sum += backtrack(prevPositions,k+1,n);
}
return sum;
}
public Integer getCandidates(Integer[] prevPositions, int k, Integer[] candidates, int n) {
int length = 0;
//only white squares are considered as candidates, hence i+=2
for (int i = 0; i < n*n; i+=2) {
boolean isGood = true;
int iRow = i / n;
int iCol = i % n;
for (int j = 0; j < k; j++) {
int prev = prevPositions[j];
if (i == prev) {
isGood = false;
break;
} else {
int prevRow = prev / n;
int prevCol = prev % n;
if (Math.abs(iRow - prevRow) == Math.abs(iCol - prevCol)) {
isGood = false;
break;
}
}
}
if(isGood) {
candidates[length] = new Integer(i);
length++;
}
}
return length;
}
Even though I can see why dividing chessboard into white and black squares optimizes the problem, it is still takes around 11 seconds to count number of possible ways to put all bishops ONLY ON WHITE SQUARES. Can you help me pls? What am I doing wrong?
here are a few ways to improve your search.
(1) Instead of generate-and-test, you could consider finite domain search, where every bishop has a "domain" of possible places. Whenever you place a bishop, you prune the domains of the remaining bishops. If a bishop's domain becomes empty, you must backtrack.
(2) As a refinement, if you have n bishops to place and m < n places left, you must backtrack.
(3) Use dynamic programming/memoization, where you store solutions for 1 bishop, 2 bishops, ..., and compute the set of n + 1 bishop solutions from the set of n bishop solutions.
(4) Exploit symmetry to reduce your search space. In this case there is (at least) black/white symmetry and rotational/reflective symmetry.
(5) Try to find a better representation. For example, bit patterns.
(6) If you use a different representation, look into using a "trail" (cf. Prolog) to track the operations you need to undo on backtracking.
Cheers!

How many moves to reach a destination? Efficient flood filling

I want to compute the distance of cells from a destination cell, using number of four-way movements to reach something. So the the four cells immediately adjacent to the destination have a distance of 1, and those on the four cardinal directions of each of them have a distance of 2 and so on. There is a maximum distance that might be around 16 or 20, and there are cells that are occupied by barriers; the distance can flow around them but not through them.
I want to store the output into a 2D array, and I want to be able to compute this 'distance map' for any destination on a bigger maze map very quickly.
I am successfully doing it with a variation on a flood fill where the I place incremental distance of the adjacent unfilled cells in a priority queue (using C++ STL).
I am happy with the functionality and now want to focus on optimizing the code, as it is very performance sensitive.
What cunning and fast approaches might there be?
I think you have done everything right. If you coded it correct it takes O(n) time and O(n) memory to compute flood fill, where n is the number of cells, and it can be proven that it's impossible to do better (in general case). And after fill is complete you just return distance for any destination with O(1), it easy to see that it also can be done better.
So if you want to optimize performance, you can only focused on CODE LOCAL OPTIMIZATION. Which will not affect asymptotic but can significantly improve your real execution time. But it's hard to give you any advice for code optimization without actually seeing source.
So if you really want to see optimized code see the following (Pure C):
include
int* BFS()
{
int N, M; // Assume we have NxM grid.
int X, Y; // Start position. X, Y are unit based.
int i, j;
int movex[4] = {0, 0, 1, -1}; // Move on x dimension.
int movey[4] = {1, -1, 0, 0}; // Move on y dimension.
// TO DO: Read N, M, X, Y
// To reduce redundant functions calls and memory reallocation
// allocate all needed memory once and use a simple arrays.
int* map = (int*)malloc((N + 2) * (M + 2));
int leadDim = M + 2;
// Our map. We use one dimension array. map[x][y] = map[leadDim * x + y];
// If (x,y) is occupied then map[leadDim*x + y] = -1;
// If (x,y) is not visited map[leadDim*x + y] = -2;
int* queue = (int*)malloc(N*M);
int first = 0, last =1;
// Fill the boarders to simplify the code and reduce conditions
for (i = 0; i < N+2; ++i)
{
map[i * leadDim + 0] = -1;
map[i * leadDim + M + 1] = -1;
}
for (j = 0; j < M+2; ++j)
{
map[j] = -1;
map[(N + 1) * leadDim + j] = -1;
}
// TO DO: Read the map.
queue[first] = X * leadDim + Y;
map[X * leadDim + Y] = 0;
// Very simple optimized process loop.
while (first < last)
{
int current = queue[first];
int step = map[current];
for (i = 0; i < 4; ++i)
{
int temp = current + movex[i] * leadDim + movey[i];
if (map[temp] == -2) // only one condition in internal loop.
{
map[temp] = step + 1;
queue[last++] = temp;
}
}
++first;
}
free(queue);
return map;
}
Code may seems tricky. And of course, it doesn't look like OOP (I actually think that OOP fans will hate it) but if you want something really fast that's what you need.
It's common task for BFS. Complexity is O(cellsCount)
My c++ implementation:
vector<vector<int> > GetDistance(int x, int y, vector<vector<int> > cells)
{
const int INF = 0x7FFFFF;
vector<vector<int> > distance(cells.size());
for(int i = 0; i < distance.size(); i++)
distance[i].assign(cells[i].size(), INF);
queue<pair<int, int> > q;
q.push(make_pair(x, y));
distance[x][y] = 0;
while(!q.empty())
{
pair<int, int> curPoint = q.front();
q.pop();
int curDistance = distance[curPoint.first][curPoint.second];
for(int i = -1; i <= 1; i++)
for(int j = -1; j <= 1; j++)
{
if( (i + j) % 2 == 0 ) continue;
pair<int, int> nextPoint(curPoint.first + i, curPoint.second + j);
if(nextPoint.first >= 0 && nextPoint.first < cells.size()
&& nextPoint.second >= 0 && nextPoint.second < cells[nextPoint.first].size()
&& cells[nextPoint.first][nextPoint.second] != BARRIER
&& distance[nextPoint.first][nextPoint.second] > curDistance + 1)
{
distance[nextPoint.first][nextPoint.second] = curDistance + 1;
q.push(nextPoint);
}
}
}
return distance;
}
Start with a recursive implementation: (untested code)
int visit( int xy, int dist) {
int ret =1;
if (array[xy] <= dist) return 0;
array[xy] = dist;
if (dist == maxdist) return ret;
ret += visit ( RIGHT(xy) , dist+1);
...
same for left, up, down
...
return ret;
}
You'l need to handle the initalisation and the edge-cases. And you have to decide if you want a two dimentional array or a one dimensonal array.
A next step could be to use a todo list and remove the recursion, and a third step could be to add some bitmasking.
8-bit computers in the 1970s did this with an optimization that has the same algorithmic complexity, but in the typical case is much faster on actual hardware.
Starting from the initial square, scan to the left and right until "walls" are found. Now you have a "span" that is one square tall and N squares wide. Mark the span as "filled," in this case each square with the distance to the initial square.
For each square above and below the current span, if it's not a "wall" or already filled, pick it as the new origin of a span.
Repeat until no new spans are found.
Since horizontal rows tend to be stored contiguously in memory, this algorithm tends to thrash the cache far less than one that has no bias for horizontal searches.
Also, since in the most common cases far fewer items are pushed and popped from a stack (spans instead of individual blocks) there is less time spent maintaining the stack.

Resources