Accelerating incomplete LDL^T preconditioner with OpenACC - openacc

Original question:
I am relatively new to OpenACC, but have so far managed to accelerate my suite of iterative Fortran solvers (based on CG) with relative success, and am getting speed-ups of around 7 on my Nvidia RTX GeForce card. All works fine if no preconditioning is used, or if diagonal preconditioning is used. But, problems start if I want to speed up slightly more sophisticated preconditioners - the inclomplete LDL^T being my favorite.
The piece of code which performs incomplete LDL^T factorization looks like this:
20 do i = 1, n ! browse through rows
21 sum1 = a % val(a % dia(i)) ! take diagonal entry
22 do j = a % row(i), a % dia(i)-1 ! browse only lower triangular part
23 k = a % col(j) ! fetch the column
24 sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
25 end do
27 ! Keep only the diagonal from LDL decomposition
28 f % val(f % dia(i)) = 1.0 / sum1
29 end do
This piece of code is inherently sequential. As I browse through rows, I use the factorization performed in previous rows, thus should not be easily parallelizable from the outset. The only way I managed to find to compile it through OpenACC directive, and having in mind its sequential nature, is:
20 !$acc parallel loop seq &
21 !$acc& present(a, a % row, a % col, a % dia, a % val) &
22 !$acc& present(f, f % row, f % col, f % dia, f % val)
23 do i = 1, n ! browse through rows
24 sum1 = a % val(a % dia(i)) ! take diagonal entry
25 !$acc loop vector reduction(+:sum1)
26 do j = a % row(i), a % dia(i)-1 ! browse only lower triangular part
27 k = a % col(j) ! fetch the column
28 sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
29 end do
31 ! Keep only the diagonal from LDL decomposition
32 f % val(f % dia(i)) = 1.0 / sum1
33 end do
Although these OpenACC directives keeps calculations on GPUs and give the same results as non-GPU variant, those of you more experienced than me will already see that there is not a hell of lot of parallelism and speed-up here. The outer loop (i, through rows) is sequential, and the inner loop (j and k) browses through several elements only.
Overall, the GPU variant of incomplete LDL^T preconditioned CG solver is several times slower than the non-GPU version.
Does anyone have an idea how to work around this? I understand that it might be far from trivial, but maybe some work has already been done on this issue that I am unaware of, or maybe there is a better way to use OpenACC.
Update a couple of days later
I did my homework, read some articles on Nvidia site and beyond, and yeah, the factorization algorithm is indeed sequential and it seems to be an open question how to go about it on GPUs. In this recent blog: the author is using cuBLAS and cuSPARSE for CG, but still does factorization on CPU. The average speed-up he reports is around two, which is slightly underwhelming I dare to say.
So, I decided for a workaround by coloring matrix rows and perform re-factorization color by color, like this:
21 n_colors = 8
23 do color = 1, n_colors
25 c_low = (color-1) * n / n_colors + 1 ! color's upper bound
26 c_upp = color * n / n_colors ! color's lower bound
28 do i = c_low, c_upp ! browse your color band
29 sum1 = a % val(a % dia(i)) ! take diagonal entry
30 do j = a % row(i), a % dia(i)-1 ! only lower traingular
31 k = a % col(j)
32 if(k >= c_low) then ! limit entries to your color
33 sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
34 end if
35 end do
37 ! This is only the diagonal from LDL decomposition
38 f % val(f % dia(i)) = 1.0 / sum1
39 end do
40 end do ! colors
With line 32 I am limiting matrix entries only to the color belonging to its loop. Clearly, since the incompleteness of the factorized matrix is more pronounced (more entries are neglected), convergence properties are poorer, but still better than with simple diagonal preconditoner.
I did the following with OpenACC:
21 n_colors = 8
23 !$acc parallel loop num_gangs(8) tile(8) ! do each tile in its own color
24 !$acc& present(a, a % row, a % col, a % dia, a % val) &
25 !$acc& present(f, f % row, f % col, f % dia, f % val)
26 do color = 1, n_colors
28 c_low = (color-1) * n / n_colors + 1 ! color's upper bound
29 c_upp = color * n / n_colors ! color's lower bound
31 !$acc loop seq ! inherently sequential
32 do i = c_low, c_upp ! browse your color band
33 sum1 = a % val(a % dia(i)) ! take diagonal entry
34 !$acc loop vector reduction(+:sum1)
35 do j = a % row(i), a % dia(i)-1 ! only lower traingular
36 k = a % col(j)
37 if(k >= c_low) then ! limit entries to your color
38 sum1 = sum1 - f % val(f % dia(k)) * a % val(j) * a % val(j)
39 end if
40 end do
42 ! This is only the diagonal from LDL decomposition
43 f % val(f % dia(i)) = 1.0 / sum1
44 end do
45 end do ! colors
The results I obtain from OpenACC variant are identical to those from CPU only, it is just that performance is still much slower on GPUs. Yet, it seems that tile directive works as I expected, each gang seems to be working on its own color since results I get from GPUs are identical.
Any advice on how to improve performance? Am I doing something utterly silly in the above code? (Profiler shows that the computation is GPU-bound since the entire system is on GPU, but performance is really poor.)
Best regards

Not sure this will help, but this paper describes a method for solving CG on GPUs. It uses cuSPARSE and CUDA for the implementations, but you might be able to get ideas that could be applied to your code.


How to approach and understand a math related DSA question

I found this question online and I really have no idea what the question is even asking. I would really appreciate some help in first understanding the question, and a solution if possible. Thanks!
To see if a number is divisible by 3, you need to add up the digits of its decimal notation, and check if the sum is divisible by 3.
To see if a number is divisible by 11, you need to split its decimal notation into pairs of digits (starting from the right end), add up corresponding numbers and check if the sum is divisible by 11.
For any prime p (except for 2 and 5) there exists an integer r such that a similar divisibility test exists: to check if a number is divisible by p, you need to split its decimal notation into r-tuples of digits (starting from the right end), add up these r-tuples and check whether their sum is divisible by p.
Given a prime int p, find the minimal r for which such divisibility test is valid and output it.
The input consists of a single integer p - a prime between 3 and 999983, inclusive, not equal to 5.
This is a very cool problem! It uses modular arithmetic and some basic number theory to devise the solution.
Let's say we have p = 11. What divisibility rule applies here? How many digits at once do we need to take, to have a divisibility rule?
Well, let's try a single digit at a time. That would mean, that if we have 121 and we sum its digits 1 + 2 + 1, then we get 4. However we see, that although 121 is divisible by 11, 4 isn't and so the rule doesn't work.
What if we take two digits at a time? With 121 we get 1 + 21 = 22. We see that 22 IS divisible by 11, so the rule might work here. And in fact, it does. For p = 11, we have r = 2.
This requires a bit of intuition which I am unable to convey in text (I really have tried) but it can be proven that for a given prime p other than 2 and 5, the divisibility rule works for tuples of digits of length r if and only if the number 99...9 (with r nines) is divisible by p. And indeed, for p = 3 we have 9 % 3 = 0, while for p = 11 we have 9 % 11 = 9 (this is bad) and 99 % 11 = 0 (this is what we want).
If we want to find such an r, we start with r = 1. We check if 9 is divisible by p. If it is, then we found the r. Otherwise, we go further and we check if 99 is divisible by p. If it is, then we return r = 2. Then, we check if 999 is divisible by p and if so, return r = 3 and so on. However, the 99...9 numbers can get very large. Thankfully, to check divisibility by p we only need to store the remainder modulo p, which we know is small (at least smaller than 999983). So the code in C++ would look something like this:
int r(int p) {
int result = 1;
int remainder = 9 % p;
while (remainder != 0) {
remainder = (remainder * 10 + 9) % p;
return result;
I have no idea how they expect a random programmer with no background to figure out the answer from this.
But here is the brief introduction to modulo arithmetic that should make this doable.
In programming, n % k is the modulo operator. It refers to taking the remainder of n / k. It satisfies the following two important properties:
(n + m) % k = ((n % k) + (m % k)) % k
(n * m) % k = ((n % k) * (m % k)) % k
Because of this, for any k we can think of all numbers with the same remainder as somehow being the same. The result is something called "the integers modulo k". And it satisfies most of the rules of algebra that you're used to. You have the associative property, the commutative property, distributive law, addition by 0, and multiplication by 1.
However if k is a composite number like 10, you have the unfortunate fact that 2 * 5 = 10 which means that modulo 10, 2 * 5 = 0. That's kind of a problem for division.
BUT if k = p, a prime, then things become massively easier. If (a*m) % p = (b*m) % p then ((a-b) * m) % p = 0 so (a-b) * m is divisible by p. And therefore either (a-b) or m is divisible by p.
For any non-zero remainder m, let's look at the sequence m % p, m^2 % p, m^3 % p, .... This sequence is infinitely long and can only take on p values. So we must have a repeat where, a < b and m^a % p = m^b %p. So (1 * m^a) % p = (m^(b-a) * m^a) % p. Since m doesn't divide p, m^a doesn't either, and therefore m^(b-a) % p = 1. Furthermore m^(b-a-1) % p acts just like m^(-1) = 1/m. (If you take enough math, you'll find that the non-zero remainders under multiplication is a finite group, and all the remainders forms a field. But let's ignore that.)
(I'm going to drop the % p everywhere. Just assume it is there in any calculation.)
Now let's let a be the smallest positive number such that m^a = 1. Then 1, m, m^2, ..., m^(a-1) forms a cycle of length a. For any n in 1, ..., p-1 we can form a cycle (possibly the same, possibly different) n, n*m, n*m^2, ..., n*m^(a-1). It can be shown that these cycles partition 1, 2, ..., p-1 where every number is in a cycle, and each cycle has length a. THEREFORE, a divides p-1. As a side note, since a divides p-1, we easily get Fermat's little theorem that m^(p-1) has remainder 1 and therefore m^p = m.
OK, enough theory. Now to your problem. Suppose we have a base b = 10^i. The primality test that they are discussing is that a_0 + a_1 * b + a_2 * b^2 + a_k * b^k is divisible by a prime p if and only if a_0 + a_1 + ... + a_k is divisible by p. Looking at (p-1) + b, this can only happen if b % p is 1. And if b % p is 1, then in modulo arithmetic b to any power is 1, and the test works.
So we're looking for the smallest i such that 10^i % p is 1. From what I showed above, i always exists, and divides p-1. So you just need to factor p-1, and try 10 to each power until you find the smallest i that works.
Note that you should % p at every step you can to keep those powers from getting too big. And with repeated squaring you can speed up the calculation. So, for example, calculating 10^20 % p could be done by calculating each of the following in turn.
10 % p
10^2 % p
10^4 % p
10^5 % p
10^10 % p
10^20 % p
This is an almost direct application of Fermat's little theorem.
First, you have to reformulate the "split decimal notation into tuples [...]"-condition into something you can work with:
to check if a number is divisible by p, you need to split its decimal notation into r-tuples of digits (starting from the right end), add up these r-tuples and check whether their sum is divisible by p
When you translate it from prose into a formula, what it essentially says is that you want
for any choice of "r-tuples of digits" b_i from { 0, ..., 10^r - 1 } (with only finitely many b_i being non-zero).
Taking b_1 = 1 and all other b_i = 0, it's easy to see that it is necessary that
It's even easier to see that this is also sufficient (all 10^ri on the left hand side simply transform into factor 1 that does nothing).
Now, if p is neither 2 nor 5, then 10 will not be divisible by p, so that Fermat's little theorem guarantees us that
, that is, at least the solution r = p - 1 exists. This might not be the smallest such r though, and computing the smallest one is hard if you don't have a quantum computer handy.
Despite it being hard in general, for very small p, you can simply use an algorithm that is linear in p (you simply look at the sequence
10 mod p
100 mod p
1000 mod p
10000 mod p
and stop as soon as you find something that equals 1 mod p).
Written out as code, for example, in Scala:
def blockSize(p: Int, n: Int = 10, r: Int = 1): Int =
if n % p == 1 then r else blockSize(p, n * 10 % p, r + 1)
println(blockSize(3)) // 1
println(blockSize(11)) // 2
println(blockSize(19)) // 18
or in Python:
def blockSize(p: int, n: int = 10, r: int = 1) -> int:
return r if n % p == 1 else blockSize(p, n * 10 % p, r + 1)
print(blockSize(3)) # 1
print(blockSize(11)) # 2
print(blockSize(19)) # 18
A wall of numbers, just in case someone else wants to sanity-check alternative approaches:
11 -> 2
13 -> 6
17 -> 16
19 -> 18
23 -> 22
29 -> 28
31 -> 15
37 -> 3
41 -> 5
43 -> 21
47 -> 46
53 -> 13
59 -> 58
61 -> 60
67 -> 33
71 -> 35
73 -> 8
79 -> 13
83 -> 41
89 -> 44
97 -> 96
101 -> 4
103 -> 34
107 -> 53
109 -> 108
113 -> 112
127 -> 42
131 -> 130
137 -> 8
139 -> 46
149 -> 148
151 -> 75
157 -> 78
163 -> 81
167 -> 166
173 -> 43
179 -> 178
181 -> 180
191 -> 95
193 -> 192
197 -> 98
199 -> 99
Thank you andrey tyukin.
Simple terms to remember:
When x%y =z then (x%y)%y again =z
(X+y)%z == (x%z + y%z)%z
keep this in mind.
So you break any number into some r digits at a time together. I.e. break 3456733 when r=6 into 3 * 10 power(6 * 1) + 446733 * 10 power(6 * 0).
And you can break 12536382626373 into 12 * 10 power (6 * 2). + 536382 * 10 power (6 * 1) + 626373 * 10 power (6 * 0)
Observe that here r is 6.
So when we say we combine the r digits and sum them together and apply modulo. We are saying we apply modulo to coefficients of above breakdown.
So how come coefficients sum represents whole number’s sum?
When the “10 power (6* anything)” modulo in the above break down becomes 1 then that particular term’s modulo will be equal to the coefficient’s modulo. That means the 10 power (r* anything) is of no effect. You can check why it will have no effect by using the formulas 1&2.
And the other similar terms 10 power (r * anything) also will have modulo as 1. I.e. if you can prove that (10 power r)modulo is 1. Then (10 power r * anything) is also 1.
But the important thing is we should have 10 power (r) equal to 1. Then every 10 power (r * anything) is 1 that leads to modulo of number equal to sum of r digits divided modulo.
Conclusion: find r in (10 power r) such that the given prime number will leave 1 as reminder.
That also mean the smallest 9…..9 which is divisible by given prime number decides r.

Calculate Reverse Modulus

I'm trying to calculate the value of x in this equation:
(4 + 11111111/x)mod95 = 54
I've tried solving it using the top answer here: how to calculate reverse modulus
However, it provides the lowest possible value for x (145, if helpful to anyone.)
In addition, whenever 11111111/x is calculated, it removes any decimal places from the answer.
I guess you are referring to the bash code
(4 + 11111111 / x) % 95 # == 54
Where / yields the int part of the division.
If you simplify that, an x that sattisfies this, also sattisfies:
(11111111 / x) % 95 # == 50
And so also:
(11111111 / x) == 95 * i + 50 # for integer i
If we further look at the division that rounds towards the next lowest integer, we have
r= 11111111 % x
(11111111 - r)/x == 95*i + 50 # for integer i
(11111111 - r) == 5*(19*i + 10)*x # for integer i
So it can be rewritten as two conditions, which have to be met by any solution at once:
2222222 = (19*i + 10)*x
0 < 11111111 % x < x-1 # -1 because 11111111 % 5 == 1 and 11111111 % x < x
In other words, to find x you just need to check the two conditions for all divisors of 2222222.
In general, if you have questions like:
(a + b/x) mod m = c
transform it to
g=gcd(m, c-a)
c'= (c-a)/g
(b/x) mod m = g*c'
m= g*m'
b/x = g*c' + g*m'*i
r= b%x
r'= b%g
# now search for an x that divides (b-r')/g
# and complies with the following conditions:
(b-r')/g = (c' + m'*i)*x
r' <= r < x-r'

How to search the minimum n that 10^n ≡ 1 mod(9x) for given x

For given x, I need to calculate the minimum n that equates true for the formula 10^n ≡ 1 (mod 9x)
My algorithm is simple. For i = 1 to inf, I loop it until I get a result. There is always a result if gcd(10, x) = 1. Meanwhile if I don't get a result, I increase i by 1 .
This is really slow for big primes or numbers with a factorization of big values, so I ask if there is another way to calculate it faster. I have tried with threads, getting each thread the next 10^i to calculate. Performance is a bit better, but big primes still don't finish.
You can use Fermat's Little Theorem.
Assuming your x is relatively prime with 10, the following holds:
10 ^ φ(9x) ≡ 1 (mod 9x)
Here φ is Euler's totient function. So you can easily calculate at least one n (not necessarily the smallest) for which your equation holds. To find the smallest such n, just iterate through the list of n's divisors.
Example: x = 89 (a prime number just for simplicity).
9x = 801
φ(9x) = 6 * (89 - 1) = 528 (easy to calculate for a prime number)
The list of divisors of 528:
Trying each one, you can find that your equation holds for 44:
10 ^ 44 ≡ 1 (mod 801)
I just tried the example, it runs in less than one second:
public class Main {
public static void main(String[] args) {
int n = 1;
int x = 954661;
int v = 10;
while (v != 1) {
v = (v * 10) % (9*x);
For larger values of x the variables should be of type long.
As you specified you are actually trying to get modulus with 1 that is 1mod(9x).
That will always give you 1.
And you don't have to calculate that part exactly which might reduce your calculation.
On the other hand for 10^n = 1, it will always be 0.
So can you exactly specify what you are trying to do

What is the most efficient way to implement zig-zag ordering in MATLAB? [duplicate]

I have an NxM matrix in MATLAB that I would like to reorder in similar fashion to the way JPEG reorders its subblock pixels:
(image from Wikipedia)
I would like the algorithm to be generic such that I can pass in a 2D matrix with any dimensions. I am a C++ programmer by trade and am very tempted to write an old school loop to accomplish this, but I suspect there is a better way to do it in MATLAB.
I'd be rather want an algorithm that worked on an NxN matrix and go from there.
1 2 3
4 5 6 --> 1 2 4 7 5 3 6 8 9
7 8 9
Consider the code:
M = randi(100, [3 4]); %# input matrix
ind = reshape(1:numel(M), size(M)); %# indices of elements
ind = fliplr( spdiags( fliplr(ind) ) ); %# get the anti-diagonals
ind(:,1:2:end) = flipud( ind(:,1:2:end) ); %# reverse order of odd columns
ind(ind==0) = []; %# keep non-zero indices
M(ind) %# get elements in zigzag order
An example with a 4x4 matrix:
» M
M =
17 35 26 96
12 59 51 55
50 23 70 14
96 76 90 15
» M(ind)
ans =
17 35 12 50 59 26 96 51 23 96 76 70 55 14 90 15
and an example with a non-square matrix:
M =
69 9 16 100
75 23 83 8
46 92 54 45
ans =
69 9 75 46 23 16 100 83 92 54 8 45
This approach is pretty fast:
X = randn(500,2000); %// example input matrix
[r, c] = size(X);
M = bsxfun(#plus, (1:r).', 0:c-1);
M = M + bsxfun(#times, (1:r).'/(r+c), (-1).^M);
[~, ind] = sort(M(:));
y = X(ind).'; %'// output row vector
The following code compares running time with that of Amro's excellent answer, using timeit. It tests different combinations of matrix size (number of entries) and matrix shape (number of rows to number of columns ratio).
%// Amro's approach
function y = zigzag_Amro(M)
ind = reshape(1:numel(M), size(M));
ind = fliplr( spdiags( fliplr(ind) ) );
ind(:,1:2:end) = flipud( ind(:,1:2:end) );
ind(ind==0) = [];
y = M(ind);
%// Luis' approach
function y = zigzag_Luis(X)
[r, c] = size(X);
M = bsxfun(#plus, (1:r).', 0:c-1);
M = M + bsxfun(#times, (1:r).'/(r+c), (-1).^M);
[~, ind] = sort(M(:));
y = X(ind).';
%// Benchmarking code:
S = [10 30 100 300 1000 3000]; %// reference to generate matrix size
f = [1 1]; %// number of cols is S*f(1); number of rows is S*f(2)
%// f = [0.5 2]; %// plotted with '--'
%// f = [2 0.5]; %// plotted with ':'
t_Amro = NaN(size(S));
t_Luis = NaN(size(S));
for n = 1:numel(S)
X = rand(f(1)*S(n), f(2)*S(n));
f_Amro = #() zigzag_Amro(X);
f_Luis = #() zigzag_Luis(X);
t_Amro(n) = timeit(f_Amro);
t_Luis(n) = timeit(f_Luis);
loglog(S.^2*prod(f), t_Amro, '.b-');
hold on
loglog(S.^2*prod(f), t_Luis, '.r-');
xlabel('number of matrix entries')
The figure below has been obtained with Matlab R2014b on Windows 7 64 bits. Results in R2010b are very similar. It is seen that the new approach reduces running time by a factor between 2.5 (for small matrices) and 1.4 (for large matrices). Results are seen to be almost insensitive to matrix shape, given a total number of entries.
Here's a non-loop solution zig_zag.m. It looks ugly but it works!:
function [M,index] = zig_zag(M)
[r,c] = size(M);
checker = rem(hankel(1:r,r-1+(1:c)),2);
[rEven,cEven] = find(checker);
[cOdd,rOdd] = find(~checker.'); %'#
rTotal = [rEven; rOdd];
cTotal = [cEven; cOdd];
[junk,sortIndex] = sort(rTotal+cTotal);
rSort = rTotal(sortIndex);
cSort = cTotal(sortIndex);
index = sub2ind([r c],rSort,cSort);
M = M(index);
And a test matrix:
>> M = [magic(4) zeros(4,1)];
M =
16 2 3 13 0
5 11 10 8 0
9 7 6 12 0
4 14 15 1 0
>> newM = zig_zag(M) %# Zig-zag sampled elements
newM =
Here's a way how to do this. Basically, your array is a hankel matrix plus vectors of 1:m, where m is the number of elements in each diagonal. Maybe someone else has a neat idea on how to create the diagonal arrays that have to be added to the flipped hankel array without a loop.
I think this should be generalizeable to a non-square array.
% for a 3x3 array
numElementsPerDiagonal = [1:n,n-1:-1:1];
hadaRC = cumsum([0,numElementsPerDiagonal(1:end-1)]);
array2add = fliplr(hankel(hadaRC(1:n),hadaRC(end-n+1:n)));
% loop through the hankel array and add numbers counting either up or down
% if they are even or odd
for d = 1:(2*n-1)
if floor(d/2)==d/2
% even, count down
array2add = array2add + diag(1:numElementsPerDiagonal(d),d-n);
% odd, count up
array2add = array2add + diag(numElementsPerDiagonal(d):-1:1,d-n);
% now flip to get the result
indexMatrix = fliplr(array2add)
result =
1 2 6
3 5 7
4 8 9
Afterward, you just call reshape(image(indexMatrix),[],1) to get the vector of reordered elements.
Ok, from your comment it looks like you need to use sort like Marc suggested.
indexMatrixT = indexMatrix'; % ' SO formatting
[dummy,sortedIdx] = sort(indexMatrixT(:));
sortedIdx =
1 2 4 7 5 3 6 8 9
Note that you'd need to transpose your input matrix first before you index, because Matlab counts first down, then right.
Assuming X to be the input 2D matrix and that is square or landscape-shaped, this seems to be pretty efficient -
[m,n] = size(X);
nlim = m*n;
n = n+mod(n-m,2);
mask = bsxfun(#le,[1:m]',[n:-1:1]);
start_vec = m:m-1:m*(m-1)+1;
a = bsxfun(#plus,start_vec',[0:n-1]*m);
offset_startcol = 2- mod(m+1,2);
[~,idx] = min(mask,[],1);
idx = idx - 1;
idx(idx==0) = m;
end_ind = a([0:n-1]*m + idx);
offsets = a(1,offset_startcol:2:end) + end_ind(offset_startcol:2:end);
a(:,offset_startcol:2:end) = bsxfun(#minus,offsets,a(:,offset_startcol:2:end));
out = a(mask);
out2 = m*n+1 - out(end:-1:1+m*(n-m+1));
result = X([out2 ; out(out<=nlim)]);
Quick runtime tests against Luis's approach -
Datasize: 500 x 2000
------------------------------------- With Proposed Approach
Elapsed time is 0.037145 seconds.
------------------------------------- With Luis Approach
Elapsed time is 0.045900 seconds.
Datasize: 5000 x 20000
------------------------------------- With Proposed Approach
Elapsed time is 3.947325 seconds.
------------------------------------- With Luis Approach
Elapsed time is 6.370463 seconds.
Let's assume for a moment that you have a 2-D matrix that's the same size as your image specifying the correct index. Call this array idx; then the matlab commands to reorder your image would be
[~,I] = sort (idx(:)); %sort the 1D indices of the image into ascending order according to idx
reorderedim = im(I);
I don't see an obvious solution to generate idx without using for loops or recursion, but I'll think some more.

Fast modulo 3 or division algorithm?

is there a fast algorithm, similar to power of 2, which can be used with 3, i.e. n%3.
Perhaps something that uses the fact that if sum of digits is divisible by three, then the number is also divisible.
This leads to a next question. What is the fast way to add digits in a number? I.e. 37 -> 3 +7 -> 10
I am looking for something that does not have conditionals as those tend to inhibit vectorization
4 % 3 == 1, so (4^k * a + b) % 3 == (a + b) % 3. You can use this fact to evaluate x%3 for a 32-bit x:
x = (x >> 16) + (x & 0xffff);
x = (x >> 10) + (x & 0x3ff);
x = (x >> 6) + (x & 0x3f);
x = (x >> 4) + (x & 0xf);
x = (x >> 2) + (x & 0x3);
x = (x >> 2) + (x & 0x3);
x = (x >> 2) + (x & 0x3);
if (x == 3) x = 0;
(Untested - you might need a few more reductions.) Is this faster than your hardware can do x%3? If it is, it probably isn't by much.
This comp.compilers item has a specific recommendation for computing modulo 3.
An alternative, especially if the maximium size of the dividend is modest, is to multiply by the reciprocal of 3 as a fixed-point value, with enough bits of precision to handle the maximum size dividend to compute the quotient, and then subtract 3*quotient from the the dividend to get the remainder. All of these multiplies can be implemented with a fixed sequence of shifts-and-adds. The number of instructions will depend on the bit pattern of the reciprocal. This works pretty well when the dividend max is modest in size.
Regarding adding digits in the number... if you want to add the decimal digits, you're going to end up doing what amounts to a number-conversion-to-decimal, which involves divide by 10 somewhere. If you're willing to settle for adding up the digits in base2, you can do this with an easy shift-right and add loop. Various clever tricks can be used to do this in chunks of N bits to speed it up further.
Not sure for your first question, but for your second, you can take advantage of the % operator and integer division:
int num = 12345;
int sum = 0;
while (num) {
sum += num % 10;
num /= 10;
This works because 12345 % 10 = 5, 12345 / 10 = 1234 and keep going until num == 0
If you are happy with 1 byte integer division, here's a trick. You could extend it to 2 bytes, 4 bytes, etc.
Division is essentially multiplication by 0.3333. If you want to simulate floating point arithmetic then you need closest approximation for the 256 (decimal) boundary. This is 85, because 85 / 256 = 0.332. So if you multiply your value by 85, you should be getting a value close to the result in the high 8 bits.
Multiplying a value with 85 fast is easy. n * 85 = n * 64 + n * 16 + n * 4 + n. Now all these factors are powers of 2 so you can calculate n * 4 by shifting, then use this value to calculate n * 16, etc. So you have max 5 shifts and 4 additions.
As said, this'll give you approximation. To know how good it is you'll need to check the lower byte of the next value using this rule
n ... is the 16 bit number you want to divide
approx = HI(n*85)
if LO(n*85)>LO((n+1)*85)THEN approx++
And that should do the trick.
Example 1:
3 / 3 =?
3 * 85 = 00000000 11111111 (approx=0)
4 * 85 = 00000001 01010100 (LO(3*85)>LO(4*85)=>approx=1)
result approx=1
Example 2:
254 / 3
254 * 85 = 01010100 01010110 (approx=84)
255 * 85 = 01010100 10101011 (LO(254*85)<LO(255*85), don't increase)
result approx=84
If you're dealing with big-integers, one very fast method is realizing the fact for all
bases 10 +/- multiple-of-3
4,7,10,13,16,19,22…. etc
All you have to do is count the digits, then % 3. something like :
** note : x ^ y is power, not bit-wise XOR,
x ** y being the python equivalent
function mod3(__,_) {
# can handle bases
# { 4, 7,10,13,16,19,
# 22,25,28,31,34 } w/o conversion
# assuming base digits :
# 0-9A-X for any base,
# or 0-9a-f for base-16
return \
&& (__~"^[0-9]+$") )\
? (substr(__,_~_,_+_*_+_)+\
+ length(__) \
+ gsub("[258BbEeHKNQTW]","",__))%+_
This isn't the fastest method possible, but it's one of the more agile methods.
