Is there an algorithm to multiply square matrices in-place? - algorithm

The naive algorithm for multiplying 4x4 matrices looks like this:
void matrix_mul(double out[4][4], double lhs[4][4], double rhs[4][4]) {
for (int i = 0; i < 4; ++i) {
for (int j = 0; j < 4; ++j) {
out[i][j] = 0.0;
for (int k = 0; k < 4; ++k) {
out[i][j] += lhs[i][k] * rhs[k][j];
}
}
}
}
Obviously, this algorithm gives bogus results if out == lhs or out == rhs (here == means reference equality). Is there a version that allows one or both of those cases that doesn't simply copy the matrix? I'm happy to have different functions for each case if necessary.
I found this paper but it discusses the Strassen-Winograd algorithm which is overkill for my small matrices. The answers to this question seem to indicate that if out == lhs && out == rhs (i.e., we're attempting to square the matrix), then it can't be done in place, but even there there's no convincing evidence or proof.

I'm not thrilled with this answer (I'm posting it mainly to silence the "it obviously can't be done" crowd), but I'm skeptical that it's possible to do much better with a true in-place algorithm (O(1) extra words of storage for multiplying two n x n matrices). Let's call the two matrices to be multplied A and B. Assume that A and B are not aliased.
If A were upper-triangular, then the multiplication problem would look like this.
[a11 a12 a13 a14] [b11 b12 b13 b14]
[ 0 a22 a23 a24] [b21 b22 b23 b24]
[ 0 0 a33 a34] [b31 b32 b33 b34]
[ 0 0 0 a44] [b41 b42 b43 b44]
We can compute the product into B as follows. Multiply the first row of B by a11. Add a12 times the second row of B to the first. Add a13 times the third row of B to the first. Add a14 times the fourth row of B to the first.
Now, we've overwritten the first row of B with the correct product. Fortunately, we don't need it any more. Multiply the second row of B by a22. Add a23 times the third row of B to the second. (You get the idea.)
Likewise, if A were unit lower-triangular, then the multiplication problem would look like this.
[ 1 0 0 0 ] [b11 b12 b13 b14]
[a21 1 0 0 ] [b21 b22 b23 b24]
[a31 a32 1 0 ] [b31 b32 b33 b34]
[a41 a42 a43 1 ] [b41 b42 b43 b44]
Add a43 times to third row of B to the fourth. Add a42 times the second row of B to the fourth. Add a41 times the first row of B to the fourth. Add a32 times the second row of B to the third. (You get the idea.)
The complete algorithm is to LU-decompose A in place, multiply U B into B, multiply L B into B, and then LU-undecompose A in place (I'm not sure if anyone ever does this, but it seems easy enough to reverse the steps). There are about a million reasons not to implement this in practice, two being that A may not be LU-decomposable and that A won't be reconstructed exactly in general with floating-point arithmetic.

This answer is more sensible than my other one, though it uses one whole column of additional storage and has the same amount of data movement as the naive copying algorithm. To multiply A with B, storing the product in B (again assuming that A and B are stored separately):
For each column of B,
Copy it into the auxiliary storage column
Compute the product of A and the auxiliary storage column into that column of B
I switched the pseudocode to do the copy first because for large matrices, caching effects may result in it being more efficient to multiply A by the contiguous auxiliary column as opposed to the non-contiguous entries in B.

This answer is about 4x4 matrices. Assuming, as you propose, that out may reference either lhs or rhs, and that A and B have cells of uniform bit-length, in order to technically be able to perform the multiplication in place, elements of A and B, as signed integers, generally cannot be greater or smaller than ± floor (sqrt (2 ^ (cellbitlength - 1) / 4)).
In this case, we can hack the elements of A into B (or vice versa) in the form of a bit shift or a combination of bit flags and modular arithmetic, and compute the product into the former matrix. If A and B were tightly packed, save for special cases or limits, we could not admit out to reference either lhs or rhs.
Using the naive method now would not be unlike David's second algorithm description, just with the extra column stored in A or B itself. Alternatively, we could implement the Strassen-Winograd algorithm according to the schedule below, again with no storage outside of lhs and rhs. (The formulation of p0,...,p6 and C is taken from page 166 of Jonathan Golan's The Linear Algebra a Beginning Graduate Student Ought to Know.)
p0 = (a11 + a12)(b11 + b12), p1 = (a11 + a22)b11, p2 = a11(b12 - b22),
p3 = (a21 - a11)(b11 + b12), p4 = (a11 + a12)b22, p5 = a22(b21 - b11),
p6 = (a12 - a22)(b21 + b22)
┌ ┐
c = │ p0 + p5 - p4 + p6, p2 + p4 │
│ p1 + p5 , p0 - p1 + p2 + p3 │
└ ┘
Schedule:
Each p below is a 2x2 quadrant; "x" means unassigned; "nc", no change. To compute each p, we use an unassigned 2x2 quadrant to superimpose the (one or) two results of 2x2 block matrix addition or subtraction, using the same bit-shift or modular method above; we then add their product (the seven multiplications resulting in single-elements) directly into the the target block in any order (note that for the 2x2-sized p2 and p4, we use the southwest quadrant of rhs, which is no longer needed at that point). For example, to write the first 2x2-sized p6, we superimpose the block matrix subtraction, rhs(a12) - rhs(a22), and block matrix addition, rhs(b21) + rhs(b22), onto the lhs21 submatrix; then add each of the seven single-element p's for that block multiplication, (a12 - a22) X (b21 + b22), directly to the lhs11 submatrix.
LHS RHS (contains A and B)
(1)
p6 x
x p3
(2)
+p0 x
p0 +p0
(3)
+p5 x
p5 nc
(4)
nc p1
+p1 -p1
(5)
-p4 p4 p4 (B21)
nc nc
(6)
nc +p2 p2 (B21)
nc +p2

Related

Algorithm for square root calculation

I have been implementing control software in C and one of the control algorithms requires square root calculation. I have been looking for suitable square root calculation algorithm which will have constant execution time irrespective to the radicand value. This requirement rules out the sqrt function from the standard library.
As far as my platform I have been working with floating point 32 bits ARM Cortex A9 based machine. As far as the radicand range in my application the algorithms are calculated in physical units so I expect following range <0, 400>. As far as the required error I think that error about 1 % could be sufficient. Can anybody recommend me a square root calculation algorithm suitable for my purposes?
My initial approach would be to use the Taylor serie for square root with precalculated coefficients at a number of fixed points. This will reduce the calculation to a subtraction and a number of multiplication.
The look-up table would be a 2D array like:
point | C0 | C1 | C2 | C3 | C4 | ...
-----------------------------------------
0.5 | f00 | f01 | f02 | f03 | f04 |
-----------------------------------------
1.0 | f10 | f11 | f12 | f13 | f14 |
-----------------------------------------
1.5 | f20 | f21 | f22 | f23 | f24 |
-----------------------------------------
....
So when calculating sqrt(x) use the table row with the point closest to x.
Example:
sqrt(1.1) (i.e. use point 1.0 coeffients)
f10 +
f11 * (1.1 - 1.0) +
f12 * (1.1 - 1.0) ^ 2 +
f13 * (1.1 - 1.0) ^ 3 +
f14 * (1.1 - 1.0) ^ 4
The table above suggest a fixed distance between the points at which you precalculate coeffients (i.e. 0.5 between each point). However, due to the natur of square root you may find that the distance between points shall differ for different ranges of x. For instance x in [0 - 1] -> distance 0.1,x in [1 - 2] -> distance 0.25, x in [2 - 10] -> distance 0.5 and so on.
Another thing is the number of terms needed to get the desired precision. Here you may also find that different ranges of x may require a different number of coefficients.
All this is easy to precalculation on a normal computer (e.g. using excel).
Note: For values very close to zero this method isn't good. Maybe Newtons method will be a better choice.
Taylor series: https://en.wikipedia.org/wiki/Taylor_series
Newtons method: https://en.wikipedia.org/wiki/Newton%27s_method
Also relevant: https://math.stackexchange.com/questions/291168/algorithms-for-approximating-sqrt2
Arm v7 instruction set provides a fast instruction for inverse reciprocal square root calculation vrsqrte_f32 for two simultaneous approximations and vrsqrteq_f32 for four approximations. (The scalar variant vrsqrtes_f32 is only available on Arm64 v8.2).
Then the result can be simply calculated by x * vrsqrte_f32(x);, which has better than 0.33% relative accuracy over the whole range of positive values x. See https://www.mdpi.com/2079-3197/9/2/21/pdf
ARM NEON instruction FRSQRTE gives 8.25 correct bits of the result.
At x==0 vrsqrtes_f32(x) == Inf, so x*vrsqrtes_f32(x) would be NaN.
If the value of x==0 is unavoidable, the optimal two instruction sequence needs a bit more adjustment:
float sqrtest(float a) {
// need to "transfer" or "convert" the scalar input
// to a vector of two
// - optimally we would not need an instruction for that
// but we would just let the processor calculate the instruction
// for all the lanes in the register
float32x2_t a2 = vdup_n_f32(a);
// next we create a mask that is all ones for the legal
// domain of 1/sqrt(x)
auto is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
// calculate two reciprocal estimates in parallel
float32x2_t a2est = vrsqrte_f32(a2);
// we need to mask the result, so that effectively
// all non-legal values of a2est are zeroed
a2est = vand_u32(is_legal, a2est);
// x * 1/sqrt(x) == sqrt(x)
a2 = vmul_f32(a2, a2est);
// finally we get only the zero lane of the result
// discarding the other half
return vget_lane_f32(a2, 0);
}
Surely this method will have almost twice the throughput with
void sqrtest2(float &a, float &b) {
float32x2_t a2 = vset_lane_f32(b, vdup_n_f32(a), 1);
float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
float32x2_t a2est = vrsqrte_f32(a2);
a2est = vand_u32(is_legal, a2est);
a2 = vmul_f32(a2, a2est);
a = vget_lane_f32(a2,0);
b = vget_lane_f32(a2,1);
}
And even better, if you can work directly with float32x2_t or float32x4_t inputs and outputs.
float32x2_t sqrtest2(float32x2_t a2) {
float32x2_t is_legal = vreinterpret_f32_u32(vcgt_f32(a2, vdup_n_f32(0.0f)));
float32x2_t a2est = vrsqrte_f32(a2);
a2est = vand_u32(is_legal, a2est);
return vmul_f32(a2, a2est);
}
This implementation gives sqrtest2(1) == 0.998 and sqrtest2(400) == 19.97 (tested on MacBook M1 with arm64). Being branchless and LUT free, this has likely a constant execution time, assuming that all the instructions execute in constant number of cycles.
I have decided to use following approach. I have chosen the Newton method and then I have experimentally set the fixed number of iterations so that the error in whole range of the radicand i.e. <0,400> doesn't exceed the prescribed value. I have ended at six iterations. As far as the radicand with value 0 I have decided to return 0 without any calculations.

Findig a solution for a linear equation system which has more variable then equtions

Let's divide the problem to 2 parts, the second one is optional.
Part 1
I have 3 linear equtions with N variables where N usually bigger then 3.
x1*a+x2*b+x3*c+x4*d[....]xN*p = B1
y1*a+y2*b+y3*c+y4*d[....]yN*p = B2
z1*a+z2*b+z3*c+z4*d[....]zN*p = B3
Looking for (a,b,c,d,[...],p), others are constant.
The standard Gaussian way won't work because the matrix will be wider then tall. Of course i can use it to eliminate 2 variables. Do you know an algorithm to find out a solution? (I only need one.) More 0s in the solution coefficients are better but not required.
Part 2
The coefficients in the solution must be non-negative.
Requirements:
The algorithm must be fast enough to run real time. (1800 per sec on an avrage pc). So trial and error method is a no go.
The algorithm will be implemented in C# but feel free to use pseudo language if you want to write code.
Set extra variables to zero. Now we have the matrix equation
A.x = b, where
x1 x2 x3
A = y1 y2 y3
z1 z2 z3
b = (B1, B2, B3), as a column vector
Now invert A. The solution is;
X = A-1.x
End matrix formula's in excel with Ctrl Shift Enter

m cup, pour water, graph algorithm, n=(a1+1)(a2+1)...(am+1)

The problem disturbed me for long time. I think the solution shoulde be graph algorithm. Thank a lot
Given m cups c1; c2; ;cm with integer capacity a1; a2; : : : ; am mls respectively. You are only allowed
to perform the following three types of operations:
Completely fill one cup.
Empty one cup.
Pour water from cup ci to cup cj until either ci is empty or cj is full.
Starting from the state in which all cups are empty, you would like to reach the final state in which
cup c1 has x mls of water and all other cups are empty (for some given x). Design an algorithm to findthe minimum number of operations required or report that the desired final state is not reachable.
Your algorithm must run in time polynomial in n = (a1 + 1)(a2 + 1) : : : (am + 1).
let's say c1 has capacity 5 litres and c2 have that of 3 litres. Their difference is 2 litres.
So 2 litres of water can be taken in any cup c1 or c2 by following steps-
1) fill c1 i.e. 5 litres.
2) pour it in c2 until it gets full.
3) empty c2.
4) you have 3 2 liters in c1.
For these m cups, you have m Choose 2 = m!/(m-2)!2! = m*m-1/2 combinations.
Calculate all those and fill them in hash table with operation also.
Because 2 litres can be capacity of a cup also difference of 2 cups capacity. We store operation = 1 instead of 3.
Now we have hash set of all possible litres of water we can hold with no of operations.
All you need to find the minimum length subsequence from that collection of minimum no of operations.

Maximizing a Trigonometric Function of Many Variables in Mathematica

Just to give some context, my motivation for this programming question is to understand the derivation of the CSHS inequality and basically entails maximizing the following function:
Abs[c1 Cos[2(a1-b1)]+ c2 Cos[2(a1-b2)] + c3 Cos[2(a2-b1)] + c4 Cos[2(a2-b2)]]
where a1,b1,b2,and a2 are arbitrary angles and c1,c2,c3,c4 = +/- 1 ONLY. I want to be able to determine the maximum value of this function along with the combination of angles that lead to this maximum
Eventually, I also want to repeat the calculation for a1,a2,a3,b1,b2,b3 (which will have a total of nine cosine terms)
When I tried putting the following code in Mathematica, it simply spat the input back at me and did not perform any computation, can someone help me out? (note my code didn't include the c1,c2,c3,c4 parameters, I wasn't quite sure how to incorporate them)
Maximize[{Abs[Cos[2 (a1 - b1)] - Cos[2 (a1 - b2)] + Cos[2 (a2 - b1)] +
Cos[2 (a2 - b2)]], 0 <= a1 <= 2 \[Pi] , 0 <= b1 <= 2 \[Pi], 0 <= a2 <= 2 \[Pi], 0 <= b2 <= 2 \[Pi]}, {a1, b2, a2, b1}]
The answer is 4. This is because each Cos can be made to equal 1. You have 4 variables a1, a2, b1 and b2, and four cosines, so there are going to be several ways of making the combinations 2(a1-b1), 2(a1-b2), 2(a2-b1) and 2(a2-b2) equal 0 (hence choosing the corresponding c1/c2/c3/c4 to be +1), or equal to pi (hence choosing the corresponding c1/c2/c3/c4 to be -1).
For one set of angles that give the max, the obvious answer is a1=a2=b1=b2=0. For the 9 cosine case, the max will be 9, and one possible answer is a1=a2=a3=b1=b2=b3=0.
Regarding using Mathematica, I think the lesson is that it's always best to think about before the maths itself before using tools to help with the maths.

Z3: Performing Matrix Operations

My Situation
I'm working on a project which needs to:
Prove the correctness of 3D matrix transformation formulas involving matrix operations
Find a model with the values of the unknown matrix entries.
My Question
What's the best way to express formulas using matrix operations so
that they can be solved by z3? (The way used in Z3Py's Sudoku
Example isn't
very elegant and doesn't seem suitable for more complex matrix
operations.)
Thanks. - If anything's unclear, please leave a question comment.
Z3 has no support for matrices like this, so the best way to encode them is to encode the formulas they represent. This is roughly the same as how the Sudoku example encodes things. Here is a simple example using e.g., a 2x2 real matrix (Z3py link: http://rise4fun.com/Z3Py/MYnB ):
# nonlinear version, constants a_ij, b_i are variables
# x_1, x_2, a_11, a_12, a_21, a_22, b_1, b_2 = Reals('x_1 x_2 a_11 a_12 a_21 a_22 b_1 b_2')
# linear version (all numbers are defined values)
x_1, x_2 = Reals('x_1 x_2')
# A matrix
a_11 = 1
a_12 = 2
a_21 = 3
a_22 = 5
# b-vector
b_1 = 7
b_2 = 11
newx_1 = a_11 * x_1 + a_12 * x_2 + b_1
newx_2 = a_21 * x_1 + a_22 * x_2 + b_2
print newx_1
print newx_2
# solve using model generation
s = Solver()
s.add(newx_1 == 0) # pointers to equations
s.add(newx_2 == 5)
print s.check()
print s.model()
# solve using "solve"
solve(And(newx_1 == 0, newx_2 == 5))
To get Z3 to solve for the unknown matrix entities, uncomment the second line (with the symbolic names for a_11, a_12, etc.), comment the other symbolic definitions of x_1, x_2 on the fifth line, and comment the specific assignments to a_11 = 1, etc. You will then get Z3 to solve for any unknowns by finding satisfying assignments to these variables, but note that you may need to enable model completion for your purposes (e.g., if you need assignments to all of the unknown matrix parameters or x_i variables, see, e.g.: Z3 4.0: get complete model ).
However, based on the link you shared, you are interested in performing operations using sinusoids (the rotations), which are in general transcendental, and Z3 at this point does not have support for transcendental (general exponentials, etc.). This will be the challenging part for you, e.g., to prove for any choice of angle for rotations that the operation works, or even just encoding the rotations. The scaling and translations should not be too hard to encode.
Also, see the following answer for how to encode linear differential equations, which are equations of the form x' = Ax, where A is an n * n matrix and x is an n-dimensional vector: Encoding of first order differential equation as First order formula

Resources