In my shader I have variable b and need to determine within which range it lies and from that assign the right value to variable a. I ended up with a lot of if statements:
float a = const1;
if (b >= 2.0 && b < 4.0) {
a = const2;
} else if (b >= 4.0 && b < 6.0) {
a = const3;
} else if (b >= 6.0 && b < 8.0) {
a = const4;
} else if (b >= 8.0) {
a = const5;
}
My question is could this lead to performance issues (branching) and how can I optimize it? I've looked at the step and smoothstep functions but haven't figured out a good way to accomplish this.
To solve the problem depicted and avoid branching the usual techniques is to find a series of math functions, one for each condition, that evaluate to 0 for all the conditions except the one the variable satisfies. We can use these functions as gains to build a sum that evaluates to the right value each time.
In this case the conditions are simple intervals, so using the step functions we could write:
x in [a,b] as step(a,x)*step(x,b) (notice the inversion of x and b to get x<=b)
Or
x in [a,b[ as step(a,x)-step(x,b) as explained in this other post: GLSL point inside box test
Using this technique we obtain:
float a = (step(x,2.0)-((step(2.0,x)*step(x,2.0)))*const1 +
(step(2.0,x)-step(4.0,x))*const2 +
(step(4.0,x)-step(6.0,x))*const3 +
(step(6.0,x)-step(8.0,x))*const4 +
step(8.0,x)*const5
This works for general disjoint intervals, but in the case of a step or staircase function as in this question, we can simplify it as:
float a = const1 + step(2.0,x)*(const2-const1) +
step(4.0,x)*(const3-const2) +
step(6.0,x)*(const4-const3) +
step(8.0,x)*(const5-const4)
We could also use a 'bool conversion to float' as means to express our conditions, so as an example step(8.0,x)*(const5-const4) is equivalent to float(x>=8.0)*(const5-const4)
You can avoid branching by creating kind of a lookup table:
float table[5] = {const1, const2, const3, const4, const5};
float a = table[int(clamp(b, 0.0, 8.0) / 2)];
But the performance will depend on whether the lookup table will have to be created in every shader or if it's some kind of uniform... As always, measure first...
It turned out Jaa-cs answere wasn't viable for me as I'm targeting WebGL which doesn't allow variables as indexes (unless it's a loop index). His solution might work great for other OpenGL implementations though.
I came up with this solution using mix and step functions:
//Outside of main function:
uniform vec3 constArray[5]; // Values are sent in to shader
//Inside main function:
float a = constArray[0];
a = mix(a, constArray[1], step(2.0, b));
a = mix(a, constArray[2], step(4.0, b));
a = mix(a, constArray[3], step(6.0, b));
a = mix(a, constArray[4], step(8.0, b));
But after some testing it didn't give any visible performance boost. I finally ended up with this solution:
float a = constArray[0];
if (b >= 2.0)
a = constArray[1];
if (b >= 4.0)
a = constArray[2];
if (b >= 6.0)
a = constArray[3];
if (b >= 8.0)
a = constArray[4];
Which is both compact and easily readable. In my case both these alternatives and my original code performed equally, but at least here are some options to try out.
Related
I develop code performing Schur decomposition.
I test it with eigen corresponding stuff.
I found out that my code gives result different form those of eigen.
Most of matrix elements of my and egein's output are the same to 2 or 4 decimal places.
However worst difference available is about 70%: for example my code gives certain matrix element equal to 0.3 and eigen 0.19.
I decided to look deeper into eigen sources and figured out that if I change following eigen code
void ::applyHouseholderOnTheLeft(....)
{
......
Map<typename internal::plain_row_type<PlainObject>::type> tmp(workspace,cols());
Block<Derived, EssentialPart::SizeAtCompileTime, Derived::ColsAtCompileTime> bottom(derived(), 1, 0, rows()-1, cols());
tmp.noalias() = essential.adjoint() * bottom;
tmp += this->row(0);
this->row(0) -= tau * tmp;
bottom.noalias() -= tau * essential * tmp;
}
to this one (same was done for applyHouseholderOnTheRight):
void ::applyHouseholderOnTheLeft(....)
{
......
Map<typename internal::plain_row_type<PlainObject>::type> tmp(workspace,cols());
Block<Derived, EssentialPart::SizeAtCompileTime, Derived::ColsAtCompileTime> bottom(derived(), 1, 0, rows()-1, cols());
tmp.noalias() = essential.adjoint() * bottom;
tmp += this->row(0);
tmp *= tau;
this->row(0) -= tmp;
bottom.noalias() -= essential * tmp;
}
i get eigen output equal to mine (within 7-6 decimal places) !!
Mathematically these two pieces of code are equivalent.
So the question is - why there is so big difference in outputs of equivalent code ?
And what result is actually true (0.3 or 0.19 :-) ) ?
Original test code:
Matrix<double, Dynamic, Dynamic, RowMajor> A(10,10);
A<<6.9 ,4.8 ,9.5 ,3.1 ,6.5 ,5.8 ,-0.9 ,-7.3 ,-8.1 ,3.0 ,0.1 ,9.9 ,-3.2 ,6.4 ,6.2 ,-7.0 ,5.5 ,-2.2 ,-4.0 ,3.7 ,-3.6 ,9.0 ,-1.4 ,-2.4 ,1.7 ,-6.1 ,-4.2 , -2.5 ,-5.6 ,-0.4 ,0.4 ,9.1 ,-2.1 ,-5.4 ,7.3 ,3.6 ,-1.7 ,-5.7 ,-8.0 ,8.8 ,-3.0 ,-0.5 ,1.1 ,10.0 ,8.0 ,0.8 ,1.0 ,7.5 ,3.5 ,-1.8 ,0.3 ,-0.6 ,-6.3 ,-4.5 , -1.1 ,1.8 ,0.6 ,9.6 ,9.2 ,9.7 ,-2.6 ,4.3 ,-3.4 ,0.0 ,-6.7 ,5.0 ,10.5 ,1.5 ,-7.8 ,-4.1 ,-5.3 ,-5.0 ,2.0 ,-4.4 ,-8.4 ,6.0 ,-9.4 ,-4.8 ,8.2 ,7.8 ,5.2 ,-9.5 , -3.9 ,0.2 ,6.8 ,5.7 ,-8.5 ,-1.9 ,-0.3 ,7.4 ,-8.7 ,7.2 ,1.3 ,6.3 ,-3.7 ,3.9 ,3.3 ,-6.0 ,-9.1 ,5.9;
RealSchur<Matrix<double, Dynamic, Dynamic, RowMajor>>schur(A);
Matrix<double, Dynamic, Dynamic, RowMajor> T = schur.matrixT();
// and ,for example, element T(row_0, col_2) has notable difference: 0.19 (first code), 0.3 (second code)
P.S. vectorization in eigen is disabled in my case (macros EIGEN_DONT_VECTORIZE is defined)
Here is my little script for simulating Levy motion:
clear all;
clc; close all;
t = 0; T = 1000; I = T-t;
dT = T/I; t = 0:dT:T; tau = T/I;
alpha = 1.5;
sigma = dT^(1/alpha);
mu = 0; beta = 0;
N = 1000;
X = zeros(N, length(I));
for k=1:N
L = zeros(1,I);
for i = 1:I-1
L( (i + 1) * tau ) = L(i*tau) + stable2( alpha, beta, sigma, mu, 1);
end
X(k,1:length(L)) = L;
end
q = 0.1:0.1:0.9;
quant = qlines2(X, q, t(1:length(X)), tau);
hold all
for i = 1:length(quant)
plot( t, quant(i) * t.^(1/alpha), ':k' );
end
Where stable2 returns a stable random variable with given parameters (you may replace it with normrnd(mu, sigma) for this case, it's not crucial); qlines2 returns quantiles needed for plotting.
But I don't want to talk about math here. My problem is that this implementation is pretty slow, and I would like to speed it up. Unfortunately, computer science is not my main field - I heard something about methods like memoization, vectorization and that there is a lot of other techniques, but I don't know how to use them.
For example, I'm pretty sure I should replace this filthy double for-loop somehow, but I'm not sure what to do instead.
EDIT: Maybe I should use (and learn...) another language (Python, C, any functional one)? I always though that Matlab/OCTAVE is designed for numerical computation, but if change, then for which one?
The crucial bit is, as you said, the for loops, Matlab does not like those, so vectorization is indeed the keyword. (Together with preallocating the space.
I just altered you for loop section somewhat so that you do not have to reset L over and over again, instead we save all Ls in a bigger matrix (also I elimiated the length(L) command).
L = zeros(N,I);
for k=1:N
for i = 1:I-1
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
X(k,1:I) = L(k,1:I);
end
Now you can already see that X(k,1:I) = L(k,1:I); in the loop is obsolete and that also means that we can switch the order of the loops. This is crucial, because the i-steps are recursive (depend on the previous step) that means we cannot vectorize this loop, we can only vectorize the k-loop.
Now your original code needed 9.3 seconds on my machine, the new code still needs about the same time)
L = zeros(N,I);
for i = 1:I-1
for k=1:N
L(k,(i + 1) * tau ) = L(k,i*tau) + normrnd(mu, sigma);
end
end
X = L;
But now we can apply the vectorization, instead of looping throu all rows (the loop over k) we can instead eliminate this loop, and doing all rows at "once".
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma); %<- this is not yet what you want, see comment below
end
X = L;
This code need only 0.045 seconds on my machine. I hope you still get the same output, because I have no idea what you are calculating, but I also hope you could see how you go about vectorizing code.
PS: I just noticed that we now use the same random number in the last example for the whole column, this is obviously not what you want. Instad you should generate a whole vector of random numbers, e.g:
L = zeros(N,I);
for i = 1:I-1
L(:,(i + 1) * tau ) = L(:,i*tau) + normrnd(mu, sigma,N,1);
end
X = L;
PPS: Great question!
I am trying to create random lines and select some of them, which are really rare. My code is rather simple, but to get something that I can use I need to create very large vectors(i.e.: <100000000 x 1, tracks variable in my code). Is there any way to be able to creater larger vectors and to reduce the time needed for all those calculations?
My code is
%Initial line values
tracks=input('Give me the number of muon tracks: ');
width=1e-4;
height=2e-4;
Ystart=15.*ones(tracks,1);
Xstart=-40+80.*rand(tracks,1);
%Xend=-40+80.*rand(tracks,1);
Xend=laprnd(tracks,1,Xstart,15);
X=[Xstart';Xend'];
Y=[Ystart';zeros(1,tracks)];
b=(Ystart.*Xend)./(Xend-Xstart);
hot=0;
cold=0;
for i=1:tracks
if ((Xend(i,1)<width/2 && Xend(i,1)>-width/2)||(b(i,1)<height && b(i,1)>0))
plot(X(:, i),Y(:, i),'r');%the chosen ones!
hold all
hot=hot+1;
else
%plot(X(:, i),Y(:, i),'b');%the rest of them
%hold all
cold=cold+1;
end
end
I am also using and calling a Laplace distribution generator made my Elvis Chen which can be found here
function y = laprnd(m, n, mu, sigma)
%LAPRND generate i.i.d. laplacian random number drawn from laplacian distribution
% with mean mu and standard deviation sigma.
% mu : mean
% sigma : standard deviation
% [m, n] : the dimension of y.
% Default mu = 0, sigma = 1.
% For more information, refer to
% http://en.wikipedia.org./wiki/Laplace_distribution
% Author : Elvis Chen (bee33#sjtu.edu.cn)
% Date : 01/19/07
%Check inputs
if nargin < 2
error('At least two inputs are required');
end
if nargin == 2
mu = 0; sigma = 1;
end
if nargin == 3
sigma = 1;
end
% Generate Laplacian noise
u = rand(m, n)-0.5;
b = sigma / sqrt(2);
y = mu - b * sign(u).* log(1- 2* abs(u));
The result plot is
As you indicate, your problem is two-fold. On the one hand, you have memory issues because you need to do so many trials. On the other hand, you have performance issues, because you have to process all those trials.
Solutions to each issue often have a negative impact on the other issue. IMHO, the best approach would be to find a compromise.
More trials are only possible of you get rid of those gargantuan arrays that are required for vectorization, and use a different strategy to do the loop. I will give priority to the possibility of using more trials, possibly at the cost of optimal performance.
When I execute your code as-is in the Matlab profiler, it immediately shows that the initial memory allocation for all your variables takes a lot of time. It also shows that the plot and hold all commands are the most time-consuming lines of them all. Some more trial-and-error shows that there is a disappointingly low maximum value for the trials you can do before OUT OF MEMORY errors start appearing.
The loop can be accelerated tremendously if you know a few things about its limitations in Matlab. In older versions of Matlab, it used to be true that loops should be avoided completely in favor of 'vectorized' code. In recent versions (I believe R2008a and up), the Mathworks introduced a piece of technology called the JIT accelerator (Just-in-Time compiler) which translates M-code into machine language on the fly during execution. Simply put, the JIT accelerator allows your code to bypass Matlab's interpreter and talk much more directly with the underlying hardware, which can save a lot of time.
The advice you'll hear a lot that loops should be avoided in Matlab, is no longer generally true. While vectorization still has its value, any procedure of sizable complexity that is implemented using only vectorized code is often illegible, hard to understand, hard to change and hard to upkeep. An implementation of the same procedure that uses loops, often has none of these drawbacks, and moreover, it will quite often be faster and require less memory.
Unfortunately, the JIT accelerator has a few nasty (and IMHO, unnecessary) limitations that you'll have to learn about.
One such thing is plot; it's generally a better idea to let a loop do nothing other than collect and manipulate data, and delay any plotting commands etc. until after the loop.
Another such thing is hold; the hold function is not a Matlab built-in function, meaning, it is implemented in M-language. Matlab's JIT accelerator is not able to accelerate non-builtin functions when used in a loop, meaning, your entire loop will run at Matlab's interpretation speed, rather than machine-language speed! Therefore, also delay this command until after the loop :)
Now, in case you're wondering, this last step can make a HUGE difference -- I know of one case where copy-pasting a function body into the upper-level loop caused a 1200x performance improvement. Days of execution time had been reduced to minutes!).
There is actually another minor issue in your loop (which is really small, and rather inconvenient, I will immediately agree with) -- the name of the loop variable should not be i. The name i is the name of the imaginary unit in Matlab, and the name resolution will also unnecessarily consume time on each iteration. It's small, but non-negligible.
Now, considering all this, I've come to the following implementation:
function [hot, cold, h] = MuonTracks(tracks)
% NOTE: no variables larger than 1x1 are initialized
width = 1e-4;
height = 2e-4;
% constant used for Laplacian noise distribution
bL = 15 / sqrt(2);
% Loop through all tracks
X = [];
hot = 0;
ii = 0;
while ii <= tracks
ii = ii + 1;
% Note that I've inlined (== copy-pasted) the original laprnd()
% function call. This was necessary to work around limitations
% in loops in Matlab, and prevent the nececessity of those HUGE
% variables.
%
% Of course, you can still easily generalize all of this:
% the new data
u = rand-0.5;
Ystart = 15;
Xstart = 800*rand-400;
Xend = Xstart - bL*sign(u)*log(1-2*abs(u));
b = (Ystart*Xend)/(Xend-Xstart);
% the test
if ((b < height && b > 0)) ||...
(Xend < width/2 && Xend > -width/2)
hot = hot+1;
% growing an array is perfectly fine when the chances of it
% happening are so slim
X = [X [Xstart; Xend]]; %#ok
end
end
% This is trivial to do here, and prevents an 'else' in the loop
cold = tracks - hot;
% Now plot the chosen ones
h = figure;
hold all
Y = repmat([15;0], 1, size(X,2));
plot(X, Y, 'r');
end
With this implementation, I can do this:
>> tic, MuonTracks(1e8); toc
Elapsed time is 24.738725 seconds.
with a completely negligible memory footprint.
The profiler now also shows a nice and even distribution of effort along the code; no lines that really stand out because of their memory use or performance.
It's possibly not the fastest possible implementation (if anyone sees obvious improvements, please, feel free to edit them in). But, if you're willing to wait, you'll be able to do MuonTracks(1e23) (or higher :)
I've also done an implementation in C, which can be compiled into a Matlab MEX file:
/* DoMuonCounting.c */
#include <math.h>
#include <matrix.h>
#include <mex.h>
#include <time.h>
#include <stdlib.h>
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout);
/* simple little helper functions */
double sign(double x) { return (x>0)-(x<0); }
double rand_double() { return (double)rand()/(double)RAND_MAX; }
/* the gateway function */
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
int
dims[] = {1,1};
const mxArray
/* Output arguments */
*hot_out = plhs[0] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*cold_out = plhs[1] = mxCreateNumericArray(2,dims, mxUINT64_CLASS,0),
*X_out = plhs[2] = mxCreateDoubleMatrix(2,10000, mxREAL);
const unsigned long long
tracks = (const unsigned long long)mxGetPr(prhs[0])[0];
unsigned long long
*hot = (unsigned long long*)mxGetPr(hot_out),
*cold = (unsigned long long*)mxGetPr(cold_out);
double
*Xout = mxGetPr(X_out);
/* call the actual function, and return */
CountMuons(tracks, hot,cold, Xout);
}
// The actual muon counting
void CountMuons(
unsigned long long tracks,
unsigned long long *hot, unsigned long long *cold, double *Xout)
{
const double
width = 1.0e-4,
height = 2.0e-4,
bL = 15.0/sqrt(2.0),
Ystart = 15.0;
double
Xstart,
Xend,
u,
b;
unsigned long long
i = 0ul;
*hot = 0ul;
*cold = tracks;
/* seed the RNG */
srand((unsigned)time(NULL));
/* aaaand start! */
while (i++ < tracks)
{
u = rand_double() - 0.5;
Xstart = 800.0*rand_double() - 400.0;
Xend = Xstart - bL*sign(u)*log(1.0-2.0*fabs(u));
b = (Ystart*Xend)/(Xend-Xstart);
if ((b < height && b > 0.0) || (Xend < width/2.0 && Xend > -width/2.0))
{
Xout[0 + *hot*2] = Xstart;
Xout[1 + *hot*2] = Xend;
++(*hot);
--(*cold);
}
}
}
compile in Matlab with
mex DoMuonCounting.c
(after having run mex setup :) and then use it in conjunction with a small M-wrapper like this:
function [hot,cold, h] = MuonTrack2(tracks)
% call the MEX function
[hot,cold, Xtmp] = DoMuonCounting(tracks);
% process outputs, and generate plots
hot = uint32(hot); % circumvents limitations in 32-bit matlab
X = Xtmp(:,1:hot);
clear Xtmp
h = NaN;
if ~isempty(X)
h = figure;
hold all
Y = repmat([15;0], 1, hot);
plot(X, Y, 'r');
end
end
which allows me to do
>> tic, MuonTrack2(1e8); toc
Elapsed time is 14.496355 seconds.
Note that the memory footprint of the MEX version is slightly larger, but I think that's nothing to worry about.
The only flaw I see is the fixed maximum number of Muon counts (hard-coded as 10000 as the initial array size of Xout; needed because there are no dynamically growing arrays in standard C)...if you're worried this limit could be broken, simply increase it, change it to be equal to a fraction of tracks, or do some smarter (but more painful) dynamic array-growing tricks.
In Matlab, it is sometimes faster to vectorize rather than use a for loop. For example, this expression:
(Xend(i,1) < width/2 && Xend(i,1) > -width/2) || (b(i,1) < height && b(i,1) > 0)
which is defined for each value of i, can be rewritten in a vectorised manner like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0)
Expessions like Xend(:,1) will give you a column vector, so Xend(:,1) < width/2 will give you a column vector of boolean values. Note then that I have used & rather than && - this is because & performs an element-wise logical AND, unlike && which only works on scalar values. In this way you can build the entire expression, such that the variable isChosen holds a column vector of boolean values, one for each row of your Xend/b vectors.
Getting counts is now as simple as this:
hot = sum(isChosen);
since true is represented by 1. And:
cold = sum(~isChosen);
Finally, you can get the data points by using the boolean vector to select rows:
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
hold all;
plot(X(:, ~isChosen),Y(:, ~isChosen),'b'); % Plot unchosen values
EDIT: The code should look like this:
isChosen = (Xend(:,1) < width/2 & Xend(:,1) > -width/2) | (b(:,1) < height & b(:,1)>0);
hot = sum(isChosen);
cold = sum(~isChosen);
plot(X(:, isChosen),Y(:, isChosen),'r'); % Plot chosen values
Simplified example of my slowly working code (the function rbf is from the kernlab package) that needs speeding up:
install.packages('kernlab')
library('kernlab')
rbf <- rbfdot(sigma=1)
test <- matrix(NaN,nrow=5,ncol=10)
for (i in 1:5) {
for (j in 1:10) { test[i,j] <- rbf(i,j)}
}
I've tried outer() but it doesn't work because the rbf function doesn't return the required length (50). I need to speed this code up because I have a huge amount of data. I've read that vectorization would be the holy grail to speeding this up but I don't know how.
Could you please point me in the right direction?
If rbf is indeed the return value from a call to rbfdot, then body(rbf) looks something like
{
if (!is(x, "vector"))
stop("x must be a vector")
if (!is(y, "vector") && !is.null(y))
stop("y must a vector")
if (is(x, "vector") && is.null(y)) {
return(1)
}
if (is(x, "vector") && is(y, "vector")) {
if (!length(x) == length(y))
stop("number of dimension must be the same on both data points")
return(exp(sigma * (2 * crossprod(x, y) - crossprod(x) -
crossprod(y))))
}
}
Since most of this is consists of check functions, and crossprod simplifies when you are only passing in scalars, I think your function simplifies to
rbf <- function(x, y, sigma = 1)
{
exp(- sigma * (x - y) ^ 2)
}
For a possible further speedup, use the compiler package (requires R-2.14.0 or later).
rbf_loop <- function(m, n)
{
out <- matrix(NaN, nrow = m, ncol = n)
for (i in seq_len(m))
{
for (j in seq_len(n))
{
out[i,j] <- rbf(i,j)
}
}
out
)
library(compiler)
rbf_loop_cmp <- cmpfun(rbf_loop)
Then compare the timing of rbf_loop_cmp(m, n) to what you had before.
The simplification step is easier to see in reverse. If you expand (x - y) ^ 2 you get x ^ 2 - 2 * x * y + y ^ 2, which is minus the thing in the rbf function.
Use the function kernelMatrix() in kernlab,
it should be a couple a couple of order of magnitudes
faster then looping over the kernel function:
library(kernlab)
rbf <- rbfdot(sigma=1)
kernelMatrix(rbf, 1:5, 1:10)
I have to implement asin, acos and atan in environment where I have only following math tools:
sine
cosine
elementary fixed point arithmetic (floating point numbers are not available)
I also already have reasonably good square root function.
Can I use those to implement reasonably efficient inverse trigonometric functions?
I don't need too big precision (the floating point numbers have very limited precision anyways), basic approximation will do.
I'm already half decided to go with table lookup, but I would like to know if there is some neater option (that doesn't need several hundred lines of code just to implement basic math).
EDIT:
To clear things up: I need to run the function hundreds of times per frame at 35 frames per second.
In a fixed-point environment (S15.16) I successfully used the CORDIC algorithm (see Wikipedia for a general description) to compute atan2(y,x), then derived asin() and acos() from that using well-known functional identities that involve the square root:
asin(x) = atan2 (x, sqrt ((1.0 + x) * (1.0 - x)))
acos(x) = atan2 (sqrt ((1.0 + x) * (1.0 - x)), x)
It turns out that finding a useful description of the CORDIC iteration for atan2() on the double is harder than I thought. The following website appears to contain a sufficiently detailed description, and also discusses two alternative approaches, polynomial approximation and lookup tables:
http://ch.mathworks.com/examples/matlab-fixed-point-designer/615-calculate-fixed-point-arctangent
Do you need a large precision for arcsin(x) function? If no you may calculate arcsin in N nodes, and keep values in memory. I suggest using line aproximation. if x = A*x_(N) + (1-A)*x_(N+1) then x = A*arcsin(x_(N)) + (1-A)*arcsin(x_(N+1)) where arcsin(x_(N)) is known.
you might want to use approximation: use an infinite series until the solution is close enough for you.
for example:
arcsin(z) = Sigma((2n!)/((2^2n)*(n!)^2)*((z^(2n+1))/(2n+1))) where n in [0,infinity)
http://en.wikipedia.org/wiki/Inverse_trigonometric_functions#Expression_as_definite_integrals
You could do that integration numerically with your square root function, approximating with an infinite series:
Submitting here my answer from this other similar question.
nVidia has some great resources I've used for my own uses, few examples: acos asin atan2 etc etc...
These algorithms produce precise enough results. Here's a straight up Python example with their code copy pasted in:
import math
def nVidia_acos(x):
negate = float(x<0)
x=abs(x)
ret = -0.0187293
ret = ret * x
ret = ret + 0.0742610
ret = ret * x
ret = ret - 0.2121144
ret = ret * x
ret = ret + 1.5707288
ret = ret * math.sqrt(1.0-x)
ret = ret - 2 * negate * ret
return negate * 3.14159265358979 + ret
And here are the results for comparison:
nVidia_acos(0.5) result: 1.0471513828611643
math.acos(0.5) result: 1.0471975511965976
That's pretty close! Multiply by 57.29577951 to get results in degrees, which is also from their "degrees" formula.
It should be easy to addapt the following code to fixed point. It employs a rational approximation to calculate the arctangent normalized to the [0 1) interval (you can multiply it by Pi/2 to get the real arctangent). Then, you can use well known identities to get the arcsin/arccos from the arctangent.
normalized_atan(x) ~ (b x + x^2) / (1 + 2 b x + x^2)
where b = 0.596227
The maximum error is 0.1620º
#include <stdint.h>
#include <math.h>
// Approximates atan(x) normalized to the [-1,1] range
// with a maximum error of 0.1620 degrees.
float norm_atan( float x )
{
static const uint32_t sign_mask = 0x80000000;
static const float b = 0.596227f;
// Extract the sign bit
uint32_t ux_s = sign_mask & (uint32_t &)x;
// Calculate the arctangent in the first quadrant
float bx_a = ::fabs( b * x );
float num = bx_a + x * x;
float atan_1q = num / ( 1.f + bx_a + num );
// Restore the sign bit
uint32_t atan_2q = ux_s | (uint32_t &)atan_1q;
return (float &)atan_2q;
}
// Approximates atan2(y, x) normalized to the [0,4) range
// with a maximum error of 0.1620 degrees
float norm_atan2( float y, float x )
{
static const uint32_t sign_mask = 0x80000000;
static const float b = 0.596227f;
// Extract the sign bits
uint32_t ux_s = sign_mask & (uint32_t &)x;
uint32_t uy_s = sign_mask & (uint32_t &)y;
// Determine the quadrant offset
float q = (float)( ( ~ux_s & uy_s ) >> 29 | ux_s >> 30 );
// Calculate the arctangent in the first quadrant
float bxy_a = ::fabs( b * x * y );
float num = bxy_a + y * y;
float atan_1q = num / ( x * x + bxy_a + num );
// Translate it to the proper quadrant
uint32_t uatan_2q = (ux_s ^ uy_s) | (uint32_t &)atan_1q;
return q + (float &)uatan_2q;
}
In case you need more precision, there is a 3rd order rational function:
normalized_atan(x) ~ ( c x + x^2 + x^3) / ( 1 + (c + 1) x + (c + 1) x^2 + x^3)
where c = (1 + sqrt(17)) / 8
which has a maximum approximation error of 0.00811º
Maybe some kind of intelligent brute force like newton rapson.
So for solving asin() you go with steepest descent on sin()
Use a polynomial approximation. Least-squares fit is easiest (Microsoft Excel has it) and Chebyshev approximation is more accurate.
This question has been covered before: How do Trigonometric functions work?
Only continous functions are approximable by polynomials. And arcsin(x) is discontinous in point x=1.same arccos(x).But a range reduction to interval 1,sqrt(1/2) in that case avoid this situation. We have arcsin(x)=pi/2- arccos(x),arccos(x)=pi/2-arcsin(x).you can use matlab for minimax approximation.Aproximate only in range [0,sqrt(1/2)](if angle for that arcsin is request is bigger that sqrt(1/2) find cos(x).arctangent function only for x<1.arctan(x)=pi/2-arctan(1/x).