So in high school math, and probably college, we are taught how to use trig functions, what they do, and what kinds of problems they solve. But they have always been presented to me as a black box. If you need the Sine or Cosine of something, you hit the sin or cos button on your calculator and you're set. Which is fine.
What I'm wondering is how trigonometric functions are typically implemented.

First, you have to do some sort of range reduction. Trig functions are periodic, so you need to reduce arguments down to a standard interval. For starters, you could reduce angles to be between 0 and 360 degrees. But by using a few identities, you realize you could get by with less. If you calculate sines and cosines for angles between 0 and 45 degrees, you can bootstrap your way to calculating all trig functions for all angles.
Once you've reduced your argument, most chips use a CORDIC algorithm to compute the sines and cosines. You may hear people say that computers use Taylor series. That sounds reasonable, but it's not true. The CORDIC algorithms are much better suited to efficient hardware implementation. (Software libraries may use Taylor series, say on hardware that doesn't support trig functions.) There may be some additional processing, using the CORDIC algorithm to get fairly good answers but then doing something else to improve accuracy.
There are some refinements to the above. For example, for very small angles theta (in radians), sin(theta) = theta to all the precision you have, so it's more efficient to simply return theta than to use some other algorithm. So in practice there is a lot of special case logic to squeeze out all the performance and accuracy possible. Chips with smaller markets may not go to as much optimization effort.

edit: Jack Ganssle has a decent discussion in his book on embedded systems, "The Firmware Handbook".
FYI: If you have accuracy and performance constraints, Taylor series should not be used to approximate functions for numerical purposes. (Save them for your Calculus courses.) They make use of the analyticity of a function at a single point, e.g. the fact that all its derivatives exist at that point. They don't necessarily converge in the interval of interest. Often they do a lousy job of distributing the function approximation's accuracy in order to be "perfect" right near the evaluation point; the error generally zooms upwards as you get away from it. And if you have a function with any noncontinuous derivative (e.g. square waves, triangle waves, and their integrals), a Taylor series will give you the wrong answer.
The best "easy" solution, when using a polynomial of maximum degree N to approximate a given function f(x) over an interval x0 < x < x1, is from Chebyshev approximation; see Numerical Recipes for a good discussion. Note that the Tj(x) and Tk(x) in the Wolfram article I linked to used the cos and inverse cosine, these are polynomials and in practice you use a recurrence formula to get the coefficients. Again, see Numerical Recipes.
edit: Wikipedia has a semi-decent article on approximation theory. One of the sources they cite (Hart, "Computer Approximations") is out of print (& used copies tend to be expensive) but goes into a lot of detail about stuff like this. (Jack Ganssle mentions this in issue 39 of his newsletter The Embedded Muse.)
edit 2: Here's some tangible error metrics (see below) for Taylor vs. Chebyshev for sin(x). Some important points to note:
that the maximum error of a Taylor series approximation over a given range, is much larger than the maximum error of a Chebyshev approximation of the same degree. (For about the same error, you can get away with one fewer term with Chebyshev, which means faster performance)
Range reduction is a huge win. This is because the contribution of higher order polynomials shrinks down when the interval of the approximation is smaller.
If you can't get away with range reduction, your coefficients need to be stored with more precision.
Don't get me wrong: Taylor series will work properly for sine/cosine (with reasonable precision for the range -pi/2 to +pi/2; technically, with enough terms, you can reach any desired precision for all real inputs, but try to calculate cos(100) using Taylor series and you can't do it unless you use arbitrary-precision arithmetic). If I were stuck on a desert island with a nonscientific calculator, and I needed to calculate sine and cosine, I would probably use Taylor series since the coefficients are easy to remember. But the real world applications for having to write your own sin() or cos() functions are rare enough that you'd be best off using an efficient implementation to reach a desired accuracy -- which the Taylor series is not.
Range = -pi/2 to +pi/2, degree 5 (3 terms)
Taylor: max error around 4.5e-3, f(x) = x-x3/6+x5/120
Chebyshev: max error around 7e-5, f(x) = 0.9996949x-0.1656700x3+0.0075134x5
Range = -pi/2 to +pi/2, degree 7 (4 terms)
Taylor: max error around 1.5e-4, f(x) = x-x3/6+x5/120-x7/5040
Chebyshev: max error around 6e-7, f(x) = 0.99999660x-0.16664824x3+0.00830629x5-0.00018363x7
Range = -pi/4 to +pi/4, degree 3 (2 terms)
Taylor: max error around 2.5e-3, f(x) = x-x3/6
Chebyshev: max error around 1.5e-4, f(x) = 0.999x-0.1603x3
Range = -pi/4 to +pi/4, degree 5 (3 terms)
Taylor: max error around 3.5e-5, f(x) = x-x3/6+x5
Chebyshev: max error around 6e-7, f(x) = 0.999995x-0.1666016x3+0.0081215x5
Range = -pi/4 to +pi/4, degree 7 (4 terms)
Taylor: max error around 3e-7, f(x) = x-x3/6+x5/120-x7/5040
Chebyshev: max error around 1.2e-9, f(x) = 0.999999986x-0.166666367x3+0.008331584x5-0.000194621x7

I believe they're calculated using Taylor Series or CORDIC. Some applications which make heavy use of trig functions (games, graphics) construct trig tables when they start up so they can just look up values rather than recalculating them over and over.

Check out the Wikipedia article on trig functions. A good place to learn about actually implementing them in code is Numerical Recipes.
I'm not much of a mathematician, but my understanding of where sin, cos, and tan "come from" is that they are, in a sense, observed when you're working with right-angle triangles. If you take measurements of the lengths of sides of a bunch of different right-angle triangles and plot the points on a graph, you can get sin, cos, and tan out of that. As Harper Shelby points out, the functions are simply defined as properties of right-angle triangles.
A more sophisticated understanding is achieved by understanding how these ratios relate to the geometry of circle, which leads to radians and all of that goodness. It's all there in the Wikipedia entry.

Most commonly for computers, power series representation is used to calculate sines and cosines and these are used for other trig functions. Expanding these series out to about 8 terms computes the values needed to an accuracy close to the machine epsilon (smallest non-zero floating point number that can be held).
The CORDIC method is faster since it is implemented on hardware, but it is primarily used for embedded systems and not standard computers.

I would like to extend the answer provided by #Jason S. Using a domain subdivision method similar to that described by #Jason S and using Maclaurin series approximations, an average (2-3)X speedup over the tan(), sin(), cos(), atan(), asin(), and acos() functions built into the gcc compiler with -O3 optimization was achieved. The best Maclaurin series approximating functions described below achieved double precision accuracy.
For the tan(), sin(), and cos() functions, and for simplicity, an overlapping 0 to 2pi+pi/80 domain was divided into 81 equal intervals with "anchor points" at pi/80, 3pi/80, ..., 161pi/80. Then tan(), sin(), and cos() of these 81 anchor points were evaluated and stored. With the help of trig identities, a single Maclaurin series function was developed for each trig function. Any angle between ±infinity may be submitted to the trig approximating functions because the functions first translate the input angle to the 0 to 2pi domain. This translation overhead is included in the approximation overhead.
Similar methods were developed for the atan(), asin(), and acos() functions, where an overlapping -1.0 to 1.1 domain was divided into 21 equal intervals with anchor points at -19/20, -17/20, ..., 19/20, 21/20. Then only atan() of these 21 anchor points was stored. Again, with the help of inverse trig identities, a single Maclaurin series function was developed for the atan() function. Results of the atan() function were then used to approximate asin() and acos().
Since all inverse trig approximating functions are based on the atan() approximating function, any double-precision argument input value is allowed. However the argument input to the asin() and acos() approximating functions is truncated to the ±1 domain because any value outside it is meaningless.
To test the approximating functions, a billion random function evaluations were forced to be evaluated (that is, the -O3 optimizing compiler was not allowed to bypass evaluating something because some computed result would not be used.) To remove the bias of evaluating a billion random numbers and processing the results, the cost of a run without evaluating any trig or inverse trig function was performed first. This bias was then subtracted off each test to obtain a more representative approximation of actual function evaluation time.
Table 2. Time spent in seconds executing the indicated function or functions one billion times. The estimates are obtained by subtracting the time cost of evaluating one billion random numbers shown in the first row of Table 1 from the remaining rows in Table 1.
Time spent in tan(): 18.0515 18.2545
Time spent in TAN3(): 5.93853 6.02349
Time spent in TAN4(): 6.72216 6.99134
Time spent in sin() and cos(): 19.4052 19.4311
Time spent in SINCOS3(): 7.85564 7.92844
Time spent in SINCOS4(): 9.36672 9.57946
Time spent in atan(): 15.7160 15.6599
Time spent in ATAN1(): 6.47800 6.55230
Time spent in ATAN2(): 7.26730 7.24885
Time spent in ATAN3(): 8.15299 8.21284
Time spent in asin() and acos(): 36.8833 36.9496
Time spent in ASINCOS1(): 10.1655 9.78479
Time spent in ASINCOS2(): 10.6236 10.6000
Time spent in ASINCOS3(): 12.8430 12.0707
(In the interest of saving space, Table 1 is not shown.) Table 2 shows the results of two separate runs of a billion evaluations of each approximating function. The first column is the first run and the second column is the second run. The numbers '1', '2', '3' or '4' in the function names indicate the number of terms used in the Maclaurin series function to evaluate the particular trig or inverse trig approximation. SINCOS#() means that both sin and cos were evaluated at the same time. Likewise, ASINCOS#() means both asin and acos were evaluated at the same time. There is little extra overhead in evaluating both quantities at the same time.
The results show that increasing the number of terms slightly increases execution time as would be expected. Even the smallest number of terms gave around 12-14 digit accuracy everywhere except for the tan() approximation near where its value approaches ±infinity. One would expect even the tan() function to have problems there.
Similar results were obtained on a high-end MacBook Pro laptop in Unix and on a high-end desktop computer in Linux.

If your asking for a more physical explanation of sin, cos, and tan consider how they relate to right-angle triangles. The actual numeric value of cos(lambda) can be found by forming a right-angle triangle with one of the angles being lambda and dividing the length of the triangles side adjacent to lambda by the length of the hypotenuse. Similarily for sin use the opposite side divided by the hypotenuse. For tangent use the opposite side divided by the adjacent side. The classic memonic to remember this is SOHCAHTOA (pronounced socatoa).


Should one calculate QR decomposition before Least Squares to speed up the process?

I am reading the book "Introduction to linear algebra" by Gilbert Strang. The section is called "Orthonormal Bases and Gram-Schmidt". The author several times emphasised the fact that with orthonormal basis it's very easy and fast to calculate Least Squares solution, since Qᵀ*Q = I, where Q is a design matrix with orthonormal basis. So your equation becomes x̂ = Qᵀb.
And I got the impression that it's a good idea to every time calculate QR decomposition before applying Least Squares. But later I figured out time complexity for QR decomposition and it turned out to be that calculating QR decomposition and after that applying Least Squares is more expensive than regular x̂ = inv(AᵀA)Aᵀb.
Is that right that there is no point in using QR decomposition to speed up Least Squares? Or maybe I got something wrong?
So the only purpose of QR decomposition regarding Least Squares is numerical stability?
There are many ways to do least squares; typically these vary in applicability, accuracy and speed.
Perhaps the Rolls-Royce method is to use SVD. This can be used to solve under-determined (fewer obs than states) and singular systems (where A'*A is not invertible) and is very accurate. It is also the slowest.
QR can only be used to solve non-singular systems (that is we must have A'*A invertible, ie A must be of full rank), and though perhaps not as accurate as SVD is also a good deal faster.
The normal equations ie
compute P = A'*A
solve P*x = A'*b
is the fastest (perhaps by a large margin if P can be computed efficiently, for example if A is sparse) but is also the least accurate. This too can only be used to solve non singular systems.
Inaccuracy should not be taken lightly nor dismissed as some academic fanciness. If you happen to know that the problems ypu will be solving are nicely behaved, then it might well be fine to use an inaccurate method. But otherwise the inaccurate routine might well fail (ie say there is no solution when there is, or worse come up with a totally bogus answer).
I'm a but confused that you seem to be suggesting forming and solving the normal equations after performing the QR decomposition. The usual way to use QR in least squares is, if A is nObs x nStates:
decompose A as A = Q*(R )
(0 )
transform b into b~ = Q'*b
(here R is upper triangular)
solve R * x = b# for x,
(here b# is the first nStates entries of b~)

optimize integral f(x)exp(-x) from x=0,infinity

I need a robust integration algorithm for f(x)exp(-x) between x=0 and infinity, with f(x) a positive, differentiable function.
I do not know the array x a priori (it's an intermediate output of my routine). The x array is typically ~log-equispaced, but highly irregular.
Currently, I'm using the Simpson algorithm, buy my problem is that often the domain is highly undersampled by the x array, which produces unrealistic values for the integral.
On each run of my code I need to do this integration thousands of times (each with a different set of x values), so I need to find an efficient and robust way to integrate this function.
More details:
The x array can have between 2 and N points (N known). The first value is always x[0] = 0.0. The last point is always a value greater than a tunable threshold x_max (such that exp(x_max) approx 0). I only know the values of f at the points x[i] (though the function is a smooth function).
My first idea was to do a Laguerre-Gauss quadrature integration. However, this algorithm seems to be highly unreliable when one does not use the optimal quadrature points.
My current idea is to add a set of auxiliary points, interpolating f, such that the Simpson algorithm becomes more stable. If I do this, is there an optimal selection of auxiliary points?
I'd appreciate any advice,
Set t=1-exp(-x), then dt = exp(-x) dx and the integral value is equal to
integral[ f(-log(1-t)) , t=0..1 ]
which you can evaluate with the standard Simpson formula and hopefully get good results.
Note that piecewise linear interpolation will always result in an order 2 error for the integral, as the result amounts to a trapezoid formula even if the method was Simpson. For better errors in the Simpson method you will need higher interpolation degrees, ideally cubic splines. Cubic Bezier polynomials with estimated derivatives to compute the control points could be a fast compromise.

Theory on how to find the equation of a curve given a variable number of data points

I have recently started working on a project. One of the problems I ran into was converting changing accelerations into velocity. Accelerations at different points in time are provided through sensors. If you get the equation of these data points, the derivative of a certain time (x) on that equation will be the velocity.
I know how to do this on the computer, but how would I get the equation to start with? I have searched around but I have not found any existing programs that can form an equation given a set of points. In the past, I have created a neural net algorithm to form an equation, but it takes an incredibly long time to run.
If someone can link me a program or explain the process of doing this, that would be fantastic.
Sorry if this is in the wrong forum. I would post into math, but a programming background will be needed to know the realm of possibility of what a computer can do quickly.
This started out as a comment but ended up being too big.
Just to make sure you're familiar with the terminology...
Differentiation takes a function f(t) and spits out a new function f'(t) that tells you how f(t) changes with time (i.e. f'(t) gives the slope of f(t) at time t). This takes you from displacement to velocity or from velocity to acceleration.
Integreation takes a function f(t) and spits out a new function F(t) which measures the area under the function f(t) from the beginning of time up until a given point t. What's not obvious at first is that integration is actually the reverse of differentiation, a fact called the The Fundamental Theorem of Calculus. So integration takes you from acceleration to velocity or velocity to displacement.
You don't need to understand the rules of calculus to do numerical integration. The simplest (and most naive) method for integrating a function numerically is just by approximating the area by dividing it up into small slices between time points and summing the area of rectangles. This approximating sum is called a Reimann sum.
As you can see, this tends to really overshoot and undershoot certain parts of the function. A more accurate but still very simple method is the trapezoid rule, which also approximates the function with a series of slices, except the tops of the slices are straight lines between the function values rather than constant values.
Still more complicated, but yet a better approximation, is Simpson's rules, which approximates the function with parabolas between time points.
You can think of each of these methods as getting a better approximation of the integral because they each use more information about the function. The first method uses just one data point per area (a constant flat line), the second method uses two data points per area (a straight line), and the third method uses three data points per area (a parabola).
You could read up on the math behind these methods here or in the first page of this pdf.
I agree with the comments that numerical integration is probably what you want. In case you still want a function going through your data, let me further argue against doing that.
It's usually a bad idea to find a curve that goes exactly through some given points. In almost any applied math context you have to accept that there is a little noise in the inputs, and a curve going exactly through the points may be very sensitive to noise. This can produce garbage outputs. Finding a curve going exactly through a set of points is asking for overfitting to get a function that memorizes rather than understands the data, and does not generalize.
For example, take the points (0,0), (1,1), (2,4), (3,9), (4,16), (5,25), (6,36). These are seven points on y=x^2, which is fine. The value of x^2 at x=-1 is 1. Now what happens if you replace (3,9) with (2.9,9.1)? There is a sixth order polynomial passing through all 7 points,
4.66329x - 8.87063x^2 + 7.2281x^3 - 2.35108x^4 + 0.349747x^5 - 0.0194304x^6.
The value of this at x=-1 is -23.4823, very far from 1. While the curve looks ok between 0 and 2, in other examples you can see large oscillations between the data points.
Once you accept that you want an approximation, not a curve going exactly through the points, you have what is known as a regression problem. There are many types of regression. Typically, you choose a set of functions and a way to measure how well a function approximates the data. If you use a simple set of functions like lines (linear regression), you just find the best fit. If you use a more complicated family of functions, you should use regularization to penalize overly complicated functions such as high degree polynomials with large coefficients that memorize the data. If you either use a simple family or regularization, the function tends not to change much when you add or withhold a few data points, which indicates that it is a meaningful trend in the data.
Unfortunately, integrating accelerometer data to get velocity is a numerically unstable problem. For most applications, your error will diverge far too soon to get results of any practical value.
Recall that:
However well you fit a function to your accelerometer data, you will still essentially be doing a piecewise interpolation of the underlying acceleration function:
Where the error terms from each integration will add!
Typically you will see wildly inaccurate results after just a few seconds.

Fast find of all local maximums in C++

I have a formula for calculation of 1D polynomial, joint function. I want to find all local maximums of that function within a given range.
My approach
My current solution is that i evaluate my function in a certain number of points from the range and then I go through these points and remember points where function changed from rising to decline. Of cause I can change number of samples within the interval, but I want to find all maximums with as lowest number of samples as possible.
Can you suggest any effetive algorithm to me?
Finding all the maxima of an unknown function is hard. You can never be sure that a maximum you found is really just one maximum or that you have not overlooked a maximum somewhere.
However, if something is known about the function, you can try to exploit that. The simplest one is, of course, is if the function is known to be rational and bounded in grade. Up to a rational function of grade five it is possible to derive all four extrema from a closed formula, see for details. Most likely, you don't want to implement that, but for linear, square, and cubic roots, the closed formula is feasible and can be used to find maxima of a quartic function.
That is only the most simple information that might be known, other interesting information is whether you can give a bound to the second derivative. This would allow you to reduce the sampling density when you find a strong slope.
You may also be able to exploit information from how you intend to use the maxima you found. It can give you clues about how much precision you need. Is it sufficient to know that a point is near a maximum? Or that a point is flat? Is it really a problem if a saddle point is classified as a maximum? Or if a maximum right next to a turning point is overlooked? And how much is the allowable error margin?
If you cannot exploit information like this, you are thrown back to sampling your function in small steps and hoping you don't make too much of an error.
You mention in the comments that your function is in fact a kernel density estimation. This gives you at least the following information:
Unless the kernel is not limited in extend, your estimated function will be a piecewise function: Any point on it will only be influenced by a precisely calculable number of measurement points.
If the kernel is based on a rational function, the resulting estimated function will be piecewise rational. And it will be of the same grade as the kernel!
If the kernel is the uniform kernel, your estimated function will be a step function.
This case needs special handling because there won't be any maxima in the mathematical sense. However, it also makes your job really easy.
If the kernel is the triangular kernel, your estimated function will be a piecewise linear function.
If the kernel is the Epanechnikov kernel, your estimated function will be a piecewise quadratic function.
In all these cases it is next to trivial to produce the piecewise functions and to find their maxima.
If the kernel is of too high grade or transcendental, you still know the measurements that your estimation is based on, and you know the kernel properties. This allows you to derive a heuristic on how dense your maxima can get.
At the very least, you know the first and second derivative of the kernel.
In principle, this allows you to calculate the first and second derivative of the estimated function at any point.
In the case of a local kernel, it might be more prudent to calculate the first derivative and an upper bound to the second derivative of the estimated function at any point.
With this information, it should be possible to constrain the search to the regions where there are maxima and avoid oversampling of the slopes.
As you see, there is a lot of useful information that you can derive from the knowledge of your function, and which you can use to your advantage.
The local maxima are among the roots of the first derivative. To isolate those roots in your working interval you can use the Sturm theorem, and proceed by dichotomy. In theory (using exact arithmetic) it gives you all real roots.
An equivalent approach is to express your polynomial in the Bezier/Bernstein basis and look for changes of signs of the coefficients (hull property). Dichotomic search can be efficiently implemented by recursive subdivision of the Bezier.
There are several classical algorithms available for polynomials, such as Laguerre, that usually look for the complex roots as well.

Frequency determination from sparsely sampled data

I'm observing a sinusoidally-varying source, i.e. f(x) = a sin (bx + d) + c, and want to determine the amplitude a, offset c and period/frequency b - the shift d is unimportant. Measurements are sparse, with each source measured typically between 6 and 12 times, and observations are at (effectively) random times, with intervals between observations roughly between a quarter and ten times the period (just to stress, the spacing of observations is not constant for each source). In each source the offset c is typically quite large compared to the measurement error, while amplitudes vary - at one extreme they are only on the order of the measurement error, while at the other extreme they are about twenty times the error. Hopefully that fully outlines the problem, if not, please ask and i'll clarify.
Thinking naively about the problem, the average of the measurements will be a good estimate of the offset c, while half the range between the minimum and maximum value of the measured f(x) will be a reasonable estimate of the amplitude, especially as the number of measurements increase so that the prospects of having observed the maximum offset from the mean improve. However, if the amplitude is small then it seems to me that there is little chance of accurately determining b, while the prospects should be better for large-amplitude sources even if they are only observed the minimum number of times.
Anyway, I wrote some code to do a least-squares fit to the data for the range of periods, and it identifies best-fit values of a, b and d quite effectively for the larger-amplitude sources. However, I see it finding a number of possible periods, and while one is the 'best' (in as much as it gives the minimum error-weighted residual) in the majority of cases the difference in the residuals for different candidate periods is not large. So what I would like to do now is quantify the possibility that the derived period is a 'false positive' (or, to put it slightly differently, what confidence I can have that the derived period is correct).
Does anybody have any suggestions on how best to proceed? One thought I had was to use a Monte-Carlo algorithm to construct a large number of sources with known values for a, b and c, construct samples that correspond to my measurement times, fit the resultant sample with my fitting code, and see what percentage of the time I recover the correct period. But that seems quite heavyweight, and i'm not sure that it's particularly useful other than giving a general feel for the false-positive rate.
And any advice for frameworks that might help? I have a feeling this is something that can likely be done in a line or two in Mathematica, but (a) I don't know it, an (b) don't have access to it. I'm fluent in Java, competent in IDL and can probably figure out other things...
This looks tailor-made for working in the frequency domain. Apply a Fourier transform and identify the frequency based on where the power is located, which should be clear for a sinusoidal source.
ADDENDUM To get an idea of how accurate is your estimate, I'd try a resampling approach such as cross-validation. I think this is the direction that you're heading with the Monte Carlo idea; lots of work is out there, so hopefully that's a wheel you won't need to re-invent.
The trick here is to do what might seem at first to make the problem more difficult. Rewrite f in the similar form:
f(x) = a1*sin(b*x) + a2*cos(b*x) + c
This is based on the identity for the sin(u+v).
Recognize that if b is known, then the problem of estimating {a1, a2, c} is a simple LINEAR regression problem. So all you need to do is use a 1-variable minimization tool, working on the value of b, to minimize the sum of squares of the residuals from that linear regression model. There are many such univariate optimizers to be found.
Once you have those parameters, it is easy to find the parameter a in your original model, since that is all you care about.
a = sqrt(a1^2 + a2^2)
The scheme I have described is called a partitioned least squares.
If you have a reasonable estimate of the size and the nature of your noise (e.g. white Gaussian with SD sigma), you can
(a) invert the Hessian matrix to get an estimate of the error in your position and
(b) should be able to easily derive a significance statistic for your fit residues.
For (a), compare
For (b), assume that your measurement errors are independent and thus the variance of their sum is the sum of their variances.
