Approximation Ratios - algorithm

I've got a question about how best to classify the performance of an approximate algorithm. I'm trying to find the 'correct' value of a graph problem instance whose cost function has an objective function and a penalty function. I've configured my method such that the optimum solution of the problem has the highest value, fulfilling the objective criterion adds to the cost function value, and the penalty subtracts from it. This means that some poor solutions have negative values.
Typically the performance of an approximate algorithm is measured as:
found_cost_function_value / best_possible_solution = r .
I'm a little confused where to set my baseline of 0 however. Should I translate my objective values such that they are all positive, and so add the | most_negative_solution_cost_function_value| to all my cost function values and the best_possible_solution? Or is it best practise to calculate r as found ?
EDIT: This is an optimisation problem. Each instance can be assigned an objective value. In my setup I've got a bit of an issue in that I found it best for the execution of the algorithm to have the penalty terms much larger than my reward terms and so I have a distribution of solutions with a small head of positive terms and a large tail of grossly negative solutions which skew things if I force everyting to be between 0 <r <1.

Related

Algorithm to determine the global minima of a blackbox function

I recently got this question in an interview, and it's kind of making me mad thinking about it.
Suppose you have a family of functions that each take a fixed number of parameters (different functions can take different numbers of parameters), each with the following properties:
Each input is between 0-1
Each output is between 0-1
The function is continuous
The function is a blackbox (i.e you cannot look at the equation for it)
He then asked me to create an algorithm to find the global minima of this function.
To me, looking at this question was like trying to answer the basis of machine learning. Obviously if there was some way to guarantee to find the global minima of a function, then we'd have perfect machine learning algorithms. Obviously we don't, so this question seems kind of impossible.
Anyways, the answer I gave was a mixture of divide and conquer with stochastic gradient descent. Since all functions are continuous, you'll always be able to calculate the partial gradient with respect to a certain dimension. You split each dimension in half and once you've reached a certain granularity, you apply stochastic gradient descent. In gradient descent, you initialize a certain start point, and evaluate the left and right side of that point based on a small delta with respect to every dimension to get the slope at that point. Then you update your point based on a certain learning rate and recalculate your partial derivatives until you've reached a point where the distance between old and new point is below a certain threshold. Then you re-merge and return the minimum of the two sections until you return the minimum value from all your divisions. My hope was to get around the fact that SGD can get stuck in local minima, so I thought dividing the dimension space would reduce the chance of that happening.
He seemed pretty unimpressed with my algorithm in the end. Does anybody have a faster/more accurate way of solving this problem?
The range is [0, 1], therefore f(x) = 0, where x on R^n, is the global minima. Moreover, it's not guaranteed that the function will be a convex, by knowing the domain, range, and continuity holds.
ex. f(x) = sqrt(x), it's a concave function (i.e. has no minimum), and x - [0, 1] belongs to its domain.

Optimization algorithms for piecewise-constant and similar ill-defined functions

I have a function which takes as inputs n-dimensional (say n=10) vectors whose components are real numbers varying from 0 to a large positive number A say 50,000, ends included. For any such vector the function outputs an integer from 1 to say B=100. I have this function and want to find its global minima.
Broadly speaking there are algorithmic, iterative and heuristics based approaches to tackle such optimization problem. Which are the best techniques suggested to solve this problem? I am looking for suggestions to algorithms or active research papers that i can implement from scratch to solve such problems. I have already given up hope on existing optimization functions that ship with Matlab/python. I am hoping to read experience of others working with approximation/heuristic algorithms to optimize such ill-defined functions.
I ran fmincon, fminsearch, fminunc in Matlab but they fail to optimize the function. The function is ill-defined according to their definitions. Matlab says this for fmincon:
Initial point is a local minimum that satisfies the constraints.
Optimization completed because at the initial point, the objective function is non-decreasing
in feasible directions to within the selected value of the optimality tolerance, and
constraints are satisfied to within the selected value of the constraint tolerance.
Problem arises because this function has piecewise-constant behavior. If a vector V is assigned to a number say 65, changing its components very slightly may not have any change. Such ill-defined behavior is to be well-expected because of pigeon-hole principle. The domain of function is unlimited whereas range is just a bunch of numbers.
I also wish to clarify one issue that may arise. Suppose i do gradient descent on a starting point x0 and my next x that i get from GD-iteration has some components lie outside the domain [0,50000], then what happens? So actually the domain is circular. So a vector of size 3 like [30;5432;50432] becomes [30;5432;432]. This is automatically taken care of so that there is no worry about iterations finding a vector outside the domain.

Fast find of all local maximums in C++

Problem
I have a formula for calculation of 1D polynomial, joint function. I want to find all local maximums of that function within a given range.
My approach
My current solution is that i evaluate my function in a certain number of points from the range and then I go through these points and remember points where function changed from rising to decline. Of cause I can change number of samples within the interval, but I want to find all maximums with as lowest number of samples as possible.
Question
Can you suggest any effetive algorithm to me?
Finding all the maxima of an unknown function is hard. You can never be sure that a maximum you found is really just one maximum or that you have not overlooked a maximum somewhere.
However, if something is known about the function, you can try to exploit that. The simplest one is, of course, is if the function is known to be rational and bounded in grade. Up to a rational function of grade five it is possible to derive all four extrema from a closed formula, see http://en.wikipedia.org/wiki/Quartic_equation#General_formula_for_roots for details. Most likely, you don't want to implement that, but for linear, square, and cubic roots, the closed formula is feasible and can be used to find maxima of a quartic function.
That is only the most simple information that might be known, other interesting information is whether you can give a bound to the second derivative. This would allow you to reduce the sampling density when you find a strong slope.
You may also be able to exploit information from how you intend to use the maxima you found. It can give you clues about how much precision you need. Is it sufficient to know that a point is near a maximum? Or that a point is flat? Is it really a problem if a saddle point is classified as a maximum? Or if a maximum right next to a turning point is overlooked? And how much is the allowable error margin?
If you cannot exploit information like this, you are thrown back to sampling your function in small steps and hoping you don't make too much of an error.
Edit:
You mention in the comments that your function is in fact a kernel density estimation. This gives you at least the following information:
Unless the kernel is not limited in extend, your estimated function will be a piecewise function: Any point on it will only be influenced by a precisely calculable number of measurement points.
If the kernel is based on a rational function, the resulting estimated function will be piecewise rational. And it will be of the same grade as the kernel!
If the kernel is the uniform kernel, your estimated function will be a step function.
This case needs special handling because there won't be any maxima in the mathematical sense. However, it also makes your job really easy.
If the kernel is the triangular kernel, your estimated function will be a piecewise linear function.
If the kernel is the Epanechnikov kernel, your estimated function will be a piecewise quadratic function.
In all these cases it is next to trivial to produce the piecewise functions and to find their maxima.
If the kernel is of too high grade or transcendental, you still know the measurements that your estimation is based on, and you know the kernel properties. This allows you to derive a heuristic on how dense your maxima can get.
At the very least, you know the first and second derivative of the kernel.
In principle, this allows you to calculate the first and second derivative of the estimated function at any point.
In the case of a local kernel, it might be more prudent to calculate the first derivative and an upper bound to the second derivative of the estimated function at any point.
With this information, it should be possible to constrain the search to the regions where there are maxima and avoid oversampling of the slopes.
As you see, there is a lot of useful information that you can derive from the knowledge of your function, and which you can use to your advantage.
The local maxima are among the roots of the first derivative. To isolate those roots in your working interval you can use the Sturm theorem, and proceed by dichotomy. In theory (using exact arithmetic) it gives you all real roots.
An equivalent approach is to express your polynomial in the Bezier/Bernstein basis and look for changes of signs of the coefficients (hull property). Dichotomic search can be efficiently implemented by recursive subdivision of the Bezier.
There are several classical algorithms available for polynomials, such as Laguerre, that usually look for the complex roots as well.

Frequency determination from sparsely sampled data

I'm observing a sinusoidally-varying source, i.e. f(x) = a sin (bx + d) + c, and want to determine the amplitude a, offset c and period/frequency b - the shift d is unimportant. Measurements are sparse, with each source measured typically between 6 and 12 times, and observations are at (effectively) random times, with intervals between observations roughly between a quarter and ten times the period (just to stress, the spacing of observations is not constant for each source). In each source the offset c is typically quite large compared to the measurement error, while amplitudes vary - at one extreme they are only on the order of the measurement error, while at the other extreme they are about twenty times the error. Hopefully that fully outlines the problem, if not, please ask and i'll clarify.
Thinking naively about the problem, the average of the measurements will be a good estimate of the offset c, while half the range between the minimum and maximum value of the measured f(x) will be a reasonable estimate of the amplitude, especially as the number of measurements increase so that the prospects of having observed the maximum offset from the mean improve. However, if the amplitude is small then it seems to me that there is little chance of accurately determining b, while the prospects should be better for large-amplitude sources even if they are only observed the minimum number of times.
Anyway, I wrote some code to do a least-squares fit to the data for the range of periods, and it identifies best-fit values of a, b and d quite effectively for the larger-amplitude sources. However, I see it finding a number of possible periods, and while one is the 'best' (in as much as it gives the minimum error-weighted residual) in the majority of cases the difference in the residuals for different candidate periods is not large. So what I would like to do now is quantify the possibility that the derived period is a 'false positive' (or, to put it slightly differently, what confidence I can have that the derived period is correct).
Does anybody have any suggestions on how best to proceed? One thought I had was to use a Monte-Carlo algorithm to construct a large number of sources with known values for a, b and c, construct samples that correspond to my measurement times, fit the resultant sample with my fitting code, and see what percentage of the time I recover the correct period. But that seems quite heavyweight, and i'm not sure that it's particularly useful other than giving a general feel for the false-positive rate.
And any advice for frameworks that might help? I have a feeling this is something that can likely be done in a line or two in Mathematica, but (a) I don't know it, an (b) don't have access to it. I'm fluent in Java, competent in IDL and can probably figure out other things...
This looks tailor-made for working in the frequency domain. Apply a Fourier transform and identify the frequency based on where the power is located, which should be clear for a sinusoidal source.
ADDENDUM To get an idea of how accurate is your estimate, I'd try a resampling approach such as cross-validation. I think this is the direction that you're heading with the Monte Carlo idea; lots of work is out there, so hopefully that's a wheel you won't need to re-invent.
The trick here is to do what might seem at first to make the problem more difficult. Rewrite f in the similar form:
f(x) = a1*sin(b*x) + a2*cos(b*x) + c
This is based on the identity for the sin(u+v).
Recognize that if b is known, then the problem of estimating {a1, a2, c} is a simple LINEAR regression problem. So all you need to do is use a 1-variable minimization tool, working on the value of b, to minimize the sum of squares of the residuals from that linear regression model. There are many such univariate optimizers to be found.
Once you have those parameters, it is easy to find the parameter a in your original model, since that is all you care about.
a = sqrt(a1^2 + a2^2)
The scheme I have described is called a partitioned least squares.
If you have a reasonable estimate of the size and the nature of your noise (e.g. white Gaussian with SD sigma), you can
(a) invert the Hessian matrix to get an estimate of the error in your position and
(b) should be able to easily derive a significance statistic for your fit residues.
For (a), compare http://www.physics.utah.edu/~detar/phys6720/handouts/curve_fit/curve_fit/node6.html
For (b), assume that your measurement errors are independent and thus the variance of their sum is the sum of their variances.

How do Trigonometric functions work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
So in high school math, and probably college, we are taught how to use trig functions, what they do, and what kinds of problems they solve. But they have always been presented to me as a black box. If you need the Sine or Cosine of something, you hit the sin or cos button on your calculator and you're set. Which is fine.
What I'm wondering is how trigonometric functions are typically implemented.
First, you have to do some sort of range reduction. Trig functions are periodic, so you need to reduce arguments down to a standard interval. For starters, you could reduce angles to be between 0 and 360 degrees. But by using a few identities, you realize you could get by with less. If you calculate sines and cosines for angles between 0 and 45 degrees, you can bootstrap your way to calculating all trig functions for all angles.
Once you've reduced your argument, most chips use a CORDIC algorithm to compute the sines and cosines. You may hear people say that computers use Taylor series. That sounds reasonable, but it's not true. The CORDIC algorithms are much better suited to efficient hardware implementation. (Software libraries may use Taylor series, say on hardware that doesn't support trig functions.) There may be some additional processing, using the CORDIC algorithm to get fairly good answers but then doing something else to improve accuracy.
There are some refinements to the above. For example, for very small angles theta (in radians), sin(theta) = theta to all the precision you have, so it's more efficient to simply return theta than to use some other algorithm. So in practice there is a lot of special case logic to squeeze out all the performance and accuracy possible. Chips with smaller markets may not go to as much optimization effort.
edit: Jack Ganssle has a decent discussion in his book on embedded systems, "The Firmware Handbook".
FYI: If you have accuracy and performance constraints, Taylor series should not be used to approximate functions for numerical purposes. (Save them for your Calculus courses.) They make use of the analyticity of a function at a single point, e.g. the fact that all its derivatives exist at that point. They don't necessarily converge in the interval of interest. Often they do a lousy job of distributing the function approximation's accuracy in order to be "perfect" right near the evaluation point; the error generally zooms upwards as you get away from it. And if you have a function with any noncontinuous derivative (e.g. square waves, triangle waves, and their integrals), a Taylor series will give you the wrong answer.
The best "easy" solution, when using a polynomial of maximum degree N to approximate a given function f(x) over an interval x0 < x < x1, is from Chebyshev approximation; see Numerical Recipes for a good discussion. Note that the Tj(x) and Tk(x) in the Wolfram article I linked to used the cos and inverse cosine, these are polynomials and in practice you use a recurrence formula to get the coefficients. Again, see Numerical Recipes.
edit: Wikipedia has a semi-decent article on approximation theory. One of the sources they cite (Hart, "Computer Approximations") is out of print (& used copies tend to be expensive) but goes into a lot of detail about stuff like this. (Jack Ganssle mentions this in issue 39 of his newsletter The Embedded Muse.)
edit 2: Here's some tangible error metrics (see below) for Taylor vs. Chebyshev for sin(x). Some important points to note:
that the maximum error of a Taylor series approximation over a given range, is much larger than the maximum error of a Chebyshev approximation of the same degree. (For about the same error, you can get away with one fewer term with Chebyshev, which means faster performance)
Range reduction is a huge win. This is because the contribution of higher order polynomials shrinks down when the interval of the approximation is smaller.
If you can't get away with range reduction, your coefficients need to be stored with more precision.
Don't get me wrong: Taylor series will work properly for sine/cosine (with reasonable precision for the range -pi/2 to +pi/2; technically, with enough terms, you can reach any desired precision for all real inputs, but try to calculate cos(100) using Taylor series and you can't do it unless you use arbitrary-precision arithmetic). If I were stuck on a desert island with a nonscientific calculator, and I needed to calculate sine and cosine, I would probably use Taylor series since the coefficients are easy to remember. But the real world applications for having to write your own sin() or cos() functions are rare enough that you'd be best off using an efficient implementation to reach a desired accuracy -- which the Taylor series is not.
Range = -pi/2 to +pi/2, degree 5 (3 terms)
Taylor: max error around 4.5e-3, f(x) = x-x3/6+x5/120
Chebyshev: max error around 7e-5, f(x) = 0.9996949x-0.1656700x3+0.0075134x5
Range = -pi/2 to +pi/2, degree 7 (4 terms)
Taylor: max error around 1.5e-4, f(x) = x-x3/6+x5/120-x7/5040
Chebyshev: max error around 6e-7, f(x) = 0.99999660x-0.16664824x3+0.00830629x5-0.00018363x7
Range = -pi/4 to +pi/4, degree 3 (2 terms)
Taylor: max error around 2.5e-3, f(x) = x-x3/6
Chebyshev: max error around 1.5e-4, f(x) = 0.999x-0.1603x3
Range = -pi/4 to +pi/4, degree 5 (3 terms)
Taylor: max error around 3.5e-5, f(x) = x-x3/6+x5
Chebyshev: max error around 6e-7, f(x) = 0.999995x-0.1666016x3+0.0081215x5
Range = -pi/4 to +pi/4, degree 7 (4 terms)
Taylor: max error around 3e-7, f(x) = x-x3/6+x5/120-x7/5040
Chebyshev: max error around 1.2e-9, f(x) = 0.999999986x-0.166666367x3+0.008331584x5-0.000194621x7
I believe they're calculated using Taylor Series or CORDIC. Some applications which make heavy use of trig functions (games, graphics) construct trig tables when they start up so they can just look up values rather than recalculating them over and over.
Check out the Wikipedia article on trig functions. A good place to learn about actually implementing them in code is Numerical Recipes.
I'm not much of a mathematician, but my understanding of where sin, cos, and tan "come from" is that they are, in a sense, observed when you're working with right-angle triangles. If you take measurements of the lengths of sides of a bunch of different right-angle triangles and plot the points on a graph, you can get sin, cos, and tan out of that. As Harper Shelby points out, the functions are simply defined as properties of right-angle triangles.
A more sophisticated understanding is achieved by understanding how these ratios relate to the geometry of circle, which leads to radians and all of that goodness. It's all there in the Wikipedia entry.
Most commonly for computers, power series representation is used to calculate sines and cosines and these are used for other trig functions. Expanding these series out to about 8 terms computes the values needed to an accuracy close to the machine epsilon (smallest non-zero floating point number that can be held).
The CORDIC method is faster since it is implemented on hardware, but it is primarily used for embedded systems and not standard computers.
I would like to extend the answer provided by #Jason S. Using a domain subdivision method similar to that described by #Jason S and using Maclaurin series approximations, an average (2-3)X speedup over the tan(), sin(), cos(), atan(), asin(), and acos() functions built into the gcc compiler with -O3 optimization was achieved. The best Maclaurin series approximating functions described below achieved double precision accuracy.
For the tan(), sin(), and cos() functions, and for simplicity, an overlapping 0 to 2pi+pi/80 domain was divided into 81 equal intervals with "anchor points" at pi/80, 3pi/80, ..., 161pi/80. Then tan(), sin(), and cos() of these 81 anchor points were evaluated and stored. With the help of trig identities, a single Maclaurin series function was developed for each trig function. Any angle between ±infinity may be submitted to the trig approximating functions because the functions first translate the input angle to the 0 to 2pi domain. This translation overhead is included in the approximation overhead.
Similar methods were developed for the atan(), asin(), and acos() functions, where an overlapping -1.0 to 1.1 domain was divided into 21 equal intervals with anchor points at -19/20, -17/20, ..., 19/20, 21/20. Then only atan() of these 21 anchor points was stored. Again, with the help of inverse trig identities, a single Maclaurin series function was developed for the atan() function. Results of the atan() function were then used to approximate asin() and acos().
Since all inverse trig approximating functions are based on the atan() approximating function, any double-precision argument input value is allowed. However the argument input to the asin() and acos() approximating functions is truncated to the ±1 domain because any value outside it is meaningless.
To test the approximating functions, a billion random function evaluations were forced to be evaluated (that is, the -O3 optimizing compiler was not allowed to bypass evaluating something because some computed result would not be used.) To remove the bias of evaluating a billion random numbers and processing the results, the cost of a run without evaluating any trig or inverse trig function was performed first. This bias was then subtracted off each test to obtain a more representative approximation of actual function evaluation time.
Table 2. Time spent in seconds executing the indicated function or functions one billion times. The estimates are obtained by subtracting the time cost of evaluating one billion random numbers shown in the first row of Table 1 from the remaining rows in Table 1.
Time spent in tan(): 18.0515 18.2545
Time spent in TAN3(): 5.93853 6.02349
Time spent in TAN4(): 6.72216 6.99134
Time spent in sin() and cos(): 19.4052 19.4311
Time spent in SINCOS3(): 7.85564 7.92844
Time spent in SINCOS4(): 9.36672 9.57946
Time spent in atan(): 15.7160 15.6599
Time spent in ATAN1(): 6.47800 6.55230
Time spent in ATAN2(): 7.26730 7.24885
Time spent in ATAN3(): 8.15299 8.21284
Time spent in asin() and acos(): 36.8833 36.9496
Time spent in ASINCOS1(): 10.1655 9.78479
Time spent in ASINCOS2(): 10.6236 10.6000
Time spent in ASINCOS3(): 12.8430 12.0707
(In the interest of saving space, Table 1 is not shown.) Table 2 shows the results of two separate runs of a billion evaluations of each approximating function. The first column is the first run and the second column is the second run. The numbers '1', '2', '3' or '4' in the function names indicate the number of terms used in the Maclaurin series function to evaluate the particular trig or inverse trig approximation. SINCOS#() means that both sin and cos were evaluated at the same time. Likewise, ASINCOS#() means both asin and acos were evaluated at the same time. There is little extra overhead in evaluating both quantities at the same time.
The results show that increasing the number of terms slightly increases execution time as would be expected. Even the smallest number of terms gave around 12-14 digit accuracy everywhere except for the tan() approximation near where its value approaches ±infinity. One would expect even the tan() function to have problems there.
Similar results were obtained on a high-end MacBook Pro laptop in Unix and on a high-end desktop computer in Linux.
If your asking for a more physical explanation of sin, cos, and tan consider how they relate to right-angle triangles. The actual numeric value of cos(lambda) can be found by forming a right-angle triangle with one of the angles being lambda and dividing the length of the triangles side adjacent to lambda by the length of the hypotenuse. Similarily for sin use the opposite side divided by the hypotenuse. For tangent use the opposite side divided by the adjacent side. The classic memonic to remember this is SOHCAHTOA (pronounced socatoa).

Resources