Minimizing the effect of rounding errors caused by repeated operations effectively - algorithm

I just recently came across the Kahan (or compensated) summation algorithm for minimizing roundoff, and I'd like to know if there are equivalent algorithms for division and/or multiplication, as well as subtraction (if there happens to be one, I know about associativity). Implementation examples in any language, pseudo-code or links would be great!
Thanks

Subtraction is usually handled via the Kahan method.
For multiplication, there are algorithms to convert a product of two floating-point numbers into a sum of two floating-point numbers without rounding, at which point you can use Kahan summation or some other method, depending on what you need to do next with the product.
If you have FMA (fused multiply-add) available, this can easily be accomplished as follows:
p = a*b;
r = fma(a,b,-p);
After these two operations, if no overflow or underflow occurs, p + r is exactly equal to a * b without rounding. This can also be accomplished without FMA, but it is rather more difficult. If you're interested in these algorithms, you might start by downloading the crlibm documentation, which details several of them.
Division... well, division is best avoided. Division is slow, and compensated division is even slower. You can do it, but it's brutally hard without FMA, and non-trivial with it. Better to design your algorithms to avoid it as much as possible.
Note that all of this becomes a losing battle pretty quickly. There's a very narrow band of situations where these tricks are beneficial--for anything more complicated, it's much better to just use a wider-precision floating point library like mpfr. Unless you're an expert in the field (or want to become one), it's usually best to just learn to use such a library.

Designing algorithms to be numerically stable is an academic discipline and field of research in its own right. It's not something you can do (or learn) meaningfully via "cheat sheets" - it requires specific mathematical knowledge and needs to be done for each specific algorithm. If you want to learn how to do this, the reference in the Wikipedia article sounds pretty good: Nicholas J. Higham, Accuracy and Stability of Numerical Algorithms, Society of Industrial and Applied Mathematics, Philadelphia, 1996. ISBN 0-89871-355-2.
A relatively simple way to diagnose the stability of an algorithm is to use interval arithmetic.

You could use bignums and rational fractions rather than floating point numbers in which case you are limited only by the finite availability of memory to hold the require precision.

Related

Unexpectedly low error for a numerical integrator with certain equations of motion

I have an RKF7(8) integrator whose output I've verified with several simple test functions. However, when I use it on the equation of motion I'm interested in, the local truncation errors are suddenly very small. For a timestep of around 1e-1, my errors are all around 1e-18 or 1e-19. For the simple test functions (so far, sine and an exponential), the errors are always reasonable, ie 1e-7 or so for the same timestep.
The only difference between the simple test functions and the problem function is that it's a huge chunk of code, like maybe 1000 different terms, some with relatively large exponents (like 9 or 10). Could that possibly affect the precision? Should I change my code to use long doubles?
Very interesting question. The problem you are facing might be related to issues (or limitations) of the floating point arithmetic. Since your function contains coefficients in a wide numerical interval, it is likely that you have some loss of precision in your calculations. In general, these problems can come in the form of:
Overflow
Underflow
Multiplication and division
Adding numbers of very different magnitudes
Subtracting numbers of similar magnitudes
Overflow and underflow occur when the numbers you are dealing with are too large or too small with respect to the machine precision, and it would be my bet that this is not what happens in your system. Nevertheless, one must take into account that multiplication and division operations can lead to overflow and underflow. On the other hand, adding numbers of very different magnitudes (or subtracting numbers of similar magnitudes) can lead to severe loss of precision due to the roundoff errors. From my experience in optimization problems that involve large and small numbers, I would say this could be a reasonable explanation of the behavior of your integrator.
I have two suggestions for you. The fist one is of course increasing the precision of your numbers to the maximum available. This might help or not depending on how ill conditioned your problem is. The second one is to use a better algorithm to perform the sums in your numerical method. In contrast to the naive addition of all number sequentially, you could use a more elaborated strategy by dividing your sums into sub-sums, effectively reducing roundoff errors. Notable examples of these algorithms are the pairwise summation and the Kahan summation.
I hope this answer offers you some clues. Good luck!

Floating point algorithms with potential for performance optimization

For a university lecture I am looking for floating point algorithms with known asymptotic runtime, but potential for low-level (micro-)optimization. This means optimizations such as minimizing cache misses and register spillages, maximizing instruction level parallelism and taking advantage of SIMD (vector) instructions on new CPUs. The optimizations are going to be CPU-specific and will make use of applicable instruction set extensions.
The classic textbook example for this is matrix multiplication, where great speedups can be achieved by simply reordering the sequence of memory accesses (among other tricks). Another example is FFT. Unfortunately, I am not allowed to choose either of these.
Anyone have any ideas, or an algorithm/method that could use a boost?
I am only interested in algorithms where a per-thread speedup is conceivable. Parallelizing problems by multi-threading them is fine, but not the scope of this lecture.
Edit 1: I am taking the course, not teaching it. In the past years, there were quite a few projects that succeeded in surpassing the current best implementations in terms of performance.
Edit 2: This paper lists (from page 11 onwards) seven classes of important numerical methods and some associated algorithms that use them. At least some of the mentioned algorithms are candidates, it is however difficult to see which.
Edit 3: Thank you everyone for your great suggestions! We proposed to implement the exposure fusion algorithm (paper from 2007) and our proposal was accepted. The algorithm creates HDR-like images and consists mainly of small kernel convolutions followed by weighted multiresolution blending (on the Laplacian pyramid) of the source images. Interesting for us is the fact that the algorithm is already implemented in the widely used Enfuse tool, which is now at version 4.1. So we will be able to validate and compare our results with the original and also potentially contribute to the development of the tool itself. I will update this post in the future with the results if I can.
The simplest possible example:
accumulation of a sum. unrolling using multiple accumulators and vectorization allow a speedup of (ADD latency)*(SIMD vector width) on typical pipelined architectures (if the data is in cache; because there's no data reuse, it typically won't help if you're reading from memory), which can easily be an order of magnitude. Cute thing to note: this also decreases the average error of the result! The same techniques apply to any similar reduction operation.
A few classics from image/signal processing:
convolution with small kernels (especially small 2d convolves like a 3x3 or 5x5 kernel). In some sense this is cheating, because convolution is matrix multiplication, and is intimately related to the FFT, but in reality the nitty-gritty algorithmic techniques of high-performance small kernel convolutions are quite different from either.
erode and dilate.
what image people call a "gamma correction"; this is really evaluation of an exponential function (maybe with a piecewise linear segment near zero). Here you can take advantage of the fact that image data is often entirely in a nice bounded range like [0,1] and sub-ulp accuracy is rarely needed to use much cheaper function approximations (low-order piecewise minimax polynomials are common).
Stephen Canon's image processing examples would each make for instructive projects. Taking a different tack, though, you might look at certain amenable geometry problems:
Closest pair of points in moderately high dimension---say 50000 or so points in 16 or so dimensions. This may have too much in common with matrix multiplication for your purposes. (Take the dimension too much higher and dimensionality reduction silliness starts mattering; much lower and spatial data structures dominate. Brute force, or something simple using a brute-force kernel, is what I would want to use for this.)
Variation: For each point, find the closest neighbour.
Variation: Red points and blue points; find the closest red point to each blue point.
Welzl's smallest containing circle algorithm is fairly straightforward to implement, and the really costly step (check for points outside the current circle) is amenable to vectorisation. (I suspect you can kill it in two dimensions with just a little effort.)
Be warned that computational geometry stuff is usually more annoying to implement than it looks at first; don't just grab a random paper without understanding what degenerate cases exist and how careful your programming needs to be.
Have a look at other linear algebra problems, too. They're also hugely important. Dense Cholesky factorisation is a natural thing to look at here (much more so than LU factorisation) since you don't need to mess around with pivoting to make it work.
There is a free benchmark called c-ray.
It is a small ray-tracer for spheres designed to be a benchmark for floating-point performance.
A few random stackshots show that it spends nearly all its time in a function called ray_sphere that determines if a ray intersects a sphere and if so, where.
They also show some opportunities for larger speedup, such as:
It does a linear search through all the spheres in the scene to try to find the nearest intersection. That represents a possible area for speedup, by doing a quick test to see if a sphere is farther away than the best seen so far, before doing all the 3-d geometry math.
It does not try to exploit similarity from one pixel to the next. This could gain a huge speedup.
So if all you want to look at is chip-level performance, it could be a decent example.
However, it also shows how there can be much bigger opportunities.

What's faster for 3D? Perlin or Simplex noise?

Okay, there are a lot of comparisons between Perlin and Simplex noise to be found on the web. But I really couldn't find one where there was a simple processing time comparison between both for three dimensions, which is what I am mostly interested in. I've read that popular PDF (and even understood most of it - yay!) but I cannot answer the simple question: Which one is faster for 3D, assuming an optimal implementation?
This stackoverflow question answer suggests that Simplex is a pretty clear winner for my case. Of course, there are other resources claiming the exact opposite.
However, the general statement seems to be that Perlin noise has a complexity of O(2^N), while Simplex has O(N^2). Which for 3D would mean 8 for Perlin and 9 for Simplex. But, on some site I found the statement that Simplex is actually O(N). So what is true here, and what does that really mean for speed in 3D?
I am at a loss here, I'm really mainly interested in 3D application (for random terrain generation including caves) usage, and I cannot find a good answer to the question which one I should use if I want it to be as fast as possible.
So maybe someone can help me here :)
1) http://www.fundza.com/c4serious/noise/perlin/perlin.html
2) http://www.6by9.net/b/2012/02/03/simplex-noise-for-c-and-python
Execution times in "my laptop" for 8M samples of noise using these two implementation:
(g++ -O6)
1) 1.389s i.e. 5.7M ops per second
2) 0.607s i.e. 13.2M ops per second
But...
When really, really going for the optimizations, one should study
Higher level optimizations (what really is done in each stage: are there alternatives?)
Branches
Memory patterns
Dependencies
LUT sizes
Individual arithmetic operations needed, their latencies and throughputs
exploitable parallelisms using SIMD
number of live variables
Simplex noise is better looking, but not necessarily faster. It all depends on the implementation. As a rule of thumb, it is "about the same speed", and there shouldn't be a big penalty from using either variant if your code is good.
Note that most of the code I have written that is floating around on the Internet is not optimized for speed, but written for clarity. The GLSL implementations by Ian McEwan and myself from a couple of years ago are reasonably optimized for speed, but they were optimized for hardware which is now outdated, and the versions of GLSL that were current at the time. Important changes to GLSL since then include integer types and bit-wise logical operations, which makes some of the hash functions awkward and unnecessarily complicated,. The need for a permutation polynomial was motivated by the lack of bit-wise logic operators in GLSL. It's still lacking in GLSL for WebGL, but all other platforms now have integer support.
Simplex noise in 4D is mostly faster than classic noise in 4D. All other cases depend on the language, the platform and the amount of code optimization.
Simplex noise has a simple analytic derivative. Classic noise is more tricky in that respect. In many cases, like antialiasing and terrain mapping, an analytic derivative is very useful.

Performance Testing for Calculation-Heavy Programs

What are some good tips and/or techniques for optimizing and improving the performance of calculation heavy programs. I'm talking about things like complication graphics calculations or mathematical and simulation types of programming where every second saved is useful, as opposed to IO heavy programs where only a certain amount of speedup is helpful.
While changing the algorithm is frequently mentioned as the most effective method here,I'm trying to find out how effective different algorithms are in the first place, so I want to create as much efficiency with each algorithm as is possible. The "problem" I'm solving isn't something thats well known, so there are few if any algorithms on the web, but I'm looking for any good advice on how to proceed and what to look for.
I am exploring the differences in effectiveness between evolutionary algorithms and more straightforward approaches for a particular group of related problems. I have written three evolutionary algorithms for the problem already and now I have written an brute force technique that I am trying to make as fast as possible.
Edit: To specify a bit more. I am using C# and my algorithms all revolve around calculating and solving constraint type problems for expressions (using expression trees). By expressions I mean things like x^2 + 4 or anything else like that which would be parsed into an expression tree. My algorithms all create and manipulate these trees to try to find better approximations. But I wanted to put the question out there in a general way in case it would help anyone else.
I am trying to find out if it is possible to write a useful evolutionary algorithm for finding expressions that are a good approximation for various properties. Both because I want to know what a good approximation would be and to see how the evolutionary stuff compares to traditional methods.
It's pretty much the same process as any other optimization: profile, experiment, benchmark, repeat.
First you have to figure out what sections of your code are taking up the time. Then try different methods to speed them up (trying methods based on merit would be a better idea than trying things at random). Benchmark to find out if you actually did speed them up. If you did, replace the old method with the new one. Profile again.
I would recommend against a brute force approach if it's at all possible to do it some other way. But, here are some guidelines that should help you speed your code up either way.
There are many, many different optimizations you could apply to your code, but before you do anything, you should profile to figure out where the bottleneck is. Here are some profilers that should give you a good idea about where the hot spots are in your code:
GProf
PerfMon2
OProfile
HPCToolkit
These all use sampling to get their data, so the overhead of running them with your code should be minimal. Only GProf requires that you recompile your code. Also, the last three let you do both time and hardware performance counter profiles, so once you do a time (or CPU cycle) profile, you can zoom in on the hotter regions and find out why they might be running slow (cache misses, FP instruction counts, etc.).
Beyond that, it's a matter of thinking about how best to restructure your code, and this depends on what the problem is. It may be that you've just got a loop that the compiler doesn't optimize well, and you can inline or move things in/out of the loop to help the compiler out. Or, if you're running as fast as you can with basic arithmetic ops, you may want to try to exploit vector instructions (SSE, etc.) If your code is parallel, you might have load balance problems, and you may need to restructure your code so that data is better distributed across cores.
These are just a few examples. Performance optimization is complex, and it might not help you nearly enough if you're doing a brute force approach to begin with.
For more information on ways people have optimized things, there were some pretty good examples in the recent Why do you program in assembly? question.
If your optimization problem is (quasi-)convex or can be transformed into such a form, there are far more efficient algorithms than evolutionary search.
If you have large matrices, pay attention to your linear algebra routines. The right algorithm can make shave an order of magnitude off the computation time, especially if your matrices are sparse.
Think about how data is loaded into memory. Even when you think you're spending most of your time on pure arithmetic, you're actually spending a lot of time moving things between levels of cache etc. Do as much as you can with the data while it's in the fastest memory.
Try to avoid unnecessary memory allocation and de-allocation. Here's where it can make sense to back away from a purely OO approach.
This is more of a tip to find holes in the algorithm itself...
To realize maximum performance, simplify everything inside the most inner loop at the expense of everything else.
One example of keeping things simple is the classic bouncing ball animation. You can implement gravity by looking up the definition in your physics book and plugging in the numbers, or you can do it like this and save precious clock cycles:
initialize:
float y = 0; // y coordinate
float yi = 0; // incremental variable
loop:
y += yi;
yi += 0.001;
if (y > 10)
yi = -yi;
But now let's say you're having to do this with nested loops in an N-body simulation where every particle is attracted to every other particle. This can be an enormously processor intensive task when you're dealing with thousands of particles.
You should of course take the same approach as to simplify everything inside the most inner loop. But more than that, at the very simplest level you should also use data types wisely. For example, math operations are faster when working with integers than floating point variables. Also, addition is faster than multiplication, and multiplication is faster than division.
So with all of that in mind, you should be able to simplify the most inner loop using primarily addition and multiplication of integers. And then any scaling down you might need to do can be done afterwards. To take the y and yi example, if yi is an integer that you modify inside the inner loop then you could scale it down after the loop like this:
y += yi * 0.01;
These are very basic low-level performance tips, but they're all things I try to keep in mind whenever I'm working with processor intensive algorithms. Of course, if you then take these ideas and apply them to parallel processing on a GPU then you can take your algorithm to a whole new level. =)
Well how you do this depends the most on which language
you are using. Still, the key in any language
in the profiler. Profile your code. See which
functions/operations are taking the most time and then determine
if you can make these costly operations more efficient.
Standard bottlenecks in numerical algorithms are memory
usage (do you access matrices in the order which the elements
are stored in memory); communication overhead, etc. They
can be little different than other non-numerical programs.
Moreover, many other factors such as preconditioning, etc.
can lead to drastically difference performance behavior
of the SAME algorithm on the same problem. Make sure
you determine optimal parameters for your implementations.
As for comparing different algorithms, I recommend
reading the paper
"Benchmarking optimization software with performance profiles,"
Jorge Moré and Elizabeth D. Dolan, Mathematical Programming 91 (2002), 201-213.
It provides a nice, uniform way to compare different algorithms being
applied to the same problem set. It really should be better known
outside of the optimization community (in my not so humble opinion
at least).
Good luck!

Efficiency/speed for trigonometric functions

In a game I'm making, I've got two points, pt1 and pt2, and I want to work out the angle between them. I've already worked out the distance, in an earlier calculation. The obvious way would be to arctan the horizontal distance over the vertical distance (tan(theta) = opp/adj).
I'm wondering though, as I've already calculated the distance, would it be quicker to use arcsine/arccosine with the distance and dx or dy?
Also, might I be better off pre-calculating in a table?
I suspect there's a risk of premature optimization here. Also, be careful about your geometry. Your opposite/adjacent approach is a property of right angle triangles, is that what you actually have?
I'm assuming your points are planar, and so for the general case you have them implicitly representing two vectors form the origin (call these v1 v2), so your angle is
theta=arccos(dot(v1,v2)/(|v1||v2|)) where |.| is vector length.
Making this faster (assuming the need) will depend on a lot of things. Do you know the vector lengths, or have to compute them? How fast can you do a dot product in your architecture. How fast is acos? At some point tricks like table lookup (probably interpolated) might help but that will cost you accuracy.
It's all trade-offs though, there really isn't a general answer to your question.
[edit: added commentary]
I'd like to re-emphasize that often playing "x is fastest" is a bit of a mugs game with modern cpus and compilers anyway. You won't know until you measure it and grovel the generated code. When you hit the point that you really care about it at this level for a (hopefully small) piece of code, you can find out in detail what your system is doing. But it's painstaking. Maybe a table is good. But maybe you've got fast vector computations and a small cache. etc. etc. etc. It all amounts to "it depends". Sorry 'bout that. On the other hand, if you haven't reached the point that you really care so much about this bit of code... you probably shouldn't be thinking about it at this level at all. Make it right. Make it clean (which means abstraction as well as code). Then worry about the overhead.
Aside from all of the wise comments regarding premature optimization, let's just assume this is the hotspot and do a frigg'n benchmark:
Times are in nanoseconds, scaled to normalize 'acos' between the systems.
'acos' simply assumes unit radius i.e. acos(adj), whereas 'acos+div' means acos(adj/hyp).
System 1 is a 2.4GHz i5 running Mac OS X 10.6.4 (gcc 4.2.1)
System 2 is a 2.83GHz Core2 Quad running Red Hat 7 Linux 2.6.28 (gcc 4.1.2)
System 3 is a 1.66GHz Atom N280 running Ubuntu 10.04 2.6.32 (gcc 4.4.3)
System 4 is a 2.40GHz Pentium 4 running Ubuntu 10.04 2.6.32 (gcc 4.4.3)
Summary: Relative performance is all over the map. Sometimes atan2 is faster, sometimes its slower. Very strangely, on some systems doing acos with a division is faster than doing it without. Test on your own system :-/
If you're going to be doing this many times, pre-calculate in a table. Performance will be much better this way.
Tons of good answers here.
By the way, if you use Math.atan2, you get a full 2π of angles out of it.
I would just do it, then run it flat out. If you don't like the speed, and if samples show that you're actually in that code most of the time and not someplace else,
try replacing it with table lookup. If you don't need precision closer than 1 degree, you could use a pretty small table and interpolation.
Also, you may want to memoize the function. Why recompute something you already did recently?
Added: If you use a table, it only has to cover angles from 0-45 degrees (and it can be hard-coded). You can get everything else by symmetry.
From a pure speed standpoint, a precalculated table and a closest-match lookup would be best. It involves some overhead, of course, depending on how fine-grained you need the angle to be, but it's more than worth it if you're doing this calculation a lot (or in a tight loop), as those are going to be expensive calculations.
Get it right first !
And then profile and optimize. Table lookup is a good candidate for sure, but be sure to have your calculation right before doing anything fancy
If you're interested in big-O notation, all the methods you might use are O(1).
If you're interested in what works fastest, test it. Write a wrapper function, one that calls your preferred method but can be easily changed, and test with that. Make sure that your application spends a noticeable amount of time doing this, so you aren't wasting your own time. Try whatever ways occur to you. Ideally, run it on more than one different CPU.
I've become very leery of predicting what will take more or less time on modern processors. Lookup tables used to be the answer if you needed speed, but you don't know a priori the effects on caching or how long it's going to take to normalize and look up versus how long it's going to take to do a trig function on a particular CPU.
Given that this is for a game, you probably care about speed. A lookup table is definitely the fastest but you trade accuracy for speed with this method. So how accurate must you be to meet requirements? Only you can answer that. Before you trade accuracy, determine first if you have a speed problem. All of the trigonometric functions are calculated using numerical methods (research numerical analysis to learn more). Some trig functions are have more expensive methods than others because they rely on series that converge more slowly and who knows, your computer may have different implementations for these functions than another computer. At any rate, you can find out for yourself how expensive these functions are by writing some small programs that loop through as many iterations as you desire, with increments of your choosing, all the while timing the outcomes. Then you can pick the fastest method.
While others are very right to mention that you are almost certainly falling into the pit of premature optimization, when they say that trigonometric functions are O(1) they're not telling the whole story.
Most trigonometric function implementations are actually O(N) in the value of the input function. This is because the trig functions are most efficiently calculated on a small interval like [0, 2π) (or, for the best implementations, even smaller parts of this interval, but that one suffices to explain things). So the algorithm looks something like this, in pseudo-Python:
def Cosine_0to2Pi(x):
#a series approximation of some kind, or CORDIC, or perhaps a table
#this function requires 0 <= x < 2Pi
def MyCosine(x):
if x < 0:
x = -x
while x >= TwoPi:
x -= TwoPi
return Cosine_0to2Pi(x)
Even microcoded CPU instructions like the x87's FSINCOS end up doing something like this internally. So trig functions, because they are periodic, usually take O(N) time to do the argument reduction. There are two caveats, however:
If you have to calculate a ton of values off the principal domain of the trig functions, your math is probably not very well thought out.
Big-O notation hides a constant factor. Argument reduction has a very small constant factor, because it's simple to do. Thus the O(1) part is going to dominate the O(N) part for just about every input.

Resources