Related
I am using MATLAB to run a for loop in which variable-length portions of a large vector are updated at each iteration with the content of another vector; something like:
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) +...
a(k)*vec2(idx_start2(k):idx_end2(k));
end
The selected portions of vec1 and vec2 are not so small and N can be quite large; moreover, if this can be useful, idx_end(k)<idx_start(k+1) does not necessarily hold (i.e. vec1's edited portions may be partially re-updated in subsequent iterations). As a consequence, the above is by far the slowest portion of code in my script and I would like to speed it up, if possible.
Is there any way to vectorize the above for loop in order to make it run faster? Or, are there any alternative approaches to improve its execution speed?
EDIT:
As requested in the comments, here are some example values: Using the profiler to check execution times, the loop above runs in about 3.3 s with N=5e4, length(vec1)=3e6, length(vec2)=1.7e3 and the portions indexed by idx_start/end are slightly shorter on average than the latter, although not significantly.
Of course, 3.3 s is not particularly worrying in itself, but I would like to be able to increase especially N and vec1 by 1 or 2 orders of magnitude and in such a loop it will take quite longer to run.
Sorry, I couldn't find a way to speed up your code. This is the code I created to try to speed it up:
N = 5e4;
vec1 = 1:3e6;
vec2 = 1:1.7e3;
rng(0)
a = randn(N, 1);
idx_start1 = randi([1, 2.9e6], N, 1);
idx_end1 = idx_start1 + 1000;
idx_start2 = randi([1, 0.6e3], N, 1);
idx_end2 = idx_start2 + 1000;
for k=1:N
vec1(idx_start1(k):idx_end1(k)) = vec1(idx_start1(k):idx_end1(k)) + a(k) * vec2(idx_start2(k):idx_end2(k));
% use = idx_start1(k):idx_end1(k);
% vec1(use) = vec1(use) + a(k) * vec2(idx_start2(k):idx_end2(k));
end
The two commented-out lines of code in the for loop were my attempt to speed it up, but it actually made it slower, much to my surprise. Generally, I would create a variable for an array that is used more than once thinking that is faster, but it is not. The code that is not commented out runs in 0.24 s versus 0.67 seconds for the code that is commented out.
How can I find the cube root of a number in an efficient way?
I think Newton-Raphson method can be used, but I don't know how to guess the initial solution programmatically to minimize the number of iterations.
This is a deceptively complex question. Here is a nice survey of some possible approaches.
In view of the "link rot" that overtook the Accepted Answer, I'll give a more self-contained answer focusing on the topic of quickly obtaining an initial guess suitable for superlinear iteration.
The "survey" by metamerist (Wayback link) provided some timing comparisons for various starting value/iteration combinations (both Newton and Halley methods are included). Its references are to works by W. Kahan, "Computing a Real Cube Root", and by K. Turkowski, "Computing the Cube Root".
metamarist updates the DEC-VAX era bit-fiddling technique of W. Kahan with this snippet, which "assumes 32-bit integers" and relies on IEEE 754 format for doubles "to generate initial estimates with 5 bits of precision":
inline double cbrt_5d(double d)
{
const unsigned int B1 = 715094163;
double t = 0.0;
unsigned int* pt = (unsigned int*) &t;
unsigned int* px = (unsigned int*) &d;
pt[1]=px[1]/3+B1;
return t;
}
The code by K. Turkowski provides slightly more precision ("approximately 6 bits") by a conventional powers-of-two scaling on float fr, followed by a quadratic approximation to its cube root over interval [0.125,1.0):
/* Compute seed with a quadratic qpproximation */
fr = (-0.46946116F * fr + 1.072302F) * fr + 0.3812513F;/* 0.5<=fr<1 */
and a subsequent restoration of the exponent of two (adjusted to one-third). The exponent/mantissa extraction and restoration make use of math library calls to frexp and ldexp.
Comparison with other cube root "seed" approximations
To appreciate those cube root approximations we need to compare them with other possible forms. First the criteria for judging: we consider the approximation on the interval [1/8,1], and we use best (minimizing the maximum) relative error.
That is, if f(x) is a proposed approximation to x^{1/3}, we find its relative error:
error_rel = max | f(x)/x^(1/3) - 1 | on [1/8,1]
The simplest approximation would of course be to use a single constant on the interval, and the best relative error in that case is achieved by picking f_0(x) = sqrt(2)/2, the geometric mean of the values at the endpoints. This gives 1.27 bits of relative accuracy, a quick but dirty starting point for a Newton iteration.
A better approximation would be the best first-degree polynomial:
f_1(x) = 0.6042181313*x + 0.4531635984
This gives 4.12 bits of relative accuracy, a big improvement but short of the 5-6 bits of relative accuracy promised by the respective methods of Kahan and Turkowski. But it's in the ballpark and uses only one multiplication (and one addition).
Finally, what if we allow ourselves a division instead of a multiplication? It turns out that with one division and two "additions" we can have the best linear-fractional function:
f_M(x) = 1.4774329094 - 0.8414323527/(x+0.7387320679)
which gives 7.265 bits of relative accuracy.
At a glance this seems like an attractive approach, but an old rule of thumb was to treat the cost of a FP division like three FP multiplications (and to mostly ignore the additions and subtractions). However with current FPU designs this is not realistic. While the relative cost of multiplications to adds/subtracts has come down, in most cases to a factor of two or even equality, the cost of division has not fallen but often gone up to 7-10 times the cost of multiplication. Therefore we must be miserly with our division operations.
static double cubeRoot(double num) {
double x = num;
if(num >= 0) {
for(int i = 0; i < 10 ; i++) {
x = ((2 * x * x * x) + num ) / (3 * x * x);
}
}
return x;
}
It seems like the optimization question has already been addressed, but I'd like to add an improvement to the cubeRoot() function posted here, for other people stumbling on this page looking for a quick cube root algorithm.
The existing algorithm works well, but outside the range of 0-100 it gives incorrect results.
Here's a revised version that works with numbers between -/+1 quadrillion (1E15). If you need to work with larger numbers, just use more iterations.
static double cubeRoot( double num ){
boolean neg = ( num < 0 );
double x = Math.abs( num );
for( int i = 0, iterations = 60; i < iterations; i++ ){
x = ( ( 2 * x * x * x ) + num ) / ( 3 * x * x );
}
if( neg ){ return 0 - x; }
return x;
}
Regarding optimization, I'm guessing the original poster was asking how to predict the minimum number of iterations for an accurate result, given an arbitrary input size. But it seems like for most general cases the gain from optimization isn't worth the added complexity. Even with the function above, 100 iterations takes less than 0.2 ms on average consumer hardware. If speed was of utmost importance, I'd consider using pre-computed lookup tables. But this is coming from a desktop developer, not an embedded systems engineer.
I have to calculate the following:
float2 y = CONSTANT;
for (int i = 0; i < totalN; i++)
h[i] = cos(y*i);
totalN is a large number, so I would like to make this in a more efficient way. Is there any way to improve this? I suspect there is, because, after all, we know what's the result of cos(n), for n=1..N, so maybe there's some theorem that allows me to compute this in a faster way. I would really appreciate any hint.
Thanks in advance,
Federico
Using one of the most beautiful formulas of mathematics, Euler's formula
exp(i*x) = cos(x) + i*sin(x),
substituting x := n * phi:
cos(n*phi) = Re( exp(i*n*phi) )
sin(n*phi) = Im( exp(i*n*phi) )
exp(i*n*phi) = exp(i*phi) ^ n
Power ^n is n repeated multiplications.
Therefore you can calculate cos(n*phi) and simultaneously sin(n*phi) by repeated complex multiplication by exp(i*phi) starting with (1+i*0).
Code examples:
Python:
from math import *
DEG2RAD = pi/180.0 # conversion factor degrees --> radians
phi = 10*DEG2RAD # constant e.g. 10 degrees
c = cos(phi)+1j*sin(phi) # = exp(1j*phi)
h=1+0j
for i in range(1,10):
h = h*c
print "%d %8.3f"%(i,h.real)
or C:
#include <stdio.h>
#include <math.h>
// numer of values to calculate:
#define N 10
// conversion factor degrees --> radians:
#define DEG2RAD (3.14159265/180.0)
// e.g. constant is 10 degrees:
#define PHI (10*DEG2RAD)
typedef struct
{
double re,im;
} complex_t;
int main(int argc, char **argv)
{
complex_t c;
complex_t h[N];
int index;
c.re=cos(PHI);
c.im=sin(PHI);
h[0].re=1.0;
h[0].im=0.0;
for(index=1; index<N; index++)
{
// complex multiplication h[index] = h[index-1] * c;
h[index].re=h[index-1].re*c.re - h[index-1].im*c.im;
h[index].im=h[index-1].re*c.im + h[index-1].im*c.re;
printf("%d: %8.3f\n",index,h[index].re);
}
}
I'm not sure what kind of accuracy vs. performance compromises you're willing to make, but there are extensive discussions of various sinusoid approximation techniques at these links:
Fun with Sinusoids - http://www.audiomulch.com/~rossb/code/sinusoids/
Fast and accurate sine/cosine - http://www.devmaster.net/forums/showthread.php?t=5784
Edit (I think this is the "Don Cross" link that's broken on the "Fun with Sinusoids" page):
Optimizing Trig Calculations - http://groovit.disjunkt.com/analog/time-domain/fasttrig.html
Maybe the simplest formula is
cos(n+y) = 2cos(n)cos(y) - cos(n-y).
If you precompute the constant 2*cos(y) then each value cos(n+y) can be computed from the previous 2 values with one single multiplication and one subtraction.
I.e., in pseudocode
h[0] = 1.0
h[1] = cos(y)
m = 2*h[1]
for (int i = 2; i < totalN; ++i)
h[i] = m*h[i-1] - h[i-2]
Here's a method, but it uses a little bit of memory for the sin. It uses the trig identities:
cos(a + b) = cos(a)cos(b)-sin(a)sin(b)
sin(a + b) = sin(a)cos(b)+cos(a)sin(b)
Then here's the code:
h[0] = 1.0;
double g1 = sin(y);
double glast = g1;
h[1] = cos(y);
for (int i = 2; i < totalN; i++){
h[i] = h[i-1]*h[1]-glast*g1;
glast = glast*h[1]+h[i-1]*g1;
}
If I didn't make any errors then that should do it. Of course there could be round-off problems so be aware of that. I implemented this in Python and it is quite accurate.
There are some good answers here but they are all recursive. Recursive calculation will not work for cosine function when using floating point arithmetic; you will invariably get rounding errors which quickly compound.
Consider calculation y = 45 degrees, totalN 10 000. You won't end up with 1 as the final result.
To address Kirk's concerns: all of the solutions based on the recurrence for cos and sin boil down to computing
x(k) = R x(k - 1),
where R is the matrix that rotates by y and x(0) is the unit vector (1, 0). If the true result for k - 1 is x'(k - 1) and the true result for k is x'(k), then the error goes from e(k - 1) = x(k - 1) - x'(k - 1) to e(k) = R x(k - 1) - R x'(k - 1) = R e(k - 1) by linearity. Since R is what's called an orthogonal matrix, R e(k - 1) has the same norm as e(k - 1), and the error grows very slowly. (The reason it grows at all is due to round-off; the computer representation of R is in general almost, but not quite orthogonal, so it will be necessary to restart the recurrence using the trig operations from time to time depending on the accuracy required. This is still much, much faster than using the trig ops to compute each value.)
You can do this using complex numbers.
if you define x = sin(y) + i cos(y), cos(y*i) will be the real part of x^i.
You can compute for all i iteratively. Complex multiply is 2 multiplies plus two adds.
Knowing cos(n) doesn't help -- your math library already does these kind of trivial things for you.
Knowing that cos((i+1)y)=cos(iy+y)=cos(iy)cos(y)-sin(iy)sin(y) can help, if you precompute cos(y) and sin(y), and keep track of both cos(iy) and sin(i*y) along the way. It may result in some loss of precision, though - you'll have to check.
How accurate do you need the resulting cos(x) to be? If you can live with some, you could create a lookup table, sampling the unit circle at 2*PI/N intervals and then interpolate between two adjacent points. N would be chosen to achieve some desired level of accuracy.
What I don't know is whether an interpolation is actually less costly than computing a cosine. Since its usually done in microcode in modern CPUs, it may not be.
John Carmack has a special function in the Quake III source code which calculates the inverse square root of a float, 4x faster than regular (float)(1.0/sqrt(x)), including a strange 0x5f3759df constant. See the code below. Can someone explain line by line what exactly is going on here and why this works so much faster than the regular implementation?
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) );
#ifndef Q3_VM
#ifdef __linux__
assert( !isnan(y) );
#endif
#endif
return y;
}
FYI. Carmack didn't write it. Terje Mathisen and Gary Tarolli both take partial (and very modest) credit for it, as well as crediting some other sources.
How the mythical constant was derived is something of a mystery.
To quote Gary Tarolli:
Which actually is doing a floating
point computation in integer - it took
a long time to figure out how and why
this works, and I can't remember the
details anymore.
A slightly better constant, developed by an expert mathematician (Chris Lomont) trying to work out how the original algorithm worked is:
float InvSqrt(float x)
{
float xhalf = 0.5f * x;
int i = *(int*)&x; // get bits for floating value
i = 0x5f375a86 - (i >> 1); // gives initial guess y0
x = *(float*)&i; // convert bits back to float
x = x * (1.5f - xhalf * x * x); // Newton step, repeating increases accuracy
return x;
}
In spite of this, his initial attempt a mathematically 'superior' version of id's sqrt (which came to almost the same constant) proved inferior to the one initially developed by Gary despite being mathematically much 'purer'. He couldn't explain why id's was so excellent iirc.
Of course these days, it turns out to be much slower than just using an FPU's sqrt (especially on 360/PS3), because swapping between float and int registers induces a load-hit-store, while the floating point unit can do reciprocal square root in hardware.
It just shows how optimizations have to evolve as the nature of underlying hardware changes.
Greg Hewgill and IllidanS4 gave a link with excellent mathematical explanation.
I'll try to sum it up here for ones who don't want to go too much into details.
Any mathematical function, with some exceptions, can be represented by a polynomial sum:
y = f(x)
can be exactly transformed into:
y = a0 + a1*x + a2*(x^2) + a3*(x^3) + a4*(x^4) + ...
Where a0, a1, a2,... are constants. The problem is that for many functions, like square root, for exact value this sum has infinite number of members, it does not end at some x^n. But, if we stop at some x^n we would still have a result up to some precision.
So, if we have:
y = 1/sqrt(x)
In this particular case they decided to discard all polynomial members above second, probably because of calculation speed:
y = a0 + a1*x + [...discarded...]
And the task has now came down to calculate a0 and a1 in order for y to have the least difference from the exact value. They have calculated that the most appropriate values are:
a0 = 0x5f375a86
a1 = -0.5
So when you put this into equation you get:
y = 0x5f375a86 - 0.5*x
Which is the same as the line you see in the code:
i = 0x5f375a86 - (i >> 1);
Edit: actually here y = 0x5f375a86 - 0.5*x is not the same as i = 0x5f375a86 - (i >> 1); since shifting float as integer not only divides by two but also divides exponent by two and causes some other artifacts, but it still comes down to calculating some coefficients a0, a1, a2... .
At this point they've found out that this result's precision is not enough for the purpose. So they additionally did only one step of Newton's iteration to improve the result accuracy:
x = x * (1.5f - xhalf * x * x)
They could have done some more iterations in a loop, each one improving result, until required accuracy is met. This is exactly how it works in CPU/FPU! But it seems that only one iteration was enough, which was also a blessing for the speed. CPU/FPU does as many iterations as needed to reach the accuracy for the floating point number in which the result is stored and it has more general algorithm which works for all cases.
So in short, what they did is:
Use (almost) the same algorithm as CPU/FPU, exploit the improvement of initial conditions for the special case of 1/sqrt(x) and don't calculate all the way to precision CPU/FPU will go to but stop earlier, thus gaining in calculation speed.
I was curious to see what the constant was as a float so I simply wrote this bit of code and googled the integer that popped out.
long i = 0x5F3759DF;
float* fp = (float*)&i;
printf("(2^127)^(1/2) = %f\n", *fp);
//Output
//(2^127)^(1/2) = 13211836172961054720.000000
It looks like the constant is "An integer approximation to the square root of 2^127 better known by the hexadecimal form of its floating-point representation, 0x5f3759df" https://mrob.com/pub/math/numbers-18.html
On the same site it explains the whole thing. https://mrob.com/pub/math/numbers-16.html#le009_16
According to this nice article written a while back...
The magic of the code, even if you
can't follow it, stands out as the i =
0x5f3759df - (i>>1); line. Simplified,
Newton-Raphson is an approximation
that starts off with a guess and
refines it with iteration. Taking
advantage of the nature of 32-bit x86
processors, i, an integer, is
initially set to the value of the
floating point number you want to take
the inverse square of, using an
integer cast. i is then set to
0x5f3759df, minus itself shifted one
bit to the right. The right shift
drops the least significant bit of i,
essentially halving it.
It's a really good read. This is only a tiny piece of it.
The code consists of two major parts. Part one calculates an approximation for 1/sqrt(y), and part two takes that number and runs one iteration of Newton's method to get a better approximation.
Calculating an approximation for 1/sqrt(y)
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
Line 1 takes the floating point representation of y and treats it as an integer i. Line 2 shifts i over one bit and subtracts it from a mysterious constant. Line 3 takes the resulting number and converts it back to a standard float32. Now why does this work?
Let g be a function that maps a floating point number to its floating point representation, read as an integer. Line 1 above is setting i = g(y).
The following good approximation of g exists(*):
g(y) ≈ Clog_2 y + D for some constants C and D. An intuition for why such a good approximation exists is that the floating point representation of y is roughly linear in the exponent.
The purpose of line 2 is to map from g(y) to g(1/sqrt(y)), after which line 3 can use g^-1 to map that number to 1/sqrt(y). Using the approximation above, we have g(1/sqrt(y)) ≈ Clog_2 (1/sqrt(y)) + D = -C/2 log_2 y + D. We can use these formulas to calculate the map from g(y) to g(1/sqrt(y)), which is g(1/sqrt(y)) ≈ 3D/2 - 1/2 * g(y). In line 2, we have 0x5f3759df ≈ 3D/2, and i >> 1 ≈ 1/2*g(y).
The constant 0x5f3759df is slightly smaller than the constant that gives the best possible approximation for g(1/sqrt(y)). That is because this step is not done in isolation. Due to the direction that Newton's method tends to miss in, using a slightly smaller constant tends to yield better results. The exact optimal constant to use in this setting depends on your input distribution of y, but 0x5f3759df is one such constant that gives good results over a fairly broad range.
A more detailed description of this process can be found on Wikipedia: https://en.wikipedia.org/wiki/Fast_inverse_square_root#Algorithm
(*) More explicitly, let y = 2^e*(1+f). Taking the log of both sides, we get log_2 y = e + log_2(1+f), which can be approximated as log_2 y ≈ e + f + σ for a small constant sigma. Separately, the float32 encoding of y expressed as an integer is g(y) ≈ 2^23 * (e+127) + f * 2^23. Combining the two equations, we get g(y) ≈ 2^23 * log_2 y + 2^23 * (127 - σ).
Using Newton's method
y = y * ( threehalfs - ( x2 * y * y ) );
Consider the function f(y) = 1/y^2 - num. The positive zero of f is y = 1/sqrt(num), which is what we are interested in calculating.
Newton's method is an iterative algorithm for taking an approximation y_n for the zero of a function f, and calculating a better approximation y_n+1, using the following equation: y_n+1 = y_n - f(y_n)/f'(y_n).
Calculating what that looks like for our function f gives the following equation: y_n+1 = y_n - (-y_n+y_n^3*num)/2 = y_n * (3/2 - num/2 * y_n * y_n). This is exactly what the line of code above is doing.
You can learn more about the details of Newton's method here: https://en.wikipedia.org/wiki/Newton%27s_method
I've been profiling some of our core math on an Intel Core Duo, and while looking at various approaches to square root I've noticed something odd: using the SSE scalar operations, it is faster to take a reciprocal square root and multiply it to get the sqrt, than it is to use the native sqrt opcode!
I'm testing it with a loop something like:
inline float TestSqrtFunction( float in );
void TestFunc()
{
#define ARRAYSIZE 4096
#define NUMITERS 16386
float flIn[ ARRAYSIZE ]; // filled with random numbers ( 0 .. 2^22 )
float flOut [ ARRAYSIZE ]; // filled with 0 to force fetch into L1 cache
cyclecounter.Start();
for ( int i = 0 ; i < NUMITERS ; ++i )
for ( int j = 0 ; j < ARRAYSIZE ; ++j )
{
flOut[j] = TestSqrtFunction( flIn[j] );
// unrolling this loop makes no difference -- I tested it.
}
cyclecounter.Stop();
printf( "%d loops over %d floats took %.3f milliseconds",
NUMITERS, ARRAYSIZE, cyclecounter.Milliseconds() );
}
I've tried this with a few different bodies for the TestSqrtFunction, and I've got some timings that are really scratching my head. The worst of all by far was using the native sqrt() function and letting the "smart" compiler "optimize". At 24ns/float, using the x87 FPU this was pathetically bad:
inline float TestSqrtFunction( float in )
{ return sqrt(in); }
The next thing I tried was using an intrinsic to force the compiler to use SSE's scalar sqrt opcode:
inline void SSESqrt( float * restrict pOut, float * restrict pIn )
{
_mm_store_ss( pOut, _mm_sqrt_ss( _mm_load_ss( pIn ) ) );
// compiles to movss, sqrtss, movss
}
This was better, at 11.9ns/float. I also tried Carmack's wacky Newton-Raphson approximation technique, which ran even better than the hardware, at 4.3ns/float, although with an error of 1 in 210 (which is too much for my purposes).
The doozy was when I tried the SSE op for reciprocal square root, and then used a multiply to get the square root ( x * 1/√x = √x ). Even though this takes two dependent operations, it was the fastest solution by far, at 1.24ns/float and accurate to 2-14:
inline void SSESqrt_Recip_Times_X( float * restrict pOut, float * restrict pIn )
{
__m128 in = _mm_load_ss( pIn );
_mm_store_ss( pOut, _mm_mul_ss( in, _mm_rsqrt_ss( in ) ) );
// compiles to movss, movaps, rsqrtss, mulss, movss
}
My question is basically what gives? Why is SSE's built-in-to-hardware square root opcode slower than synthesizing it out of two other math operations?
I'm sure that this is really the cost of the op itself, because I've verified:
All data fits in cache, and
accesses are sequential
the functions are inlined
unrolling the loop makes no difference
compiler flags are set to full optimization (and the assembly is good, I checked)
(edit: stephentyrone correctly points out that operations on long strings of numbers should use the vectorizing SIMD packed ops, like rsqrtps — but the array data structure here is for testing purposes only: what I am really trying to measure is scalar performance for use in code that can't be vectorized.)
sqrtss gives a correctly rounded result. rsqrtss gives an approximation to the reciprocal, accurate to about 11 bits.
sqrtss is generating a far more accurate result, for when accuracy is required. rsqrtss exists for the cases when an approximation suffices, but speed is required. If you read Intel's documentation, you will also find an instruction sequence (reciprocal square-root approximation followed by a single Newton-Raphson step) that gives nearly full precision (~23 bits of accuracy, if I remember properly), and is still somewhat faster than sqrtss.
edit: If speed is critical, and you're really calling this in a loop for many values, you should be using the vectorized versions of these instructions, rsqrtps or sqrtps, both of which process four floats per instruction.
There are a number of other answers to this already from a few years ago. Here's what the consensus got right:
The rsqrt* instructions compute an approximation to the reciprocal square root, good to about 11-12 bits.
It's implemented with a lookup table (i.e. a ROM) indexed by the mantissa. (In fact, it's a compressed lookup table, similar to mathematical tables of old, using adjustments to the low-order bits to save on transistors.)
The reason why it's available is that it is the initial estimate used by the FPU for the "real" square root algorithm.
There's also an approximate reciprocal instruction, rcp. Both of these instructions are a clue to how the FPU implements square root and division.
Here's what the consensus got wrong:
SSE-era FPUs do not use Newton-Raphson to compute square roots. It's a great method in software, but it would be a mistake to implement it that way in hardware.
The N-R algorithm to compute reciprocal square root has this update step, as others have noted:
x' = 0.5 * x * (3 - n*x*x);
That's a lot of data-dependent multiplications and one subtraction.
What follows is the algorithm that modern FPUs actually use.
Given b[0] = n, suppose we can find a series of numbers Y[i] such that b[n] = b[0] * Y[0]^2 * Y[1]^2 * ... * Y[n]^2 approaches 1. Then consider:
x[n] = b[0] * Y[0] * Y[1] * ... * Y[n]
y[n] = Y[0] * Y[1] * ... * Y[n]
Clearly x[n] approaches sqrt(n) and y[n] approaches 1/sqrt(n).
We can use the Newton-Raphson update step for reciprocal square root to get a good Y[i]:
b[i] = b[i-1] * Y[i-1]^2
Y[i] = 0.5 * (3 - b[i])
Then:
x[0] = n Y[0]
x[i] = x[i-1] * Y[i]
and:
y[0] = Y[0]
y[i] = y[i-1] * Y[i]
The next key observation is that b[i] = x[i-1] * y[i-1]. So:
Y[i] = 0.5 * (3 - x[i-1] * y[i-1])
= 1 + 0.5 * (1 - x[i-1] * y[i-1])
Then:
x[i] = x[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
= x[i-1] + x[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
y[i] = y[i-1] * (1 + 0.5 * (1 - x[i-1] * y[i-1]))
= y[i-1] + y[i-1] * 0.5 * (1 - x[i-1] * y[i-1]))
That is, given initial x and y, we can use the following update step:
r = 0.5 * (1 - x * y)
x' = x + x * r
y' = y + y * r
Or, even fancier, we can set h = 0.5 * y. This is the initialisation:
Y = approx_rsqrt(n)
x = Y * n
h = Y * 0.5
And this is the update step:
r = 0.5 - x * h
x' = x + x * r
h' = h + h * r
This is Goldschmidt's algorithm, and it has a huge advantage if you're implementing it in hardware: the "inner loop" is three multiply-adds and nothing else, and two of them are independent and can be pipelined.
In 1999, FPUs already needed a pipelined add/substract circuit and a pipelined multiply circuit, otherwise SSE would not be very "streaming". Only one of each circuit was needed in 1999 to implement this inner loop in a fully-pipelined way without wasting a lot of hardware just on square root.
Today, of course, we have fused multiply-add exposed to the programmer. Again, the inner loop is three pipelined FMAs, which are (again) generally useful even if you're not computing square roots.
This is also true for division. MULSS(a,RCPSS(b)) is way faster than DIVSS(a,b). In fact it's still faster even when you increase its precision with a Newton-Raphson iteration.
Intel and AMD both recommend this technique in their optimisation manuals. In applications which don't require IEEE-754 compliance, the only reason to use div/sqrt is code readability.
Instead of supplying an answer, that actually might be incorrect (I'm also not going to check or argue about cache and other stuff, let's say they are identical) I'll try to point you to the source that can answer your question.
The difference might lie in how sqrt and rsqrt are computed. You can read more here http://www.intel.com/products/processor/manuals/. I'd suggest to start from reading about processor functions you are using, there are some info, especially about rsqrt (cpu is using internal lookup table with huge approximation, which makes it much simpler to get the result). It may seem, that rsqrt is so much faster than sqrt, that 1 additional mul operation (which isn't to costly) might not change the situation here.
Edit: Few facts that might be worth mentioning:
1. Once I was doing some micro optimalizations for my graphics library and I've used rsqrt for computing length of vectors. (instead of sqrt, I've multiplied my sum of squared by rsqrt of it, which is exactly what you've done in your tests), and it performed better.
2. Computing rsqrt using simple lookup table might be easier, as for rsqrt, when x goes to infinity, 1/sqrt(x) goes to 0, so for small x's the function values doesn't change (a lot), whereas for sqrt - it goes to infinity, so it's that simple case ;).
Also, clarification: I'm not sure where I've found it in books I've linked, but I'm pretty sure I've read that rsqrt is using some lookup table, and it should be used only, when the result doesn't need to be exact, although - I might be wrong as well, as it was some time ago :).
Newton-Raphson converges to the zero of f(x) using increments equals to -f/f' where f' is the derivative.
For x=sqrt(y), you can try to solve f(x) = 0 for x using f(x) = x^2 - y;
Then the increment is: dx = -f/f' = 1/2 (x - y/x) = 1/2 (x^2 - y) / x
which has a slow divide in it.
You can try other functions (like f(x) = 1/y - 1/x^2) but they will be equally complicated.
Let's look at 1/sqrt(y) now. You can try f(x) = x^2 - 1/y, but it will be equally complicated: dx = 2xy / (y*x^2 - 1) for instance.
One non-obvious alternate choice for f(x) is: f(x) = y - 1/x^2
Then: dx = -f/f' = (y - 1/x^2) / (2/x^3) = 1/2 * x * (1 - y * x^2)
Ah! It's not a trivial expression, but you only have multiplies in it, no divide. => Faster!
And: the full update step new_x = x + dx then reads:
x *= 3/2 - y/2 * x * x which is easy too.
It is faster becausse these instruction ignore rounding modes, and do not handle floatin point exceptions or dernormalized numbers. For these reasons it is much easier to pipeline, speculate and execute other fp instruction Out of order.