my Java algorithm is spending 30% of its time calculating the expression Math.pow(10, (0.1x+1.5)) for many different x values (I cannot know these x's beforehand). Is there any way/trick to lower this bottleneck?
An other option is:
31.6227766017 * Math.exp(0.23025850929 * x)
This is likely faster, because exp is simpler than pow, but I did not test how big the difference (if any) is.
Related
I am looking for an algorithm that would efficiently calculate b^e where b and e are rational numbers, ensuring that the approximation error won't exceed given err (rational as well). Explicitly, I am looking for a function:
rational exp(rational base, rational exp, rational err)
that would preserve law |exp(b, e, err) - b^e| < err
Rational numbers are represented as pairs of big integers. Let's assume that all rationality preserving operations like addition, multiplication etc. are already defined.
I have found several approaches, but they did not allow me to control the error clearly enough. In this problem I don't care about integer overflow. What is the best approach to achieve this?
This one is complicated, so I'm going to outline the approach that I'd take. I do not promise no errors, and you'll have a lot of work left.
I will change variables from what you said to exp(x, y, err) to be x^y within error err.If y is not in the range 0 <= y < 1, then we can easily multiply by an appropriate x^k with k an integer to make it so. So we only need to worry about fractional `y
If all numerators and denominators were small, it would be easy to tackle this by first taking an integer power, and then taking a root using Newton's method. But that naive idea will fall apart painfully when you try to estimate something like (1000001/1000000)^(2000001/1000000). So the challenge is to keep that from blowing up on you.
I would recommend looking at the problem of calculating x^y as x^y = (x0^y0) * (x0^(y-y0)) * (x/x0)^y = (x0^y0) * e^((y-y0) * log(x0)) * e^(y * log(x/x0)). And we will choose x0 and y0 such that the calculations are easier and the errors are bounded.
To bound the errors, we can first come up with a naive upper bound b on x0^y0 - something like "next highest integer than x to the power of the next highest integer than y". We will pick x0 and y0 to be close enough to x and y that the latter terms are under 2. And then we just need to have the three terms estimated to within err/12, err/(6*b) and err/(6*b). (You might want to make those errors tighter half that then make the final answer a nearby rational.)
Now when we pick x0 and y0 we will be aiming for "close rational with smallish numerator/denominator". For that we start calculating the continued fraction. This gives a sequence of rational numbers that quickly converges to a target real. If we just cut off the sequence fairly soon, we can quickly find a rational number that is within any desired distance of a target real while keeping relatively small numerators and denominators.
Let's work from the third term backwards.
We want y * log(x/x0) < log(2). But from the Taylor series if x/2 < x0 < 2x then log(x/x0) < x/x0 - 1. So we can search the continued fraction for an appropriate x0.
Once we have found it, we can use the Taylor series for log(1+z) to calculate log(x/x0) to within err/(12*y*b). And then the Taylor series for e^z to calculate the term to our desired error.
The second term is more complicated. We need to estimate log(x0). What we do is find an appropriate integer k such that 1.1^k <= x0 < 1.1^(k+1). And then we can estimate both k * log(1.1) and log(x0 / 1.1^k) fairly precisely. Find a naive upper bound to that log and use it to find a close enough y0 for the second term to be within 2. And then use the Taylor series to estimate e^((y-y0) * log(x0)) to our desired precision.
For the first term we use the naive method of raising x0 to an integer and then Newton's method to take a root, to give x0^y0 to our desired precision.
Then multiply them together, and we have an answer. (If you chose the "tighter errors, nicer answer", then now you'd do a continued fraction on that answer to pick a better rational to return.)
I'm trying to understand the algebra behind Big-O expressions. I have gone through several questions but still don't have a very clear idea how it's done.
When dealing with powers do we always omit the lower powers, for example:
O(10n^4-n^2-10) = O(10n^4)
What difference does it make when multiplication is involved? For example:
O(2n^3+10^2n) * O(n) = O(2n^3) ??
And finally, how do we deal with logs? For example:
O(n2) + O(5*log(n))
I think we try to get rid of all constants and lower powers. I'm not sure how logarithms are involved in the simplification and what difference a multiplication sign would do. Thank you.
Big-O expressions are more closely related to Calculus, specifically limits, than they are to algebraic concepts/rules. The easiest way I've found to think about expressions like the examples you've provided, is to start by plugging in a small number, and then a really large number, and observe how the result changes:
Expression: O(10n^4-n^2-10)
use n = 2: O(10(2^4) - 2^2 - 10)
O(10 * 16 - 4 - 10) = 146
use n = 100: O(10(100^4) - 100^2- 10)
O(10(100,000,000) - 10,000 - 10) = 999,989,990
What you can see from this, is that the n^4 term overpowers all other terms in the expression. Therefore, this algorithm would be denoted as having a run-time of O(n^4).
So yes, your assumptions are correct, that you should generally go with the highest power, drop constants, and drop order-1 terms.
Logarithms are effectively "undoing" exponentiation. Because of this, they will reduce the overall O-run-time of an algorithm. However, when they are added against exponential run-times, they generally get overruled by the larger order term. In the example you provided, if we again evaluate using real numbers:
Expression: O(n^2) + O(5*log(n))
use n=2: O(2^2) + O(5*log(2))
O(4) + O(3.4657) = 7.46
use n=100: O(100^2) + O(5*log(100))
O(10,000) + O(23.02) = 10,023
You will notice that although the logarithm term is increasing, it isn't a great gain compared to the increase in n's size. However, the n^2 term is still generating a massive increase compared to the increase in n's size. Because of this, the Big O of these expressions combined would still boil down to: O(n^2).
If you're interested in further reading about the mathematics side of this, you may want to check out this post: https://secweb.cs.odu.edu/~zeil/cs361/web/website/Lectures/algebra/page/algebra.html
For an NxP matrix x and an Nx1 vector y with N > P, the two expressions
x \ y -- (1)
and
(x' * x) \ (x' * y) -- (2)
both compute the solution b to the matrix equation
x * b = y
in the least squares sense, i.e. so that the quantity
norm(y - x * b)
is minimized. Expression (2) does it using the classic algorithm for the solution of an ordinary least squares regression, where the left-hand argument to the \ operator is square. It is equivalent to writing
inv(x' * x) * (x' * y) -- (3)
but it uses an algorithm which is more numerically stable. It turns out that (3) is moderately faster than (2) even though (2) doesn't have to produce the inverse matrix as a byproduct, but I can accept that given the additional numerical stability.
However, some simple timings (with N=100,000 and P=30) show that expression (2) is more than 5 times faster than expression (1), even though (1) has greater flexibility to choose the algorithm used! For example, any call to (1) could just dispatch on the size of X, and in the case N>P it could reduce to (2), which would add a tiny amount of overhead, but certainly wouldn't take 5 times longer.
What is happening in expression (1) that is causing it to take so much longer?
Edit: Here are my timings
x = randn(1e5, 30);
y = randn(1e5,1);
tic, for i = 1:100; x\y; end; t1=toc;
tic, for i = 1:100; (x'*x)\(x'*y); end; t2=toc;
assert( abs(norm(x\y) - norm((x'*x)\(x'*y))) < 1e-10 );
fprintf('Speedup: %.2f\n', t1/t2)
Speedup: 5.23
You are aware of the fact that in your test
size(x) == [1e5 30] but size(x'*x) == [30 30]
size(y) == [1e5 1] but size(x'*y) == [30 1]
That means that the matrices entering the mldivide function differ in size by 4 orders of magnitude! This would render any overhead of determining which algorithm to use rather large and significant (and perhaps also running the same algorithm on the two different problems).
In other words, you have a biased test. To make a fair test, use something like
x = randn(1e3);
y = randn(1e3,1);
I find (worst of 5 runs):
Speedup: 1.06 %// R2010a
Speedup: 1.16 %// R2010b
Speedup: 0.97 %// R2013a
...the difference has all but evaporated.
But, this does show very well that if you indeed have a regression problem with low dimensionality compared to the number of observations, it really pays off to do the multiplication first :)
mldivide is a catch-all, and really great at that. But often, having knowledge about the problem may make more specific solutions, like pre-multiplication, pre-conditioning, lu, qr, linsolve, etc. orders of magnitude faster.
even though (1) has greater flexibility to choose the algorithm used!
For example, any call to (1) could just dispatch on the size of X, and
in the case N>P it could reduce to (2), which would add a tiny amount
of overhead, but certainly wouldn't take 5 times longer.
This is not the case. It could take a lot of overhead to choose which algorithm to use, particularly when compared to the computation on relatively small inputs such as these. In this case, because MATLAB can see that you have x'*x, it knows that one of the arguments must be both square and symmetric (yes - that knowledge of linear algebra is built in to MATLAB even at a parser level), and can straight away call one of the appropriate code paths within \.
I can't say whether this fully explains the timing differences you're seeing. I would want to investigate further, at least by:
Making sure to put the code within a function, and warming the function up to ensure that the JIT is engaged - and then trying the same thing with feature('accel', 'off') to remove the effect of the JIT
Trying this on a much bigger range of input sizes to check what contribution an 'algorithm choice overhead' made compared to computation time.
I just got to know about Karatsuba Algorithm, and I tried to implement it in Haskell.
Here is my code:
(***) :: Integer -> Integer -> Integer
x *** y
| max x y < ub = x*y
| otherwise =z2*base*base + ((x1+x0)***(y1+y0)-z2-z0)*base + z0
where
base =10^((length . show $ max x y) `div` 2)
z2 =x1***y1
z0 = x0***y0
(x1, x0) = helper x
(y1, y0) = helper y
helper n = (n `div` base, n `mod` base)
ub = 10000
This works accurately as long as I checked with large numbers like 20 -30 digits and fast enough for 10-20 digits. However, this is a a lot slower than normal * when 100 digits or even bigger numbers. How I can improve this algorithm?
Actually I doubt you could improve the performance to beat the naive operator - Haskell use GMP under the hood, which should automatically use Toom-3 or other algorithms when the algorithm works well for the value range. The naive Karatsuba might not be even used, but the Toom series is said to be algorithmically close to it. If you come to think about it, there's no reason for GHC to not use some advanced algorithm for multiplication since they already supported it out of the box.
The last time I checked, GMP is blazing fast and even when used in the normal double range, is at least as fast as gcc's compilation result.
There's a proposal on removing GMP from GHC, but it seems rather inactive.
EDIT: Thanks to #danvari, here are the different algorithms GMP uses: http://gmplib.org/manual/Multiplication-Algorithms.html#Multiplication-Algorithms. It seems Karatsuba does get used when the numbers are small enough, and aside from the usual Toom-Cook family, FFT is also used.
Is there a fast method for taking the modulus of a floating point number?
With integers, there are tricks for Mersenne primes, so that its possible to calculate y = x MOD 2^31-1 without needing division. integer trick
Can any similar tricks be applied for floating point numbers?
Preferably, in a way that can be converted into vector/SIMD operations, or moved into GPGPU code. This rules out using integer calculations on the floating point data.
The primes I'm interested in would be 2^7-1 and 2^31-1, although if there are more efficient ones for floating point numbers, those would be welcome.
One intended use of this algorithm would be to calculate a running "checksum" of input floating point numbers as they are being read into an algorithm. To avoid taking up too much of the calculation capability, I'd like to keep this lightweight.
Apparently a similar technique is used for larger numbers, particularly 2^127 - 1. Unfortunately, the math in the paper is beyond me, and I haven't been able to figure out how to convert it to smaller primes.
Example of floating point MOD 2^127 - 1 - HASH127
I looked at djb's paper, and you have it easier, since 31 bits fits comfortably into the 53-bit precision double significand. Assuming that your checksum consists of some ring operations over Z/(2**31 - 1), it will be easier (and faster) to solve the relaxed problem of computing a small representative of x mod Z/(2**31 - 1); at the end, you can use integer arithmetic to find a canonical one, which is slow but shouldn't happen too often.
The basic reduction step is to replace an integer x = y + 2**31 * z with y + z. The trick that djb uses is to compute w = (x + L) - L, where L is a large integer carefully chosen to provoke roundoff in such a way that z = 2**-31 * w. Then compute y = x - w and output y + z, which will have magnitude at most 2**32. (I apologize if this operation isn't quite enough; if so, please post your checksum algorithm.)
The choice of L involves knowing how precise the significand is. For the modulus 2**31 - 1, we want the unit of least precision (ulp) to be 2**31. For doubles in the range [1.0, 2.0), the ulp is 2**-52, so L should be 2**52 * 2**31. If you were doing this with the modulus 2**7 - 1, then you'd take L = 2**52 * 2**7. As djb notes, this trick depends crucially on intermediate results not being computed in higher precision.