Implementing Karatsuba algorithm in Haskell - algorithm

I just got to know about Karatsuba Algorithm, and I tried to implement it in Haskell.
Here is my code:
(***) :: Integer -> Integer -> Integer
x *** y
| max x y < ub = x*y
| otherwise =z2*base*base + ((x1+x0)***(y1+y0)-z2-z0)*base + z0
where
base =10^((length . show $ max x y) `div` 2)
z2 =x1***y1
z0 = x0***y0
(x1, x0) = helper x
(y1, y0) = helper y
helper n = (n `div` base, n `mod` base)
ub = 10000
This works accurately as long as I checked with large numbers like 20 -30 digits and fast enough for 10-20 digits. However, this is a a lot slower than normal * when 100 digits or even bigger numbers. How I can improve this algorithm?

Actually I doubt you could improve the performance to beat the naive operator - Haskell use GMP under the hood, which should automatically use Toom-3 or other algorithms when the algorithm works well for the value range. The naive Karatsuba might not be even used, but the Toom series is said to be algorithmically close to it. If you come to think about it, there's no reason for GHC to not use some advanced algorithm for multiplication since they already supported it out of the box.
The last time I checked, GMP is blazing fast and even when used in the normal double range, is at least as fast as gcc's compilation result.
There's a proposal on removing GMP from GHC, but it seems rather inactive.
EDIT: Thanks to #danvari, here are the different algorithms GMP uses: http://gmplib.org/manual/Multiplication-Algorithms.html#Multiplication-Algorithms. It seems Karatsuba does get used when the numbers are small enough, and aside from the usual Toom-Cook family, FFT is also used.

Related

Rational approximation of rational exponentiation root with error control

I am looking for an algorithm that would efficiently calculate b^e where b and e are rational numbers, ensuring that the approximation error won't exceed given err (rational as well). Explicitly, I am looking for a function:
rational exp(rational base, rational exp, rational err)
that would preserve law |exp(b, e, err) - b^e| < err
Rational numbers are represented as pairs of big integers. Let's assume that all rationality preserving operations like addition, multiplication etc. are already defined.
I have found several approaches, but they did not allow me to control the error clearly enough. In this problem I don't care about integer overflow. What is the best approach to achieve this?
This one is complicated, so I'm going to outline the approach that I'd take. I do not promise no errors, and you'll have a lot of work left.
I will change variables from what you said to exp(x, y, err) to be x^y within error err.If y is not in the range 0 <= y < 1, then we can easily multiply by an appropriate x^k with k an integer to make it so. So we only need to worry about fractional `y
If all numerators and denominators were small, it would be easy to tackle this by first taking an integer power, and then taking a root using Newton's method. But that naive idea will fall apart painfully when you try to estimate something like (1000001/1000000)^(2000001/1000000). So the challenge is to keep that from blowing up on you.
I would recommend looking at the problem of calculating x^y as x^y = (x0^y0) * (x0^(y-y0)) * (x/x0)^y = (x0^y0) * e^((y-y0) * log(x0)) * e^(y * log(x/x0)). And we will choose x0 and y0 such that the calculations are easier and the errors are bounded.
To bound the errors, we can first come up with a naive upper bound b on x0^y0 - something like "next highest integer than x to the power of the next highest integer than y". We will pick x0 and y0 to be close enough to x and y that the latter terms are under 2. And then we just need to have the three terms estimated to within err/12, err/(6*b) and err/(6*b). (You might want to make those errors tighter half that then make the final answer a nearby rational.)
Now when we pick x0 and y0 we will be aiming for "close rational with smallish numerator/denominator". For that we start calculating the continued fraction. This gives a sequence of rational numbers that quickly converges to a target real. If we just cut off the sequence fairly soon, we can quickly find a rational number that is within any desired distance of a target real while keeping relatively small numerators and denominators.
Let's work from the third term backwards.
We want y * log(x/x0) < log(2). But from the Taylor series if x/2 < x0 < 2x then log(x/x0) < x/x0 - 1. So we can search the continued fraction for an appropriate x0.
Once we have found it, we can use the Taylor series for log(1+z) to calculate log(x/x0) to within err/(12*y*b). And then the Taylor series for e^z to calculate the term to our desired error.
The second term is more complicated. We need to estimate log(x0). What we do is find an appropriate integer k such that 1.1^k <= x0 < 1.1^(k+1). And then we can estimate both k * log(1.1) and log(x0 / 1.1^k) fairly precisely. Find a naive upper bound to that log and use it to find a close enough y0 for the second term to be within 2. And then use the Taylor series to estimate e^((y-y0) * log(x0)) to our desired precision.
For the first term we use the naive method of raising x0 to an integer and then Newton's method to take a root, to give x0^y0 to our desired precision.
Then multiply them together, and we have an answer. (If you chose the "tighter errors, nicer answer", then now you'd do a continued fraction on that answer to pick a better rational to return.)

Prime Factorization of numbers of form x^a + b where x is prime

I need to calculate prime factorization of large numbers, by large numbers I mean of range 10^100.
I get an input a[0] <= 10^5 (whose prime factors I have already calculated using the sieve and other optimizations). After that I get series of inputs of a[1], a[2], a[3] all in range 2 <= a[i] <= 10^5. I need to calculate the product and get the factors of the new product. I have the following maths
Let X be the data in memory and X can be represented as:
X = (p[0]^c1)(p[1]^c2)(p[2]^c[3]) .... where p[i] are its prime factors.
So I save this as,
A[p[0]] = c1, A[p[1]] = c2.... as p[i] <= 100000 this seems to work pretty well.
and as new number arrives, I just add the power of primes of the new number in A.
So this works really well and also is fast enough. Now I am thinking of optimizing space and compensating with reduction of time efficiency.
So If I can represent Any number P as x^a + b where x is a prime. Can I factorize it? P obviously doesn't fit in the memory but 2 <= x, a, b <= 100000? Or Is there any other method that is possible which will save me the space of A? I am okay with a slower algorithm than the above one.
I don't think representing a number as xa + b with prime x makes it any easier to factor.
Factoring hundred-digit numbers isn't all that hard these days. A good personal computer with lots of cores running a good quadratic sieve can factor most hundred-digit numbers in about a day, though you should know that hundred-digit numbers are about at the limit of what is reasonable to factor with a desktop computer. Look at Jason Papadopoulos' program msieve for a cutting edge factorization program.
First, you'll better do some math on paper (perhaps some simplifications are possible; I don't know...).
Then you need to use some arbitrary precision arithmetic (a.k.a. bignums or bigints) library. I recommend GMPlib but they are other ones.
See also this answer.

Why is x \ y so much slower than (x' * x) \ (x' * y)?

For an NxP matrix x and an Nx1 vector y with N > P, the two expressions
x \ y -- (1)
and
(x' * x) \ (x' * y) -- (2)
both compute the solution b to the matrix equation
x * b = y
in the least squares sense, i.e. so that the quantity
norm(y - x * b)
is minimized. Expression (2) does it using the classic algorithm for the solution of an ordinary least squares regression, where the left-hand argument to the \ operator is square. It is equivalent to writing
inv(x' * x) * (x' * y) -- (3)
but it uses an algorithm which is more numerically stable. It turns out that (3) is moderately faster than (2) even though (2) doesn't have to produce the inverse matrix as a byproduct, but I can accept that given the additional numerical stability.
However, some simple timings (with N=100,000 and P=30) show that expression (2) is more than 5 times faster than expression (1), even though (1) has greater flexibility to choose the algorithm used! For example, any call to (1) could just dispatch on the size of X, and in the case N>P it could reduce to (2), which would add a tiny amount of overhead, but certainly wouldn't take 5 times longer.
What is happening in expression (1) that is causing it to take so much longer?
Edit: Here are my timings
x = randn(1e5, 30);
y = randn(1e5,1);
tic, for i = 1:100; x\y; end; t1=toc;
tic, for i = 1:100; (x'*x)\(x'*y); end; t2=toc;
assert( abs(norm(x\y) - norm((x'*x)\(x'*y))) < 1e-10 );
fprintf('Speedup: %.2f\n', t1/t2)
Speedup: 5.23
You are aware of the fact that in your test
size(x) == [1e5 30] but size(x'*x) == [30 30]
size(y) == [1e5 1] but size(x'*y) == [30 1]
That means that the matrices entering the mldivide function differ in size by 4 orders of magnitude! This would render any overhead of determining which algorithm to use rather large and significant (and perhaps also running the same algorithm on the two different problems).
In other words, you have a biased test. To make a fair test, use something like
x = randn(1e3);
y = randn(1e3,1);
I find (worst of 5 runs):
Speedup: 1.06 %// R2010a
Speedup: 1.16 %// R2010b
Speedup: 0.97 %// R2013a
...the difference has all but evaporated.
But, this does show very well that if you indeed have a regression problem with low dimensionality compared to the number of observations, it really pays off to do the multiplication first :)
mldivide is a catch-all, and really great at that. But often, having knowledge about the problem may make more specific solutions, like pre-multiplication, pre-conditioning, lu, qr, linsolve, etc. orders of magnitude faster.
even though (1) has greater flexibility to choose the algorithm used!
For example, any call to (1) could just dispatch on the size of X, and
in the case N>P it could reduce to (2), which would add a tiny amount
of overhead, but certainly wouldn't take 5 times longer.
This is not the case. It could take a lot of overhead to choose which algorithm to use, particularly when compared to the computation on relatively small inputs such as these. In this case, because MATLAB can see that you have x'*x, it knows that one of the arguments must be both square and symmetric (yes - that knowledge of linear algebra is built in to MATLAB even at a parser level), and can straight away call one of the appropriate code paths within \.
I can't say whether this fully explains the timing differences you're seeing. I would want to investigate further, at least by:
Making sure to put the code within a function, and warming the function up to ensure that the JIT is engaged - and then trying the same thing with feature('accel', 'off') to remove the effect of the JIT
Trying this on a much bigger range of input sizes to check what contribution an 'algorithm choice overhead' made compared to computation time.

Optimising a recursive brute force into a more mathematical/linear solution

I've written this Haskell program to solve Euler 15 (it uses some very simple dynamic programming to run a tad faster, so I can actually run it, but removing that you would expect it to run in O(2^n).
-- Starting in the top left corner of a 2×2 grid, and only being able to move to
-- the right and down, there are exactly 6 routes to the bottom right corner.
--
-- How many such routes are there through a 20×20 grid?
calcPaths :: Int -> Integer
calcPaths s
= let go x y
| x == 0 || y == 0 = 1
| x == y = 2 * go x (y - 1)
| otherwise = go (x - 1) y + go x (y - 1)
in go s s
I later realised this could be done in O(n) by transforming it into an equation and upon thinking of it a little longer, realised it's actually quite similar to my solution above, except the recursion (which is slow on our hardware) is replaced by mathematics representing the same thing (which is fast on our hardware).
Is there a systematic way to perform this kind of optimisation (produce and prove an equation to match a recursion) on recursive sets of expressions, specifically one which could be realistically "taught" to a compiler so this reduction is done automatically?
Unfortunately I can't say much about analytical algorithmic optimizations, but in practice there is a useful technique for dynamic programming named memoization. For example, with Memoize library your code can be rewritten as
import Data.Function.Memoize
calcPaths s
= let go f x y
| x == 0 || y == 0 = 1
| x == y = 2 * f x (y - 1)
| otherwise = f (x - 1) y + f x (y - 1)
in memoFix2 go s s
so the go function will be calculated only once for any combination of arguments.
You can also use dynamic programming if the problem is divisible into smaller subproblems for e.g
F(x,y) = F(x-1,y) + F(x,y-1)
here F(x,y) is divisible into smaller sub-problems hence DP can be used
int arr[xmax+1][ymax+1];
//base conditions
for(int i=0;i<=xmax;i++)
arr[i][0] = 1
for(int j=0;j<=ymax;j++)
arr[0][j] = 1;
// main equation
for(int i=1;i<=xmax;i++) {
for(int j=1;j<=ymax;j++) {
arr[i][j] = arr[i-1][j] + arr[i][j-1];
}
}
As you mentioned compiler optimization DP can be used to do so you just need to write instructions in compiler which when given a recursive solution check if it is divisible into sub-problems of smaller sizes if so then use DP with simple for loop build up like above but the most difficult part is optimizing it automatically for example above DP needs O(xmax*ymax) space but can be easily optimized for getting solution in O(xmax+ymax) space
Example solver :- http://www.cs.unipr.it/purrs/
This also seems like somewhat of a philosophical question. It seems that you are asking that the compiler recognize that you would like a more efficient (faster? using less resources?) process to return the value of the function call (rather than the most efficient way to execute your code).
Carrying the idea further, we might have a compiler give suggestions of, in this case, mathematical formulas that might distill code more succinctly/efficiently; alternatively, the compiler might choose to connect to the internet and have another computer (e.g., google or wolfram) conduct the calculation; at the extreme, perhaps the compiler will recognize that what might actually be better to deliver at this time is not the answer to Euler Project 15 but a chocolate cake recipe or instructions for fixing your home heating.
The question makes me think of artificial intelligence and the role of the computer (how much of your math should the computer do for you? how much should it follow the code more closely?). That said, this kind of optimization ought to be an interesting project to think about.

Fast, Vectorizable method of taking floating point number modulus of special primes?

Is there a fast method for taking the modulus of a floating point number?
With integers, there are tricks for Mersenne primes, so that its possible to calculate y = x MOD 2^31-1 without needing division. integer trick
Can any similar tricks be applied for floating point numbers?
Preferably, in a way that can be converted into vector/SIMD operations, or moved into GPGPU code. This rules out using integer calculations on the floating point data.
The primes I'm interested in would be 2^7-1 and 2^31-1, although if there are more efficient ones for floating point numbers, those would be welcome.
One intended use of this algorithm would be to calculate a running "checksum" of input floating point numbers as they are being read into an algorithm. To avoid taking up too much of the calculation capability, I'd like to keep this lightweight.
Apparently a similar technique is used for larger numbers, particularly 2^127 - 1. Unfortunately, the math in the paper is beyond me, and I haven't been able to figure out how to convert it to smaller primes.
Example of floating point MOD 2^127 - 1 - HASH127
I looked at djb's paper, and you have it easier, since 31 bits fits comfortably into the 53-bit precision double significand. Assuming that your checksum consists of some ring operations over Z/(2**31 - 1), it will be easier (and faster) to solve the relaxed problem of computing a small representative of x mod Z/(2**31 - 1); at the end, you can use integer arithmetic to find a canonical one, which is slow but shouldn't happen too often.
The basic reduction step is to replace an integer x = y + 2**31 * z with y + z. The trick that djb uses is to compute w = (x + L) - L, where L is a large integer carefully chosen to provoke roundoff in such a way that z = 2**-31 * w. Then compute y = x - w and output y + z, which will have magnitude at most 2**32. (I apologize if this operation isn't quite enough; if so, please post your checksum algorithm.)
The choice of L involves knowing how precise the significand is. For the modulus 2**31 - 1, we want the unit of least precision (ulp) to be 2**31. For doubles in the range [1.0, 2.0), the ulp is 2**-52, so L should be 2**52 * 2**31. If you were doing this with the modulus 2**7 - 1, then you'd take L = 2**52 * 2**7. As djb notes, this trick depends crucially on intermediate results not being computed in higher precision.

Resources