Solve system of polynomial equations - wolfram-mathematica

I have a system consisting of 2 polynomials, in 2 variables, with complex coefficients.
The general case consists of a finite number of pairs of complex numbers.
NSolve[{poly1==0,poly2==0},{x,y}]
in Mathematica works for lower degree polynomials, but the time needed to find all roots
seems to be exponential, 2^deg. Is there an alternative to NSolve, which is more efficient?
In other language? The degree we're aiming for is in the range 15-25, higher is better.

I did not find a solution, but seems like lesser number of cores is better.
(Compared with 2,4 and 50 processor cores), and 64 bit architecture is 2 times faster.
All this using NSolve. System of 2 degree 17 polynomials in 2 variables took 24 hours to solve.

Related

How are sparse Ax = b systems solved in practice?

Let A be an n x n sparse matrix, represented by a sequence of m tuples of the form (i,j,a) --- with indices i,j (between 0 and n-1) and a being a value a in the underlying field F.
What algorithms are used, in practice, to solve linear systems of equations of the form Ax = b? Please describe them, don't just link somewhere.
Notes:
I'm interested both in exact solutions for finite fields, and in exact and bounded-error solutions for reals or complex numbers using floating-point representation. I suppose exact or bounded-solutions for rational numbers are also interesting.
I'm particularly interested in parallelizable solutions.
A is not fixed, i.e. you don't just get different b's for the same A.
The main two algorithms that I have used and parallelised are the Wiedemann algorithm and the Lanczos algorithm (and their block variants for GF(2) computations), both of which are better than structured gaussian elimination.
The LaMacchia-Odlyzo paper (the one for the Lanczos algorithm) will tell you what you need to know. The algorithms involve repeatedly multiplying your sparse matrix by a sequence of vectors. To do this efficiently, you need to use the right data structure (linked list) to make the matrix-vector multiply time proportional to the number of non-zero values in the matrix (i.e. the sparsity).
Paralellisation of these algorithms is trivial, but optimisation will depend upon the architecture of your system. The parallelisation of the matrix-vector multiply is done by splitting the matrix into blocks of rows (each processor gets one block), each block of rows multiplies by the vector separately. Then you combine the results to get the new vector.
I've done these types of computations extensively. The original authors that broke the RSA-129 factorisation took 6 weeks using structured gaussian elimination on a 16,384 processor MasPar. On the same machine, I worked with Arjen Lenstra (one of the authors) to solve the matrix in 4 days with block Wiedemann and 1 day with block Lanczos. Unfortunately, I never published the result!

What is the fastest way to get the value of e?

What is the most optimised algorithm which finds the value of e with moderate accuracy?
I am looking for a comparison between optimised approaches giving more importance to speed than high precision.
Edit: By moderate accuracy I mean upto 6-7 decimal places.
But if there is a HUGE difference in speed, then I can settle with 4-5 places.
basic datatype
As mentioned in the comments 6-7 decimal places is too small accuracy to do this by an algorithm. Instead use a constant which is the fastest way anyway for this.
const double e=2.7182818284590452353602874713527;
If FPU is involved the constant is usually stored there too... Also having single constant occupies much less space than a function that computes it ...
finite accuracy
Only once bignums are involved then has any merit to use algorithm to compute e. The algo depends on target accuracy. Again for smaller accuracies are predefined constants used:
e=2.71828182845904523536028747135266249775724709369995957496696762772407663035354759457138217852516642742746639193200305992181741359662904357290033429526059563073813232862794349076323382988075319525101901157383418793070215408914993488416750924476146066808226480016847741185374234544243710753907774499206955170189
but usually in hex format for faster and more precise manipulation:
e=2.B7E151628AED2A6ABF7158809CF4F3C762E7160F38B4DA56A784D9045190CFEF324E7738926CFBE5F4BF8D8D8C31D763DA06C80ABB1185EB4F7C7B5757F5958490CFD47D7C19BB42158D9554F7B46BCED55C4D79FD5F24D6613C31C3839A2DDF8A9A276BCFBFA1C877C56284DAB79CD4C2B3293D20E9E5EAF02AC60ACC93ECEBh
For limited/finite accuracy and best speed is the PSLQ algorithm best. My understanding is that it is algorithm to find relation between real number and integer iterations.
here is my favourite PSLQ up to 800 digits of Pi PSLQ example
arbitrary accuracy
For arbitrary or "fixed" precision you need algorithm that is with variable precision. This is what I use in my arbnum class:
e=(1+1/x)^x where x -> +infinity
If you chose x as power of 2 realize that x is just single set bit of the number and 1/x has predictable bit-width. So the e will be obtained with single division and pow. Here an example:
arbnum arithmetics_e() // e computation min(_arbnum_max_a,arbnum_max_b)*5 decimals
{ // e=(1+1/x)^x ... x -> +inf
int i; arbnum c,x;
i=_arbnum_bits_a; if (i>_arbnum_bits_b) i=_arbnum_bits_b; i>>=1;
c.zero(); c.bitset(_arbnum_bits_b-i); x.one(); x/=c; c++;
for (;!x.bitget(_arbnum_bits_b);x>>=1) c*=c; //=pow(c,x);
return c;
}
Where _arbnum_bits_a,_arbnum_bits_b is the number of bits before and after decimal point in binary. So it break down to some bit operations, one bignum division and single power by squaring. Beware that multiplication and division is not that simple with bignums and usually involves Karatsuba or worse ...
There are also polynomial approaches out there that does not require bignum arithmetics similar to compute Pi. The idea is to compute chunk of binary bits per iteration without affecting the previously computed bits (too much). They should be faster but as usual for any optimizations that depends on the implementation and HW it runs on.
For reference see Brother's formula here : https://www.intmath.com/exponential-logarithmic-functions/calculating-e.php

fast algorithm for computing 1/d?(SRT, goldsmidt, newton raphson,...)

I want to find a fast algorithm for computing 1/d , where d is double ( albeit it can be converted to integer) what is the best algorithm of many algorithms(SRT , goldschmidt,newton raphson, ...)?I'm writing my program in c language.
thanks in advance.
The fastest program is: double result = 1 / d;
CPU:s already use a root finding iterative algorithm like the ones you describe, to find the reciprocal 1/d. So you should find it difficult to beat it using a software implementation of the same algorithm.
If you have few/known denominators then try a lookup table. This is the usual approach for even slower functions such as trig functions.
Otherwise: just compute 1/d. It will be the fastest you can do. And there is an endless list of things you can do to speed up arithmetic if you have to
use 32 bit (single) instead of 64bit (double) precision. FP Division on takes a number of cycles proportional to the number of bits.
vectorize the operations. For example I believe you can compute four 32bit float divisions in parallel with SSE2, or even more in parallel by doing it on the GPU.
I've asked it from someone and I get my answer:
So, you can't add a hardware divider to the FPGA then? Or fast reciprocal support?
Anyway it depends. Does it have fast multiplication? If not, well, that's a problem, you could only implement the slow methods then.
If you have fast multiplication and IEEE floats, you can use the weird trick I linked to in my previous post with a couple of refinement steps. That's really just Newton–Raphson division with a simpler calculation for the initial approximation (but afaik it still only takes 3 refinements for single-precision floats, just like the regular initial approximation). Fast reciprocal support works that way too - give a fast initial approximation (handling the exponent right and getting significant bits from a lookup table, if you get 12 significant bits that way you only need one refinement step for single-precision or, 13 are enough to get 2 steps for double-precision) and optionally have instructions that help implement the refinement step (like AMD's PFRCPIT1 and PFRCPIT2), for example to calculate Y = (1 - D*X) and to calculate X + X * Y.
Even without those tricks Newton–Raphson division is still not bad, with the linear approximation it takes only 4 refinements for double-precision floats, but it also takes some annoying exponent adjustments to get in the right range first (in hardware that wouldn't be half as annoying).
Goldschmidt division is, afaik, roughly equivalent in performance and might have a slightly less complex implementation. It's really the same sort of deal - trickery with the exponent to get in the right range, the "2 - something" estimation trick (which is rearranged in Newton-Raphson division, but it's really the same thing), and doing the refinement step until all the bits are right. It just looks a little different.

Why is float division slow?

What are the steps in the algorithm to do floating point division?
Why is the result slower than say, multiplication?
Is it done the same way we do division by hand? By repeatedly dividing by the divisor, subtracting the result to obtain a remainder, aligning the number again and continuing till the remainder is less than a particular value?
Also, why do we gain on performance if instead of doing
a = b / c
we do
d = 1 / c
a = b * d
?
Edit:
Basically I was asking because someone asked me to distribute a value among contenders based on the assignment of weights. I did all this in integers and was later asked to convert to float, which caused a slowdown in performance. I was just interested in knowing how would C or C++ do these operations that would cause the slowness.
FPU division often basically uses Newton-Raphson (or some other algorithm) to get a reciprocal then multiplies by that reciprocal. That's why the reciprocal operation is slightly faster than the general division operation.
This HP paper (which is actually more understandable than most papers I come across talking about Newton-Raphson) has this to say about floating point division:
Floating point division and square
root take considerably longer to
compute than addition and
multiplication. The latter two are
computed directly while the former are
usually computed with an iterative
algorithm. The most common approach is
to use a division-free Newton-Raphson
iteration to get an approximation to
the reciprocal of the denominator
(division) or the reciprocal square
root, and then multiply by the
numerator (division) or input argument
(square root).
From a hardware point of view division is a iterative algorithm, and the time it takes is proportional to the number of bits. The fastest division that is currently around uses the radix4 algorithm which generates 4 bit of result per iteration. For a 32 bit divide you need 8 steps at least.
Multiplication can be done in parallel to a certain degree. Without going into detail you can break up a large multiplication into several smaller, independent ones. These multiplications can again be broken down until you're at a bit-level, or you stop earlier and use a small lookup-table in hardware. This makes the multiplication hardware heavy from a silicon real estate point of view but very fast as well. It's the classic size/speed tradeoff.
You need log2 steps to combine the parallel computed results, so a 32 bit multiply need 5 logical steps (if you go down to the minimum). Fortunately these 5 steps are a good deal simpler than the division steps (it's just additions). That means in practice multiplies are even faster.
As described in the Wikipedia article Division algorithm, there are two main aproaches to division in computers:
Slow Division
Uses the following recurrence and finds one digit per iteration:
partialRemainder[j+1] = radix * partialRemainder[j] - quotientDigit[n-(j+1)]*denominator
Fast Division
Starts with an estimation and converges on the quotient. How accurate you are depends on the number of iterations.
Newton-Raphson division (very briefly):
Calculate estimate of the reciprocal.
Compute more accurate estimates of the reciprocal.
Compute quotient by multiplying the dividend by the reciprocal.
Think of the hardware involved, and you'll understand a lot better why it takes so much longer to divide than multiply. Both operations are done down at the Floating Point Unit (FPU) level, and even in the world of integral ALUs, the division circuit is a far busier place than a multiplication circuit. I would suspect this is only more painful in the world of floating point, as now the data isn't just least to most significant digit ordered, but is instead ordered by the IEEE 754 standard.
As for the round off, it's really about wherever the signals traveling between the gates get soldered to ground; where that happens, you lose digits. Not rounding, so much as truncation.
Or were you asking about simulating floating point arithmetic using just integers?
You won't gain performance by doing
d = 1 / c
a = b * d
You probably mean:
d = 1 / c
a1 = b1 * d
a2 = b2 * d
This way the division is done only once.
Division is per se slower than multiplication, however, I don't know the details. The basic reason is that, similar to functions such as sin or sqrt, it's just mathematically more complex. IIRC, a multiplication takes about 10 cycles on an average CPU, while a division takes about 50 or more.
How it is actually done was nicely explained by John Mulder.
Float division is not much slower than integer division, but the compiler may be unable to do the same optimizations.
For example the compiler can replace integer division between 3 with a multiplication and a binary shift.
Also it can replace float division between 2.0 with a multipliation by 0.5 but it cannot replace division between 3.0 with a multiplication by 1/3.0 as 1/3.0 cannot be represented exactly using binary numbers, therefore rounding errors may change the result of the division.
As the compiler doesn't know how sensitive is your application to rounding errors (say you were doing a weather simulation, see Butterfly effect )it cannot do the optimization.

How do Trigonometric functions work? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
So in high school math, and probably college, we are taught how to use trig functions, what they do, and what kinds of problems they solve. But they have always been presented to me as a black box. If you need the Sine or Cosine of something, you hit the sin or cos button on your calculator and you're set. Which is fine.
What I'm wondering is how trigonometric functions are typically implemented.
First, you have to do some sort of range reduction. Trig functions are periodic, so you need to reduce arguments down to a standard interval. For starters, you could reduce angles to be between 0 and 360 degrees. But by using a few identities, you realize you could get by with less. If you calculate sines and cosines for angles between 0 and 45 degrees, you can bootstrap your way to calculating all trig functions for all angles.
Once you've reduced your argument, most chips use a CORDIC algorithm to compute the sines and cosines. You may hear people say that computers use Taylor series. That sounds reasonable, but it's not true. The CORDIC algorithms are much better suited to efficient hardware implementation. (Software libraries may use Taylor series, say on hardware that doesn't support trig functions.) There may be some additional processing, using the CORDIC algorithm to get fairly good answers but then doing something else to improve accuracy.
There are some refinements to the above. For example, for very small angles theta (in radians), sin(theta) = theta to all the precision you have, so it's more efficient to simply return theta than to use some other algorithm. So in practice there is a lot of special case logic to squeeze out all the performance and accuracy possible. Chips with smaller markets may not go to as much optimization effort.
edit: Jack Ganssle has a decent discussion in his book on embedded systems, "The Firmware Handbook".
FYI: If you have accuracy and performance constraints, Taylor series should not be used to approximate functions for numerical purposes. (Save them for your Calculus courses.) They make use of the analyticity of a function at a single point, e.g. the fact that all its derivatives exist at that point. They don't necessarily converge in the interval of interest. Often they do a lousy job of distributing the function approximation's accuracy in order to be "perfect" right near the evaluation point; the error generally zooms upwards as you get away from it. And if you have a function with any noncontinuous derivative (e.g. square waves, triangle waves, and their integrals), a Taylor series will give you the wrong answer.
The best "easy" solution, when using a polynomial of maximum degree N to approximate a given function f(x) over an interval x0 < x < x1, is from Chebyshev approximation; see Numerical Recipes for a good discussion. Note that the Tj(x) and Tk(x) in the Wolfram article I linked to used the cos and inverse cosine, these are polynomials and in practice you use a recurrence formula to get the coefficients. Again, see Numerical Recipes.
edit: Wikipedia has a semi-decent article on approximation theory. One of the sources they cite (Hart, "Computer Approximations") is out of print (& used copies tend to be expensive) but goes into a lot of detail about stuff like this. (Jack Ganssle mentions this in issue 39 of his newsletter The Embedded Muse.)
edit 2: Here's some tangible error metrics (see below) for Taylor vs. Chebyshev for sin(x). Some important points to note:
that the maximum error of a Taylor series approximation over a given range, is much larger than the maximum error of a Chebyshev approximation of the same degree. (For about the same error, you can get away with one fewer term with Chebyshev, which means faster performance)
Range reduction is a huge win. This is because the contribution of higher order polynomials shrinks down when the interval of the approximation is smaller.
If you can't get away with range reduction, your coefficients need to be stored with more precision.
Don't get me wrong: Taylor series will work properly for sine/cosine (with reasonable precision for the range -pi/2 to +pi/2; technically, with enough terms, you can reach any desired precision for all real inputs, but try to calculate cos(100) using Taylor series and you can't do it unless you use arbitrary-precision arithmetic). If I were stuck on a desert island with a nonscientific calculator, and I needed to calculate sine and cosine, I would probably use Taylor series since the coefficients are easy to remember. But the real world applications for having to write your own sin() or cos() functions are rare enough that you'd be best off using an efficient implementation to reach a desired accuracy -- which the Taylor series is not.
Range = -pi/2 to +pi/2, degree 5 (3 terms)
Taylor: max error around 4.5e-3, f(x) = x-x3/6+x5/120
Chebyshev: max error around 7e-5, f(x) = 0.9996949x-0.1656700x3+0.0075134x5
Range = -pi/2 to +pi/2, degree 7 (4 terms)
Taylor: max error around 1.5e-4, f(x) = x-x3/6+x5/120-x7/5040
Chebyshev: max error around 6e-7, f(x) = 0.99999660x-0.16664824x3+0.00830629x5-0.00018363x7
Range = -pi/4 to +pi/4, degree 3 (2 terms)
Taylor: max error around 2.5e-3, f(x) = x-x3/6
Chebyshev: max error around 1.5e-4, f(x) = 0.999x-0.1603x3
Range = -pi/4 to +pi/4, degree 5 (3 terms)
Taylor: max error around 3.5e-5, f(x) = x-x3/6+x5
Chebyshev: max error around 6e-7, f(x) = 0.999995x-0.1666016x3+0.0081215x5
Range = -pi/4 to +pi/4, degree 7 (4 terms)
Taylor: max error around 3e-7, f(x) = x-x3/6+x5/120-x7/5040
Chebyshev: max error around 1.2e-9, f(x) = 0.999999986x-0.166666367x3+0.008331584x5-0.000194621x7
I believe they're calculated using Taylor Series or CORDIC. Some applications which make heavy use of trig functions (games, graphics) construct trig tables when they start up so they can just look up values rather than recalculating them over and over.
Check out the Wikipedia article on trig functions. A good place to learn about actually implementing them in code is Numerical Recipes.
I'm not much of a mathematician, but my understanding of where sin, cos, and tan "come from" is that they are, in a sense, observed when you're working with right-angle triangles. If you take measurements of the lengths of sides of a bunch of different right-angle triangles and plot the points on a graph, you can get sin, cos, and tan out of that. As Harper Shelby points out, the functions are simply defined as properties of right-angle triangles.
A more sophisticated understanding is achieved by understanding how these ratios relate to the geometry of circle, which leads to radians and all of that goodness. It's all there in the Wikipedia entry.
Most commonly for computers, power series representation is used to calculate sines and cosines and these are used for other trig functions. Expanding these series out to about 8 terms computes the values needed to an accuracy close to the machine epsilon (smallest non-zero floating point number that can be held).
The CORDIC method is faster since it is implemented on hardware, but it is primarily used for embedded systems and not standard computers.
I would like to extend the answer provided by #Jason S. Using a domain subdivision method similar to that described by #Jason S and using Maclaurin series approximations, an average (2-3)X speedup over the tan(), sin(), cos(), atan(), asin(), and acos() functions built into the gcc compiler with -O3 optimization was achieved. The best Maclaurin series approximating functions described below achieved double precision accuracy.
For the tan(), sin(), and cos() functions, and for simplicity, an overlapping 0 to 2pi+pi/80 domain was divided into 81 equal intervals with "anchor points" at pi/80, 3pi/80, ..., 161pi/80. Then tan(), sin(), and cos() of these 81 anchor points were evaluated and stored. With the help of trig identities, a single Maclaurin series function was developed for each trig function. Any angle between ±infinity may be submitted to the trig approximating functions because the functions first translate the input angle to the 0 to 2pi domain. This translation overhead is included in the approximation overhead.
Similar methods were developed for the atan(), asin(), and acos() functions, where an overlapping -1.0 to 1.1 domain was divided into 21 equal intervals with anchor points at -19/20, -17/20, ..., 19/20, 21/20. Then only atan() of these 21 anchor points was stored. Again, with the help of inverse trig identities, a single Maclaurin series function was developed for the atan() function. Results of the atan() function were then used to approximate asin() and acos().
Since all inverse trig approximating functions are based on the atan() approximating function, any double-precision argument input value is allowed. However the argument input to the asin() and acos() approximating functions is truncated to the ±1 domain because any value outside it is meaningless.
To test the approximating functions, a billion random function evaluations were forced to be evaluated (that is, the -O3 optimizing compiler was not allowed to bypass evaluating something because some computed result would not be used.) To remove the bias of evaluating a billion random numbers and processing the results, the cost of a run without evaluating any trig or inverse trig function was performed first. This bias was then subtracted off each test to obtain a more representative approximation of actual function evaluation time.
Table 2. Time spent in seconds executing the indicated function or functions one billion times. The estimates are obtained by subtracting the time cost of evaluating one billion random numbers shown in the first row of Table 1 from the remaining rows in Table 1.
Time spent in tan(): 18.0515 18.2545
Time spent in TAN3(): 5.93853 6.02349
Time spent in TAN4(): 6.72216 6.99134
Time spent in sin() and cos(): 19.4052 19.4311
Time spent in SINCOS3(): 7.85564 7.92844
Time spent in SINCOS4(): 9.36672 9.57946
Time spent in atan(): 15.7160 15.6599
Time spent in ATAN1(): 6.47800 6.55230
Time spent in ATAN2(): 7.26730 7.24885
Time spent in ATAN3(): 8.15299 8.21284
Time spent in asin() and acos(): 36.8833 36.9496
Time spent in ASINCOS1(): 10.1655 9.78479
Time spent in ASINCOS2(): 10.6236 10.6000
Time spent in ASINCOS3(): 12.8430 12.0707
(In the interest of saving space, Table 1 is not shown.) Table 2 shows the results of two separate runs of a billion evaluations of each approximating function. The first column is the first run and the second column is the second run. The numbers '1', '2', '3' or '4' in the function names indicate the number of terms used in the Maclaurin series function to evaluate the particular trig or inverse trig approximation. SINCOS#() means that both sin and cos were evaluated at the same time. Likewise, ASINCOS#() means both asin and acos were evaluated at the same time. There is little extra overhead in evaluating both quantities at the same time.
The results show that increasing the number of terms slightly increases execution time as would be expected. Even the smallest number of terms gave around 12-14 digit accuracy everywhere except for the tan() approximation near where its value approaches ±infinity. One would expect even the tan() function to have problems there.
Similar results were obtained on a high-end MacBook Pro laptop in Unix and on a high-end desktop computer in Linux.
If your asking for a more physical explanation of sin, cos, and tan consider how they relate to right-angle triangles. The actual numeric value of cos(lambda) can be found by forming a right-angle triangle with one of the angles being lambda and dividing the length of the triangles side adjacent to lambda by the length of the hypotenuse. Similarily for sin use the opposite side divided by the hypotenuse. For tangent use the opposite side divided by the adjacent side. The classic memonic to remember this is SOHCAHTOA (pronounced socatoa).

Resources