Why is math.Pow performance worse than bitshifting? - go

When solving this exercise on Exercism website, I have used the standard math.Pow package function to get the raising powers of two.
return uint64(math.Pow(2, float64(n-1)))
After checking the community solutions, I found a solution using bit shifting to achieve the same thing:
return uint64(1 << uint(n-1)), nil
What surprised me is that there is a big performance difference between the two:
bit-shifting
math-pow
I thought that the Go compiler would recognize that math.Pow uses a constant 2 as the base and just utilize bit shifting on its own, without me explicitly doing it so. The only other difference I can see is the conversion of the float64 and that math.Pow is operating on floats and not on integers.
Why doesn't the compiler optimize the power operation to achieve performance similar to bit shifting?

First, note that uint64(1) << (n-1) is a better version of the expression uint64(1 << uint(n-1)) that appears in your question. The expression 1<<n is an int, so valid shift values are between 0 and either 30 or 62 depending on the size of int. uint64(1) << n allows n between 0 and 63.
In general, the optimization you suggest is incorrect. The compiler would have to be able to deduce that n is within a particular range.
See this example (on playground)
package main
import (
"fmt"
"math"
)
func main() {
n := 65
fmt.Println(uint64(math.Pow(2, float64(n-1))))
fmt.Println(uint64(1) << uint(n-1))
}
The output demonstrates that the two methods are different:
9223372036854775808
0

math.Pow() is implemented to operate on float64 numbers. Bit shifting to calculate powers of 2 can only be applied on integers, and only on a tiny subset where the result fits into int64 (or uint64).
If you have such special case, you're more than welcome to use bit shifting.
Any other case where the result is bigger than math.MaxInt64 or where the base is not 2 (or a power of 2) requires floating point arithmetic.
Also note that even if detection of the above possible tiny subset would be implemented, the result is in 2's complement format, which also would have to be converted to IEEE 754 format because the return value of math.Pow() is float64 (although these numbers could be cached), which again you'd most likely convert back to int64.
Again: if you need performance, use explicit bit shifting.

Because such optimization was never implemented.
The go compiler aims to have fast compile time. Thus, some optimizations are decided to be not worth it. This saves compilation time at the expense of some run-time.

Related

What is the purpose of arbitrary precision constants in Go?

Go features untyped exact numeric constants with arbitrary size and precision. The spec requires all compilers to support integers to at least 256 bits, and floats to at least 272 bits (256 bits for the mantissa and 16 bits for the exponent). So compilers are required to faithfully and exactly represent expressions like this:
const (
PI = 3.1415926535897932384626433832795028841971
Prime256 = 84028154888444252871881479176271707868370175636848156449781508641811196133203
)
This is interesting...and yet I cannot find any way to actually use any such constant that exceeds the maximum precision of the 64 bit concrete types int64, uint64, float64, complex128 (which is just a pair of float64 values). Even the standard library big number types big.Int and big.Float cannot be initialized from large numeric constants -- they must instead be deserialized from string constants or other expressions.
The underlying mechanics are fairly obvious: the constants exist only at compile time, and must be coerced to some value representable at runtime to be used at runtime. They are a language construct that exists only in code and during compilation. You cannot retrieve the raw value of a constant at runtime; it is is not stored at some address in the compiled program itself.
So the question remains: Why does the language make such a point of supporting enormous constants when they cannot be used in practice?
TLDR; Go's arbitrary precision constants give you the possibility to work with "real" numbers and not with "boxed" numbers, so "artifacts" like overflow, underflow, infinity corner cases are relieved. You have the possibility to work with higher precision, and only the result have to be converted to limited-precision, mitigating the effect of intermediate errors.
The Go Blog: Constants: (emphasizes are mine answering your question)
Numeric constants live in an arbitrary-precision numeric space; they are just regular numbers. But when they are assigned to a variable the value must be able to fit in the destination. We can declare a constant with a very large value:
const Huge = 1e1000
—that's just a number, after all—but we can't assign it or even print it. This statement won't even compile:
fmt.Println(Huge)
The error is, "constant 1.00000e+1000 overflows float64", which is true. But Huge might be useful: we can use it in expressions with other constants and use the value of those expressions if the result can be represented in the range of a float64. The statement,
fmt.Println(Huge / 1e999)
prints 10, as one would expect.
In a related way, floating-point constants may have very high precision, so that arithmetic involving them is more accurate. The constants defined in the math package are given with many more digits than are available in a float64. Here is the definition of math.Pi:
Pi = 3.14159265358979323846264338327950288419716939937510582097494459
When that value is assigned to a variable, some of the precision will be lost; the assignment will create the float64 (or float32) value closest to the high-precision value. This snippet
pi := math.Pi
fmt.Println(pi)
prints 3.141592653589793.
Having so many digits available means that calculations like Pi/2 or other more intricate evaluations can carry more precision until the result is assigned, making calculations involving constants easier to write without losing precision. It also means that there is no occasion in which the floating-point corner cases like infinities, soft underflows, and NaNs arise in constant expressions. (Division by a constant zero is a compile-time error, and when everything is a number there's no such thing as "not a number".)
See related: How does Go perform arithmetic on constants?

64 bit integer and 64 bit float homogeneous representation

Assume we have some sequence as input. For performance reasons we may want to convert it in homogeneous representation. And in order to transform it into homogeneous representation we are trying to convert it to same type. Here lets consider only 2 types in input - int64 and float64 (in my simple code I will use numpy and python; it is not the matter of this question - one may think only about 64-bit integer and 64-bit floats).
First we may try to cast everything to float64.
So we want something like so as input:
31 1.2 -1234
be converted to float64. If we would have all int64 we may left it unchanged ("already homogeneous"), or if something else was found we would return "not homogeneous". Pretty straightforward.
But here is the problem. Consider a bit modified input:
31000000 1.2 -1234
Idea is clear - we need to check that our "caster" is able to handle large by absolute value int64 properly:
format(np.float64(31000000), '.0f') # just convert to float64 and print
'31000000'
Seems like not a problem at all. So lets go to the deal right away:
im = np.iinfo(np.int64).max # maximum of int64 type
format(np.float64(im), '.0f')
format(np.float64(im-100), '.0f')
'9223372036854775808'
'9223372036854775808'
Now its really undesired - we lose some information which maybe needed. I.e. we want to preserve all the information provided in the input sequence.
So our im and im-100 values cast to the same float64 representation. The reason of this is clear - float64 has only 53 significand of total 64 bits. That is why its precision enough to represent log10(2^53) ~= 15.95 i.e. about all 16-length int64 without any information loss. But int64 type contains up to 19 digits.
So we end up with about [10^16; 10^19] (more precisely [10^log10(53); int64.max]) range in which each int64 may be represented with information loss.
Q: What decision in such situation should one made in order to represent int64 and float64 homogeneously.
I see several options for now:
Just convert all int64 range to float64 and "forget" about possible information loss.
Motivation here is "majority of input barely will be > 10^16 int64 values".
EDIT: This clause was misleading. In clear formulation we don't consider such solutions (but left it for completeness).
Do not make such automatic conversions at all. Only if explicitly specified.
I.e. we agree with performance drawbacks. For any int-float arrays. Even with ones as in simplest 1st case.
Calculate threshold for performing conversion to float64 without possible information loss. And use it while making casting decision. If int64 above this threshold found - do not convert (return "not homogeneous").
We've already calculate this threshold. It is log10(2^53) rounded.
Create new type "fint64". This is an exotic decision but I'm considering even this one for completeness.
Motivation here consists of 2 points. First one: it is frequent situation when user wants to store int and float types together. Second - is structure of float64 type. I'm not quite understand why one will need ~308 digits value range if significand consists only of ~16 of them and other ~292 is itself a noise. So we might use one of float64 exponent bits to indicate whether its float or int is stored here. But for int64 it would be definitely drawback to lose 1 bit. Cause would reduce our integer range twice. But we would gain possibility freely store ints along with floats without any additional overhead.
EDIT: While my initial thinking of this was as "exotic" decision in fact it is just a variant of another solution alternative - composite type for our representation (see 5 clause). But need to add here that my 1st composition has definite drawback - losing some range for float64 and for int64. What we rather do - is not to subtract 1 bit but add one bit which represents a flag for int or float type stored in following 64 bits.
As proposed #Brendan one may use composite type consists of "combination of 2 or more primitive types". So using additional primitives we may cover our "problem" range for int64 for example and get homogeneous representation in this "new" type.
EDITs:
Because here question arisen I need to try be very specific: Devised application in question do following thing - convert sequence of int64 or float64 to some homogeneous representation lossless if possible. The solutions are compared by performance (e.g. total excessive RAM needed for representation). That is all. No any other requirements is considered here (cause we should consider a problem in its minimal state - not writing whole application). Correspondingly algo that represents our data in homogeneous state lossless (we are sure we not lost any information) fits into our app.
I've decided to remove words "app" and "user" from question - it was also misleading.
When choosing a data type there are 3 requirements:
if values may have different signs
needed precision
needed range
Of course hardware doesn't provide a lot of types to choose from; so you'll need to select the next largest provided type. For example, if you want to store values ranging from 0 to 500 with 8 bits of precision; then hardware won't provide anything like that and you will need to use either 16-bit integer or 32-bit floating point.
When choosing a homogeneous representation there are 3 requirements:
if values may have different signs; determined from the requirements from all of the original types being represented
needed precision; determined from the requirements from all of the original types being represented
needed range; determined from the requirements from all of the original types being represented
For example, if you have integers from -10 to +10000000000 you need a 35 bit integer type that doesn't exist so you'll use a 64-bit integer, and if you need floating point values from -2 to +2 with 31 bits of precision then you'll need a 33 bit floating point type that doesn't exist so you'll use a 64-bit floating point type; and from the requirements of these two original types you'll know that a homogeneous representation will need a sign flag, a 33 bit significand (with an implied bit), and a 1-bit exponent; which doesn't exist so you'll use a 64-bit floating point type as the homogeneous representation.
However; if you don't know anything about the requirements of the original data types (and only know that whatever the requirements were they led to the selection of a 64-bit integer type and a 64-bit floating point type), then you'll have to assume "worst cases". This leads to needing a homogeneous representation that has a sign flag, 62 bits of precision (plus an implied 1 bit) and an 8 bit exponent. Of course this 71 bit floating point type doesn't exist, so you need to select the next largest type.
Also note that sometimes there is no "next largest type" that hardware supports. When this happens you need to resort to "composed types" - a combination of 2 or more primitive types. This can include anything up to and including "big rational numbers" (numbers represented by 3 big integers in "numerator / divisor * (1 << exponent)" form).
Of course if the original types (the 64-bit integer type and 64-bit floating point type) were primitive types and your homogeneous representation needs to use a "composed type"; then your "for performance reasons we may want to convert it in homogeneous representation" assumption is likely to be false (it's likely that, for performance reasons, you want to avoid using a homogeneous representation).
In other words:
If you don't know anything about the requirements of the original data types, it's likely that, for performance reasons, you want to avoid using a homogeneous representation.
Now...
Let's rephrase your question as "How to deal with design failures (choosing the wrong types which don't meet requirements)?". There is only one answer, and that is to avoid the design failure. Run-time checks (e.g. throwing an exception if the conversion to the homogeneous representation caused precision loss) serve no purpose other than to notify developers of design failures.
It is actually very basic: use 64 bits floating point. Floating point is an approximation, and you will loose precision for many ints. But there are no uncertainties other than "might this originally have been integral" and "does the original value deviates more than 1.0".
I know of one non-standard floating point representation that would be more powerfull (to be found in the net). That might (or might not) help cover the ints.
The only way to have an exact int mapping, would be to reduce the int range, and guarantee (say) 60 bits ints to be precise, and the remaining range approximated by floating point. Floating point would have to be reduced too, either exponential range as mentioned, or precision (the mantissa).

When is it useful to compare floating-point values for equality?

I seem to see people asking all the time around here questions about comparing floating-point numbers. The canonical answer is always: just see if the numbers are within some small number of each other…
So the question is this: Why would you ever need to know if two floating point numbers are equal to each other?
In all my years of coding I have never needed to do this (although I will admit that even I am not omniscient). From my point of view, if you are trying to use floating point numbers for and for some reason want to know if the numbers are equal, you should probably be using an integer type (or a decimal type in languages that support it). Am I just missing something?
A few reasons to compare floating-point numbers for equality are:
Testing software. Given software that should conform to a precise specification, exact results might be known or feasibly computable, so a test program would compare the subject software’s results to the expected results.
Performing exact arithmetic. Carefully designed software can perform exact arithmetic with floating-point. At its simplest, this may simply be integer arithmetic. (On platforms which provide IEEE-754 64-bit double-precision floating-point but only 32-bit integer arithmetic, floating-point arithmetic can be used to perform 53-bit integer arithmetic.) Comparing for equality when performing exact arithmetic is the same as comparing for equality with integer operations.
Searching sorted or structured data. Floating-point values can be used as keys for searching, in which case testing for equality is necessary to determine that the sought item has been found. (There are issues if NaNs may be present, since they report false for any order test.)
Avoiding poles and discontinuities. Functions may have special behaviors at certain points, the most obvious of which is division. Software may need to test for these points and divert execution to alternate methods.
Note that only the last of these tests for equality when using floating-point arithmetic to approximate real arithmetic. (This list of examples is not complete, so I do not expect this is the only such use.) The first three are special situations. Usually when using floating-point arithmetic, one is approximating real arithmetic and working with mostly continuous functions. Continuous functions are “okay” for working with floating-point arithmetic because they transmit errors in “normal” ways. For example, if your calculations so far have produced some a' that approximates an ideal mathematical result a, and you have a b' that approximates an ideal mathematical result b, then the computed sum a'+b' will approximate a+b.
Discontinuous functions, on the other hand, can disrupt this behavior. For example, if we attempt to round a number to the nearest integer, what happens when a is 3.49? Our approximation a' might be 3.48 or 3.51. When the rounding is computed, the approximation may produce 3 or 4, turning a very small error into a very large error. When working with discontinuous functions in floating-point arithmetic, one has to be careful. For example, consider evaluating the quadratic formula, (−b±sqrt(b2−4ac))/(2a). If there is a slight error during the calculations for b2−4ac, the result might be negative, and then sqrt will return NaN. So software cannot simply use floating-point arithmetic as if it easily approximated real arithmetic. The programmer must understand floating-point arithmetic and be wary of the pitfalls, and these issues and their solutions can be specific to the particular software and application.
Testing for equality is a discontinuous function. It is a function f(a, b) that is 0 everywhere except along the line a=b. Since it is a discontinuous function, it can turn small errors into large errors—it can report as equal numbers that are unequal if computed with ideal mathematics, and it can report as unequal numbers that are equal if computed with ideal mathematics.
With this view, we can see testing for equality is a member of a general class of functions. It is not any more special than square root or division—it is continuous in most places but discontinuous in some, and so its use must be treated with care. That care is customized to each application.
I will relate one place where testing for equality was very useful. We implement some math library routines that are specified to be faithfully rounded. The best quality for a routine is that it is correctly rounded. Consider a function whose exact mathematical result (for a particular input x) is y. In some cases, y is exactly representable in the floating-point format, in which case a good routine will return y. Often, y is not exactly representable. In this case, it is between two numbers representable in the floating-point format, some numbers y0 and y1. If a routine is correctly rounded, it returns whichever of y0 and y1 is closer to y. (In case of a tie, it returns the one with an even low digit. Also, I am discussing only the round-to-nearest ties-to-even mode.)
If a routine is faithfully rounded, it is allowed to return either y0 or y1.
Now, here is the problem we wanted to solve: We have some version of a single-precision routine, say sin0, that we know is faithfully rounded. We have a new version, sin1, and we want to test whether it is faithfully rounded. We have multiple-precision software that can evaluate the mathematical sin function to great precision, so we can use that to check whether the results of sin1 are faithfully rounded. However, the multiple-precision software is slow, and we want to test all four billion inputs. sin0 and sin1 are both fast, but sin1 is allowed to have outputs different from sin0, because sin1 is only required to be faithfully rounded, not to be the same as sin0.
However, it happens that most of the sin1 results are the same as sin0. (This is partly a result of how math library routines are designed, using some extra precision to get a very close result before using a few final arithmetic operations to deliver the final result. That tends to get the correctly rounded result most of the time but sometimes slips to the next nearest value.) So what we can do is this:
For each input, calculate both sin0 and sin1.
Compare the results for equality.
If the results are equal, we are done. If they are not, use the extended precision software to test whether the sin1 result is faithfully rounded.
Again, this is a special case for using floating-point arithmetic. But it is one where testing for equality serves very well; the final test program runs in a few minutes instead of many hours.
The only time I needed, it was to check if the GPU was IEEE 754 compliant.
It was not.
Anyway I haven't used a comparison with a programming language. I just run the program on the CPU and on the GPU producing some binary output (no literals) and compared the outputs with a simple diff.
There are plenty possible reasons.
Since I know Squeak/Pharo Smalltalk better, here are a few trivial examples taken out of it (it relies on strict IEEE 754 model):
Float>>isFinite
"simple, byte-order independent test for rejecting Not-a-Number and (Negative)Infinity"
^(self - self) = 0.0
Float>>isInfinite
"Return true if the receiver is positive or negative infinity."
^ self = Infinity or: [self = NegativeInfinity]
Float>>predecessor
| ulp |
self isFinite ifFalse: [
(self isNaN or: [self negative]) ifTrue: [^self].
^Float fmax].
ulp := self ulp.
^self - (0.5 * ulp) = self
ifTrue: [self - ulp]
ifFalse: [self - (0.5 * ulp)]
I'm sure that you would find some more involved == if you open some libm implementation and check... Unfortunately, I don't know how to search == thru github web interface, but manually I found this example in julia libm (a variant of fdlibm)
https://github.com/JuliaLang/openlibm/blob/master/src/s_remquo.c
remquo(double x, double y, int *quo)
{
...
fixup:
INSERT_WORDS(x,hx,lx);
y = fabs(y);
if (y < 0x1p-1021) {
if (x+x>y || (x+x==y && (q & 1))) {
q++;
x-=y;
}
} else if (x>0.5*y || (x==0.5*y && (q & 1))) {
q++;
x-=y;
}
GET_HIGH_WORD(hx,x);
SET_HIGH_WORD(x,hx^sx);
q &= 0x7fffffff;
*quo = (sxy ? -q : q);
return x;
Here, the remainder function answer a result x between -y/2 and y/2. If it is exactly y/2, then there are 2 choices (a tie)... The == test in fixup is here to test the case of exact tie (resolved so as to always have an even quotient).
There are also a few ==zero tests, for example in __ieee754_logf (test for trivial case log(1)) or __ieee754_rem_pio2 (modulo pi/2 used for trigonometric functions).

Help with debugging unexpected takeWhile behaviour with large numbers in Haskell

Firstly, apologies for the vague title, but I'm not sure exactly what I'm asking here(!).
After encountering Haskell at university, I've recently started using it in anger and so am working through the Project Euler problems as an extended Hello World, really. I've encountered a bug in one of my answers that seems to suggest a misunderstanding of a fundamental part of the language, and it's not something I could work out from the tutorials, nor something I know enough about to start Googling for.
A brief description of the issue itself - the solution relates to primes, so I wanted an infinite list of prime numbers which I implemented (without optimisation yet!) thusly:
isPrime :: Int -> Bool
isPrime n = isPrime' 2 n
where isPrime' p n | p >= n = True
| n `mod` p == 0 = False
| otherwise = isPrime' (p+1) n
primes :: [Int]
primes = filter isPrime [2..]
Since infinite lists can be a little tedious to evaluate, I'll of course be using lazy evaluation to ensure that just the bits I want get evaulatued. So, for example, I can ask GHCI for the prime numbers less than 100:
*Main> takeWhile (< 100) primes
[2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97]
Now here's the part that I don't understand at all - when the upper limit gets large enough, I get no answers back at all. In particular:
*Main> takeWhile (< 4000000000) primes
[]
This isn't a problem with takeWhile itself, or the partially-applied function, as takeWhile (< 4000000000) [2..] works as I would expect. It's not a problem with my use of filter (within the definition of primes), since takeWhile (< 4000000000) (filter even [2..]) also returns the expected result.
Through binary search I found that the greatest upper limit that works is 2^31 - 1, so this would certainly seem to imply some kind of space-based constraint (i.e. largest positive signed integer). However:
I was of the impression that Haskell had no language limits on the size of integers, and they were bounded only by the amount of free memory.
This number only appears in the less-than predicate, which I know works as expected in at least some cases. Surely when it's applied to elements of a list, it shouldn't care where they come from? Looking solely at the first element, I know that the predicate returns true for the input 2 when it comes from filter even [2..]; I know that primes returns 2 as its first element. So how can my list be empty, how is this predicate failing "for some values of 2"?
Any thoughts would be grateful as I don't have enough experience to know where to start with this one. Thanks for taking the time to take a look.
There are 2 built-in integral types in haskell: Int and Integer. Integer is the default and is unbounded. Int however is bounded. Since you're explicitly using Int in the type for isPrime 4000000000 is used as an Int and overflows. If you change the type of isPrime to Integer -> Bool or even better Integral a => a -> Bool (read: a function that can take any kind of Integral value and returns a Bool), it will work as expected.
The important thing to take away here (other than the difference between Int and Integer) is that the type of 4000000000 depends on how it is used. If it is used as an argument to a function that takes an Int, it will be an Int (and on 32-bit systems it will overflow). If it is used as an argument to a function that takes an Integer, it will be an Integer (and never overflow). If it is used as an argument to a function that takes any kind of Integral, it will also be an Integer because Integer is the default instance of Integral.
That's an easy answer (...which I see has already been partly answered) - "premature specialization".
The first part of your definition, the type signature, specifies:
isPrime :: Int -> Bool
An Int is not just a "shortcut" way to say Integer - they are different types! To be a nit-picker (which in turn invites every one else to tear apart the many places here, where I am not accurate), there are never "different values of 2" - it has to be of type Int, because that's how you specified the function (you compare 2 to the function's argument n and you're only allowed to compare values of the same type, so your 2 is "pinned down" to the Int type.
Oh, and just as a warning, the Int type is a type just rife with corner case potential. If your system is built in a 64-bit environment, then your Int will also be based on a 64-bit representation, and your example will work up to 2^63-1, instead of 2^31-1 as yours did. Note my phrasing: I have a 64-bit computer with an MS Windows OS, which means that there is not yet an official 64-bit MinGW toolchain - my OS is 64-bit, but the GHC version I have was compiled with 32-bit libraries, so it has 32-bit-based Ints. When I use Linux, even in a VM, it has a 64-bit toolchain, so Ints are 64 bits. If you had used one of those, you may not have even noticed the behavior!
So, I guess that's just one more reason to be careful when reasoning about your types. (Especially in Haskell, anyway....)

Algorithm to find a common multiplier to convert decimal numbers to whole numbers

I have an array of numbers that potentially have up to 8 decimal places and I need to find the smallest common number I can multiply them by so that they are all whole numbers. I need this so all the original numbers can all be multiplied out to the same scale and be processed by a sealed system that will only deal with whole numbers, then I can retrieve the results and divide them by the common multiplier to get my relative results.
Currently we do a few checks on the numbers and multiply by 100 or 1,000,000, but the processing done by the *sealed system can get quite expensive when dealing with large numbers so multiplying everything by a million just for the sake of it isn’t really a great option. As an approximation lets say that the sealed algorithm gets 10 times more expensive every time you multiply by a factor of 10.
What is the most efficient algorithm, that will also give the best possible result, to accomplish what I need and is there a mathematical name and/or formula for what I’m need?
*The sealed system isn’t really sealed. I own/maintain the source code for it but its 100,000 odd lines of proprietary magic and it has been thoroughly bug and performance tested, altering it to deal with floats is not an option for many reasons. It is a system that creates a grid of X by Y cells, then rects that are X by Y are dropped into the grid, “proprietary magic” occurs and results are spat out – obviously this is an extremely simplified version of reality, but it’s a good enough approximation.
So far there are quiet a few good answers and I wondered how I should go about choosing the ‘correct’ one. To begin with I figured the only fair way was to create each solution and performance test it, but I later realised that pure speed wasn’t the only relevant factor – an more accurate solution is also very relevant. I wrote the performance tests anyway, but currently the I’m choosing the correct answer based on speed as well accuracy using a ‘gut feel’ formula.
My performance tests process 1000 different sets of 100 randomly generated numbers.
Each algorithm is tested using the same set of random numbers.
Algorithms are written in .Net 3.5 (although thus far would be 2.0 compatible)
I tried pretty hard to make the tests as fair as possible.
Greg – Multiply by large number
and then divide by GCD – 63
milliseconds
Andy – String Parsing
– 199 milliseconds
Eric – Decimal.GetBits – 160 milliseconds
Eric – Binary search – 32
milliseconds
Ima – sorry I couldn’t
figure out a how to implement your
solution easily in .Net (I didn’t
want to spend too long on it)
Bill – I figure your answer was pretty
close to Greg’s so didn’t implement
it. I’m sure it’d be a smidge faster
but potentially less accurate.
So Greg’s Multiply by large number and then divide by GCD” solution was the second fastest algorithm and it gave the most accurate results so for now I’m calling it correct.
I really wanted the Decimal.GetBits solution to be the fastest, but it was very slow, I’m unsure if this is due to the conversion of a Double to a Decimal or the Bit masking and shifting. There should be a
similar usable solution for a straight Double using the BitConverter.GetBytes and some knowledge contained here: http://blogs.msdn.com/bclteam/archive/2007/05/29/bcl-refresher-floating-point-types-the-good-the-bad-and-the-ugly-inbar-gazit-matthew-greig.aspx but my eyes just kept glazing over every time I read that article and I eventually ran out of time to try to implement a solution.
I’m always open to other solutions if anyone can think of something better.
I'd multiply by something sufficiently large (100,000,000 for 8 decimal places), then divide by the GCD of the resulting numbers. You'll end up with a pile of smallest integers that you can feed to the other algorithm. After getting the result, reverse the process to recover your original range.
Multiple all the numbers by 10
until you have integers.
Divide
by 2,3,5,7 while you still have all
integers.
I think that covers all cases.
2.1 * 10/7 -> 3
0.008 * 10^3/2^3 -> 1
That's assuming your multiplier can be a rational fraction.
If you want to find some integer N so that N*x is also an exact integer for a set of floats x in a given set are all integers, then you have a basically unsolvable problem. Suppose x = the smallest positive float your type can represent, say it's 10^-30. If you multiply all your numbers by 10^30, and then try to represent them in binary (otherwise, why are you even trying so hard to make them ints?), then you'll lose basically all the information of the other numbers due to overflow.
So here are two suggestions:
If you have control over all the related code, find another
approach. For example, if you have some function that takes only
int's, but you have floats, and you want to stuff your floats into
the function, just re-write or overload this function to accept
floats as well.
If you don't have control over the part of your system that requires
int's, then choose a precision to which you care about, accept that
you will simply have to lose some information sometimes (but it will
always be "small" in some sense), and then just multiply all your
float's by that constant, and round to the nearest integer.
By the way, if you're dealing with fractions, rather than float's, then it's a different game. If you have a bunch of fractions a/b, c/d, e/f; and you want a least common multiplier N such that N*(each fraction) = an integer, then N = abc / gcd(a,b,c); and gcd(a,b,c) = gcd(a, gcd(b, c)). You can use Euclid's algorithm to find the gcd of any two numbers.
Greg: Nice solution but won't calculating a GCD that's common in an array of 100+ numbers get a bit expensive? And how would you go about that? Its easy to do GCD for two numbers but for 100 it becomes more complex (I think).
Evil Andy: I'm programing in .Net and the solution you pose is pretty much a match for what we do now. I didn't want to include it in my original question cause I was hoping for some outside the box (or my box anyway) thinking and I didn't want to taint peoples answers with a potential solution. While I don't have any solid performance statistics (because I haven't had any other method to compare it against) I know the string parsing would be relatively expensive and I figured a purely mathematical solution could potentially be more efficient.
To be fair the current string parsing solution is in production and there have been no complaints about its performance yet (its even in production in a separate system in a VB6 format and no complaints there either). It's just that it doesn't feel right, I guess it offends my programing sensibilities - but it may well be the best solution.
That said I'm still open to any other solutions, purely mathematical or otherwise.
What language are you programming in? Something like
myNumber.ToString().Substring(myNumber.ToString().IndexOf(".")+1).Length
would give you the number of decimal places for a double in C#. You could run each number through that and find the largest number of decimal places(x), then multiply each number by 10 to the power of x.
Edit: Out of curiosity, what is this sealed system which you can pass only integers to?
In a loop get mantissa and exponent of each number as integers. You can use frexp for exponent, but I think bit mask will be required for mantissa. Find minimal exponent. Find most significant digits in mantissa (loop through bits looking for last "1") - or simply use predefined number of significant digits.
Your multiple is then something like 2^(numberOfDigits-minMantissa). "Something like" because I don't remember biases/offsets/ranges, but I think idea is clear enough.
So basically you want to determine the number of digits after the decimal point for each number.
This would be rather easier if you had the binary representation of the number. Are the numbers being converted from rationals or scientific notation earlier in your program? If so, you could skip the earlier conversion and have a much easier time. Otherwise you might want to pass each number to a function in an external DLL written in C, where you could work with the floating point representation directly. Or you could cast the numbers to decimal and do some work with Decimal.GetBits.
The fastest approach I can think of in-place and following your conditions would be to find the smallest necessary power-of-ten (or 2, or whatever) as suggested before. But instead of doing it in a loop, save some computation by doing binary search on the possible powers. Assuming a maximum of 8, something like:
int NumDecimals( double d )
{
// make d positive for clarity; it won't change the result
if( d<0 ) d=-d;
// now do binary search on the possible numbers of post-decimal digits to
// determine the actual number as quickly as possible:
if( NeedsMore( d, 10e4 ) )
{
// more than 4 decimals
if( NeedsMore( d, 10e6 ) )
{
// > 6 decimal places
if( NeedsMore( d, 10e7 ) ) return 10e8;
return 10e7;
}
else
{
// <= 6 decimal places
if( NeedsMore( d, 10e5 ) ) return 10e6;
return 10e5;
}
}
else
{
// <= 4 decimal places
// etc...
}
}
bool NeedsMore( double d, double e )
{
// check whether the representation of D has more decimal points than the
// power of 10 represented in e.
return (d*e - Math.Floor( d*e )) > 0;
}
PS: you wouldn't be passing security prices to an option pricing engine would you? It has exactly the flavor...

Resources