Computing the most accurate result for: the sum of a list of fixed point nubers - addition

Lets say you'd have to add some 32-bit fixed point numbers stored in a huge array L and you'd like to get the most accurate result as possible. Furthermore, you're not allowed to use anything else than L and the 32-bit fixed point numbers (i.e. you're not allowed to convert them to 64-bit). What would be your approach to get the most accurate result for the sum of the numbers in L?
This would be my current approach noted in sudo code:
L = sort(L)
result = 0
lastMax = false -- indicates whether we've extracted the maximum from L last time
while (not empty(L)) and (result not equals +INF or -INF) do:
current = 0
if lastMax:
current = extractMin(L) -- gets and removes minimum from L
else:
current = extractMax(L) -- gets and removes maximum from L
result = safeAdd(result, current)
lastMax = not lastMax
safeAdd(a,b):
if a = +INF: return +INF
else if a = -INF: return -INF
else: return a + b
So I'm alternating between adding the minimum/maximum from the remaining list L in order to stay between the ranges of L. The way how safeAdd is implemented shows that once we've crossed the ranges of accuracy (i.e., the result of a+b has yielded +INF or -INF - just as it's done in C) we will not alter the result anymore.
Do you have any suggestions on how to improve the approach?
Sidenote: If we want to be very precise: We further assume that the + operation can yield +INF or -INF which can be represented as fixed point numbers in the programming language. But we assume that the values +INF, -INF do not occur in L. And we ignore the fact the fixed point standard may also have a representation for NaN.

If you know, that the result is in range, you don't need to care about overflow or underflow.
Here is an example with 4 bits only to keep it simple
7 0111
+1 0001
= 1000 <-- -8 overflow
-1 0001
= 0111
If you are not sure, if the result is in range, you need to count overflows and underflows.
For a = b + c
Overflow if b and c are positive and a is negative
Underflow if b and c are negative and a is positive

Related

How does clearing least significant bit repeatedly work in this algorithm?

The following function determines the number of bits you would need to flip to convert integer A to integer B. I have solved it by a different method, but when I read this solution I don't understand it. It clearly works, but my question is why?
def bit_swap_required(a: int, b: int) -> int:
count, c = 0, a ^ b
while c:
count, c = count + 1, c & (c - 1)
return count
I understand why we do a^b. It gives us a '1' at every place we need to flip. But, How does doing c & (c-1) repeatedly give you the EXACT number of '1's in the number?
c - 1 unsets the least significant bit in the binary representation of c and sets all the unset bits to the right of that bit.
When you binary and c - 1 with c you effectively unset all the bits to the right of the least significant set bit and also the least significant set bit. In other words the least significant set bit in c and everything to its right become zeros.
You count this as one, and rightly so. Because it was just one set bit fomr a ^ b.
Now, you continue this operation until c becomes zero and the number of operations is the number of set bits in c which is the number of different bits between a and b.
To give you an example for what c - 1 does to the binary representation of c:
c = 6, in binary 110
c-1 = 5, in binary 101
(c-1) & (c) = 110 & 101 = 100
The above is one iteration of the loop
Next iteration
c = 4, in binary 100
c-1 = 3, in binary 011
(c-1) & (c) = 100 & 101 = 0
The above successfully counts the number of set bits in 6.
This optimization helps you to improve the algorithm compared to when you right shift the number at each iteration and check whether the least significant bit is set. In the former case you operate in the number of positions where the most significant set bit is. Say for a power of 2 number, 2^7, you iterate 8 times until the number becomes zero. While with the optimized method you iterate according to the number of set bits. For 2^7 it would be just one iteration.
From the structure of the code, you can guess that c & (c-1) has one 1 less than c, even without investigating the expression
Indeed, subtracting 1 flips all 0 bits from the right (there are borrows) until the rightmost 1 inclusive. So if you bitwise and with c, only that rightmost 1 disappears.
E.g. c = 1001010100 -> c-1 = 1001010011 -> c & (c-1) = 1001010000.
Next, c = 1001010000 -> c-1 = 1001001111 -> c & (c-1) = 1001000000.

How to find the multiplier that produces a smaller output for every double values?

How would you find the greatest and positive IEEE-754 binary-64 value C such that every IEEE-754 product of a positive, normalized binary-64 value A with C is smaller than A?
I know it must be close to 0.999999... but I'd like to find exactly the greatest one.
Suppose round-to-nearest, ties to even.
There've been a couple of experimental approaches; here's a proof that C = 1 - ε, where ε is machine epsilon (that is, the distance between 1 and the smallest representable number greater than 1.)
We know that C < 1, of course, so it makes sense to try C = 1 - ε/2 because it's the next representable number smaller than 1. (The ε/2 is because C is in the [0.5, 1) bucket of representable numbers.) Let's see if it works for all A.
I'm going to assume in this paragraph that 1 <= A < 2. If both A and AC are in the "normal" region then it doesn't really matter what the exponent is, the situation will be the same with the exponent 2^0. Now, that choice of C obviously works for A=1, so we are left with the region 1 < A < 2. Looking at A = 1 + ε, we see that AC (the exact value, not the rounded result) is already greater than 1; and for A = 2 - ε we see that it's less than 2. That's important, because if AC is between 1 and 2, we know that the distance between AC and round(AC) (that is, rounding it to the nearest representable value) is at most ε/2. Now, if A - AC < ε/2, then round(AC) = A which we don't want. (If A - AC = ε/2 then it might round to A given the "ties to even" part of the normal FP rounding rules, but let's see if we can do better.) Since we've chosen C = 1 - ε/2, we can see that A - AC = A - A(1 - ε/2) = A * ε/2. Since that's greater than ε/2 (remember, A>1), it's far enough away from A to round away from it.
BUT! The one other value of A we have to check is the minimum representable normal value, since there AC is not in the normal range and so our "relative distance to nearest" rule doesn't apply. And what we find is that in that case A-AC is exactly half of machine epsilon in the region. "Round to nearest, ties to even" kicks in and the product rounds back up to equal A. Drat.
Going through the same thing with C = 1 - ε, we see that round(AC) < A, and that nothing else even comes close to rounding towards A (we end up asking whether A * ε > ε/2, which of course it is). So the punchline is that C = 1-ε/2 almost works but the boundary between normals and denormals screws us up, and C = 1-ε gets us into the end zone.
Due to the nature of floating-point types, C will vary depending on how big the value of A is. You can use nextafter to get the largest value less than 1 which will be the rough value for C
However it's possible that if A is too large or too small, A*C will be the same as A. I'm not able to mathematically prove that nextafter(1.0, 0) will work for all possible A's, therefore I'm suggesting a solution like this
double largestCfor(double A)
{
double C = nextafter(1.0, 0);
while (C*A >= A)
C = nextafter(C, 0);
return C;
}
If you want a C value that works for any A, even if C*A might not be the largest possible value then you'll need to check for every exponents that the type can represent
double C = 1;
for (double A = 0x1p-1022; isfinite(A); A *= 2) // loop through all possible exponents
{
double c = largestCfor(A);
if (c < C) C = c;
}
I've tried running on Ideone and got the result
C = 0.999999999999999777955395074969
nextafter(1.0, 0) = 0.999999999999999888977697537484
Edit:
0.999999999999999777955395074969 is 0x1.ffffffffffffep-1 which is also 1 - DBL_EPSILON. That aligns with Sneftel's proof above

convert real number to radicals

Suppose I have a real number. I want to approximate it with something of the form a+sqrt(b) for integers a and b. But I don't know the values of a and b. Of course I would prefer to get a good approximation with small values of a and b. Let's leave it undefined for now what is meant by "good" and "small". Any sensible definitions of those terms will do.
Is there a sane way to find them? Something like the continued fraction algorithm for finding fractional approximations of decimals. For more on the fractions problem, see here.
EDIT: To clarify, it is an arbitrary real number. All I have are a bunch of its digits. So depending on how good of an approximation we want, a and b might or might not exist. Brute force is naturally not a particularly good algorithm. The best I can think of would be to start adding integers to my real, squaring the result, and seeing if I come close to an integer. Pretty much brute force, and not a particularly good algorithm. But if nothing better exists, that would itself be interesting to know.
EDIT: Obviously b has to be zero or positive. But a could be any integer.
No need for continued fractions; just calculate the square-root of all "small" values of b (up to whatever value you feel is still "small" enough), remove everything before the decimal point, and sort/store them all (along with the b that generated it).
Then when you need to approximate a real number, find the radical whose decimal-portion is closet to the real number's decimal-portion. This gives you b - choosing the correct a is then a simple matter of subtraction.
This is actually more of a math problem than a computer problem, but to answer the question I think you are right that you can use continued fractions. What you do is first represent the target number as a continued fraction. For example, if you want to approximate pi (3.14159265) then the CF is:
3: 7, 15, 1, 288, 1, 2, 1, 3, 1, 7, 4 ...
The next step is create a table of CFs for square roots, then you compare the values in the table to the fractional part of the target value (here: 7, 15, 1, 288, 1, 2, 1, 3, 1, 7, 4...). For example, let's say your table had square roots for 1-99 only. Then you would find the closest match would be sqrt(51) which has a CF of 7: 7,14 repeating. The 7,14 is the closest to pi's 7,15. Thus your answer would be:
sqrt(51)-4
As the closest approximation given a b < 100 which is off by 0.00016. If you allow larger b's then you could get a better approximation.
The advantage of using CFs is that it is faster than working in, say, doubles or using floating point. For example, in the above case you only have to compare two integers (7 and 15), and you can also use indexing to make finding the closest entry in the table very fast.
This can be done using mixed integer quadratic programming very efficiently (though there are no run-time guarantees as MIQP is NP-complete.)
Define:
d := the real number you wish to approximate
b, a := two integers such that a + sqrt(b) is as "close" to d as possible
r := (d - a)^2 - b, is the residual of the approximation
The goal is to minimize r. Setup your quadratic program as:
x := [ s b t ]
D := | 1 0 0 |
| 0 0 0 |
| 0 0 0 |
c := [0 -1 0]^T
with the constraint that s - t = f (where f is the fractional part of d)
and b,t are integers (s is not)
This is a convex (therefore optimally solvable) mixed integer quadratic program since D is positive semi-definite.
Once s,b,t are computed, simply derive the answer using b=b, s=d-a and t can be ignored.
Your problem may be NP-complete, it would be interesting to prove if so.
Some of the previous answers use methods that are of time or space complexity O(n), where n is the largest “small number” that will be accepted. By contrast, the following method is O(sqrt(n)) in time, and O(1) in space.
Suppose that positive real number r = x + y, where x=floor(r) and 0 ≤ y < 1. We want to approximate r by a number of the form a + √b. If x+y ≈ a+√b then x+y-a ≈ √b, so √b ≈ h+y for some integer offset h, and b ≈ (h+y)^2. To make b an integer, we want to minimize the fractional part of (h+y)^2 over all eligible h. There are at most √n eligible values of h. See following python code and sample output.
import math, random
def findb(y, rhi):
bestb = loerror = 1;
for r in range(2,rhi):
v = (r+y)**2
u = round(v)
err = abs(v-u)
if round(math.sqrt(u))**2 == u: continue
if err < loerror:
bestb, loerror = u, err
return bestb
#random.seed(123456) # set a seed if testing repetitively
f = [math.pi-3] + sorted([random.random() for i in range(24)])
print (' frac sqrt(b) error b')
for frac in f:
b = findb(frac, 12)
r = math.sqrt(b)
t = math.modf(r)[0] # Get fractional part of sqrt(b)
print ('{:9.5f} {:9.5f} {:11.7f} {:5.0f}'.format(frac, r, t-frac, b))
(Note 1: This code is in demo form; the parameters to findb() are y, the fractional part of r, and rhi, the square root of the largest small number. You may wish to change usage of parameters. Note 2: The
if round(math.sqrt(u))**2 == u: continue
line of code prevents findb() from returning perfect-square values of b, except for the value b=1, because no perfect square can improve upon the accuracy offered by b=1.)
Sample output follows. About a dozen lines have been elided in the middle. The first output line shows that this procedure yields b=51 to represent the fractional part of pi, which is the same value reported in some other answers.
frac sqrt(b) error b
0.14159 7.14143 -0.0001642 51
0.11975 4.12311 0.0033593 17
0.12230 4.12311 0.0008085 17
0.22150 9.21954 -0.0019586 85
0.22681 11.22497 -0.0018377 126
0.25946 2.23607 -0.0233893 5
0.30024 5.29150 -0.0087362 28
0.36772 8.36660 -0.0011170 70
0.42452 8.42615 0.0016309 71
...
0.93086 6.92820 -0.0026609 48
0.94677 8.94427 -0.0024960 80
0.96549 11.95826 -0.0072333 143
0.97693 11.95826 -0.0186723 143
With the following code added at the end of the program, the output shown below also appears. This shows closer approximations for the fractional part of pi.
frac, rhi = math.pi-3, 16
print (' frac sqrt(b) error b bMax')
while rhi < 1000:
b = findb(frac, rhi)
r = math.sqrt(b)
t = math.modf(r)[0] # Get fractional part of sqrt(b)
print ('{:11.7f} {:11.7f} {:13.9f} {:7.0f} {:7.0f}'.format(frac, r, t-frac, b,rhi**2))
rhi = 3*rhi/2
frac sqrt(b) error b bMax
0.1415927 7.1414284 -0.000164225 51 256
0.1415927 7.1414284 -0.000164225 51 576
0.1415927 7.1414284 -0.000164225 51 1296
0.1415927 7.1414284 -0.000164225 51 2916
0.1415927 7.1414284 -0.000164225 51 6561
0.1415927 120.1415831 -0.000009511 14434 14641
0.1415927 120.1415831 -0.000009511 14434 32761
0.1415927 233.1415879 -0.000004772 54355 73441
0.1415927 346.1415895 -0.000003127 119814 164836
0.1415927 572.1415909 -0.000001786 327346 370881
0.1415927 911.1415916 -0.000001023 830179 833569
I do not know if there is any kind of standard algorithm for this kind of problem, but it does intrigue me, so here is my attempt at developing an algorithm that finds the needed approximation.
Call the real number in question r. Then, first I assume that a can be negative, in that case we can reduce the problem and now only have to find a b such that the decimal part of sqrt(b) is a good approximation of the decimal part of r. Let us now write r as r = x.y with x being the integer and y the decimal part.
Now:
b = r^2
= (x.y)^2
= (x + .y)^2
= x^2 + 2 * x * .y + .y^2
= 2 * x * .y + .y^2 (mod 1)
We now only have to find an x such that 0 = .y^2 + 2 * x * .y (mod 1) (approximately).
Filling that x into the formulas above we get b and can then calculate a as a = r - b. (All of these calculations have to be carefully rounded of course.)
Now, for the time being I am not sure if there is a way to find this x without brute forcing it. But even then, one can simple use a simple loop to find an x good enough.
I am thinking of something like this(semi pseudo code):
max_diff_low = 0.01 // arbitrary accuracy
max_diff_high = 1 - max_diff_low
y = r % 1
v = y^2
addend = 2 * y
x = 0
while (v < max_diff_high && v > max_diff_low)
x++;
v = (v + addend) % 1
c = (x + y) ^ 2
b = round(c)
a = round(r - c)
Now, I think this algorithm is fairly efficient, while even allowing you to specify the wished accuracy of the approximation. One thing that could be done that would turn it into an O(1) algorithm is calculating all the x and putting them into a lookup table. If one only cares about the first three decimal digits of r(for example), the lookup table would only have 1000 values, which is only 4kb of memory(assuming that 32bit integers are used).
Hope this is helpful at all. If anyone finds anything wrong with the algorithm, please let me know in a comment and I will fix it.
EDIT:
Upon reflection I retract my claim of efficiency. There is in fact as far as I can tell no guarantee that the algorithm as outlined above will ever terminate, and even if it does, it might take a long time to find a very large x that solves the equation adequately.
One could maybe keep track of the best x found so far and relax the accuracy bounds over time to make sure the algorithm terminates quickly, at the possible cost of accuracy.
These problems are of course non-existent, if one simply pre-calculates a lookup table.

Algorithm - Partition two numbers about a power-of-two

Given two floating point numbers, p and q where 0 < p < q I am interested in writing a function partition(p,q) that finds the 'simplest' number r that is between p and q. For example:
partition(3.0, 4.1) = 4.0 (2^2)
partition(4.2, 7.0) = 6.0 (2^2 + 2^1)
partition(2.0, 4.0) = 3.0 (2^1 + 2^0)
partition(0.3, 0.6) = 0.5 (2^-1)
partition(1.0, 10.0) = 8.0 (2^3)
In the last instance I am interested in the largest number (so 8 as opposed to 4 or 2).
Let us assume assume that p and q are both normalized and positive, and p < q.
If p and q have differing exponents, it appears that the number you are looking for is the number obtained by zeroing the mantissa of q after the leading (and often implicit) 1. The corner cases are left as an exercise, especially the case where q's mantissa is already made of zeroes after the leading, possibly implicit, 1.
If p and q have the same exponent, then we have to look at their mantissas. These mantissas have some bits in common (starting from the most significant end). Let us call c1 c2 .. ck pk+1 ... pn the bits of p's mantissa, c1 c2 .. ck qk+1 ... qnthe bits of q's mantissa, where c1 .. ck are common bits and pk+1, qk+1 differ. Then pk+1 is zero and qk+1 is one (because of the hypotheses). The number with the same exponent and mantissa c1 .. ck 1 0 .. 0 is in the interval p .. q and is the number you are looking for (again, corner cases left as an exercise).
Write the numbers in binary (terminating if possible, so 1 is written as 1.0000..., not 0.1111...),
Scan from left to right, "keeping" all digits at which the two numbers are equal
At the first digit where the two numbers differ, p must be 0 and q must be 1 since p < q:
If q has any more 1 digits after this point, then put a 1 at this point and you're done.
If q has no more 1 digits after this point, then doing that would result in r == q, which is forbidden, so instead append a 0 digit. Follow that by a 1 digit unless doing so would result in r == p, in which case append another 0 and then a 1.
Basically, we truncate q down to the first place at which p and q differ, then jigger it a bit if necessary to avoid r == p or r == q. The result is certainly less than q and greater than p. It is "simplest" (has the least possible number of 1 digits) since any number between p and q must share their common initial sequence. We have added only one 1 digit to that sequence, which is necessary since the initial sequence alone is <= p, so no value in range (p,q) has fewer 1 digits. We've chosen the "largest" solution because we always place our extra 1 at the first (biggest) possible place.
It sounds like you just want to convert the binary representation of the largest integer strictly less than your largest argument to the corresponding sum of powers of two.

Number base conversion as a stream operation

Is there a way in constant working space to do arbitrary size and arbitrary base conversions. That is, to convert a sequence of n numbers in the range [1,m] to a sequence of ceiling(n*log(m)/log(p)) numbers in the range [1,p] using a 1-to-1 mapping that (preferably but not necessarily) preservers lexigraphical order and gives sequential results?
I'm particularly interested in solutions that are viable as a pipe function, e.i. are able to handle larger dataset than can be stored in RAM.
I have found a number of solutions that require "working space" proportional to the size of the input but none yet that can get away with constant "working space".
Does dropping the sequential constraint make any difference? That is: allow lexicographically sequential inputs to result in non lexicographically sequential outputs:
F(1,2,6,4,3,7,8) -> (5,6,3,2,1,3,5,2,4,3)
F(1,2,6,4,3,7,9) -> (5,6,3,2,1,3,5,2,4,5)
some thoughts:
might this work?
streamBasen -> convert(n, lcm(n,p)) -> convert(lcm(n,p), p) -> streamBasep
(where lcm is least common multiple)
I don't think it's possible in the general case. If m is a power of p (or vice-versa), or if they're both powers of a common base, you can do it, since each group of logm(p) is then independent. However, in the general case, suppose you're converting the number a1 a2 a3 ... an. The equivalent number in base p is
sum(ai * mi-1 for i in 1..n)
If we've processed the first i digits, then we have the ith partial sum. To compute the i+1'th partial sum, we need to add ai+1 * mi. In the general case, this number is going have non-zero digits in most places, so we'll need to modify all of the digits we've processed so far. In other words, we'll have to process all of the input digits before we'll know what the final output digits will be.
In the special case where m are both powers of a common base, or equivalently if logm(p) is a rational number, then mi will only have a few non-zero digits in base p near the front, so we can safely output most of the digits we've computed so far.
I think there is a way of doing radix conversion in a stream-oriented fashion in lexicographic order. However, what I've come up with isn't sufficient for actually doing it, and it has a couple of assumptions:
The length of the positional numbers are already known.
The numbers described are integers. I've not considered what happens with the maths and -ive indices.
We have a sequence of values a of length p, where each value is in the range [0,m-1]. We want a sequence of values b of length q in the range [0,n-1]. We can work out the kth digit of our output sequence b from a as follows:
bk = floor[ sum(ai * mi for i in 0 to p-1) / nk ] mod n
Lets rearrange that sum into two parts, splitting it at an arbitrary point z
bk = floor[ ( sum(ai * mi for i in z to p-1) + sum(ai * mi for i in 0 to z-1) ) / nk ] mod n
Suppose that we don't yet know the values of a between [0,z-1] and can't compute the second sum term. We're left with having to deal with ranges. But that still gives us information about bk.
The minimum value bk can be is:
bk >= floor[ sum(ai * mi for i in z to p-1) / nk ] mod n
and the maximum value bk can be is:
bk <= floor[ ( sum(ai * mi for i in z to p-1) + mz - 1 ) / nk ] mod n
We should be able to do a process like this:
Initialise z to be p. We will count down from p as we receive each character of a.
Initialise k to the index of the most significant value in b. If my brain is still working, ceil[ logn(mp) ].
Read a value of a. Decrement z.
Compute the min and max value for bk.
If the min and max are the same, output bk, and decrement k. Goto 4. (It may be possible that we already have enough values for several consecutive values of bk)
If z!=0 then we expect more values of a. Goto 3.
Hopefully, at this point we're done.
I've not considered how to efficiently compute the range values as yet, but I'm reasonably confident that computing the sum from the incoming characters of a can be done much more reasonably than storing all of a. Without doing the maths though, I won't make any hard claims about it though!
Yes, it is possible
For every I character(s) you read in, you will write out O character(s)
based on Ceiling(Length * log(In) / log(Out)).
Allocate enough space
Set x to 1
Loop over digits from end to beginning # Horner's method
Set a to x * digit
Set t to O - 1
Loop while a > 0 and t >= 0
Set a to a + out digit
Set out digit at position t to a mod to base
Set a to a / to base
Set x to x * from base
Return converted digit(s)
Thus, for base 16 to 2 (which is easy), using "192FE" we read '1' and convert it, then repeat on '9', then '2' and so on giving us '0001', '1001', '0010', '1111', and '1110'.
Note that for bases that are not common powers, such as base 17 to base 2 would mean reading 1 characters and writing 5.

Resources