I need a way to compute how many times a fixed point number B is contained into a fixed point number A. Something like integer division but on non-integer operands.
I need to design an hardware block for this operation.
My first guess is to use division as shift and subtract and stop when I reach the fractional part but maybe you know better ways to find it.

If I understand you correctly you want the integer part of the fractional division, i.e.
C = floor(A / B)
Now, fractional division is not any different than integer division, besides adjusting the decimal point, if you represent A = a * 2^-n and B = b * 2^-m you get
C = floor(A / B) = floor((a / b) * 2^(-n-m))
So you can use the division algorithm for integers (which is essentially shift and subtract) unchanged and ignore (round down) the least significant n+m bits, or more efficiently just stop iterating once you've reached the decimal point.


Implementing the square root method through successive approximation

Determining the square root through successive approximation is implemented using the following algorithm:
Begin by guessing that the square root is x / 2. Call that guess g.
The actual square root must lie between g and x/g. At each step in the successive approximation, generate a new guess by averaging g and x/g.
Repeat step 2 until the values of g and x/g are as close together as the precision of the hardware allows. In Java, the best way to check for this condition is to test whether the average is equal to either of the values used to generate it.
What really confuses me is the last statement of step 3. I interpreted it as follows:
private double sqrt(double x) {
double g = x / 2;
while(true) {
double average = (g + x/g) / 2;
if(average == g || average == x/g) break;
g = average;
return g;
This seems to just cause an infinite loop. I am following the algorithm exactly, if the average equals either g or x/g (the two values used to generate it) then we have our answer ?
Why would anyone ever use that approach, when they could simply use the formulas for (2n^2) = 4n^2 and (n + 1)^2 = n^2 + 2n + 1, to populate each bit in the mantissa, and divide the exponent by two, multiplying the mantissa by two iff the the mod of the exponent with two equals 1?
To check if g and x/g are as close as the HW allow, look at the relative difference and compare
it with the epsilon for your floating point format. If it is within a small integer multiple of epsilon, you are OK.
Relative difference of x and y, see https://en.wikipedia.org/wiki/Relative_change_and_difference
The epsilon for 32-bit IEEE floats is about 1.0e-7, as in one of the other answers here, but that answer used the absolute rather than the relative difference.
In practice, that means something like:
Math.abs(g-x/g)/Math.max(Math.abs(g),Math.abs(x/g)) < 3.0e-7
Never compare floating point values for equality. The result is not reliable.
Use a epsilon like so:
if(Math.abs(average-g) < 1e-7 || Math.abs(average-x/g) < 1e-7)
You can change the epsilon value to be whatever you need. Probably best is something related to the original x.

Splitting a floating point number as sums of floating point of fixed precision

Suppose i have an algorithm by which i can compute an infinitely precise floating point number (depending from a parameter N) lets say in pseudocode:
arbitrary_precision_float f = computeValue(n); //it could be a function which compute a specific value, like PI for instance.
I guess i can implement computeValue(int) with the library mpf of the gnump library for example...
Anyway how can i split such number in sums of floating point number where each number has L Mantissa digits?
f = x1 + x2 + ... + xn;
for i = 1:n
xi = 2^ei * Mi
Mi has exactly p digits.
I don't know if i'm clear but i'm looking for something "simple".
You can use a very simple algorithm. Assume without loss of generality that the exponent of your original number is zero; if it's not, then you just add that exponent to all the exponents of the answer.
Split your number f into groups of L digits and treat each group as a separate xi. Any such group can be represented in the form you need: the mantissa will be exactly that group, and the exponent will be negated start position of the group in the original number (that is, i*L, where i is the group number).
If any of the resulting xis starts from zero, you just shift its mantissa correcting the exponent correspondingly.
For example, for L=4
f = 10010011100
-> x1=1.001 *2^0
x2=0.011 *2^{-4} = 1.1*2^{-6}
x3=1.00 *2^{-8}
Another question arises if you want to minimize the amount of numbers you get. In the example above, two numbers are sufficient: 1.001*2^0+1.11*2^{-6}. This is a separate question, and in fact is a simple problem for dynamic programming.

ruby floats add up to different values depending on the order

So this is weird. I'm in Ruby 1.9.3, and float addition is not working as I expect it would.
0.3 + 0.6 + 0.1 = 0.9999999999999999
0.6 + 0.1 + 0.3 = 1
I've tried this on another machine and get the same result. Any idea why this might happen?
Floating point operations are inexact: they round the result to nearest representable float value.
That means that each float operation is:
float(a op b) = mathematical(a op b) + rounding-error( a op b )
As suggested by above equation, the rounding error depends on operands a & b.
Thus, if you perform operations in different order,
float(float( a op b) op c) != float(a op (b op c))
In other words, floating point operations are not associative.
They are commutative though...
As other said, transforming a decimal representation 0.1 (that is 1/10) into a base 2 representation (that is 1/16 + 1/64 + ... ) would lead to an infinite serie of digits. So float(0.1) is not equal to 1/10 exactly, it also has a rounding-error and it leads to a long serie of binary digits, which explains that following operations have a non null rounding-error (mathematical result is not representable in floating point)
It has been said many times before but it bears repeating: Floating point numbers are by their very nature approximations of decimal numbers. There are some decimal numbers that cannot be represented precisely due to the way the floating point numbers are stored in binary. Small but perceptible rounding errors will occur.
To avoid this kind of mess, you should always format your numbers to an appropriate number of places for presentation:
'%.3f' % (0.3 + 0.6 + 0.1)
# => "1.000"
'%.3f' % (0.6 + 0.1 + 0.3)
# => "1.000"
This is why using floating point numbers for currency values is risky and you're generally encouraged to use fixed point numbers or regular integers for these things.
First, the numerals “0.3”, “.6”, and “.1” in the source text are converted to floating-point numbers, which I will call a, b, and c. These values are near .3, .6, and .1 but not equal to them, but that is not directly the reason you see different results.
In each floating-point arithmetic operation, there may be a little rounding error, some small number ei. So the exact mathematical results your two expressions calculate is:
(a + b + e0) + c + e1 and
(b + c + e2) + a + e3.
That is, in the first expression, a is added to b, and there is a slight rounding error e0. Then c is added, and there is a slight rounding error e1. In the second expression, b is added to c, and there is a slight rounding error e2. Finally, a is added, and there is a slight rounding error e3.
The reason your results differ is that e0 + e1 ≠ e2 + e3. That is, the rounding that was necessary when a and b were added was different from the rounding that was necessary when b and c were added and/or the roundings that were necessary in the second additions of the two cases were different.
There are rules that govern these errors. If you know the rules, you can make deductions about them that bound the size of the errors in final results.
This is a common limitation of floating point numbers, due to their being encoded in base 2 instead of base 10. It can be difficult to understand, but once you do, you can easily avoid problems like this. I recommend this guide, which goes in depth to explain it.
For this problem specifically, you might try rounding your result to the nearest millionths place:
result = (0.3+0.6+0.1)
=> 0.9999999999999999
=> 1.0
As for why the order matters, it has to do with rounding. When those numbers are turned into floats, they are converted to binary, and all of them become repeating fractions, like ⅓ is in decimal. Since the result gets rounded during each addition, the final answer depends on the order of the additions. It appears that in one of those, you get a round-up, where in the other, you get a round-down. This explains the discrepancy.
It is worth noting what the actual difference is between those two answers: approximately 0.0000000000000001.
In view you can use also the number_with_precision helper:
result = 0.3 + 0.6 + 0.1
result = number_with_precision result, :precision => 3

Number base conversion as a stream operation

Is there a way in constant working space to do arbitrary size and arbitrary base conversions. That is, to convert a sequence of n numbers in the range [1,m] to a sequence of ceiling(n*log(m)/log(p)) numbers in the range [1,p] using a 1-to-1 mapping that (preferably but not necessarily) preservers lexigraphical order and gives sequential results?
I'm particularly interested in solutions that are viable as a pipe function, e.i. are able to handle larger dataset than can be stored in RAM.
I have found a number of solutions that require "working space" proportional to the size of the input but none yet that can get away with constant "working space".
Does dropping the sequential constraint make any difference? That is: allow lexicographically sequential inputs to result in non lexicographically sequential outputs:
F(1,2,6,4,3,7,8) -> (5,6,3,2,1,3,5,2,4,3)
F(1,2,6,4,3,7,9) -> (5,6,3,2,1,3,5,2,4,5)
some thoughts:
might this work?
streamBasen -> convert(n, lcm(n,p)) -> convert(lcm(n,p), p) -> streamBasep
(where lcm is least common multiple)
I don't think it's possible in the general case. If m is a power of p (or vice-versa), or if they're both powers of a common base, you can do it, since each group of logm(p) is then independent. However, in the general case, suppose you're converting the number a1 a2 a3 ... an. The equivalent number in base p is
sum(ai * mi-1 for i in 1..n)
If we've processed the first i digits, then we have the ith partial sum. To compute the i+1'th partial sum, we need to add ai+1 * mi. In the general case, this number is going have non-zero digits in most places, so we'll need to modify all of the digits we've processed so far. In other words, we'll have to process all of the input digits before we'll know what the final output digits will be.
In the special case where m are both powers of a common base, or equivalently if logm(p) is a rational number, then mi will only have a few non-zero digits in base p near the front, so we can safely output most of the digits we've computed so far.
I think there is a way of doing radix conversion in a stream-oriented fashion in lexicographic order. However, what I've come up with isn't sufficient for actually doing it, and it has a couple of assumptions:
The length of the positional numbers are already known.
The numbers described are integers. I've not considered what happens with the maths and -ive indices.
We have a sequence of values a of length p, where each value is in the range [0,m-1]. We want a sequence of values b of length q in the range [0,n-1]. We can work out the kth digit of our output sequence b from a as follows:
bk = floor[ sum(ai * mi for i in 0 to p-1) / nk ] mod n
Lets rearrange that sum into two parts, splitting it at an arbitrary point z
bk = floor[ ( sum(ai * mi for i in z to p-1) + sum(ai * mi for i in 0 to z-1) ) / nk ] mod n
Suppose that we don't yet know the values of a between [0,z-1] and can't compute the second sum term. We're left with having to deal with ranges. But that still gives us information about bk.
The minimum value bk can be is:
bk >= floor[ sum(ai * mi for i in z to p-1) / nk ] mod n
and the maximum value bk can be is:
bk <= floor[ ( sum(ai * mi for i in z to p-1) + mz - 1 ) / nk ] mod n
We should be able to do a process like this:
Initialise z to be p. We will count down from p as we receive each character of a.
Initialise k to the index of the most significant value in b. If my brain is still working, ceil[ logn(mp) ].
Read a value of a. Decrement z.
Compute the min and max value for bk.
If the min and max are the same, output bk, and decrement k. Goto 4. (It may be possible that we already have enough values for several consecutive values of bk)
If z!=0 then we expect more values of a. Goto 3.
Hopefully, at this point we're done.
I've not considered how to efficiently compute the range values as yet, but I'm reasonably confident that computing the sum from the incoming characters of a can be done much more reasonably than storing all of a. Without doing the maths though, I won't make any hard claims about it though!
Yes, it is possible
For every I character(s) you read in, you will write out O character(s)
based on Ceiling(Length * log(In) / log(Out)).
Allocate enough space
Set x to 1
Loop over digits from end to beginning # Horner's method
Set a to x * digit
Set t to O - 1
Loop while a > 0 and t >= 0
Set a to a + out digit
Set out digit at position t to a mod to base
Set a to a / to base
Set x to x * from base
Return converted digit(s)
Thus, for base 16 to 2 (which is easy), using "192FE" we read '1' and convert it, then repeat on '9', then '2' and so on giving us '0001', '1001', '0010', '1111', and '1110'.
Note that for bases that are not common powers, such as base 17 to base 2 would mean reading 1 characters and writing 5.

How many digits will be after converting from one numeral system to another

The main question: How many digits?
Let me explain. I have a number in binary system: 11000000 and in decimal is 192.
After converting to decimal, how many digits it will have (in dicimal)? In my example, it's 3 digits. But, it isn't a problem. I've searched over internet and found one algorithm for integral part and one for fractional part. I'm not quite understand them, but (I think) they works.
When converting from binary to octal, it's more easy: each 3 bits give you 1 digit in octal. Same for hex: each 4 bits = 1 hex digit.
But, I'm very curious, what to do, if I have a number in P numeral system and want to convert it to the Q numeral system? I know how to do it (I think, I know :)), but, 1st of all, I want to know how many digits in Q system it will take (u no, I must preallocate space).
Writing n in base b takes ceiling(log base b (n)) digits.
The ratio you noticed (octal/binary) is log base 8 (n) / log base 2 (n) = 3.
(From memory, will it stick?)
There was an error in my previous answer: look at the comment by Ben Schwehn.
Sorry for the confusion, I found and explain the error I made in my previous answer below.
Please use the answer provided by Paul Tomblin. (rewritten to use P, Q and n)
Y = ln(P^n) / ln(Q)
Y = n * ln(P) / ln(Q)
So Y (rounded up) is the number of characters you need in system Q to express the highest number you can encode in n characters in system P.
I have no answer (that wouldn't convert the number already and take up that many space in a temporary variable) to get the bare minimum for a given number 1000(bin) = 8(dec) while you would reserve 2 decimal positions using this formula.
If a temporary memory usage isn't a problem, you might cheat and use (Python):
This will give you the number of decimals needed to convert a number in base P, cast as a string (otherBaseStr), into decimals.
Old WRONG answer:
If you have a number in P numeral system of length n
Then you can calculate the highest number that is possible in n characters:
To express this highest number in number system Q you need to use logarithms (because they are the inverse to exponentiation):
(n-1)*log(P) / log(Q)
For example
11000000 in binary is 8 characters.
To get it in Decimal you would need:
(8-1)*log(2) / log(10) = 2.1 digits (round up to 3)
Reason it was wrong:
The highest number that is possible in n characters is
(P^n) - 1
If you have a number that's X digits long in base B, then the maximum value that can be represented is B^X - 1. So if you want to know how many digits it might take in base C, then you have to find the number Y that C^Y - 1 is at least as big as B^X - 1. The way to do that is to take the logarithm in base C of B^X-1. And since the logarithm (log) of a number in base C is the same as the natural log (ln) of that number divided by the natural log of C, that becomes:
Y = ln((B^X)-1) / ln(C) + 1
and since ln(B^X) is X * ln(B), and that's probably faster to calculate than ln(B^X-1) and close enough to the right answer, rewrite that as
Y = X * ln(B) / ln(C) + 1
Covert that to your favourite language. Because we dropped the "-1", we might end up with one digit more than you need in some cases. But even better, you can pre-calculate ln(B)/ln(C) and just multiply it by new "X"s and the length of the number you are trying to convert changes.
Calculating the number of digit can be done using the formulas given by the other answers, however, it might actually be faster to allocate a buffer of maximum size first and then return the relevant part of that buffer instead of calculating a logarithm.
Note that the worst case for the buffer size happens when you convert to binary, which gives you a buffer size of 32 characters for 32-bit integers.
Converting a number to an arbitrary base could be done using the C# function below (The code would look very similar in other languages like C or Java):
public static string IntToString(int value, char[] baseChars)
// 32 is the worst cast buffer size for base 2 and int.MaxValue
int i = 32;
char[] buffer = new char[i];
int targetBase= baseChars.Length;
buffer[--i] = baseChars[value % targetBase];
value = value / targetBase;
while (value > 0);
char[] result = new char[32 - i];
Array.Copy(buffer, i, result, 0, 32 - i);
return new string(result);
The keyword here is "logarithm", here are some suggestive links:
look at the logarithms base P and base Q. Round down to nearest integer.
The logarithm base P can be computed using your favorite base (10 or e): log_P(x) = log_10(x)/log_10(P)
You need to compute the length of the fractional part separately.
For binary to decimal, there are as many decimal digits as there are bits. For example, binary 0.11001101001001 is decimal 0.80133056640625, both 14 digits after the radix point.
For decimal to binary, there are two cases. If the decimal fraction is dyadic, then there are as many bits as decimal digits (same as for binary to decimal above). If the fraction is not dyadic, then the number of bits is infinite.
(You can use my decimal/binary converter to experiment with this.)
