How can I convert between a double-double and a decimal string? - algorithm

One way of increasing precision beyond that of a double (e.g. if my application is doing something space-related that needs to represent accurate positions over distances of many light-years) is to use a double-double, a structure composed of two doubles which represents the sum of the two. Algorithms are known for the various arithmetic operations on such a structure, e.g. double-double + double-double, double × double-double, etc, e.g. as given in this paper.
(Note that this is not the same format as the IEEE 754-2008 binary128, a.k.a. quad-precision and conversion to/from double-double and binary128 is not guaranteed to round-trip.)
An obvious way to represent such a quantity as a string would then be to use strings representing each individual component of the double, e.g. "1.0+1.0e-200". My question is, is there a known way to convert to and from strings that represent the value as a single decimal? I.e. given the string "0.3" then provide the double-double closest to this representation, or go in the reverse direction. One naïve way would be to use successive multiplications/divisions by 10, but that is insufficient for doubles so I'm somewhat sceptical that they would work here.

such technique as summing 2 floating point variables just effectively doubles the mantissa bitwidth so its enough to just store/load bigger mantissa.
Standard IEEE 754 double has 52+1 bit mantissa leading to
log10(2^53) = 15.95 = ~16 [dec digits]
so when you add 2 such variables then:
log10(2^(53+53)) = 31.9 = ~32 [dec digits]
so just store/load 32 digit mantissa to/from string. The exponent of the 2 variables will differ by +/- 53 so its enough to store just one of them.
To further improve performance and precision you can use hex strings. Its much faster and there is no rounding as you can directly convert between the mantissa bits and hex string characters.
any 4 bits form a single hexadecimal digit so
(53+53) / 4 = 26.5 = ~27 [hex digits]
As you can see its also more storage efficient the only problem is the exponent delimiter as hexa digits contain E so you need to distinct the digit and exponent separator by upper/lower casing or use different character or use just sign for example:
1.23456789ABCDEFe10
1.23456789ABCDEFe+10
1.23456789ABCDEF|+10
1.23456789ABCDEF+10
I usually use the first version. Also you need to take in mind the exponent is bit shift of mantissa so resulting number is:
mantisa<<exponent = mantisa * (2^exponent)
Now during loading/storing from/to string you just load 53+53 bit integer number then separate it to 2 mantissas and reconstruct the floating point values at bit level ... Its important that your mantissas are aligned so exp1+53 = exp2 give or take 1 ...
All this can be done on integer arithmetics.
If your exponent is exp10 then you will inflict heavy rounding on the number during both storage and loading to/from string as your mantissa will usually missing many zero bits before or after the decimal point making transformation between decadic and binary/hexadecimal very hard and inaccurate (especially if you limit your computation just to 64/80/128/160 bits of mantissa).
Here an C++ example of just that (printing 32bit float in decadic on integer arithmetics only):
//---------------------------------------------------------------------------
AnsiString f32_prn(float fx) // scientific format integers only
{
const int ms=10+5; // mantisa digits
const int es=2; // exponent digits
const int eb=100000;// 10^(es+3)
const int sz=ms+es+5;
char txt[sz],c;
int i=0,i0,i1,m,n,exp,e2,e10;
DWORD x,y,man;
for (i0=0;i0<sz;i0++) txt[i0]=' ';
// float -> DWORD
x=((DWORD*)(&fx))[0];
// sign
if (x>=0x80000000){ txt[i]='-'; i++; x&=0x7FFFFFFF; }
else { txt[i]='+'; i++; }
// exp
exp=((x>>23)&255)-127;
// man
man=x&0x007FFFFF;
if ((exp!=-127)&&(exp!=+128)) man|=0x00800000; // not zero or denormalized or Inf/NaN
// special cases
if ((man==0)&&(exp==-127)){ txt[i]='0'; i++; txt[i]=0; return txt; } // +/- zero
if ((man==0)&&(exp==+128)){ txt[i]='I'; i++;
txt[i]='N'; i++;
txt[i]='F'; i++; txt[i]=0; return txt; } // +/- Infinity
if ((man!=0)&&(exp==+128)){ txt[i]='N'; i++;
txt[i]='A'; i++;
txt[i]='N'; i++; txt[i]=0; return txt; } // +/- Not a number
// align man,exp to 4bit
e2=(1+(exp&3))&3;
man<<=e2;
exp-=e2+23; // exp of lsb of mantisa
e10=0; // decimal digits to add/remove
m=0; // mantisa digits
n=ms; // max mantisa digits
// integer part
if (exp>=-28)
{
x=man; y=0; e2=exp;
// shift x to integer part <<
if (x) for (;e2>0;)
{
while (x>0x0FFFFFFF){ y/=10; y+=((x%10)<<28)/10; x/=10; e10++; }
e2-=4; x<<=4; y<<=4;
x+=(y>>28)&15; y&=0x0FFFFFFF;
}
// shift x to integer part >>
for (;e2<0;e2+=4) x>>=4;
// no exponent?
if ((e10>0)&&(e10<=es+3)) n++; // no '.'
// print
for (i0=i;x;)
{
if (m<n){ txt[i]='0'+(x%10); i++; m++; if ((m==n)&&(x<eb)) m+=es+1; } else e10++;
x/=10;
}
// reverse digits
for (i1=i-1;i0<i1;i0++,i1--){ c=txt[i0]; txt[i0]=txt[i1]; txt[i1]=c; }
}
// fractional part
if (exp<0)
{
x=man; y=0; e2=exp;
// shift x to fractional part <<
if (x) for (;e2<-28;)
{
while ((x<=0x19999999)&&(y<=0x19999999)){ y*=10; x*=10; x+=(y>>28)&15; y&=0x0FFFFFFF; e10--; }
y>>=4; y&=0x00FFFFFF; y|=(x&15)<<24;
x>>=4; x&=0x0FFFFFFF; e2+=4;
}
// shift x to fractional part <<
for (;e2>-28;e2-=4) x<<=4;
// print
x&=0x0FFFFFFF;
if ((m)&&(!e10)) n+=es+2; // no exponent means more digits for mantisa
if (x)
{
if (m){ txt[i]='.'; i++; }
for (i0=i;x;)
{
y*=10; x*=10;
x+=(y>>28)&15;
if (m<n)
{
i0=((x>>28)&15);
if (!m)
{
if (i0)
{
txt[i]='0'+i0; i++; m++;
txt[i]='.'; i++;
}
e10--;
if (!e10) n+=es+2; // no exponent means more digits for mantisa
}
else { txt[i]='0'+i0; i++; m++; }
} else break;
y&=0x0FFFFFFF;
x&=0x0FFFFFFF;
}
}
}
else{
// no fractional part
if ((e10>0)&&(e10<sz-i))
for (;e10;e10--){ txt[i]='0'+i0; i++; m++; }
}
// exponent
if (e10)
{
if (e10>0) // move . after first digit
{
for (i0=i;i0>2;i0--) txt[i0]=txt[i0-1];
txt[2]='.'; i++; e10+=i-3;
}
// sign
txt[i]='E'; i++;
if (e10<0.0){ txt[i]='-'; i++; e10=-e10; }
else { txt[i]='+'; i++; }
// print
for (i0=i;e10;){ txt[i]='0'+(e10%10); e10/=10; i++; }
// reverse digits
for (i1=i-1;i0<i1;i0++,i1--){ c=txt[i0]; txt[i0]=txt[i1]; txt[i1]=c; }
}
txt[i]=0;
return txt;
}
//---------------------------------------------------------------------------
Just change the AnsiString return type into any string type or char* you got at your disposal ...
As you can see its a lot of code with a lot of hacks and internally a lot more than 24bit of mantissa is used to lower the rounding errors inflicted by decadic exponent.
So I strongly advice to use binary exponent (exp2) and hexa digits for mantissa it will simplify your problem a lot and get rid of the rounding entirely. The only problem is when you want print or input decadic number in such case you have no choice but to round ... Luckily you can use hexa output and convert it to decadic on strings... Or construct the print from single variable prints ...
for more info see related QAs:
How do I convert a very long binary number to decimal?

Related

IEEE754 single precision - General algorithm for representing the half of a number

Suppose N is an arbitrary number represented according to IEEE754 single precision standards. I want to find the most precise possible representation of N/2 again in IEEE754.
I want to find a general algorithm (described in words, I just want the necessary steps and cases to take into account) for obtaining the representation.
My approach is :
Say the number is represented: b0_b1_b2_b3...b_34.
Isolate the first bit which determines the sign (-/+) of the number.
Calculate the representation of the power (p) from the unsigned representation b_1...b_11.
If power = 128 we have a special case. If all the bits of the mantissa are equal to 0, we have, depending on b_0, either minus or plus infinity. We don't change anything. If the mantissa has at least one bit equal to 1 then we have NaN value. Again we change nothing.
if e is inside]-126, 127[then we have a normalized mantissam. The new power p can be calculated asp' = p - 1and belongs in the interval]-127, 126]. We then calculatem/2` and we represent it starting from the right and losing any bits that cannot be included in the 23 bits of the mantissa.
If e = -126, then in calculating the half of this number we pass in a denormalized mantissa. We represent p = 127, calculate the half of the mantissa and represent it again starting from the right losing any information that cannot be included.
Finally if e = -127 we have a denormalized mantissa. As long as m/2 can be represented in the number of bits available in the mantissa without losing information we represent that and keep p = -127. In any other case we represent the number as a positive or negative 0 depending on b_0
Any steps I have missed, any improvements ( I am sure there are ) that can be made or anything that seems completely wrong?
I implemented a divide by two algorithm in Java and verified it for all 32-bit inputs. I tried to follow your pseudocode, but there were three places where I diverged. First, the infinity/NaN exponent is 128. Second, in case 4 (normal -> normal), there's no need to operate on the fraction. Third, you didn't describe how round half to even works when you do operate on the fraction. LGTM otherwise.
public final class FloatDivision {
public static float divideFloatByTwo(float value) {
int bits = Float.floatToIntBits(value);
int sign = bits >>> 31;
int biased_exponent = (bits >>> 23) & 0xff;
int exponent = biased_exponent - 127;
int fraction = bits & 0x7fffff;
if (exponent == 128) {
// value is NaN or infinity
} else if (exponent == -126) {
// value is normal, but result is subnormal
biased_exponent = 0;
fraction = divideNonNegativeIntByTwo(0x800000 | fraction);
} else if (exponent == -127) {
// value is subnormal or zero
fraction = divideNonNegativeIntByTwo(fraction);
} else {
// value and result are normal
biased_exponent--;
}
return Float.intBitsToFloat((sign << 31) | (biased_exponent << 23) | fraction);
}
private static int divideNonNegativeIntByTwo(int value) {
// round half to even
return (value >>> 1) + ((value >>> 1) & value & 1);
}
public static void main(String[] args) {
int bits = Integer.MIN_VALUE;
do {
if (bits % 0x800000 == 0) {
System.out.println(bits);
}
float value = Float.intBitsToFloat(bits);
if (Float.floatToIntBits(divideFloatByTwo(value)) != Float.floatToIntBits(value / 2)) {
System.err.println(bits);
break;
}
} while (++bits != Integer.MIN_VALUE);
}
}

Binary to decimal (on huge numbers)

I am building a C library on big integer number. Basically, I'm seeking a fast algorythm to convert any integer in it binary representation to a decimal one
I saw JDK's Biginteger.toString() implementation, but it looks quite heavy to me, as it was made to convert the number to any radix (it uses a division for each digits, which should be pretty slow while dealing with thousands of digits).
So if you have any documentations / knowledge to share about it, I would be glad to read it.
EDIT: more precisions about my question:
Let P a memory address
Let N be the number of bytes allocated (and set) at P
How to convert the integer represented by the N bytes at address P (let's say in little endian to make things simpler), to a C string
Example:
N = 1
P = some random memory address storing '00101010'
out string = "42"
Thank for your answer still
The reason for the BigInteger.toString method looking heavy is doing the conversion in chunks.
A trivial algorithm would take the last digits and then divide the whole big integer by the radix until there is nothing left.
One problem with this is that a big integer division is quite expensive, so the number is subdivided into chunks that can be processed with regular integer division (opposed to BigInt division):
static String toDecimal(BigInteger bigInt) {
BigInteger chunker = new BigInteger(1000000000);
StringBuilder sb = new StringBuilder();
do {
int current = bigInt.mod(chunker).getInt(0);
bigInt = bigInt.div(chunker);
for (int i = 0; i < 9; i ++) {
sb.append((char) ('0' + remainder % 10));
current /= 10;
if (currnet == 0 && bigInt.signum() == 0) {
break;
}
}
} while (bigInt.signum() != 0);
return sb.reverse().toString();
}
That said, for a fixed radix, you are probably even better off with porting the "double dabble" algorithm to your needs, as suggested in the comments: https://en.wikipedia.org/wiki/Double_dabble
I recently got the challenge to print a big mersenne prime: 2**82589933-1. On my CPU that takes ~40 minutes with apcalc and ~120 minutes with python 2.7. It's a number with 24 million digits and a bit.
Here is my own little C code for the conversion:
// print 2**82589933-1
#include <stdio.h>
#include <math.h>
#include <stdint.h>
#include <inttypes.h>
#include <string.h>
const uint32_t exponent = 82589933;
//const uint32_t exponent = 100;
//outputs 1267650600228229401496703205375
const uint32_t blocks = (exponent + 31) / 32;
const uint32_t digits = (int)(exponent * log(2.0) / log(10.0)) + 10;
uint32_t num[2][blocks];
char out[digits + 1];
// blocks : number of uint32_t in num1 and num2
// num1 : number to convert
// num2 : free space
// out : end of output buffer
void conv(uint32_t blocks, uint32_t *num1, uint32_t *num2, char *out) {
if (blocks == 0) return;
const uint32_t div = 1000000000;
uint64_t t = 0;
for (uint32_t i = 0; i < blocks; ++i) {
t = (t << 32) + num1[i];
num2[i] = t / div;
t = t % div;
}
for (int i = 0; i < 9; ++i) {
*out-- = '0' + (t % 10);
t /= 10;
}
if (num2[0] == 0) {
--blocks;
num2++;
}
conv(blocks, num2, num1, out);
}
int main() {
// prepare number
uint32_t t = exponent % 32;
num[0][0] = (1LLU << t) - 1;
memset(&num[0][1], 0xFF, (blocks - 1) * 4);
// prepare output
memset(out, '0', digits);
out[digits] = 0;
// convert to decimal
conv(blocks, num[0], num[1], &out[digits - 1]);
// output number
char *res = out;
while(*res == '0') ++res;
printf("%s\n", res);
return 0;
}
The conversion is destructive and tail recursive. In each step it divides num1 by 1_000_000_000 and stores the result in num2. The remainder is added to out. Then it calls itself with num1 and num2 switched and often shortened by one (blocks is decremented). out is filled from back to front. You have to allocate it large enough and then strip leading zeroes.
Python seems to be using a similar mechanism for converting big integers to decimal.
Want to do better?
For large number like in my case each division by 1_000_000_000 takes rather long. At a certain size a divide&conquer algorithm does better. In my case the first division would be by dividing by 10 ^ 16777216 to split the number into divident and remainder. Then convert each part separately. Now each part is still big so split again at 10 ^ 8388608. Recursively keep splitting till the numbers are small enough. Say maybe 1024 digits each. Those convert with the simple algorithm above. The right definition of "small enough" would have to be tested, 1024 is just a guess.
While the long division of two big integer numbers is expensive, much more so than a division by 1_000_000_000, the time spend there is then saved because each separate chunk requires far fewer divisions by 1_000_000_000 to convert to decimal.
And if you have split the problem into separate and independent chunks it's only a tiny step away from spreading the chunks out among multiple cores. That would really speed up the conversion another step. It looks like apcalc uses divide&conquer but not multi-threading.

Algorithm Challenge: Arbitrary in-place base conversion for lossless string compression

It might help to start out with a real world example. Say I'm writing a web app that's backed by MongoDB, so my records have a long hex primary key, making my url to view a record look like /widget/55c460d8e2d6e59da89d08d0. That seems excessively long. Urls can use many more characters than that. While there are just under 8 x 10^28 (16^24) possible values in a 24 digit hex number, just limiting yourself to the characters matched by a [a-zA-Z0-9] regex class (a YouTube video id uses more), 62 characters, you can get past 8 x 10^28 in only 17 characters.
I want an algorithm that will convert any string that is limited to a specific alphabet of characters to any other string with another alphabet of characters, where the value of each character c could be thought of as alphabet.indexOf(c).
Something of the form:
convert(value, sourceAlphabet, destinationAlphabet)
Assumptions
all parameters are strings
every character in value exists in sourceAlphabet
every character in sourceAlphabet and destinationAlphabet is unique
Simplest example
var hex = "0123456789abcdef";
var base10 = "0123456789";
var result = convert("12245589", base10, hex); // result is "bada55";
But I also want it to work to convert War & Peace from the Russian alphabet plus some punctuation to the entire unicode charset and back again losslessly.
Is this possible?
The only way I was ever taught to do base conversions in Comp Sci 101 was to first convert to a base ten integer by summing digit * base^position and then doing the reverse to convert to the target base. Such a method is insufficient for the conversion of very long strings, because the integers get too big.
It certainly feels intuitively that a base conversion could be done in place, as you step through the string (probably backwards to maintain standard significant digit order), keeping track of a remainder somehow, but I'm not smart enough to work out how.
That's where you come in, StackOverflow. Are you smart enough?
Perhaps this is a solved problem, done on paper by some 18th century mathematician, implemented in LISP on punch cards in 1970 and the first homework assignment in Cryptography 101, but my searches have borne no fruit.
I'd prefer a solution in javascript with a functional style, but any language or style will do, as long as you're not cheating with some big integer library. Bonus points for efficiency, of course.
Please refrain from criticizing the original example. The general nerd cred of solving the problem is more important than any application of the solution.
Here is a solution in C that is very fast, using bit shift operations. It assumes that you know what the length of the decoded string should be. The strings are vectors of integers in the range 0..maximum for each alphabet. It is up to the user to convert to and from strings with restricted ranges of characters. As for the "in-place" in the question title, the source and destination vectors can overlap, but only if the source alphabet is not larger than the destination alphabet.
/*
recode version 1.0, 22 August 2015
Copyright (C) 2015 Mark Adler
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler#alumni.caltech.edu
*/
/* Recode a vector from one alphabet to another using intermediate
variable-length bit codes. */
/* The approach is to use a Huffman code over equiprobable alphabets in two
directions. First to encode the source alphabet to a string of bits, and
second to encode the string of bits to the destination alphabet. This will
be reasonably close to the efficiency of base-encoding with arbitrary
precision arithmetic. */
#include <stddef.h> // size_t
#include <limits.h> // UINT_MAX, ULLONG_MAX
#if UINT_MAX == ULLONG_MAX
# error recode() assumes that long long has more bits than int
#endif
/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
*dlen returns the length of the result, which will not exceed the value of
*dlen when called. If the original *dlen is not large enough to hold the
full result, then recode() will return non-zero to indicate failure.
Otherwise recode() will return 0. recode() will also return non-zero if
either of the smax or dmax parameters are less than one. The non-zero
return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
and 3 if any of the elements of source are greater than smax.
Using this same operation on the result with smax and dmax reversed reverses
the operation, restoring the original vector. However there may be more
symbols returned than the original, so the number of symbols expected needs
to be known for decoding. (An end symbol could be appended to the source
alphabet to include the length in the coding, but then encoding and decoding
would no longer be symmetric, and the coding efficiency would be reduced.
This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
const unsigned *source, size_t slen, unsigned smax)
{
// compute sbits and scut, with which we will recode the source with
// sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
if (smax < 1)
return 2;
unsigned sbits = 0;
unsigned scut = 1; // 2**sbits
while (scut && scut <= smax) {
scut <<= 1;
sbits++;
}
scut -= smax + 1;
// same thing for dbits and dcut
if (dmax < 1)
return 2;
unsigned dbits = 0;
unsigned dcut = 1; // 2**dbits
while (dcut && dcut <= dmax) {
dcut <<= 1;
dbits++;
}
dcut -= dmax + 1;
// recode a base smax+1 vector to a base dmax+1 vector using an
// intermediate bit vector (a sliding window of that bit vector is kept in
// a bit buffer)
unsigned long long buf = 0; // bit buffer
unsigned have = 0; // number of bits in bit buffer
size_t i = 0, n = 0; // source and dest indices
unsigned sym; // symbol being encoded
for (;;) {
// encode enough of source into bits to encode that to dest
while (have < dbits && i < slen) {
sym = source[i++];
if (sym > smax) {
*dlen = n;
return 3;
}
if (sym < scut) {
buf = (buf << (sbits - 1)) + sym;
have += sbits - 1;
}
else {
buf = (buf << sbits) + sym + scut;
have += sbits;
}
}
// if not enough bits to assure one symbol, then break out to a special
// case for coding the final symbol
if (have < dbits)
break;
// encode one symbol to dest
if (n == *dlen)
return 1;
sym = buf >> (have - dbits + 1);
if (sym < dcut) {
dest[n++] = sym;
have -= dbits - 1;
}
else {
sym = buf >> (have - dbits);
dest[n++] = sym - dcut;
have -= dbits;
}
buf &= ((unsigned long long)1 << have) - 1;
}
// if any bits are left in the bit buffer, encode one last symbol to dest
if (have) {
if (n == *dlen)
return 1;
sym = buf;
sym <<= dbits - 1 - have;
if (sym >= dcut)
sym = (sym << 1) - dcut;
dest[n++] = sym;
}
// return recoded vector
*dlen = n;
return 0;
}
/* Test recode(). */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>
// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
unsigned bits = 0;
unsigned long long mask = 1;
while (mask <= max) {
mask <<= 1;
bits++;
}
mask--;
unsigned long long ran = 0;
unsigned have = 0;
size_t n = 0;
while (n < len) {
while (have < bits) {
ran = (ran << 31) + random();
have += 31;
}
if ((ran & mask) <= max)
vec[n++] = ran & mask;
ran >>= bits;
have -= bits;
}
}
// Get a valid number from str and assign it to var
#define NUM(var, str) \
do { \
char *end; \
unsigned long val = strtoul(str, &end, 0); \
var = val; \
if (*end || var != val) { \
fprintf(stderr, \
"invalid or out of range numeric argument: %s\n", str); \
return 1; \
} \
} while (0)
/* "bet n m len count" generates count test vectors of length len, where each
entry is in the range 0..n. Each vector is recoded to another vector using
only symbols in the range 0..m. That vector is recoded back to a vector
using only symbols in 0..n, and that result is compared with the original
random vector. Report on the average ratio of input and output symbols, as
compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
// get sizes of alphabets and length of test vector, compute maximum sizes
// of recoded vectors
unsigned smax, dmax, runs;
size_t slen, dsize, bsize;
if (argc != 5) { fputs("need four arguments\n", stderr); return 1; }
NUM(smax, argv[1]);
NUM(dmax, argv[2]);
NUM(slen, argv[3]);
NUM(runs, argv[4]);
dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));
// generate random test vectors, encode, decode, and compare
srandomdev();
unsigned source[slen], dest[dsize], back[bsize];
unsigned mis = 0, i;
unsigned long long dtot = 0;
int ret;
for (i = 0; i < runs; i++) {
ranvec(source, slen, smax);
size_t dlen = dsize;
ret = recode(dest, &dlen, dmax, source, slen, smax);
if (ret) {
fprintf(stderr, "encode error %d\n", ret);
break;
}
dtot += dlen;
size_t blen = bsize;
ret = recode(back, &blen, smax, dest, dlen, dmax);
if (ret) {
fprintf(stderr, "decode error %d\n", ret);
break;
}
if (blen < slen || memcmp(source, back, slen)) // blen > slen is ok
mis++;
}
if (mis)
fprintf(stderr, "%u/%u mismatches!\n", mis, i);
if (ret == 0)
printf("mean dest/source symbols = %.4f (optimal = %.4f)\n",
dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
return 0;
}
As has been pointed out in other StackOverflow answers, try not to think of summing digit * base^position as converting it to base ten; rather, think of it as directing the computer to generate a representation of the quantity represented by the number in its own terms (for most computers probably closer to our concept of base 2). Once the computer has its own representation of the quantity, we can direct it to output the number in any way we like.
By rejecting "big integer" implementations and asking for letter-by-letter conversion you are at the same time arguing that the numerical/alphabetical representation of quantity is not actually what it is, namely that each position represents a quantity of digit * base^position. If the nine-millionth character of War and Peace does represent what you are asking to convert it from, then the computer at some point will need to generate a representation for Д * 33^9000000.
I don't think any solution can work generally because if ne != m for some integer e and some MAX_INT because there's no way to calculate the value of the target base in a certain place p if np > MAX_INT.
You can get away with this for the case where ne == m for some e because the problem is recursively doable (the first e digits of n can be summed and converted into the first digit of M, and then chopped off and repeated.
If you don't have this useful property, then eventually you're going to have to try to take some part of the original base and try to perform modulus in np and np is going to be greater than MAX_INT, which means it's impossible.

Floating Point Divider Hardware Implementation Details

I am trying to implement a 32-bit floating point hardware divider in hardware and I am wondering if I can get any suggestions as to some tradeoffs between different algorithms?
My floating point unit currently suppports multiplication and addition/subtraction, but I am not going to switch it to a fused multiply-add (FMA) floating point architecture since this is an embedded platform where I am trying to minimize area usage.
Once upon a very long time ago i come across this neat and easy to implement float/fixed point divison algorithm used in military FPUs of that time period:
input must be unsigned and shifted so x < y and both are in range < 0.5 ; 1 >
don't forget to store the difference of shifts sh = shx - shy and original signs
find f (by iterating) so y*f -> 1 .... after that x*f -> x/y which is the division result
shift the x*f back by sh and restore result sign (sig=sigx*sigy)
the x*f can be computed easily like this:
z=1-y
(x*f)=(x/y)=x*(1+z)*(1+z^2)*(1+z^4)*(1+z^8)*(1+z^16)...(1+z^2n)
where
n = log2(num of fractional bits for fixed point, or mantisa bit size for floating point)
You can also stop when z^2n is zero on fixed bit width data types.
[Edit2] Had a bit of time&mood for this so here 32 bit IEEE 754 C++ implementation
I removed the old (bignum) examples to avoid confusion for future readers (they are still accessible in edit history if needed)
//---------------------------------------------------------------------------
// IEEE 754 single masks
const DWORD _f32_sig =0x80000000; // sign
const DWORD _f32_exp =0x7F800000; // exponent
const DWORD _f32_exp_sig=0x40000000; // exponent sign
const DWORD _f32_exp_bia=0x3F800000; // exponent bias
const DWORD _f32_exp_lsb=0x00800000; // exponent LSB
const DWORD _f32_exp_pos= 23; // exponent LSB bit position
const DWORD _f32_man =0x007FFFFF; // mantisa
const DWORD _f32_man_msb=0x00400000; // mantisa MSB
const DWORD _f32_man_bits= 23; // mantisa bits
//---------------------------------------------------------------------------
float f32_div(float x,float y)
{
union _f32 // float bits access
{
float f; // 32bit floating point
DWORD u; // 32 bit uint
};
_f32 xx,yy,zz; int sh; DWORD zsig; float z;
// result signum abs value
xx.f=x; zsig =xx.u&_f32_sig; xx.u&=(0xFFFFFFFF^_f32_sig);
yy.f=y; zsig^=yy.u&_f32_sig; yy.u&=(0xFFFFFFFF^_f32_sig);
// initial exponent difference sh and normalize exponents to speed up shift in range
sh =0;
sh-=((xx.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); xx.u&=(0xFFFFFFFF^_f32_exp); xx.u|=_f32_exp_bia;
sh+=((yy.u&_f32_exp)>>_f32_exp_pos)-(_f32_exp_bia>>_f32_exp_pos); yy.u&=(0xFFFFFFFF^_f32_exp); yy.u|=_f32_exp_bia;
// shift input in range
while (xx.f> 1.0f) { xx.f*=0.5f; sh--; }
while (xx.f< 0.5f) { xx.f*=2.0f; sh++; }
while (yy.f> 1.0f) { yy.f*=0.5f; sh++; }
while (yy.f< 0.5f) { yy.f*=2.0f; sh--; }
while (xx.f<=yy.f) { yy.f*=0.5f; sh++; }
// divider block
z=(1.0f-yy.f);
zz.f=xx.f*(1.0f+z);
for (;;)
{
z*=z; if (z==0.0f) break;
zz.f*=(1.0f+z);
}
// shift result back
for (;sh>0;) { sh--; zz.f*=0.5f; }
for (;sh<0;) { sh++; zz.f*=2.0f; }
// set signum
zz.u&=(0xFFFFFFFF^_f32_sig);
zz.u|=zsig;
return zz.f;
}
//---------------------------------------------------------------------------
I wanted to keep it simple so it is not optimized yet. You can for example replace all *=0.5 and *=2.0 by exponent inc/dec ... If you compare with FPU results on float operator / this will be a bit less precise because most FPUs compute on 80 bit internal format and this implementation is only on 32 bits.
As you can see I am using from FPU just +,-,*. The stuff can be speed up by using fast sqr algorithms like
Fast bignum square computation
especially if you want to use big bit widths ...
Do not forget to implement normalization and or overflow/underflow correction.

How to find a binary logarithm very fast? (O(1) at best)

Is there any very fast method to find a binary logarithm of an integer number? For example, given a number
x=52656145834278593348959013841835216159447547700274555627155488768 such algorithm must find y=log(x,2) which is 215. x is always a power of 2.
The problem seems to be really simple. All what is required is to find the position of the most significant 1 bit. There is a well-known method FloorLog, but it is not very fast especially for the very long multi-words integers.
What is the fastest method?
A quick hack: Most floating-point number representations automatically normalise values, meaning that they effectively perform the loop Christoffer Hammarström mentioned in hardware. So simply converting from an integer to FP and extracting the exponent should do the trick, provided the numbers are within the FP representation's exponent range! (In your case, your integer input requires multiple machine words, so multiple "shifts" will need to be performed in the conversion.)
If the integers are stored in a uint32_t a[], then my obvious solution would be as follows:
Run a linear search over a[] to find the highest-valued non-zero uint32_t value a[i] in a[] (test using uint64_t for that search if your machine has native uint64_t support)
Apply the bit twiddling hacks to find the binary log b of the uint32_t value a[i] you found in step 1.
Evaluate 32*i+b.
The answer is implementation or language dependent. Any implementation can store the number of significant bits along with the data, as it is often useful. If it must be calculated, then find the most significant word/limb and the most significant bit in that word.
If you're using fixed-width integers then the other answers already have you pretty-well covered.
If you're using arbitrarily large integers, like int in Python or BigInteger in Java, then you can take advantage of the fact that their variable-size representation uses an underlying array, so the base-2 logarithm can be computed easily and quickly in O(1) time using the length of the underlying array. The base-2 logarithm of a power of 2 is simply one less than the number of bits required to represent the number.
So when n is an integer power of 2:
In Python, you can write n.bit_length() - 1 (docs).
In Java, you can write n.bitLength() - 1 (docs).
You can create an array of logarithms beforehand. This will find logarithmic values up to log(N):
#define N 100000
int naj[N];
naj[2] = 1;
for ( int i = 3; i <= N; i++ )
{
naj[i] = naj[i-1];
if ( (1 << (naj[i]+1)) <= i )
naj[i]++;
}
The array naj is your logarithmic values. Where naj[k] = log(k).
Log is based on two.
This uses binary search for finding the closest power of 2.
public static int binLog(int x,boolean shouldRoundResult){
// assuming 32-bit integer
int lo=0;
int hi=31;
int rangeDelta=hi-lo;
int expGuess=0;
int guess;
while(rangeDelta>1){
expGuess=(lo+hi)/2; // or (loGuess+hiGuess)>>1
guess=1<<expGuess;
if(guess<x){
lo=expGuess;
} else if(guess>x){
hi=expGuess;
} else {
lo=hi=expGuess;
}
rangeDelta=hi-lo;
}
if(shouldRoundResult && hi>lo){
int loGuess=1<<lo;
int hiGuess=1<<hi;
int loDelta=Math.abs(x-loGuess);
int hiDelta=Math.abs(hiGuess-x);
if(loDelta<hiDelta)
expGuess=lo;
else
expGuess=hi;
} else {
expGuess=lo;
}
int result=expGuess;
return result;
}
The best option on top of my head would be a O(log(logn)) approach, by using binary search. Here is an example for a 64-bit ( <= 2^63 - 1 ) number (in C++):
int log2(int64_t num) {
int res = 0, pw = 0;
for(int i = 32; i > 0; i --) {
res += i;
if(((1LL << res) - 1) & num)
res -= i;
}
return res;
}
This algorithm will basically profide me with the highest number res such as (2^res - 1 & num) == 0. Of course, for any number, you can work it out in a similar matter:
int log2_better(int64_t num) {
var res = 0;
for(i = 32; i > 0; i >>= 1) {
if( (1LL << (res + i)) <= num )
res += i;
}
return res;
}
Note that this method relies on the fact that the "bitshift" operation is more or less O(1). If this is not the case, you would have to precompute either all the powers of 2, or the numbers of form 2^2^i (2^1, 2^2, 2^4, 2^8, etc.) and do some multiplications(which in this case aren't O(1)) anymore.
The example in the OP is an integer string of 65 characters, which is not representable by a INT64 or even INT128. It is still very easy to get the Log(2,x) from this string by converting it to a double-precision number. This at least gives you easy access to integers upto 2^1023.
Below you find some form of pseudocode
# 1. read the string
string="52656145834278593348959013841835216159447547700274555627155488768"
# 2. extract the length of the string
l=length(string) # l = 65
# 3. read the first min(l,17) digits in a float
float=to_float(string(1: min(17,l) ))
# 4. multiply with the correct power of 10
float = float * 10^(l-min(17,l) ) # float = 5.2656145834278593E64
# 5. Take the log2 of this number and round to the nearest integer
log2 = Round( Log(float,2) ) # 215
Note:
some computer languages can convert arbitrary strings into a double precision number. So steps 2,3 and 4 could be replaced by x=to_float(string)
Step 5 could be done quicker by just reading the double-precision exponent (bits 53 up to and including 63) and subtracting 1023 from it.
Quick example code: If you have awk you can quickly test this algorithm.
The following code creates the first 300 powers of two:
awk 'BEGIN{for(n=0;n<300; n++) print 2^n}'
The following reads the input and does the above algorithm:
awk '{ l=length($0); m = (l > 17 ? 17 : l)
x = substr($0,1,m) * 10^(l-m)
print log(x)/log(2)
}'
So the following bash-command is a convoluted way to create a consecutive list of numbers from 0 to 299:
$ awk 'BEGIN{for(n=0;n<300; n++) print 2^n}' | awk '{ l=length($0); m = (l > 17 ? 17 : l); x = substr($0,1,m) * 10^(l-m); print log(x)/log(2) }'
0
1
2
...
299

Resources