IEEE-754 double precision and splitting method - precision

When you compute elementary functions, you apply constant modification. Specially in the implementation of exp(x). In all these implementations any correction with ln(2) is done in two steps. ln(2) is split in two numbers:
static const double ln2p1 = 0.693145751953125;
static const double ln2p2 = 1.42860682030941723212E-6;
// then ln(2) = ln2p1 + ln2p2
Then any computation with ln(2) is done by:
blablabla -= ln2p1
blablabla -= ln2p2
I know it is to avoid rounding effect. But why this two numbers in specially ? Some of you have an idea how get these two numbers ?
Thank you !
Following the first comment I complete this post with more material and a very weird question. I worked with my team and we agree the deal is to double potentially the precision by splitting the number ln(2) in two numbers. For this, two transformations are applied, the first one:
1) c_h = floor(2^k ln(2))/2^k
2) c_l = ln(2) - c_h
the k indicates the precisions, in look likes in Cephes library (~1980) for float k has been fixed on 9, 16 for double and also 16 for long long double (why I do not know). So for double c_h has a precision of 16 bits but 52 bits for c_l.
From this, I write the following program, and determine c_h with 52 bit precision.
#include <iostream>
#include <math.h>
#include <iomanip>
enum precision { nine = 9, sixteen = 16, fiftytwo = 52 };
int64_t k_helper(double x){
return floor(x/log(2));
}
template<class C>
double z_helper(double x, int64_t k){
x -= k*C::c_h;
x -= k*C::c_l;
return x;
}
template<precision p>
struct coeff{};
template<>
struct coeff<nine>{
constexpr const static double c_h = 0.693359375;
constexpr const static double c_l = -2.12194440e-4;
};
template<>
struct coeff<sixteen>{
constexpr const static double c_h = 6.93145751953125E-1;
constexpr const static double c_l = 1.42860682030941723212E-6;
};
template<>
struct coeff<fiftytwo>{
constexpr const static double c_h = 0.6931471805599453972490664455108344554901123046875;
constexpr const static double c_l = -8.78318343240526578874146121703272447458793199905066E-17;
};
int main(int argc, const char * argv[]) {
double x = atof(argv[1]);
int64_t k = k_helper(x);
double z_9 = z_helper<coeff<nine> >(x,k);
double z_16 = z_helper<coeff<sixteen> >(x,k);
double z_52 = z_helper<coeff<fiftytwo> >(x,k);
std::cout << std::setprecision(16) << " 9 bits precisions " << z_9 << "\n"
<< " 16 bits precisions " << z_16 << "\n"
<< " 52 bits precisions " << z_52 << "\n";
return 0;
}
If I compute now for a set of different values I get:
bash-3.2$ g++ -std=c++11 main.cpp
bash-3.2$ ./a.out 1
9 bits precisions 0.30685281944
16 bits precisions 0.3068528194400547
52 bits precisions 0.3068528194400547
bash-3.2$ ./a.out 2
9 bits precisions 0.61370563888
16 bits precisions 0.6137056388801094
52 bits precisions 0.6137056388801094
bash-3.2$ ./a.out 100
9 bits precisions 0.18680599936
16 bits precisions 0.1868059993678755
52 bits precisions 0.1868059993678755
bash-3.2$ ./a.out 200
9 bits precisions 0.37361199872
16 bits precisions 0.3736119987357509
52 bits precisions 0.3736119987357509
bash-3.2$ ./a.out 300
9 bits precisions 0.56041799808
16 bits precisions 0.5604179981036264
52 bits precisions 0.5604179981036548
bash-3.2$ ./a.out 400
9 bits precisions 0.05407681688
16 bits precisions 0.05407681691155647
52 bits precisions 0.05407681691155469
bash-3.2$ ./a.out 500
9 bits precisions 0.24088281624
16 bits precisions 0.2408828162794319
52 bits precisions 0.2408828162794586
bash-3.2$ ./a.out 600
9 bits precisions 0.4276888156
16 bits precisions 0.4276888156473074
52 bits precisions 0.4276888156473056
bash-3.2$ ./a.out 700
9 bits precisions 0.61449481496
16 bits precisions 0.6144948150151828
52 bits precisions 0.6144948150151526
It like when x becomes larger than 300 a difference appear. I had a look on the the implementation of gnulibc
http://osxr.org:8080/glibc/source/sysdeps/ieee754/ldbl-128/s_expm1l.c
presently it is using the 16 bits prevision for c_h (line 84)
Well I am probably missing something, with the IEEE standard, and I can not imagine an error of precision in glibc. What do you think ?
Best,

ln2p1 is exactly 45426/65536. This can be obtained by round(65536 * ln(2)). ln2p2 is simply the remainder. So what's so special about the two number is the denominator 65536 (216).
From what I found most algorithms using this constant can be traced back to the cephes library, which was first released in 1984 where 16-bit computing was still dominating, which probably explains why 216 is chosen.

Related

Floats in visual studio

Consider single precision floating point number system conforming to IEEE 754 standard. In visual studio, FP switch was set to Strict.
struct FP {
unsigned char a : 8;
unsigned char b : 8;
unsigned char c : 8;
unsigned char d : 8;
}*fp;
fp->a = 63;
fp->b = 128;
fp->c = 0;
fp->d = 1;
std::cout << "raw float = " << *reinterpret_cast<float*>(fp) << "\n";
Tha mathematical value according to standard is 1.00000011920928955078125.
What visual studio prints is raw float = 2.36018991e-38. Why?
Assume sign bit it 0. And 0111 1111 in exponent part.
In the remaining 23 bits assume 01 and 10 are the least significant bits, which means the mathematical value is number1 = 1.00000011920928955078125 and number2 = 1.0000002384185791015625 respectively. The mid value is number3 = 1.000000178813934326171875. So, all values between number1 and number3 should be captured by encoding with 01 in least two significant bits and values between number3 and number2 should be captured by encoding with 10 in least significant bits. But visual studio captures 1.0000001788139343(this actually falls between number1 and number3) and greater values in encoding with 10 in least significant bits. So what am I mising?
If you take a look at https://www.h-schmidt.net/FloatConverter/IEEE754.html
then you can see that binary representation of 2.36018991E-38 is
00000001 00000000 10000000 00111111
and that binary value equals to your struct

Qsort comparison

I'm converting C++ code to Go, but I have difficulties in understanding this comparison function:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <iostream>
using namespace std;
typedef struct SensorIndex
{ double value;
int index;
} SensorIndex;
int comp(const void *a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
return abs(y->value) - abs(x->value);
}
int main(int argc , char *argv[])
{
SensorIndex *s_tmp;
s_tmp = (SensorIndex *)malloc(sizeof(SensorIndex)*200);
double q[200] = {8.48359,8.41851,-2.53585,1.69949,0.00358129,-3.19341,3.29215,2.68201,-0.443549,-0.140532,1.64661,-1.84908,0.643066,1.53472,2.63785,-0.754417,0.431077,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256,-0.123256};
for( int i=0; i < 200; ++i ) {
s_tmp[i].value = q[i];
s_tmp[i].index = i;
}
qsort(s_tmp, 200, sizeof(SensorIndex), comp);
for( int i=0; i<200; i++)
{
cout << s_tmp[i].index << " " << s_tmp[i].value << endl;
}
}
I expected that the "comp" function would allow the sorting from the highest (absolute) value to the minor, but in my environment (gcc 32 bit) the result is:
1 8.41851
0 8.48359
2 -2.53585
3 1.69949
11 -1.84908
5 -3.19341
6 3.29215
7 2.68201
10 1.64661
14 2.63785
12 0.643066
13 1.53472
4 0.00358129
9 -0.140532
8 -0.443549
15 -0.754417
16 0.431077
17 -0.123256
18 -0.123256
19 -0.123256
20 -0.123256
...
Moreover one thing that seems strange to me is that by executing the same code with online services I get different values (cpp.sh, C++98):
0 8.48359
1 8.41851
5 -3.19341
6 3.29215
2 -2.53585
7 2.68201
14 2.63785
3 1.69949
10 1.64661
11 -1.84908
13 1.53472
4 0.00358129
8 -0.443549
9 -0.140532
12 0.643066
15 -0.754417
16 0.431077
17 -0.123256
18 -0.123256
19 -0.123256
20 -0.123256
...
Any help?
This behavior is caused by using abs, a function that works with int, and passing it double arguments. The doubles are being implicitly cast to int, truncating the decimal component before comparing them. Essentially, this means you take the original number, strip off the sign, and then strip off everything to the right of the decimal and compare those values. So 8.123 and -8.9 are both converted to 8, and compare equal. Since the inputs are reversed for the subtraction, the ordering is in descending order by magnitude.
Your cpp.sh output reflects this; all the values with a magnitude between 8 and 9 appear first, then 3-4s, then 2-3s, 1-2s and less than 1 values.
If you wanted to fix this to actually sort in descending order in general, you'd need a comparison function that properly used the double-friendly fabs function, e.g.
int comp(const void *a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
double diff = fabs(y->value) - fabs(x->value);
if (diff < 0.0) return -1;
return diff > 0;
}
Update: On further reading, it looks like std::abs from <cmath> has worked with doubles for a long time, but std::abs for doubles was only added to <cstdlib> (where the integer abs functions dwell) in C++17. And the implementers got this stuff wrong all the time, so different compilers would behave differently at random. In any event, both the answers given here are right; if you haven't included <cmath> and you're on pre-C++17 compilers, you should only have access to integer based versions of std::abs (or ::abs from math.h), which would truncate each value before the comparison. And even if you were using the correct std::abs, returning the result of double subtraction as an int would drop fractional components of the difference, making any values with a magnitude difference of less than 1.0 appear equal. Worse, depending on specific comparisons performed and their ordering (since not all values are compared to each other), the consequences of this effect could chain, as comparison ordering changes could make 1.0 appear equal to 1.6 which would in turn appear equal to 2.5, even though 1.0 would be correctly identified as less than 2.5 if they were compared to each other; in theory, as long as each number is within 1.0 of every other number, the comparisons might evaluate as if they're all equal to each other (pathological case yes, but smaller runs of such errors would definitely happen).
Point is, the only way to figure out the real intent of this code is to figure out the exact compiler version and C++ standard it was originally compiled under and test it there.
There is a bug in your comparison function. You return an int which means you lose the distinction between element values whose absolute difference is less then 1!
int comp(const void* a, const void* b)
{
SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
// what about differences between 0.0 and 1.0?
return abs(y->value) - abs(x->value);
}
You can fix it like this:
int comp(const void* a, const void* b)
{ SensorIndex* x = (SensorIndex*)a;
SensorIndex* y = (SensorIndex*)b;
if(std::abs(y->value) < std::abs(x->value))
return -1;
return 1;
}
A more modern (and safer) way to do this would be to use std::vector and std::sort:
// use a vector for dynamic arrays
std::vector<SensorIndex> s_tmp;
for(int i = 0; i < 200; ++i) {
s_tmp.push_back({q[i], i});
}
// use std::sort
std::sort(std::begin(s_tmp), std::end(s_tmp), [](SensorIndex const& a, SensorIndex const& b){
return std::abs(b.value) < std::abs(a.value);
});

Algorithm for Simple Squared Squares

I want to split a square in unequal squares.
After some search on the web found this Link.
This is an output i need :
Does anyone have idea for this?
As Yves Daoust said the algorithm to solve this is going to be slow. The first challenge is to determine what squares COULD be combined to fit into your big square. Then figure out if they WILL fit in there.
I would first filter by area.
To answer the first part you need to look for a combination of squares that will fit into your big one. There are likely multiple combinations as a 5x5 square takes up the same area as a 3x3 with a 4x4 square. This is a O(2^n) problem in itself.
Then attempt to arrange them.
I would make a matrix that is the size of your big square. Then starting at the topmost then right most index add in a square by marking the matrix positions as occupied by that square. Then move to the next unoccupied space, based on the previous rules adding an unused square. If no square fits then remove the previous square and continue to the next. This is a method begging for recursion.
As I said at the beginning this is a SLOW way to do it but it will give you a solution if one exists.
I used a dynamic programming approach for solving this. but it works until n ~ 50. I stored a solution as a bitset for efficiency:
You can compile the code yourself with:
$ g++ -O3 -std=c++11 squares.cpp -o squares
#include <bitset>
#include <iostream>
#include <list>
#include <vector>
using namespace std;
constexpr auto N = 116;
class FastSquareList {
public:
FastSquareList() = default;
FastSquareList(int i) { mask_.set(i); }
FastSquareList operator+(int i) const {
FastSquareList result = *this;
result.mask_.set(i);
return result;
}
bool has(int i) const { return mask_.test(i); }
void display() const {
for (auto i = 1; i <= N; ++i) {
if (has(i)) {
cout << i * i << " ";
}
}
cout << endl;
}
private:
bitset<N + 1> mask_;
};
int main() {
int n;
cin >> n;
vector<list<FastSquareList> > v(n * n + 1);
for (int i = 1; i <= n; ++i) {
v[i * i].push_back(i);
for (int a = i * i + 1; a <= n * n; ++a) {
int p = a - i * i;
for (const auto& l : v[p]) {
if (l.has(i)) {
continue;
}
v[a].emplace_back(l + i);
}
}
}
for (const auto& l : v[n * n]) {
l.display();
}
cout << "solutions count = " << v[n*n].size() << endl;
return 0;
}
an example:
$ ./Squares
15
9 16 36 64 100
25 36 64 100
1 4 9 16 25 49 121
4 36 64 121
4 100 121
4 16 25 36 144
1 16 64 144
81 144
4 16 36 169
4 9 16 196
4 25 196
225
solutions count = 12

16 bit limit on XDrawString arguments

From the XDrawString man page it seems that it aceepts signed 32 bit x and y coordinates
int XDrawString(Display *display, Drawable d, GC
gc, int x, int y, char *string, int length);
Note how both x and y are int ( ie: 32 bit signed Integer on gcc/linux2.6-i386 at least )
The problem is when I pass y = 32767 ( 2^15 - 1) the string is drawn in the correct location but anything above this value the string is not drawn.
I suspect that internally 32 bit integers are not used but instead 16 bit signed integers for the coordinates.
Given that the man pages seem to indicate that the function accepts 32 bit integers, is there some compile option that needs to be turned to allow the use of the longer integers? Or is this a limmitation of Xlib?
The X11 protocol does specify 16 bits.
Have a look at the definition for xPolyTextReq in <X11/Xproto.h>
typedef struct {
CARD8 reqType;
CARD8 pad;
CARD16 length B16;
Drawable drawable B32;
GContext gc B32;
INT16 x B16, y B16; /* items (xTextElt) start after struct */
} xPolyTextReq;

Char conversion in gcc

What are the char implicit typecasting rules? The following code gives an awkward output of -172.
char x = 200;
char y = 140;
printf("%d", x+y);
My guess is that being signed, x is casted into 72, and y is casted into 12, which should give 84 as the answer, which however is not the case as mentioned above. I am using gcc on Ubuntu.
The following code gives an awkward output of -172.
The behavior of an overflow is implementation dependent, but visibly in your case (and mine) a char has 8 bits and its representation is the complement by 2. So the binary representation of the unsigned char 200 and 140 are 11001000 and 10001100, corresponding to the binary representation of the  signed char -56 and -116, and -56 + -116 equals -172 (the char are promoted to int to do the addition).
Example forcing x and y to be signed whatever the default for char:
#include <stdio.h>
int main()
{
signed char x = 200;
signed char y = 140;
printf("%d %d %d\n", x, y, x+y);
return 0;
}
Compilation and execution :
pi#raspberrypi:/tmp $ gcc -Wall c.c
pi#raspberrypi:/tmp $ ./a.out
-56 -116 -172
pi#raspberrypi:/tmp $
My guess is that being signed, x is casted into 72, and y is casted into 12
You supposed the higher bit is removed (11001000 -> 1001000 and 10001100 -> 1100) but this is not the case, contrarily to the IEEE floats using a bit for the sign.

Resources