I know that double has more precision than float and all that, but during lecture, my professor said 0.5 is a double. Could it be float too?
short int s;
int i;
long int l;
float f;
double d;
l = 2 * s + i * f - 0.5 * d;
According to this SO question, the type of a floating point literal (i.e. a number with a decimal point in it) is by default double unless it has a suffix of f:
The type of a floating literal is double unless explicitly specified by a suffix. The suffixes f and F specify float, the suffixes l and L specify long double.
So, your professor appears to be spot on. In the above expression 0.5 will be treated like a double by default. I hope that you will get a high grade on the final exam.
Related
Back story : uniform PRNG with arbitrary endpoints
I've got a fast uniform pseudo random number generator that creates uniform float32 numbers in range [1:2) i.e. u : 1 <= u <= 2-eps. Unfortunately mapping the endpoints [1:2) to that of an arbitrary range [a:b) is non-trivial in floating point math. I'd like to exactly match the endpoints with a simple affine calculation.
Formally stated
I want to make an IEEE-754 32 bit floating point affine function f(x,a,b) for 1<=x<2 and arbitrary a,b that exactly maps
1 -> a and nextlower(2) -> nextlower(b)
where nextlower(q) is the next lower FP representable number (e.g. in C++ std::nextafter(float(q),float(q-1)))
What I've tried
The simple mapping f(x,a,b) = (x-1)*(b-a) + a always achieves the f(1) condition but sometimes fails the f(2) condition due to floating point rounding.
I've tried replacing the 1 with a free design parameter to cancel FP errors in the spirit of Kahan summation.
i.e. with
f(x,c0,c1,c2) = (x-c0)*c1 + c2
one mathematical solution is c0=1,c1=(b-a),c2=a (the simple mapping above),
but the extra parameter lets me play around with constants c0,c1,c2 to match the endpoints. I'm not sure I understand the principles behind Kahan summation well enough to apply them to determine the parameters or even be confident a solution exists. It feels like I'm bumping around in the dark where others might've found the light already.
Aside: I'm fine assuming the following
a < b
both a and b are far from zero, i.e. OK to ignore subnormals
a and b are far enough apart (measuered in representable FP values) to mitigate non-uniform quantization and avoid degenerate cases
Update
I'm using a modified form of Chux's answer to avoid the division.
While I'm not 100% certain my refactoring kept all the magic, it does still work in all my test cases.
float lerp12(float x,float a,float b)
{
const float scale = 1.0000001f;
// scale = 1/(nextlower(2) - 1);
const float ascale = a*scale;
const float bscale = nextlower(b)*scale;
return (nextlower(2) - x)*ascale + (x - 1.0f)*bscale;
}
Note that only the last line (5 FLOPS) depends on x, so the others can be reused if (a,b) remain the same.
OP's goal
I want to make an IEEE-754 32 bit floating point affine function f(x,a,b) for 1<=x<2 and arbitrary a,b that exactly maps 1 -> a and nextlower(2) -> nextlower(b)
This differs slightly from "map range of IEEE 32bit float [1:2) to some arbitrary [a:b)".
General case
Map x0 to y0, x1 to y1 and various x in-between to y :
m = (y1 - y0)/(x1 - x0);
y = m*(x - x0) + y0;
OP's case
// x0 = 1.0f;
// x1 = nextafterf(2.0f, 1.0f);
// y0 = a;
// y1 = nextafterf(b, a);
#include <math.h> // for nextafterf()
float x = random_number_1_to_almost_2();
float m = (nextafterf(b, a) - a)/(nextafterf(2.0f, 1.0f) - 1.0f);
float y = m*(x - 1.0f) + a;
nextafterf(2.0f, 1.0f) - 1.0f, x - 1.0f and nextafterf(b, a) are exact, incurring no calculation error.
nextafterf(2.0f, 1.0f) - 1.0f is a value a little less than 1.0f.
Recommendation
Other re-formations are possible with better symmetry and numerical stability at the end-points.
float x = random_number_1_to_almost_2();
float afactor = nextafterf(2.0f, 1.0f) - x; // exact
float bfactor = x - 1.0f; // exact
float xwidth = nextafterf(2.0f, 1.0f) - 1.0f; // exact
// Do not re-order next line of code, perform 2 divisions
float y = (afactor/xwidth)*a + (bfactor/xwidth)*nextafterf(b, a);
Notice afactor/xwidth and bfactor/xwidth are both exactly 0.0 or 1.0 at the end-points, thus meeting "maps 1 -> a and nextlower(2) -> nextlower(b)". Extended precision not needed.
OP's (x-c0)*c1 + c2 has trouble as it divides (x-c0)*c1 by (2.0 - 1.0) or 1.0 (implied), when it should divide by nextafterf(2.0f, 1.0f) - 1.0f.
Simple lerping based on fused multiply-add can reliably hit the endpoints for interpolation factors 0 and 1. For x in [1, 2) the interpolation factor x - 1 does not reach unity, which can be fixed by slight stretching by multiplying x-1 with (2.0f / nextlower(2.0f)). Obviously the endpoint needs to also be adjusted to the endpoint nextlower(b). For the C code below I have used the definition of nextlower() provided in the question, which may not be what asker desires, since for floating-point q sufficiently large in magnitude, q == (q - 1).
Asker stated in comments that it is understood that this kind of mapping is not going to result in an exactly uniform distribution of the pseudo-random numbers in the interval [a, b), only approximately so, and that pathological mappings may occur when a and b are extremely close together. I have not mathematically proved that the implementation of map() below guarantees the desired behavior, but it seems to do so for a large number of random test cases.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
float nextlowerf (float q)
{
return nextafterf (q, q - 1);
}
float map (float a, float b, float x)
{
float t = (x - 1.0f) * (2.0f / nextlowerf (2.0f));
return fmaf (t, nextlowerf (b), fmaf (-t, a, a));
}
float uint32_as_float (uint32_t a)
{
float r;
memcpy (&r, &a, sizeof(r));
return r;
}
// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)
int main (void)
{
float a, b, x, r;
float FP32_MIN_NORM = 0x1.000000p-126f;
float FP32_MAX_NORM = 0x1.fffffep+127f;
do {
do {
a = uint32_as_float (KISS);
} while ((fabsf (a) < FP32_MIN_NORM) || (fabsf (a) > FP32_MAX_NORM) || isnan (a));
do {
b = uint32_as_float (KISS);
} while ((fabsf (b) < FP32_MIN_NORM) || (fabsf (b) > FP32_MAX_NORM) || isnan (b) || (b < a));
x = 1.0f;
r = map (a, b, x);
if (r != a) {
printf ("lower bound failed: a=%12.6a b=%12.6a map=%12.6a\n", a, b, r);
return EXIT_FAILURE;
}
x = nextlowerf (2.0f);
r = map (a, b, x);
if (r != nextlowerf (b)) {
printf ("upper bound failed: a=%12.6a b=%12.6a map=%12.6a\n", a, b, r);
return EXIT_FAILURE;
}
} while (1);
return EXIT_SUCCESS;
}
From CGAL I am currently using the following package:
Boolean operations on polygons
As I am interested in polygons which can have, besides line segments, as edges also circle segments, I use the following build-up for my basic typedefs:
typedef CGAL::Exact_predicates_exact_constructions_kernel Kernel;
typedef Kernel::Point_2 Point_2;
typedef Kernel::Circle_2 Circle_2;
typedef Kernel::Line_2 Line_2;
typedef CGAL::Gps_circle_segment_traits_2<Kernel> Traits_2;
typedef CGAL::General_polygon_set_2<Traits_2> Polygon_set_2;
typedef Traits_2::General_polygon_2 Polygon_2;
typedef Traits_2::General_polygon_with_holes_2 Polygon_with_holes_2;
typedef Traits_2::Curve_2 Curve_2;
typedef Traits_2::X_monotone_curve_2 X_monotone_curve_2;
typedef Traits_2::Point_2 Point_2t;
typedef Traits_2::CoordNT coordnt;
typedef CGAL::Arrangement_2<Traits_2> Arrangement_2;
typedef Arrangement_2::Face_handle Face_handle;
As shown in the types above, I have two Point-types namely Point_2 which is Kernel::Point_2 and what I called Point_2t which is Traits_2::Point_2.
The difference between them is, that Point_2 has rational coordinates x(), y() whereas Point_2t has coordinates in Q(alpha) where Q stands for the rational field and alpha is the square-root of a rational number.
Or, to say it otherwise, the coordinates for Point_2 are in Kernel::FT, whereas the coordinates of Point_2t are in Traits_2::CoordNT.
So converting from Point_2 to Point_2t is no problem, but I have to convert also from Point_2t to Point_2, hopefully in a way which gives control over the lost precision.
Reading through the documentation and using the autocomplete feature of eclipse, I made up the following routines:
const int use_precision = 100;
CGAL::Gmpfr convert(CGAL::Gmpq z)
{
CGAL::Gmpz num = z.numerator();
CGAL::Gmpz den = z.denominator();
CGAL::Gmpfr num_f(num);
CGAL::Gmpfr den_f(den);
return num_f/den_f;
}
CGAL::Gmpfr convert(Traits_2::CoordNT z)
{
Kernel::FT a0_val = z.a0();
Kernel::FT a1_val = z.a1();
Kernel::FT root_val = z.root();
CGAL::Gmpq a0_q = a0_val.exact();
CGAL::Gmpq a1_q = a1_val.exact();
CGAL::Gmpq root_q = root_val.exact();
CGAL::Gmpfr a0_f = convert(a0_q);
CGAL::Gmpfr a1_f = convert(a1_q);
CGAL::Gmpfr root_f = convert(root_q);
CGAL::Gmpfr res = a0_f + a1_f * root_f.sqrt(use_precision);
return res;
}
Point_2 convert(Point_2t p)
{
CGAL::Gmpfr xx = convert(p.x());
CGAL::Gmpfr yy = convert(p.y());
CGAL::Gmpq xx1 = xx;
CGAL::Gmpq yy1 = yy;
Kernel::FT xx2 = xx1;
Kernel::FT yy2 = yy1;
Point_2 pp(xx2, yy2);
return pp;
}
Essentially I convert the coordinates from Traits_2::CoordNT into the form
(*) a0 + a1 * sqrt(root)
with a0, a1, root from Kernel::FT (=rational field), then convert a0, a1, root into Gmpq rationals, these into Gmpfr with 100 decimals precision, then evaluate the expression (*) and convert back into Gmpq and then Kernel::FT. All conversions are (more or less) just done by assignments and automatic conversion by CGAL.
In my tests, this worked seemingly correct, but I am still not 100% sure, if, according to the CGAL definitions, the sqrt(root) expression in (*) always means the positive square-root.
I looked through the definition:
description of sqrt-extended Number type in CGAL
but even then I am not totally convinced, that only the positive value of sqrt(root) is taken.
So my question to those, who understand the CGAL system in this point fully:
Are my conversion routines above correct in assuming always the positive value of the root to be taken?
Yes, you are right. In CGAL Sqrt_extension, in the expression a0+a1√(root), the square root is always positive or null.
I have a C++ code below,
#include <iostream>
#include <cstdio>
#include <math.h>
using namespace std;
int main ()
{
unsigned long long dec,len;
long double dbl;
while (cin >> dec)
{
len = log10(dec)+1;
dbl = (long double) (dec);
while (len--)
dbl /= 10.0;
dbl += 1e-9;
printf ("%llu in int = %.20Lf in long double. :)\n",dec,dbl);
}
return 0;
}
In this code I wanted to convert an integer to a floating-point number. But for some inputs it gave some precision errors. So I added 1e-9 before printing the result. But still it is showing errors for all the inputs, actually I got some extra digits in the result. Some of them are given below,
stdin
1
12
123
1234
12345
123456
1234567
12345678
123456789
1234567890
stdout
1 in int = 0.10000000100000000000 in long double. :)
12 in int = 0.12000000100000000001 in long double. :)
123 in int = 0.12300000100000000000 in long double. :)
1234 in int = 0.12340000100000000000 in long double. :)
12345 in int = 0.12345000099999999999 in long double. :)
123456 in int = 0.12345600100000000000 in long double. :)
1234567 in int = 0.12345670100000000000 in long double. :)
12345678 in int = 0.12345678099999999998 in long double. :)
123456789 in int = 0.12345679000000000001 in long double. :)
1234567890 in int = 0.12345679000000000001 in long double. :)
Is there any way to avoid or get rid of these errors? :)
No, there is no way around it. A floating point number is basically a fraction with a power of 2 as the denominator. This means that the only non-integers that can be represented exactly are multiples of a (negative) power of 2, i.e. a multiple of 1/2, or of 1/16, or of 1/1048576, or...
Now, 10 has two prime factors; 2 and 5. Thus 1/10 cannot be expressed as a fractional number with a power of 2 as the denominator. You will always end up with a rounding error. By repeatedly dividing by 10, you even make this slightly worse, so one "solution" would be to rather than dividing dbl by 10 repeatedly keeping a separate counter multiplier:
double multiplier = 1;
while (len--)
multiplier *= 10.;
dbl /= multiplier;
Note that I don't say this will solve the problem, but it might make things slightly more stable. Assuming that you can represent a decimal number exactly in floating point remains wrong.
I am using following code to typecast from float to int. I always have float with up to 1 decimal point. First I multiply it with 10 and then typecast it to int
float temp1 = float.Parse(textBox.Text);
int temp = (int)(temp1*10);
for 25.3 I get 252 after typecasting but for 25.2, 25.4 I get correct output 252,254 respectively.
Now performing same operation little bit in different way gives correct output.
float temp1 = float.Parse(textBox.Text);
temp1 = temp1*10;
int temp = (int)temp1;
now for 25.3 I get 253. What is the reason for this because logically first method is also correct? I am using Visual Studio 2010.
Its all because of precision in float & double & there rounding off to integer
By default arithmetic operation are performed to in double precision
here is your first code
float temp1 = float.Parse(textBox.Text);
int temp = (int)(temp1*10);
which gets executed as
float temp1 = float.Parse(textBox.Text);
double xVariable = temp1*10
int temp = (int)xVariable;
here is your second code which executes as float conversion of multiplication
float temp1 = float.Parse(textBox.Text);
float xVariable = temp1*10;
int temp = (int)xVariable;
More Information about precision
http://en.wikipedia.org/wiki/Single-precision_floating-point_format
What range of numbers can be represented in a 16-, 32- and 64-bit IEEE-754 systems?
try
decimal temp1 = decimal.Parse(textBox.Text);
int temp = (int)(temp1*10);
Use this:
float temp1 = 25.3;
int temp = Int32. conversation (temp1)
In the IEEE754 standarad, the minimum strictly positive (subnormal) value is 2−16493 ≈ 10−4965 using Quadruple-precision floating-point format. Why does GCC reject anything lower than 10-4949? I'm looking for an explanation of the different things that could be going on underneath which determine the limit to be 10-4949 rather than 10−4965.
#include <stdio.h>
void prt_ldbl(long double decker) {
unsigned char * desmond = (unsigned char *) & decker;
int i;
for (i = 0; i < sizeof (decker); i++) {
printf ("%02X ", desmond[i]);
}
printf ("\n");
}
int main()
{
long double x = 1e-4955L;
prt_ldbl(x);
}
I'm using GNU GCC version 4.8.1 online - not sure which architecture it's running on (which I realize may be the culprit). Please feel free to post your findings from different architectures.
Your long double type may not be(*) quadruple-precision. It may simply be the 387 80-bit extended-double format. This format has the same number of bits for the exponent as quad-precision, but many fewer significand bits, so the minimum value that would be representable in it sounds about right (2-16445)
(*) Your long double is likely not to be quad-precision, because no processor implements quad-precision in hardware. The compiler can always implement quad-precision in software, but it is much more likely to map long double to double-precision, to extended-double or to double-double.
The smallest 80-bit long double is around 2-16382 - 63 ~= 10-4951, not 2-164934. So the compiler is entirely correct; your number is smaller than the smallest subnormal.