long double subnormals/denormals get truncated to 0 [-Woverflow] - gcc

In the IEEE754 standarad, the minimum strictly positive (subnormal) value is 2−16493 ≈ 10−4965 using Quadruple-precision floating-point format. Why does GCC reject anything lower than 10-4949? I'm looking for an explanation of the different things that could be going on underneath which determine the limit to be 10-4949 rather than 10−4965.
#include <stdio.h>
void prt_ldbl(long double decker) {
unsigned char * desmond = (unsigned char *) & decker;
int i;
for (i = 0; i < sizeof (decker); i++) {
printf ("%02X ", desmond[i]);
}
printf ("\n");
}
int main()
{
long double x = 1e-4955L;
prt_ldbl(x);
}
I'm using GNU GCC version 4.8.1 online - not sure which architecture it's running on (which I realize may be the culprit). Please feel free to post your findings from different architectures.

Your long double type may not be(*) quadruple-precision. It may simply be the 387 80-bit extended-double format. This format has the same number of bits for the exponent as quad-precision, but many fewer significand bits, so the minimum value that would be representable in it sounds about right (2-16445)
(*) Your long double is likely not to be quad-precision, because no processor implements quad-precision in hardware. The compiler can always implement quad-precision in software, but it is much more likely to map long double to double-precision, to extended-double or to double-double.

The smallest 80-bit long double is around 2-16382 - 63 ~= 10-4951, not 2-164934. So the compiler is entirely correct; your number is smaller than the smallest subnormal.

Related

Metal Compute function causes GPU timeout error

I am trying to compute the Collatz conjecture for a range of numbers to see how much I can benefit from using the GPU. For some reason, the function seems to fail for integers above one hundred million. I use 64-bit unsigned long for the calculations, so it can't be integer overflow; the largest number reached in the calculations for any integer is well below the maximum representable value for this datatype.
The application is basically Apple's Performing Calculations on a GPU, where the array buffers and fragment function are the only things changed. The basic idea for the function is to pass an array of integers (say from 1 to 1000), where each integer serves as a starting point for a while-loop performing the Collatz calculations for every number until the thread reaches a top set limit, for example, a billion.
kernel void compute_collatz(device const unsigned int *array [[buffer(0)]],
device unsigned int *result [[buffer(1)]],
uint index [[thread_position_in_grid]])
{
const unsigned long arrayLength = (unsigned long)1000;
unsigned long arrayNumber = (unsigned long)array[index];
unsigned long maxNumber = (unsigned long)1000000000 - arrayLength;
while (arrayNumber <= maxNumber) {
unsigned long curentStep = arrayNumber;
while (curentStep != 1) {
if (curentStep % 2 == 0) {curentStep = curentStep / 2;}
else {curentStep = (curentStep * 3) + 1;}
}
arrayNumber += arrayLength;
}
result[index] = (int)arrayNumber;
}
When the function reaches the maximum limit, it stores that value in a different array. This works well when the maximum value is sett to one hundred million or less, but only about one third of the array is changed when I try higher values. The program fails with the following error: Execution of the command buffer was aborted due to an error during execution. Caused GPU Timeout Error (IOAF code 2). I had a similar problem when I tried multithreading on the CPU for the first time, but then the problem was related to pointers. Can't see that this is a problem here.
I am using Xcode 12.5.1 on macOS 11.5.2 on a MacBook Pro 16 (2016). Any help is much appreciated!

Why float division is faster than integer division in c++?

Consider the following code snippet in C++ :(visual studio 2015)
First Block
const int size = 500000000;
int sum =0;
int *num1 = new int[size];//initialized between 1-250
int *num2 = new int[size];//initialized between 1-250
for (int i = 0; i < size; i++)
{
sum +=(num1[i] / num2[i]);
}
Second Block
const int size = 500000000;
int sum =0;
float *num1 = new float [size]; //initialized between 1-250
float *num2 = new float [size]; //initialized between 1-250
for (int i = 0; i < size; i++)
{
sum +=(num1[i] / num2[i]);
}
I expected that first block runs faster because it is integer operation . But the Second block is considerably faster , although it is floating point operation . here is results of my bench mark :
Division:
Type Time
uint8 879.5ms
uint16 885.284ms
int 982.195ms
float 654.654ms
As well as floating point multiplication is faster than integer multiplication.
here is results of my bench mark :
Multiplication:
Type Time
uint8 166.339ms
uint16 524.045ms
int 432.041ms
float 402.109ms
My system spec: CPU core i7-7700 ,Ram 64GB,Visual studio 2015
Floating point number division is faster than integer division because of the exponent part in floating point number representation. To divide one exponent by another one plain subtraction is used.
int32_t division requires fast division of 31-bit numbers, whereas float division requires fast division of 24-bit mantissas (the leading one in mantissa is implied and not stored in a floating point number) and faster subtraction of 8-bit exponents.
See an excellent detailed explanation how division is performed in CPU.
It may be worth mentioning that SSE and AVX instructions only provide floating point division, but no integer division. SSE instructions/intrinsincs can be used to quadruple the speed of your float calculation easily.
If you look into Agner Fog's instruction tables, for example, for Skylake, the latency of the 32-bit integer division is 26 CPU cycles, whereas the latency of the SSE scalar float division is 11 CPU cycles (and, surprisingly, it takes the same time to divide four packed floats).
Also note, in C and C++ there is no division on numbers shorter that int, so that uint8_t and uint16_t are first promoted to int and then the division of ints happens. uint8_t division looks faster than int because it has fewer bits set when converted to int which causes the division to complete faster.

ARM GCC compiler "buggy" conversion

Problem
I am working with flash memory optimization of STM32F051. It's revealed, that conversion between floatand int types consumes a lot of flash.
Digging into this, it turned out that the conversion to int takes around 200 bytes of flash memory; while the conversion to unsigned int takes around 1500 bytes!
It’s known, that both int and unsigned int differ only by the interpretation of the ‘sign’ bit, so such behavior – is a great mystery for me.
Note: Performing the 2-stage conversion float -> int -> unsigned int also consumes only around 200 bytes.
Questions
Analyzing that, I have such questions:
1) What is a mechanism of the conversion of float to unsigned int. Why it takes so many memory space, when in the same time conversion float->int->unsigned int takes so little memory? Maybe it’s connected with IEEE 754 standard?
2) Are there any problems expected when the conversion float->int->unsigned int is used instead of a direct float ->int?
3) Are there any methods to wrap float -> unsigned int conversion keeping the low memory footprint?
Note: The familiar question has been already asked here (Trying to understand how the casting/conversion is done by compiler,e.g., when cast from float to int), but still there is no clear answer and my question is about the memory usage.
Technical data
Compiler: ARM-NONE-EABI-GCC (gcc version 4.9.3 20141119 (release))
MCU: STM32F051
MCU's core: 32 bit ARM CORTEX-M0
Code example
float -> int (~200 bytes of flash)
int main() {
volatile float f;
volatile int i;
i = f;
return 0;
}
float -> unsigned int (~1500 bytes! of flash)
int main() {
volatile float f;
volatile unsigned int ui;
ui = f;
return 0;
}
float ->int-> unsigned int (~200 bytes of flash)
int main() {
volatile float f;
volatile int i;
volatile unsigned int ui;
i = f;
ui = i;
return 0;
}
There is no fundamental reason for the conversion from float to unsigned int should be larger than the conversion from float to signed int, in practice the float to unsigned int conversion can be made smaller than the float to signed int conversion.
I did some investigations using the GNU Arm Embedded Toolchain (Version 7-2018-q2) and
as far as I can see the size problem is due to a flaw in the gcc runtime library. For some reason this library does not provide an specialized version of the __aeabi_f2uiz function for Arm V6m, instead it falls back on a much larger general version.

How to avoid precision problems in C++ while using double and long double variables?

I have a C++ code below,
#include <iostream>
#include <cstdio>
#include <math.h>
using namespace std;
int main ()
{
unsigned long long dec,len;
long double dbl;
while (cin >> dec)
{
len = log10(dec)+1;
dbl = (long double) (dec);
while (len--)
dbl /= 10.0;
dbl += 1e-9;
printf ("%llu in int = %.20Lf in long double. :)\n",dec,dbl);
}
return 0;
}
In this code I wanted to convert an integer to a floating-point number. But for some inputs it gave some precision errors. So I added 1e-9 before printing the result. But still it is showing errors for all the inputs, actually I got some extra digits in the result. Some of them are given below,
stdin
1
12
123
1234
12345
123456
1234567
12345678
123456789
1234567890
stdout
1 in int = 0.10000000100000000000 in long double. :)
12 in int = 0.12000000100000000001 in long double. :)
123 in int = 0.12300000100000000000 in long double. :)
1234 in int = 0.12340000100000000000 in long double. :)
12345 in int = 0.12345000099999999999 in long double. :)
123456 in int = 0.12345600100000000000 in long double. :)
1234567 in int = 0.12345670100000000000 in long double. :)
12345678 in int = 0.12345678099999999998 in long double. :)
123456789 in int = 0.12345679000000000001 in long double. :)
1234567890 in int = 0.12345679000000000001 in long double. :)
Is there any way to avoid or get rid of these errors? :)
No, there is no way around it. A floating point number is basically a fraction with a power of 2 as the denominator. This means that the only non-integers that can be represented exactly are multiples of a (negative) power of 2, i.e. a multiple of 1/2, or of 1/16, or of 1/1048576, or...
Now, 10 has two prime factors; 2 and 5. Thus 1/10 cannot be expressed as a fractional number with a power of 2 as the denominator. You will always end up with a rounding error. By repeatedly dividing by 10, you even make this slightly worse, so one "solution" would be to rather than dividing dbl by 10 repeatedly keeping a separate counter multiplier:
double multiplier = 1;
while (len--)
multiplier *= 10.;
dbl /= multiplier;
Note that I don't say this will solve the problem, but it might make things slightly more stable. Assuming that you can represent a decimal number exactly in floating point remains wrong.

Using GMP for Cryptography: how to get random numbers?

The documentation for GMP seems to list only the following algorithms for random number generation:
gmp_randinit_mt, the Mersenne Twister;
gmp_randinit_lc_2exp and gmp_randinit_lc_2exp_size, linear congruential.
There is also gmp_randinit_default, but it points to gmp_randinit_mt.
Neither the Mersenne Twister nor linear congruential generators should be used for Cryptography.
What do people usually do, then, when they want to use the GMP to build some cryptographic code?
(Using a cryptographic API for encrypting/decrypting/etc doesn't help, because I'd actually implement a new algorithm, which crypto libraries do not have).
Disclaimer: I have only "tinkered" with RNGs, and that was over a year ago.
If you are on a linux box, the solution is relatively simple and non-deterministic. Just open and read a desired number of bits from /dev/urandom. If you need a large number of random bits for your program however, then you might want to use a smaller number of bits from /dev/urandom as seeds for a PRNG.
boost offers a number of PRNGs and a non-deterministic RNG, random_device. random_device uses the very same /dev/urandom on linux and a similar(IIRC) function on windows, so if you need windows or x-platform.
Of course, you just might want/need to write a function based on your favored RNG using GMP's types and functions.
Edit:
#include<stdio.h>
#include<gmp.h>
#include<boost/random/random_device.hpp>
int main( int argc, char *argv[]){
unsigned min_digits = 30;
unsigned max_digits = 50;
unsigned quantity = 1000; // How many numbers do you want?
unsigned sequence = 10; // How many numbers before reseeding?
mpz_t rmin;
mpz_init(rmin);
mpz_ui_pow_ui(rmin, 10, min_digits-1);
mpz_t rmax;
mpz_init(rmax);
mpz_ui_pow_ui(rmax, 10, max_digits);
gmp_randstate_t rstate;
gmp_randinit_mt(rstate);
mpz_t rnum;
mpz_init(rnum);
boost::random::random_device rdev;
for( unsigned i = 0; i < quantity; i++){
if(!(i % sequence))
gmp_randseed_ui(rstate, rdev.operator ()());
do{
mpz_urandomm(rnum, rstate, rmax);
}while(mpz_cmp(rnum, rmin) < 0);
gmp_printf("%Zd\n", rnum);
}
return 0;
}

Resources