Floats in visual studio - visual-studio

Consider single precision floating point number system conforming to IEEE 754 standard. In visual studio, FP switch was set to Strict.
struct FP {
unsigned char a : 8;
unsigned char b : 8;
unsigned char c : 8;
unsigned char d : 8;
}*fp;
fp->a = 63;
fp->b = 128;
fp->c = 0;
fp->d = 1;
std::cout << "raw float = " << *reinterpret_cast<float*>(fp) << "\n";
Tha mathematical value according to standard is 1.00000011920928955078125.
What visual studio prints is raw float = 2.36018991e-38. Why?
Assume sign bit it 0. And 0111 1111 in exponent part.
In the remaining 23 bits assume 01 and 10 are the least significant bits, which means the mathematical value is number1 = 1.00000011920928955078125 and number2 = 1.0000002384185791015625 respectively. The mid value is number3 = 1.000000178813934326171875. So, all values between number1 and number3 should be captured by encoding with 01 in least two significant bits and values between number3 and number2 should be captured by encoding with 10 in least significant bits. But visual studio captures 1.0000001788139343(this actually falls between number1 and number3) and greater values in encoding with 10 in least significant bits. So what am I mising?

If you take a look at https://www.h-schmidt.net/FloatConverter/IEEE754.html
then you can see that binary representation of 2.36018991E-38 is
00000001 00000000 10000000 00111111
and that binary value equals to your struct

Related

IEEE-754 double precision and splitting method

When you compute elementary functions, you apply constant modification. Specially in the implementation of exp(x). In all these implementations any correction with ln(2) is done in two steps. ln(2) is split in two numbers:
static const double ln2p1 = 0.693145751953125;
static const double ln2p2 = 1.42860682030941723212E-6;
// then ln(2) = ln2p1 + ln2p2
Then any computation with ln(2) is done by:
blablabla -= ln2p1
blablabla -= ln2p2
I know it is to avoid rounding effect. But why this two numbers in specially ? Some of you have an idea how get these two numbers ?
Thank you !
Following the first comment I complete this post with more material and a very weird question. I worked with my team and we agree the deal is to double potentially the precision by splitting the number ln(2) in two numbers. For this, two transformations are applied, the first one:
1) c_h = floor(2^k ln(2))/2^k
2) c_l = ln(2) - c_h
the k indicates the precisions, in look likes in Cephes library (~1980) for float k has been fixed on 9, 16 for double and also 16 for long long double (why I do not know). So for double c_h has a precision of 16 bits but 52 bits for c_l.
From this, I write the following program, and determine c_h with 52 bit precision.
#include <iostream>
#include <math.h>
#include <iomanip>
enum precision { nine = 9, sixteen = 16, fiftytwo = 52 };
int64_t k_helper(double x){
return floor(x/log(2));
}
template<class C>
double z_helper(double x, int64_t k){
x -= k*C::c_h;
x -= k*C::c_l;
return x;
}
template<precision p>
struct coeff{};
template<>
struct coeff<nine>{
constexpr const static double c_h = 0.693359375;
constexpr const static double c_l = -2.12194440e-4;
};
template<>
struct coeff<sixteen>{
constexpr const static double c_h = 6.93145751953125E-1;
constexpr const static double c_l = 1.42860682030941723212E-6;
};
template<>
struct coeff<fiftytwo>{
constexpr const static double c_h = 0.6931471805599453972490664455108344554901123046875;
constexpr const static double c_l = -8.78318343240526578874146121703272447458793199905066E-17;
};
int main(int argc, const char * argv[]) {
double x = atof(argv[1]);
int64_t k = k_helper(x);
double z_9 = z_helper<coeff<nine> >(x,k);
double z_16 = z_helper<coeff<sixteen> >(x,k);
double z_52 = z_helper<coeff<fiftytwo> >(x,k);
std::cout << std::setprecision(16) << " 9 bits precisions " << z_9 << "\n"
<< " 16 bits precisions " << z_16 << "\n"
<< " 52 bits precisions " << z_52 << "\n";
return 0;
}
If I compute now for a set of different values I get:
bash-3.2$ g++ -std=c++11 main.cpp
bash-3.2$ ./a.out 1
9 bits precisions 0.30685281944
16 bits precisions 0.3068528194400547
52 bits precisions 0.3068528194400547
bash-3.2$ ./a.out 2
9 bits precisions 0.61370563888
16 bits precisions 0.6137056388801094
52 bits precisions 0.6137056388801094
bash-3.2$ ./a.out 100
9 bits precisions 0.18680599936
16 bits precisions 0.1868059993678755
52 bits precisions 0.1868059993678755
bash-3.2$ ./a.out 200
9 bits precisions 0.37361199872
16 bits precisions 0.3736119987357509
52 bits precisions 0.3736119987357509
bash-3.2$ ./a.out 300
9 bits precisions 0.56041799808
16 bits precisions 0.5604179981036264
52 bits precisions 0.5604179981036548
bash-3.2$ ./a.out 400
9 bits precisions 0.05407681688
16 bits precisions 0.05407681691155647
52 bits precisions 0.05407681691155469
bash-3.2$ ./a.out 500
9 bits precisions 0.24088281624
16 bits precisions 0.2408828162794319
52 bits precisions 0.2408828162794586
bash-3.2$ ./a.out 600
9 bits precisions 0.4276888156
16 bits precisions 0.4276888156473074
52 bits precisions 0.4276888156473056
bash-3.2$ ./a.out 700
9 bits precisions 0.61449481496
16 bits precisions 0.6144948150151828
52 bits precisions 0.6144948150151526
It like when x becomes larger than 300 a difference appear. I had a look on the the implementation of gnulibc
http://osxr.org:8080/glibc/source/sysdeps/ieee754/ldbl-128/s_expm1l.c
presently it is using the 16 bits prevision for c_h (line 84)
Well I am probably missing something, with the IEEE standard, and I can not imagine an error of precision in glibc. What do you think ?
Best,
ln2p1 is exactly 45426/65536. This can be obtained by round(65536 * ln(2)). ln2p2 is simply the remainder. So what's so special about the two number is the denominator 65536 (216).
From what I found most algorithms using this constant can be traced back to the cephes library, which was first released in 1984 where 16-bit computing was still dominating, which probably explains why 216 is chosen.

Bit twiddle help: Expanding bits to follow a given bitmask

I'm interested in a fast method for "expanding bits," which can be defined as the following:
Let B be a binary number with n bits, i.e. B \in {0,1}^n
Let P be the position of all 1/true bits in B, i.e. 1 << p[i] & B == 1, and |P|=k
For another given number, A \in {0,1}^k, let Ap be the bit-expanded form of A given B, such that Ap[j] == A[j] << p[j].
The result of the "bit expansion" is Ap.
A couple examples:
Given B: 0010 1110, A: 0110, then Ap should be 0000 1100
Given B: 1001 1001, A: 1101, then Ap should be 1001 0001
Following is a straightforward algorithm, but I can't help shake the feeling that there's a faster/easier way to do this.
unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
int k = popcount(B); // cuda function, but there are good methods for this
unsigned int Ap = 0;
int j = k-1;
// Starting at the most significant bit,
for (int i = n - 1; i >= 0; --i) {
Ap <<= 1;
// if B is 1, add the value at A[j] to Ap, decrement j.
if (B & (1 << i)) {
Ap += (A >> j--) & 1;
}
}
return Ap;
}
The question appears to be asking for a CUDA emulation of the BMI2 instruction PDEP, which takes a source operand a, and deposits its bits based on the positions of the 1-bits of a mask b. There is no hardware support for an identical, or a similar, operation on currently shipping GPUs; that is, up to and including the Maxwell architecture.
I am assuming, based on the two examples given, that the mask b in general is sparse, and that we can minimize work by only iterating over the 1-bits of b. This could cause divergent branches on the GPU, but the exact trade-off in performance is unknown without knowledge of a specific use case. For now, I am assuming that the exploitation of sparsity in the mask b has a stronger positive influence on performance compared to the negative impact of divergence.
In the emulation code below, I have reduced the use of potentially "expensive" shift operations, instead relying mostly on simple ALU instructions. On various GPUs, shift instructions are executed with lower throughput than simple integer arithmetic. I have retained a single shift, off the critical path through the code, to avoid becoming execution limited by the arithmetic units. If desired, the expression 1U << i can be replaced by addition: introduce a variable m that is initialized to 1 before the loop and doubled each time through the loop.
The basic idea is to isolate each 1-bit of mask b in turn (starting at the least significant end), AND it with the value of the i-th bit of a, and incorporate the result into the expanded destination. After a 1-bit from b has been used, we remove it from the mask, and iterate until the mask becomes zero.
In order to avoid shifting the i-th bit of a into place, we simply isolate it and then replicate its value to all more significant bits by simple negation, taking advantage of the two's complement representation of integers.
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0;
int i;
for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
}
return r;
}
The variant without any shift operations alluded to above looks as follows:
/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant
bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
unsigned int l, s, r = 0, m = 1;
while (b) { // iterate over 1-bits in mask, until mask becomes 0
l = b & (0 - b); // extract mask's least significant 1-bit
b = b ^ l; // clear mask's least significant 1-bit
s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
m = m + m; // mask for next bit of 'a'
}
return r;
}
In comments below, #Evgeny Kluev pointed to a shift-free PDEP emulation at the chessprogramming website that looks potentially faster than either of my two implementations above; it seems worth a try.

Algorithm Challenge: Arbitrary in-place base conversion for lossless string compression

It might help to start out with a real world example. Say I'm writing a web app that's backed by MongoDB, so my records have a long hex primary key, making my url to view a record look like /widget/55c460d8e2d6e59da89d08d0. That seems excessively long. Urls can use many more characters than that. While there are just under 8 x 10^28 (16^24) possible values in a 24 digit hex number, just limiting yourself to the characters matched by a [a-zA-Z0-9] regex class (a YouTube video id uses more), 62 characters, you can get past 8 x 10^28 in only 17 characters.
I want an algorithm that will convert any string that is limited to a specific alphabet of characters to any other string with another alphabet of characters, where the value of each character c could be thought of as alphabet.indexOf(c).
Something of the form:
convert(value, sourceAlphabet, destinationAlphabet)
Assumptions
all parameters are strings
every character in value exists in sourceAlphabet
every character in sourceAlphabet and destinationAlphabet is unique
Simplest example
var hex = "0123456789abcdef";
var base10 = "0123456789";
var result = convert("12245589", base10, hex); // result is "bada55";
But I also want it to work to convert War & Peace from the Russian alphabet plus some punctuation to the entire unicode charset and back again losslessly.
Is this possible?
The only way I was ever taught to do base conversions in Comp Sci 101 was to first convert to a base ten integer by summing digit * base^position and then doing the reverse to convert to the target base. Such a method is insufficient for the conversion of very long strings, because the integers get too big.
It certainly feels intuitively that a base conversion could be done in place, as you step through the string (probably backwards to maintain standard significant digit order), keeping track of a remainder somehow, but I'm not smart enough to work out how.
That's where you come in, StackOverflow. Are you smart enough?
Perhaps this is a solved problem, done on paper by some 18th century mathematician, implemented in LISP on punch cards in 1970 and the first homework assignment in Cryptography 101, but my searches have borne no fruit.
I'd prefer a solution in javascript with a functional style, but any language or style will do, as long as you're not cheating with some big integer library. Bonus points for efficiency, of course.
Please refrain from criticizing the original example. The general nerd cred of solving the problem is more important than any application of the solution.
Here is a solution in C that is very fast, using bit shift operations. It assumes that you know what the length of the decoded string should be. The strings are vectors of integers in the range 0..maximum for each alphabet. It is up to the user to convert to and from strings with restricted ranges of characters. As for the "in-place" in the question title, the source and destination vectors can overlap, but only if the source alphabet is not larger than the destination alphabet.
/*
recode version 1.0, 22 August 2015
Copyright (C) 2015 Mark Adler
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler#alumni.caltech.edu
*/
/* Recode a vector from one alphabet to another using intermediate
variable-length bit codes. */
/* The approach is to use a Huffman code over equiprobable alphabets in two
directions. First to encode the source alphabet to a string of bits, and
second to encode the string of bits to the destination alphabet. This will
be reasonably close to the efficiency of base-encoding with arbitrary
precision arithmetic. */
#include <stddef.h> // size_t
#include <limits.h> // UINT_MAX, ULLONG_MAX
#if UINT_MAX == ULLONG_MAX
# error recode() assumes that long long has more bits than int
#endif
/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
*dlen returns the length of the result, which will not exceed the value of
*dlen when called. If the original *dlen is not large enough to hold the
full result, then recode() will return non-zero to indicate failure.
Otherwise recode() will return 0. recode() will also return non-zero if
either of the smax or dmax parameters are less than one. The non-zero
return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
and 3 if any of the elements of source are greater than smax.
Using this same operation on the result with smax and dmax reversed reverses
the operation, restoring the original vector. However there may be more
symbols returned than the original, so the number of symbols expected needs
to be known for decoding. (An end symbol could be appended to the source
alphabet to include the length in the coding, but then encoding and decoding
would no longer be symmetric, and the coding efficiency would be reduced.
This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
const unsigned *source, size_t slen, unsigned smax)
{
// compute sbits and scut, with which we will recode the source with
// sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
if (smax < 1)
return 2;
unsigned sbits = 0;
unsigned scut = 1; // 2**sbits
while (scut && scut <= smax) {
scut <<= 1;
sbits++;
}
scut -= smax + 1;
// same thing for dbits and dcut
if (dmax < 1)
return 2;
unsigned dbits = 0;
unsigned dcut = 1; // 2**dbits
while (dcut && dcut <= dmax) {
dcut <<= 1;
dbits++;
}
dcut -= dmax + 1;
// recode a base smax+1 vector to a base dmax+1 vector using an
// intermediate bit vector (a sliding window of that bit vector is kept in
// a bit buffer)
unsigned long long buf = 0; // bit buffer
unsigned have = 0; // number of bits in bit buffer
size_t i = 0, n = 0; // source and dest indices
unsigned sym; // symbol being encoded
for (;;) {
// encode enough of source into bits to encode that to dest
while (have < dbits && i < slen) {
sym = source[i++];
if (sym > smax) {
*dlen = n;
return 3;
}
if (sym < scut) {
buf = (buf << (sbits - 1)) + sym;
have += sbits - 1;
}
else {
buf = (buf << sbits) + sym + scut;
have += sbits;
}
}
// if not enough bits to assure one symbol, then break out to a special
// case for coding the final symbol
if (have < dbits)
break;
// encode one symbol to dest
if (n == *dlen)
return 1;
sym = buf >> (have - dbits + 1);
if (sym < dcut) {
dest[n++] = sym;
have -= dbits - 1;
}
else {
sym = buf >> (have - dbits);
dest[n++] = sym - dcut;
have -= dbits;
}
buf &= ((unsigned long long)1 << have) - 1;
}
// if any bits are left in the bit buffer, encode one last symbol to dest
if (have) {
if (n == *dlen)
return 1;
sym = buf;
sym <<= dbits - 1 - have;
if (sym >= dcut)
sym = (sym << 1) - dcut;
dest[n++] = sym;
}
// return recoded vector
*dlen = n;
return 0;
}
/* Test recode(). */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>
// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
unsigned bits = 0;
unsigned long long mask = 1;
while (mask <= max) {
mask <<= 1;
bits++;
}
mask--;
unsigned long long ran = 0;
unsigned have = 0;
size_t n = 0;
while (n < len) {
while (have < bits) {
ran = (ran << 31) + random();
have += 31;
}
if ((ran & mask) <= max)
vec[n++] = ran & mask;
ran >>= bits;
have -= bits;
}
}
// Get a valid number from str and assign it to var
#define NUM(var, str) \
do { \
char *end; \
unsigned long val = strtoul(str, &end, 0); \
var = val; \
if (*end || var != val) { \
fprintf(stderr, \
"invalid or out of range numeric argument: %s\n", str); \
return 1; \
} \
} while (0)
/* "bet n m len count" generates count test vectors of length len, where each
entry is in the range 0..n. Each vector is recoded to another vector using
only symbols in the range 0..m. That vector is recoded back to a vector
using only symbols in 0..n, and that result is compared with the original
random vector. Report on the average ratio of input and output symbols, as
compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
// get sizes of alphabets and length of test vector, compute maximum sizes
// of recoded vectors
unsigned smax, dmax, runs;
size_t slen, dsize, bsize;
if (argc != 5) { fputs("need four arguments\n", stderr); return 1; }
NUM(smax, argv[1]);
NUM(dmax, argv[2]);
NUM(slen, argv[3]);
NUM(runs, argv[4]);
dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));
// generate random test vectors, encode, decode, and compare
srandomdev();
unsigned source[slen], dest[dsize], back[bsize];
unsigned mis = 0, i;
unsigned long long dtot = 0;
int ret;
for (i = 0; i < runs; i++) {
ranvec(source, slen, smax);
size_t dlen = dsize;
ret = recode(dest, &dlen, dmax, source, slen, smax);
if (ret) {
fprintf(stderr, "encode error %d\n", ret);
break;
}
dtot += dlen;
size_t blen = bsize;
ret = recode(back, &blen, smax, dest, dlen, dmax);
if (ret) {
fprintf(stderr, "decode error %d\n", ret);
break;
}
if (blen < slen || memcmp(source, back, slen)) // blen > slen is ok
mis++;
}
if (mis)
fprintf(stderr, "%u/%u mismatches!\n", mis, i);
if (ret == 0)
printf("mean dest/source symbols = %.4f (optimal = %.4f)\n",
dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
return 0;
}
As has been pointed out in other StackOverflow answers, try not to think of summing digit * base^position as converting it to base ten; rather, think of it as directing the computer to generate a representation of the quantity represented by the number in its own terms (for most computers probably closer to our concept of base 2). Once the computer has its own representation of the quantity, we can direct it to output the number in any way we like.
By rejecting "big integer" implementations and asking for letter-by-letter conversion you are at the same time arguing that the numerical/alphabetical representation of quantity is not actually what it is, namely that each position represents a quantity of digit * base^position. If the nine-millionth character of War and Peace does represent what you are asking to convert it from, then the computer at some point will need to generate a representation for Д * 33^9000000.
I don't think any solution can work generally because if ne != m for some integer e and some MAX_INT because there's no way to calculate the value of the target base in a certain place p if np > MAX_INT.
You can get away with this for the case where ne == m for some e because the problem is recursively doable (the first e digits of n can be summed and converted into the first digit of M, and then chopped off and repeated.
If you don't have this useful property, then eventually you're going to have to try to take some part of the original base and try to perform modulus in np and np is going to be greater than MAX_INT, which means it's impossible.

Manually Converting rgba8 to rgba5551

I need to convert rgba8 to rgba5551 manually. I found some helpful code from another post and want to modify it to convert from rgba8 to rgba5551. I don't really have experience with bitewise stuff and haven't had any luck messing with the code myself.
void* rgba8888_to_rgba4444( void* src, int src_bytes)
{
// compute the actual number of pixel elements in the buffer.
int num_pixels = src_bytes / 4;
unsigned long* psrc = (unsigned long*)src;
unsigned short* pdst = (unsigned short*)src;
// convert every pixel
for(int i = 0; i < num_pixels; i++){
// read a source pixel
unsigned px = psrc[i];
// unpack the source data as 8 bit values
unsigned r = (px << 8) & 0xf000;
unsigned g = (px >> 4) & 0x0f00;
unsigned b = (px >> 16) & 0x00f0;
unsigned a = (px >> 28) & 0x000f;
// and store
pdst[i] = r | g | b | a;
}
return pdst;
}
The value of RGBA5551 is that it has color info condensed into 16 bits - or two bytes, with only one bit for the alpha channel (on or off). RGBA8888, on the other hand, uses a byte for each channel. (If you don't need an alpha channel, I hear RGB565 is better - as humans are more sensitive to green). Now, with 5 bits, you get the numbers 0 through 31, so r, g, and b each need to be converted to some number between 0 and 31, and since they are originally a byte each (0-255), we multiply each by 31/255. Here is a function that takes RGBA bytes as input and outputs RGBA5551 as a short:
short int RGBA8888_to_RGBA5551(unsigned char r, unsigned char g, unsigned char b, unsigned char a){
unsigned char r5 = r*31/255; // All arithmetic is integer arithmetic, and so floating points are truncated. If you want to round to the nearest integer, adjust this code accordingly.
unsigned char g5 = g*31/255;
unsigned char b5 = b*31/255;
unsigned char a1 = (a > 0) ? 1 : 0; // 1 if a is positive, 0 else. You must decide what is sensible.
// Now that we have our 5 bit r, g, and b and our 1 bit a, we need to shift them into place before combining.
short int rShift = (short int)r5 << 11; // (short int)r5 looks like 00000000000vwxyz - 11 zeroes. I'm not sure if you need (short int), but I've wasted time tracking down bugs where I didn't typecast properly before shifting.
short int gShift = (short int)g5 << 6;
short int bShift = (short int)b5 << 1;
// Combine and return
return rShift | gShift | bShift | a1;
}
You can, of course condense this code.

Character range in Java

I've read in a book:
..characters are just 16-bit unsigned integers under the hood. That means you can assign a number literal, assuming it will fit into the unsigned 16-bit range (65535 or less).
It gives me the impression that I can assign integers to characters as long as it's within the 16-bit range.
But how come I can do this:
char c = (char) 80000; //80000 is beyond 65535.
I'm aware the cast did the magic. But what exactly happened behind the scenes?
Looks like it's using the int value mod 65536. The following code:
int i = 97 + 65536;
char c = (char)i;
System.out.println(c);
System.out.println(i % 65536);
char d = 'a';
int n = (int)d;
System.out.println(n);
Prints out 'a' and then '97' twice (a is char 97 in ascii).

Resources