Why does D3DCOLORtoUBYTE4 multiplies components by 255.001953f? - windows

I’ve compiled a pixel shader that uses D3DCOLORtoUBYTE4 intrinsic, then decompiled.
Here’s what I found:
r0.xyzw = float4(255.001953,255.001953,255.001953,255.001953) * r0.zyxw;
o0.xyzw = (int4)r0.xyzw;
The rgba->bgra swizzle is expected but why does it use 255.001953 instead of 255.0? Data Conversion Rules is quite specific about what should happen, it says following:
Convert from float scale to integer scale: c = c * (2^n-1).

The short answer is: This is just part of the definition of the intrinsic. It's implemented that way in all the modern versions of the HLSL compiler.
255.0 as a 32-bit float is represented in binary as
0100`0011`0111`1111`0000`0000`0000`0000
255.001953 as a 32-bit float is actually represented as 255.001953125 which in binary is:
0100`0011`0111`1111`0000`0000`1000`0000
This slight bias helps in specific cases, such as the input value being 0.999999. If we used 255.0, you'd get 254. With 255.001953 you get 255. Otherwise in most other cases the answer after converting to integer (using truncation) results in the same answer either way.
Some useful and interesting musings on floating-point numbers here

Related

Fused fast conversion from int16 to [-1.0, 1.0] float32 range in NumPy

I'm looking for the fastest and most memory-economical conversion routine from int16 to float32 in NumPy. My usecase is conversion of audio samples, so real-world arrays are easily in 100K-1M elements range.
I came up with two ways.
The first: converts int16 to float32, and then do division inplace. This would require at least two passes over the memory.
The second: uses divide directly and specifies an out-array that is in float32. Theoretically this should do only one pass over memory, and thus be a bit faster.
My questions:
Does the second way use float32 for division directly? (I hope it does not use float64 as an intermediate dtype)
In general, is there a way to do division in a specified dtype?
Do I need to specify some casting argument?
Same question about converting back from [-1.0, 1.0] float32 into int16
Thanks!
import numpy
a = numpy.array([1,2,3], dtype = 'int16')
# first
b = a.astype(numpy.float32)
c = numpy.divide(b, numpy.float32(32767.0), out = b)
# second
d = numpy.divide(a, numpy.float32(32767.0), dtype = 'float32')
print(c, d)
Does the second way use float32 for division directly? (I hope it does not use float64 as an intermediate dtype)
Yes. You can check that by looking the code or more directly by scanning hardware events which clearly show that single floating point arithmetic instructions are executed (at least with Numpy 1.18).
In general, is there a way to do division in a specified dtype?
AFAIK, not directly with Numpy. Type promotion rules always apply. However, it is possible with Numba to perform conversions cell by cell which is much more efficient than using intermediate array (costly to allocate and to read/write).
Do I need to specify some casting argument?
This is not needed here since there is no loss of precision in this case. Indeed, in the first version the input operands are of type float32 as well as for the result. For the second version, the type promotion rule is automatically applied and a is implicitly casted to float32 before the division (probably more efficiently than the first method as no intermediate array could be created). The casting argument helps you to control the level of safety here (which is safe by default): for example, you can turn it to no to be sure that no cast occurs (for the both operands and the result, an error is raised if a cast is needed). You can see the documentation of can_cast for more information.
Same question about converting back from [-1.0, 1.0] float32 into int16
Similar answers applies. However, you should should care about the type promotion rules as float32 * int16 -> float32. Thus, the result of a multiply will have to be casted to int16 and a loss of accuracy appear. As a result, you can use the casting argument to enable unsafe casts (now deprecated) and maybe better performance.
Notes & Advises:
I advise you to use the Numba's #njit to perform the operation efficiently.
Note that modern processors are able to perform such operations very quickly if SIMD instructions are used. Consequently, the memory bandwidth and the cache allocation policy should be the two main limiting factors. Fast conversions can be archived by preallocating buffers, by avoiding the creation of new temporary arrays as well as by avoiding the copy of unnecessary (large) arrays.

When is (x==(x+y)-y) or (x==(x-y)+y) guaranteed for IEEE floats?

In C or another language which uses IEEE floats, I have two variables x and y which are both guaranteed to be finite, non-NaN, basically normal numbers.
I have some code which assumes, in essence, that the following code has no effect:
float x = get_x ();
float y = get_y ();
float old_x = x;
x += y;
x -= y;
assert (old_x == x);
x -= y;
x += y;
assert (old_x == x);
I know that this will be true for certain classes of values, i.e. those which do not have "many" significant figures in the mantissa, but I would like to be clear about the edge cases.
For example, the binary expression of 1.3 will have significant figures all the way down the mantissa, and so will 1.7, and I should not assume that 1.3+1.7==3 exactly, but can I assume that if I add such numbers together and then subtract them, or vice versa, I will get the first value back again?
What are the formal edge conditions for this?
The number of bits in the floating point pipeline is not part of the standard.
From Wikipedia:
The standard also recommends extended format(s) to be used to perform
internal computations at a higher precision than that required for the
final result, to minimise round-off errors: the standard only
specifies minimum precision and exponent requirements for such
formats. The x87 80-bit extended format is the most commonly
implemented extended format that meets these requirements.
So since the internal formats can be extended, not knowing when internal formats get truncated to standard formats, what rounding method is being used, the assumption that adding a value and then subtracting it again will result in the original value is not guaranteed by the standard.
For the trivial case you posted it probably would work most of the time.
Then there is the case of handling NAN.
You may be able to determine edge cases for the architecture you are currently using but its probably easier to just check if the current value is within margin of error of original value.

Overflow in a random number generator and 4-byte vs. 8-byte integers

The famous linear congruential random number generator also known as minimal standard use formula
x(i+1)=16807*x(i) mod (2^31-1)
I want to implement this using Fortran.
However, as pointed out by "Numerical Recipes", directly implement the formula with default Integer type (32bit) will cause 16807*x(i) to overflow.
So the book recommend Schrage’s algorithm is based on an approximate factorization of m. This method can still implemented with default integer type.
However, I am wondering fortran actually has Integer(8) type whose range is -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 which is much bigger than 16807*x(i) could be.
but the book even said the following sentence
It is not possible to implement equations (7.1.2) and (7.1.3) directly
in a high-level language, since the product of a and m − 1 exceeds the
maximum value for a 32-bit integer.
So why can't we just use Integer(8) type to implement the formula directly?
Whether or not you can have 8-byte integers depends on your compiler and your system. What's worse is that the actual value to pass to kind to get a specific precision is not standardized. While most Fortran compilers I know use the number of bytes (so 8 would be 64 bit), this is not guaranteed.
You can use the selected_int_kindmethod to get a kind of int that has a certain range. This code compiles on my 64 bit computer and works fine:
program ran
implicit none
integer, parameter :: i8 = selected_int_kind(R=18)
integer(kind=i8) :: x
integer :: i
x = 100
do i = 1, 100
x = my_rand(x)
write(*, *) x
end do
contains
function my_rand(x)
implicit none
integer(kind=i8), intent(in) :: x
integer(kind=i8) :: my_rand
my_rand = mod(16807_i8 * x, 2_i8**31 - 1)
end function my_rand
end program ran
Update and explanation of #VladimirF's comment below
Modern Fortran delivers an intrinsic module called iso_fortran_env that supplies constants that reference the standard variable types. In your case, one would use this:
program ran
use, intrinsic :: iso_fortran_env, only: int64
implicit none
integer(kind=int64) :: x
and then as above. This code is easier to read than the old selected_int_kind. (Why did R have to be 18 again?)
Yes. The simplest thing is to append _8 to the integer constants to make them 8 bytes. I know it is "old style" Fortran but is is portable and unambiguous.
By the way, when you write:
16807*x mod (2^31-1)
this is equivalent to take the result of 16807*x and use an and with a 32-bit mask where all the bits are set to one except the sign bit.
The efficient way to write it by avoiding the expensive mod functions is:
iand(16807_8*x, Z'7FFFFFFF')
Update after comment :
or
iand(16807_8*x, 2147483647_8)
if your super modern compiler does not have backwards compatibility.

Why is this Transpose() required in my WorldViewProj matrix?

Given a super-basic vertex shader such as:
output.position = mul(position, _gWorldViewProj);
I was having a great deal of trouble because I was setting _gWorldViewProj as follows; I tried both (a bit of flailing) to make sure it wasn't just backwards.
mWorldViewProj = world * view * proj;
mWorldViewProj = proj * view * world;
My solution turned out to be:
mWorldView = mWorld * mView;
mWorldViewProj = XMMatrixTranspose(worldView * proj);
Can someone explain why this XMMatrixTranspose was required? I know there were matrix differences between XNA and HLSL (I think) but not between vanilla C++ and HLSL, though I could be wrong.
Problem is I don't know if I'm wrong or what I'm wrong about! So if someone could tell me precisely why the transpose is required, I hopefully won't make the same mistake again.
On the CPU, 2D arrays are generally stored in row-major ordering, so the order in memory goes x[0][0], x[0][1], ... In HLSL, matrix declarations default to column-major ordering, so the order goes x[0][0], x[1][0], ...
In order to transform the memory from the format defined on the CPU to the order expected in HLSL, you need to transpose the CPU matrix before sending it to the GPU. Alternatively, you can row_major keyword in HLSL to declare the matrices as row major, eliminating the need for a transpose but leading to different codegen in HLSL (you'll often end up with mul-adds instead of dot-products).

What has a better performance: multiplication or division?

Which version is faster:
x * 0.5
or
x / 2 ?
I've had a course at the university called computer systems some time ago. From back then I remember that multiplying two values can be achieved with comparably "simple" logical gates but division is not a "native" operation and requires a sum register that is in a loop increased by the divisor and compared to the dividend.
Now I have to optimise an algorithm with a lot of divisions. Unfortunately it's not just dividing by two, so binary shifting is not an option. Will it make a difference to change all divisions to multiplications ?
Update:
I have changed my code and didn't notice any difference. You're probably right about compiler optimisations. Since all the answers were great ive upvoted them all. I chose rahul's answer because of the great link.
Usually division is a lot more expensive than multiplication, but a smart compiler will often convert division by a compile-time constant to a multiplication anyway. If your compiler is not smart enough though, or if there are floating point accuracy issues, then you can always do the optimisation explicitly, e.g. change:
float x = y / 2.5f;
to:
const float k = 1.0f / 2.5f;
...
float x = y * k;
Note that this is most likely a case of premature optimisation - you should only do this kind of thing if you have profiled your code and positively identified division as being a performance bottlneck.
Division by a compile-time constant that's a power of 2 is quite fast (comparable to multiplication by a compile-time constant) for both integers and floats (it's basically convertible into a bit shift).
For floats even dynamic division by powers of two is much faster than regular (dynamic or static division) as it basically turns into a subtraction on its exponent.
In all other cases, division appears to be several times slower than multiplication.
For dynamic divisor the slowndown factor at my Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz appears to be about 8, for static ones about 2.
The results are from a little benchmark of mine, which I made because I was somewhat curious about this (notice the aberrations at powers of two) :
ulong -- 64 bit unsigned
1 in the label means dynamic argument
0 in the lable means statically known argument
The results were generated from the following bash template:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
int main(int argc, char** argv){
$TYPE arg = atoi(argv[1]);
$TYPE i = 0, res = 0;
for (i=0;i< $IT;i++)
res+=i $OP $ARG;
printf($FMT, res);
return 0;
}
with the $-variables assigned and the resulting program compiled with -O3 and run (dynamic values came from the command line as it's obvious from the C code).
Well if it is a single calculation you wil hardly notice any difference but if you talk about millions of transaction then definitely Division is costlier than Multiplication. You can always use whatever is the clearest and readable.
Please refer this link:- Should I use multiplication or division?
That will likely depend on your specific CPU and the types of your arguments. For instance, in your example you're doing a floating-point multiplication but an integer division. (Probably, at least, in most languages I know of that use C syntax.)
If you are doing work in assembler, you can look up the specific instructions you are using and see how long they take.
If you are not doing work in assembler, you probably don't need to care. All modern compilers with optimization will change your operations in this way to the most appropriate instructions.
Your big wins on optimization will not be from twiddling the arithmetic like this. Instead, focus on how well you are using your cache. Consider whether there are algorithm changes that might speed things up.
One note to make, if you are looking for numerical stability:
Don't recycle the divisions for solutions that require multiple components/coordinates, e.g. like implementing an n-D vector normalize() function, i.e. the following will NOT give you a unit-length vector:
V3d v3d(x,y,z);
float l = v3d.length();
float oneOverL = 1.f / l;
v3d.x *= oneOverL;
v3d.y *= oneOverL;
v3d.z *= oneOverL;
assert(1. == v3d.length()); // fails!
.. but this code will..
V3d v3d(x,y,z);
float l = v3d.length();
v3d.x /= l;
v3d.y /= l;
v3d.z /= l;
assert(1. == v3d.length()); // ok!
Guess the problem in the first code excerpt is the additional float normalization (the pre-division will impose a different scale normalization to the floating point number, which is then forced upon the actual result and introducing additional error).
Didn't look into this for too long, so please share your explanation why this happens. Tested it with x,y and z being .1f (and with doubles instead of floats)

Resources