The following code realizes the same computation by either using an Eigen vector as a container only or a simple C-array. It produces a closed but not bit-to-bit equivalent result.
The final mathematical operation is x * alpha + y * beta.
#include <Eigen/Eigen>
int main()
{
Eigen::VectorXd x(2);
double* y = new double[2];
long long int a = 4603016991731078785;
double ga = *(double*)(&a);
long long int b = -4617595986472363966;
double gb = *(double*)(&b);
long long int x0 = 451;
long long int x1 = -9223372036854775100;
x[0] = *(double*)(&x0);
y[0] = *(double*)(&x0);
x[1] = *(double*)(&x1);
y[1] = *(double*)(&x1);
double r = ga*x[0] + gb*x[1];
double s = ga*y[0] + gb*y[1];
}
Why is it so?
Results differ when using MSVC and gcc (64-bits OS).
This is probably because one computation is done completely within the FPU (floating-point unit) with 80 bits of precision, while the other computation uses partially 64 bits of precision (the size of a double). This can also be demonstrated without using Eigen. Look at the following program:
int main()
{
// Load ga, gb, y[0], y[1] as in original program
double* y = new double[2];
long long int a = 4603016991731078785;
double ga = *(double*)(&a);
long long int b = -4617595986472363966;
double gb = *(double*)(&b);
long long int x0 = 451;
long long int x1 = -9223372036854775100;
y[0] = *(double*)(&x0);
y[1] = *(double*)(&x1);
// Compute s as in original program
double s = ga*y[0] + gb*y[1];
// Same computation, but in steps
double r1 = ga*y[0];
double r2 = gb*y[1];
double r = r1+r2;
}
If you compile this without optimization, you will see that r and s have different values (at least, I saw that on my machine). Looking at the assembly code, in the first computation, the values of ga, y[0], gb and y[1] are loaded in the FPU, then the calculation ga * y[0] + gb * y[1] is done, and then the result is stored in memory. The FPU does all computations with 80 bits, but when the result is stored in memory the number is rounded so that it fits within the 64 bits of a double variable.
The second computation proceeds differently. First, ga and y[0] are loaded in the FPU, multiplied, and then rounded to a 64-bits number and stored in memory. Then, gb and y[1] are loaded in the FPU, multiplied, and then rounded to a 64-bits number and stored in memory. Finally, r1 and r2 are loaded in the FPU, added, rounded to a 64-bits number and stored in memory. This time, the computer rounds intermediate results, and this leads to the difference.
For this computation, rounding has a fairly large effect because you are working with denormal numbers.
Now, here comes the bit where I am not so certain (and if this was your question, I apologize): what does this have to do with the original program, where x is an Eigen container? Here the computation goes as follows: a function from Eigen is called to get x[0], then ga and the result from that function is loaded into the FPU, multiplied, and stored in a temporary memory location (64 bits, so this is rounded). Then gb and x[1] are loaded into the FPU, multiplied, added to the intermediate result stored in a temporary memory location, and finally stored in x. So in the computation of r in the original program, the result of ga*x[0] is rounded to 64 bits. Perhaps the reason for this is that the floating point stack is not preserved across function calls.
Related
I would like to compare two vectors of doubles based on their absolute values.
That is, the vector equivalent of the following:
if (fabs(x) < fabs(y)) {
...
}
Is there anything better than just taking the absolute value of each side and following up with a _mm256_cmp_pd?
Interested in all of AVX, AVX2, and AVX-512 flavors.
With AVX-512 you can save one µop. Instead of 2xvandpd+vcmppd you can use
vpternlogq+vpcmpuq. Note that the solution below assumes that the numbers are
not a NaN.
IEEE-754 floating point numbers have the nice property that they are encoded
such that if x[62:0] integer_less_than y[62:0], then as a floating point:
abs(x)<abs(y).
So, instead of setting both sign bits to 0, we can copy the sign bit of x
to the sign bit of y and compare the result as an unsigned integer.
In the (untested) code below, for negative x both xi[63] and yi_sgnx[63] are 1,
while for positive x, both xi[63] and yi_sgnx[63] are 0.
So the unsigned integer compare actually compares xi[62:0] with yi[62:0], which is just what we need for the comparison abs(x)<abs(y).
The vpternlog instruction is suitable for copying the sign bit, see here or here.
I'm not sure if the constants z and 0xCA are chosen correctly.
__mmask8 cmplt_via_ternlog(__m512d x, __m512d y){
__m512i xi = _mm512_castpd_si512(x);
__m512i yi = _mm512_castpd_si512(x);
__m512i z = _mm512_set1_epi64(0x7FFFFFFFFFFFFFFFull);
__m512i yi_sgnx = _mm512_ternarylogic_epi64(z, yi, xi, 0xCA);
return _mm512_cmp_epu64_mask(xi, yi_sgnx, 1); /* _CMPINT_LT */
}
I am working on some matrices related problems in c++. I want to solve the problem: Y = aX + Y, where X and Y are matrices and a is a constant. I thought about using the daxpy BLAS routine, however, DAXPY according to the documentation is a vectors routine and I am not getting the same results as when I solve the same problem in matlab.
I am currently running this:
F77NAME(daxpy)(N, a, X, 1, Y, 1);
When you need to perform operation Y=a*X+Y it does not matter if X',Y` are 1D or 2D matrices, since the operation is done element-wise.
So, If you allocated the matrices in single pointers double A[] = new[] (M*N);, then you can use daxpy by defining the dimension of the vector as M*N
int MN = M*N;
int one = 1;
F77NAME(daxpy)(&MN, &a, &X, &one, &Y, &one);
Same goes with stack two dimension matrix double A[3][2]; as this memory is allocated in sequence.
Otherwise, you need to use a for loop and add each row separately.
I want to design a synthesizable module in Verilog which will take only one cycle in calculating square root of given input of 32 bit.
[Edit1] repaired code
Recently found the results where off even if tests determine all was OK so I dig deeper and found out that I had a silly bug in my equation and due to name conflicts with my pgm environment the tests got false positives so I overlooked it before. Now it work in all cases as it should.
The best thing I can think of (except approximation or large LUT) is binary search without multiplication, here C++ code:
//---------------------------------------------------------------------------
WORD u32_sqrt(DWORD xx) // 16 T
{
DWORD x,m,a0,a1,i;
const DWORD lut[16]=
{
// m*m
0x40000000,
0x10000000,
0x04000000,
0x01000000,
0x00400000,
0x00100000,
0x00040000,
0x00010000,
0x00004000,
0x00001000,
0x00000400,
0x00000100,
0x00000040,
0x00000010,
0x00000004,
0x00000001,
};
for (x=0,a0=0,m=0x8000,i=0;m;m>>=1,i++)
{
a1=a0+lut[i]+(x<<(16-i));
if (a1<=xx) { a0=a1; x|=m; }
}
return x;
}
//---------------------------------------------------------------------------
Standard binary search sqrt(xx) is setting bits of x from MSB to LSB so that result of x*x <= xx. Luckily we can avoid the multiplication by simply rewrite the thing as incrementing multiplicant... in each iteration the older x*x result can be used like this:
x1 = x0+m
x1*x1 = (x0+m)*(x0+m) = (x0*x0) + (2*m*x0) + (m*m)
Where x0 is value of x from last iteration and x1 is actual value. The m is weight of actual processed bit. The (2*m) and (m*m) are constant and can be used as LUT and bit-shift so no need to multiply. Only addition is needed. Sadly the iteration is bound to sequential computation forbid paralelisation so the result is 16T at best.
In the code a0 represents last x*x and a1 represents actual iterated x*x
As you can see the sqrt is done in 16 x (BitShiftLeft,BitShiftRight,OR,Plus,Compare) where the bit shift and LUT can be hardwired.
If you got super fast gates for this in comparison to the rest you can multiply the input clock by 16 and use that as internal timing for SQRT module. Something similar to the old days when there was MC clock as Division of source CPU clock in old Intel CPU/MCUs ... This way you can get 1T timing (or multiple of it depends on the multiplication ratio).
There is conversion to a logarithm, halving, and converting back.
For an idea how to implement "combinatorial" log and antilog, see Michael Dunn's EDN article showing priority encoder, barrel shifter & lookup table, with three log variants in System Verilog for down-load.
(Priority encoder, barrel shifter & lookup table look promising for "one-step-Babylonian/Heron/Newton/-Raphson. But that would probably still need a 128K by 9 bits lookup table.)
While not featuring "verilog",
Tole Sutikno: "An Optimized Square Root Algorithm for Implementation in FPGA Hardware" shows a combinatorial implementation of a modified (binary) digit-by-digit algorithm.
In 2018, T. Bagala, A. Fibich, M. Hagara,
P. Kubinec, O. Ondráček, V. Štofanik and R. Stojanović authored Single Clock Square Root Algorithm Based on Binomial Series and its FPGA Implementation.
Local oscillator runs at 50MHz [… For 16 bit input mantissa,] Values from [the hardware] experiment were the same as values from simulation […] Obtained delay averages were 892ps and 906ps respectively.
(No explanation about the discrepancy between 50MHz and .9ns or the quoted ps resolution and the use of a 10Gsps scope. If it was about 18 cycles (due to pipelining rather than looping?)/~900*ns*, interpretation of Single Clock Square Root… remains open - may be one result per cycle.)
The paper discloses next no details about the evaluation of the binomial series.
While the equations are presented in a general form, too, my guess is that the amount of hardware needed for a greater number of bits gets prohibitive quickly.
I got the code
here it is
module sqrt(
input[31:0]a,
output[15:0]out
);
reg [31:0]temp;
reg[14:0]x;
always#(a)
begin
if(a<257)x=4;
if(a>256 && a<65537)x=80;
if(a>65536 && a<16777217)x=1000;
if(a>16777216 && a<=4294967295)x=20000;
temp=(x+(a/x))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
temp=(temp+(a/temp))/2;
end
assign out=temp;
endmodule
The usual means of doing this in hardware is using a CORDIC. A general implementation allows the calculation of a variety of transcendental functions (cos/sin/tan) and... square roots depending on how you initialize and operate the CORDIC.
It's an iterative algorithm so to do it in a single cycle you'd unroll the loop into as many iterations as you require for your desired precision and chain the instances together.
Specifically if you operated the CORDIC in vectoring mode, initialize it with [x, 0] and rotate to 45 degrees the [x', y'] final output will be a multiplicative constant away. i.e. sqrt(x) = x' * sqrt(2) * K
My version of Spektre with variable bits count in so it can be faster on short squares.
const unsigned int isqrt_lut[16] =
{
// m*m
0x40000000,
0x10000000,
0x04000000,
0x01000000,
0x00400000,
0x00100000,
0x00040000,
0x00010000,
0x00004000,
0x00001000,
0x00000400,
0x00000100,
0x00000040,
0x00000010,
0x00000004,
0x00000001,
};
/// Our largest golf ball image is about 74 pixels, so lets round up to power of 2 and we get 128.
/// 128 squared is 16384 so out largest sqrt has to handle 16383 or 14 bits. Only positive values.
/// ** maxBits in is 2 to 32 always an even number **
/// Input value mist always be less than (2^maxBits) - 1
unsigned int isqrt(unsigned int xx, int maxBitsIn) {
DWORD x, m, a0, a1, i;
for (x = 0, a0 = 0, m = 0x01 << (maxBitsIn / 2 - 1), i = 16 - maxBitsIn / 2; m; m >>= 1, i++)
{
a1 = a0 + isqrt_lut[i] + (x << (16 - i));
if (a1 <= xx) {
a0 = a1;
x |= m;
}
}
return x;
}
What is the execution time for a double multiplication on a 16 bit microcontroller with only multiplication hardware support? No FPU.
I know it runs through a sequence of code to calculate it. I'm just not sure how long it takes to run through it.
example
double conversion = 0.03039013;
double distance= 10.23456;
double total = conversion * distance;//cost of this line
Has anyone timed it?
What is the difference between 64 bit float and 32 bit float multiplication in respects to time? Is there much to gain from using 32 over 64 bit.
I ran through a two scenarios, 32 bit float computation and 64 bit computation. The Platform was on a 16 bit Renesas M16C/28 MCU ( this platform has a multiplier, but no floating point hardware. running at 20 Mhz, 1 cycle = 50 ns.
Note: this was done with software so timing isn't perfect, but the idea and concept is proven in it.
Scenario 1:
void floatMultiple(void)
{
float a = 123456.1234;
float b = 123456.1234;
float result = 0;
result = a * b;
}
timing in cycles
Best case: 305(15.25uSec)
worse case: 2033(101.65uSec)
Scenario 2:
void doubleMultiple(void)
{
double a = 123456.1234;
double b = 123456.1234;
double result = 0;
result = a * b;
}
Using the same numbers same system, only changing the type.
Best case: 2356(117.8uSec)
worse case: 14567(728.35uSec)
there is a little overhead with my timing system, I would guess that the overhead is around 100 cycles. due to function calls.
This still shows the significant difference from using a float and a double on a 16 bit MCU. The difference is about 7 times longer(FOR THIS PLATFORM).
There can be a difference with the generated assembly code to calculate the floating point values on different systems.
int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication
The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.
This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;
There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.
See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#
Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).
For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.
What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.
I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP
For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}
If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.
Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}
A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}