What has a better performance: multiplication or division? - performance

Which version is faster:
x * 0.5
or
x / 2 ?
I've had a course at the university called computer systems some time ago. From back then I remember that multiplying two values can be achieved with comparably "simple" logical gates but division is not a "native" operation and requires a sum register that is in a loop increased by the divisor and compared to the dividend.
Now I have to optimise an algorithm with a lot of divisions. Unfortunately it's not just dividing by two, so binary shifting is not an option. Will it make a difference to change all divisions to multiplications ?
Update:
I have changed my code and didn't notice any difference. You're probably right about compiler optimisations. Since all the answers were great ive upvoted them all. I chose rahul's answer because of the great link.

Usually division is a lot more expensive than multiplication, but a smart compiler will often convert division by a compile-time constant to a multiplication anyway. If your compiler is not smart enough though, or if there are floating point accuracy issues, then you can always do the optimisation explicitly, e.g. change:
float x = y / 2.5f;
to:
const float k = 1.0f / 2.5f;
...
float x = y * k;
Note that this is most likely a case of premature optimisation - you should only do this kind of thing if you have profiled your code and positively identified division as being a performance bottlneck.

Division by a compile-time constant that's a power of 2 is quite fast (comparable to multiplication by a compile-time constant) for both integers and floats (it's basically convertible into a bit shift).
For floats even dynamic division by powers of two is much faster than regular (dynamic or static division) as it basically turns into a subtraction on its exponent.
In all other cases, division appears to be several times slower than multiplication.
For dynamic divisor the slowndown factor at my Intel(R) Core(TM) i5 CPU M 430 # 2.27GHz appears to be about 8, for static ones about 2.
The results are from a little benchmark of mine, which I made because I was somewhat curious about this (notice the aberrations at powers of two) :
ulong -- 64 bit unsigned
1 in the label means dynamic argument
0 in the lable means statically known argument
The results were generated from the following bash template:
#include <stdio.h>
#include <stdlib.h>
typedef unsigned long ulong;
int main(int argc, char** argv){
$TYPE arg = atoi(argv[1]);
$TYPE i = 0, res = 0;
for (i=0;i< $IT;i++)
res+=i $OP $ARG;
printf($FMT, res);
return 0;
}
with the $-variables assigned and the resulting program compiled with -O3 and run (dynamic values came from the command line as it's obvious from the C code).

Well if it is a single calculation you wil hardly notice any difference but if you talk about millions of transaction then definitely Division is costlier than Multiplication. You can always use whatever is the clearest and readable.
Please refer this link:- Should I use multiplication or division?

That will likely depend on your specific CPU and the types of your arguments. For instance, in your example you're doing a floating-point multiplication but an integer division. (Probably, at least, in most languages I know of that use C syntax.)
If you are doing work in assembler, you can look up the specific instructions you are using and see how long they take.
If you are not doing work in assembler, you probably don't need to care. All modern compilers with optimization will change your operations in this way to the most appropriate instructions.
Your big wins on optimization will not be from twiddling the arithmetic like this. Instead, focus on how well you are using your cache. Consider whether there are algorithm changes that might speed things up.

One note to make, if you are looking for numerical stability:
Don't recycle the divisions for solutions that require multiple components/coordinates, e.g. like implementing an n-D vector normalize() function, i.e. the following will NOT give you a unit-length vector:
V3d v3d(x,y,z);
float l = v3d.length();
float oneOverL = 1.f / l;
v3d.x *= oneOverL;
v3d.y *= oneOverL;
v3d.z *= oneOverL;
assert(1. == v3d.length()); // fails!
.. but this code will..
V3d v3d(x,y,z);
float l = v3d.length();
v3d.x /= l;
v3d.y /= l;
v3d.z /= l;
assert(1. == v3d.length()); // ok!
Guess the problem in the first code excerpt is the additional float normalization (the pre-division will impose a different scale normalization to the floating point number, which is then forced upon the actual result and introducing additional error).
Didn't look into this for too long, so please share your explanation why this happens. Tested it with x,y and z being .1f (and with doubles instead of floats)

Related

How to efficiently vectorize polynomial computation with condition (roofline model)

I want to apply a polynomial of small degree (2-5) to a vector of whose length can be between 50 and 3000, and do this as efficiently as possible.
Example: For example, we can take the function: (1+x^2)^3, when x>3 and 0 when x<=3.
Such a function would be executed 100k times for vectors of double elements. The size of each vector can be anything between 50 and 3000.
One idea would be to use Eigen:
Eigen::ArrayXd v;
then simply apply a functor:
v.unaryExpr([&](double x) {return x>3 ? std::pow((1+x*x), 3.00) : 0.00;});
Trying with both GCC 9 and GCC 10, I saw that this loop is not being vectorized. I did vectorize it manually, only to see that the gain is much smaller than I expected (1.5x). I also replaced the conditioning with logical AND instructions, basically executing both branches and zeroing out the result when x<=3. I presume that the gain came mostly from the lack of branch misprediction.
Some considerations
There are multiple factors at play. First of all, there are RAW dependencies in my code (using intrinsics). I am not sure how this affects the computation. I wrote my code with AVX2 so I was expecting a 4x gain. I presume that this plays a role, but I cannot be sure, as the CPU has out-of-order-processing. Another problem is that I am unsure if the performance of the loop I am trying to write is bound by the memory bandwidth.
Question
How can I determine if either the memory bandwidth or pipeline hazards are affecting the implementation of this loop? Where can I learn techniques to better vectorize this loop? Are there good tools for this in Eigenr MSVC or Linux? I am using an AMD CPU as opposed to Intel.
You can fix the GCC missed optimization with -fno-trapping-math, which should really be the default because -ftrapping-math doesn't even fully work. It auto-vectorizes just fine with that option: https://godbolt.org/z/zfKjjq.
#include <stdlib.h>
void foo(double *arr, size_t n) {
for (size_t i=0 ; i<n ; i++){
double &tmp = arr[i];
double sqrp1 = 1.0 + tmp*tmp;
tmp = tmp>3 ? sqrp1*sqrp1*sqrp1 : 0;
}
}
It's avoiding the multiplies in one side of the ternary because they could raise FP exceptions that C++ abstract machine wouldn't.
You'd hope that writing it with the cubing outside a ternary should let GCC auto-vectorize, because none of the FP math operations are conditional in the source. But it doesn't actually help: https://godbolt.org/z/c7Ms9G GCC's default -ftrapping-math still decides to branch on the input to avoid all the FP computation, potentially not raising an overflow (to infinity) exception that the C++ abstract machine would have raised. Or invalid if the input was NaN. This is the kind of thing I meant about -ftrapping-math not working. (related: How to force GCC to assume that a floating-point expression is non-negative?)
Clang also has no problem: https://godbolt.org/z/KvM9fh
I'd suggest using clang -O3 -march=native -ffp-contract=fast to get FMAs across statements when FMA is available.
(In this case, -ffp-contract=on is sufficient to contract 1.0 + tmp*tmp within that one expression, but not across statements if you need to avoid that for Kahan summation for example. The clang default is apparently -ffp-contract=off, giving separate mulpd and addpd)
Of course you'll want to avoid std::pow with a small integer exponent. Compilers might not optimize that into just 2 multiplies and instead call a full pow function.

Set bits before index without shift or LUT

Let's say I need to set all bits before a specific bit index. Here are examples with 4 bits:
index(0) = (0x0, 0000)
index(1) = (0x1, 1000)
index(2) = (0x3, 1100)
index(3) = (0x7, 1110)
How can I do this without using shifts or a LUT, but instead using minimal bitwise operations or arithmetic or something similarly efficient?
The constraints are very weird because you want efficient solution and cut of the only two means allowing to do it properly.
So you basically want to compute x=(2^bit)-1 which is with bit-shift pretty easy:
x=(1<<bit)-1; // O(1)
With LUT also ... So how to attack this without the two:
x=pow(2,bit)-1; //O(?) can be O(1),O(log(n)),O(n)
Well this is far from efficient and pow also uses bit shift and some implementation also LUT. The only solutions left are:
approximation
can use polynomial,PCA,or any other method ... but you need to consider the target range ... This is also not very optimal and robust. This can be O(1),O(log(n)),O(n) but usually with very slow constant time.
emulate bit-shif left
you can do that with loop and addition:
int x; for (x=1;bit;bit--) x+=x; x--;
But this runs in O(n). Anyway this is still faster then pow unless you got some pow2 implemented on HW.
[Notes]
in the complexity formulas n=bit and all the code is in C++ except the first formula where ^ means power.

Arithmetic Operations using only 32 bit integers

How would you compute the multiplication of two 1024 bit numbers on a microprocessor that is only capable of multiplying 32 bit numbers?
The starting point is to realize that you already know how to do this: in elementary school you were taught how to do arithmetic on single digit numbers, and then given data structures to represent larger numbers (e.g. decimals) and algorithms to compute arithmetic operations (e.g. long division).
If you have a way to multiply two 32-bit numbers to give a 64-bit result (note that unsigned long long is guaranteed to be at least 64 bits), then you can use those same algorithms to do arithmetic in base 2^32.
You'll also need, e.g., an add with carry operation. You can determine the carry when adding two unsigned numbers of the same type by detecting overflow, e.g. as follows:
uint32_t x, y; // set to some value
uint32_t sum = x + y;
uint32_t carry = (sum < x);
(technically, this sort of operation requires that you do unsigned arithmetic: overflow in signed arithmetic is undefined behavior, and optimizers will do surprising things to your code you least expect it)
(modern processors usually give a way to multiply two 64-bit numbers to give a 128-bit result, but to access it you will have to use compiler extensions like 128-bit types, or you'll have to write inline assembly code. modern processors also have specialized add-with-carry instructions)
Now, to do arithmetic efficiently is an immense project; I found it quite instructive to browse through the documentation and source code to gmp, the GNU multiple precision arithmetic library.
look at any implementation of bigint operations
here are few of mine approaches in C++ for fast bignum square
some are solely for sqr but others are usable for multiplication...
use 32bit arithmetics as a module for 64/128/256/... bit arithmetics
see mine 32bit ALU in x86 C++
use long multiplication with digit base 2^32
can use also Karatsuba this way

How do I translate this range coding C++ snippet to performant Haskell?

I know enough Haskell to translate the code below, but I don't know much about making it perform well:
typedef unsigned long precision;
typedef unsigned char uc;
const int kSpaceForByte = sizeof(precision) * 8 - 8;
const int kHalfPrec = sizeof(precision) * 8 / 2;
const precision kTop = ((precision)1) << kSpaceForByte;
const precision kBot = ((precision)1) << kHalfPrec;
//This must be called before encoding starts
void RangeCoder::StartEncode(){
_low = 0;
_range = (precision) -1;
}
/*
RangeCoder does not concern itself with models of the data.
To encode each symbol, you pass the parameters *cumFreq*, which gives
the cumulative frequency of the possible symbols ordered before this symbol,
*freq*, which gives the frequency of this symbol. And *totFreq*, which gives
the total frequency of all symbols.
This means that you can have different frequency distributions / models for
each encoded symbol, as long as you can restore the same distribution at
this point, when restoring.
*/
void RangeCoder::Encode(precision cumFreq, precision freq, precision totFreq){
assert(cumFreq + freq <= totFreq && freq && totFreq <= kBot);
_low += cumFreq * (_range /= totFreq);
_range *= freq;
while ((_low ^ _low + _range) < kTop or
_range < kBot and ((_range= -_low & kBot - 1), 1)){
//the "a or b and (r=..,1)" idiom is a way to assign r only if a is false.
OutByte(_low >> kSpaceForByte); //output one byte.
_range <<= sizeof(uc) * 8;
_low <<= sizeof(uc) * 8;
}
}
I know, I know "Write several versions and use criterion to see what works". I don't know enough to know what my options are though, or to avoid silly mistakes.
Here are my thoughts so far. One way would be to use the State monad and/or lenses. Another would be to translate the loop and state to explicit recursion. I read somewhere that explicit recursion tends to performs badly on ghc though. I think using ByteString Builder would be a good way to output each byte. Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Since this is not a 1-1 mapping, pipes with StateP would lead to very neat code, where I would request arguments one at a time and then let the while-loop respond byte for byte. Unfortunately, when i benchmarked it, it seems the pipe overhead (unsurprisingly) is quite large. Since each symbol can lead to many byte outputs, it feels a bit like a concatMap with State. Perhaps this would be the idiomatic solution? Concatenating lists of bytes does not sound very fast to me, though. ByteString has a concatMap. Perhaps this is the correct way? EDIT: no it is not. It takes a ByteString as input.
I intend to release the package on Hackage when I'm done, so any advice (or actual code!) you can give will benefit the community :). I plan to use this compression as a base for writing a very memory efficient compressed map.
I read somewhere that explicit recursion tends to performs badly on ghc though.
No. GHC produce slow machine code for recursion, which couldn't be reduced (or GHC "don't want" to reduce). If recursion could be unrolled (I don't see any fundamential problems with it in your snippet), it is translated to almost the same machine code as while-loop in C or C++.
Assuming I run on a 64 bit platform, should I use unboxed Word64 arguments? The compression quality will not decrease significantly if I decrease the precision to 32 bits. Will GHC optimize better for this?
Do you mean Word#? Let GHC to deal with it, use boxed types. I've never met a situation when some profit could be achived only by using unboxed types. Using 32bit types wouldn't help on 64bit platform.
One general rule of optimizing performance for GHC is avoiding data structures where possible. If you can pass pieces of data through function arguments or closures, use the chance.

cuda intrinsic functions sqrtf and powf performance issues

when i convert from powf to __powf it gives performance improvement to me. but if i convert sqrtf to one of which __fsqrt_[rn,rz,ru,rd] it slows down. I think they should run at least as fast as sqrtf. What can be the problem?
Regards
If you need to square an integer (or float for that matter) then you can just multiply the value with itself, i.e. instead of;
y = powf(x, 2);
use:
y = x * x;
This avoids using an expensive transcendental function (along with its associated function call overhead) and just generates a single multiply instruction in most cases.
The square root probably can't be avoided but you can use fsqrtf rather than sqrtf if you only need single precision - this is typically much faster.

Resources