Time taken in executing % / * + - operations - time

Recently, i heard that % operator is costly in terms of time.
So, the question is that, is there a way to find the remainder faster?
Also your help will be appreciated if anyone can tell the difference in the execution of % / * + - operations.

In some cases where you're using power-of-2 divisors you can do better with roll-your-own techniques for calculating remainder, but generally a halfway decent compiler will do the best job possible with variable divisors, or "odd" divisors that don't fit any pattern.
Note that a few CPUs don't even have a multiply operation, and so (on those) multiply is quite slow vs add (at least 64x for a 32-bit multiply). (But a smart compiler may improve on this if the multiplier is a literal.) A slightly larger number do not have a divide operation or have a pretty slow one. (On a CPU with a fast multiplier multiply may only be on the order of 4 times slower than add, but on "normal" hardware it's 16-32 times slower for a 32 bit operation. Divide is inherently 2-4x slower than multiply, but can be much slower on some hardware.)
The remainder operation is rarely implemented in hardware, and normally A % B maps to something along the lines of A - ((A / B) * B) (a few extra operations may be required to assure the proper sign, et al).
(I learned about this stuff while microprogramming the instruction set for the SUMC computer for RCA/NASA back in the early 70s.)

No, the compiler is going to implement % in the most efficient way possible.
In terms of speed, + and - are the fastest (and are equally fast, generally done by the same hardware).
*, /, and % are much slower. Multiplication is basically done by the method you learn in grade school- multiply the first number by every digit in the second number and add the results. With some hacks made possible by binary. As of a few years ago, multiply was 3x slower than add. Division should be similar to multiply. Remainder is similar to division (in fact it generally calculates both at once).
Exact differences depend on the CPU type and exact model. You'd need to look up the latencies in the CPU spec sheets for your particular machine.

Related

Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?

Specifically, this is the code I'm talking about:
float InvSqrt(float x) {
float xhalf = 0.5f*x;
int i = *(int*)&x; // warning: strict-aliasing UB, use memcpy instead
i = 0x5f375a86- (i >> 1);
x = *(float*)&i; // same
x = x*(1.5f-xhalf*x*x);
return x;
}
I forgot where I got this from but it's apparently better and more efficient or precise than the original Quake III algorithm (slightly different magic constant), but it's been more than 2 decades since this algorithm was created, and I just want to know if it's still worth using it in terms of performance, or if there's an instruction that implements it already in modern x86-64 CPUs.
Origins:
See John Carmack's Unusual Fast Inverse Square Root (Quake III)
Modern usefulness: none, obsoleted by SSE1 rsqrtss
Use _mm_rsqrt_ps or ss to get a very approximate reciprocal-sqrt for 4 floats in parallel, much faster than even a good compiler could do with this (using SSE2 integer shift/add instructions to keep the FP bit pattern in an XMM register, which is probably not how it would actually compile with the type-pun to integer. Which is strict-aliasing UB in C or C++; use memcpy or C++20 std::bit_cast.)
https://www.felixcloutier.com/x86/rsqrtss documents the scalar version of the asm instruction, including the |Relative Error| ≤ 1.5 ∗ 2−12 guarantee. (i.e. about half the mantissa bits are correct.) One Newton-Raphson iteration can refine it to within 1ulp of being correct, although still not the 0.5ulp you'd get from actual sqrt. See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision)
rsqrtps performs only slightly slower than a mulps / mulss instruction on most CPUs, like 5 cycle latency, 1/clock throughput. (With a Newton iteration to refine it, more uops.) Latency various by microarchitecture, as low as 3 uops in Zen 3, but Intel runs it with about 5c latency since Conroe at least (https://uops.info/).
The integer shift / subtract from the magic number in the Quake InvSqrt similarly provides an even rougher initial-guess, and the rest (after type-punning the bit-pattern back to a float is a Newton Raphson iteration.
Compilers will even use rsqrtss for you when compiling sqrt with -ffast-math, depending on context and tuning options. (e.g. modern clang compiling 1.0f/sqrtf(x) with -O3 -ffast-math -march=skylake https://godbolt.org/z/fT86bKesb uses vrsqrtss and 3x vmulss plus an FMA.) Non-reciprocal sqrt is usually not worth it, but rsqrt + refinement avoids a division as well as a sqrt.
Full-precision square root and division themselves are not as slow as they used to be, at least if you use them infrequently compared to mul/add/sub. (e.g. if you can hide the latency, one sqrt every 12 or so other operations might cost about the same, still a single uop instead of multiple for rsqrt + Newton iteration.) See Floating point division vs floating point multiplication
But sqrt and div do compete with each other for throughput so needing to divide by a square root is a nasty case.
So if you have a bad loop over an array that mostly just does sqrt, not mixed with other math operations, that's a use-case for _mm_rsqrt_ps (and a Newton iteration) as a higher throughput approximation than _mm_sqrt_ps
But if you can combine that pass with something else to increase computational intensity and get more work done overlapped with keeping the div/sqrt unit, often it's better to use a real sqrt instruction on its own, since that's still just 1 uop for the front-end to issue, and for the back-end to track and execute. vs. a Newton iteration taking something like 5 uops if FMA is available for reciprocal square root, else more (also if non-reciprocal sqrt is needed).
With Skylake for example having 1 per 3 cycle sqrtps xmm throughput (128-bit vectors), it costs the same as a mul/add/sub/fma operation if you don't do more than one per 6 math operations. (Throughput is worse for 256-bit YMM vectors, 6 cycles.) A Newton iteration would cost more uops, so if uops for port 0/1 are the bottleneck, it's a win to just use sqrt directly. (This is assuming that out-of-order exec can hide the latency, typically when each loop iteration is independent.) This kind of situation is common if you're using a polynomial approximation as part of something like log or exp in a loop.
See also Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision re: performance on modern OoO exec CPUs.

Fastest way to implement floating-point multiplication by a small integer constant

Suppose you are trying to multiply a floating-point number k by a small integer constant n (by small I mean -20 <= n <= 20). The naive way of doing this is converting n to a floating point number (which for the purposes of this question does not count towards the runtime) and executing a floating-point multiply. However, for n = 2, it seems likely that k + k is a faster way of computing it. At what n does the multiply instruction become faster than repeated additions (plus an inversion at the end if n < 0)?
Note that I am not particularly concerned about accuracy here; I am willing to allow unsound optimizations as long as they get roughly the right answer (i.e.: up to 1024 ULP error is probably fine).
I am writing OpenCL code, so I'm interested in the answer to this question in many computational contexts (x86-64, x86-64 + AVX256, GPUs).
I could benchmark this, but since I don't have a particular architecture in mind, I'd prefer a theoretical justification of the choice.
According to AMD's OpenCL optimisation guide for GPUs, section 3.8.1 "Instruction Bandwidths", for single-precision floating point operands, addition, multiplication and 'MAD' (multiply-add) all have a throughput of 5 per cycle on GCN based GPUs. The same is true for 24-bit integers. Only once you move to 32-bit integers are multiplications much more expensive (1/cycle). Int-to-float conversions and vice versa are also comparatively slow (1/cycle), and unless you have a double-precision float capable model (mostly FirePro/Radeon Pro series or Quadro/Tesla from nvidia) operations on doubles are super slow (<1/cycle). Negation is typically "free" on GPUs - for example GCN has sign flags on instruction operands, so -(a + b) compiles to one instruction after transforming to (-a) + (-b).
Nvidia GPUs tend to be a bit slower at integer operations, for floats it's a similar story to AMD's though: multiplications are just as fast as addition, and if you can combine them into MAD operations, you can double throughput. Intel's GPUs are quite different in other regards, but again they're very fast at FP multiplication and addition.
Basically, it's really hard to beat a GPU at floating-point multiplication, as that's essentially the one thing they're optimised for.
On the CPU it's typically more complicated - Agner Fog's optimisation resources and instruction tables are the place to go for the details. Note though that on many CPUs you'll pay a penalty for interpreting float data as integer and back because ALU and FPU are typically separate. (For example if you wanted to optimise multiplying floats by a power of 2 by performing an integer addition on their exponents. On x86, you can easily do this by operating on SSE or AVX registers using first float instructions, then integer ones, but it's generally not good for performance.)

Is it possible to count the number of Set bits in Number in O(1)? [duplicate]

This question already has answers here:
Count the number of set bits in a 32-bit integer
(65 answers)
Count bits in the number [duplicate]
(3 answers)
Closed 8 years ago.
I was asked the above question in an interview and interviewer is very certain of the answer. But i am not sure. Can anyone help me here?
Sure. The obvious brute force method is just a big lookup table with one entry for every possible value of the input number. That's not very practical if the number is very big, but is still enough to prove it's possible.
Edit: the notion has been raised that this is complete nonsense, and the same could be said of essentially any algorithm.
To a limited degree, that's a fair statement -- but the limitations are so severe that for most algorithms it remains utterly meaningless.
My original point (at least as well as I remember it) was that population counting is about equivalent to many other operations like addition and subtraction that we normally assume are O(1).
At the hardware level, circuitry for a single-cycle POPCNT instruction is probably easier than for a single-cycle ADD instruction. Just for one example, for any practical size of data word, we can use table lookups on 4-bit chunks in parallel, then add the results from those pieces together. Even using fairly unlikely worst-case assumptions (e.g., separate storage for each of those tables) this would still be easy to implement in a modern CPU -- in fact, it's probably at least somewhat simpler than the single-cycle addition or subtraction mentioned above1.
This is a decided contrast to many other algorithms. For one obvious example, let's consider sorting. For even the most trivial sort most people can imagine -- 2 items, 8 bits apiece, we're already at a 64 kilobyte lookup table to get constant complexity. Long before we can do even a fairly trivial sort (e.g., 100 items) we need a lookup table that contains far more data items than there are atoms in the universe.
Looking at it from the opposite direction, it's certainly true that at some point, essentially nothing is O(1) any more. Let's consider the most trivial operations possible. For an N-bit CPU, bitwise OR is normally implemented as a set of N OR gates in parallel. Unlike addition, there's no interaction between one bit and another, so for any practical size of CPU, this easy to execute in a single instruction.
Nonetheless, if I specify a bit-wise OR in which each operand is 100 petabits, there's nothing even approaching a practical way to do the job with constant complexity. Using the usual method of parallel OR gates, we end up with (among other things) 300 petabits worth of input and output lines -- a number that completely dwarfs even the number of pins on the largest CPUs.
On reasonable hardware, doing a bitwise OR on 100 petabit operands is going to take a while (not to mention quite a bit of hard drive space). If we increase that to 200 petabit operands, the time is likely to (about) double -- so from that viewpoint, it's an O(N) operation. Obviously enough, the same is going to be true with the other "trivial" operations like addition, subtraction, bit-wise AND, bit-wise XOR, and so on.
Nonetheless, unless you have very specific instructions to say you're going to be dealing with utterly immense operands, you're typically going to treat every one of these as a constant complexity operation. Looked at in these terms, a POPCNT instruction falls about halfway between bit-wise AND/OR/XOR on one hand, and addition/subtraction on the other, in terms of the difficulty to execute in fixed time.
1. You might wonder how it could possibly be simpler than an add when it actually includes an add after doing some other operations. If so, kudos -- it's an excellent question.
The answer is that it's because it only needs a much smaller adder. For example, a 64-bit CPU needs one half-adder and 63 full-adders. In the simple implementation, you carry out the addition bit-wise -- i.e., you add bit-0 of one operand to bit-0 of the other. That generates an output bit, and a carry bit. That carry bit becomes an input to the addition for the next pair of bits. There are some tricks to parallelize that to some degree, but the nature of the beast (so to speak) is bit-serial.
With a POPCNT instruction, we have an addition after doing the individual table lookups, but our result is limited to the size of the input words. Given the same size of inputs (64 bits) our final result can't be any larger than 64. That means we only need a 6-bit adder instead of a 64-bit adder.
Since, as outlined above, addition is basically bit-serial, this means that the addition at the end of the POPCNT instruction is fundamentally a lot faster than a normal add. To be specific, it's logarithmic on the operand size, whereas simple addition is roughly linear on the operand size.
If the bit size is fixed (e.g. natural word size of a 32- or 64-bit machine), you can just iterate over the bits and count them directly in O(1) time (though there are certainly faster ways to do it). For arbitrary precision numbers (BigInt, etc.), the answer must be no.
Some processors can do it in one instruction, obviously for integers of limited size. Look up the POPCNT mnemonic for further details.
For integers of unlimited size obviously you need to read the whole input, so the lower bound is O(n).
The interviewer probably meant the bit counting trick (the first Google result follows): http://www.gamedev.net/topic/547102-bit-counting-trick---new-to-me/

What is the fastest way to do integer division?

Using scheme I have the need to use the following function. (All args are natural numbers [0, inf) )
(define safe-div
(lambda (num denom safe)
(if (zero? denom)
safe
(div num denom))))
However, this function is called quite often and is not performing well enough (speed wise).
Is there a more efficient way of implementing the desired behavior (integer division of num and denom, return safe value if denom is zero)?
Notes, I am using Chez Scheme, however this is being used in a library that imports rnrs only, not full Chez.
For maximum performance, you need to get as close to the silicon as possible. Adding safety checks like this isn't going to do it, unless they get just-in-time compiled into ultra-efficient machine code by the scheme system.
I see two options. One is to create a native (i.e. foreign) implementation in C (or assembly) and invoke that. That might not be compatible with packaging it as a lambda, but then again, the dynamic nature of lambdas leads to notational efficiency but not necessarily runtime efficiency. (Function pointers excepted, there's a reason lambda expressions are not present in C, despite being many years older.) If you go this route, it might be best to take a step back and see if the larger processing of which safe-div is a part should be taken native. There's little point in speeding up the division at the center of a loop if everything around it is still slow.
Assuming that division by zero is expected to be rare, another approach is to just use div and hope its implementation is fast. Yes, this can lead to division by zero, but when it comes to speed, sometimes it is better to beg forgiveness than to ask permission. In other words, skip the checking before the division and just do it. If it fails, the scheme runtime should catch the division by zero fault and you can install an exception handler for it. This leads to slower code in the exceptional case and faster code in the normal case. Hopefully this tradeoff works out to a performance win.
Lastly, depending on what you are dividing by, it might be faster to multiply by the reciprocal than to perform an actual division. This requires fast reciprocal computation or revising earlier computations to yield a reciprocal directly. Since you are dealing with integers, the reciprocal would be stored in fixed-point, which is essentially 2^32 * 1/denom. Multiply this by num and shift right by 32 bits to get the quotient. This works out to a win because more processors these days have single cycle multiply instructions, but division executes in a loop on the chip, which is much slower. This might be overkill for your needs, but could be useful at some point.

What's better multiplication by 2 or adding the number to itself ? BIGnums

I need some help deciding what is better performance wise.
I'm working with bigints (more then 5 million digits) and most of the computation (if not all) is in the part of doubling the current bigint. So i wanted to know is it better to multiply every cell (part of the bigint) by 2 then mod it and you know the rest. Or is it better just add the bigint to itself.
I'm thinking a bit about the ease of implementation too (addition of 2 bigints is more complicated then multiplication by 2) , but I'm more concerned about the performance rather then the size of code or ease of implementation.
Other info:
I'll code it in C++ , I'm fairly familiar with bigints (just never came across this problem).
I'm not in the need of any source code or similar i just need a nice opinion and explanation/proof of it , since i need to make a good decision form the start as the project will be fairly large and mostly built around this part it depends heavily on what i chose now.
Thanks.
Try bitshifting each bit. That is probably the fastest method. When you bitshift an integer to the left, then you double it (multiply by 2). If you have several long integers in a chain, then you need to store the most significant bit, because after shifting it, it will be gone, and you need to use it as the least significant bit on the next long integer.
This doesn't actually matter a whole lot. Modern 64bit computers can add two integers in the same time it takes to bitshift them (1 clockcycle), so it will take just as long. I suggest you try different methods, and then report back if there is any major time differences. All three methods should be easy to implement, and generating a 5mb number should also be easy, using a random number generator.
To store a 5 million digit integer, you'll need quite a few bits -- 5 million if you were referring to binary digits, or ~17 million bits if those were decimal digits. Let's assume the numbers are stored in a binary representation, and your arithmetic happens in chunks of some size, e.g. 32 bits or 64 bits.
If adding the number to itself, each chunk is added to itself and to the carry from the addition of the previous chunk. Any carry forward is kept for the next chunk. That's a couple of addition operation, and some book keeping for tracking the carry.
If multiplying by two by left-shifting, that's one left-shift operation for the multiplication, and one right-shift operation + and with 1 to obtain the carry. Carry book keeping is a little simpler.
Superficially, the shift version appears slightly faster. The overall cost of doubling the number, however, is highly influenced by the size of the number. A 17 million bits number exceeds the cpu's L1 cache, and processing time is likely overwhelmed by memory fetch operations. On modern PC hardware, memory fetch is orders of magnitude slower than addition and shifting.
With that, you might want to pick the one that's simpler for you to implement. I'm leaning towards the left-shift version.
did you try shifting the bits?
<< multiplies by 2
>> divides by 2
Left bit shifting by one is the same as a multiplication by two !
This link explains the mecanism and give examples.
int A = 10; //...01010 = 10
int B = A<<1; //..010100 = 20
If it really matters, you need to write all three methods (including bit-shift!), and profile them, on various input. (Use small numbers, large numbers, and random numbers, to avoid biasing the results.)
Sorry for the "Do it yourself" answer, but that's really the best way. No one cares about this result more than you, which just makes you the best person to figure it out.
Well implemented multiplication of BigNums is O(N log(N) log(log(N)). Addition is O(n). Therefore, adding to itself should be faster than multiplying by two. However that's only true if you're multiplying two arbitrary bignums; if your library knows you're multiplying a bignum by a small integer it may be able to optimize to O(n).
As others have noted, bit-shifting is also an option. It should be O(n) as well but faster constant time. But that will only work if your bignum library supports bit shifting.
most of the computation (if not all) is in the part of doubling the current bigint
If all your computation is in doubling the number, why don't you just keep a distinct (base-2) scale field? Then just add one to scale, which can just be a plain-old int. This will surely be faster than any manipulation of some-odd million bits.
IOW, use a bigfloat.
random benchmark
use Math::GMP;
use Time::HiRes qw(clock_gettime CLOCK_REALTIME CLOCK_PROCESS_CPUTIME_ID);
my $n = Math::GMP->new(2);
$n = $n ** 1_000_000;
my $m = Math::GMP->new(2);
$m = $m ** 10_000;
my $str;
for ($bits = 1_000_000; $bits <= 2_000_000; $bits += 10_000) {
my $start = clock_gettime(CLOCK_PROCESS_CPUTIME_ID);
$str = "$n" for (1..3);
my $stop = clock_gettime(CLOCK_PROCESS_CPUTIME_ID);
print "$bits,#{[($stop-$start)/3]}\n";
$n = $n * $m;
}
Seems to show that somehow GMP is doing its conversion in O(n) time (where n the number of bits in the binary number). This may be due to the special case of having a 1 followed by a million (or two) zeros; the GNU MP docs say it should be slower (but still better than O(N^2).
http://img197.imageshack.us/img197/6527/chartp.png

Resources