Should I use multiplication or division? - performance

Here's a silly fun question:
Let's say we have to perform a simple operation where we need half of the value of a variable. There are typically two ways of doing this:
y = x / 2.0;
// or...
y = x * 0.5;
Assuming we're using the standard operators provided with the language, which one has better performance?
I'm guessing multiplication is typically better so I try to stick to that when I code, but I would like to confirm this.
Although personally I'm interested in the answer for Python 2.4-2.5, feel free to also post an answer for other languages! And if you'd like, feel free to post other fancier ways (like using bitwise shift operators) as well.

Python:
time python -c 'for i in xrange(int(1e8)): t=12341234234.234 / 2.0'
real 0m26.676s
user 0m25.154s
sys 0m0.076s
time python -c 'for i in xrange(int(1e8)): t=12341234234.234 * 0.5'
real 0m17.932s
user 0m16.481s
sys 0m0.048s
multiplication is 33% faster
Lua:
time lua -e 'for i=1,1e8 do t=12341234234.234 / 2.0 end'
real 0m7.956s
user 0m7.332s
sys 0m0.032s
time lua -e 'for i=1,1e8 do t=12341234234.234 * 0.5 end'
real 0m7.997s
user 0m7.516s
sys 0m0.036s
=> no real difference
LuaJIT:
time luajit -O -e 'for i=1,1e8 do t=12341234234.234 / 2.0 end'
real 0m1.921s
user 0m1.668s
sys 0m0.004s
time luajit -O -e 'for i=1,1e8 do t=12341234234.234 * 0.5 end'
real 0m1.843s
user 0m1.676s
sys 0m0.000s
=>it's only 5% faster
conclusions: in Python it's faster to multiply than to divide, but as you get closer to the CPU using more advanced VMs or JITs, the advantage disappears. It's quite possible that a future Python VM would make it irrelevant

Always use whatever is the clearest. Anything else you do is trying to outsmart the compiler. If the compiler is at all intelligent, it will do the best to optimize the result, but nothing can make the next guy not hate you for your crappy bitshifting solution (I love bit manipulation by the way, it's fun. But fun != readable)
Premature optimization is the root of all evil. Always remember the three rules of optimization!
Don't optimize.
If you are an expert, see rule #1
If you are an expert and can justify the need, then use the following procedure:
Code it unoptimized
determine how fast is "Fast enough"--Note which user requirement/story requires that metric.
Write a speed test
Test existing code--If it's fast enough, you're done.
Recode it optimized
Test optimized code. IF it doesn't meet the metric, throw it away and keep the original.
If it meets the test, keep the original code in as comments
Also, doing things like removing inner loops when they aren't required or choosing a linked list over an array for an insertion sort are not optimizations, just programming.

I think this is getting so nitpicky that you would be better off doing whatever makes the code more readable. Unless you perform the operations thousands, if not millions, of times, I doubt anyone will ever notice the difference.
If you really have to make the choice, benchmarking is the only way to go. Find what function(s) are giving you problems, then find out where in the function the problems occur, and fix those sections. However, I still doubt that a single mathematical operation (even one repeated many, many times) would be a cause of any bottleneck.

Multiplication is faster, division is more accurate. You'll lose some precision if your number isn't a power of 2:
y = x / 3.0;
y = x * 0.333333; // how many 3's should there be, and how will the compiler round?
Even if you let the compiler figure out the inverted constant to perfect precision, the answer can still be different.
x = 100.0;
x / 3.0 == x * (1.0/3.0) // is false in the test I just performed
The speed issue is only likely to matter in C/C++ or JIT languages, and even then only if the operation is in a loop at a bottleneck.

If you want to optimize your code but still be clear, try this:
y = x * (1.0 / 2.0);
The compiler should be able to do the divide at compile-time, so you get a multiply at run-time. I would expect the precision to be the same as in the y = x / 2.0 case.
Where this may matter a LOT is in embedded processors where floating-point emulation is required to compute floating-point arithmetic.

Just going to add something for the "other languages" option.
C: Since this is just an academic exercise that really makes no difference, I thought I would contribute something different.
I compiled to assembly with no optimizations and looked at the result.
The code:
int main() {
volatile int a;
volatile int b;
asm("## 5/2\n");
a = 5;
a = a / 2;
asm("## 5*0.5");
b = 5;
b = b * 0.5;
asm("## done");
return a + b;
}
compiled with gcc tdiv.c -O1 -o tdiv.s -S
the division by 2:
movl $5, -4(%ebp)
movl -4(%ebp), %eax
movl %eax, %edx
shrl $31, %edx
addl %edx, %eax
sarl %eax
movl %eax, -4(%ebp)
and the multiplication by 0.5:
movl $5, -8(%ebp)
movl -8(%ebp), %eax
pushl %eax
fildl (%esp)
leal 4(%esp), %esp
fmuls LC0
fnstcw -10(%ebp)
movzwl -10(%ebp), %eax
orw $3072, %ax
movw %ax, -12(%ebp)
fldcw -12(%ebp)
fistpl -16(%ebp)
fldcw -10(%ebp)
movl -16(%ebp), %eax
movl %eax, -8(%ebp)
However, when I changed those ints to doubles (which is what python would probably do), I got this:
division:
flds LC0
fstl -8(%ebp)
fldl -8(%ebp)
flds LC1
fmul %st, %st(1)
fxch %st(1)
fstpl -8(%ebp)
fxch %st(1)
multiplication:
fstpl -16(%ebp)
fldl -16(%ebp)
fmulp %st, %st(1)
fstpl -16(%ebp)
I haven't benchmarked any of this code, but just by examining the code you can see that using integers, division by 2 is shorter than multiplication by 2. Using doubles, multiplication is shorter because the compiler uses the processor's floating point opcodes, which probably run faster (but actually I don't know) than not using them for the same operation. So ultimately this answer has shown that the performance of multiplaction by 0.5 vs. division by 2 depends on the implementation of the language and the platform it runs on. Ultimately the difference is negligible and is something you should virtually never ever worry about, except in terms of readability.
As a side note, you can see that in my program main() returns a + b. When I take the volatile keyword away, you'll never guess what the assembly looks like (excluding the program setup):
## 5/2
## 5*0.5
## done
movl $5, %eax
leave
ret
it did both the division, multiplication, AND addition in a single instruction! Clearly you don't have to worry about this if the optimizer is any kind of respectable.
Sorry for the overly long answer.

Firstly, unless you are working in C or ASSEMBLY, you're probably in a higher level language where memory stalls and general call overheads will absolutely dwarf the difference between multiply and divide to the point of irrelevance. So, just pick what reads better in that case.
If you're talking from a very high level it won't be measurably slower for anything you're likely to use it for. You'll see in other answers, people need to do a million multiply/divides just to measure some sub-millisecond difference between the two.
If you're still curious, from a low level optimisation point of view:
Divide tends to have a significantly longer pipeline than multiply. This means it takes longer to get the result, but if you can keep the processor busy with non-dependent tasks, then it doesn't end up costing you any more than a multiply.
How long the pipeline difference is is completely hardware dependant. Last hardware I used was something like 9 cycles for a FPU multiply and 50 cycles for a FPU divide. Sounds a lot, but then you'd lose 1000 cycles for a memory miss, so that can put things in perspective.
An analogy is putting a pie in a microwave while you watch a TV show. The total time it took you away from the TV show is how long it was to put it in the microwave, and take it out of the microwave. The rest of your time you still watched the TV show. So if the pie took 10 minutes to cook instead of 1 minute, it didn't actually use up any more of your tv watching time.
In practice, if you're going to get to the level of caring about the difference between Multiply and Divide, you need to understand pipelines, cache, branch stalls, out-of-order prediction, and pipeline dependencies. If this doesn't sound like where you were intending to go with this question, then the correct answer is to ignore the difference between the two.
Many (many) years ago it was absolutely critical to avoid divides and always use multiplies, but back then memory hits were less relevant, and divides were much worse. These days I rate readability higher, but if there's no readability difference, I think its a good habit to opt for multiplies.

Write whichever is more clearly states your intent.
After your program works, figure out what's slow, and make that faster.
Don't do it the other way around.

Do whatever you need. Think of your reader first, do not worry about performance until you are sure you have a performance problem.
Let compiler do the performance for you.

Actually there is a good reason that as a general rule of thumb multiplication will be faster than division. Floating point division in hardware is done either with shift and conditional subtract algorithms ("long division" with binary numbers) or - more likely these days - with iterations like Goldschmidt's algorithm. Shift and subtract needs at least one cycle per bit of precision (the iterations are nearly impossible to parallelize as are the shift-and-add of multiplication), and iterative algorithms do at least one multiplication per iteration. In either case, it's highly likely that the division will take more cycles. Of course this does not account for quirks in compilers, data movement, or precision. By and large, though, if you are coding an inner loop in a time sensitive part of a program, writing 0.5 * x or 1.0/2.0 * x rather than x / 2.0 is a reasonable thing to do. The pedantry of "code what's clearest" is absolutely true, but all three of these are so close in readability that the pedantry is in this case just pedantic.

If you are working with integers or non floating point types don't forget your bitshifting operators: << >>
int y = 10;
y = y >> 1;
Console.WriteLine("value halved: " + y);
y = y << 1;
Console.WriteLine("now value doubled: " + y);

Multiplication is usually faster - certainly never slower.
However, if it is not speed critical, write whichever is clearest.

I have always learned that multiplication is more efficient.

Floating-point division is (generally) especially slow, so while floating-point multiplication is also relatively slow, it's probably faster than floating-point division.
But I'm more inclined to answer "it doesn't really matter", unless profiling has shown that division is a bit bottleneck vs. multiplication. I'm guessing, though, that the choice of multiplication vs. division isn't going to have a big performance impact in your application.

This becomes more of a question when you are programming in assembly or perhaps C. I figure that with most modern languages that optimization such as this is being done for me.

Be wary of "guessing multiplication is typically better so I try to stick to that when I code,"
In context of this specific question, better here means "faster". Which is not very useful.
Thinking about speed can be a serious mistake. There are profound error implications in the specific algebraic form of the calculation.
See Floating Point arithmetic with error analysis. See Basic Issues in Floating Point Arithmetic and Error Analysis.
While some floating-point values are exact, most floating point values are an approximation; they are some ideal value plus some error. Every operation applies to the ideal value and the error value.
The biggest problems come from trying to manipulate two nearly-equal numbers. The right-most bits (the error bits) come to dominate the results.
>>> for i in range(7):
... a=1/(10.0**i)
... b=(1/10.0)**i
... print i, a, b, a-b
...
0 1.0 1.0 0.0
1 0.1 0.1 0.0
2 0.01 0.01 -1.73472347598e-18
3 0.001 0.001 -2.16840434497e-19
4 0.0001 0.0001 -1.35525271561e-20
5 1e-05 1e-05 -1.69406589451e-21
6 1e-06 1e-06 -4.23516473627e-22
In this example, you can see that as the values get smaller, the difference between nearly equal numbers create non-zero results where the correct answer is zero.

I've read somewhere that multiplication is more efficient in C/C++; No idea regarding interpreted languages - the difference is probably negligible due to all the other overhead.
Unless it becomes an issue stick with what is more maintainable/readable - I hate it when people tell me this but its so true.

I would suggest multiplication in general, because you don't have to spend the cycles ensuring that your divisor is not 0. This doesn't apply, of course, if your divisor is a constant.

As with posts #24 (multiplication is faster) and #30 - but sometimes they are both just as easy to understand:
1*1e-6F;
1/1e6F;
~ I find them both just as easy to read, and have to repeat them billions of times. So it is useful to know that multiplication is usually faster.

There is a difference, but it is compiler dependent. At first on vs2003 (c++) I got no significant difference for double types (64 bit floating point). However running the tests again on vs2010, I detected a huge difference, up to factor 4 faster for multiplications. Tracking this down, it seems that vs2003 and vs2010 generates different fpu code.
On a Pentium 4, 2.8 GHz, vs2003:
Multiplication: 8.09
Division: 7.97
On a Xeon W3530, vs2003:
Multiplication: 4.68
Division: 4.64
On a Xeon W3530, vs2010:
Multiplication: 5.33
Division: 21.05
It seems that on vs2003 a division in a loop (so the divisor was used multiple times) was translated to a multiplication with the inverse. On vs2010 this optimization is not applied any more (I suppose because there is slightly different result between the two methods). Note also that the cpu performs divisions faster as soon as your numerator is 0.0. I do not know the precise algorithm hardwired in the chip, but maybe it is number dependent.
Edit 18-03-2013: the observation for vs2010

Java android, profiled on Samsung GT-S5830
public void Mutiplication()
{
float a = 1.0f;
for(int i=0; i<1000000; i++)
{
a *= 0.5f;
}
}
public void Division()
{
float a = 1.0f;
for(int i=0; i<1000000; i++)
{
a /= 2.0f;
}
}
Results?
Multiplications(): time/call: 1524.375 ms
Division(): time/call: 1220.003 ms
Division is about 20% faster than multiplication (!)

After such a long and interesting discussion here is my take on this: There is no final answer to this question. As some people pointed out it depends on both, the hardware (cf piotrk and gast128) and the compiler (cf #Javier's tests). If speed is not critical, if your application does not need to process in real-time huge amount of data, you may opt for clarity using a division whereas if processing speed or processor load are an issue, multiplication might be the safest.
Finally, unless you know exactly on what platform your application will be deployed, benchmark is meaningless. And for code clarity, a single comment would do the job!

Here's a silly fun answer:
x / 2.0 is not equivalent to x * 0.5
Let's say you wrote this method on Oct 22, 2008.
double half(double x) => x / 2.0;
Now, 10 years later you learn that you can optimize this piece of code. The method is referenced in hundreds of formulas throughout your application. So you change it, and experience a remarkable 5% performance improvement.
double half(double x) => x * 0.5;
Was it the right decision to change the code? In maths, the two expressions are indeed equivalent. In computer science, that does not always hold true. Please read Minimizing the effect of accuracy problems for more details. If your calculated values are - at some point - compared with other values, you will change the outcome of edge cases. E.g.:
double quantize(double x)
{
if (half(x) > threshold))
return 1;
else
return -1;
}
Bottom line is; once you settle for either of the two, then stick to it!

Well, if we assume that an add/subtrack operation costs 1, then multiply costs 5, and divide costs about 20.

Technically there is no such thing as division, there is just multiplication by inverse elements. For example You never divide by 2, you in fact multiply by 0.5.
'Division' - let's kid ourselves that it exists for a second - is always harder that multiplication because to 'divide' x by y one first needs to compute the value y^{-1} such that y*y^{-1} = 1 and then do the multiplication x*y^{-1}. If you already know y^{-1} then not calculating it from y must be an optimization.

Related

Is it still worth using the Quake fast inverse square root algorithm nowadays on x86-64?

Specifically, this is the code I'm talking about:
float InvSqrt(float x) {
float xhalf = 0.5f*x;
int i = *(int*)&x; // warning: strict-aliasing UB, use memcpy instead
i = 0x5f375a86- (i >> 1);
x = *(float*)&i; // same
x = x*(1.5f-xhalf*x*x);
return x;
}
I forgot where I got this from but it's apparently better and more efficient or precise than the original Quake III algorithm (slightly different magic constant), but it's been more than 2 decades since this algorithm was created, and I just want to know if it's still worth using it in terms of performance, or if there's an instruction that implements it already in modern x86-64 CPUs.
Origins:
See John Carmack's Unusual Fast Inverse Square Root (Quake III)
Modern usefulness: none, obsoleted by SSE1 rsqrtss
Use _mm_rsqrt_ps or ss to get a very approximate reciprocal-sqrt for 4 floats in parallel, much faster than even a good compiler could do with this (using SSE2 integer shift/add instructions to keep the FP bit pattern in an XMM register, which is probably not how it would actually compile with the type-pun to integer. Which is strict-aliasing UB in C or C++; use memcpy or C++20 std::bit_cast.)
https://www.felixcloutier.com/x86/rsqrtss documents the scalar version of the asm instruction, including the |Relative Error| ≤ 1.5 ∗ 2−12 guarantee. (i.e. about half the mantissa bits are correct.) One Newton-Raphson iteration can refine it to within 1ulp of being correct, although still not the 0.5ulp you'd get from actual sqrt. See Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision)
rsqrtps performs only slightly slower than a mulps / mulss instruction on most CPUs, like 5 cycle latency, 1/clock throughput. (With a Newton iteration to refine it, more uops.) Latency various by microarchitecture, as low as 3 uops in Zen 3, but Intel runs it with about 5c latency since Conroe at least (https://uops.info/).
The integer shift / subtract from the magic number in the Quake InvSqrt similarly provides an even rougher initial-guess, and the rest (after type-punning the bit-pattern back to a float is a Newton Raphson iteration.
Compilers will even use rsqrtss for you when compiling sqrt with -ffast-math, depending on context and tuning options. (e.g. modern clang compiling 1.0f/sqrtf(x) with -O3 -ffast-math -march=skylake https://godbolt.org/z/fT86bKesb uses vrsqrtss and 3x vmulss plus an FMA.) Non-reciprocal sqrt is usually not worth it, but rsqrt + refinement avoids a division as well as a sqrt.
Full-precision square root and division themselves are not as slow as they used to be, at least if you use them infrequently compared to mul/add/sub. (e.g. if you can hide the latency, one sqrt every 12 or so other operations might cost about the same, still a single uop instead of multiple for rsqrt + Newton iteration.) See Floating point division vs floating point multiplication
But sqrt and div do compete with each other for throughput so needing to divide by a square root is a nasty case.
So if you have a bad loop over an array that mostly just does sqrt, not mixed with other math operations, that's a use-case for _mm_rsqrt_ps (and a Newton iteration) as a higher throughput approximation than _mm_sqrt_ps
But if you can combine that pass with something else to increase computational intensity and get more work done overlapped with keeping the div/sqrt unit, often it's better to use a real sqrt instruction on its own, since that's still just 1 uop for the front-end to issue, and for the back-end to track and execute. vs. a Newton iteration taking something like 5 uops if FMA is available for reciprocal square root, else more (also if non-reciprocal sqrt is needed).
With Skylake for example having 1 per 3 cycle sqrtps xmm throughput (128-bit vectors), it costs the same as a mul/add/sub/fma operation if you don't do more than one per 6 math operations. (Throughput is worse for 256-bit YMM vectors, 6 cycles.) A Newton iteration would cost more uops, so if uops for port 0/1 are the bottleneck, it's a win to just use sqrt directly. (This is assuming that out-of-order exec can hide the latency, typically when each loop iteration is independent.) This kind of situation is common if you're using a polynomial approximation as part of something like log or exp in a loop.
See also Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision re: performance on modern OoO exec CPUs.

Modulus optimization in a program

I have seen that many people prefer to use in code:
while(i<1000000){
ret+=a[i];
i++;
if(ret >= MOD)
ret -= MOD;
}
instead of making ret%MOD in the final step.
What is the difference between these two and how both these are equal?
How it is making an optimize our code?
Basically you can't tell without trying. There are two possible outcomes (considering my note further down below):
The compiler optimizes the code in some way that both solutions use either a conditional jump or a modulo operation. This does not only depend on how "bright" the compiler is, but it also has to consider the target architecture's available instruction set (but to be honest, it would be odd not having a modulo operation).
The compiler doesn't optimize the code (most probable for non-optimizing debug builds).
The basic difference that - as mentioned already - the solution with the if() will use one conditional jump, which - again depending on your architecture - might slow you down a bit, since the compiler can't prefetch the next instruction without evaluating the jump condition first.
One further note:
Either using a modulo operation or your if() statement actually isn't equal (depending on the actual values), simply due to the fact that ret % MOD would result in the following equal code:
while (ret >= MOD)
ret -= MOD;
Imagine a[i] being bigger than MOD and the new sum being bigger than two times MOD. In that case you'd end up with a ret bigger than MOD, something that won't happen when using modulo.
Let an example :
13 MOD 10
what it actually do is, give you the reminder after dividing 13 by 10.
that is : 13 - (10 * (int)(13/10)) = 13 - ( 10 * 1 ) = 3
so if a[i] <= mod then it will work good. but if a[i] > mod then see, what happens
let a[]= {15,15,15}
mod=7
in first step
ret = 0 + 15
ret = 15 - 7 = 8
2nd step
ret = 8 + 15 = 23
ret = 23 - 7 = 16
3rd step
ret = 16 + 15
ret = 31 - 7 = 24
So your final result is 24, but it should be 3.
you have to do :
while (ret >= MOD)
ret -= MOD;
if you want to use subtraction instead of mod..
And obviously sub is better than mod in respect to time... because mod is really time consuming :(
It is best not to try to optimise code unless you have a performance problem. Then find out where it is actually having the problems
An to answer you question the two are the same - but you need to check with the particular hardware/compiler to check.
The conditional test and subtraction is typically less expensive than a modulus operation, especially if the sum does not frequently exceed MOD. A modulus operation is effectively an integer division instruction, which typically has a latency which is an order of magnitude greater than that of compare/subtract. Having said that, unless the loop is a known performance bottleneck then you should just code for clarity and robustness.
Modulo requires integer division which is usually the slowest integer math operation on a CPU. Long ago before pipelines and branch prediction, this code was probably reliably faster than modulo. Nowadays branches can be very slow so its benefit is far from certain. If the values in a are always much smaller than MOD, it's probably still a win because the branch will be skipped most iterations and the branch predictor will mostly guess right. If they are not smaller, it's uncertain. You would need to benchmark both.
If you can write the program such that MOD is always a power of 2, you could use bit masking which is much faster than either.
If I saw this pattern in code that wasn't 1) from 1978 or 2) accompanied by a comment explaining how the author benchmarked it and found it was faster than modulo on the current compiler, typical user CPU, and a realistic data input, I'd roll my eyes hard.
Yes booth compute the same thing but:
operation % needs integer division which is more time costly then - and if
but on modern parallel machines (mean more pipelines by that not cores)
the CPU do more tasks at once unless they depend on each other or brunching occurs
that is why on modern machines is the % variant usually faster (if stalls the pipelines)
There are still platforms where the -=,if variant is faster
like MCU's so when you know you have just single CPU/MCU pipeline
or have very slow division then use this variant
you should always measure the result ties during optimization process
in your case you want to call just single mod per whole loop so it should be faster but check the later text ...
Compilers
modern compilers optimize code for your target platform and usually detect this and use the right choice
so you should not be consumed by the low level optimization instead of by programing the task functionality
but not all compilers are such for many platforms there are still used older compilers
also in some rare cases the optimizations are preferred to be turned off
because it could destroy specific desired timing, instruction patterns, or even functionality of the task ...
in such cases there is no choice and this knowledge suddenly comes handy
now the differences of your cases from algorithmic side:
while(i<1000000){ ret+=a[i]; i++; if(ret>=MOD) ret-=MOD; }
the sub result is still around modulo MOD
that mean you do not need more bits then used for max(a[i])+MOD*N where N depends on a[i]
if the sum(a[i]) will go to bignums then this will have more speed due to no need to increase sub-result bit-width
while(i<1000000){ ret+=a[i]; i++; } ret%=MOD;
this could overflow if variable ret can not hold the non modulo result
while(i<1000000){ ret+=a[i]; i++; ret%=MOD; }
this is how it should be for bigger non modulo results
if (ret>=MOD) ret-=MOD; is not modulo operation
it is just iteration of it.
more safe is while (ret>=MOD) ret-=MOD;
but if you know that the sub-result is not increasing too much (so it will not overflow in any few iterations) then if is OK
but in that case you should add while or modulo after the loop to ensure correct result

Time taken in executing % / * + - operations

Recently, i heard that % operator is costly in terms of time.
So, the question is that, is there a way to find the remainder faster?
Also your help will be appreciated if anyone can tell the difference in the execution of % / * + - operations.
In some cases where you're using power-of-2 divisors you can do better with roll-your-own techniques for calculating remainder, but generally a halfway decent compiler will do the best job possible with variable divisors, or "odd" divisors that don't fit any pattern.
Note that a few CPUs don't even have a multiply operation, and so (on those) multiply is quite slow vs add (at least 64x for a 32-bit multiply). (But a smart compiler may improve on this if the multiplier is a literal.) A slightly larger number do not have a divide operation or have a pretty slow one. (On a CPU with a fast multiplier multiply may only be on the order of 4 times slower than add, but on "normal" hardware it's 16-32 times slower for a 32 bit operation. Divide is inherently 2-4x slower than multiply, but can be much slower on some hardware.)
The remainder operation is rarely implemented in hardware, and normally A % B maps to something along the lines of A - ((A / B) * B) (a few extra operations may be required to assure the proper sign, et al).
(I learned about this stuff while microprogramming the instruction set for the SUMC computer for RCA/NASA back in the early 70s.)
No, the compiler is going to implement % in the most efficient way possible.
In terms of speed, + and - are the fastest (and are equally fast, generally done by the same hardware).
*, /, and % are much slower. Multiplication is basically done by the method you learn in grade school- multiply the first number by every digit in the second number and add the results. With some hacks made possible by binary. As of a few years ago, multiply was 3x slower than add. Division should be similar to multiply. Remainder is similar to division (in fact it generally calculates both at once).
Exact differences depend on the CPU type and exact model. You'd need to look up the latencies in the CPU spec sheets for your particular machine.

A clever homebrew modulus implementation

I'm programming a PLC with some legacy software (RSLogix 500, don't ask) and it does not natively support a modulus operation, but I need one. I do not have access to: modulus, integer division, local variables, a truncate operation (though I can hack it with rounding). Furthermore, all variables available to me are laid out in tables sorted by data type. Finally, it should work for floating point decimals, for example 12345.678 MOD 10000 = 2345.678.
If we make our equation:
dividend / divisor = integer quotient, remainder
There are two obvious implementations.
Implementation 1:
Perform floating point division: dividend / divisor = decimal quotient. Then hack together a truncation operation so you find the integer quotient. Multiply it by the divisor and find the difference between the dividend and that, which results in the remainder.
I don't like this because it involves a bunch of variables of different types. I can't 'pass' variables to a subroutine, so I just have to allocate some of the global variables located in multiple different variable tables, and it's difficult to follow. Unfortunately, 'difficult to follow' counts, because it needs to be simple enough for a maintenance worker to mess with.
Implementation 2:
Create a loop such that while dividend > divisor divisor = dividend - divisor. This is very clean, but it violates one of the big rules of PLC programming, which is to never use loops, since if someone inadvertently modifies an index counter you could get stuck in an infinite loop and machinery would go crazy or irrecoverably fault. Plus loops are hard for maintenance to troubleshoot. Plus, I don't even have looping instructions, I have to use labels and jumps. Eww.
So I'm wondering if anyone has any clever math hacks or smarter implementations of modulus than either of these. I have access to + - * /, exponents, sqrt, trig functions, log, abs value, and AND/OR/NOT/XOR.
How many bits are you dealing with? You could do something like:
if dividend > 32 * divisor dividend -= 32 * divisor
if dividend > 16 * divisor dividend -= 16 * divisor
if dividend > 8 * divisor dividend -= 8 * divisor
if dividend > 4 * divisor dividend -= 4 * divisor
if dividend > 2 * divisor dividend -= 2 * divisor
if dividend > 1 * divisor dividend -= 1 * divisor
quotient = dividend
Just unroll as many times as there are bits in dividend. Make sure to be careful about those multiplies overflowing. This is just like your #2 except it takes log(n) instead of n iterations, so it is feasible to unroll completely.
If you don't mind overly complicating things and wasting computer time you can calculate modulus with periodic trig functions:
atan(tan(( 12345.678 -5000)*pi/10000))*10000/pi+5000 = 2345.678
Seriously though, subtracting 10000 once or twice (your "implementation 2") is better. The usual algorithms for general floating point modulus require a number of bit-level manipulations that are probably unfeasible for you. See for example http://www.netlib.org/fdlibm/e_fmod.c (The algorithm is simple but the code is complex because of special cases and because it is written for IEEE 754 double precision numbers assuming there is no 64-bit integer type)
This all seems completely overcomplicated. You have an encoder index that rolls over at 10000 and objects rolling along the line whose positions you are tracking at any given point. If you need to forward project stop points or action points along the line, just add however many inches you need and immediately subtract 10000 if your target result is greater than 10000.
Alternatively, or in addition, you always get a new encoder value every PLC scan. In the case where the difference between the current value and last value is negative you can energize a working contact to flag the wrap event and make appropriate corrections for any calculations on that scan. (**or increment a secondary counter as below)
Without knowing more about the actual problem it is hard to suggest a more specific solution but there are certainly better solutions. I don't see a need for MOD here at all. Furthermore, the guys on the floor will thank you for not filling up the machine with obfuscated wizard stuff.
I quote :
Finally, it has to work for floating point decimals, for example
12345.678 MOD 10000 = 2345.678
There is a brilliant function that exists to do this - it's a subtraction. Why does it need to be more complicated than that? If your conveyor line is actually longer than 833 feet then roll a second counter that increments on a primary index roll-over until you've got enough distance to cover the ground you need.
For example, if you need 100000 inches of conveyor memory you can have a secondary counter that rolls over at 10. Primary encoder rollovers can be easily detected as above and you increment the secondary counter each time. Your working encoder position, then, is 10000 times the counter value plus the current encoder value. Work in the extended units only and make the secondary counter roll over at whatever value you require to not lose any parts. The problem, again, then reduces to a simple subtraction (as above).
I use this technique with a planetary geared rotational part holder, for example. I have an encoder that rolls over once per primary rotation while the planetary geared satellite parts (which themselves rotate around a stator gear) require 43 primary rotations to return to an identical starting orientation. With a simple counter that increments (or decrements, depending on direction) at the primary encoder rollover point it gives you a fully absolute measure of where the parts are at. In this case, the secondary counter rolls over at 43.
This would work identically for a linear conveyor with the only difference being that a linear conveyor can go on for an infinite distance. The problem then only needs to be limited by the longest linear path taken by the worst-case part on the line.
With the caveat that I've never used RSLogix, here is the general idea (I've used generic symbols here and my syntax is probably a bit wrong but you should get the idea)
With the above, you end up with a value ENC_EXT which has essentially transformed your encoder from a 10k inch one to a 100k inch one. I don't know if your conveyor can run in reverse, if it can you would need to handle the down count also. If the entire rest of your program only works with the ENC_EXT value then you don't even have to worry about the fact that your encoder only goes to 10k. It now goes to 100k (or whatever you want) and the wraparound can be handled with a subtraction instead of a modulus.
Afterword :
PLCs are first and foremost state machines. The best solutions for PLC programs are usually those that are in harmony with this idea. If your hardware is not sufficient to fully represent the state of the machine then the PLC program should do its best to fill in the gaps for that missing state information with the information it has. The above solution does this - it takes the insufficient 10000 inches of state information and extends it to suit the requirements of the process.
The benefit of this approach is that you now have preserved absolute state information, not just for the conveyor, but also for any parts on the line. You can track them forward and backward for troubleshooting and debugging and you have a much simpler and clearer coordinate system to work with for future extensions. With a modulus calculation you are throwing away state information and trying to solve individual problems in a functional way - this is often not the best way to work with PLCs. You kind of have to forget what you know from other programming languages and work in a different way. PLCs are a different beast and they work best when treated as such.
You can use a subroutine to do exactly what you are talking about. You can tuck the tricky code away so the maintenance techs will never encounter it. It's almost certainly the easiest for you and your maintenance crew to understand.
It's been a while since I used RSLogix500, so I might get a couple of terms wrong, but you'll get the point.
Define a Data File each for your floating points and integers, and give them symbols something along the lines of MOD_F and MOD_N. If you make these intimidating enough, maintenance techs leave them alone, and all you need them for is passing parameters and workspace during your math.
If you really worried about them messing up the data tables, there are ways to protect them, but I have forgotten what they are on a SLC/500.
Next, defined a subroutine, far away numerically from the ones in use now, if possible. Name it something like MODULUS. Again, maintenance guys almost always stay out of SBRs if they sound like programming names.
In the rungs immediately before your JSR instruction, load the variables you want to process into the MOD_N and MOD_F Data Files. Comment these rungs with instructions that they load data for MODULUS SBR. Make the comments clear to anyone with a programming background.
Call your JSR conditionally, only when you need to. Maintenance techs do not bother troubleshooting non-executing logic, so if your JSR is not active, they will rarely look at it.
Now you have your own little walled garden where you can write your loop without maintenance getting involved with it. Only use those Data Files, and don't assume the state of anything but those files is what you expect. In other words, you cannot trust indirect addressing. Indexed addressing is OK, as long as you define the index within your MODULUS JSR. Do not trust any incoming index. It's pretty easy to write a FOR loop with one word from your MOD_N file, a jump and a label. Your whole Implementation #2 should be less than ten rungs or so. I would consider using an expression instruction or something...the one that lets you just type in an expression. Might need a 504 or 505 for that instruction. Works well for combined float/integer math. Check the results though to make sure the rounding doesn't kill you.
After you are done, validate your code, perfectly if possible. If this code ever causes a math overflow and faults the processor, you will never hear the end of it. Run it on a simulator if you have one, with weird values (in case they somehow mess up the loading of the function inputs), and make sure the PLC does not fault.
If you do all that, no one will ever even realize you used regular programming techniques in the PLC, and you will be fine. AS LONG AS IT WORKS.
This is a loop based on the answer by #Keith Randall, but it also maintains the result of the division by substraction. I kept the printf's for clarity.
#include <stdio.h>
#include <limits.h>
#define NBIT (CHAR_BIT * sizeof (unsigned int))
unsigned modulo(unsigned dividend, unsigned divisor)
{
unsigned quotient, bit;
printf("%u / %u:", dividend, divisor);
for (bit = NBIT, quotient=0; bit-- && dividend >= divisor; ) {
if (dividend < (1ul << bit) * divisor) continue;
dividend -= (1ul << bit) * divisor;
quotient += (1ul << bit);
}
printf("%u, %u\n", quotient, dividend);
return dividend; // the remainder *is* the modulo
}
int main(void)
{
modulo( 13,5);
modulo( 33,11);
return 0;
}

Why differ(!=,<>) is faster than equal(=,==)?

I've seen comments on SO saying "<> is faster than =" or "!= faster than ==" in an if() statement.
I'd like to know why is that so. Could you show an example in asm?
Thanks! :)
EDIT:
Source
Here is what he did.
function Check(var MemoryData:Array of byte;MemorySignature:Array of byte;Position:integer):boolean;
var i:byte;
begin
Result := True; //moved at top. Your function always returned 'True'. This is what you wanted?
for i := 0 to Length(MemorySignature) - 1 do //are you sure??? Perhaps you want High(MemorySignature) here...
begin
{!} if MemorySignature[i] <> $FF then //speedup - '<>' evaluates faster than '='
begin
Result:=memorydata[i + position] <> MemorySignature[i]; //speedup.
if not Result then
Break; //added this! - speedup. We already know the result. So, no need to scan till end.
end;
end;
end;
I'd claim that this is flat out wrong except perhaps in very special circumstances. Compilers can refactor one into the other effortlessly (by just switching the if and else cases).
It could have something to do with branch prediction on the CPU. Static branch prediction would predict that a branch simply wouldn't be taken and fetch the next instruction. However, hardly anybody uses that anymore. Other than that, I'd say it's bull because the comparisons should be identical.
I think there's some confusion in your previous question about what the algorithm was that you were trying to implement, and therefore in what the claimed "speedup" purports to do.
Here's some disassembly from Delphi 2007. optimization on. (Note, optimization off changed the code a little, but not in a relevant way.
Unit70.pas.31: for I := 0 to 100 do
004552B5 33C0 xor eax,eax
Unit70.pas.33: if i = j then
004552B7 3B02 cmp eax,[edx]
004552B9 7506 jnz $004552c1
Unit70.pas.34: k := k+1;
004552BB FF05D0DC4500 inc dword ptr [$0045dcd0]
Unit70.pas.35: if i <> j then
004552C1 3B02 cmp eax,[edx]
004552C3 7406 jz $004552cb
Unit70.pas.36: l := l + 1;
004552C5 FF05D4DC4500 inc dword ptr [$0045dcd4]
Unit70.pas.37: end;
004552CB 40 inc eax
Unit70.pas.31: for I := 0 to 100 do
004552CC 83F865 cmp eax,$65
004552CF 75E6 jnz $004552b7
Unit70.pas.38: end;
004552D1 C3 ret
As you can see, the only difference between the two cases is a jz vs. a jnz instruction. These WILL run at the same speed. what's likely to affect things much more is how often the branch is taken, and if the entire loop fits into cache.
For .Net languages
If you look at the IL from the string.op_Equality and string.op_Inequality methods, you will see that both internall call string.Equals.
But the op_Inequality inverts the result. This is two IL-statements more.
I would say they the performance is the same, with maybe a small (very small, very very small) better performance for the == statement. But I believe that the optimizer & JIT compiler will remove this.
Spontaneous though; most other things in your code will affect performance more than the choice between == and != (or = and <> depending on language).
When I ran a test in C# over 1000000 iterations of comparing strings (containing the alphabet, a-z, with the last two letters reversed in one of them), the difference was between 0 an 1 milliseconds.
It has been said before: write code for readability; change into more performant code when it has been established that it will make a difference.
Edit: repeated the same test with byte arrays; same thing; the performance difference is neglectible.
It could also be a result of misinterpretation of an experiment.
Most compilers/optimizers assume a branch is taken by default. If you invert the operator and the if-then-else order, and the branch that is now taken is the ELSE clause, that might cause an additional speed effect in highly calculating code (*)
(*) obviously you need to do a lot of operations for that. But it can matter for the tightest loops in e.g. codecs or image analysis/machine vision where you have 50MByte/s of data to trawl through.
.... and then I even only stoop to this level for the really heavily reusable code. For ordinary business code it is not worth it.
I'd claim this was flat out wrong full stop. The test for equality is always the same as the test for inequality. With string (or complex structure testing), you're always going to break at exactly the same point. Until that break point is reached, then the answer for equality is unknown.
I strongly doubt there is any speed difference. For integral types for example you are getting a CMP instruction and either JZ (Jump if zero) or JNZ (Jump if not zero), depending on whether you used = or ≠. There is no speed difference here and I'd expect that to hold true at higher levels too.
If you can provide a small example that clearly shows a difference, then I'm sure the Stack Overflow community could explain why. However, I think you might have difficulty constructing a clear example. I don't think there will be any performance difference noticeable at any reasonable scale.
Well it could be or it couldn't be, that is the question :-)
The thing is this is highly depending on the programming language you are using.
Since all your statements will eventually end up as instructions to the CPU, the one that uses the least amount of instruction to achieve the result will be the fastest.
For example if you say bits x is equal to bits y, you could use the instruction that does an XOR using both bits as an input, if the result is anything but 0 it is not the same. So how would you know that the result is anything but 0? By using the instruction that returns true if you say input a is bigger than 0.
So this is already 2 instructions you use to do it, but since most CPU's have an instruction that does compare in a single cycle it is a bad example.
The point I am making is still the same, you can't make this generally statements without providing the programming language and the CPU architecture.
This list (assuming it's on x86) of ASM instructions might help:
Jump if greater
Jump on equality
Comparison between two registers
(Disclaimer, I have nothing more than very basic experience with writing assembler so I could be off the mark)
However it obviously depends purely on what assembly instructions the Delphi compiler is producing. Without seeing that output then it's guesswork. I'm going to keep my Donald Knuth quote in as caring about this kind of thing for all but a niche set of applications (games, mobile devices, high performance server apps, safety critical software, missile launchers etc.) is the thing you worry about last in my view.
"We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil."
If you're writing one of those or similar then obviously you do care, but you didn't specify it.
Just guessing, but given you want to preserve the logic, you cannot just replace
if A = B then
with
if A <> B then
To conserve the logic, the original code must have been something like
if not (A = B) then
or
if A <> B then
else
and that may truely be a little bit slower than the test on inequality.

Resources