Speed of Comparison operators - performance

In languages such as... well anything, both operators for < and <= (and their opposites) exist. Which would be faster, and how are they interpreted?
if (x <= y) { blah; }
or
if (x < y + 1) { blah; }

Assuming no compiler optimizations (big assumption), the first will be faster, as <= is implemented by a single jle instruction, where as the latter requires an addition followed by a jl instruction.
http://en.wikibooks.org/wiki/X86_Assembly/Control_Flow#Jump_if_Less

I wouldn't worry about this at all as far as performance goes. Using C as an example, on a simple test I ran with GCC 4.5.1 targeting x86 (with -O2), the (x <=y ) operation compiled to:
// if (x <= y) {
// printf( "x <= y\n");
// }
//
// `x` is [esp+28]
// `y` is [esp+24]
mov eax, DWORD PTR [esp+24] // load `y` into eax
cmp DWORD PTR [esp+28], eax // compare with `x`
jle L5 // if x < y, jump to the `true` block
L2:
// ...
ret
L5: // this prints "x <= y\n"
mov DWORD PTR [esp], OFFSET FLAT:LC1
call _puts
jmp L2 // jumps back to the code after the ` if statement
and the (x < y + 1) operation compiled to:
// if (x < y +1) {
// printf( "x < y+1\n");
// }
//
// `x` is [esp+28]
// `y` is [esp+24]
mov eax, DWORD PTR [esp+28] // load x into eax
cmp DWORD PTR [esp+24], eax // compare with y
jl L3 // jump past the true block if (y < x)
mov DWORD PTR [esp], OFFSET FLAT:LC2
call _puts
L3:
So you might have a difference of a jump around a jump or so, but you should really only be concerned about this kind of thing for the odd time where it really is a hot spot. Of course there may be differences between languages and what exactly happens might depend on the type of objects that are being compared. But I'd still not worry about this at all as far as performance is concerned (until it became a demonstrated performance issue - which I'll be surprised if it ever does for more than once or twice in my lifetime).
So, I think the only two reasons to worry about which test to use are:
correctness - of course, this trumps any other consideration
style/readabilty
While you might not think there's much to the style/readability consideration, I do worry about this a little. In my C and C++ code today, I'd favor using the < operator over <= because I think loops tend to terminate 'better' using a < than a <= test. So, for example:
iterating over an array by index, you should typically use an index < number_of_elements test
iterating over an array using pointers to elements should use a ptr < (array + number_of_elements) test
Actually even in C, I now tend to use a ptr != (array + number_of_elements) since I've gotten used to STL iterators where the < relation won work.
In fact, if I see a <= test in a for loop condition, I take a close look - often there's a bug lurking. I consider it an anti-pattern.
Noe I'll grant that a lot of this may not hold for other languages, but I be surprised if when I'm using another language that there's ever a performance issues I'll have to worry about because I chose to use < over <=.

What data-type?
If y is INT_MAX, then the first expression is true no matter what x is (assuming x is the same or smaller type), while the second expression is always false.
If the answer doesn't need to be right, you can get it even faster.

Have you considered that both those arguments are different? In case x and y are floating point numbers - they may not give the same result. That is the reason both comparison operators exists.

Prefer the first one.
In some languages with dynamic types the running environment has to figure out what the type of y is and execute the appropriate + operator.

Leaving this as vague as you have has caused this to be an unanswerable question. Performance cannot be evaluated unless you have software and hardware to measure - what language? what language implementation? what target CPU architecture? etc.
That being said, both <= and < are often identical performance-wise, because they are logically equivalent to > and >=, just with swapped destinations for the underlying goto's (branch instructions), or swapped logic for the underlying "true/false" evaluation.
If you're programming in C or C++, the compiler may be able to figure out what you're doing, and swap in the faster alternative, anyway.
Write code that is understandable, maintainable, correct, and performant, in that order. For performance, find tools to measure the performance of your whole program, and spend your time wisely. Optimize bottlenecks only until your program is fast enough. Spend the time you save by making better code, or making more cool features :)

Related

What's the benefit of 3-ways comparison operator (<=>) in C++20?

I know the syntax of it.
I just wondering what's the benefit or whether does it make sens.
Without it, we must code like this:
void func1(int x, int y) {
if( x > y )
doSomeThing();
else if( x < y )
doSomeElse();
else
explosive();
}
With it, we can do like this:
void func1(int x, int y) {
auto result = x <=> y;
if( result > 0 )
doSomeThing();
else if( result < 0 )
doSomeElse();
else
explosive();
}
Except for returning a comparison result, I can NOT see any benefit of this feature.
Some one says it can make our codes more readable, but I don't think so.
It's very obvious, the former example has more readability.
As to return a result, like this:
int func1(int x, int y) {
return x <=> y;
}
It looks like that we get much more readability, but we still need to check the value with another if/else somewhere, e.g. outside of the func1.
I can NOT see any benefit of this feature.
Then you are thinking too narrowly about what is actually happening.
Direct use of <=> does not exist to serve the needs of ints. It can be used for comparing them, but ints are not why the functionality exists.
It exists for types that actually have complex comparison logic. Consider std::string. To know if one string is "less than" another, you have to iterate through both strings and compare each character. When you find a non-equivalent one, you have your answer.
Your code, when applied to string, does the comparison twice: once with less than and once with greater than. The problem is this: the first comparison already found the first non-equal character. But the second comparison does not know where that is. So it must start from the very beginning, doing the exact same comparisons the first one did.
That's a lot of repeated work for an answer than the first comparison already computed. In fact, there is a really easy way to compute <=> for strings: subtract the corresponding character in the second string from the first. If the value is zero, then they are equal. If the result is negative, the first string is less; if it's positive, the first string is greater.
Which is... exactly what <=> returns, isn't it? By using <=>, you do two expensive comparisons in the space of one; testing the return value is immaterial next to the cost of them.
The more complex your comparison logic, the more you are likely to save with <=> if you need to categorize them into less/greater/equal.
It should also be noted that processors often have special opcodes to tell if an integer is negative, zero, or positive. If we look at the x86 assembly for your integer comparison:
func1(int, int): # #func1(int, int)
cmp edi, esi
jle .LBB0_1
jmp a()#PLT # TAILCALL
.LBB0_1:
jge .LBB0_2
jmp b()#PLT # TAILCALL
.LBB0_2:
jmp c()#PLT # TAILCALL
We can see that it only executes cmp once; the jle and jge instructions use the results of the comparison. The <=> compiles to the same assembly, so the compiler fully understands these as being synonyms.

Why does the pseudocode of _mm_insert_ps calculate %8?

Within the intel intrinsics guide, the pseudocode for the operation of _mm_insert_ps, the following is defined:
FOR j := 0 to 3
i := j*32
IF imm8[j%8]
dst[i+31:i] := 0
ELSE
dst[i+31:i] := tmp2[i+31:i]
FI
ENDFOR
. The access into imm8 confuses me: IF imm8[j%8]. As j is within the range 0..3, the modulo 8 part doesn't seem to do anything. Does this maybe signal a convertion that I am not aware of? Or is % not "modulo" in this case?
Seems like a pointless modulo.
Intel's documentation for the corresponding asm instruction, insertps, doesn't use any % modulo operations in the pseudocode. It uses ZMASK ←imm8[3:0] and then basically unrolls that part of the pseudocode where this uses a loop, with checks like
IF (ZMASK[2] = 1) THEN DEST[95:64]←00000000H
ELSE DEST[95:64]←TMP2[95:64]
This is just showing how the low 4 bits of the immediate perform zero-masking on the 4 dword elements of the final result, after the insert of an element from another vector, or a scalar in memory.
(There's no intrinsic for insert directly from memory; you'd need an intrinsic for movss and then hope the compiler folds that load into a memory operand for insertps. With a memory source, imm8[7:6] are ignored, just taking that scalar dword as the element to insert (that's the ELSE COUNT_S←0 in the asm pseudocode), but then everything else works the same, including the zero-masking you're asking about.)

Mathematically find the value that is closest to 0

Is there a way to determine mathematically if a value is closer to 0 than another?
For example closerToZero(-2, 3) would return -2.
I tried by removing the sign and then compared the values for the minimum, but I would then be assigning the sign-less version of the initial numbers.
a and b are IEEE-754 compliant floating-point doubles (js number)
(64 bit => 1 bit sign 11 bit exponent 52 bit fraction)
min (a,b) => b-((a-b)&((a-b)>>52));
result = min(abs(a), abs(b));
// result has the wrong sign ...
The obvious algorithm is to compare the absolute values, and use that to select the original values.
If this absolutely needs to be branchless (e.g. for crypto security), be careful with ? : ternary. It often compiles to branchless asm but that's not guaranteed. (I assume that's why you tagged branch-prediction? If it was just out of performance concerns, the compiler will generally make good decisions.)
In languages with fixed-with 2's complement integers, remember that abs(INT_MIN) overflows a signed result of the same width as the input. In C and C++, abs() is inconveniently designed to return an int and it's undefined behaviour to call it with the most-negative 2's complement integer, on 2's complement systems. On systems with well-defined wrapping signed-int math (like gcc -fwrapv, or maybe Java), signed abs(INT_MIN) would overflow back to INT_MIN, giving wrong results if you do a signed compare because INT_MIN is maximally far from 0.
Make sure you do an unsigned compare of the abs results so you correctly handle INT_MIN. (Or as #kaya3 suggests, map positive integers to negative, instead of negative to positive.)
Safe C implementation that avoids Undefined Behaviour:
unsigned absu(int x) {
return x<0? 0U - x : x;
}
int minabs(int a, int b) {
return absu(a) < absu(b) ? a : b;
}
Note that < vs. <= actually matters in minabs: that decides which one to select if their magnitudes are equal.
0U - x converts x to unsigned before a subtract from 0 which can overflow. Converting negative signed-integer types to unsigned is well-defined in C and C++ as modulo reduction (unlike for floats, UB IIRC). On 2's complement machines that means using the same bit-pattern unchanged.
This compiles nicely for x86-64 (Godbolt), especially with clang. (GCC avoids cmov even with -march=skylake, ending up with a worse sequence. Except for the final select after doing both absu operations, then it uses cmovbe which is 2 uops instead of 1 for cmovb on Intel CPUs, because it needs to read both ZF and CF flags. If it ended up with the opposite value in EAX already, it could have used cmovb.)
# clang -O3
absu:
mov eax, edi
neg eax # sets flags like sub-from-0
cmovl eax, edi # select on signed less-than condition
ret
minabs:
mov ecx, edi
neg ecx
cmovl ecx, edi # inlined absu(a)
mov eax, esi
mov edx, esi
neg edx
cmovl edx, esi # inlined absu(b)
cmp ecx, edx # compare absu results
cmovb eax, edi # select on unsigned Below condition.
ret
Fully branchless with both GCC and clang, with optimization enabled. It's a safe bet that other ISAs will be the same.
It might auto-vectorize decently, but x86 doesn't have SIMD unsigned integer compares until AVX512. (You can emulate by flipping the high bit to use signed integer pcmpgtd).
For float / double, abs is cheaper and can't overflow: just clear the sign bit, then use that to select the original.

If CPU is a binary machine, why is it slow on bit manipulations?

I found that contrary to its binary / bi-state nature, x86 CPUs are very slow when processing binary manipulations instructions such as SHR, BT, BTR, ROL and something similar.
For example, I've read it from somewhere that bit shifting / rotate more than 1 positions is considered slow (with high-latency, performance penalty and those scary stuff). It's even worse when the operands are in memory (Aren't memory bi-state peripherals, too?)
shl eax,1 ;ok
shl eax,7 ;slow?
So what's making them slow? It's kind of ironic that binary machines like the CPUs are slow on bit manipulations when such operations are supposed to be natural. It gives the impression that a binary CPU is having a hard time shifting bits in place!
EDIT: Now after having a second look at SHL entry in the manual, it does involve some heavy microcode logics!
From Intel's vol.2 manual for shl...
Operation
TemporaryCount = Count & 0x1F;
TemporaryDestination = Destination;
while(TemporaryCount != 0) {
if(Instruction == SAL || Instruction == SHL) {
CF = MSB(Destination);
Destination = Destination << 1;
}
//instruction is SAR or SHR
else {
CF = LSB(Destination);
if(Instruction == SAR) Destination = Destination / 2; //Signed divide, rounding toward negative infinity
//Instruction is SHR
else Destination = Destination / 2; //Unsigned divide
}
TemporaryCount = TemporaryCount - 1;
}
//Determine overflow
if(Count & 0x1F == 1) {
if(Instruction == SAL || Instruction == SHL) OF = MSB(Destination) ^ CF;
else if(Instruction == SAR) OF = 0;
//Instruction == SHR
else OF = MSB(TemporaryDestination);
}
else OF = Undefined;
Unbelievable to see that such a simple boolean algebra is turned into an implementation nightmare.
That is just the pseudo-code for the instruction, specifying exactly what it does. The instruction is not actually implemented like this. In practice, all modern CPUs have barrel shifters or similar, allowing them to shift by arbitrary amounts in a single cycle. See for example Agner Fog's tables where it shows a latency of 1 for almost all bit-fiddling instructions.
A few bit-fiddling instructions are slower, here are some examples:
bt, btr, bts, and btc are slower when used with memory operands because of the (a) read-modify-write operation and (b) the bitstring indexing they do
rcr with a rotate amount of more than 1 is slow because this instruction is almost never needed and thus not optimised
pdepand pext are slightly slower on Intel and much slower on AMD, probably because their implementation is pretty involved and splitting the implementation up makes it easier.
On old processors (say, an 8086), the CPU would take as many cycles as the shift amount was, doing one shift every cycle. This kind of implementation allows the ALU to be used for shifting without any extra hardware, reducing the number of gates needed for the processor. No modern CPU I know of has this performance behaviour.
Just a note.
shl eax,1 ; opcode: d1 e0
shl eax,7 ; opcode: c1 e0 07
are actually different instructions with different opcodes which are handled potentially by different logic blocks of ALU. They use the same mnemonic in assembly and that can confuse, but from the viewpoint of CPU, they are different instructions with different opcodes and encodings.

My algorithm is too slow

I have an algorithm that for an integer x and a starting integer i such that 1 < i < x the next value of i is computed by i = floor(x / i) + (x mod i). This continues until we reach an i that we've already seen.
In JavaScript (though this question is language agnostic):
function f(x, i) {
var map = {};
while(!map[i]) {
map[i] = true;
i = Math.floor(x / i) + (x % i); // ~~(x / i) is a faster way of flooring
}
return i;
}
I can prove that we will eventually reach an i we've already seen, but I'm wondering:
Is there is a more efficient way of computing the next i?
(More importantly) Is there is a way to compute the nth i without running through the loop n times?
Just to clarify - I know there are faster ways than using JS hash maps for that check, and that flooring can be replaced by integer division in other languages. I have made both of those optimizations, but I left them out to try to make the code easier to understand. Sorry for any confusion.
Thanks in advance!
I think the main time eater - map. It uses some hashing function (probably not simple). If i range is limited by reasonable value, it would better to use bit/boolean array (or Javascript analog)
The second - two divisions. Are floats and integers distinct in Javascript? It is possible to make one integer division, finding modulo with multiplication and subtraction (due to fundamental properties of integer division/modulo definition):
p = x \\ i
i = p + (x - p * i)
or
i = x - (x \\ i) * (i - 1)
Note: integer division in most processors calculates both quotient and residue in the same time
mov eax, 17 //dividend
mov ecx, 3 //divisor
xor edx, edx //zero
div ecx //edx:eax pair divide by ecx
//now eax contains quotient 5, edx contains residue (modulus) 2
If you can use asm in C, or have some functions like delphi DivMod, you can make calculations some faster.

Resources