Mathematically find the value that is closest to 0 - algorithm

Is there a way to determine mathematically if a value is closer to 0 than another?
For example closerToZero(-2, 3) would return -2.
I tried by removing the sign and then compared the values for the minimum, but I would then be assigning the sign-less version of the initial numbers.
a and b are IEEE-754 compliant floating-point doubles (js number)
(64 bit => 1 bit sign 11 bit exponent 52 bit fraction)
min (a,b) => b-((a-b)&((a-b)>>52));
result = min(abs(a), abs(b));
// result has the wrong sign ...

The obvious algorithm is to compare the absolute values, and use that to select the original values.
If this absolutely needs to be branchless (e.g. for crypto security), be careful with ? : ternary. It often compiles to branchless asm but that's not guaranteed. (I assume that's why you tagged branch-prediction? If it was just out of performance concerns, the compiler will generally make good decisions.)
In languages with fixed-with 2's complement integers, remember that abs(INT_MIN) overflows a signed result of the same width as the input. In C and C++, abs() is inconveniently designed to return an int and it's undefined behaviour to call it with the most-negative 2's complement integer, on 2's complement systems. On systems with well-defined wrapping signed-int math (like gcc -fwrapv, or maybe Java), signed abs(INT_MIN) would overflow back to INT_MIN, giving wrong results if you do a signed compare because INT_MIN is maximally far from 0.
Make sure you do an unsigned compare of the abs results so you correctly handle INT_MIN. (Or as #kaya3 suggests, map positive integers to negative, instead of negative to positive.)
Safe C implementation that avoids Undefined Behaviour:
unsigned absu(int x) {
return x<0? 0U - x : x;
}
int minabs(int a, int b) {
return absu(a) < absu(b) ? a : b;
}
Note that < vs. <= actually matters in minabs: that decides which one to select if their magnitudes are equal.
0U - x converts x to unsigned before a subtract from 0 which can overflow. Converting negative signed-integer types to unsigned is well-defined in C and C++ as modulo reduction (unlike for floats, UB IIRC). On 2's complement machines that means using the same bit-pattern unchanged.
This compiles nicely for x86-64 (Godbolt), especially with clang. (GCC avoids cmov even with -march=skylake, ending up with a worse sequence. Except for the final select after doing both absu operations, then it uses cmovbe which is 2 uops instead of 1 for cmovb on Intel CPUs, because it needs to read both ZF and CF flags. If it ended up with the opposite value in EAX already, it could have used cmovb.)
# clang -O3
absu:
mov eax, edi
neg eax # sets flags like sub-from-0
cmovl eax, edi # select on signed less-than condition
ret
minabs:
mov ecx, edi
neg ecx
cmovl ecx, edi # inlined absu(a)
mov eax, esi
mov edx, esi
neg edx
cmovl edx, esi # inlined absu(b)
cmp ecx, edx # compare absu results
cmovb eax, edi # select on unsigned Below condition.
ret
Fully branchless with both GCC and clang, with optimization enabled. It's a safe bet that other ISAs will be the same.
It might auto-vectorize decently, but x86 doesn't have SIMD unsigned integer compares until AVX512. (You can emulate by flipping the high bit to use signed integer pcmpgtd).
For float / double, abs is cheaper and can't overflow: just clear the sign bit, then use that to select the original.

Related

Fast saturating integer conversion?

I'm wondering if there is any fast bit twiddling trick to do a saturating conversion from a 64-bit unsigned value to a 32-bit unsigned value (it would be nice if it is generalized to other widths but that's the main width I care about). Most of the resources I've been able to find googling have been for saturating arithmetic operations.
A saturating conversion would take a 64-bit unsigned value, and return either the value unmodified as a 32-bit value or 2^32-1 if the input value is greater than 2^32-1. Note that this is not what the default C casting truncating behaviour does.
I can imagine doing something like:
Test if upper half has any bit set
If so create a 32-bit mask with all bits set, otherwise create a mask with all bits unset
Bitwise-or lower half with mask
But I don't know how to quickly generate the mask. I tried straightforward branching implementations in Godbolt to see if the compiler would generate a clever branchless implementation for me but no luck.
Implementation example here.
#include <stdint.h>
#include <limits.h>
// Type your code here, or load an example.
uint32_t square(uint64_t num) {
return num > UINT32_MAX ? UINT32_MAX : num;
}
Edit: my mistake, issue was godbolt not set to use optimizations
You don't need to do any fancy bit twiddling trick to do this. The following function should be enough for compilers to generate efficient code:
uint32_t saturate(uint64_t value) {
return value > UINT32_MAX ? UINT32_MAX : value;
}
This contains a conditional statement, but most common CPUs, like AMD/Intel and Arm ones, have conditional move instructions. So they will test the value for overflowing 32 bits, and based on the test they will replace it with UINT32_MAX, otherwise leave it alone. For example, on 64-bit Arm processors this function will be compiled by GCC (to:
saturate:
mov x1, 4294967295
cmp x0, x1
csel x0, x0, x1, ls
ret
Note that you must enable compiler optimizations to get the above result.
A way to do this without relying on conditional moves is
((-(x >> 32)) | (x << 32)) >> 32

How is this faster? 52-bit modulo multiply using "FPU trick" faster than inline ASM on x64

I have found that this:
#define mulmod52(a,b,m) (((a * b) - (((uint64_t)(((double)a * (double)b) / (double)m) - 1ULL) * m)) % m)
... is faster than:
static inline uint64_t _mulmod(uint64_t a, uint64_t b, uint64_t n) {
uint64_t d, dummy; /* d will get a*b mod c */
asm ("mulq %3\n\t" /* mul a*b -> rdx:rax */
"divq %4\n\t" /* (a*b)/c -> quot in rax remainder in rdx */
:"=a"(dummy), "=&d"(d) /* output */
:"a"(a), "rm"(b), "rm"(n) /* input */
:"cc" /* mulq and divq can set conditions */
);
return d;
}
The former is a trick to exploit the FPU to compute modular multiplication of two up to 52 bit numbers. The latter is simple X64 ASM to compute modular multiplication of two 64 bit numbers, and of course it also works just fine for only 52 bits.
The former is faster than the latter by about 5-15% depending on where I test it.
How is this possible given that the FPU trick also involves one integer multiply and one integer divide (modulus) plus additional FPU work? There's something I'm not understanding here. Is it some weird compiler artifact such as the asm inline ruining compiler optimization passes?
On pre-Icelake processors, such as Skylake, there is a big difference between a "full" 128bit-by-64bit division and a "half" 64bit-by-64bit division (where the upper qword is zero). The full one can take up to nearly 100 cycles (varies a bit depending on the value in rdx, but there is a sudden "cliff" when rdx is even set to 1), the half one is more around 30 to 40-ish cycles depending on the µarch.
The 64bit floating point division is (for a division) relatively fast at around 14 to 20 cycles depending on the µarch, so even with that and some other even-less-significant overhead thrown in, that's not enough to waste the 60 cycle advantage that the "half" division has compared to the "full" division. So the complicated floating point version can come out ahead.
Icelake apparently has an amazing divider that can do a full division in 18 cycles (and a "half" division isn't faster), the inline asm should be good on Icelake.
On AMD Ryzen, divisions with a non-zero upper qword seem to get slower more gradually as rdx gets higher (less of a "perf cliff").

Counting elements "less than x" in an array

Let's say you want to find the first occurrence of a value1 in a sorted array. For small arrays (where things like binary search don't pay off), you can achieve this by simply counting the number of values less than that value: the result is the index you are after.
In x86 you can use adc (add with carry) for an efficient branch-free2 implementation of that approach (with the start pointer in rdi length in rsi and the value to search for in edx):
xor eax, eax
lea rdi, [rdi + rsi*4] ; pointer to end of array = base + length
neg rsi ; we loop from -length to zero
loop:
cmp [rdi + 4 * rsi], edx
adc rax, 0 ; only a single uop on Sandybridge-family even before BDW
inc rsi
jnz loop
The answer ends up in rax. If you unroll that (or if you have a fixed, known input size), only the cmp; adc pair of instructions get repeated, so the overhead approaches 2 simple instructions per comparison (and the sometimes fused load). Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?
However, this only works for unsigned comparisons, where the carry flag holds the result of the comparison. Is there any equivalently efficient sequence for counting signed comparisons? Unfortunately, there doesn't seem to be an "add 1 if less than" instruction: adc, sbb and the carry flag are special in that respect.
I am interested in the general case where the elements have no specific order, and also in this case where the array is sorted in the case the sortedness assumption leads to a simpler or faster implementation.
1 Or, if the value doesn't exist, the first greater value. I.e., this is the so called "lower bound" search.
2 Branch free approaches necessarily do the same amount of work each time - in this case examining the entire array, so this approach only make sense when the arrays are small and so the cost of a branch misprediction is large relative to the total search time.
PCMPGT + PADDD or PSUBD is probably a really good idea for most CPUs, even for small sizes, maybe with a simple scalar cleanup. Or even just purely scalar, using movd loads, see below.
For scalar integer, avoiding XMM regs, use SETCC to create a 0/1 integer from any flag condition you want. xor-zero a tmp register (potentially outside the loop) and SETCC into the low 8 of that, if you want to use 32 or 64-bit ADD instructions instead of only 8-bit.
cmp/adc reg,0 is basically a peephole optimization for the below / carry-set condition. AFAIK, there is nothing as efficient for signed-compare conditions. At best 3 uops for cmp/setcc/add, vs. 2 for cmp/adc. So unrolling to hide loop overhead is even more important.
See the bottom section of What is the best way to set a register to zero in x86 assembly: xor, mov or and? for more details about how to zero-extend SETCC r/m8 efficiently but without causing partial-register stalls. And see Why doesn't GCC use partial registers? for a reminder of partial-register behaviour across uarches.
Yes, CF is special for a lot of things. It's the only condition flag that has set/clear/complement (stc/clc/cmc) instructions1. There's a reason that bt/bts/etc. instructions set CF, and that shift instructions shift into it. And yes, ADC/SBB can add/sub it directly into another register, unlike any other flag.
OF can be read similarly with ADOX (Intel since Broadwell, AMD since Ryzen), but that still doesn't help us because it's strictly OF, not the SF!=OF signed-less-than condition.
This is typical for most ISAs, not just x86. (AVR and some others can set/clear any condition flag because they have an instruction that takes an immediate bit-position in the status register. But they still only have ADC/SBB for directly adding the carry flag to an integer register.)
ARM 32-bit can do a predicated addlt r0, r0, #1 using any condition-code, including signed less-than, instead of an add-with-carry with immediate 0. ARM does have ADC-immediate which you could use for the C flag here, but not in Thumb mode (where it would be useful to avoid an IT instruction to predicate an ADD), so you'd need a zeroed register.
AArch64 can do a few predicated things, including increment with cinc with arbitrary condition predicates.
But x86 can't. We only have cmovcc and setcc to turn conditions other than CF==1 into integers. (Or with ADOX, for OF==1.)
Footnote 1: Some status flags in EFLAGS like interrupts IF (sti/cli), direction DF (std/cld), and alignment-check (stac/clac) have set/clear instructions, but not the condition flags ZF/SF/OF/PF or the BCD-carry AF.
cmp [rdi + 4 * rsi], edx will un-laminate even on Haswell/Skylake because of the indexed addressing mode, and it it doesn't have a read/write destination register (so it's not like add reg, [mem].)
If tuning only for Sandybridge-family, you might as well just increment a pointer and decrement the size counter. Although this does save back-end (unfused-domain) uops for RS-size effects.
In practice you'd want to unroll with a pointer increment.
You mentioned sizes from 0 to 32, so we need to skip the loop if RSI = 0. The code in your question is just a do{}while which doesn't do that. NEG sets flags according to the result, so we can JZ on that. You'd hope that it could macro-fuse because NEG is exactly like SUB from 0, but according to Agner Fog it doesn't on SnB/IvB. So that costs us another uop in the startup if you really do need to handle size=0.
Using integer registers
The standard way to implement integer += (a < b) or any other flag condition is what compilers do (Godbolt):
xor edx,edx ; can be hoisted out of a short-running loop, but compilers never do that
; but an interrupt-handler will destroy the rdx=dl status
cmp/test/whatever ; flag-setting code here
setcc dl ; zero-extended to a full register because of earlier xor-zeroing
add eax, edx
Sometimes compilers (especially gcc) will use setcc dl / movzx edx,dl, which puts the MOVZX on the critical path. This is bad for latency, and mov-elimination doesn't work on Intel CPUs when they use (part of) the same register for both operands.
For small arrays, if you don't mind having only an 8-bit counter, you could just use 8-bit add so you don't have to worry about zero-extension inside the loop.
; slower than cmp/adc: 5 uops per iteration so you'll definitely want to unroll.
; requires size<256 or the count will wrap
; use the add eax,edx version if you need to support larger size
count_signed_lt: ; (int *arr, size_t size, int key)
xor eax, eax
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
jz .return ; if(-size == 0) return 0;
; xor edx, edx ; tmp destination for SETCC
.loop:
cmp [rdi + 4 * rsi], edx
setl dl ; false dependency on old RDX on CPUs other than P6-family
add al, dl
; add eax, edx ; boolean condition zero-extended into RDX if it was xor-zeroed
inc rsi
jnz .loop
.return:
ret
Alternatively using CMOV, making the loop-carried dep chain 2 cycles long (or 3 cycles on Intel before Broadwell, where CMOV is 2 uops):
;; 3 uops without any partial-register shenanigans, (or 4 because of unlamination)
;; but creates a 2 cycle loop-carried dep chain
cmp [rdi + 4 * rsi], edx
lea ecx, [rax + 1] ; tmp = count+1
cmovl eax, ecx ; count = arr[i]<key ? count+1 : count
So at best (with loop unrolling and a pointer-increment allowing cmp to micro-fuse) this takes 3 uops per element instead of 2.
SETCC is a single uop, so this is 5 fused-domain uops inside the loop. That's much worse on Sandybridge/IvyBridge, and still runs at worse than 1 per clock on later SnB-family. (Some ancient CPUs had slow setcc, like Pentium 4, but it's efficient on everything we still care about.)
When unrolling, if you want this to run faster than 1 cmp per clock, you have two choices: use separate registers for each setcc destination, creating multiple dep chains for the false dependencies, or use one xor edx,edx inside the loop to break the loop-carried false dependency into multiple short dep chains that only couple the setcc results of nearby loads (probably coming from the same cache line). You'll also need multiple accumulators because add latency is 1c.
Obviously you'll need to use a pointer-increment so cmp [rdi], edx can micro-fuse with a non-indexed addressing mode, otherwise the cmp/setcc/add is 4 uops total, and that's the pipeline width on Intel CPUs.
There's no partial-register stall from the caller reading EAX after writing AL, even on P6-family, because we xor-zeroed it first. Sandybridge won't rename it separately from RAX because add al,dl is a read-modify-write, and IvB and later never rename AL separately from RAX (only AH/BH/CH/DH). CPUs other than P6 / SnB-family don't do partial-register renaming at all, only partial flags.
The same applies for the version that reads EDX inside the loop. But an interrupt-handler saving/restoring RDX with push/pop would destroy its xor-zeroed status, leading to partial-register stalls every iteration on P6-family. This is catastrophically bad, so that's one reason compilers never hoist the xor-zeroing. They usually don't know if a loop will be long-running or not, and won't take the risk. By hand, you'd probably want to unroll and xor-zero once per unrolled loop body, rather than once per cmp/setcc.
You can use SSE2 or MMX for scalar stuff
Both are baseline on x86-64. Since you're not gaining anything (on SnB-family) from folding the load into the cmp, you might as well use a scalar movd load into an XMM register. MMX has the advantage of smaller code-size, but requires EMMS when you're done. It also allows unaligned memory operands, so it's potentially interesting for simpler auto-vectorization.
Until AVX512, we only have comparison for greater-than available, so it would take an extra movdqa xmm,xmm instruction to do key > arr[i] without destroying key, instead of arr[i] > key. (This is what gcc and clang do when auto-vectorizing).
AVX would be nice, for vpcmpgtd xmm0, xmm1, [rdi] to do key > arr[i], like gcc and clang use with AVX. But that's a 128-bit load, and we want to keep it simple and scalar.
We can decrement key and use (arr[i]<key) = (arr[i] <= key-1) = !(arr[i] > key-1). We can count elements where the array is greater-than key-1, and subtract that from the size. So we can make do with just SSE2 without costing extra instructions.
If key was already the most-negative number (so key-1 would wrap), then no array elements can be less than it. This does introduce a branch before the loop if that case is actually possible.
; signed version of the function in your question
; using the low element of XMM vectors
count_signed_lt: ; (int *arr, size_t size, int key)
; actually only works for size < 2^32
dec edx ; key-1
jo .key_eq_int_min
movd xmm2, edx ; not broadcast, we only use the low element
movd xmm1, esi ; counter = size, decrement toward zero on elements >= key
;; pxor xmm1, xmm1 ; counter
;; mov eax, esi ; save original size for a later SUB
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
.loop:
movd xmm0, [rdi + 4 * rsi]
pcmpgtd xmm0, xmm2 ; xmm0 = arr[i] gt key-1 = arr[i] >= key = not less-than
paddd xmm1, xmm0 ; counter += 0 or -1
;; psubd xmm1, xmm0 ; -0 or -(-1) to count upward
inc rsi
jnz .loop
movd eax, xmm1 ; size - count(elements > key-1)
ret
.key_eq_int_min:
xor eax, eax ; no array elements are less than the most-negative number
ret
This should be the same speed as your loop on Intel SnB-family CPUs, plus a tiny bit of extra overhead outside. It's 4 fuse-domain uops, so it can issue at 1 per clock. A movd load uses a regular load port, and there are at least 2 vector ALU ports that can run PCMPGTD and PADDD.
Oh, but on IvB/SnB the macro-fused inc/jnz requires port 5, while PCMPGTD / PADDD both only run on p1/p5, so port 5 throughput will be a bottleneck. On HSW and later the branch runs on port 6, so we're fine for back-end throughput.
It's worse on AMD CPUs where a memory-operand cmp can use an indexed addressing mode without a penalty. (And on Intel Silvermont, and Core 2 / Nehalem, where memory-source cmp can be a single uop with an indexed addressing mode.)
And on Bulldozer-family, a pair of integer cores share a SIMD unit, so sticking to integer registers could be an even bigger advantage. That's also why int<->XMM movd/movq has higher latency, again hurting this version.
Other tricks:
Clang for PowerPC64 (included in the Godbolt link) shows us a neat trick: zero or sign-extend to 64-bit, subtract, and then grab the MSB of the result as a 0/1 integer that you add to counter. PowerPC has excellent bitfield instructions, including rldicl. In this case, it's being used to rotate left by 1, and then zero all bits above that, i.e. extracting the MSB to the bottom of another register. (Note that PowerPC documentation numbers bits with MSB=0, LSB=63 or 31.)
If you don't disable auto-vectorization, it uses Altivec with a vcmpgtsw / vsubuwm loop, which I assume does what you'd expect from the names.
# PowerPC64 clang 9-trunk -O3 -fno-tree-vectorize -fno-unroll-loops -mcpu=power9
# signed int version
# I've added "r" to register names, leaving immediates alone, because clang doesn't have `-mregnames`
... setup
.LBB0_2: # do {
lwzu r5, 4(r6) # zero-extending load and update the address register with the effective-address. i.e. pre-increment
extsw r5, r5 # sign-extend word (to doubleword)
sub r5, r5, r4 # 64-bit subtract
rldicl r5, r5, 1, 63 # rotate-left doubleword immediate then clear left
add r3, r3, r5 # retval += MSB of (int64_t)arr[i] - key
bdnz .LBB0_2 # } while(--loop_count);
I think clang could have avoided the extsw inside the loop if it had used an arithmetic (sign-extending) load. The only lwa that updates the address register (saving an increment) seems to be the indexed form lwaux RT, RA, RB, but if clang put 4 in another register it could use it. (There doesn't seem to be a lwau instruction.) Maybe lwaux is slow or maybe it's a missed optimization. I used -mcpu=power9 so even though that instruction is POWER-only, it should be available.
This trick could sort of help for x86, at least for a rolled-up loop.
It takes 4 uops this way per compare, not counting loop overhead. Despite x86's pretty bad bitfield extract capabilities, all we actually need is a logical right-shift to isolate the MSB.
count_signed_lt: ; (int *arr, size_t size, int key)
xor eax, eax
movsxd rdx, edx
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
.loop:
movsxd rcx, dword [rdi + 4 * rsi] ; 1 uop, pure load
sub rcx, rdx ; (int64_t)arr[i] - key
shr rcx, 63 ; extract MSB
add eax, ecx ; count += MSB of (int64_t)arr[i] - key
inc rsi
jnz .loop
ret
This doesn't have any false dependencies, but neither does 4-uop xor-zero / cmp / setl / add. The only advantage here is that this is 4 uops even with an indexed addressing mode. Some AMD CPUs may run MOVSXD through an ALU as well as a load port, but Ryzen has the same latency as for it as for regular loads.
If you have fewer than 64 iterations, you could do something like this if only throughput matters, not latency. (But you can probably still do better with setl)
.loop
movsxd rcx, dword [rdi + 4 * rsi] ; 1 uop, pure load
sub rcx, rdx ; (int64_t)arr[i] - key
shld rax, rcx, 1 ; 3 cycle latency
inc rsi / jnz .loop
popcnt rax, rax ; turn the bitmap of compare results into an integer
But the 3-cycle latency of shld makes this a showstopper for most uses, even though it's only a single uop on SnB-family. The rax->rax dependency is loop-carried.
There's a trick to convert a signed comparison to an unsigned comparison and vice versa by toggling the top bit
bool signedLessThan(int a, int b)
{
return ((unsigned)a ^ INT_MIN) < b; // or a + 0x80000000U
}
It works because the ranges in 2's complement are still linear, just with a swapped signed and unsigned space. So the simplest way may be XORing before comparison
xor eax, eax
xor edx, 0x80000000 ; adjusting the search value
lea rdi, [rdi + rsi*4] ; pointer to end of array = base + length
neg rsi ; we loop from -length to zero
loop:
mov ecx, [rdi + 4 * rsi]
xor ecx, 0x80000000
cmp ecx, edx
adc rax, 0 ; only a single uop on Sandybridge-family even before BDW
inc rsi
jnz loop
If you can modify the array then just do the conversion before checking
In ADX there's ADOX that uses carry from OF. Unfortunately signed comparison also needs SF instead of only OF, thus you can't use it like this
xor ecx, ecx
loop:
cmp [rdi + 4 * rsi], edx
adox rax, rcx ; rcx=0; ADOX is not available with an immediate operand
and must do some more bit manipulations to correct the result
In the case the array is guaranteed to be sorted, one could use cmovl with an "immediate" value representing the correct value to add. There are no immediates for cmovl, so you'll have to load them into registers beforehand.
This technique makes sense when unrolled, for example:
; load constants
mov r11, 1
mov r12, 2
mov r13, 3
mov r14, 4
loop:
xor ecx, ecx
cmp [rdi + 0], edx
cmovl rcx, r11
cmp [rdi + 4], edx
cmovl rcx, r12
cmp [rdi + 8], edx
cmovl rcx, r13
cmp [rdi + 12], edx
cmovl rcx, r14
add rax, rcx
; update rdi, test loop condition, etc
jcc loop
You have 2 uops per comparison, plus overhead. There is a 4-cycle (BDW and later) dependency chain between the cmovl instructions, but it is not carried.
One disadvantage is that you have to set up the 1,2,3,4 constants outside of the loop. It also doesn't work as well if not unrolled (you need to ammortize the add rax, rcx accumulation).
Assuming the array is sorted, you could make separate code branches for positive and negative needles. You will need a branch instruction at the very start, but after that, you can use the same branch-free implementation you would use for unsigned numbers. I hope that is acceptable.
needle >= 0:
go through the array in ascending order
begin by counting every negative array element
proceed with the positive numbers just like you would in the unsigned scenario
needle < 0:
go through the array in descending order
begin by skipping every positive array element
proceed with the negative numbers just like you would in the unsigned scenario
Unfortunately, in this approach you cannot unroll your loops.
An alternative would be to go through each array twice; once with the needle, and again to find the number of positive or negative elements (using a 'needle' that matches the minimum signed integer).
(unsigned) count the elements < needle
(unsigned) count the elements >= 0x80000000
add the results
if needle < 0, subtract the array length from the result
There's probably a lot to be optimized to my code below. I'm quite rusty at this.
; NOTE: no need to initialize eax here!
lea rdi, [rdi + rsi*4] ; pointer to end of array = base + length
neg rsi ; we loop from -length to zero
mov ebx, 80000000h ; minimum signed integer (need this in the loop too)
cmp edx, ebx ; set carry if needle negative
sbb eax, eax ; -1 if needle negative, otherwise zero
and eax, esi ; -length if needle negative, otherwise zero
loop:
cmp [rdi + 4 * rsi], edx
adc rax, 0 ; +1 if element < needle
cmp [rdi + 4 * rsi], ebx
cmc
adc rax, 0 ; +1 if element >= 0x80000000
inc rsi
jnz loop

How are Mathematical Equality Operators Handled at the Machine-Code Level

So I wanted to ask a rather existential question today, and it's one that I feel as though most programmers skip over and just accept as something that works, without really asking the question of "how" it works. The question is rather simple: how is the >= operator compiled down to machine code, and what does that machine code look like? Down at the very bottom, it must be a greater than test, mixed with an "is equal" test. But how is this actually implemented? Thinking about it seems rather paradoxical, because at the very bottom there cannot be a > or == test. There needs to be something else. I want to know what this is.
How do computers test for equality and greater than at the fundamental level?
Indeed there is no > or == test as such. Instead, the lowest level comparison in assembler works by binary subtraction.
On x86, the opcode for integer comparisons is CMP. It is really the one instruction to rule them all. How it works is described for example in 80386 Programmer's reference manual:
CMP subtracts the second operand from the first but, unlike the SUB instruction, does not store the result; only the flags are changed.
CMP is typically used in conjunction with conditional jumps and the SETcc instruction. (Refer to Appendix D for the list of signed and unsigned flag tests provided.) If an operand greater than one byte is compared to an immediate byte, the byte value is first sign-extended.
Basically, CMP A, B (In Intel operand ordering) calculates A - B, and then discards the result. However, in an x86 ALU, arithmetic operations set condition flags inside the flag register of the CPU based on the result of the operation. The flags relevant to arithmetic operations are
Bit Name Function
0 CF Carry Flag -- Set on high-order bit carry or borrow; cleared
otherwise.
6 ZF Zero Flag -- Set if result is zero; cleared otherwise.
7 SF Sign Flag -- Set equal to high-order bit of result (0 is
positive, 1 if negative).
11 OF Overflow Flag -- Set if result is too large a positive number
or too small a negative number (excluding sign-bit) to fit in
destination operand; cleared otherwise.
For example if the result of calculation is zero, the Zero Flag ZF is set. CMP A, B executes A - B and discards the result. The result of subtraction is 0 iff A == B. Thus the ZF will be set only when the operands are equal, cleared otherwise.
Carry flag CF would be set iff the unsigned subtraction would result in borrow, i.e. A - B would be < 0 if A and B are considered unsigned numbers and A < B.
Sign flag is set whenever the MSB bit of the result is set. This means that the result as a signed number is considered negative in 2's complement. However, if you consider the 8-bit subtraction 01111111 (127) - 10000000 (-128), the result is 11111111, which interpreted as a 8-bit signed 2's complement number is -1, even though 127 - (-128) should be 255. A signed integer overflow happened The sign flag alone doesn't alone tell which of the signed quantities was greater - theOF overflow flag tells whether a signed overflow happened in the previous arithmetic operation.
Now, depending on the place where this is used, a Byte Set on Condition SETcc or a Jump if Condition is Met Jcc instruction is used to decode the flags and act on them. If the boolean value is used to set a variable, then a clever compiler would use SETcc; Jcc would be a better match for an if...else.
Now, there are 2 choices for >=: either we want a signed comparison or an unsigned comparison.
int a, b;
bool r1, r2;
unsigned int c, d;
r1 = a >= b; // signed
r2 = c >= d; // unsigned
In Intel assembly the names of conditions for unsigned inequality use the words above and below; conditions for signed equality use the words greater and less. Thus, for r2 the compiler could decide to use Set on Above or Equal, i.e. SETAE, which sets the target byte to 1 if (CF=0). For r1 the result would be decoded by SETGE - Set Byte on Greater or Equal, which means (SF=OF) - i.e. the result of subtraction interpreted as a 2's complement is positive without overflow, or negative with overflow happening.
Finally an example:
#include <stdbool.h>
bool gte_unsigned(unsigned int a, unsigned int b) {
return a >= b;
}
The resulting optimized code on x86-64 Linux is:
cmp edi, esi
setae al
ret
Likewise for signed comparison
bool gte_signed(int a, int b) {
return a >= b;
}
The resulting assembly is
cmp edi, esi
setge al
ret
Here's a simple C function:
bool lt_or_eq(int a, int b)
{
return (a <= b);
}
On x86-64, GCC compiles this to:
.file "lt_or_eq.c"
.text
.globl lt_or_eq
.type lt_or_eq, #function
lt_or_eq:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
cmpl -8(%rbp), %eax
setle %al
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size lt_or_eq, .-lt_or_eq
The important part is the cmpl -8(%rbp), %eax; setle %al; sequence. Basically, it's using the cmp instruction to compare the two arguments numerically, and set the state of the zero flag and the carry flag based on that comparison. It then uses setle to decide whether to to set the %al register to 0 or 1, depending on the state of those flags. The caller gets the return value from the %al register.
First the computer needs to figure out the type of the data. In a language like C, this would be at compile time, python would dispatch to different type specific tests at run time. Assuming we are coming from a compiled language, and that we know the values that we are comparing are integers, The complier would make sure that the valuses are in registers and then issue:
SUBS r1, r2
BGE #target
subtracting the registers, and then checking for zero/undflow. These instructions are built in operation on the CPU. (Which I'm assuming here is ARM-like there are many variations).

Speed of Comparison operators

In languages such as... well anything, both operators for < and <= (and their opposites) exist. Which would be faster, and how are they interpreted?
if (x <= y) { blah; }
or
if (x < y + 1) { blah; }
Assuming no compiler optimizations (big assumption), the first will be faster, as <= is implemented by a single jle instruction, where as the latter requires an addition followed by a jl instruction.
http://en.wikibooks.org/wiki/X86_Assembly/Control_Flow#Jump_if_Less
I wouldn't worry about this at all as far as performance goes. Using C as an example, on a simple test I ran with GCC 4.5.1 targeting x86 (with -O2), the (x <=y ) operation compiled to:
// if (x <= y) {
// printf( "x <= y\n");
// }
//
// `x` is [esp+28]
// `y` is [esp+24]
mov eax, DWORD PTR [esp+24] // load `y` into eax
cmp DWORD PTR [esp+28], eax // compare with `x`
jle L5 // if x < y, jump to the `true` block
L2:
// ...
ret
L5: // this prints "x <= y\n"
mov DWORD PTR [esp], OFFSET FLAT:LC1
call _puts
jmp L2 // jumps back to the code after the ` if statement
and the (x < y + 1) operation compiled to:
// if (x < y +1) {
// printf( "x < y+1\n");
// }
//
// `x` is [esp+28]
// `y` is [esp+24]
mov eax, DWORD PTR [esp+28] // load x into eax
cmp DWORD PTR [esp+24], eax // compare with y
jl L3 // jump past the true block if (y < x)
mov DWORD PTR [esp], OFFSET FLAT:LC2
call _puts
L3:
So you might have a difference of a jump around a jump or so, but you should really only be concerned about this kind of thing for the odd time where it really is a hot spot. Of course there may be differences between languages and what exactly happens might depend on the type of objects that are being compared. But I'd still not worry about this at all as far as performance is concerned (until it became a demonstrated performance issue - which I'll be surprised if it ever does for more than once or twice in my lifetime).
So, I think the only two reasons to worry about which test to use are:
correctness - of course, this trumps any other consideration
style/readabilty
While you might not think there's much to the style/readability consideration, I do worry about this a little. In my C and C++ code today, I'd favor using the < operator over <= because I think loops tend to terminate 'better' using a < than a <= test. So, for example:
iterating over an array by index, you should typically use an index < number_of_elements test
iterating over an array using pointers to elements should use a ptr < (array + number_of_elements) test
Actually even in C, I now tend to use a ptr != (array + number_of_elements) since I've gotten used to STL iterators where the < relation won work.
In fact, if I see a <= test in a for loop condition, I take a close look - often there's a bug lurking. I consider it an anti-pattern.
Noe I'll grant that a lot of this may not hold for other languages, but I be surprised if when I'm using another language that there's ever a performance issues I'll have to worry about because I chose to use < over <=.
What data-type?
If y is INT_MAX, then the first expression is true no matter what x is (assuming x is the same or smaller type), while the second expression is always false.
If the answer doesn't need to be right, you can get it even faster.
Have you considered that both those arguments are different? In case x and y are floating point numbers - they may not give the same result. That is the reason both comparison operators exists.
Prefer the first one.
In some languages with dynamic types the running environment has to figure out what the type of y is and execute the appropriate + operator.
Leaving this as vague as you have has caused this to be an unanswerable question. Performance cannot be evaluated unless you have software and hardware to measure - what language? what language implementation? what target CPU architecture? etc.
That being said, both <= and < are often identical performance-wise, because they are logically equivalent to > and >=, just with swapped destinations for the underlying goto's (branch instructions), or swapped logic for the underlying "true/false" evaluation.
If you're programming in C or C++, the compiler may be able to figure out what you're doing, and swap in the faster alternative, anyway.
Write code that is understandable, maintainable, correct, and performant, in that order. For performance, find tools to measure the performance of your whole program, and spend your time wisely. Optimize bottlenecks only until your program is fast enough. Spend the time you save by making better code, or making more cool features :)

Resources