Related
When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are:
Trust the branch predictor for conditions that usually have one result; or
Avoid branching with a little bit of bit-magic if reasonably possible; or
Conditional moves where possible.
What I couldn't find was whether or not we can calculate the condition early to help where possible. So, instead of:
... work
if (a > b) {
... more work
}
Do something like this:
bool aGreaterThanB = a > b;
... work
if (aGreaterThanB) {
... more work
}
Could something like this potentially avoid stalls on this conditional altogether (depending on the length of the pipeline and the amount of work we can put between the bool and the if)? It doesn't have to be as I've written it, but is there a way to evaluate conditionals early so the CPU doesn't have to try and predict branches?
Also, if that helps, is it something a compiler is likely to do anyway?
Yes, it can be beneficial to allow the the branch condition to calculated as early as possible, so that any misprediction can be resolved early and the front-end part of the pipeline can start re-filling early. In the best case, the mis-prediction can be free if there is enough work already in-flight to totally hide the front end bubble.
Unfortunately, on out-of-order CPUs, early has a somewhat subtle definition and so getting the branch to resolve early isn't as simple as just moving lines around in the source - you'll probably have to make a change to way the condition is calculated.
What doesn't work
Unfortunately, earlier doesn't refer to the position of the condition/branch in the source file, nor does it refer to the positions of the assembly instructions corresponding to the comparison or branch. So at a fundamental level, it mostly7 doesn't work as in your example.
Even if source level positioning mattered, it probably wouldn't work in your example because:
You've moved the evaluation of the condition up and assigned it to a bool, but it's not the test (the < operator) that can mispredict, it's the subsequent conditional branch: after all, it's a branch misprediction. In your example, the branch is in the same place in both places: its form has simply changed from if (a > b) to if (aGreaterThanB).
Beyond that, the way you've transformed the code isn't likely to fool most compilers. Optimizing compilers don't emit code line-by-line in the order you've written it, but rather schedule things as they see fit based on the source-level dependencies. Pulling the condition up earlier will likely just be ignored, since compilers will want to put the check where it would naturally go: approximately right before the branch on architectures with a flag register.
For example, consider the following two implementations of a simple function, which follow the pattern you suggested. The second function moves the condition up to the top of the function.
int test1(int a, int b) {
int result = a * b;
result *= result;
if (a > b) {
return result + a;
}
return result + b * 3;
}
int test2(int a, int b) {
bool aGreaterThanB = a > b;
int result = a * b;
result *= result;
if (aGreaterThanB) {
return result + a;
}
return result + b * 3;
}
I checked gcc, clang2 and MSVC, and all compiled both functions identically (the output differed between compilers, but for each compiler the output was the same for the two functions). For example, compiling test2 with gcc resulted in:
test2(int, int):
mov eax, edi
imul eax, esi
imul eax, eax
cmp edi, esi
jg .L4
lea edi, [rsi+rsi*2]
.L4:
add eax, edi
ret
The cmp instruction corresponds to the a > b condition, and gcc has moved it back down past all the "work" and put it right next to the jg which is the conditional branch.
What does work
So if we know that simple manipulation of the order of operations in the source doesn't work, what does work? As it turns out, anything you can do move the branch condition "up" in the data flow graph might improve performance by allowing the misprediction to be resolved earlier. I'm not going to get deep into how modern CPUs depend on dataflow, but you can find a brief overview here with pointers to further reading at the end.
Traversing a linked list
Here's a real-world example involving linked-list traversal.
Consider the the task of summing all values a null-terminated linked list which also stores its length1 as a member of the list head structure. The linked list implemented as one list_head object and zero or more list nodes (with a single int value payload), defined like so:
struct list_node {
int value;
list_node* next;
};
struct list_head {
int size;
list_node *first;
};
The canonical search loop would use the node->next == nullptr sentinel in the last node to determine that is has reached the end of the list, like this:
long sum_sentinel(list_head list) {
int sum = 0;
for (list_node* cur = list.first; cur; cur = cur->next) {
sum += cur->value;
}
return sum;
}
That's about as simple as you get.
However, this puts the branch that ends the summation (the one that first cur == null) at the end of the node-to-node pointer-chasing, which is the longest dependency in the data flow graph. If this branch mispredicts, the resolution of the mispredict will occur "late" and the front-end bubble will add directly to the runtime.
On the other hand, you could do the summation by explicitly counting nodes, like so:
long sum_counter(list_head list) {
int sum = 0;
list_node* cur = list.first;
for (int i = 0; i < list.size; cur = cur->next, i++) {
sum += cur->value;
}
return sum;
}
Comparing this to the sentinel solution, it seems like we have added extra work: we now need to initialize, track and decrement the count4. The key, however, is that this decrement dependency chain is very short and so it will "run ahead" of pointer-chasing work and the mis-prediction will occur early while there is still valid remaining pointer chasing work to do, possibly with a large improvement in runtime.
Let's actually try this. First we examine the assembly for the two solutions, so we can verify there isn't anything unexpected going on:
<sum_sentinel(list_head)>:
test rsi,rsi
je 1fe <sum_sentinel(list_head)+0x1e>
xor eax,eax
loop:
add eax,DWORD PTR [rsi]
mov rsi,QWORD PTR [rsi+0x8]
test rsi,rsi
jne loop
cdqe
ret
<sum_counter(list_head)>:
test edi,edi
jle 1d0 <sum_counter(list_head)+0x20>
xor edx,edx
xor eax,eax
loop:
add edx,0x1
add eax,DWORD PTR [rsi]
mov rsi,QWORD PTR [rsi+0x8]
cmp edi,edx
jne loop:
cdqe
ret
As expected, the sentinel approach is slightly simpler: one less instruction during setup, and one less instruction in the loop5, but overall the key pointer chasing and addition steps are identical and we expect this loop to be dominated by the latency of successive node pointers.
Indeed, the loops perform virtually identically when summing short or long lists when the prediction impact is negligible. For long lists the branch prediction impact is automatically small since the single mis-prediction when the end of the list is reached is amortized across many nodes, and the runtime asymptotically reaches almost exactly 4 cycles per node for lists contained in L1, which is what we expect with Intel's best-case 4 cycle load-to-use latency.
For short lists, branch misprediction is neglible if the pattern of lists is predictable: either always the same or cycling with some moderate period (which can be 1000 or more with good prediction!). In this case the time per node can be less than 4 cycles when summing many short lists since multiple lists can be in flight at once (e.g., if summary an array of lists). In any case, both implementations perform almost identically. For example, when lists always have 5 nodes, the time to sum one list is about 12 cycles with either implementation:
** Running benchmark group Tests written in C++ **
Benchmark Cycles BR_MIS
Linked-list w/ Sentinel 12.19 0.00
Linked-list w/ count 12.40 0.00
Let's add branch prediction to the mix, by changing the list generation code to create lists with an average a length of 5, but with actual length uniformly distributed in [0, 10]. The summation code is unchanged: only the input differs. The results with random list lengths:
** Running benchmark group Tests written in C++ **
Benchmark Cycles BR_MIS
Linked-list w/ Sentinel 43.87 0.88
Linked-list w/ count 27.48 0.89
The BR_MIS column shows that we get nearly one branch misprediction per list6, as expected, since the loop exit is unpredictable.
However, the sentinel algorithm now takes ~44 cycles versus the ~27.5 cycles of the count algorithm. The count algorithm is about 16.5 cycles faster. You can play with the list lengths and other factors, and change the absolute timings, but the delta is almost always around 16-17 cycles, which not coincidentally is about the same as the branch misprediction penalty on recent Intel! By resolving the branch condition early, we are avoiding the front end bubble, where nothing would be happening at all.
Calculating iteration count ahead of time
Another example would be something like a loop which calculates a floating point value, say a Taylor series approximation, where the termination condition depends on some function of the calculated value. This has the same effect as above: the termination condition depends on the slow loop carried dependency, so it is just as slow to resolve as the calculation of the value itself. If the exit is unpredictable, you'll suffer a stall on exit.
If you could change that to calculate the iteration count up front, you could use a decoupled integer counter as the termination condition, avoiding the bubble. Even if the up-front calculation adds some time, it could still provide an overall speedup (and the calculation can run in parallel with the first iterations of the loop, anyways, so it may be much less costly what you'd expect by looking at its latency).
1 MIPS is an interesting exception here having no flag registers - test results are stored directly into general purpose registers.
2 Clang compiled this and many other variants in a branch-free manner, but it still interesting because you still have the same structure of a test instruction and a conditional move (taking the place of the branch).
3 Like the C++11 std::list.
4 As it turns out, on x86, the per-node work is actually very similar between the two approaches because of the way that dec implicitly set the zero flag, so we don't need an extra test instruction, while the mov used in pointer chasing doesn't, so the counter approach has an extra dec while the sentinel approach has an extra test, making it about a wash.
5 Although this part is just because gcc didn't manage to transform the incrementing for-loop to a decrementing one to take advantage of dec setting the zero flag, avoiding the cmp. Maybe newer gcc versions do better. See also footnote 4.
6 I guess this is closer to 0.9 than to 1.0 since perhaps the branch predictors still get the length = 10 case correct, since once you've looped 9 times the next iteration will always exit. A less synthetic/exact distribution wouldn't exhibit that.
7 I say mostly because in some cases you might save a cycle or two via such source or assembly-level re-orderings, because such things can have a minor effect on the execution order in out-of-order processors, execution order is also affected by assembly order, but only within the constraints of the data-flow graph. See also this comment.
Out-of-order execution is definitely a thing (not just compilers but also even the processor chips themselves can reorder instructions), but it helps more with pipeline stalls caused by data dependencies than those caused by mispredictions.
The benefit in control flow scenarios is somewhat limited by the fact that on most architectures, the conditional branch instructions make their decision only based on the flags register, not based on a general-purpose register. It's hard to set up the flags register far in advance unless the intervening "work" is very unusual, because most instructions change the flags register (on most architectures).
Perhaps identifying the combination of
TST (reg)
J(condition)
could be designed to minimize the stall when (reg) is set far enough in advance. This of course requires a large degree of help from the processor, not just the compiler. And the processor designers are likely to optimize for a more general case of early (out of order) execution of the instruction which sets the flags for the branch, with the resulting flags forwarded through the pipeline, ending the stall early.
The main problem with branch misprediction is not the few cycles it incurs as penalty while flushing younger operations (which is relatively fast), but the fact that it may occur very late along the pipe if there are data dependencies the branch condition has to resolve first.
With branches based on prior calculations, the dependency works just like with other operations. Additionally, the branch passes through prediction very early along the pipe so that the machine can go on fetching and allocating further operations. If the prediction was incorrect (which is more often the case with data-dependent branches, unlike loop controls that usually exhibit more predictable patterns), than the flush would occur only when the dependency was resolved and the prediction was proven to be wrong. The later that happens, the bigger the penalty.
Since out-of-order execution schedules operations as soon as dependency is resolved (assuming no port stress), moving the operation ahead is probably not going to help as it does not change the dependency chain and would not affect the scheduling time too much. The only potential benefit is if you move it far enough up so that the OOO window can see it much earlier, but modern CPUs usually run hundreds of instructions ahead, and hoisting instructions that far without breaking the program is hard. If you're running some loop though, it might be simple to compute the conditions of future iterations ahead, if possible.
None of this is going to change the prediction process which is completely orthogonal, but once the branch reaches the OOO part of the machine, it will get resolved immediately, clear if needed, and incur minimal penalty.
I'm using GCC 4.8.1 to compile C code and I need to detect if underflow occurs in a subtraction on x86/64 architecture. Both are UNSIGNED. I know in assembly is very easy, but I'm wondering if I can do it in C code and have GCC optimize it in a way, cause I can't find it. This is a very used function (or lowlevel, is that the term?) so I need it to be efficient, but GCC seems to be too dumb to recognize this simple operation? I tried so many ways to give it hints in C, but it always uses two registers instead of just a sub and a conditional jump. And to be honest I get annoyed seeing such stupid code written so MANY times (function is called a lot).
My best approach in C seemed to be the following:
if((a-=b)+b < b) {
// underflow here
}
Basically, subtract b from a, and if result underflows detect it and do some conditional processing (which is unrelated to a's value, for example, it brings an error, etc).
GCC seems too dumb to reduce the above to just a sub and a conditional jump, and believe me I tried so many ways to do it in C code, and tried alot of command line options (-O3 and -Os included of course). What GCC does is something like this (Intel syntax assembly):
mov rax, rcx ; 'a' is in rcx
sub rcx, rdx ; 'b' is in rdx
cmp rax, rdx ; useless comparison since sub already sets flags
jc underflow
Needless to say the above is stupid, when all it needs is this:
sub rcx, rdx
jc underflow
This is so annoying because GCC does understand that sub modifies flags that way, since if I typecast it into a "int" it will generate the exact above except it uses "js" which is jump with sign, instead of carry, which will not work if the unsigned values difference is high enough to have the high bit set. Nevertheless it shows it is aware of the sub instruction affecting those flags.
Now, maybe I should give up on trying to make GCC optimize this properly and do it with inline assembly which I have no problems with. Unfortunately, this requires "asm goto" because I need a conditional JUMP, and asm goto is not very efficient with an output because it's volatile.
I tried something but I have no idea if it is "safe" to use or not. asm goto can't have outputs for some reason. I do not want to make it flush all registers to memory, that would kill the entire point I'm doing this which is efficiency. But if I use empty asm statements with outputs set to the 'a' variable before and after it, will that work and is it safe? Here's my macro:
#define subchk(a,b,g) { typeof(a) _a=a; \
asm("":"+rm"(_a)::"cc"); \
asm goto("sub %1,%0;jc %l2"::"r,m,r"(_a),"r,r,m"(b):"cc":g); \
asm("":"+rm"(_a)::"cc"); }
and using it like this:
subchk(a,b,underflow)
// normal code with no underflow
// ...
underflow:
// underflow occured here
It's a bit ugly but it works just fine. On my test scenario, it compiles just FINE without volatile overhead (flushing registers to memory) without generating anything bad, and it seems it works ok, however this is just a limited test, I can't possibly test this everywhere I use this function/macro as I said it is used A LOT, so I'd like to know if someone is knowledgeable, is there something unsafe about the above construct?
Particularly, the value of 'a' is NOT NEEDED if underflow occurs, so with that in mind are there any side effects or unsafe stuff that can happen with my inline asm macro? If not I'll use it without problems till they optimize the compiler so I can replace it back after I guess.
Please don't turn this into a debate about premature optimizations or what not, stay on topic of the question, I'm fully aware of that, so thank you.
I probably miss something obvious, but why isn't this good?
extern void underflow(void) __attribute__((noreturn));
unsigned foo(unsigned a, unsigned b)
{
unsigned r = a - b;
if (r > a)
{
underflow();
}
return r;
}
I have checked, gcc optimizes it to what you want:
foo:
movl %edi, %eax
subl %esi, %eax
jb .L6
rep
ret
.L6:
pushq %rax
call underflow
Of course you can handle underflow however you want, I have just done this to keep the asm simple.
How about the following assembly code (you can wrap it into GCC format):
sub rcx, rdx ; assuming operands are in rcx, rdx
setc al ; capture carry bit int AL (see Intel "setxx" instructions)
; return AL as boolean to compiler
Then you invoke/inline the assembly code, and branch on the resulting boolean.
Have you tested whether this is actually faster? Modern x86-microarchitectures use microcode, turning single assembly instructions into sequences of simpler micro-operations. Some of them also do micro-op fusion, in which a sequence of assembly-instructions is turned into a single micro-op. In particular, sequences like test %reg, %reg; jcc target are fused, probably because global processor flags are a bane of performance.
If cmp %reg, %reg; jcc target is mOp-fused, gcc might use that to get faster code. In my experience, gcc is very good at scheduling and similar low-level optimizations.
I am unable to understand the usefulness of LOOPNE. Even if LOOPNE was not there and only LOOP was there, it would have done the same thing here. Please help me out.
MOV CX, 80
MOV AH,1
INT 21H
CMP AL, ' '
LOOPNE BACK
CMP is more or less a SUB instruction without changing the value, which means that it sets flags such as ZF (the zero flag).
LOOPNE has 2 conditions to loop: cx > 0 and ZF = 0
LOOP has 1 condition to loop: cx > 0
So, a normal LOOP would go through all characters, whereas LOOPNE will go through all characters, or until a space is encountered. Whichever comes first
LOOPNE loops when a comparison fails, and when there is a remaining nonzero iteration count (after decrementing it). This is arguably very convenient for finding an element in a linear list of known length.
There is little use for it in modern x86 CPUs.
The LOOPNE instruction is likely implemented internally in the CPU by microinstructions and thus effectively equivalent to JNE/DEC CX/JNE.
Because the CPU designers invest vast amounts of effort to optimize compare/branch/register arithmetic, the equivalent instruction sequence is likely, on a highly pipelined CPU, to execute virtually just as fast. It may actually execute slower; you'll only know by timing it. And the fact that you are confused about what it does makes it a source of coding errors.
I presently code the equivalent instruction sequence, because I got bit by a misunderstanding once. I'm not confused about CMP and JNE.
I am taking an assembly course now, and the guy who checks our home assignments is a very pedantic old-school optimization freak. For example he deducts 10% if he sees:
mov ax, 0
instead of:
xor ax,ax
even if it's only used once.
I am not a complete beginner in assembly programing but I'm not an optimization expert, so I need your help in something (might be a very stupid question but I'll ask anyway):
if I need to set a register value to 1 or (-1) is it better to use:
mov ax, 1
or do something like:
xor ax,ax
inc ax
I really need a good grade, so I'm trying to get it as optimized as possible. ( I need to optimize both time and code size)
A quick google for 8086 instructions timings size turned up a listing of instruction timings which seems to have all the timings and sizes for the 8086/8088 through Pentium.
Although you should note that this probably doesn't include code fetch memory bottlenecks which can be very significant, especially on an 8088. This usually makes optimization for code-size a better choice. See here for some details on this.
No doubt you could find official Intel documentation on the web with similar information, such as the "8086/8088 User's Manual: Programmer's and Hardware Reference".
For your specific question, the table below gives a comparison that indicates the latter is better (less cycles, and same space):
Instructions
Clock cycles
Bytes
xor ax, axinc ax
33---6
21---3
mov ax, 1
4
3
But you might want to talk to your educational institute about this guy. A 10% penalty for a simple thing like that seems quite harsh. You should ask what should be done in the case where you have two possibilities, one faster and one shorter.
Then, once they've admitted that there are different ways to optimise code depending on what you're trying to achieve, tell them that what you're trying to do is optimise for readability and maintainability, and seriously couldn't give a damn about a wasted cycle or byte here or there(1).
Optimisation is something you generally do if and when you have a performance problem, after a piece of code is in a near-complete state - it's almost always wasted effort when the code is still subject to a not-insignificant likelihood of change.
For what it's worth, sub ax,ax appears to be on par with xor ax,ax in terms of clock cycles and size, so maybe you could throw that into the mix next time to cause him some more work.
(1)No, don't really do that , but it's fun to vent occasionally :-)
You're better off with
mov AX,1
on the 8086. If you're tracking register contents, you can possibly do better if you know that, for example, BX already has a 1 in it:
mov AX,BX
or if you know that AH is 0:
mov AL,1
etc.
Depending upon your circumstances, you may be able to get away with ...
sbb ax, ax
The result will either be 0 if the carry flag is not set or -1 if the carry flag is set.
However, if the above example is not applicable to your situation, I would recommend the
xor ax, ax
inc ax
method. It should satisfy your professor for size. However, if your processor employs any pipe-lining, I would expect there to be some coupling-like delay between the two instructions (I could very well be wrong on that). If such a coupling exists, the speed could be improved slightly by reordering your instructions slightly to have another instruction between them (one that does not use ax).
Hope this helps.
I would use mov [e]ax, 1 under any circumstances. Its encoding is no longer than the hackier xor sequence, and I'm pretty sure it's faster just about anywhere. 8086 is just weird enough to be the exception, and as that thing is so slow, a micro-optimization like this would make most difference. But any where else: executing 2 "easy" instructions will always be slower than executing 1, especially if you consider data hazards and long pipelines. You're trying to read a register in the very next instruction after you modify it, so unless your CPU can bypass the result from stage N of the pipeline (where the xor is executing) to to stage N-1 (where the inc is trying to load the register, never mind adding 1 to its value), you're going to have stalls.
Other things to consider: instruction fetch bandwidth (moot for 16-bit code, both are 3 bytes); mov avoids changing flags (more likely to be useful than forcing them all to zero); depending on what values other registers might hold, you could perhaps do lea ax,[bx+1] (also 3 bytes, even in 32-bit code, no effect on flags); as others have said, sbb ax,ax could work too in circumstances - it's also shorter at 2 bytes.
When faced with these sorts of micro-optimizations you really should measure the alternatives instead of blindly relying even on processor manuals.
P.S. New homework: is xor bx,bx any faster than xor bx,cx (on any processor)?
I've seen comments on SO saying "<> is faster than =" or "!= faster than ==" in an if() statement.
I'd like to know why is that so. Could you show an example in asm?
Thanks! :)
EDIT:
Source
Here is what he did.
function Check(var MemoryData:Array of byte;MemorySignature:Array of byte;Position:integer):boolean;
var i:byte;
begin
Result := True; //moved at top. Your function always returned 'True'. This is what you wanted?
for i := 0 to Length(MemorySignature) - 1 do //are you sure??? Perhaps you want High(MemorySignature) here...
begin
{!} if MemorySignature[i] <> $FF then //speedup - '<>' evaluates faster than '='
begin
Result:=memorydata[i + position] <> MemorySignature[i]; //speedup.
if not Result then
Break; //added this! - speedup. We already know the result. So, no need to scan till end.
end;
end;
end;
I'd claim that this is flat out wrong except perhaps in very special circumstances. Compilers can refactor one into the other effortlessly (by just switching the if and else cases).
It could have something to do with branch prediction on the CPU. Static branch prediction would predict that a branch simply wouldn't be taken and fetch the next instruction. However, hardly anybody uses that anymore. Other than that, I'd say it's bull because the comparisons should be identical.
I think there's some confusion in your previous question about what the algorithm was that you were trying to implement, and therefore in what the claimed "speedup" purports to do.
Here's some disassembly from Delphi 2007. optimization on. (Note, optimization off changed the code a little, but not in a relevant way.
Unit70.pas.31: for I := 0 to 100 do
004552B5 33C0 xor eax,eax
Unit70.pas.33: if i = j then
004552B7 3B02 cmp eax,[edx]
004552B9 7506 jnz $004552c1
Unit70.pas.34: k := k+1;
004552BB FF05D0DC4500 inc dword ptr [$0045dcd0]
Unit70.pas.35: if i <> j then
004552C1 3B02 cmp eax,[edx]
004552C3 7406 jz $004552cb
Unit70.pas.36: l := l + 1;
004552C5 FF05D4DC4500 inc dword ptr [$0045dcd4]
Unit70.pas.37: end;
004552CB 40 inc eax
Unit70.pas.31: for I := 0 to 100 do
004552CC 83F865 cmp eax,$65
004552CF 75E6 jnz $004552b7
Unit70.pas.38: end;
004552D1 C3 ret
As you can see, the only difference between the two cases is a jz vs. a jnz instruction. These WILL run at the same speed. what's likely to affect things much more is how often the branch is taken, and if the entire loop fits into cache.
For .Net languages
If you look at the IL from the string.op_Equality and string.op_Inequality methods, you will see that both internall call string.Equals.
But the op_Inequality inverts the result. This is two IL-statements more.
I would say they the performance is the same, with maybe a small (very small, very very small) better performance for the == statement. But I believe that the optimizer & JIT compiler will remove this.
Spontaneous though; most other things in your code will affect performance more than the choice between == and != (or = and <> depending on language).
When I ran a test in C# over 1000000 iterations of comparing strings (containing the alphabet, a-z, with the last two letters reversed in one of them), the difference was between 0 an 1 milliseconds.
It has been said before: write code for readability; change into more performant code when it has been established that it will make a difference.
Edit: repeated the same test with byte arrays; same thing; the performance difference is neglectible.
It could also be a result of misinterpretation of an experiment.
Most compilers/optimizers assume a branch is taken by default. If you invert the operator and the if-then-else order, and the branch that is now taken is the ELSE clause, that might cause an additional speed effect in highly calculating code (*)
(*) obviously you need to do a lot of operations for that. But it can matter for the tightest loops in e.g. codecs or image analysis/machine vision where you have 50MByte/s of data to trawl through.
.... and then I even only stoop to this level for the really heavily reusable code. For ordinary business code it is not worth it.
I'd claim this was flat out wrong full stop. The test for equality is always the same as the test for inequality. With string (or complex structure testing), you're always going to break at exactly the same point. Until that break point is reached, then the answer for equality is unknown.
I strongly doubt there is any speed difference. For integral types for example you are getting a CMP instruction and either JZ (Jump if zero) or JNZ (Jump if not zero), depending on whether you used = or ≠. There is no speed difference here and I'd expect that to hold true at higher levels too.
If you can provide a small example that clearly shows a difference, then I'm sure the Stack Overflow community could explain why. However, I think you might have difficulty constructing a clear example. I don't think there will be any performance difference noticeable at any reasonable scale.
Well it could be or it couldn't be, that is the question :-)
The thing is this is highly depending on the programming language you are using.
Since all your statements will eventually end up as instructions to the CPU, the one that uses the least amount of instruction to achieve the result will be the fastest.
For example if you say bits x is equal to bits y, you could use the instruction that does an XOR using both bits as an input, if the result is anything but 0 it is not the same. So how would you know that the result is anything but 0? By using the instruction that returns true if you say input a is bigger than 0.
So this is already 2 instructions you use to do it, but since most CPU's have an instruction that does compare in a single cycle it is a bad example.
The point I am making is still the same, you can't make this generally statements without providing the programming language and the CPU architecture.
This list (assuming it's on x86) of ASM instructions might help:
Jump if greater
Jump on equality
Comparison between two registers
(Disclaimer, I have nothing more than very basic experience with writing assembler so I could be off the mark)
However it obviously depends purely on what assembly instructions the Delphi compiler is producing. Without seeing that output then it's guesswork. I'm going to keep my Donald Knuth quote in as caring about this kind of thing for all but a niche set of applications (games, mobile devices, high performance server apps, safety critical software, missile launchers etc.) is the thing you worry about last in my view.
"We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil."
If you're writing one of those or similar then obviously you do care, but you didn't specify it.
Just guessing, but given you want to preserve the logic, you cannot just replace
if A = B then
with
if A <> B then
To conserve the logic, the original code must have been something like
if not (A = B) then
or
if A <> B then
else
and that may truely be a little bit slower than the test on inequality.