Avoid stalling pipeline by calculating conditional early

Avoid stalling pipeline by calculating conditional early - performance

When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are:
Trust the branch predictor for conditions that usually have one result; or
Avoid branching with a little bit of bit-magic if reasonably possible; or
Conditional moves where possible.
What I couldn't find was whether or not we can calculate the condition early to help where possible. So, instead of:
... work
if (a > b) {
... more work
}
Do something like this:
bool aGreaterThanB = a > b;
... work
if (aGreaterThanB) {
... more work
}
Could something like this potentially avoid stalls on this conditional altogether (depending on the length of the pipeline and the amount of work we can put between the bool and the if)? It doesn't have to be as I've written it, but is there a way to evaluate conditionals early so the CPU doesn't have to try and predict branches?
Also, if that helps, is it something a compiler is likely to do anyway?

Yes, it can be beneficial to allow the the branch condition to calculated as early as possible, so that any misprediction can be resolved early and the front-end part of the pipeline can start re-filling early. In the best case, the mis-prediction can be free if there is enough work already in-flight to totally hide the front end bubble.
Unfortunately, on out-of-order CPUs, early has a somewhat subtle definition and so getting the branch to resolve early isn't as simple as just moving lines around in the source - you'll probably have to make a change to way the condition is calculated.
What doesn't work
Unfortunately, earlier doesn't refer to the position of the condition/branch in the source file, nor does it refer to the positions of the assembly instructions corresponding to the comparison or branch. So at a fundamental level, it mostly7 doesn't work as in your example.
Even if source level positioning mattered, it probably wouldn't work in your example because:
You've moved the evaluation of the condition up and assigned it to a bool, but it's not the test (the < operator) that can mispredict, it's the subsequent conditional branch: after all, it's a branch misprediction. In your example, the branch is in the same place in both places: its form has simply changed from if (a > b) to if (aGreaterThanB).
Beyond that, the way you've transformed the code isn't likely to fool most compilers. Optimizing compilers don't emit code line-by-line in the order you've written it, but rather schedule things as they see fit based on the source-level dependencies. Pulling the condition up earlier will likely just be ignored, since compilers will want to put the check where it would naturally go: approximately right before the branch on architectures with a flag register.
For example, consider the following two implementations of a simple function, which follow the pattern you suggested. The second function moves the condition up to the top of the function.
int test1(int a, int b) {
int result = a * b;
result *= result;
if (a > b) {
return result + a;
}
return result + b * 3;
}
int test2(int a, int b) {
bool aGreaterThanB = a > b;
int result = a * b;
result *= result;
if (aGreaterThanB) {
return result + a;
}
return result + b * 3;
}
I checked gcc, clang2 and MSVC, and all compiled both functions identically (the output differed between compilers, but for each compiler the output was the same for the two functions). For example, compiling test2 with gcc resulted in:
test2(int, int):
mov eax, edi
imul eax, esi
imul eax, eax
cmp edi, esi
jg .L4
lea edi, [rsi+rsi*2]
.L4:
add eax, edi
ret
The cmp instruction corresponds to the a > b condition, and gcc has moved it back down past all the "work" and put it right next to the jg which is the conditional branch.
What does work
So if we know that simple manipulation of the order of operations in the source doesn't work, what does work? As it turns out, anything you can do move the branch condition "up" in the data flow graph might improve performance by allowing the misprediction to be resolved earlier. I'm not going to get deep into how modern CPUs depend on dataflow, but you can find a brief overview here with pointers to further reading at the end.
Traversing a linked list
Here's a real-world example involving linked-list traversal.
Consider the the task of summing all values a null-terminated linked list which also stores its length1 as a member of the list head structure. The linked list implemented as one list_head object and zero or more list nodes (with a single int value payload), defined like so:
struct list_node {
int value;
list_node* next;
};
struct list_head {
int size;
list_node *first;
};
The canonical search loop would use the node->next == nullptr sentinel in the last node to determine that is has reached the end of the list, like this:
long sum_sentinel(list_head list) {
int sum = 0;
for (list_node* cur = list.first; cur; cur = cur->next) {
sum += cur->value;
}
return sum;
}
That's about as simple as you get.
However, this puts the branch that ends the summation (the one that first cur == null) at the end of the node-to-node pointer-chasing, which is the longest dependency in the data flow graph. If this branch mispredicts, the resolution of the mispredict will occur "late" and the front-end bubble will add directly to the runtime.
On the other hand, you could do the summation by explicitly counting nodes, like so:
long sum_counter(list_head list) {
int sum = 0;
list_node* cur = list.first;
for (int i = 0; i < list.size; cur = cur->next, i++) {
sum += cur->value;
}
return sum;
}
Comparing this to the sentinel solution, it seems like we have added extra work: we now need to initialize, track and decrement the count4. The key, however, is that this decrement dependency chain is very short and so it will "run ahead" of pointer-chasing work and the mis-prediction will occur early while there is still valid remaining pointer chasing work to do, possibly with a large improvement in runtime.
Let's actually try this. First we examine the assembly for the two solutions, so we can verify there isn't anything unexpected going on:
<sum_sentinel(list_head)>:
test rsi,rsi
je 1fe <sum_sentinel(list_head)+0x1e>
xor eax,eax
loop:
add eax,DWORD PTR [rsi]
mov rsi,QWORD PTR [rsi+0x8]
test rsi,rsi
jne loop
cdqe
ret
<sum_counter(list_head)>:
test edi,edi
jle 1d0 <sum_counter(list_head)+0x20>
xor edx,edx
xor eax,eax
loop:
add edx,0x1
add eax,DWORD PTR [rsi]
mov rsi,QWORD PTR [rsi+0x8]
cmp edi,edx
jne loop:
cdqe
ret
As expected, the sentinel approach is slightly simpler: one less instruction during setup, and one less instruction in the loop5, but overall the key pointer chasing and addition steps are identical and we expect this loop to be dominated by the latency of successive node pointers.
Indeed, the loops perform virtually identically when summing short or long lists when the prediction impact is negligible. For long lists the branch prediction impact is automatically small since the single mis-prediction when the end of the list is reached is amortized across many nodes, and the runtime asymptotically reaches almost exactly 4 cycles per node for lists contained in L1, which is what we expect with Intel's best-case 4 cycle load-to-use latency.
For short lists, branch misprediction is neglible if the pattern of lists is predictable: either always the same or cycling with some moderate period (which can be 1000 or more with good prediction!). In this case the time per node can be less than 4 cycles when summing many short lists since multiple lists can be in flight at once (e.g., if summary an array of lists). In any case, both implementations perform almost identically. For example, when lists always have 5 nodes, the time to sum one list is about 12 cycles with either implementation:
** Running benchmark group Tests written in C++ **
Benchmark Cycles BR_MIS
Linked-list w/ Sentinel 12.19 0.00
Linked-list w/ count 12.40 0.00
Let's add branch prediction to the mix, by changing the list generation code to create lists with an average a length of 5, but with actual length uniformly distributed in [0, 10]. The summation code is unchanged: only the input differs. The results with random list lengths:
** Running benchmark group Tests written in C++ **
Benchmark Cycles BR_MIS
Linked-list w/ Sentinel 43.87 0.88
Linked-list w/ count 27.48 0.89
The BR_MIS column shows that we get nearly one branch misprediction per list6, as expected, since the loop exit is unpredictable.
However, the sentinel algorithm now takes ~44 cycles versus the ~27.5 cycles of the count algorithm. The count algorithm is about 16.5 cycles faster. You can play with the list lengths and other factors, and change the absolute timings, but the delta is almost always around 16-17 cycles, which not coincidentally is about the same as the branch misprediction penalty on recent Intel! By resolving the branch condition early, we are avoiding the front end bubble, where nothing would be happening at all.
Calculating iteration count ahead of time
Another example would be something like a loop which calculates a floating point value, say a Taylor series approximation, where the termination condition depends on some function of the calculated value. This has the same effect as above: the termination condition depends on the slow loop carried dependency, so it is just as slow to resolve as the calculation of the value itself. If the exit is unpredictable, you'll suffer a stall on exit.
If you could change that to calculate the iteration count up front, you could use a decoupled integer counter as the termination condition, avoiding the bubble. Even if the up-front calculation adds some time, it could still provide an overall speedup (and the calculation can run in parallel with the first iterations of the loop, anyways, so it may be much less costly what you'd expect by looking at its latency).
1 MIPS is an interesting exception here having no flag registers - test results are stored directly into general purpose registers.
2 Clang compiled this and many other variants in a branch-free manner, but it still interesting because you still have the same structure of a test instruction and a conditional move (taking the place of the branch).
3 Like the C++11 std::list.
4 As it turns out, on x86, the per-node work is actually very similar between the two approaches because of the way that dec implicitly set the zero flag, so we don't need an extra test instruction, while the mov used in pointer chasing doesn't, so the counter approach has an extra dec while the sentinel approach has an extra test, making it about a wash.
5 Although this part is just because gcc didn't manage to transform the incrementing for-loop to a decrementing one to take advantage of dec setting the zero flag, avoiding the cmp. Maybe newer gcc versions do better. See also footnote 4.
6 I guess this is closer to 0.9 than to 1.0 since perhaps the branch predictors still get the length = 10 case correct, since once you've looped 9 times the next iteration will always exit. A less synthetic/exact distribution wouldn't exhibit that.
7 I say mostly because in some cases you might save a cycle or two via such source or assembly-level re-orderings, because such things can have a minor effect on the execution order in out-of-order processors, execution order is also affected by assembly order, but only within the constraints of the data-flow graph. See also this comment.

Out-of-order execution is definitely a thing (not just compilers but also even the processor chips themselves can reorder instructions), but it helps more with pipeline stalls caused by data dependencies than those caused by mispredictions.
The benefit in control flow scenarios is somewhat limited by the fact that on most architectures, the conditional branch instructions make their decision only based on the flags register, not based on a general-purpose register. It's hard to set up the flags register far in advance unless the intervening "work" is very unusual, because most instructions change the flags register (on most architectures).
Perhaps identifying the combination of
TST (reg)
J(condition)
could be designed to minimize the stall when (reg) is set far enough in advance. This of course requires a large degree of help from the processor, not just the compiler. And the processor designers are likely to optimize for a more general case of early (out of order) execution of the instruction which sets the flags for the branch, with the resulting flags forwarded through the pipeline, ending the stall early.

The main problem with branch misprediction is not the few cycles it incurs as penalty while flushing younger operations (which is relatively fast), but the fact that it may occur very late along the pipe if there are data dependencies the branch condition has to resolve first.
With branches based on prior calculations, the dependency works just like with other operations. Additionally, the branch passes through prediction very early along the pipe so that the machine can go on fetching and allocating further operations. If the prediction was incorrect (which is more often the case with data-dependent branches, unlike loop controls that usually exhibit more predictable patterns), than the flush would occur only when the dependency was resolved and the prediction was proven to be wrong. The later that happens, the bigger the penalty.
Since out-of-order execution schedules operations as soon as dependency is resolved (assuming no port stress), moving the operation ahead is probably not going to help as it does not change the dependency chain and would not affect the scheduling time too much. The only potential benefit is if you move it far enough up so that the OOO window can see it much earlier, but modern CPUs usually run hundreds of instructions ahead, and hoisting instructions that far without breaking the program is hard. If you're running some loop though, it might be simple to compute the conditions of future iterations ahead, if possible.
None of this is going to change the prediction process which is completely orthogonal, but once the branch reaches the OOO part of the machine, it will get resolved immediately, clear if needed, and incur minimal penalty.

Related

Theoretically, is comparison between 0 and 255 faster than 0 and 1?

From the point of view of very low level programming, how is performed the comparison between two numbers?
Using one byte, unsigned numbers 0, 1 and 255 are written:
0 -----> 00000000
1 -----> 00000001
255 ---> 11111111
Now, what happens during the comparison between these numbers?
Using my vision as a human having learned basic programming, I could imagine the following algorithm about == implementation:
b = 0
while b < 8:
if first_number[b] != second_number[b]:
return False
b += 1
return True
Basically this is like comparing each bit step by step, and stop before the end if two bits are different.
Thus we note that the comparison stops at the first iteration compared 0 and 255, while it stops at the last if 0 and 1 are compared.
The first comparison would be 8 times faster than the second.
In practice, I doubt that is the case. But is this theoretically true?
If not, how does the computer work?

A comparison between integers is tipically implemented by the cpu as a subtraction, whose result sign contains information about which number is bigger.
While a naive implementation of subtraction executes one bit at a time (because every bit needs to know the carry of the preceding one), tipical implementation use a carry-lookahead circuit that allows the calculation of more result bits at the same time.
So, the answer is: no, every comparison takes almost the same time for every possible input.

Hardware is fundamentally different from the dominant programming paradigms in that all logic gates (or circuits in general) always do their work independently, in parallel, at all times. There is no such thing as "do this, then do that", only "do this here, feed the result into the circuit over there". If there's a circuit on the chip with input A and output B, then the circuit always, continuously, updates B in accordance with the current values of A — regardless of whether the result is needed right now "in the big picture".
Your pseudo code algorithm doesn't even begin to map to logic gates well. Instead, a comparator looks like this in Verilog (ignoring that there's a built-in == operator):
assign not_equal = (a[0] ^ b[0]) | (a[1] ^ b[1]) | ...;
Where each XOR is a separate logic gate and hence works independently from the others. The results are "reduced" with a logical or, i.e. the output is 1 if any of the XORs produces a 1 (this too does some work in parallel, but the critical path is longer than one gate). Furthermore, all these gates exist in silicon regardless of the specific bit values, and the signal has to propagate through about (1 + log w) gates for a w-bit integer. This propagation delay is again independent of the intermediate and final results.
On some CPU families, equality comparison is implemented by subtracting the two numbers and comparing the result to zero (using a circuit as described above), but the same principle applies. An adder/subtracter doesn't get slower or faster depending on the values.
Not to mention that instructions in a CPU can't take less than one clock cycle anyway, so even if the hardware would finish more quickly, the next instruction still wouldn't start until the next tick.
Now, some hardware operations can take a variable amount of time, but that's because they are state machines, i.e. sequential logic. Technically one could implement the moral equivalent of your algorithm with a state machine, but nobody does that, it's harder to implement than the naive, un-optimized combinatorial circuit above, and less efficient to boot.
State machine circuits are circuits with memory: They store their current state and always compute the outputs (depending on the current state) and the next state (depending on current state and inputs) each clock cycle. On some inputs they may go through N states until they produce an output, and N+x on other inputs. ALU operations generally don't do that though. Pipeline stalls, branch mispredictions, and cache misses are common reasons one instruction takes longer than usual in some circumstances. Properly reasoning about these in a way that helps programmers write faster code is hard though: You have to take into account all the tricky and quirks of real hardware, and there's a lot of those. Empirical evidence, i.e. benchmarking a real black box CPU, is vital.

When it gets down to the assembly the cmp instruction is used regardless of the contents of the variables.
So there is no performance difference.

About reducing the branch miss prediciton

I saw a sentence in a paper "Transforming branches into data dependencies to avoid mispredicted branches." (Page 6)
I wonder how to change the code from branches into data dependencies.
This is the paper: http://www.adms-conf.org/p1-SCHLEGEL.pdf
Update: How to transform branches in the binary search?

The basic idea (I would presume) would be to change something like:
if (a>b)
return "A is greater than B";
else
return "A is less than or equal to B";
into:
static char const *strings[] = {
"A is less than or equal to B",
"A is greater than B"
};
return strings[a>b];
For branches in a binary search, let's consider the basic idea of the "normal" binary search, which typically looks (at least vaguely) like this:
while (left < right) {
middle = (left + right)/2;
if (search_for < array[middle])
right = middle;
else
left = middle;
}
We could get rid of most of the branching here in pretty much the same way:
size_t positions[] = {left, right};
while (left < right) {
size_t middle = (left + right)/2;
positions[search_for >= array[middle]] = middle;
}
[For general purpose code use left + (right-left)/2 instead of (left+right)/2.]
We do still have the branching for the loop itself, of course, but we're generally a lot less concerned about that--that branch is extremely amenable to prediction, so even if we did eliminate it, doing so would gain little as a rule.

Removing branches is not always optimal, even (especially) with simple binary conditions like this. I have looked into removing branches in a similar manner in various circumstances before. After running into an instance where code with a conditional branch in a loop ran faster than equivalent, branch-less code, I did some research on processor execution strategies.
I knew that the ARM instruction set had conditional instructions, which could make a conditional branch faster than the kind of branch-less code in the paper, but I was working on an intel (and the branch was not one that cmove could take care of). It turns out that modern CPUs, including intel, will sometimes turn an ordinary instruction into a conditional one: they use eager execution if the two end points of the branch are sufficiently short. That is, the CPU will put both possible paths in the pipeline and execute them both and only keep the correct result once the condition is known. This avoids the possibility of a mis-prediction without having to index an array. See https://en.wikipedia.org/wiki/Speculative_execution#Variants for more information.
It takes a lot of detailed knowledge of a processor to write optimal code for it, even in assembly. This makes it possible for "naive" implementations to sometimes be faster than hand-optimized ones that overlook certain CPU characteristics. The same thing can happen with optimizing compilers: sometimes writing more complicated "optimized" code breaks some of the compiler's optimizations and results in a slower executable than a naive implementation that the compiler can optimize fully.
When in doubt and performance is critical and there is time, it is usually best to try it both ways and see which is faster!

Usefulness of LOOPNE

I am unable to understand the usefulness of LOOPNE. Even if LOOPNE was not there and only LOOP was there, it would have done the same thing here. Please help me out.
MOV CX, 80
MOV AH,1
INT 21H
CMP AL, ' '
LOOPNE BACK

CMP is more or less a SUB instruction without changing the value, which means that it sets flags such as ZF (the zero flag).
LOOPNE has 2 conditions to loop: cx > 0 and ZF = 0
LOOP has 1 condition to loop: cx > 0
So, a normal LOOP would go through all characters, whereas LOOPNE will go through all characters, or until a space is encountered. Whichever comes first

LOOPNE loops when a comparison fails, and when there is a remaining nonzero iteration count (after decrementing it). This is arguably very convenient for finding an element in a linear list of known length.
There is little use for it in modern x86 CPUs.
The LOOPNE instruction is likely implemented internally in the CPU by microinstructions and thus effectively equivalent to JNE/DEC CX/JNE.
Because the CPU designers invest vast amounts of effort to optimize compare/branch/register arithmetic, the equivalent instruction sequence is likely, on a highly pipelined CPU, to execute virtually just as fast. It may actually execute slower; you'll only know by timing it. And the fact that you are confused about what it does makes it a source of coding errors.
I presently code the equivalent instruction sequence, because I got bit by a misunderstanding once. I'm not confused about CMP and JNE.

When, if ever, is loop unrolling still useful?

I've been trying to optimize some extremely performance-critical code (a quick sort algorithm that's being called millions and millions of times inside a monte carlo simulation) by loop unrolling. Here's the inner loop I'm trying to speed up:
// Search for elements to swap.
while(myArray[++index1] < pivot) {}
while(pivot < myArray[--index2]) {}
I tried unrolling to something like:
while(true) {
if(myArray[++index1] < pivot) break;
if(myArray[++index1] < pivot) break;
// More unrolling
}
while(true) {
if(pivot < myArray[--index2]) break;
if(pivot < myArray[--index2]) break;
// More unrolling
}
This made absolutely no difference so I changed it back to the more readable form. I've had similar experiences other times I've tried loop unrolling. Given the quality of branch predictors on modern hardware, when, if ever, is loop unrolling still a useful optimization?

Loop unrolling makes sense if you can break dependency chains. This gives a out of order or super-scalar CPU the possibility to schedule things better and thus run faster.
A simple example:
for (int i=0; i<n; i++)
{
sum += data[i];
}
Here the dependency chain of the arguments is very short. If you get a stall because you have a cache-miss on the data-array the cpu cannot do anything but to wait.
On the other hand this code:
for (int i=0; i<n-3; i+=4) // note the n-3 bound for starting i + 0..3
{
sum1 += data[i+0];
sum2 += data[i+1];
sum3 += data[i+2];
sum4 += data[i+3];
}
sum = sum1 + sum2 + sum3 + sum4;
// if n%4 != 0, handle final 0..3 elements with a rolled up loop or whatever
could run faster. If you get a cache miss or other stall in one calculation there are still three other dependency chains that don't depend on the stall. A out of order CPU can execute these in parallel.
(See Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators) for an in-depth look at how register-renaming helps CPUs find that parallelism, and an in depth look at the details for FP dot-product on modern x86-64 CPUs with their throughput vs. latency characteristics for pipelined floating-point SIMD FMA ALUs. Hiding latency of FP addition or FMA is a major benefit to multiple accumulators, since latencies are longer than integer but SIMD throughput is often similar.)

Those wouldn't make any difference because you're doing the same number of comparisons. Here's a better example. Instead of:
for (int i=0; i<200; i++) {
doStuff();
}
write:
for (int i=0; i<50; i++) {
doStuff();
doStuff();
doStuff();
doStuff();
}
Even then it almost certainly won't matter but you are now doing 50 comparisons instead of 200 (imagine the comparison is more complex).
Manual loop unrolling in general is largely an artifact of history however. It's another of the growing list of things that a good compiler will do for you when it matters. For example, most people don't bother to write x <<= 1 or x += x instead of x *= 2. You just write x *= 2 and the compiler will optimize it for you to whatever is best.
Basically there's increasingly less need to second-guess your compiler.

Regardless of branch prediction on modern hardware, most compilers do loop unrolling for you anyway.
It would be worthwhile finding out how much optimizations your compiler does for you.
I found Felix von Leitner's presentation very enlightening on the subject. I recommend you read it. Summary: Modern compilers are VERY clever, so hand optimizations are almost never effective.

As far as I understand it, modern compilers already unroll loops where appropriate - an example being gcc, if passed the optimisation flags it the manual says it will:
Unroll loops whose number of
iterations can be determined at
compile time or upon entry to the
loop.
So, in practice it's likely that your compiler will do the trivial cases for you. It's up to you therefore to make sure that as many as possible of your loops are easy for the compiler to determine how many iterations will be needed.

Loop unrolling, whether it's hand unrolling or compiler unrolling, can often be counter-productive, particularly with more recent x86 CPUs (Core 2, Core i7). Bottom line: benchmark your code with and without loop unrolling on whatever CPUs you plan to deploy this code on.

Trying without knowing is not the way to do it.
Does this sort take a high percentage of overall time?
All loop unrolling does is reduce the loop overhead of incrementing/decrementing, comparing for the stop condition, and jumping. If what you're doing in the loop takes more instruction cycles than the loop overhead itself, you're not going to see much improvement percentage-wise.
Here's an example of how to get maximum performance.

Loop unrolling can be helpful in specific cases. The only gain isn't skipping some tests!
It can for instance allow scalar replacement, efficient insertion of software prefetching... You would be surprised actually how useful it can be (you can easily get 10% speedup on most loops even with -O3) by aggressively unrolling.
As it was said before though, it depends a lot on the loop and the compiler and experiment is necessary. It's hard to make a rule (or the compiler heuristic for unrolling would be perfect)

Loop unrolling entirely depends on your problem size. It is entirely dependent on your algorithm being able to reduce the size into smaller groups of work. What you did above does not look like that. I am not sure if a monte carlo simulation can even be unrolled.
I good scenario for loop unrolling would be rotating an image. Since you could rotate separate groups of work. To get this to work you would have to reduce the number of iterations.

Loop unrolling is still useful if there are a lot of local variables both in and with the loop. To reuse those registers more instead of saving one for the loop index.
In your example, you use small amount of local variables, not overusing the registers.
Comparison (to loop end) are also a major drawback if the comparison is heavy (i.e non-test instruction), especially if it depends on an external function.
Loop unrolling helps increasing the CPU's awareness for branch prediction as well, but those occur anyway.

Why differ(!=,<>) is faster than equal(=,==)?

I've seen comments on SO saying "<> is faster than =" or "!= faster than ==" in an if() statement.
I'd like to know why is that so. Could you show an example in asm?
Thanks! :)
EDIT:
Source
Here is what he did.
function Check(var MemoryData:Array of byte;MemorySignature:Array of byte;Position:integer):boolean;
var i:byte;
begin
Result := True; //moved at top. Your function always returned 'True'. This is what you wanted?
for i := 0 to Length(MemorySignature) - 1 do //are you sure??? Perhaps you want High(MemorySignature) here...
begin
{!} if MemorySignature[i] <> $FF then //speedup - '<>' evaluates faster than '='
begin
Result:=memorydata[i + position] <> MemorySignature[i]; //speedup.
if not Result then
Break; //added this! - speedup. We already know the result. So, no need to scan till end.
end;
end;
end;

I'd claim that this is flat out wrong except perhaps in very special circumstances. Compilers can refactor one into the other effortlessly (by just switching the if and else cases).

It could have something to do with branch prediction on the CPU. Static branch prediction would predict that a branch simply wouldn't be taken and fetch the next instruction. However, hardly anybody uses that anymore. Other than that, I'd say it's bull because the comparisons should be identical.

I think there's some confusion in your previous question about what the algorithm was that you were trying to implement, and therefore in what the claimed "speedup" purports to do.
Here's some disassembly from Delphi 2007. optimization on. (Note, optimization off changed the code a little, but not in a relevant way.
Unit70.pas.31: for I := 0 to 100 do
004552B5 33C0 xor eax,eax
Unit70.pas.33: if i = j then
004552B7 3B02 cmp eax,[edx]
004552B9 7506 jnz $004552c1
Unit70.pas.34: k := k+1;
004552BB FF05D0DC4500 inc dword ptr [$0045dcd0]
Unit70.pas.35: if i <> j then
004552C1 3B02 cmp eax,[edx]
004552C3 7406 jz $004552cb
Unit70.pas.36: l := l + 1;
004552C5 FF05D4DC4500 inc dword ptr [$0045dcd4]
Unit70.pas.37: end;
004552CB 40 inc eax
Unit70.pas.31: for I := 0 to 100 do
004552CC 83F865 cmp eax,$65
004552CF 75E6 jnz $004552b7
Unit70.pas.38: end;
004552D1 C3 ret
As you can see, the only difference between the two cases is a jz vs. a jnz instruction. These WILL run at the same speed. what's likely to affect things much more is how often the branch is taken, and if the entire loop fits into cache.

For .Net languages
If you look at the IL from the string.op_Equality and string.op_Inequality methods, you will see that both internall call string.Equals.
But the op_Inequality inverts the result. This is two IL-statements more.
I would say they the performance is the same, with maybe a small (very small, very very small) better performance for the == statement. But I believe that the optimizer & JIT compiler will remove this.

Spontaneous though; most other things in your code will affect performance more than the choice between == and != (or = and <> depending on language).
When I ran a test in C# over 1000000 iterations of comparing strings (containing the alphabet, a-z, with the last two letters reversed in one of them), the difference was between 0 an 1 milliseconds.
It has been said before: write code for readability; change into more performant code when it has been established that it will make a difference.
Edit: repeated the same test with byte arrays; same thing; the performance difference is neglectible.

It could also be a result of misinterpretation of an experiment.
Most compilers/optimizers assume a branch is taken by default. If you invert the operator and the if-then-else order, and the branch that is now taken is the ELSE clause, that might cause an additional speed effect in highly calculating code (*)
(*) obviously you need to do a lot of operations for that. But it can matter for the tightest loops in e.g. codecs or image analysis/machine vision where you have 50MByte/s of data to trawl through.
.... and then I even only stoop to this level for the really heavily reusable code. For ordinary business code it is not worth it.

I'd claim this was flat out wrong full stop. The test for equality is always the same as the test for inequality. With string (or complex structure testing), you're always going to break at exactly the same point. Until that break point is reached, then the answer for equality is unknown.

I strongly doubt there is any speed difference. For integral types for example you are getting a CMP instruction and either JZ (Jump if zero) or JNZ (Jump if not zero), depending on whether you used = or ≠. There is no speed difference here and I'd expect that to hold true at higher levels too.

If you can provide a small example that clearly shows a difference, then I'm sure the Stack Overflow community could explain why. However, I think you might have difficulty constructing a clear example. I don't think there will be any performance difference noticeable at any reasonable scale.

Well it could be or it couldn't be, that is the question :-)
The thing is this is highly depending on the programming language you are using.
Since all your statements will eventually end up as instructions to the CPU, the one that uses the least amount of instruction to achieve the result will be the fastest.
For example if you say bits x is equal to bits y, you could use the instruction that does an XOR using both bits as an input, if the result is anything but 0 it is not the same. So how would you know that the result is anything but 0? By using the instruction that returns true if you say input a is bigger than 0.
So this is already 2 instructions you use to do it, but since most CPU's have an instruction that does compare in a single cycle it is a bad example.
The point I am making is still the same, you can't make this generally statements without providing the programming language and the CPU architecture.

This list (assuming it's on x86) of ASM instructions might help:
Jump if greater
Jump on equality
Comparison between two registers
(Disclaimer, I have nothing more than very basic experience with writing assembler so I could be off the mark)
However it obviously depends purely on what assembly instructions the Delphi compiler is producing. Without seeing that output then it's guesswork. I'm going to keep my Donald Knuth quote in as caring about this kind of thing for all but a niche set of applications (games, mobile devices, high performance server apps, safety critical software, missile launchers etc.) is the thing you worry about last in my view.
"We should forget about small
efficiencies, say about 97% of the
time: premature optimization is the
root of all evil."
If you're writing one of those or similar then obviously you do care, but you didn't specify it.

Just guessing, but given you want to preserve the logic, you cannot just replace
if A = B then
with
if A <> B then
To conserve the logic, the original code must have been something like
if not (A = B) then
or
if A <> B then
else
and that may truely be a little bit slower than the test on inequality.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio