Here is a benchmark:
fn benchmark_or(repetitions: usize, mut increment: usize) -> usize {
let mut batch = 0;
for _ in 0..repetitions {
increment |= 1;
batch += increment | 1;
}
batch
}
fn benchmark_xor(repetitions: usize, mut increment: usize) -> usize {
let mut batch = 0;
for _ in 0..repetitions {
increment ^= 1;
batch += increment | 1;
}
batch
}
fn benchmark(c: &mut Criterion) {
let increment = 1;
let repetitions = 1000;
c.bench_function("Increment Or", |b| {
b.iter(|| black_box(benchmark_or(repetitions, increment)))
});
c.bench_function("Increment Xor", |b| {
b.iter(|| black_box(benchmark_xor(repetitions, increment)))
});
}
The results are:
Increment Or time: [271.02 ns 271.14 ns 271.28 ns]
Increment Xor time: [79.656 ns 79.761 ns 79.885 ns]
I get the same result if I replace or with and.
It's quite confusing as the or bench compiles to
.LBB0_5:
or edi, 1
add eax, edi
add rcx, -1
jne .LBB0_5
And the xor bench compiles to basically the same instructions plus two additional ones:
.LBB1_6:
xor edx, 1
or edi, 1
add eax, edi
mov edi, edx
add rcx, -1
jne .LBB1_6
Full Assembly code
Why is the difference so large?
This part of the function that uses XOR which you quoted:
.LBB1_6:
xor rdx, 1
or rsi, 1
add rax, rsi
mov rsi, rdx
add rcx, -1
jne .LBB1_6
Is only the "tail end" of an unrolled loop. The "meat" (the part that actually runs a lot) is this:
.LBB1_9:
add rax, rdx
add rdi, 4
jne .LBB1_9
rdx is set up to be 4 times increment - in a way that I would describe as "only a compiler could be this stupid", but it only happens outside the loop so it's not a complete disaster. The loop counter is advanced by 4 in every iteration (starting negative and counting up to zero, which is clever, redeeming the compiler somewhat).
This loop could be executed at 1 iteration per cycle, translating to 4 iterations of the source-loop per cycle.
The loop in the function that uses OR is also unrolled, this is the actual "meat" of that function:
.LBB0_8:
or rsi, 1
lea rax, [rax + 2*rsi]
lea rax, [rax + 2*rsi]
lea rax, [rax + 2*rsi]
lea rax, [rax + 2*rsi]
add rdi, 8
jne .LBB0_8
It's unrolled by 8, which might have been nice, but chaining lea 4 times like that really takes "only a compiler could be this stupid" to the next level. The serial dependency through the leas costs at least 4 cycles per iteration, translating to 2 iterations of the source-loop per cycle.
That explains a 2x difference in performance (in favour of the XOR version), not quite your measured 3.4x difference, so further analysis could be done.
Related
I am currently learning Rust, and as a first exercise I wanted to implement a function that computes the nth fibonacci number:
fn main() {
for i in 0..48 {
println!("{}: {}", i, fibonacci(i));
}
}
fn fibonacci(n: u32) -> u32 {
match n {
0 => 0,
1 => 1,
_ => fibonacci(n - 1) + fibonacci(n - 2),
}
}
I run it as:
$ time cargo run --release
real 0m15.380s
user 0m15.362s
sys 0m0.014s
As an exercise, I also implemented the same algorithm in C++. I was expecting a similar performance, but the C++ code runs in 80% of the time:
#include<iostream>
unsigned int fibonacci(unsigned int n);
int main (int argc, char* argv[]) {
for(unsigned int i = 0; i < 48; ++i) {
std::cout << i << ": " << fibonacci(i) << '\n';
}
return 0;
}
unsigned int fibonacci(unsigned int n) {
if(n == 0) {
return 0;
} else if (n == 1) {
return 1;
} else {
return fibonacci(n - 1) + fibonacci(n - 2);
}
}
Compiled as:
$ g++ test.cpp -o test.exe -O2
And running:
$ time ./test.exe
real 0m12.127s
user 0m12.124s
sys 0m0.000s
Why do I see such a difference in performance? I am not interested in calculating the fibonacci faster in Rust (with a different algorithm); I am only interested on where the difference comes from. This is just an exercise in my progress as I learn Rust.
TL;DR: It's not Rust vs C++, it's LLVM (Clang) vs GCC.
Different optimizers optimize the code differently, and in this case GCC produces larger but faster code.
This can be verified using godbolt.
Here is Rust, compiled with both GCC (via rustgcc-master):
example::fibonacci:
push r15
push r14
push r13
push r12
push rbp
xor ebp, ebp
push rbx
mov ebx, edi
sub rsp, 24
.L2:
test ebx, ebx
je .L1
cmp ebx, 1
je .L4
lea r12d, -1[rbx]
xor r13d, r13d
.L19:
cmp r12d, 1
je .L6
lea r14d, -1[r12]
xor r15d, r15d
.L16:
cmp r14d, 1
je .L8
lea edx, -1[r14]
xor ecx, ecx
.L13:
cmp edx, 1
je .L10
lea edi, -1[rdx]
mov DWORD PTR 12[rsp], ecx
mov DWORD PTR 8[rsp], edx
call example::fibonacci.localalias
mov ecx, DWORD PTR 12[rsp]
mov edx, DWORD PTR 8[rsp]
add ecx, eax
sub edx, 2
jne .L13
.L14:
add r15d, ecx
sub r14d, 2
je .L17
jmp .L16
.L4:
add ebp, 1
.L1:
add rsp, 24
mov eax, ebp
pop rbx
pop rbp
pop r12
pop r13
pop r14
pop r15
ret
.L6:
add r13d, 1
.L20:
sub ebx, 2
add ebp, r13d
jmp .L2
.L8:
add r15d, 1
.L17:
add r13d, r15d
sub r12d, 2
je .L20
jmp .L19
.L10:
add ecx, 1
jmp .L14
And with LLVM (via rustc):
example::fibonacci:
push rbp
push r14
push rbx
mov ebx, edi
xor ebp, ebp
mov r14, qword ptr [rip + example::fibonacci#GOTPCREL]
cmp ebx, 2
jb .LBB0_3
.LBB0_2:
lea edi, [rbx - 1]
call r14
add ebp, eax
add ebx, -2
cmp ebx, 2
jae .LBB0_2
.LBB0_3:
add ebx, ebp
mov eax, ebx
pop rbx
pop r14
pop rbp
ret
We can see that LLVM produces a naive version -- calling the function in each iteration of the loop -- while GCC partially unrolls the recursion by inlining some calls. This results in a smaller number of calls in the case of GCC, and at about 5ns of overhead per function call, it's significant enough.
We can do the same exercise with the C++ version using LLVM via Clang and GCC and note that the result is pretty much similar.
So, as announced, it's a LLVM vs GCC difference, not a language one.
Incidentally, the fact that optimizers may produce such widely different results is a reason why I am quite excited at the progress of the rustc_codegen_gcc initiative (dubbed rustgcc-master on godbolt) which aims at pluging a GCC backend into the rustc frontend: once complete anyone will be able to switch to the better optimizer for their own workload.
example of computed goto:
...
GO TO ( 10, 20, 30, 40 ), N
...
10 CONTINUE
...
20 CONTINUE
...
40 CONTINUE
If N equals one, then go to 10.
If N equals two, then go to 20.
If N equals three, then go to 30.
If N equals four, then go to 40.
What is the code generator of goto in the final state of compiling?
The most common way of compiling computed goto is a static jump table and an indirect branch instruction. For example (without -fPIC):
int test(int num) {
const void * const labels[] = {&&a, &&b, &&cl};
goto *labels[num];
a: return 1;
b: return 2;
cl: return 3;
}
Is going to be compiled as:
test(int): # #test(int)
movsxd rax, edi
jmp qword ptr [8*rax + .L__const.test(int).labels]
.Ltmp0: # Block address taken
mov eax, 1
ret
.Ltmp1: # Block address taken
mov eax, 3
ret
.Ltmp2: # Block address taken
mov eax, 2
ret
.L__const.test(int).labels:
.quad .Ltmp0
.quad .Ltmp2
.quad .Ltmp1
So we have like this safes challenge in assembly, you need to create safes and keys that will break them and end the infinite loop.
Here's an example for a safe:
loopy:
mov ax, [1900]
cmp ax,1234
jne loopy
and a key:
loopy2:
mov ax, 1234
mov [1900],ax
jmp loopy2
So I have a safe and a key, and I don't understand why it doesn't work:
here's my safe:
org 100h
mySafe:
mov dx,5
mov ax, [5768h]
mov bx,7
mov word [180h],2
mul word [180h]
mov [180h],bx
push ax
dec bx
mov cx,dx
mov ax,dx
loopy1:
add bx,ax
loop loopy1
dec bx
pop ax
add ax,bx
mul word [180h]
cmp ax,350
jne mySafe
And here's my key:
org 100h
loopy:
mov word [5768h],10
jmp loopy
ret
The right answer to break the loop should be 10 and it works when I put in on the safe, somehow with the key it doesn't work and I can't figure out why..
(the "word" is needed for nasm)
The value in dx used as the counter for the loop instruction comes from the first mul instruction.
This multiplication is just doubling the key, so dx is either 0 or 1 (an easy way to see this is to think of the multiplication as a left shift by one or by remembering that the sum of two n-bit numbers has at most n+1 bits)
If dx is zero, the whole loopy1 block does nothing (as dx also sets ax) and the value in ax at the end of the safe is 7*(5 +2k) where k is the key (see the commented code below).
It is then easy to see that 350 = 7*(5+2k) => 2k = 45 has no solution. Therefore no key for which dx is zero can unlock the safe.
A key has dx 0 iif its value is less than 32768 (again, this is easy to see when thinking of the multiplication as a left shift by one).
Corollary: 10 cannot be a solution.
safe:
mov dx,5
mov ax, [k] ;ax = k (key)
mov bx,7
mov word [aux],2
mul word [aux] ;dx = 0 ax = 2k
mov [aux],bx ;aux = 7
push ax ;ax = 2k
dec bx ;bx = 6
dec bx ;bx = 5
pop ax ;ax = 2k
add ax,bx ;ax = 5 + 2k
mul word [aux] ;ax = 7*(5 +2k)
cmp ax,350
ret
If there is a key that unlocks the safe then it must be greater or equal to 32768 so that dx is 1 after the first multiplication.
With this condition, the value in ax at the end of the safe can be written as 7*(6 + (2k & 0xffff)) => k & 0x7fff = 22.
Adding the condition stated at the very beginning of this section, the final value for k is 32768 + 22 = 32790 or 0x8016 in hex.
I've leaped quite a few logical steps in manipulating the equation and forming the result but, again, thinking of 2k as a shift may help visualize them.
Corollary: Due to the algebraic structure involved, this is the only solution.
safe:
mov dx,5
mov ax, [k] ;ax = k
mov bx,7
mov word [aux],2
mul word [aux] ;dx:ax = 2k
mov [aux],bx ;[aux] = 7
push ax ;dx = 1 ax = 2k & 0xffff
dec bx ;bx = 6
mov cx,dx ;cx = 1
mov ax,dx ;ax = 1
loopy1:
add bx,ax ;bx = 6 + 1
dec cx
jnz loopy1
dec bx ;bx = 6
pop ax ;ax = 2k & 0xffff
add ax,bx ;ax = 6 + (2k & 0xffff)
mul word [aux] ;ax = 7*(6 + (2k & 0xffff))
cmp ax,350
ret
Considering that you have a mov dx, 5 before the first multiplication, did you (or the author of the safe) forget that mul affects dx?
If you wrap the first mul in push dx / pop dx (or just move mov dx, 5 after it), you would get, at the end of the safe, a value in ax equals to 7*(30 +2k) which implies k = 10 indeed.
Debugging my code in VS2015, I get to the end of the program. The registers are as they should be, however, on call ExitProcess, or any variation of that, causes an "Access violation writing location 0x00000004." I am utilizing Irvine32.inc from Kip Irvine's book. I have tried using call DumpRegs, but that too throws the error.
I have tried using other variations of call ExitProcess, such as exit and invoke ExitProcess,0 which did not work either, throwing the same error. Before, when I used the same format, the code worked fine. The only difference between this code and the last one is utilizing the general purpose registers.
include Irvine32.inc
.data
;ary dword 100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
ary dword -24, 1, -5, 30, 35, 81, 94, 143, 0
.code
main PROC
;ESI will be used for the array
;EDI will be used for the array value
;ESP will be used for the array counting
;EAX will be used for the accumulating sum
;EBX will be used for the average
;ECX will be used for the remainder of avg
;EBP will be used for calculating remaining sum
mov eax,0 ;Set EAX register to 0
mov ebx,0 ;Set EBX register to 0
mov esp,0 ;Set ESP register to 0
mov esi,OFFSET ary ;Set ESI register to array
sum: mov edi,[esi] ;Set value to array value
cmp edi,0 ;Check value to temination value 0
je finsum ;If equal, jump to finsum
add esp,1 ;Add 1 to array count
add eax,edi ;Add value to sum
add esi,4 ;Increment to next address in array
jmp sum ;Loop back to sum array
finsum: mov ebp,eax ;Set remaining sum to the sum
cmp ebp,0 ;Compare rem sum to 0
je finavg ;Jump to finavg if sum is 0
cmp ebp,esp ;Check sum to array count
jl finavg ;Jump to finavg if sum is less than array count
avg: add ebx,1 ;Add to average
sub ebp,esp ;Subtract array count from rem sum
cmp ebp,esp ;Compare rem sum to array count
jge avg ;Jump to avg if rem sum is >= to ary count
finavg: mov ecx,ebp ;Set rem sum to remainder of avg
call ExitProcess
main ENDP
END MAIN
Registers before call ExitProcess
EAX = 00000163 EBX = 0000002C ECX = 00000003 EDX = 00401055
ESI = 004068C0 EDI = 00000000 EIP = 0040366B ESP = 00000008
EBP = 00000003 EFL = 00000293
OV = 0 UP = 0 EI = 1 PL = 1 ZR = 0 AC = 1 PE = 0 CY = 1
mov esp,0 sets the stack pointer to 0. Any stack instructions like push/pop or call/ret will crash after you do that.
Pick a different register for your array-count temporary, not the stack pointer! You have 7 other choices, looks like you still have EDX unused.
In the normal calling convention, only EAX, ECX, and EDX are call-clobbered (so you can use them without preserving the caller's value). But you're calling ExitProcess instead of returning from main, so you can destroy all the registers. But ESP has to be valid when you call.
call works by pushing a return address onto the stack, like sub esp,4 / mov [esp], next_instruction / jmp ExitProcess. See https://www.felixcloutier.com/x86/CALL.html. As your register-dump shows, ESP=8 before the call, which is why it's trying to store to absolute address 4.
Your code has 2 sections: looping over the array and then finding the average. You can reuse a register for different things in the 2 sections, often vastly reducing register pressure. (i.e. you don't run out of registers.)
Using implicit-length arrays (terminated by a sentinel element like 0) is unusual outside of strings. It's much more common to pass a function a pointer + length, instead of just a pointer.
But anyway, you have an implicit-length array so you have to find its length and remember that when calculating the average. Instead of incrementing a size counter inside the loop, you can calculate it from the pointer you're also incrementing. (Or use the counter as an array index like ary[ecx*4], but pointer-increments are often more efficient.)
Here's what an efficient (scalar) implementation might look like. (With SSE2 for SIMD you could add 4 elements with one instruction...)
It only uses 3 registers total. I could have used ECX instead of ESI (so main could ret without having destroyed any of the registers the caller expected it to preserve, only EAX, ECX, and EDX), but I kept ESI for consistency with your version.
.data
;ary dword 100, -30, 25, 14, 35, -92, 82, 134, 193, 99, 0
ary dword -24, 1, -5, 30, 35, 81, 94, 143, 0
.code
main PROC
;; inputs: static ary of signed dword integers
;; outputs: EAX = array average, EDX = remainder of sum/size
;; ESI = array count (in elements)
;; clobbers: none (other than the outputs)
; EAX = sum accumulator
; ESI = array pointer
; EDX = array element temporary
xor eax, eax ; sum = 0
mov esi, OFFSET ary ; incrementing a pointer is usually efficient, vs. ary[ecx*4] inside a loop or something. So this is good.
sumloop: ; do {
mov edx, [esi]
add edx, 4
add eax, edx ; sum += *p++ without checking for 0, because + 0 is a no-op
test edx, edx ; sets FLAGS the same as cmp edx,0
jnz sumloop ; }while(array element != 0);
;;; fall through if the element is 0.
;;; esi points to one past the terminator, i.e. two past the last real element we want to count for the average
sub esi, OFFSET ary + 4 ; (end+4) - (start+4) = array size in bytes
shr esi, 2 ; esi = array length = (end-start)/element_size
cdq ; sign-extend sum into EDX:EAX as an input for idiv
idiv esi ; EAX = sum/length EDX = sum%length
call ExitProcess
main ENDP
I used x86's hardware division instruction, instead of a subtraction loop. Your repeated-subtraction loop looked pretty complicated, but manual signed division can be tricky. I don't see where you're handling the possibility of the sum being negative. If your array had a negative sum, repeated subtraction would make it grow until it overflowed. Or in your case, you're breaking out of the loop if sum < count, which will be true on the first iteration for a negative sum.
Note that comments like Set EAX register to 0 are useless. We already know that from reading mov eax,0. sum = 0 describes the semantic meaning, not the architectural effect. There are some tricky x86 instructions where it does make sense to comment about what it even does in this specific case, but mov isn't one of them.
If you just wanted to do repeated subtraction with the assumption that sum is non-negative to start with, it's as simple as this:
;; UNSIGNED division (or signed with non-negative dividend and positive divisor)
; Inputs: sum(dividend) in EAX, count(divisor) in ECX
; Outputs: quotient in EDX, remainder in EAX (reverse of the DIV instruction)
xor edx, edx ; quotient counter = 0
cmp eax, ecx
jb subloop_end ; the quotient = 0 case
repeat_subtraction: ; do {
inc edx ; quotient++
sub eax, ecx ; dividend -= divisor
cmp eax, ecx
jae repeat_subtraction ; while( dividend >= divisor );
; fall through when eax < ecx (unsigned), leaving EAX = remainder
subloop_end:
Notice how checking for special cases before entering the loop lets us simplify it. See also Why are loops always compiled into "do...while" style (tail jump)?
sub eax, ecx and cmp eax, ecx in the same loop seems redundant: we could just use sub to set flags, and correct for the overshoot.
xor edx, edx ; quotient counter = 0
cmp eax, ecx
jb division_done ; the quotient = 0 case
repeat_subtraction: ; do {
inc edx ; quotient++
sub eax, ecx ; dividend -= divisor
jnc repeat_subtraction ; while( dividend -= divisor doesn't wrap (carry) );
add eax, ecx ; correct for the overshoot
dec edx
division_done:
(But this isn't actually faster in most cases on most modern x86 CPUs; they can run the inc, cmp, and sub in parallel even if the inputs weren't the same. This would maybe help on AMD Bulldozer-family where the integer cores are pretty narrow.)
Obviously repeated subtraction is total garbage for performance with large numbers. It is possible to implement better algorithms, like one-bit-at-a-time long-division, but the idiv instruction is going to be faster for anything except the case where you know the quotient is 0 or 1, so it takes at most 1 subtraction. (div/idiv is pretty slow compared to any other integer operation, but the dedicated hardware is much faster than looping.)
If you do need to implement signed division manually, normally you record the signs, take the unsigned absolute value, then do unsigned division.
e.g. xor eax, ecx / sets dl gives you dl=0 if EAX and ECX had the same sign, or 1 if they were different (and thus the quotient will be negative). (SF is set according to the sign bit of the result, and XOR produces 1 for different inputs, 0 for same inputs.)
Very specific optimisation task.
I have 3 arrays:
const char* inputTape
const int* inputOffset, organised in a group of four
char* outputTapeoutput
which i must assemble output tape from input, according to following 5 operations:
int selectorOffset = inputOffset[4*i];
char selectorValue = inputTape[selectorOffset];
int outputOffset = inputOffset[4*i+1+selectorValue];
char outputValue = inputTape[outputOffset];
outputTape[i] = outputValue; // store byte
and then advance counter.
All iterations are same and could be done all in parallel. Format of inputOffset could be a subject for change, but until same input will produce same output.
OpenCL on GPU fails on this algorithm (works same or even slower that cpu)
Assembly the best i got 5 mov, 1 lea, 1 dec instructions. Upd:
thanks to Peter Cordes little hint
loop_start:
mov eax,dword ptr [rdx-10h] ; selector offset
movzx r10d,byte ptr [rax+r8] ; selector value
mov eax,dword ptr [rdx+r10*4-0Ch] ; output offset
movzx r10d,byte ptr [r8+rax] ; output value
mov byte ptr [r9+rcx-1],r10b ; store to outputTape
lea rdx, [rdx-10h] ; pointer to inputOffset for current
dec ecx ; loop counter, sets zero flag if (ecx == 0)
jne loop_start ; continue looping while non zero iterations left: ( ecx != 0 )
How could i optimise this for SSE/AVX operation? i am stumbled...
UPD: better to see it than to hear it..