Why is it necessary to use edi constraint in this inline assembly? - gcc

centos 6.5 64bit vps, 500MB ram gcc 4.8.2
I have the following function that works only if I use edi as the constraint to hold the string pointer. If I try to use any other register or constraintg or q etc, it segfaults.
BUT this problem only occurs when both link time optimization and o3 are used together. If o2 it's fine. If I don't use -flto, it's fine. But both together then the only register I can use that doesn't crash is edi
gcc -flto
CFLAGS=-I. -flto -std=gnu11 -msse4.2 -fno-builtin-printf -Wall -Winline -Wstrict-aliasing -g -pg -O3 -lrt -lpthread
It seems like there might be some sort of register clobbering going on or something else. I'm really at a loss to understand why and how to fix this. Another interesting aspect is the generated assembly puts rdi into rdx before using the pointer but if I try to use either register as the input constraint... it segfaults! If it fails under aggressive compiling options it suggests to me either the compiler is stuffing up somehow, or more likely I'm doing something wrong.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"sub $1, %1\n\t"
"1:" "sub $15,%1\n\t"
".align 16\n\t"
"2:" "add $16, %1\n\t"
"pcmpistri $0x08,(%1),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%1,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%1\n\t"
"mov %1,%0\n\t"
"2:"
:"=q"(res)
:"edi"(str),"x"(M) //<-- if use anything except edi, it segfaults
:"rcx"
);
return (char*) res;
}
Disassembled output:
00000000000002e0 <sse4_strCRLF>:
2e0: 55 push rbp
2e1: 48 89 e5 mov rbp,rsp
2e4: e8 00 00 00 00 call 2e9 <sse4_strCRLF+0x9>
2e9: 66 0f 6f 05 00 00 00 00 movdqa xmm0,[rip+0x0] # 2f1 <sse4_strCRLF+0x11>
2f1: 48 89 fa mov rdx,rdi //<--- puts rdi into rdx!
2f4: 48 31 c0 xor rax,rax
2f7: 48 83 ea 01 sub rdx,0x1
2fb: 48 83 ea 0f sub rdx,0xf
2ff: 90 nop
300: 48 83 c2 10 add rdx,0x10
304: 66 0f 3a 63 02 08 pcmpistri xmm0,[rdx],0x8
30a: 77 f4 ja 300 <sse4_strCRLF+0x20>
30c: 73 0d jae 31b <sse4_strCRLF+0x3b>
30e: 80 7c 0a 01 0a cmp byte[rdx+rcx*1+0x1],0xa
313: 75 e6 jne 2fb <sse4_strCRLF+0x1b>
315: 48 01 ca add rdx,rcx
318: 48 89 d0 mov rax,rdx
31b: 5d pop rbp
31c: c3 ret

#David Wohlferd gave me the answer. It was 2 dumb mistakes I was making due to ignorance and assumptions. The below code is modified such that the input variable char pointer is not modified by the routine. It's copied into a register and that register is used. Also I was mistakenly thinking I could directly specify a particular register as opposed to a b etc.
gcc still seems to be fussy about what constraints I use. e.g. If I use a and b for res and str respectively, it compiles fine but segfaults on running. But using S and D seems to work fine.
#David Wohlferd, I'd like to credit you as the answerer but I don't think I can do that to a comment.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"mov %1,%%rdx\n\t"
"sub $1,%%rdx\n\t"
"1:" "sub $15,%%rdx\n\t"
".align 16\n\t"
"2:" "add $16, %%rdx\n\t"
"pcmpistri $0x08,(%%rdx),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%%rdx,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%%rdx\n\t"
"mov %%rdx,%0\n\t"
"2:"
:"=S"(res)
:"D"(str),"x"(M)
:"rcx","rdx"
);
return (char*) res;
}

Related

instruction repeated twice when decoded into machine language,

Am basically learning how to make my own instruction in the X86 architecture, but to do that I am understanding how they are decoded and and interpreted to a low level language,
By taking an example of a simple mov instruction and using the .byte notation I wanted to understand in detail as to how instructions are decoded,
My simple code is as follows:
#include <stdio.h>
#include <iostream>
int main(int argc, char const *argv[])
{
int x{5};
int y{0};
// mov %%eax, %0
asm (".byte 0x8b,0x45,0xf8\n\t" //mov %1, eax
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
printf ("dst value : %d\n", y);
return 0;
}
and when I use objdump to analyze how it is broken down to machine language, i get the following output:
000000000000078a <main>:
78a: 55 push %ebp
78b: 48 dec %eax
78c: 89 e5 mov %esp,%ebp
78e: 48 dec %eax
78f: 83 ec 20 sub $0x20,%esp
792: 89 7d ec mov %edi,-0x14(%ebp)
795: 48 dec %eax
796: 89 75 e0 mov %esi,-0x20(%ebp)
799: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp)
7a0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
7a7: 8b 45 f8 mov -0x8(%ebp),%eax
7aa: 8b 45 f8 mov -0x8(%ebp),%eax
7ad: 89 c0 mov %eax,%eax
7af: 89 45 fc mov %eax,-0x4(%ebp)
7b2: 8b 45 fc mov -0x4(%ebp),%eax
7b5: 89 c6 mov %eax,%esi
7b7: 48 dec %eax
7b8: 8d 3d f7 00 00 00 lea 0xf7,%edi
7be: b8 00 00 00 00 mov $0x0,%eax
7c3: e8 78 fe ff ff call 640 <printf#plt>
7c8: b8 00 00 00 00 mov $0x0,%eax
7cd: c9 leave
7ce: c3 ret
With regard to this output of objdump why is the instruction 7aa: 8b 45 f8 mov -0x8(%ebp),%eax repeated twice, any reason behind it or am I doing something wrong while using the .byte notation?
One of those is compiler-generated, because you asked GCC to have the input in its choice of register for you. That's what "r"(x) means. And you compiled with optimization disabled (the default -O0) so it actually stored x to memory and then reloaded it before your asm statement.
Your code has no business assuming anything about the contents of memory or where EBP points.
Since you're using 89 c0 mov %eax,%eax, the only safe constraints for your asm statement are "a" explicit-register constraints for input and output, forcing the compiler to pick that. If you compile with optimization enabled, your code totally breaks because you lied to the compiler about what your code actually does.
// constraints that match your manually-encoded instruction
asm (".byte 0x89, 0xC0\n\t"
: "=a" (y)
: "a" (x)
);
There's no constraint to force GCC to pick a certain addressing mode for a "m" source or "=m" dest operand so you need to ask for inputs/outputs in specific registers.
If you want to encode your own mov instructions differently from standard mov, see which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension - you might want to use a prefix in front of regular mov opcodes so you can let the assembler encode registers and addressing modes for you, like .byte something; mov %1, %0.
Look at the compiler-generate asm output (gcc -S, not disassembly of the .o or executable). Then you can see which instructions come from the asm statement and which are emitted by GCC.
If you don't explicitly reference some operands in the asm template but still want to see what the compiler picked, you can use them in asm comments like this:
asm (".byte 0x8b,0x45,0xf8 # 0 = %0 1 = %1 \n\t"
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
and gcc will fill it in for you so you can see what operands it expects you to be reading and writing. (Godbolt with g++ -m32 -O3). I put your code in void foo(){} instead of main because GCC -m32 thinks it needs to re-align the stack at the top of main. This makes the code a lot harder to follow.
# gcc-9.2 -O3 -m32 -fverbose-asm
.LC0:
.string "dst value : %d\n"
foo():
subl $20, %esp #,
movl $5, %eax #, tmp84
## Notice that GCC hasn't set up EBP at all before it runs your asm,
## and hasn't stored x in memory.
## It only put it in a register like you asked it to.
.byte 0x8b,0x45,0xf8 # 0 = %eax 1 = %eax # y, tmp84
.byte 0x89, 0xC0
pushl %eax # y
pushl $.LC0 #
call printf #
addl $28, %esp #,
ret
Also note that if you were compiling as 64-bit, it would probably pick %esi as a register because printf will want its 2nd arg there. So the "a" instead of "r" constraint would actually matter.
You could get 32-bit GCC to use a different register if you were assigning to a variable that has to survive across a function call; then GCC would pick a call-preserved reg like EBX instead of EAX.

Why do I find some never called instructions nopl, nopw after ret or jmp in GCC compiled code? [duplicate]

I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.
First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq
This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.
As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.

Extra bytes at the end of a DOS .COM file, compiled with GCC

I have the following C source file, with some asm blocks that implement a print and exit routine by calling DOS system calls.
__asm__(
".code16gcc;"
"call dosmain;"
"mov $0x4C, %AH;"
"int $0x21;"
);
void print(char *str)
{
__asm__(
"mov $0x09, %%ah;"
"int $0x21;"
: // no output
: "d"(str)
: "ah"
);
}
void dosmain()
{
// DOS system call expects strings to be terminated by $.
print("Hello world$");
}
The linker script file and the build script file are as such,
OUTPUT_FORMAT(binary)
SECTIONS
{
. = 0x0100;
.text :
{
*(.text);
}
.data :
{
*(.data);
*(.bss);
*(.rodata);
}
_heap = ALIGN(4);
}
gcc -fno-pie -Os -nostdlib -ffreestanding -m16 -march=i386 \
-Wl,--nmagic,--script=simple_dos.ld simple_dos.c -o simple_dos.com
I am used to building .COM files in assembly, and I am aware of the structure of a dos file. However in case of the .COM file generated using GCC, I am getting some extra bytes at the end and I cannot figure out why. (The bytes that are inside the shaded area and the box below is what is expected everything else is unaccounted).
[]
My hunch is that these are some static storage used by GCC. I thought this could be due to the string in the program. I have thus commented the line print("Hello world$"); but the extra bytes still reside. It will be of great help if someone knows what is going on and tell how to prevent GCC inserting these bytes in the output.
Source code is available here: Github
PS: Object file also contains these extra bytes.
Since you are using a native compiler and not an i686(or i386) cross compiler you can get a fair amount of extra information. It is rather dependent on the compiler configurations. I would recommend doing the following to remove unwanted code generation and sections:
Use GCC option -fno-asynchronous-unwind-tables to eliminate any .eh_frame sections. This is the cause of the unwanted data appended at the end of your DOS COM program in this case.
Use GCC option -static to build without relocations to avoid any form of dynamic linking.
Have GCC pass the --build-id=none option to the linker with -Wl to avoid unnecessarily generating any .note.gnu.build-id sections.
Modify the linker script to DISCARD any .comment sections.
Your build command could look like:
gcc -fno-pie -static -Os -nostdlib -fno-asynchronous-unwind-tables -ffreestanding \
-m16 -march=i386 -Wl,--build-id=none,--nmagic,--script=simple_dos.ld simple_dos.c \
-o simple_dos.com
I would modify your linker script to look like:
OUTPUT_FORMAT(binary)
SECTIONS
{
. = 0x0100;
.text :
{
*(.text*);
}
.data :
{
*(.data);
*(.rodata*);
*(.bss);
*(COMMON)
}
_heap = ALIGN(4);
/DISCARD/ : { *(.comment); }
}
Besides adding a /DISCARD/ directive to eliminate any .comment sections I also add *(COMMON) along side .bss. Both are BSS sections. I have also moved them after the data sections as they won't take up space in the .COM file if they appear after the other sections. I also changed *(.rodata); to *(.rodata*); and *(.text); to *(.text*); because GCC can generate section names that begin with .rodata and .text but have different suffixes on them.
Inline Assembly
Not related to the problem you asked about, but is important. In this inline assembly:
__asm__(
"mov $0x09, %%ah;"
"int $0x21;"
: // no output
: "d"(str)
: "ah"
);
Int 21h/AH=9h also clobbers AL. You should use ax as the clobber.
Since you are passing the address of an array through a register you will also want to add a memory clobber so that the compiler realizes the entire array into memory before your inline assembly is emitted. The constraint "d"(str) only tells the compiler that you will be using the pointer as input, not what the pointer points at.
Likely if you compiled with optimisations at -O3 you'd probably discover the following version of the program doesn't even have your string "Hello world$" in it because of this bug:
__asm__(
".code16gcc;"
"call dosmain;"
"mov $0x4C, %AH;"
"int $0x21;"
);
void print(char *str)
{
__asm__(
"mov $0x09, %%ah;"
"int $0x21;"
: // no output
: "d"(str)
: "ax");
}
void dosmain()
{
char hello[] = "Hello world$";
print(hello);
}
The generated code for dosmain allocated space on the stack for the string but never put the string on the stack before printing the string:
00000100 <print-0xc>:
100: 66 e8 12 00 00 00 calll 118 <dosmain>
106: b4 4c mov $0x4c,%ah
108: cd 21 int $0x21
10a: 66 90 xchg %eax,%eax
0000010c <print>:
10c: 67 66 8b 54 24 04 mov 0x4(%esp),%edx
112: b4 09 mov $0x9,%ah
114: cd 21 int $0x21
116: 66 c3 retl
00000118 <dosmain>:
118: 66 83 ec 10 sub $0x10,%esp
11c: 67 66 8d 54 24 03 lea 0x3(%esp),%edx
122: b4 09 mov $0x9,%ah
124: cd 21 int $0x21
126: 66 83 c4 10 add $0x10,%esp
12a: 66 c3 retl
If you change the inline assembly to include a "memory" clobber like this:
void print(char *str)
{
__asm__(
"mov $0x09, %%ah;"
"int $0x21;"
: // no output
: "d"(str)
: "ax", "memory");
}
The generated code may look similar to this:
00000100 <print-0xc>:
100: 66 e8 12 00 00 00 calll 118 <dosmain>
106: b4 4c mov $0x4c,%ah
108: cd 21 int $0x21
10a: 66 90 xchg %eax,%eax
0000010c <print>:
10c: 67 66 8b 54 24 04 mov 0x4(%esp),%edx
112: b4 09 mov $0x9,%ah
114: cd 21 int $0x21
116: 66 c3 retl
00000118 <dosmain>:
118: 66 57 push %edi
11a: 66 56 push %esi
11c: 66 83 ec 10 sub $0x10,%esp
120: 67 66 8d 7c 24 03 lea 0x3(%esp),%edi
126: 66 be 48 01 00 00 mov $0x148,%esi
12c: 66 b9 0d 00 00 00 mov $0xd,%ecx
132: f3 a4 rep movsb %ds:(%si),%es:(%di)
134: 67 66 8d 54 24 03 lea 0x3(%esp),%edx
13a: b4 09 mov $0x9,%ah
13c: cd 21 int $0x21
13e: 66 83 c4 10 add $0x10,%esp
142: 66 5e pop %esi
144: 66 5f pop %edi
146: 66 c3 retl
Disassembly of section .rodata.str1.1:
00000148 <_heap-0x10>:
148: 48 dec %ax
149: 65 6c gs insb (%dx),%es:(%di)
14b: 6c insb (%dx),%es:(%di)
14c: 6f outsw %ds:(%si),(%dx)
14d: 20 77 6f and %dh,0x6f(%bx)
150: 72 6c jb 1be <_heap+0x66>
152: 64 24 00 fs and $0x0,%al
An alternate version of the inline assembly that passes the sub function 9 via an a constraint using a variable and marks it as an input/output with + (since the return value of AX gets clobbered) could be done this way:
void print(char *str)
{
unsigned short int write_fun = (0x09<<8) | 0x00;
__asm__ __volatile__ (
"int $0x21;"
: "+a"(write_fun)
: "d"(str)
: "memory"
);
}
Recommendation: Don't use GCC for 16-bit code generation. The inline assembly is difficult to get right and you will probably be using a fair amount of it for low level routines. You could look at Smaller C, Bruce's C compiler, or Openwatcom C as alternatives. All of them can generate DOS COM programs.
The extra data is likely DWARF unwind information. You can stop GCC from generating it with the -fno-asynchronous-unwind-tables option.
You can also have the GNU linker discard the unwind information by adding the following to SECTIONS directive of your linker script:
/DISCARD/ :
{
*(.eh_frame)
}
Also note that generated COM file will be one byte bigger than you expect because of the null byte at the end of the string.

Why does the assembly encoding of objdump vary?

I was reading this article about Position Independent Code and I encountered this assembly listing of a function.
0000043c <ml_func>:
43c: 55 push ebp
43d: 89 e5 mov ebp,esp
43f: e8 16 00 00 00 call 45a <__i686.get_pc_thunk.cx>
444: 81 c1 b0 1b 00 00 add ecx,0x1bb0
44a: 8b 81 f0 ff ff ff mov eax,DWORD PTR [ecx-0x10]
450: 8b 00 mov eax,DWORD PTR [eax]
452: 03 45 08 add eax,DWORD PTR [ebp+0x8]
455: 03 45 0c add eax,DWORD PTR [ebp+0xc]
458: 5d pop ebp
459: c3 ret
0000045a <__i686.get_pc_thunk.cx>:
45a: 8b 0c 24 mov ecx,DWORD PTR [esp]
45d: c3 ret
However, on my machine (gcc-7.3.0, Ubuntu 18.04 x86_64), I got slightly different result below:
0000044d <ml_func>:
44d: 55 push %ebp
44e: 89 e5 mov %esp,%ebp
450: e8 29 00 00 00 call 47e <__x86.get_pc_thunk.ax>
455: 05 ab 1b 00 00 add $0x1bab,%eax
45a: 8b 90 f0 ff ff ff mov -0x10(%eax),%edx
460: 8b 0a mov (%edx),%ecx
462: 8b 55 08 mov 0x8(%ebp),%edx
465: 01 d1 add %edx,%ecx
467: 8b 90 f0 ff ff ff mov -0x10(%eax),%edx
46d: 89 0a mov %ecx,(%edx)
46f: 8b 80 f0 ff ff ff mov -0x10(%eax),%eax
475: 8b 10 mov (%eax),%edx
477: 8b 45 0c mov 0xc(%ebp),%eax
47a: 01 d0 add %edx,%eax
47c: 5d pop %ebp
47d: c3 ret
The main difference I found was that the semantic of mov instruction. In the upper listing, mov ebp,esp actually moves esp to ebp, while in the lower listing, mov %esp,%ebp does the same thing, but the order of operands are different.
This is quite confusing, even when I have to code hand-written assembly. To summarize, my questions are (1) why I got different assembly representations for the same instructions and (2) which one I should use, when writing assembly code (e.g. with __asm(:::);)
obdjump defaults to -Matt AT&T syntax (like your 2nd code block). See att vs. intel-syntax. The tag wikis have some info about the syntax differences: https://stackoverflow.com/tags/att/info vs. https://stackoverflow.com/tags/intel-syntax/info
Either syntax has the same limitations, imposed by what the machine itself can do, and what's encodeable in machine code. They're just different ways of expressing that in text.
Use objdump -d -Mintel for Intel syntax. I use alias disas='objdump -drwC -Mintel' in my .bashrc, so I can disas foo.o and get the format I want, with relocations printed (important for making sense of a non-linked .o), without line-wrapping for long instructions, and with C++ symbol names demangled.
In inline asm, you can use either syntax, as long as it matches what the compiler is expecting. The default is AT&T, and that's what I'd recommend using for compatibility with clang. Maybe there's a way, but clang doesn't work the same way as GCC with -masm=intel.
Also, AT&T is basically standard for GNU C inline asm on x86, and it means you don't need special build options for your code to work.
But you can use gcc -masm=intel to compile source files that use Intel syntax in their asm statements. This is fine for your own use if you don't care about clang.
If you're writing code for a header, you can make it portable between AT&T and Intel syntax using dialect alternatives, at least for GCC:
static inline
void atomic_inc(volatile int *p) {
// use __asm__ instead of asm in headers, so it works even with -std=c11 instead of gnu11
__asm__("lock {addl $1, %0 | add %0, 1}": "+m"(*p));
// TODO: flag output for return value?
// maybe doesn't need to be asm volatile; compilers know that modifying pointed-to memory is a visible side-effect unless it's a local that fully optimizes away.
// If you want this to work as a memory barrier, use a `"memory"` clobber to stop compile-time memory reordering. The lock prefix provides a runtime full barrier
}
source+asm outputs for gcc/clang on the Godbolt compiler explorer.
With g++ -O3 (default or -masm=att), we get
atomic_inc(int volatile*):
lock addl $1, (%rdi) # operand-size is from my explicit addl suffix
ret
With g++ -O3 -masm=intel, we get
atomic_inc(int volatile*):
lock add DWORD PTR [rdi], 1 # operand-size came from the %0 expansion
ret
clang works with the AT&T version, but fails with -masm=intel (or the -mllvm --x86-asm-syntax=intel which that implies), because that apparently only applies to code emitted by LLVM, not for how the front-end fills in the asm template.
The clang error message is:
<source>:4:13: error: unknown use of instruction mnemonic without a size suffix
__asm__("lock {addl $1, %0 | add %0, 1}": "+m"(*p));
^
<inline asm>:1:2: note: instantiated into assembly here
lock add (%rdi), 1
^
1 error generated.
It picked the "Intel" syntax alternative, but still filled in the template with an AT&T memory operand.

Understanding GCC's alloca() alignment and seemingly missed optimization

Consider the following toy example that allocates memory on the stack by means of the alloca() function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?
This is (partially) missed optimization in gcc. Clang does it as expected.
I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).
__builtin_alloca_with_align is your friend ;)
Here is an example (changed so the compiler will not reduce function call to single ret):
#include <alloca.h>
volatile int* p;
void foo()
{
p = alloca(4) ;
*p = 7;
}
void zoo()
{
// aligment is 16 bits, not bytes
p = __builtin_alloca_with_align(4,16) ;
*p = 7;
}
int main()
{
foo();
zoo();
}
Disassembled code (with objdump -d -w --insn-width=12 -M intel)
Clang will produce the following code (clang -O3 test.c) - both functions look alike
0000000000400480 <foo>:
400480: 48 8d 44 24 f8 lea rax,[rsp-0x8]
400485: 48 89 05 a4 0b 20 00 mov QWORD PTR [rip+0x200ba4],rax # 601030 <p>
40048c: c7 44 24 f8 07 00 00 00 mov DWORD PTR [rsp-0x8],0x7
400494: c3 ret
00000000004004a0 <zoo>:
4004a0: 48 8d 44 24 fc lea rax,[rsp-0x4]
4004a5: 48 89 05 84 0b 20 00 mov QWORD PTR [rip+0x200b84],rax # 601030 <p>
4004ac: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
4004b4: c3 ret
GCC this one (gcc -g -O3 -fno-stack-protector)
0000000000000620 <foo>:
620: 55 push rbp
621: 48 89 e5 mov rbp,rsp
624: 48 83 ec 20 sub rsp,0x20
628: 48 8d 44 24 0f lea rax,[rsp+0xf]
62d: 48 83 e0 f0 and rax,0xfffffffffffffff0
631: 48 89 05 e0 09 20 00 mov QWORD PTR [rip+0x2009e0],rax # 201018 <p>
638: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
63e: c9 leave
63f: c3 ret
0000000000000640 <zoo>:
640: 48 8d 44 24 fc lea rax,[rsp-0x4]
645: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
64d: 48 89 05 c4 09 20 00 mov QWORD PTR [rip+0x2009c4],rax # 201018 <p>
654: c3 ret
As you can see zoo now looks like expected and similar to clang code.
The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.
It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)
A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.
So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.
Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.
You're right, it should be able to optimize it to this:
void foo() {
alignas(16) int dummy[1];
volatile int *p = dummy; // alloca(4)
*p = 7;
}
and compile it to the movl $7, -8(%rsp) ; ret you suggested.
The alignas(16) might be optional here for alloca.
If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.
Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.

Resources