I was reading this article about Position Independent Code and I encountered this assembly listing of a function.
0000043c <ml_func>:
43c: 55 push ebp
43d: 89 e5 mov ebp,esp
43f: e8 16 00 00 00 call 45a <__i686.get_pc_thunk.cx>
444: 81 c1 b0 1b 00 00 add ecx,0x1bb0
44a: 8b 81 f0 ff ff ff mov eax,DWORD PTR [ecx-0x10]
450: 8b 00 mov eax,DWORD PTR [eax]
452: 03 45 08 add eax,DWORD PTR [ebp+0x8]
455: 03 45 0c add eax,DWORD PTR [ebp+0xc]
458: 5d pop ebp
459: c3 ret
0000045a <__i686.get_pc_thunk.cx>:
45a: 8b 0c 24 mov ecx,DWORD PTR [esp]
45d: c3 ret
However, on my machine (gcc-7.3.0, Ubuntu 18.04 x86_64), I got slightly different result below:
0000044d <ml_func>:
44d: 55 push %ebp
44e: 89 e5 mov %esp,%ebp
450: e8 29 00 00 00 call 47e <__x86.get_pc_thunk.ax>
455: 05 ab 1b 00 00 add $0x1bab,%eax
45a: 8b 90 f0 ff ff ff mov -0x10(%eax),%edx
460: 8b 0a mov (%edx),%ecx
462: 8b 55 08 mov 0x8(%ebp),%edx
465: 01 d1 add %edx,%ecx
467: 8b 90 f0 ff ff ff mov -0x10(%eax),%edx
46d: 89 0a mov %ecx,(%edx)
46f: 8b 80 f0 ff ff ff mov -0x10(%eax),%eax
475: 8b 10 mov (%eax),%edx
477: 8b 45 0c mov 0xc(%ebp),%eax
47a: 01 d0 add %edx,%eax
47c: 5d pop %ebp
47d: c3 ret
The main difference I found was that the semantic of mov instruction. In the upper listing, mov ebp,esp actually moves esp to ebp, while in the lower listing, mov %esp,%ebp does the same thing, but the order of operands are different.
This is quite confusing, even when I have to code hand-written assembly. To summarize, my questions are (1) why I got different assembly representations for the same instructions and (2) which one I should use, when writing assembly code (e.g. with __asm(:::);)
obdjump defaults to -Matt AT&T syntax (like your 2nd code block). See att vs. intel-syntax. The tag wikis have some info about the syntax differences: https://stackoverflow.com/tags/att/info vs. https://stackoverflow.com/tags/intel-syntax/info
Either syntax has the same limitations, imposed by what the machine itself can do, and what's encodeable in machine code. They're just different ways of expressing that in text.
Use objdump -d -Mintel for Intel syntax. I use alias disas='objdump -drwC -Mintel' in my .bashrc, so I can disas foo.o and get the format I want, with relocations printed (important for making sense of a non-linked .o), without line-wrapping for long instructions, and with C++ symbol names demangled.
In inline asm, you can use either syntax, as long as it matches what the compiler is expecting. The default is AT&T, and that's what I'd recommend using for compatibility with clang. Maybe there's a way, but clang doesn't work the same way as GCC with -masm=intel.
Also, AT&T is basically standard for GNU C inline asm on x86, and it means you don't need special build options for your code to work.
But you can use gcc -masm=intel to compile source files that use Intel syntax in their asm statements. This is fine for your own use if you don't care about clang.
If you're writing code for a header, you can make it portable between AT&T and Intel syntax using dialect alternatives, at least for GCC:
static inline
void atomic_inc(volatile int *p) {
// use __asm__ instead of asm in headers, so it works even with -std=c11 instead of gnu11
__asm__("lock {addl $1, %0 | add %0, 1}": "+m"(*p));
// TODO: flag output for return value?
// maybe doesn't need to be asm volatile; compilers know that modifying pointed-to memory is a visible side-effect unless it's a local that fully optimizes away.
// If you want this to work as a memory barrier, use a `"memory"` clobber to stop compile-time memory reordering. The lock prefix provides a runtime full barrier
}
source+asm outputs for gcc/clang on the Godbolt compiler explorer.
With g++ -O3 (default or -masm=att), we get
atomic_inc(int volatile*):
lock addl $1, (%rdi) # operand-size is from my explicit addl suffix
ret
With g++ -O3 -masm=intel, we get
atomic_inc(int volatile*):
lock add DWORD PTR [rdi], 1 # operand-size came from the %0 expansion
ret
clang works with the AT&T version, but fails with -masm=intel (or the -mllvm --x86-asm-syntax=intel which that implies), because that apparently only applies to code emitted by LLVM, not for how the front-end fills in the asm template.
The clang error message is:
<source>:4:13: error: unknown use of instruction mnemonic without a size suffix
__asm__("lock {addl $1, %0 | add %0, 1}": "+m"(*p));
^
<inline asm>:1:2: note: instantiated into assembly here
lock add (%rdi), 1
^
1 error generated.
It picked the "Intel" syntax alternative, but still filled in the template with an AT&T memory operand.
Related
Am basically learning how to make my own instruction in the X86 architecture, but to do that I am understanding how they are decoded and and interpreted to a low level language,
By taking an example of a simple mov instruction and using the .byte notation I wanted to understand in detail as to how instructions are decoded,
My simple code is as follows:
#include <stdio.h>
#include <iostream>
int main(int argc, char const *argv[])
{
int x{5};
int y{0};
// mov %%eax, %0
asm (".byte 0x8b,0x45,0xf8\n\t" //mov %1, eax
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
printf ("dst value : %d\n", y);
return 0;
}
and when I use objdump to analyze how it is broken down to machine language, i get the following output:
000000000000078a <main>:
78a: 55 push %ebp
78b: 48 dec %eax
78c: 89 e5 mov %esp,%ebp
78e: 48 dec %eax
78f: 83 ec 20 sub $0x20,%esp
792: 89 7d ec mov %edi,-0x14(%ebp)
795: 48 dec %eax
796: 89 75 e0 mov %esi,-0x20(%ebp)
799: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp)
7a0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
7a7: 8b 45 f8 mov -0x8(%ebp),%eax
7aa: 8b 45 f8 mov -0x8(%ebp),%eax
7ad: 89 c0 mov %eax,%eax
7af: 89 45 fc mov %eax,-0x4(%ebp)
7b2: 8b 45 fc mov -0x4(%ebp),%eax
7b5: 89 c6 mov %eax,%esi
7b7: 48 dec %eax
7b8: 8d 3d f7 00 00 00 lea 0xf7,%edi
7be: b8 00 00 00 00 mov $0x0,%eax
7c3: e8 78 fe ff ff call 640 <printf#plt>
7c8: b8 00 00 00 00 mov $0x0,%eax
7cd: c9 leave
7ce: c3 ret
With regard to this output of objdump why is the instruction 7aa: 8b 45 f8 mov -0x8(%ebp),%eax repeated twice, any reason behind it or am I doing something wrong while using the .byte notation?
One of those is compiler-generated, because you asked GCC to have the input in its choice of register for you. That's what "r"(x) means. And you compiled with optimization disabled (the default -O0) so it actually stored x to memory and then reloaded it before your asm statement.
Your code has no business assuming anything about the contents of memory or where EBP points.
Since you're using 89 c0 mov %eax,%eax, the only safe constraints for your asm statement are "a" explicit-register constraints for input and output, forcing the compiler to pick that. If you compile with optimization enabled, your code totally breaks because you lied to the compiler about what your code actually does.
// constraints that match your manually-encoded instruction
asm (".byte 0x89, 0xC0\n\t"
: "=a" (y)
: "a" (x)
);
There's no constraint to force GCC to pick a certain addressing mode for a "m" source or "=m" dest operand so you need to ask for inputs/outputs in specific registers.
If you want to encode your own mov instructions differently from standard mov, see which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension - you might want to use a prefix in front of regular mov opcodes so you can let the assembler encode registers and addressing modes for you, like .byte something; mov %1, %0.
Look at the compiler-generate asm output (gcc -S, not disassembly of the .o or executable). Then you can see which instructions come from the asm statement and which are emitted by GCC.
If you don't explicitly reference some operands in the asm template but still want to see what the compiler picked, you can use them in asm comments like this:
asm (".byte 0x8b,0x45,0xf8 # 0 = %0 1 = %1 \n\t"
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
and gcc will fill it in for you so you can see what operands it expects you to be reading and writing. (Godbolt with g++ -m32 -O3). I put your code in void foo(){} instead of main because GCC -m32 thinks it needs to re-align the stack at the top of main. This makes the code a lot harder to follow.
# gcc-9.2 -O3 -m32 -fverbose-asm
.LC0:
.string "dst value : %d\n"
foo():
subl $20, %esp #,
movl $5, %eax #, tmp84
## Notice that GCC hasn't set up EBP at all before it runs your asm,
## and hasn't stored x in memory.
## It only put it in a register like you asked it to.
.byte 0x8b,0x45,0xf8 # 0 = %eax 1 = %eax # y, tmp84
.byte 0x89, 0xC0
pushl %eax # y
pushl $.LC0 #
call printf #
addl $28, %esp #,
ret
Also note that if you were compiling as 64-bit, it would probably pick %esi as a register because printf will want its 2nd arg there. So the "a" instead of "r" constraint would actually matter.
You could get 32-bit GCC to use a different register if you were assigning to a variable that has to survive across a function call; then GCC would pick a call-preserved reg like EBX instead of EAX.
I've been working with C for a short while and very recently started to get into ASM. When I compile a program:
int main(void)
{
int a = 0;
a += 1;
return 0;
}
The objdump disassembly has the code, but nops after the ret:
...
08048394 <main>:
8048394: 55 push %ebp
8048395: 89 e5 mov %esp,%ebp
8048397: 83 ec 10 sub $0x10,%esp
804839a: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
80483a1: 83 45 fc 01 addl $0x1,-0x4(%ebp)
80483a5: b8 00 00 00 00 mov $0x0,%eax
80483aa: c9 leave
80483ab: c3 ret
80483ac: 90 nop
80483ad: 90 nop
80483ae: 90 nop
80483af: 90 nop
...
From what I learned nops do nothing, and since after ret wouldn't even be executed.
My question is: why bother? Couldn't ELF(linux-x86) work with a .text section(+main) of any size?
I'd appreciate any help, just trying to learn.
First of all, gcc doesn't always do this. The padding is controlled by -falign-functions, which is automatically turned on by -O2 and -O3:
-falign-functions
-falign-functions=n
Align the start of functions to the next power-of-two greater than n, skipping up to n bytes. For instance,
-falign-functions=32 aligns functions to the next 32-byte boundary, but -falign-functions=24 would align to the next 32-byte boundary only
if this can be done by skipping 23 bytes or less.
-fno-align-functions and -falign-functions=1 are equivalent and mean that functions will not be aligned.
Some assemblers only support this flag when n is a power of two; in
that case, it is rounded up.
If n is not specified or is zero, use a machine-dependent default.
Enabled at levels -O2, -O3.
There could be multiple reasons for doing this, but the main one on x86 is probably this:
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be
advantageous to align critical loop entries and subroutine entries by 16 in order to minimize
the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16-byte boundary in the first few instructions after a critical loop entry or subroutine entry.
(Quoted from "Optimizing subroutines in assembly
language" by Agner Fog.)
edit: Here is an example that demonstrates the padding:
// align.c
int f(void) { return 0; }
int g(void) { return 0; }
When compiled using gcc 4.4.5 with default settings, I get:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
000000000000000b <g>:
b: 55 push %rbp
c: 48 89 e5 mov %rsp,%rbp
f: b8 00 00 00 00 mov $0x0,%eax
14: c9 leaveq
15: c3 retq
Specifying -falign-functions gives:
align.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: b8 00 00 00 00 mov $0x0,%eax
9: c9 leaveq
a: c3 retq
b: eb 03 jmp 10 <g>
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <g>:
10: 55 push %rbp
11: 48 89 e5 mov %rsp,%rbp
14: b8 00 00 00 00 mov $0x0,%eax
19: c9 leaveq
1a: c3 retq
This is done to align the next function by 8, 16 or 32-byte boundary.
From “Optimizing subroutines in assembly language” by A.Fog:
11.5 Alignment of code
Most microprocessors fetch code in aligned 16-byte or 32-byte blocks. If an importantsubroutine entry or jump label happens to be near the end of a 16-byte block then themicroprocessor will only get a few useful bytes of code when fetching that block of code. Itmay have to fetch the next 16 bytes too before it can decode the first instructions after thelabel. This can be avoided by aligning important subroutine entries and loop entries by 16.
[...]
Aligning a subroutine entry is as simple as putting as many
NOP
's as needed before thesubroutine entry to make the address divisible by 8, 16, 32 or 64, as desired.
As far as I remember, instructions are pipelined in cpu and different cpu blocks (loader, decoder and such) process subsequent instructions. When RET instructions is being executed, few next instructions are already loaded into cpu pipeline. It's a guess, but you can start digging here and if you find out (maybe the specific number of NOPs that are safe, share your findings please.
Consider the following toy example that allocates memory on the stack by means of the alloca() function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?
This is (partially) missed optimization in gcc. Clang does it as expected.
I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).
__builtin_alloca_with_align is your friend ;)
Here is an example (changed so the compiler will not reduce function call to single ret):
#include <alloca.h>
volatile int* p;
void foo()
{
p = alloca(4) ;
*p = 7;
}
void zoo()
{
// aligment is 16 bits, not bytes
p = __builtin_alloca_with_align(4,16) ;
*p = 7;
}
int main()
{
foo();
zoo();
}
Disassembled code (with objdump -d -w --insn-width=12 -M intel)
Clang will produce the following code (clang -O3 test.c) - both functions look alike
0000000000400480 <foo>:
400480: 48 8d 44 24 f8 lea rax,[rsp-0x8]
400485: 48 89 05 a4 0b 20 00 mov QWORD PTR [rip+0x200ba4],rax # 601030 <p>
40048c: c7 44 24 f8 07 00 00 00 mov DWORD PTR [rsp-0x8],0x7
400494: c3 ret
00000000004004a0 <zoo>:
4004a0: 48 8d 44 24 fc lea rax,[rsp-0x4]
4004a5: 48 89 05 84 0b 20 00 mov QWORD PTR [rip+0x200b84],rax # 601030 <p>
4004ac: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
4004b4: c3 ret
GCC this one (gcc -g -O3 -fno-stack-protector)
0000000000000620 <foo>:
620: 55 push rbp
621: 48 89 e5 mov rbp,rsp
624: 48 83 ec 20 sub rsp,0x20
628: 48 8d 44 24 0f lea rax,[rsp+0xf]
62d: 48 83 e0 f0 and rax,0xfffffffffffffff0
631: 48 89 05 e0 09 20 00 mov QWORD PTR [rip+0x2009e0],rax # 201018 <p>
638: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
63e: c9 leave
63f: c3 ret
0000000000000640 <zoo>:
640: 48 8d 44 24 fc lea rax,[rsp-0x4]
645: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
64d: 48 89 05 c4 09 20 00 mov QWORD PTR [rip+0x2009c4],rax # 201018 <p>
654: c3 ret
As you can see zoo now looks like expected and similar to clang code.
The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.
It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)
A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.
So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.
Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.
You're right, it should be able to optimize it to this:
void foo() {
alignas(16) int dummy[1];
volatile int *p = dummy; // alloca(4)
*p = 7;
}
and compile it to the movl $7, -8(%rsp) ; ret you suggested.
The alignas(16) might be optional here for alloca.
If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.
Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.
This question is for Intel x86 assembly experts to answer. Thanks for your effort in advance!
Problem Specification
I am analysing a binary file, which match Mach-O 64-bit x86 assembly. I am currently using MacOS 64 OS. The assembly comes from objdump.
The problem is that when I am learning assembly, I can see variable name "$xxx", I can see string value in ascii and I can also see the callee name like "call _printf"
But in this assembly, I can get nothing above:
no main function:
Disassembly of section __TEXT,__text:
__text:
100000c90: 55 pushq %rbp
100000c91: 48 89 e5 movq %rsp, %rbp
100000c94: 48 83 ec 10 subq $16, %rsp
100000c98: 48 8d 3d bf 02 00 00 leaq 703(%rip), %rdi
100000c9f: b0 00 movb $0, %al
100000ca1: e8 68 02 00 00 callq 616
100000ca6: 89 45 fc movl %eax, -4(%rbp)
100000ca9: 48 83 c4 10 addq $16, %rsp
100000cad: 5d popq %rbp
100000cae: c3 retq
100000caf: 90 nop
100000cb0: 55 pushq %rbp
...
The above is codes frame will be executed, but I have no idea where it is executed.
Also, I newbie of AT&T assemble. Hence, could you tell me what is the meaning of instruction:
0000000100000c90 pushq %rbp
0000000100000c98 leaq 0x2bf(%rip), %rdi ## literal pool for: "xxxx\n"
...
0000000100000cd0 callq 0x100000c90
Is it a loop? I am not sure but it seems to be. And why we they use %rip and %rdi register. In intel x86 I know that EIP represents current caller address, but I don't understand the meaning here.
call integer:
No matter what call convention they used, I had never seen code pattern like "call 616":
"100000cd0: e8 bb ff ff ff callq -69 <__mh_execute_header+C90>"
After ret:
Ret in intel x86, means delete stack frame and return control flow to caller. It should be an independent function. However, after this, we can see codes like
100000cae: c3 retq
100000caf: 90 nop
/* new function call */
100000cb0: 55 pushq %rbp
...
It is ridiculous!
ASCII string lost:
I have already viewed the binary in Hexadecimal format, and recognise some ascii string before reverse it to asm file.
However, in this file no ascii string occurrences!
Total architecture review:
Disassembly of section __TEXT,__text:
__text:
from address 10000c90 to 100000ef6 of 145 lines
Disassembly of section __TEXT,__stubs:
__stubs:
from address 100000efc to 100000f14 of 5 lines asm codes:
100000efc: ff 25 16 01 00 00 jmp qword ptr [rip + 278]
100000f02: ff 25 18 01 00 00 jmp qword ptr [rip + 280]
100000f08: ff 25 1a 01 00 00 jmp qword ptr [rip + 282]
100000f0e: ff 25 1c 01 00 00 jmp qword ptr [rip + 284]
100000f14: ff 25 1e 01 00 00 jmp qword ptr [rip + 286]
Disassembly of section __TEXT,__stub_helper:
__stub_helper:
...
Disassembly of section __TEXT,__cstring:
__cstring:
...
Disassembly of section __TEXT,__unwind_info:
__unwind_info:
...
Disassembly of section __DATA,__nl_symbol_ptr:
__nl_symbol_ptr:
...
Disassembly of section __DATA,__got:
__got:
...
Disassembly of section __DATA,__la_symbol_ptr:
__la_symbol_ptr:
...
Disassembly of section __DATA,__data:
__data:
...
Since it might be a virus, I cannot execute it. How should I analyse it ?
Update on May 21
I have already identified where is the output, and if I totally understand the data flow pipeline represented in this programme, I might be able to figure out the possible solutions.
I am appreciated if someone can give me the detailed explanation. Thank you !
Update on May 22
I installed a MacOS in VirtualBox and after chmod privileges , I executed the programme but nothing special except for two lines of output happened. And the result hiding in the binary file.
You don't need a main if you are not using C. The binary header contains the entry point address.
Nothing special about call 616, it's just that you don't have (all) symbols. It's somewhat strange that objdump didn't calculate the address for you, but it should be 0x100000ca6+616.
Not sure what you find ridiculous there. One function ends, another starts.
That's not a question. Yes, you can create strings at runtime so you won't have them in the image. Possibly they are encrypted.
centos 6.5 64bit vps, 500MB ram gcc 4.8.2
I have the following function that works only if I use edi as the constraint to hold the string pointer. If I try to use any other register or constraintg or q etc, it segfaults.
BUT this problem only occurs when both link time optimization and o3 are used together. If o2 it's fine. If I don't use -flto, it's fine. But both together then the only register I can use that doesn't crash is edi
gcc -flto
CFLAGS=-I. -flto -std=gnu11 -msse4.2 -fno-builtin-printf -Wall -Winline -Wstrict-aliasing -g -pg -O3 -lrt -lpthread
It seems like there might be some sort of register clobbering going on or something else. I'm really at a loss to understand why and how to fix this. Another interesting aspect is the generated assembly puts rdi into rdx before using the pointer but if I try to use either register as the input constraint... it segfaults! If it fails under aggressive compiling options it suggests to me either the compiler is stuffing up somehow, or more likely I'm doing something wrong.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"sub $1, %1\n\t"
"1:" "sub $15,%1\n\t"
".align 16\n\t"
"2:" "add $16, %1\n\t"
"pcmpistri $0x08,(%1),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%1,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%1\n\t"
"mov %1,%0\n\t"
"2:"
:"=q"(res)
:"edi"(str),"x"(M) //<-- if use anything except edi, it segfaults
:"rcx"
);
return (char*) res;
}
Disassembled output:
00000000000002e0 <sse4_strCRLF>:
2e0: 55 push rbp
2e1: 48 89 e5 mov rbp,rsp
2e4: e8 00 00 00 00 call 2e9 <sse4_strCRLF+0x9>
2e9: 66 0f 6f 05 00 00 00 00 movdqa xmm0,[rip+0x0] # 2f1 <sse4_strCRLF+0x11>
2f1: 48 89 fa mov rdx,rdi //<--- puts rdi into rdx!
2f4: 48 31 c0 xor rax,rax
2f7: 48 83 ea 01 sub rdx,0x1
2fb: 48 83 ea 0f sub rdx,0xf
2ff: 90 nop
300: 48 83 c2 10 add rdx,0x10
304: 66 0f 3a 63 02 08 pcmpistri xmm0,[rdx],0x8
30a: 77 f4 ja 300 <sse4_strCRLF+0x20>
30c: 73 0d jae 31b <sse4_strCRLF+0x3b>
30e: 80 7c 0a 01 0a cmp byte[rdx+rcx*1+0x1],0xa
313: 75 e6 jne 2fb <sse4_strCRLF+0x1b>
315: 48 01 ca add rdx,rcx
318: 48 89 d0 mov rax,rdx
31b: 5d pop rbp
31c: c3 ret
#David Wohlferd gave me the answer. It was 2 dumb mistakes I was making due to ignorance and assumptions. The below code is modified such that the input variable char pointer is not modified by the routine. It's copied into a register and that register is used. Also I was mistakenly thinking I could directly specify a particular register as opposed to a b etc.
gcc still seems to be fussy about what constraints I use. e.g. If I use a and b for res and str respectively, it compiles fine but segfaults on running. But using S and D seems to work fine.
#David Wohlferd, I'd like to credit you as the answerer but I don't think I can do that to a comment.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"mov %1,%%rdx\n\t"
"sub $1,%%rdx\n\t"
"1:" "sub $15,%%rdx\n\t"
".align 16\n\t"
"2:" "add $16, %%rdx\n\t"
"pcmpistri $0x08,(%%rdx),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%%rdx,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%%rdx\n\t"
"mov %%rdx,%0\n\t"
"2:"
:"=S"(res)
:"D"(str),"x"(M)
:"rcx","rdx"
);
return (char*) res;
}