instruction repeated twice when decoded into machine language, - gcc

Am basically learning how to make my own instruction in the X86 architecture, but to do that I am understanding how they are decoded and and interpreted to a low level language,
By taking an example of a simple mov instruction and using the .byte notation I wanted to understand in detail as to how instructions are decoded,
My simple code is as follows:
#include <stdio.h>
#include <iostream>
int main(int argc, char const *argv[])
{
int x{5};
int y{0};
// mov %%eax, %0
asm (".byte 0x8b,0x45,0xf8\n\t" //mov %1, eax
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
printf ("dst value : %d\n", y);
return 0;
}
and when I use objdump to analyze how it is broken down to machine language, i get the following output:
000000000000078a <main>:
78a: 55 push %ebp
78b: 48 dec %eax
78c: 89 e5 mov %esp,%ebp
78e: 48 dec %eax
78f: 83 ec 20 sub $0x20,%esp
792: 89 7d ec mov %edi,-0x14(%ebp)
795: 48 dec %eax
796: 89 75 e0 mov %esi,-0x20(%ebp)
799: c7 45 f8 05 00 00 00 movl $0x5,-0x8(%ebp)
7a0: c7 45 fc 00 00 00 00 movl $0x0,-0x4(%ebp)
7a7: 8b 45 f8 mov -0x8(%ebp),%eax
7aa: 8b 45 f8 mov -0x8(%ebp),%eax
7ad: 89 c0 mov %eax,%eax
7af: 89 45 fc mov %eax,-0x4(%ebp)
7b2: 8b 45 fc mov -0x4(%ebp),%eax
7b5: 89 c6 mov %eax,%esi
7b7: 48 dec %eax
7b8: 8d 3d f7 00 00 00 lea 0xf7,%edi
7be: b8 00 00 00 00 mov $0x0,%eax
7c3: e8 78 fe ff ff call 640 <printf#plt>
7c8: b8 00 00 00 00 mov $0x0,%eax
7cd: c9 leave
7ce: c3 ret
With regard to this output of objdump why is the instruction 7aa: 8b 45 f8 mov -0x8(%ebp),%eax repeated twice, any reason behind it or am I doing something wrong while using the .byte notation?

One of those is compiler-generated, because you asked GCC to have the input in its choice of register for you. That's what "r"(x) means. And you compiled with optimization disabled (the default -O0) so it actually stored x to memory and then reloaded it before your asm statement.
Your code has no business assuming anything about the contents of memory or where EBP points.
Since you're using 89 c0 mov %eax,%eax, the only safe constraints for your asm statement are "a" explicit-register constraints for input and output, forcing the compiler to pick that. If you compile with optimization enabled, your code totally breaks because you lied to the compiler about what your code actually does.
// constraints that match your manually-encoded instruction
asm (".byte 0x89, 0xC0\n\t"
: "=a" (y)
: "a" (x)
);
There's no constraint to force GCC to pick a certain addressing mode for a "m" source or "=m" dest operand so you need to ask for inputs/outputs in specific registers.
If you want to encode your own mov instructions differently from standard mov, see which MOV instructions in the x86 are not used or the least used, and can be used for a custom MOV extension - you might want to use a prefix in front of regular mov opcodes so you can let the assembler encode registers and addressing modes for you, like .byte something; mov %1, %0.
Look at the compiler-generate asm output (gcc -S, not disassembly of the .o or executable). Then you can see which instructions come from the asm statement and which are emitted by GCC.
If you don't explicitly reference some operands in the asm template but still want to see what the compiler picked, you can use them in asm comments like this:
asm (".byte 0x8b,0x45,0xf8 # 0 = %0 1 = %1 \n\t"
".byte 0x89, 0xC0\n\t"
: "=r" (y)
: "r" (x)
);
and gcc will fill it in for you so you can see what operands it expects you to be reading and writing. (Godbolt with g++ -m32 -O3). I put your code in void foo(){} instead of main because GCC -m32 thinks it needs to re-align the stack at the top of main. This makes the code a lot harder to follow.
# gcc-9.2 -O3 -m32 -fverbose-asm
.LC0:
.string "dst value : %d\n"
foo():
subl $20, %esp #,
movl $5, %eax #, tmp84
## Notice that GCC hasn't set up EBP at all before it runs your asm,
## and hasn't stored x in memory.
## It only put it in a register like you asked it to.
.byte 0x8b,0x45,0xf8 # 0 = %eax 1 = %eax # y, tmp84
.byte 0x89, 0xC0
pushl %eax # y
pushl $.LC0 #
call printf #
addl $28, %esp #,
ret
Also note that if you were compiling as 64-bit, it would probably pick %esi as a register because printf will want its 2nd arg there. So the "a" instead of "r" constraint would actually matter.
You could get 32-bit GCC to use a different register if you were assigning to a variable that has to survive across a function call; then GCC would pick a call-preserved reg like EBX instead of EAX.

Related

Why does the assembly encoding of objdump vary?

I was reading this article about Position Independent Code and I encountered this assembly listing of a function.
0000043c <ml_func>:
43c: 55 push ebp
43d: 89 e5 mov ebp,esp
43f: e8 16 00 00 00 call 45a <__i686.get_pc_thunk.cx>
444: 81 c1 b0 1b 00 00 add ecx,0x1bb0
44a: 8b 81 f0 ff ff ff mov eax,DWORD PTR [ecx-0x10]
450: 8b 00 mov eax,DWORD PTR [eax]
452: 03 45 08 add eax,DWORD PTR [ebp+0x8]
455: 03 45 0c add eax,DWORD PTR [ebp+0xc]
458: 5d pop ebp
459: c3 ret
0000045a <__i686.get_pc_thunk.cx>:
45a: 8b 0c 24 mov ecx,DWORD PTR [esp]
45d: c3 ret
However, on my machine (gcc-7.3.0, Ubuntu 18.04 x86_64), I got slightly different result below:
0000044d <ml_func>:
44d: 55 push %ebp
44e: 89 e5 mov %esp,%ebp
450: e8 29 00 00 00 call 47e <__x86.get_pc_thunk.ax>
455: 05 ab 1b 00 00 add $0x1bab,%eax
45a: 8b 90 f0 ff ff ff mov -0x10(%eax),%edx
460: 8b 0a mov (%edx),%ecx
462: 8b 55 08 mov 0x8(%ebp),%edx
465: 01 d1 add %edx,%ecx
467: 8b 90 f0 ff ff ff mov -0x10(%eax),%edx
46d: 89 0a mov %ecx,(%edx)
46f: 8b 80 f0 ff ff ff mov -0x10(%eax),%eax
475: 8b 10 mov (%eax),%edx
477: 8b 45 0c mov 0xc(%ebp),%eax
47a: 01 d0 add %edx,%eax
47c: 5d pop %ebp
47d: c3 ret
The main difference I found was that the semantic of mov instruction. In the upper listing, mov ebp,esp actually moves esp to ebp, while in the lower listing, mov %esp,%ebp does the same thing, but the order of operands are different.
This is quite confusing, even when I have to code hand-written assembly. To summarize, my questions are (1) why I got different assembly representations for the same instructions and (2) which one I should use, when writing assembly code (e.g. with __asm(:::);)
obdjump defaults to -Matt AT&T syntax (like your 2nd code block). See att vs. intel-syntax. The tag wikis have some info about the syntax differences: https://stackoverflow.com/tags/att/info vs. https://stackoverflow.com/tags/intel-syntax/info
Either syntax has the same limitations, imposed by what the machine itself can do, and what's encodeable in machine code. They're just different ways of expressing that in text.
Use objdump -d -Mintel for Intel syntax. I use alias disas='objdump -drwC -Mintel' in my .bashrc, so I can disas foo.o and get the format I want, with relocations printed (important for making sense of a non-linked .o), without line-wrapping for long instructions, and with C++ symbol names demangled.
In inline asm, you can use either syntax, as long as it matches what the compiler is expecting. The default is AT&T, and that's what I'd recommend using for compatibility with clang. Maybe there's a way, but clang doesn't work the same way as GCC with -masm=intel.
Also, AT&T is basically standard for GNU C inline asm on x86, and it means you don't need special build options for your code to work.
But you can use gcc -masm=intel to compile source files that use Intel syntax in their asm statements. This is fine for your own use if you don't care about clang.
If you're writing code for a header, you can make it portable between AT&T and Intel syntax using dialect alternatives, at least for GCC:
static inline
void atomic_inc(volatile int *p) {
// use __asm__ instead of asm in headers, so it works even with -std=c11 instead of gnu11
__asm__("lock {addl $1, %0 | add %0, 1}": "+m"(*p));
// TODO: flag output for return value?
// maybe doesn't need to be asm volatile; compilers know that modifying pointed-to memory is a visible side-effect unless it's a local that fully optimizes away.
// If you want this to work as a memory barrier, use a `"memory"` clobber to stop compile-time memory reordering. The lock prefix provides a runtime full barrier
}
source+asm outputs for gcc/clang on the Godbolt compiler explorer.
With g++ -O3 (default or -masm=att), we get
atomic_inc(int volatile*):
lock addl $1, (%rdi) # operand-size is from my explicit addl suffix
ret
With g++ -O3 -masm=intel, we get
atomic_inc(int volatile*):
lock add DWORD PTR [rdi], 1 # operand-size came from the %0 expansion
ret
clang works with the AT&T version, but fails with -masm=intel (or the -mllvm --x86-asm-syntax=intel which that implies), because that apparently only applies to code emitted by LLVM, not for how the front-end fills in the asm template.
The clang error message is:
<source>:4:13: error: unknown use of instruction mnemonic without a size suffix
__asm__("lock {addl $1, %0 | add %0, 1}": "+m"(*p));
^
<inline asm>:1:2: note: instantiated into assembly here
lock add (%rdi), 1
^
1 error generated.
It picked the "Intel" syntax alternative, but still filled in the template with an AT&T memory operand.

Understanding GCC's alloca() alignment and seemingly missed optimization

Consider the following toy example that allocates memory on the stack by means of the alloca() function:
#include <alloca.h>
void foo() {
volatile int *p = alloca(4);
*p = 7;
}
Compiling the function above using gcc 8.2 with -O3 results in the following assembly code:
foo:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
leaq 15(%rsp), %rax
andq $-16, %rax
movl $7, (%rax)
leave
ret
Honestly, I would have expected a more compact assembly code.
16-byte alignment for allocated memory
The instruction andq $-16, %rax in the code above results in rax containing the (only) 16-byte-aligned address between the addresses rsp and rsp + 15 (both inclusive).
This alignment enforcement is the first thing I don't understand: Why does alloca() align the allocated memory to a 16-byte boundary?
Possible missed optimization?
Let's consider anyway that we want the memory allocated by alloca() to be 16-byte aligned. Even so, in the assembly code above, keeping in mind that GCC assumes the stack to be aligned to a 16-byte boundary at the moment of performing the function call (i.e., call foo), if we pay attention to the status of the stack inside foo() just after pushing the rbp register:
Size Stack RSP mod 16 Description
-----------------------------------------------------------------------------------
------------------
| . |
| . |
| . |
------------------........0 at "call foo" (stack 16-byte aligned)
8 bytes | return address |
------------------........8 at foo entry
8 bytes | saved RBP |
------------------........0 <----- RSP is 16-byte aligned!!!
I think that by taking advantage of the red zone (i.e., no need to modify rsp) and the fact that rsp already contains a 16-byte aligned address, the following code could be used instead:
foo:
pushq %rbp
movq %rsp, %rbp
movl $7, -16(%rbp)
leave
ret
The address contained in the register rbp is 16-byte aligned, therefore rbp - 16 will also be aligned to a 16-byte boundary.
Even better, the creation of the new stack frame can be optimized away, since rsp is not modified:
foo:
movl $7, -8(%rsp)
ret
Is this just a missed optimization or I am missing something else here?
This is (partially) missed optimization in gcc. Clang does it as expected.
I said partially because if you know you will be using gcc you can use builtin functions (use conditional compilation for gcc and other compilers to have portable code).
__builtin_alloca_with_align is your friend ;)
Here is an example (changed so the compiler will not reduce function call to single ret):
#include <alloca.h>
volatile int* p;
void foo()
{
p = alloca(4) ;
*p = 7;
}
void zoo()
{
// aligment is 16 bits, not bytes
p = __builtin_alloca_with_align(4,16) ;
*p = 7;
}
int main()
{
foo();
zoo();
}
Disassembled code (with objdump -d -w --insn-width=12 -M intel)
Clang will produce the following code (clang -O3 test.c) - both functions look alike
0000000000400480 <foo>:
400480: 48 8d 44 24 f8 lea rax,[rsp-0x8]
400485: 48 89 05 a4 0b 20 00 mov QWORD PTR [rip+0x200ba4],rax # 601030 <p>
40048c: c7 44 24 f8 07 00 00 00 mov DWORD PTR [rsp-0x8],0x7
400494: c3 ret
00000000004004a0 <zoo>:
4004a0: 48 8d 44 24 fc lea rax,[rsp-0x4]
4004a5: 48 89 05 84 0b 20 00 mov QWORD PTR [rip+0x200b84],rax # 601030 <p>
4004ac: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
4004b4: c3 ret
GCC this one (gcc -g -O3 -fno-stack-protector)
0000000000000620 <foo>:
620: 55 push rbp
621: 48 89 e5 mov rbp,rsp
624: 48 83 ec 20 sub rsp,0x20
628: 48 8d 44 24 0f lea rax,[rsp+0xf]
62d: 48 83 e0 f0 and rax,0xfffffffffffffff0
631: 48 89 05 e0 09 20 00 mov QWORD PTR [rip+0x2009e0],rax # 201018 <p>
638: c7 00 07 00 00 00 mov DWORD PTR [rax],0x7
63e: c9 leave
63f: c3 ret
0000000000000640 <zoo>:
640: 48 8d 44 24 fc lea rax,[rsp-0x4]
645: c7 44 24 fc 07 00 00 00 mov DWORD PTR [rsp-0x4],0x7
64d: 48 89 05 c4 09 20 00 mov QWORD PTR [rip+0x2009c4],rax # 201018 <p>
654: c3 ret
As you can see zoo now looks like expected and similar to clang code.
The x86-64 System V ABI requires VLAs (C99 Variable Length Arrays) to be 16-byte aligned, same for automatic / static arrays that are >= 16 bytes.
It looks like gcc is treating alloca as a VLA, and failing to do constant-propagation into an alloca that only runs once per function call. (Or that it internally uses alloca for VLAs.)
A generic alloca / VLA can't use the red-zone, in case the runtime value is larger than 128 bytes. GCC also makes a stack frame with RBP instead of saving the allocation size and doing an add rsp, rdx later.
So the asm looks exactly like what it would if the size was a function arg or other runtime variable instead of a constant. That's what led me to this conclusion.
Also alignof(maxalign_t) == 16 , but alloca and malloc can satisfy the requirement to return memory usable for any object without 16-byte alignment for objects smaller than 16 bytes. None of the standard types have alignment requirements wider than their size in x86-64 SysV.
You're right, it should be able to optimize it to this:
void foo() {
alignas(16) int dummy[1];
volatile int *p = dummy; // alloca(4)
*p = 7;
}
and compile it to the movl $7, -8(%rsp) ; ret you suggested.
The alignas(16) might be optional here for alloca.
If you really need gcc to emit better code when constant propagation makes the arg to alloca a compile-time constant, you could consider simply using a VLA in the first place. GNU C++ supports C99-style VLAs in C++ mode, but ISO C++ (and MSVC) don't.
Or possibly use if(__builtin_constant_p(size)) { VLA version } else { alloca version }, but scoping of VLAs means you can't return a VLA from the scope of an if that detects that we're being inlined with a compile-time constant size. So you'd have to duplicate the code that needs the pointer.

Weird SSE assembler instructions for double negation

GCC and Clang compilers seem to employ some dark magic. The C code just negates the value of a double, but the assembler instructions involve bit-wise XOR and the instruction pointer. Can somebody explain what is happening and why is it an optimal solution. Thank you.
Contents of test.c:
void function(double *a, double *b) {
*a = -(*b); // This line.
}
The resulting assembler instructions:
(gcc)
0000000000000000 <function>:
0: f2 0f 10 06 movsd xmm0,QWORD PTR [rsi]
4: 66 0f 57 05 00 00 00 xorpd xmm0,XMMWORD PTR [rip+0x0] # c <function+0xc>
b: 00
c: f2 0f 11 07 movsd QWORD PTR [rdi],xmm0
10: c3 ret
(clang)
0000000000000000 <function>:
0: f2 0f 10 06 movsd xmm0,QWORD PTR [rsi]
4: 0f 57 05 00 00 00 00 xorps xmm0,XMMWORD PTR [rip+0x0] # b <function+0xb>
b: 0f 13 07 movlps QWORD PTR [rdi],xmm0
e: c3 ret
The assembler instruction at address 0x4 represents "This line", however I can't understand how it works. The xorpd/xorps instructions are supposed to be bit-wise XOR and PTR [rip] is the instruction pointer.
I suspect that at the moment of execution rip is pointing somewhere near the 0f 57 05 00 00 00 0f strip of bytes, but I can't quite figure out, how is this working and why do both compilers choose this approach.
P.S. I should point out that this is compiled using -O3
for me the output of gcc with the -S -O3 options for the same code is:
.file "test.c"
.text
.p2align 4,,15
.globl function
.type function, #function
function:
.LFB0:
.cfi_startproc
movsd (%rsi), %xmm0
xorpd .LC0(%rip), %xmm0
movsd %xmm0, (%rdi)
ret
.cfi_endproc
.LFE0:
.size function, .-function
.section .rodata.cst16,"aM",#progbits,16
.align 16
.LC0:
.long 0
.long -2147483648
.long 0
.long 0
.ident "GCC: (Ubuntu 6.3.0-12ubuntu2) 6.3.0 20170406"
.section .note.GNU-stack,"",#progbits
here the xorpd instruction uses instruction pointer relative addressing with the offset which points to .LC0 label with the 64 bit value 0x8000000000000000(the 63rd bit is set to one).
.LC0:
.long 0
.long -2147483648
if your compiler was big endian these lines where swaped.
xoring the double value with 0x8000000000000000 sets the sign bit(which is the 63rd bit) to one for a negative value.
clang uses xorps instruction for the same manner this xors the first 32bit of the double value.
if you run object dump with -r option it will show you the relocations that should be done on the program before running it.
objdump -d test.o -r
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <function>:
0: f2 0f 10 06 movsd (%rsi),%xmm0
4: 66 0f 57 05 00 00 00 xorpd 0x0(%rip),%xmm0 # c <function+0xc>
b: 00
8: R_X86_64_PC32 .LC0-0x4
c: f2 0f 11 07 movsd %xmm0,(%rdi)
10: c3 retq
Disassembly of section .text.startup:
0000000000000000 <main>:
0: 31 c0 xor %eax,%eax
2: c3 retq
here at <function + 0xb> we have a relocation of type R_X86_64_PC32.
PS: I'm using gcc 6.3.0
xorps xmm0,XMMWORD PTR [rip+0x0]
Any part of an instruction surrounded by [] is an indirect reference to memory.
In this case a reference to the memory at address RIP+0
(I doubt it is actually RIP+0, you might have edited the actual offset)
The X64 instruction set adds instruction pointer relative addressing. This means you can have (usually read-only) data in your program that you can address easily even if the program is moved around in memory.
A XOR xmm0,Y inverts all bits in xmm0 that are set in Y.
Negation involves inverting the sign bit, so that's why xor is used. Specifically xorpd/s because we are dealing with double resp. single floats.

Reversed Mach-O 64-bit x86 Assembly analysis

This question is for Intel x86 assembly experts to answer. Thanks for your effort in advance!
Problem Specification
I am analysing a binary file, which match Mach-O 64-bit x86 assembly. I am currently using MacOS 64 OS. The assembly comes from objdump.
The problem is that when I am learning assembly, I can see variable name "$xxx", I can see string value in ascii and I can also see the callee name like "call _printf"
But in this assembly, I can get nothing above:
no main function:
Disassembly of section __TEXT,__text:
__text:
100000c90: 55 pushq %rbp
100000c91: 48 89 e5 movq %rsp, %rbp
100000c94: 48 83 ec 10 subq $16, %rsp
100000c98: 48 8d 3d bf 02 00 00 leaq 703(%rip), %rdi
100000c9f: b0 00 movb $0, %al
100000ca1: e8 68 02 00 00 callq 616
100000ca6: 89 45 fc movl %eax, -4(%rbp)
100000ca9: 48 83 c4 10 addq $16, %rsp
100000cad: 5d popq %rbp
100000cae: c3 retq
100000caf: 90 nop
100000cb0: 55 pushq %rbp
...
The above is codes frame will be executed, but I have no idea where it is executed.
Also, I newbie of AT&T assemble. Hence, could you tell me what is the meaning of instruction:
0000000100000c90 pushq %rbp
0000000100000c98 leaq 0x2bf(%rip), %rdi ## literal pool for: "xxxx\n"
...
0000000100000cd0 callq 0x100000c90
Is it a loop? I am not sure but it seems to be. And why we they use %rip and %rdi register. In intel x86 I know that EIP represents current caller address, but I don't understand the meaning here.
call integer:
No matter what call convention they used, I had never seen code pattern like "call 616":
"100000cd0: e8 bb ff ff ff callq -69 <__mh_execute_header+C90>"
After ret:
Ret in intel x86, means delete stack frame and return control flow to caller. It should be an independent function. However, after this, we can see codes like
100000cae: c3 retq
100000caf: 90 nop
/* new function call */
100000cb0: 55 pushq %rbp
...
It is ridiculous!
ASCII string lost:
I have already viewed the binary in Hexadecimal format, and recognise some ascii string before reverse it to asm file.
However, in this file no ascii string occurrences!
Total architecture review:
Disassembly of section __TEXT,__text:
__text:
from address 10000c90 to 100000ef6 of 145 lines
Disassembly of section __TEXT,__stubs:
__stubs:
from address 100000efc to 100000f14 of 5 lines asm codes:
100000efc: ff 25 16 01 00 00 jmp qword ptr [rip + 278]
100000f02: ff 25 18 01 00 00 jmp qword ptr [rip + 280]
100000f08: ff 25 1a 01 00 00 jmp qword ptr [rip + 282]
100000f0e: ff 25 1c 01 00 00 jmp qword ptr [rip + 284]
100000f14: ff 25 1e 01 00 00 jmp qword ptr [rip + 286]
Disassembly of section __TEXT,__stub_helper:
__stub_helper:
...
Disassembly of section __TEXT,__cstring:
__cstring:
...
Disassembly of section __TEXT,__unwind_info:
__unwind_info:
...
Disassembly of section __DATA,__nl_symbol_ptr:
__nl_symbol_ptr:
...
Disassembly of section __DATA,__got:
__got:
...
Disassembly of section __DATA,__la_symbol_ptr:
__la_symbol_ptr:
...
Disassembly of section __DATA,__data:
__data:
...
Since it might be a virus, I cannot execute it. How should I analyse it ?
Update on May 21
I have already identified where is the output, and if I totally understand the data flow pipeline represented in this programme, I might be able to figure out the possible solutions.
I am appreciated if someone can give me the detailed explanation. Thank you !
Update on May 22
I installed a MacOS in VirtualBox and after chmod privileges , I executed the programme but nothing special except for two lines of output happened. And the result hiding in the binary file.
You don't need a main if you are not using C. The binary header contains the entry point address.
Nothing special about call 616, it's just that you don't have (all) symbols. It's somewhat strange that objdump didn't calculate the address for you, but it should be 0x100000ca6+616.
Not sure what you find ridiculous there. One function ends, another starts.
That's not a question. Yes, you can create strings at runtime so you won't have them in the image. Possibly they are encrypted.

Why is it necessary to use edi constraint in this inline assembly?

centos 6.5 64bit vps, 500MB ram gcc 4.8.2
I have the following function that works only if I use edi as the constraint to hold the string pointer. If I try to use any other register or constraintg or q etc, it segfaults.
BUT this problem only occurs when both link time optimization and o3 are used together. If o2 it's fine. If I don't use -flto, it's fine. But both together then the only register I can use that doesn't crash is edi
gcc -flto
CFLAGS=-I. -flto -std=gnu11 -msse4.2 -fno-builtin-printf -Wall -Winline -Wstrict-aliasing -g -pg -O3 -lrt -lpthread
It seems like there might be some sort of register clobbering going on or something else. I'm really at a loss to understand why and how to fix this. Another interesting aspect is the generated assembly puts rdi into rdx before using the pointer but if I try to use either register as the input constraint... it segfaults! If it fails under aggressive compiling options it suggests to me either the compiler is stuffing up somehow, or more likely I'm doing something wrong.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"sub $1, %1\n\t"
"1:" "sub $15,%1\n\t"
".align 16\n\t"
"2:" "add $16, %1\n\t"
"pcmpistri $0x08,(%1),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%1,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%1\n\t"
"mov %1,%0\n\t"
"2:"
:"=q"(res)
:"edi"(str),"x"(M) //<-- if use anything except edi, it segfaults
:"rcx"
);
return (char*) res;
}
Disassembled output:
00000000000002e0 <sse4_strCRLF>:
2e0: 55 push rbp
2e1: 48 89 e5 mov rbp,rsp
2e4: e8 00 00 00 00 call 2e9 <sse4_strCRLF+0x9>
2e9: 66 0f 6f 05 00 00 00 00 movdqa xmm0,[rip+0x0] # 2f1 <sse4_strCRLF+0x11>
2f1: 48 89 fa mov rdx,rdi //<--- puts rdi into rdx!
2f4: 48 31 c0 xor rax,rax
2f7: 48 83 ea 01 sub rdx,0x1
2fb: 48 83 ea 0f sub rdx,0xf
2ff: 90 nop
300: 48 83 c2 10 add rdx,0x10
304: 66 0f 3a 63 02 08 pcmpistri xmm0,[rdx],0x8
30a: 77 f4 ja 300 <sse4_strCRLF+0x20>
30c: 73 0d jae 31b <sse4_strCRLF+0x3b>
30e: 80 7c 0a 01 0a cmp byte[rdx+rcx*1+0x1],0xa
313: 75 e6 jne 2fb <sse4_strCRLF+0x1b>
315: 48 01 ca add rdx,rcx
318: 48 89 d0 mov rax,rdx
31b: 5d pop rbp
31c: c3 ret
#David Wohlferd gave me the answer. It was 2 dumb mistakes I was making due to ignorance and assumptions. The below code is modified such that the input variable char pointer is not modified by the routine. It's copied into a register and that register is used. Also I was mistakenly thinking I could directly specify a particular register as opposed to a b etc.
gcc still seems to be fussy about what constraints I use. e.g. If I use a and b for res and str respectively, it compiles fine but segfaults on running. But using S and D seems to work fine.
#David Wohlferd, I'd like to credit you as the answerer but I don't think I can do that to a comment.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"mov %1,%%rdx\n\t"
"sub $1,%%rdx\n\t"
"1:" "sub $15,%%rdx\n\t"
".align 16\n\t"
"2:" "add $16, %%rdx\n\t"
"pcmpistri $0x08,(%%rdx),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%%rdx,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%%rdx\n\t"
"mov %%rdx,%0\n\t"
"2:"
:"=S"(res)
:"D"(str),"x"(M)
:"rcx","rdx"
);
return (char*) res;
}

Resources