Get lines numbers from kernel call trace - debugging

I'm trying to debug what appears to be a completion queue issue:
Apr 14 18:39:15 ST2035 kernel: Call Trace:
Apr 14 18:39:15 ST2035 kernel: [<ffffffff8049b295>] schedule_timeout+0x1e/0xad
Apr 14 18:39:15 ST2035 kernel: [<ffffffff8049a81c>] wait_for_common+0xd5/0x13c
Apr 14 18:39:15 ST2035 kernel: [<ffffffffa01ca32b>]
ib_unregister_mad_agent+0x376/0x4c9 [ib_mad]
Apr 14 18:39:16 ST2035 kernel: [<ffffffffa03058f4>] ib_umad_close+0xbd/0xfd
Is it possible to turn those hex numbers into something close to line numbers?

Not exactly but if you have a vmlinux image built with debugging info, (e.g., in RHEL, you should be able to install the kernel-debug or kernel-dbg or something like that) you can get close. So assuming you have that vmlinux file available. Do the following:
objdump -S vmlinux
This will try it's hardest to match the object code to individual lines of source code.
e.g. for the following C code:
#include <stdio.h>
main() {
int a = 1;
int b = 2;
// This is a comment
printf("This is the print line %d\n", b);
}
compiled with : cc -g test.c
and then running objdump -S on the resulting executable, I get a large output describing the various parts of the executable inclding the following section:
00000000004004cc <main>:
#include <stdio.h>
main() {
4004cc: 55 push %rbp
4004cd: 48 89 e5 mov %rsp,%rbp
4004d0: 48 83 ec 20 sub $0x20,%rsp
int a = 1;
4004d4: c7 45 f8 01 00 00 00 movl $0x1,-0x8(%rbp)
int b = 2;
4004db: c7 45 fc 02 00 00 00 movl $0x2,-0x4(%rbp)
// This is a comment
printf("This is the print line %d\n", b);
4004e2: 8b 75 fc mov -0x4(%rbp),%esi
4004e5: bf ec 05 40 00 mov $0x4005ec,%edi
4004ea: b8 00 00 00 00 mov $0x0,%eax
4004ef: e8 cc fe ff ff callq 4003c0 <printf#plt>
}
You can match the addresses of the object code in the first column against the addresses in your stack trace. Combine that with the line number info interleaved in the assembly output... and you're there.
Now keep in mind that this will not always succeed 100% because the kenrel is normally compiled at -O2 optimization level and the compiler would have done a lot of code re-ordering etc. But if you are familiar with the code that you're trying to debug and have some comfort with deciphering the assembly of the platform you're working on... you should be able to pin down most of your crashes etc.

You can use scripts/decode_stacktrace.sh from the kernel source:
decode_stacktrace.sh current_vmlinux_file < call_trace_text_file

Related

Why is gcc using two stores (`MOV %reg, (mem)`) instead of just one?

Compiling the following with gcc (version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0):
if(n >= m)
{
n = 0;
}
The code output looks like this:
23bd5: 48 3b 93 10 03 00 00 cmp 0x310(%rbx),%rdx
23bdc: 48 89 93 20 03 00 00 mov %rdx,0x320(%rbx)
23be3: 72 0d jb 23bf2
23be5: 48 c7 83 20 03 00 00 movq $0x0,0x320(%rbx)
23bec: 00 00 00 00
23bf0: 31 d2 xor %edx,%edx
23bf2: ....
I'm wondering why the compiler decides to use what looks like a rather expensive set of instructions. I would have used the following instead:
cmp 0x310(%rbx), %rdx
jb .L1
xor %edx, %edx
.L1
mov $rdx, 0x320(%rbx)
Could it be because the XOR takes so much time that it would stale before the store? Or is it that the first store doesn't really happen if the processor detects the second store? Or is it that the second store is nearly instant because it will use the L1 cache (presumably)?
(that being said, the store could happen later as other instructions coming after could safely be moved before that store).

Extra bytes at the end of a DOS .COM file, compiled with GCC

I have the following C source file, with some asm blocks that implement a print and exit routine by calling DOS system calls.
__asm__(
".code16gcc;"
"call dosmain;"
"mov $0x4C, %AH;"
"int $0x21;"
);
void print(char *str)
{
__asm__(
"mov $0x09, %%ah;"
"int $0x21;"
: // no output
: "d"(str)
: "ah"
);
}
void dosmain()
{
// DOS system call expects strings to be terminated by $.
print("Hello world$");
}
The linker script file and the build script file are as such,
OUTPUT_FORMAT(binary)
SECTIONS
{
. = 0x0100;
.text :
{
*(.text);
}
.data :
{
*(.data);
*(.bss);
*(.rodata);
}
_heap = ALIGN(4);
}
gcc -fno-pie -Os -nostdlib -ffreestanding -m16 -march=i386 \
-Wl,--nmagic,--script=simple_dos.ld simple_dos.c -o simple_dos.com
I am used to building .COM files in assembly, and I am aware of the structure of a dos file. However in case of the .COM file generated using GCC, I am getting some extra bytes at the end and I cannot figure out why. (The bytes that are inside the shaded area and the box below is what is expected everything else is unaccounted).
[]
My hunch is that these are some static storage used by GCC. I thought this could be due to the string in the program. I have thus commented the line print("Hello world$"); but the extra bytes still reside. It will be of great help if someone knows what is going on and tell how to prevent GCC inserting these bytes in the output.
Source code is available here: Github
PS: Object file also contains these extra bytes.
Since you are using a native compiler and not an i686(or i386) cross compiler you can get a fair amount of extra information. It is rather dependent on the compiler configurations. I would recommend doing the following to remove unwanted code generation and sections:
Use GCC option -fno-asynchronous-unwind-tables to eliminate any .eh_frame sections. This is the cause of the unwanted data appended at the end of your DOS COM program in this case.
Use GCC option -static to build without relocations to avoid any form of dynamic linking.
Have GCC pass the --build-id=none option to the linker with -Wl to avoid unnecessarily generating any .note.gnu.build-id sections.
Modify the linker script to DISCARD any .comment sections.
Your build command could look like:
gcc -fno-pie -static -Os -nostdlib -fno-asynchronous-unwind-tables -ffreestanding \
-m16 -march=i386 -Wl,--build-id=none,--nmagic,--script=simple_dos.ld simple_dos.c \
-o simple_dos.com
I would modify your linker script to look like:
OUTPUT_FORMAT(binary)
SECTIONS
{
. = 0x0100;
.text :
{
*(.text*);
}
.data :
{
*(.data);
*(.rodata*);
*(.bss);
*(COMMON)
}
_heap = ALIGN(4);
/DISCARD/ : { *(.comment); }
}
Besides adding a /DISCARD/ directive to eliminate any .comment sections I also add *(COMMON) along side .bss. Both are BSS sections. I have also moved them after the data sections as they won't take up space in the .COM file if they appear after the other sections. I also changed *(.rodata); to *(.rodata*); and *(.text); to *(.text*); because GCC can generate section names that begin with .rodata and .text but have different suffixes on them.
Inline Assembly
Not related to the problem you asked about, but is important. In this inline assembly:
__asm__(
"mov $0x09, %%ah;"
"int $0x21;"
: // no output
: "d"(str)
: "ah"
);
Int 21h/AH=9h also clobbers AL. You should use ax as the clobber.
Since you are passing the address of an array through a register you will also want to add a memory clobber so that the compiler realizes the entire array into memory before your inline assembly is emitted. The constraint "d"(str) only tells the compiler that you will be using the pointer as input, not what the pointer points at.
Likely if you compiled with optimisations at -O3 you'd probably discover the following version of the program doesn't even have your string "Hello world$" in it because of this bug:
__asm__(
".code16gcc;"
"call dosmain;"
"mov $0x4C, %AH;"
"int $0x21;"
);
void print(char *str)
{
__asm__(
"mov $0x09, %%ah;"
"int $0x21;"
: // no output
: "d"(str)
: "ax");
}
void dosmain()
{
char hello[] = "Hello world$";
print(hello);
}
The generated code for dosmain allocated space on the stack for the string but never put the string on the stack before printing the string:
00000100 <print-0xc>:
100: 66 e8 12 00 00 00 calll 118 <dosmain>
106: b4 4c mov $0x4c,%ah
108: cd 21 int $0x21
10a: 66 90 xchg %eax,%eax
0000010c <print>:
10c: 67 66 8b 54 24 04 mov 0x4(%esp),%edx
112: b4 09 mov $0x9,%ah
114: cd 21 int $0x21
116: 66 c3 retl
00000118 <dosmain>:
118: 66 83 ec 10 sub $0x10,%esp
11c: 67 66 8d 54 24 03 lea 0x3(%esp),%edx
122: b4 09 mov $0x9,%ah
124: cd 21 int $0x21
126: 66 83 c4 10 add $0x10,%esp
12a: 66 c3 retl
If you change the inline assembly to include a "memory" clobber like this:
void print(char *str)
{
__asm__(
"mov $0x09, %%ah;"
"int $0x21;"
: // no output
: "d"(str)
: "ax", "memory");
}
The generated code may look similar to this:
00000100 <print-0xc>:
100: 66 e8 12 00 00 00 calll 118 <dosmain>
106: b4 4c mov $0x4c,%ah
108: cd 21 int $0x21
10a: 66 90 xchg %eax,%eax
0000010c <print>:
10c: 67 66 8b 54 24 04 mov 0x4(%esp),%edx
112: b4 09 mov $0x9,%ah
114: cd 21 int $0x21
116: 66 c3 retl
00000118 <dosmain>:
118: 66 57 push %edi
11a: 66 56 push %esi
11c: 66 83 ec 10 sub $0x10,%esp
120: 67 66 8d 7c 24 03 lea 0x3(%esp),%edi
126: 66 be 48 01 00 00 mov $0x148,%esi
12c: 66 b9 0d 00 00 00 mov $0xd,%ecx
132: f3 a4 rep movsb %ds:(%si),%es:(%di)
134: 67 66 8d 54 24 03 lea 0x3(%esp),%edx
13a: b4 09 mov $0x9,%ah
13c: cd 21 int $0x21
13e: 66 83 c4 10 add $0x10,%esp
142: 66 5e pop %esi
144: 66 5f pop %edi
146: 66 c3 retl
Disassembly of section .rodata.str1.1:
00000148 <_heap-0x10>:
148: 48 dec %ax
149: 65 6c gs insb (%dx),%es:(%di)
14b: 6c insb (%dx),%es:(%di)
14c: 6f outsw %ds:(%si),(%dx)
14d: 20 77 6f and %dh,0x6f(%bx)
150: 72 6c jb 1be <_heap+0x66>
152: 64 24 00 fs and $0x0,%al
An alternate version of the inline assembly that passes the sub function 9 via an a constraint using a variable and marks it as an input/output with + (since the return value of AX gets clobbered) could be done this way:
void print(char *str)
{
unsigned short int write_fun = (0x09<<8) | 0x00;
__asm__ __volatile__ (
"int $0x21;"
: "+a"(write_fun)
: "d"(str)
: "memory"
);
}
Recommendation: Don't use GCC for 16-bit code generation. The inline assembly is difficult to get right and you will probably be using a fair amount of it for low level routines. You could look at Smaller C, Bruce's C compiler, or Openwatcom C as alternatives. All of them can generate DOS COM programs.
The extra data is likely DWARF unwind information. You can stop GCC from generating it with the -fno-asynchronous-unwind-tables option.
You can also have the GNU linker discard the unwind information by adding the following to SECTIONS directive of your linker script:
/DISCARD/ :
{
*(.eh_frame)
}
Also note that generated COM file will be one byte bigger than you expect because of the null byte at the end of the string.

Compiling MIPS to use branch instead of jump

Having the following very simple c program:
#include <stdio.h>
#include <stdlib.h>
int main()
{
char *buffer = (char*)malloc(20);
}
And Compiling it with mips-linux-gnu-gcc, it looks like the call is compiled to the following instructions:
.text:004007EC 24 04 00 14 li $a0, 0x14
.text:004007F0 8F 82 80 50 la $v0, malloc # Load Address
.text:004007F4 00 40 C8 25 move $t9, $v0
.text:004007F8 03 20 F8 09 jalr $t9 ; malloc # Jump And Link Register
.text:004007FC 00 00 00 00 nop
The full command line of the compilation is:
mips-linux-gnu-gcc my_malloc.c -o my_malloc.so
However, I would like the function calls to compile to just normal branch instructions:
jal malloc
li $a0, 0x14
Does someone know how to achieve this result?
You need to tell the compiler to use the PLT for call, using the -mplt option. This requires support for the PLT in the rest of the toolchain.

Why does loop alignment on 32 byte make code faster?

Look at this code:
one.cpp:
bool test(int a, int b, int c, int d);
int main() {
volatile int va = 1;
volatile int vb = 2;
volatile int vc = 3;
volatile int vd = 4;
int a = va;
int b = vb;
int c = vc;
int d = vd;
int s = 0;
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
__asm__("nop"); __asm__("nop"); __asm__("nop"); __asm__("nop");
for (int i=0; i<2000000000; i++) {
s += test(a, b, c, d);
}
return s;
}
two.cpp:
bool test(int a, int b, int c, int d) {
// return a == d || b == d || c == d;
return false;
}
There are 16 nops in one.cpp. You can comment/decomment them to change alignment of the loop's entry point between 16 and 32. I've compiled them with g++ one.cpp two.cpp -O3 -mtune=native.
Here are my questions:
the 32-aligned version is faster than the 16-aligned version. On Sandy Bridge, the difference is 20%; on Haswell, 8%. Why is the difference?
with the 32-aligned version, the code runs the same speed on Sandy Bridge, doesn't matter which return statement is in two.cpp. I thought the return false version should be faster at least a little bit. But no, exactly the same speed!
If I remove volatiles from one.cpp, code becomes slower (Haswell: before: ~2.17 sec, after: ~2.38 sec). Why is that? But this only happens, when the loop aligned to 32.
The fact that 32-aligned version is faster, is strange to me, because Intel® 64 and IA-32 Architectures
Optimization Reference Manual says (page 3-9):
Assembly/Compiler Coding Rule 12. (M impact, H generality) All branch
targets should be 16- byte aligned.
Another little question: is there any tricks to make only this loop 32-aligned (so rest of the code could keep using 16-byte alignment)?
Note: I've tried compilers gcc 6, gcc 7 and clang 3.9, same results.
Here's the code with volatile (the code is the same for 16/32 aligned, just the address differ):
0000000000000560 <main>:
560: 41 57 push r15
562: 41 56 push r14
564: 41 55 push r13
566: 41 54 push r12
568: 55 push rbp
569: 31 ed xor ebp,ebp
56b: 53 push rbx
56c: bb 00 94 35 77 mov ebx,0x77359400
571: 48 83 ec 18 sub rsp,0x18
575: c7 04 24 01 00 00 00 mov DWORD PTR [rsp],0x1
57c: c7 44 24 04 02 00 00 mov DWORD PTR [rsp+0x4],0x2
583: 00
584: c7 44 24 08 03 00 00 mov DWORD PTR [rsp+0x8],0x3
58b: 00
58c: c7 44 24 0c 04 00 00 mov DWORD PTR [rsp+0xc],0x4
593: 00
594: 44 8b 3c 24 mov r15d,DWORD PTR [rsp]
598: 44 8b 74 24 04 mov r14d,DWORD PTR [rsp+0x4]
59d: 44 8b 6c 24 08 mov r13d,DWORD PTR [rsp+0x8]
5a2: 44 8b 64 24 0c mov r12d,DWORD PTR [rsp+0xc]
5a7: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
5ac: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5b3: 00 00 00
5b6: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
5bd: 00 00 00
5c0: 44 89 e1 mov ecx,r12d
5c3: 44 89 ea mov edx,r13d
5c6: 44 89 f6 mov esi,r14d
5c9: 44 89 ff mov edi,r15d
5cc: e8 4f 01 00 00 call 720 <test(int, int, int, int)>
5d1: 0f b6 c0 movzx eax,al
5d4: 01 c5 add ebp,eax
5d6: 83 eb 01 sub ebx,0x1
5d9: 75 e5 jne 5c0 <main+0x60>
5db: 48 83 c4 18 add rsp,0x18
5df: 89 e8 mov eax,ebp
5e1: 5b pop rbx
5e2: 5d pop rbp
5e3: 41 5c pop r12
5e5: 41 5d pop r13
5e7: 41 5e pop r14
5e9: 41 5f pop r15
5eb: c3 ret
5ec: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
Without volatile:
0000000000000560 <main>:
560: 55 push rbp
561: 31 ed xor ebp,ebp
563: 53 push rbx
564: bb 00 94 35 77 mov ebx,0x77359400
569: 48 83 ec 08 sub rsp,0x8
56d: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
574: 00 00
576: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax*1+0x0]
57d: 00 00 00
580: b9 04 00 00 00 mov ecx,0x4
585: ba 03 00 00 00 mov edx,0x3
58a: be 02 00 00 00 mov esi,0x2
58f: bf 01 00 00 00 mov edi,0x1
594: e8 47 01 00 00 call 6e0 <test(int, int, int, int)>
599: 0f b6 c0 movzx eax,al
59c: 01 c5 add ebp,eax
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
5a3: 48 83 c4 08 add rsp,0x8
5a7: 89 e8 mov eax,ebp
5a9: 5b pop rbx
5aa: 5d pop rbp
5ab: c3 ret
5ac: 0f 1f 40 00 nop DWORD PTR [rax+0x0]
This doesn't answer point 2 (return a == d || b == d || c == d; being the same speed as return false). That's still a maybe-interesting question, since that must compile multiple to uop-cache lines of instructions.
The fact that 32-aligned version is faster, is strange to me, because [Intel's manual says to align to 16]
That optimization-guide advice is a very general guideline, and definitely doesn't mean that larger never helps. Usually it doesn't, and padding to 32 would be more likely to hurt than help. (I-cache misses, ITLB misses, and more code bytes to load from disk).
In fact, 16B alignment is rarely necessary, especially on CPUs with a uop cache. For a small loop that can run from the loop buffer, it alignment is usually totally irrelevant.
(Skylake microcode updates disabled the loop buffer to work around a partial-register AH-merging bug, SKL150. This creates problems for tiny loops that span a 32-byte boundary, only running one iteration per 2 clocks, instead of the one iteration per 1.5 clocks you might get from a 6 uop loop on Haswell, or on SKL with older microcode. The LSD is not re-enabled until Ice Lake, broken in Kaby/Coffee/Comet Lake which are the same microarchitecture as SKL/SKX.)
Another SKL erratum workaround created another worse code-alignment pothole: How can I mitigate the impact of the Intel jcc erratum on gcc?
16B is still not bad as a broad recommendation, but it doesn't tell you everything you need to know to understand one specific case on a couple of specific CPUs.
Compilers usually default to aligning loop branches and function entry-points, but usually don't align other branch targets. The cost of executing a NOP (and code bloat) is often larger than the likely cost of an unaligned non-loop branch target.
Code alignment has some direct and some indirect effects. The direct effects include the uop cache on Intel SnB-family. For example, see Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs.
Another section of Intel's optimization manual goes into some detail about how the uop cache works:
2.3.2.2 Decoded ICache:
All micro-ops in a Way (uop cache line) represent instructions which are statically contiguous in the code and have their EIPs within the
same aligned 32-byte region. (I think this means an instruction that
extends past the boundary goes in the uop cache for the block
containing its start, rather than end. Spanning instructions have to
go somewhere, and the branch target address that would run the
instruction is the start of the insn, so it's most useful to put it in
a line for that block).
A multi micro-op instruction cannot be split across Ways.
An instruction which turns on the MSROM consumes an entire Way.
Up to two branches are allowed per Way.
A pair of macro-fused instructions is kept as one micro-op.
See also Agner Fog's microarch guide. He adds:
An unconditional jump or call always ends a μop cache line
lots of other stuff that that probably isn't relevant here.
Also, that if your code doesn't fit in the uop cache, it can't run from the loop buffer.
The indirect effects of alignment include:
larger/smaller code-size (L1I cache misses, TLB). Not relevant for your test
which branches alias each other in the BTB (Branch Target Buffer).
If I remove volatiles from one.cpp, code becomes slower. Why is that?
The larger instructions push the last instruction into the loop across a 32B boundary:
59e: 83 eb 01 sub ebx,0x1
5a1: 75 dd jne 580 <main+0x20>
So if you aren't running from the loop buffer (LSD), then without volatile one of the uop-cache fetch cycles gets only 1 uop.
If sub/jne macro-fuses, this might not apply. And I think only crossing a 64B boundary would break macro-fusion.
Also, those aren't real addresses. Have you checked what the addresses are after linking? There could be a 64B boundary there after linking, if the text section has less than 64B alignment.
Also related to 32-byte boundaries, the JCC erratum disables the uop cache for blocks where a branch (including macro-fused ALU+JCC) includes the last byte of the line, on Skylake CPUs.
How can I mitigate the impact of the Intel jcc erratum on gcc?
Sorry I haven't actually tested this to say more about this specific case. The point is, when you bottleneck on the front-end from stuff like having a call/ret inside a tight loop, alignment becomes important and can get is extremely complex. Boundary-crossing or not for all future instructions is affected. Do not expect it to be simple. If you've read my other answers, you'll know I'm not usually the kind of person to say "it's too complicated to fully explain", but alignment can be that way.
See also Code alignment in one object file is affecting the performance of a function in another object file
In your case, make sure tiny functions inline. Use link-time optimization if your code-base has any important tiny functions in separate .c files instead of in a .h where they can inline. Or change your code to put them in a .h.

Why is it necessary to use edi constraint in this inline assembly?

centos 6.5 64bit vps, 500MB ram gcc 4.8.2
I have the following function that works only if I use edi as the constraint to hold the string pointer. If I try to use any other register or constraintg or q etc, it segfaults.
BUT this problem only occurs when both link time optimization and o3 are used together. If o2 it's fine. If I don't use -flto, it's fine. But both together then the only register I can use that doesn't crash is edi
gcc -flto
CFLAGS=-I. -flto -std=gnu11 -msse4.2 -fno-builtin-printf -Wall -Winline -Wstrict-aliasing -g -pg -O3 -lrt -lpthread
It seems like there might be some sort of register clobbering going on or something else. I'm really at a loss to understand why and how to fix this. Another interesting aspect is the generated assembly puts rdi into rdx before using the pointer but if I try to use either register as the input constraint... it segfaults! If it fails under aggressive compiling options it suggests to me either the compiler is stuffing up somehow, or more likely I'm doing something wrong.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"sub $1, %1\n\t"
"1:" "sub $15,%1\n\t"
".align 16\n\t"
"2:" "add $16, %1\n\t"
"pcmpistri $0x08,(%1),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%1,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%1\n\t"
"mov %1,%0\n\t"
"2:"
:"=q"(res)
:"edi"(str),"x"(M) //<-- if use anything except edi, it segfaults
:"rcx"
);
return (char*) res;
}
Disassembled output:
00000000000002e0 <sse4_strCRLF>:
2e0: 55 push rbp
2e1: 48 89 e5 mov rbp,rsp
2e4: e8 00 00 00 00 call 2e9 <sse4_strCRLF+0x9>
2e9: 66 0f 6f 05 00 00 00 00 movdqa xmm0,[rip+0x0] # 2f1 <sse4_strCRLF+0x11>
2f1: 48 89 fa mov rdx,rdi //<--- puts rdi into rdx!
2f4: 48 31 c0 xor rax,rax
2f7: 48 83 ea 01 sub rdx,0x1
2fb: 48 83 ea 0f sub rdx,0xf
2ff: 90 nop
300: 48 83 c2 10 add rdx,0x10
304: 66 0f 3a 63 02 08 pcmpistri xmm0,[rdx],0x8
30a: 77 f4 ja 300 <sse4_strCRLF+0x20>
30c: 73 0d jae 31b <sse4_strCRLF+0x3b>
30e: 80 7c 0a 01 0a cmp byte[rdx+rcx*1+0x1],0xa
313: 75 e6 jne 2fb <sse4_strCRLF+0x1b>
315: 48 01 ca add rdx,rcx
318: 48 89 d0 mov rax,rdx
31b: 5d pop rbp
31c: c3 ret
#David Wohlferd gave me the answer. It was 2 dumb mistakes I was making due to ignorance and assumptions. The below code is modified such that the input variable char pointer is not modified by the routine. It's copied into a register and that register is used. Also I was mistakenly thinking I could directly specify a particular register as opposed to a b etc.
gcc still seems to be fussy about what constraints I use. e.g. If I use a and b for res and str respectively, it compiles fine but segfaults on running. But using S and D seems to work fine.
#David Wohlferd, I'd like to credit you as the answerer but I don't think I can do that to a comment.
char *sse4_strCRLF(char *str)
{
__m128i M = _mm_set1_epi8(13);
char *res;
__asm__ __volatile__(
"xor %0,%0\n\t"
"mov %1,%%rdx\n\t"
"sub $1,%%rdx\n\t"
"1:" "sub $15,%%rdx\n\t"
".align 16\n\t"
"2:" "add $16, %%rdx\n\t"
"pcmpistri $0x08,(%%rdx),%2\n\t"
"ja 2b\n\t"
"jnc 2f\n\t"
"cmpb $10,1(%%rdx,%%rcx)\n\t"
"jne 1b\n\t"
"add %%rcx,%%rdx\n\t"
"mov %%rdx,%0\n\t"
"2:"
:"=S"(res)
:"D"(str),"x"(M)
:"rcx","rdx"
);
return (char*) res;
}

Resources