In Clang/LLVM x86-64 inline assembly, how do I say I clobbered the x87/media state? - gcc

I'm writing some x86-64 inline assembly that might affect the floating point and media (SSE, MMX, etc.) state, but I don't feel like saving and restoring the state myself. Does Clang/LLVM have a clobber constraint for that?
(I'm not too familiar with the x86-64 architecture or inline assembly, so it was hard to know what to search for. More details in case this is an XY problem: I'm working on a simple coroutine library in Rust. When we switch tasks, we need to store the old CPU state and load the new state, and I'd like to write as little assembly as possible. My guess is that letting the compiler take care of saving and restoring state is the simplest way to do that.)

If your coroutine looks like an opaque (non-inline) function call, the compiler will already assume the FP state is clobbered (except for control regs like MXCSR and the x87 control word (rounding mode)), because all the FP regs are call-clobbered in the normal function calling convention.
Except for Windows, where xmm6..15 are call-preserved.
Also beware that if you're putting a call inside inline asm, there's no way to tell the compiler that your asm clobbers the red zone (128 bytes below RSP in the x86-64 System V ABI). You could compile that file with -mno-redzone or use add rsp, -128 before call to skip over the red-zone that belongs to the compiler-generated code.
To declare clobbers on the FP state, you have to name all the registers separately.
"xmm0", "xmm1", ..., "xmm15" (clobbering xmm0 counts as clobbering ymm0/zmm0).
For good measure you should also name "mm0", ..., "mm7" as well (MMX), in case your code inlines into some legacy code using MMX intrinsics.
To clobber the x87 stack as well, "st" is how you refer to st(0) in the clobber list. The rest of the registers have their normal names for GAS syntax, "st(1)", ..., "st(7)".
https://stackoverflow.com/questions/39728398/how-to-specify-clobbered-bottom-of-the-x87-fpu-stack-with-extended-gcc-assembly
You never know, it is possible to compile withclang -mfpmath=387, or to use 387 vialong double`.
(Hopefully no code uses -mfpmath=387 in 64-bit mode and MMX intrinsics at the same time; the following test-case looks slightly broken with gcc in that case.)
#include <immintrin.h>
float gvar;
int testclobber(float f, char *p)
{
int arg1 = 1, arg2 = 2;
f += gvar; // with -mno-sse, this will be in an x87 register
__m64 mmx_var = *(const __m64*)p; // MMX
mmx_var = _mm_unpacklo_pi8(mmx_var, mmx_var);
// x86-64 System V calling convention
unsigned long long retval;
asm volatile ("add $-128, %%rsp \n\t" // skip red zone. -128 fits in an imm8
"call whatever \n\t"
"sub $-128, %%rsp \n\t"
// FIXME should probably align the stack in here somewhere
: "=a"(retval) // returns in RAX
: "D" (arg1), "S" (arg2) // input args in registers
: "rcx", "rdx", "r8", "r9", "r10", "r11" // call-clobbered integer regs
// call clobbered FP regs, *NOT* including MXCSR
, "mm0", "mm1", "mm2", "mm3", "mm4", "mm5", "mm6", "mm7" // MMX
, "st", "st(1)", "st(2)", "st(3)", "st(4)", "st(5)", "st(6)", "st(7)" // x87
// SSE/AVX: clobbering any results in a redundant vzeroupper with gcc?
, "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "xmm6", "xmm7"
, "xmm8", "xmm9", "xmm10", "xmm11", "xmm12", "xmm13", "xmm14", "xmm15"
#ifdef __AVX512F__
, "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21", "zmm22", "zmm23"
, "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29", "zmm30", "zmm31"
, "k0", "k1", "k2", "k3", "k4", "k5", "k6", "k7"
#endif
#ifdef __MPX__
, "bnd0", "bnd1", "bnd2", "bnd3"
#endif
, "memory" // reads/writes of globals and pointed-to data can't reorder across the asm (at compile time; runtime StoreLoad reordering is still a thing)
);
// Use the MMX var after the asm: compiler has to spill/reload the reg it was in
*(__m64*)p = mmx_var;
_mm_empty(); // emms
gvar = f; // memory clobber prevents hoisting this ahead of the asm.
return retval;
}
source + asm on the Godbolt compiler explorer
By commenting one of the lines of clobbers, we can see that the spill-reload go away in the asm. e.g. commenting the x87 st .. st(7) clobbers makes code that leaves f + gvar in st0, for just a fst dword [gvar] after the call.
Similarly, commenting the mm0 line lets gcc and clang keep mmx_var in mm0 across the call. The ABI requires that the FPU is in x87 mode, not MMX, on call / ret, this isn't really sufficient. The compiler will spill/reload around the asm, but it won't insert an emms for us. But by the same token, it would be an error for a function using MMX to call your co-routine without doing _mm_empty() first, so maybe this isn't a real problem.
I haven't experimented with __m256 variables to see if it inserts a vzeroupper before the asm, to avoid possible SSE/AVX slowdowns.
If we comment the xmm8..15 line, we see the version that isn't using x87 for float keeps it in xmm8, because now it thinks it has some non-clobbered xmm regs. If we comment both sets of lines, it assumes xmm0 lives across the asm, so this works as a test of the clobbers.
asm output with all clobbers in place
It saves/restores RBX (to hold the pointer arg across the asm statement), which happens to re-align the stack by 16. That's another problem with using call from inline asm: I don't think alignment of RSP is guaranteed.
# from clang7.0 -march=skylake-avx512 -mmpx
testclobber: # #testclobber
push rbx
vaddss xmm0, xmm0, dword ptr [rip + gvar]
vmovss dword ptr [rsp - 12], xmm0 # 4-byte Spill (because of xmm0..15 clobber)
mov rbx, rdi # save pointer for after asm
movq mm0, qword ptr [rdi]
punpcklbw mm0, mm0 # mm0 = mm0[0,0,1,1,2,2,3,3]
movq qword ptr [rsp - 8], mm0 # 8-byte Spill (because of mm0..7 clobber)
mov edi, 1
mov esi, 2
add rsp, -128
call whatever
sub rsp, -128
movq mm0, qword ptr [rsp - 8] # 8-byte Reload
movq qword ptr [rbx], mm0
emms # note this didn't happen before call
vmovss xmm0, dword ptr [rsp - 12] # 4-byte Reload
vmovss dword ptr [rip + gvar], xmm0
pop rbx
ret
Notice that because of the "memory" clobber in the asm statement, *p and gvar are read before the asm, but written after. Without that, the optimizer could sink the load or hoist the store so no local variable was live across the asm statement. But now the optimizer needs to assume that the asm statement itself might read the old value of gvar and/or modify it. (And assume that p points to memory that's also globally accessible somehow, because we didn't use __restrict.)

Related

Efficient type punning without undefined behavior

Say I'm working on a library called libModern. This library uses a legacy C library, called libLegacy, as an implementation strategy. libLegacy's interface looks like this:
typedef uint32_t LegacyFlags;
struct LegacyFoo {
uint32_t x;
uint32_t y;
LegacyFlags flags;
// more data
};
struct LegacyBar {
LegacyFoo foo;
float a;
// more data
};
void legacy_input(LegacyBar const* s); // Does something with s
void legacy_output(LegacyBar* s); // Stores data in s
libModern shouldn't expose libLegacy's types to its users for various reasons, among them:
libLegacy is an implementation detail that shouldn't be leaked. Future versions of libModern might chose to use another library instead of libLegacy.
libLegacy uses hard-to-use, easy-to-misuse types that shouldn't be part of any user-facing API.
The textbook way to deal with this situation is the pimpl idiom: libModern would provide a wrapper type that internally has a pointer to the legacy data. However, this is not possible here, since libModern cannot allocate dynamic memory. Generally, its goal is not to add a lot of overhead.
Therefore, libModern defines its own types that are layout-compatible with the legacy types, yet have a better interface. In this example it is using a strong enum instead of a plain uint32_t for flags:
enum class ModernFlags : std::uint32_t
{
first_flag = 0,
second_flag = 1,
};
struct ModernFoo {
std::uint32_t x;
std::uint32_t y;
ModernFlags flags;
// More data
};
struct ModernBar {
ModernFoo foo;
float a;
// more data
};
Now the question is: How can libModern convert between the legacy and the modern types without much overhead? I know of 3 options:
reinterpret_cast. This is undefined behavior, but in practice produces perfect assembly. I want to avoid this, since I cannot rely on this still working tomorrow or on another compiler.
std::memcpy. In simple cases this generates the same optimal assembly, but in any non-trivial case this adds significant overhead.
C++20's std::bit_cast. In my tests, at best it produces exactly the same code as memcpy. In some cases it's worse.
This is a comparison of the 3 ways to interface with libLegacy:
Interfacing with legacy_input()
Using reinterpret_cast:
void input_ub(ModernBar const& s) noexcept {
legacy_input(reinterpret_cast<LegacyBar const*>(&s));
}
Assembly:
input_ub(ModernBar const&):
jmp legacy_input
This is perfect codegen, but it invokes UB.
Using memcpy:
void input_memcpy(ModernBar const& s) noexcept {
LegacyBar ls;
std::memcpy(&ls, &s, sizeof(ls));
legacy_input(&ls);
}
Assembly:
input_memcpy(ModernBar const&):
sub rsp, 24
movdqu xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
movaps XMMWORD PTR [rsp], xmm0
call legacy_input
add rsp, 24
ret
Significantly worse.
Using bit_cast:
void input_bit_cast(ModernBar const& s) noexcept {
LegacyBar ls = std::bit_cast<LegacyBar>(s);
legacy_input(&ls);
}
Assembly:
input_bit_cast(ModernBar const&):
sub rsp, 40
movdqu xmm0, XMMWORD PTR [rdi]
mov rdi, rsp
movaps XMMWORD PTR [rsp+16], xmm0
mov rax, QWORD PTR [rsp+16]
mov QWORD PTR [rsp], rax
mov rax, QWORD PTR [rsp+24]
mov QWORD PTR [rsp+8], rax
call legacy_input
add rsp, 40
ret
And I have no idea what's going on here.
Interfacing with legacy_output()
Using reinterpret_cast:
auto output_ub() noexcept -> ModernBar {
ModernBar s;
legacy_output(reinterpret_cast<LegacyBar*>(&s));
return s;
}
Assembly:
output_ub():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Using memcpy:
auto output_memcpy() noexcept -> ModernBar {
LegacyBar ls;
legacy_output(&ls);
ModernBar s;
std::memcpy(&s, &ls, sizeof(ls));
return s;
}
Assembly:
output_memcpy():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Using bit_cast:
auto output_bit_cast() noexcept -> ModernBar {
LegacyBar ls;
legacy_output(&ls);
return std::bit_cast<ModernBar>(ls);
}
Assembly:
output_bit_cast():
sub rsp, 72
lea rdi, [rsp+16]
call legacy_output
movdqa xmm0, XMMWORD PTR [rsp+16]
movaps XMMWORD PTR [rsp+48], xmm0
mov rax, QWORD PTR [rsp+48]
mov QWORD PTR [rsp+32], rax
mov rax, QWORD PTR [rsp+56]
mov QWORD PTR [rsp+40], rax
mov rax, QWORD PTR [rsp+32]
mov rdx, QWORD PTR [rsp+40]
add rsp, 72
ret
Here you can find the entire example on Compiler Explorer.
I also noted that the codegen varies significantly depending on the exact definition of the structs (i.e. order, amount & type of members). But the UB version of the code is consistently better or at least as good as the other two versions.
Now my questions are:
How come the codegen varies so dramatically? It makes me wonder if I'm missing something important.
Is there something I can do to guide the compiler to generate better code without invoking UB?
Are there other standard-conformant ways that generate better code?
In your compiler explorer link, Clang produces the same code for all output cases. I don't know what problem GCC has with std::bit_cast in that situation.
For the input case, the three functions cannot produce the same code, because they have different semantics.
With input_ub, the call to legacy_input may be modifying the caller's object. This cannot be the case in the other two versions. Therefore the compiler cannot optimize away the copies, not knowing how legacy_input behaves.
If you pass by-value to the input functions, then all three versions produce the same code at least with Clang in your compiler explorer link.
To reproduce the code generated by the original input_ub you need to keep passing the address of the caller's object to legacy_input.
If legacy_input is an extern C function, then I don't think the standards specify how the object models of the two languages are supposed to interact in this call. So, for the purpose of the language-lawyer tag, I will assume that legacy_input is an ordinary C++ function.
The problem in passing the address of &s directly is that there is generally no LegacyBar object at the same address that is pointer-interconvertible with the ModernBar object. So if legacy_input tries to access LegacyBar members through the pointer, that would be UB.
Theoretically you could create a LegacyBar object at the required address, reusing the object representation of the ModernBar object. However, since the caller presumably will expect there to still be a ModernBar object after the call, you then need to recreate a ModernBar object in the storage by the same procedure.
Unfortunately though, you are not always allowed to reuse storage in this way. For example if the passed reference refers to a const complete object, that would be UB, and there are other requirements. The problem is also whether the caller's references to the old object will refer to the new object, meaning whether the two ModernBar objects are transparently replaceable. This would also not always be the case.
So in general I don't think you can achieve the same code generation without undefined behavior if you don't put additional constraints on the references passed to the function.
Most non-MSVC compilers support an attribute called __may_alias__ that you can use
struct ModernFoo {
std::uint32_t x;
std::uint32_t y;
ModernFlags flags;
// More data
} __attribute__((__may_alias__));
struct ModernBar {
ModernFoo foo;
float a;
// more data
} __attribute__((__may_alias__));
Of course some optimizations can't be done when aliasing is allowed, so use it only if performance is acceptable
Godbolt link
Programs which would ever have any reason to access storage as multiple types should be processed using -fno-strict-aliasing or equivalent on any compiler that doesn't limit type-based aliasing assumptions around places where a pointer or lvalue of one type is converted to another, even if the program uses only corner-case behaviors mandated by the Standard. Using such a compiler flag will guarantee that one won't have type-based-aliasing problems, while jumping through hoops to use only standard-mandated corner cases won't. Both clang and gcc are sometimes prone to both:
have one phase of optimization change code whose behavior would be mandated by the Standard into code whose behavior isn't mandated by the Standard would be equivalent in the absence of further optimization, but then
have a later phase of optimization further transform the code in a manner that would have been allowable for the version of the code produced by #1 but not for the code as it was originally written.
If using -fno-strict-aliasing on straightforwardly-written source code yields machine code whose performance is acceptable, that's a safer approach than trying to jump through hoops to satisfy constraints that the Standard allows compilers to impose in cases where doing so would allow them to be more useful [or--for poor quality compilers--in cases where doing so would make them less useful].
You could create a union with a private member to restrict access to the legacy representation:
union UnionBar {
struct {
ModernFoo foo;
float a;
};
private:
LegacyBar legacy;
friend LegacyBar const* to_legacy_const(UnionBar const& s) noexcept;
friend LegacyBar* to_legacy(UnionBar& s) noexcept;
};
LegacyBar const* to_legacy_const(UnionBar const& s) noexcept {
return &s.legacy;
}
LegacyBar* to_legacy(UnionBar& s) noexcept {
return &s.legacy;
}
void input_union(UnionBar const& s) noexcept {
legacy_input(to_legacy_const(s));
}
auto output_union() noexcept -> UnionBar {
UnionBar s;
legacy_output(to_legacy(s));
return s;
}
The input/output functions are compiled to the same code as the reinterpret_cast-versions (using gcc/clang):
input_union(UnionBar const&):
jmp legacy_input
and
output_union():
sub rsp, 56
lea rdi, [rsp+16]
call legacy_output
mov rax, QWORD PTR [rsp+16]
mov rdx, QWORD PTR [rsp+24]
add rsp, 56
ret
Note that this uses anonymous structs and requires you to include the legacy implementation, which you mentioned you do not want. Also, I'm missing the experience to be fully confident that there's no hidden UB, so it would be great if someone else would comment on that :)

How get EIP from x86 inline assembly by gcc

I want to get the value of EIP from the following code, but the compilation does not pass
Command :
gcc -o xxx x86_inline_asm.c -m32 && ./xxx
file contetn x86_inline_asm.c:
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
unsigned int eip_val;
__asm__("mov %0,%%eip":"=r"(eip_val));
return 0;
}
How to use the inline assembly to get the value of EIP, and it can be compiled successfully under x86.
How to modify the code and use the command to complete it?
This sounds unlikely to be useful (vs. just taking the address of the whole function like void *tmp = main), but it is possible.
Just get a label address, or use . (the address of the current line), and let the linker worry about getting the right immediate into the machine code. So you're not architecturally reading EIP, just reading the value it currently has from an immediate.
asm volatile("mov $., %0" : "=r"(address_of_mov_instruction) );
AT&T syntax is mov src, dst, so what you wrote would be a jump if it assembled.
(Architecturally, EIP = the end of an instruction while it's executing, so arguably you should do
asm volatile(
"mov $1f, %0 \n\t" // reference label 1 forward
"1:" // GAS local label
"=r"(address_after_mov)
);
I'm using asm volatile in case this asm statement gets duplicated multiple times inside the same function by inlining or something. If you want each case to get a different address, it has to be volatile. Otherwise the compiler can assume that all instances of this asm statement produce the same output. Normally that will be fine.
Architecturally in 32-bit mode you don't have RIP-relative addressing for LEA so the only good way to actually read EIP is call / pop. Reading program counter directly. It's not a general-purpose register so you can't just use it as the source or destination of a mov or any other instruction.
But really you don't need inline asm for this at all.
Is it possible to store the address of a label in a variable and use goto to jump to it? shows how to use the GNU C extension where &&label takes its address.
int foo;
void *addr_inside_function() {
foo++;
lab1: ; // labels only go on statements, not declarations
void *tmp = &&lab1;
foo++;
return tmp;
}
There's nothing you can safely do with this address outside the function; I returned it just as an example to make the compiler put a label in the asm and see what happens. Without a goto to that label, it can still optimize the function pretty aggressively, but you might find it useful as an input for an asm goto(...) somewhere else in the function.
But anyway, it compiles on Godbolt to this asm
# gcc -O3 -m32
addr_inside_function:
.L2:
addl $2, foo
movl $.L2, %eax
ret
#clang -O3 -m32
addr_inside_function:
movl foo, %eax
leal 1(%eax), %ecx
movl %ecx, foo
.Ltmp0: # Block address taken
addl $2, %eax
movl %eax, foo
movl $.Ltmp0, %eax # retval = label address
retl
So clang loads the global, computes foo+1 and stores it, then after the label computes foo+2 and stores that. (Instead of loading twice). So you still can't usefully jump to the label from anywhere, because it depends on having foo's old value in eax, and on the desired behaviour being to store foo+2
I don't know gcc inline assembly syntax for this, but for masm:
call next0
next0: pop eax ;eax = eip for this line
In the case of Masm, $ represents the current location, and since call is a 5 byte instruction, an alternative syntax without a label would be:
call $+5
pop eax

Cannot modify data segment register. When tried General Protection Error is thrown

I have been trying to create an ISR handler following this
tutorial by James Molloy but I got stuck. Whenever I throw a software interrupt, general purpose registers and the data segment register is pushed onto the stack with the variables automatically pushed by the CPU. Then the data segment is changed to the value of 0x10 (Kernel Data Segment Descriptor) so the privilege levels are changed. Then after the handler returns those values are poped. But whenever the value in ds is changed a GPE is thrown with the error code 0x2544 and after a few seconds the VM restarts. (linker and compiler i386-elf-gcc , assembler nasm)
I tried placing hlt instructions in between instructions to locate which instruction was throwing the GPE. After that I was able to find out that the the `mov ds,ax' instruction. I tried various things like removing the stack which was initialized by the bootstrap code to deleting the privilege changing parts of the code. The only way I can return from the common stub is to remove the parts of my code which change the privilege levels but as I want to move towards user mode I still want them to stay.
Here is my common stub:
isr_common_stub:
pusha ; Pushes edi,esi,ebp,esp,ebx,edx,ecx,eax
xor eax,eax
mov ax, ds ; Lower 16-bits of eax = ds.
push eax ; save the data segment descriptor
mov ax, 0x10 ; load the kernel data segment descriptor
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
call isr_handler
xor eax,eax
pop eax
mov ds, ax ; This is the instruction everything fails;
mov es, ax
mov fs, ax
mov gs, ax
popa
iret
My ISR handler macros:
extern isr_handler
%macro ISR_NOERRCODE 1
global isr%1 ; %1 accesses the first parameter.
isr%1:
cli
push byte 0
push %1
jmp isr_common_stub
%endmacro
%macro ISR_ERRCODE 1
global isr%1
isr%1:
cli
push byte %1
jmp isr_common_stub
%endmacro
ISR_NOERRCODE 0
ISR_NOERRCODE 1
ISR_NOERRCODE 2
ISR_NOERRCODE 3
...
My C handler which results in "Received interrupt: 0xD err. code 0x2544"
#include <stdio.h>
#include <isr.h>
#include <tty.h>
void isr_handler(registers_t regs) {
printf("ds: %x \n" ,regs.ds);
printf("Received interrupt: %x with err. code: %x \n", regs.int_no, regs.err_code);
}
And my main function:
void kmain(struct multiboot *mboot_ptr) {
descinit(); // Sets up IDT and GDT
ttyinit(TTY0); // Sets up the VGA Framebuffer
asm volatile ("int $0x1"); // Triggers a software interrupt
printf("Wow"); // After that its supposed to print this
}
As you can see the code was supposed to output,
ds: 0x10
Received interrupt: 0x1 with err. code: 0
but results in,
...
ds: 0x10
Received interrupt: 0xD with err. code: 0x2544
ds: 0x10
Received interrupt: 0xD with err. code: 0x2544
...
Which goes on until the VM restarts itself.
What am I doing wrong?
The code isn't complete but I'm going to guess what you are seeing is a result of a well known bug in James Molloy's OSDev tutorial. The OSDev community has compiled a list of known bugs in an errata list. I recommend reviewing and fixing all the bugs mentioned there. Specifically in this case I believe the bug that is causing problems is this one:
Problem: Interrupt handlers corrupt interrupted state
This article previously told you to know the ABI. If you do you will
see a huge problem in the interrupt.s suggested by the tutorial: It
breaks the ABI for structure passing! It creates an instance of the
struct registers on the stack and then passes it by value to the
isr_handler function and then assumes the structure is intact
afterwards. However, the function parameters on the stack belongs to
the function and it is allowed to trash these values as it sees fit
(if you need to know whether the compiler actually does this, you are
thinking the wrong way, but it actually does). There are two ways
around this. The most practical method is to pass the structure as a
pointer instead, which allows you to explicitly edit the register
state when needed - very useful for system calls, without having the
compiler randomly doing it for you. The compiler can still edit the
pointer on the stack when it's not specifically needed. The second
option is to make another copy the structure and pass that
The problem is that the 32-bit System V ABI doesn't guarantee that data passed by value will be unmodified on the stack! The compiler is free to reuse that memory for whatever purposes it chooses. The compiler probably generated code that trashed the area on the stack where DS is stored. When DS was set with the bogus value it crashed. What you should be doing is passing by reference rather than value. I'd recommend these code changes in the assembly code:
irq_common_stub:
pusha
mov ax, ds
push eax
mov ax, 0x10 ;0x10
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
push esp ; At this point ESP is a pointer to where GS (and the rest
; of the interrupt handler state resides)
; Push ESP as 1st parameter as it's a
; pointer to a registers_t
call irq_handler
pop ebx ; Remove the saved ESP on the stack. Efficient to just pop it
; into any register. You could have done: add esp, 4 as well
pop ebx
mov ds, bx
mov es, bx
mov fs, bx
mov gs, bx
popa
add esp, 8
sti
iret
And then modify irq_handler to use registers_t *regs instead of registers_t regs :
void irq_handler(registers_t *regs) {
if (regs->int_no >= 40) port_byte_out(0xA0, 0x20);
port_byte_out(0x20, 0x20);
if (interrupt_handlers[regs->int_no] != 0) {
interrupt_handlers[regs->int_no](*regs);
}
else
{
klog("ISR: Unhandled IRQ%u!\n", regs->int_no);
}
}
I'd actually recommend each interrupt handler take a pointer to registers_t to avoid unnecessary copying. If your interrupt handlers and the interrupt_handlers array used function that took registers_t * as the parameter (instead of registers_t) then you'd modify the code:
interrupt_handlers[r->int_no](*regs);
to be:
interrupt_handlers[r->int_no](regs);
Important: You have to make these same type of changes for your ISR handlers as well. Both the IRQ and ISR handlers and associated code have this same problem.

GCC Jump Table initialization code generating movsxd and add?

When I compile a switch statement with optimization in GCC, it sets up a jump table like this,
(fcn) sym.foo 148
sym.foo (unsigned int arg1);
; arg unsigned int arg1 # rdi
0x000006e0 83ff06 cmp edi, 6 ; arg1
0x000006e3 0f87a7000000 ja case.default.0x790
0x000006e9 488d156c0100. lea rdx, [0x0000085c]
0x000006f0 89ff mov edi, edi
0x000006f2 4883ec08 sub rsp, 8
0x000006f6 486304ba movsxd rax, dword [rdx + rdi*4]
0x000006fa 4801d0 add rax, rdx ; '('
;-- switch.0x000006fd:
0x000006fd ffe0 jmp rax ; switch table (7 cases) at 0x85c
Is the MOVSXD and ADD the best way to do that,
movsxd rax, dword [rdx + rdi*4]
add rax, rdx
Isn't that the same as using LEA with displacement
lea rax, [rdx + rdi*4 + rdx]
It occurs to me that I probably don't understand what's going on here. RDX seems to be the start off the start of the jump table. RDI is the incoming argument to the switch statement. Why are we adding RDX twice though?
This is the switch statement I was compiling with -O3,
int foo (int x) {
switch(x) {
//case 0: puts("\nzero"); break;
case 1: puts("\none"); break;
case 2: puts("\ntwo"); break;
case 3: puts("\nthree"); break;
case 4: puts("\nfour"); break;
case 5: puts("\nfive"); break;
case 6: puts("\nsix"); break;
}
return 0;
}
GCC is using relative displacements in its jump table (relative to the base of the table), instead of absolute addresses. So the jump table itself is position-independent, and doesn't need fixups when it's relocated, e.g. as part of loading a PIE executable or a PIC shared library.
If you compile with -fno-pie -no-pie, gcc might choose to use a table of jump targets with jmp [table + rdi*8]
Targets like x86-64 Linux do support runtime data fixups, so a simple jump table would be possible. But some targets don't support fixups at all, which is why gcc -fPIC / -fpie avoids it entirely. This potential optimization is gcc bug 84011. See discussion there for more.
It's unfortunate gcc is using a jump table instead of realizing that the only difference between each case is the data, not code. So really it just needs a table lookup of string pointers. (Which could be done with relative displacements if it wanted to.)
That's a separate missed optimization, which I reported as bug 85585. (That reminds me, I have a followup to that half-written which I should finish and post.)
Is the MOVSXD and ADD the best way to do that,
It could be done with just an add with a qword memory operand. Of course the downside is that it makes the table twice as big.
Isn't that the same as using LEA with displacement
No, lea does not access memory.
Why are we adding RDX twice though?
The first time it is used as the base of the table to index into it. The table holds addresses relative to itself, so adding RDX to the value from the table creates an absolute address.
By the way this could easily be improved:
mov edi, edi ; truncate rdi to 32bit
A self-mov cannot be mov-eliminated on current architectures, so it would be better to mov to some other register.

Assembly inline AT&T Type mismatch

I'm learning assembly and I found nothing that helps me do this. Is it even possible? I can't make this work.
I want this code to take the "b" value, put it in %eax and then move the content of %eax in my output and print that ASCII character, "0" in this case.
char a;
int b=48;
__asm__ (
//Here's the "Error: operand type mismatch for `mov'
"movl %0, %%eax;"
"movl %%eax, %1;"
:"=r"(a)
:"r" (b)
:"%eax"
);
printf("%c\n",a);
The instruction responsible for the error is this one:
movl %0, %%eax
So, in order to figure out why it's causing an error, we need to understand what it says. It's a 32-bit MOV instruction (the l suffix in AT&T syntax means "long", aka DWORD). The destination operand is the 32-bit EAX register. The source operand is the first input/output operand, a. In other words, this:
"=r"(a)
which says that char a; is to be used as an output-only register.
As such, what the inline assembler wants to do is to generate code like the following:
movl %dl, %eax
(assuming, for the sake of argument that a is allocated in the dl register, but it could just as easily have been allocated in any of the 8-bit registers). The problem is, that code is invalid because there is an operand size mismatch. The source operand and destination operand are different sizes: one is 32 bits while the other is 8 bits. This cannot work.
A workaround is the movzx/movsx instructions (introduced with the 80386) which move an 8 (or 16) bit source operand into a 32-bit destination operand, either with zero extension or sign extension, respectively. In AT&T syntax, the form that moves an 8-bit source into a 32-bit destination would be movzbl (for zero extension, used with unsigned values) or movsbl (for sign extension, used with signed values).
But wait—this is the wrong workaround. Your code is invalid for another reason: a is uninitialized! And not only is a uninitialized, but you've told the inline assembler via the output constraints it is an output-only operand (the = sign)! So you can't read from it—you can only store into it.
You have your operand notation backwards. What you really wanted was something like the following:
__asm__(
"movl %1, %%eax;"
"movl %%eax, %0;"
: "=r"(a)
: "r" (b)
: "%eax"
);
Of course, that's still going to give you an operand size mismatch, but it's now on the second assembly instruction. What this is telling the inline assembler to emit is the following code:
movl $48, %edx
movl %edx, %eax
movl %eax, %dl
which is invalid because a 32-bit source (%eax) cannot be moved into an 8-bit destination (%dl). And you can't fix this with movzx/movsx, because that is used to extend, not truncate. The way to write this would be the following:
movl $48, %edx
movl %edx, %eax
movb %al, %dl
where the last instruction is an 8-bit move, from an 8-bit source register to an 8-bit destination register.
In inline assembly, this would be written as:
__asm__(
"movl %1, %%eax;"
"movb %%al, %0;"
: "=r"(a)
: "r" (b)
: "%eax"
);
However, this is not the correct way to use inline assembly. You've manually hard-coded the EAX register inside of the inline assembly block, which means that you had to clobber it. The problem with this is that it ties the compiler's hands behind its back when it comes to register allocation. What you're supposed to do is put everything that goes into and out of the inline assembly block in the input and output operands. This lets the compiler handle all register allocation in the most optimal way possible. The code should look as follows:
char a;
int b = 48;
int temp;
__asm__(
"movl %2, %0\n\t"
"movb %b0, %1"
: "=r"(temp),
"=r"(a)
: "r" (b)
:
);
A lot of changes happened here:
I introduced another temporary variable (appropriately named temp) and added it to the output-only operands list. This causes the compiler to allocate a register for it automatically, which we then use inside of the asm block.
Now that we're letting the compiler do the register allocation, we don't need a clobber list, so that's left empty.
The b modifier is needed on the source operand for the movb instruction to ensure that the byte-sized portion of that register is used, rather than the entire 32-bit register.
Instead of using semicolons at the end of each asm instruction, I used \n\t (except on the last one). This is what is recommended for use in inline assembly blocks, and it gets you nicer assembly output listings because it matches what the compiler does internally.
Even better would be to introduce symbolic names for the operands, making the code more readable:
char a;
int b = 48;
int temp;
__asm__(
"movl %[input], %[temp]\n\t"
"movb %b[temp], %[dest]"
: [temp] "=r"(temp),
[dest] "=r"(a)
: [input] "r" (b)
:
);
And, at this point, if you hadn't noticed already, you'd see that this code is enormously silly. You don't need all those temporaries and register-register shuffling. You can just do:
movl $48, %eax
and the value 48 is already in al, since al is the low 8 bits of the 32-bit register eax.
Or, you can do:
movb $48, %al
which is just an 8-bit move of the value 48 explicitly into the 8-bit register al.
But, in fact, if you're calling printf, the argument must be passed as an int (not a char, since it's a variadic function), so you definitely want:
movl $48, %eax
When you start using inline assembly, the compiler can't easily optimize through it, so you get inefficient code. All you really needed was:
int a = 48;
printf("%c\n",a);
Which produces the following assembly code:
pushl $48
pushl $AddressOfFormatString
call printf
addl $8, %esp
or, equivalently:
movl $48, %eax
pushl %eax
pushl $AddressOfFormatString
call printf
addl $8, %esp
Now, I imagine you're saying to yourself something like: "Yes, but if I do that, then I'm not using inline assembly!" To which my response is: exactly. You don't need inline assembly here, and in fact, you should not be using it, because it just causes problems. It's more difficult to write and leads to inefficient code generation.
If you want to learn assembly language programming, get an assembler and use that—not a C compiler's inline assembler. NASM is a popular and excellent choice, as is YASM. If you want to stick with using the Gnu assembler so you can stick with this tortuous AT&T syntax, then run as.
Since a is defined as character (char a;), :"=r"(a) will assign a 8-byte register. The 32-byte register EAX cannot be loaded with an 8-byte register - movl %dl, %eax (movl %0, %%eax) will cause this error. There are the sign extend and zero extend instructions movzx and movsx (Intel syntax), in AT&T syntax: movs... and movz... for this purpose.
Change
movl %0, %%eax;
to
movzbl %0, %%eax;

Resources