Branching to a c symbol from thumb inline assembly - gcc

I'm on a Cortex-M0+ device (Thumb only) and I'm trying to dynamically generate some code in ram and then jump to it, like so:
uint16_t code_buf[18];
...
void jump() {
register volatile uint32_t* PASET asm("r0") = &(PA->OUTSET.reg);
register volatile uint32_t* PACLR asm("r1") = &(PA->OUTCLR.reg);
register uint32_t set asm("r2") = startset;
register uint32_t cl0 asm("r3") = clears[0];
register uint32_t cl1 asm("r4") = clears[1];
register uint32_t cl2 asm("r5") = clears[2];
register uint32_t cl3 asm("r6") = clears[3];
register uint32_t dl0 asm("r8") = delays[0];
register uint32_t dl1 asm("r9") = delays[1];
register uint32_t dl2 asm("r10") = delays[2];
register uint32_t dl3 asm("r11") = delays[3];
asm volatile (
"bl code_buf\n"
: [set]"+r" (set) : [PASET]"r" (PASET), [PACLR]"r" (PACLR), [cl0]"r" (cl0), [cl1]"r" (cl1), [cl2]"r" (cl2), [cl3]"r" (cl3), [dl0]"r" (dl0), [dl1]"r" (dl1), [dl2]"r" (dl2), [dl3]"r" (dl3) : "lr"
);
}
The code in code_buf will use the arguments passed via registers (that's why I'm forcing specific registers).
This code compiles fine, but when I look at the disassembly the branch instruction has been changed to
a14: f004 ebb0 blx 0x5178
Which would try to switch the cpu to ARM mode and cause a HardFault. Is there a way to force the assembler to keep the branch as a simple bl?

So it turns out that the toolchain I was using (gcc 4.8) is buggy, and makes two errors: it interprets code_buf as an arm address, and produces a bogus blx label which isn't even legal on a cortex-m0+. I updated it to 6.3.1 and the inline asm was converted to a bl label as it was supposed to.

From section 4.1.1 of the ARMv6-M Architecture Reference Manual:
Thumb interworking is held as bit [0] of an interworking address.
Interworking addresses are used in the following instructions: BX,
BLX, or POP that loads the PC.
ARMv6-M only supports the Thumb
instruction Execution state, therefore the value of address bit [0]
must be 1 in interworking instructions, otherwise a fault occurs. All
instructions ignore bit [0] and write bits [31:1]:’0’ when updating
the PC.
The target of your branch, code_buf, will be word-aligned (possibly double-word aligned) so bit 0 will be clear in its address. The key is to ensure that bit 0 is set before you branch, and then even if the toolchain selects an interworking instruction you'll remain in thumb mode.
I don't have a development environment in front of me to test this, but I would suggest casting to a pointer-to-single-byte type and using pointer arithmetic to set bit 0:
uint8_t *thumb_target = ((uint8_t *)code_buf) + 1;
asm volatile (
"bl thumb_target\n"
: [set]"+r" (set) : [PASET]"r" (PASET), [PACLR]"r" (PACLR), [cl0]"r" (cl0), [cl1]"r" (cl1), [cl2]"r" (cl2), [cl3]"r" (cl3), [dl0]"r" (dl0), [dl1]"r" (dl1), [dl2]"r" (dl2), [dl3]"r" (dl3) : "lr"
);
Edit: The above doesn't work, as Peter Cordes points out, because a local variable can't be used in inline ASM in this context. Not being well-versed in gcc's inline ASM, I won't attempt to fix it.
I have now had a chance to test the supplied code though, and gcc 7.2.1 with -S -mtune=cortex-m0plus -fomit-frame-pointer generates a BL not a BLX.
Edit 2: The documentation (section A6.7.14) suggests that only the register-target version of BLX is present in the ARMv6-M architecture (this is in common with the ARMv7 devices I'm most familiar with) and so it looks to me as if the fault is caused not by an attempt to switch to ARM mode but by an illegal instruction. Is your compiler correctly configured?

IDK why your assembler would be changing bl into blx. Mine doesn't, using arm-none-eabi-gcc 7.3.0 on Arch Linux. arm-none-eabi-as --version shows Binutils 2.30.
unsigned short code_buf[18];
void jump() {
asm("bl code_buf");
asm("blx code_buf"); // still assembles to BL, not BLX
// asm("blx jump");
// asm("bl jump");
}
compiled with arm-none-eabi-gcc -O2 -nostdlib arm-bl.c -mcpu=cortex-m0plus -mthumb (I made a linked executable with -nostdlib so I could see actual branch displacements, not placeholders).
Disassembling with arm-none-eabi-objdump -d a.out shows
00008000 <jump>:
8000: f010 f804 bl 1800c <__data_start>
8004: f010 f802 bl 1800c <__data_start>
8008: 4770 bx lr
800a: 46c0 nop ; (mov r8, r8)
Your f004 ebb0 may be a Thumb2 encoding for BLX. I don't know why you're getting it.
The Thumb encoding for bl is documented in section 5.19 of this ARM7TDMI ISA manual ("long branch with link"), but that manual doesn't mention a Thumb encoding for blx at all (because it's only Thumb, not Thumb 2). The Thumb bl encoding stores the branch displacement right-shifted by 1 (i.e. without the low bit), and always stays in Thumb mode.
It's actually two separate instructions; one which puts the high 12 bits of the displacement into LR, and another which branches and updates LR to the return address. (This 2-instruction hack allows Thumb1 to work without Thumb2 32-bit instructions). Both instructions start with f, so your disassembly shows that you got something else; the first 16-bit chunk of f004 ebb0 is the LR setup, but ebb0 doesn't match any Thumb 1 instruction.
Possibly asm("bl code_buf+1" : ...); or blx code_buf+1 could work, if the +1 convinces the assembler to treat it as a Thumb target. But you might need to use asm to get a .thumb_func directive applied to code_buf somehow to keep your assembler happy.

Related

QEMU for AArch64: why execution stucks at "ldr q1, [x0]"?

I have this simple C code:
#include "uart.h"
#include <string.h>
char x[32];
__attribute__((noinline))
void foo(void)
{
strcpy(x, "xxxxxxxxxxxxxxxxxxxxxxxx");
}
int main(void)
{
uart_puts("xxx\n");
foo();
uart_puts("yyy\n");
}
compiled as:
$ aarch64-none-elf-gcc t78.c -mcpu=cortex-a57 -Wall -Wextra -g -O2 -c -std=c11 \
&& aarch64-none-elf-ld -T linker.ld t78.o boot.o uart.o -o kernel.elf
and executed as:
$ qemu-system-aarch64.exe -machine virt -cpu cortex-a57 -nographic -kernel kernel.elf
prints:
xxx
Why yyy is not printed?
By reducing the issue I've found that:
for strcpy GCC generated a code other than "call strcpy" (see below)
ldr q1, [x0] causes yyy to not be printed.
Here is the generated code of foo:
foo:
.LFB0:
.file 1 "t78.c"
.loc 1 6 1 view -0
.cfi_startproc
.loc 1 7 5 view .LVU1
adrp x0, .LC0
add x0, x0, :lo12:.LC0
adrp x1, .LANCHOR0
add x2, x1, :lo12:.LANCHOR0
ldr q1, [x0] <<== root cause
ldr q0, [x0, 9]
str q1, [x1, #:lo12:.LANCHOR0]
str q0, [x2, 9]
.loc 1 8 1 is_stmt 0 view .LVU2
ret
If I put ret before ldr q1, [x0] the yyy is printed (ax expected).
The question: why ldr q1, [x0] causes yyy to not be printed?
Tool versions:
$ aarch64-none-elf-gcc --version
aarch64-none-elf-gcc.exe (Arm GNU Toolchain 12.2.Rel1 (Build arm-12.24)) 12.2.1 20221205
$ qemu-system-aarch64 --version
QEMU emulator version 7.2.0 (v7.2.0-11948-ge6523b71fc-dirty)
The ldr q1, [x0] instruction is taking an exception because it accesses a floating-point/SIMD register but your startup code does not enable the FPU. The compiler is assuming that it can generate code that uses the FPU, so to meet that assumption one of the things your startup code must do is enable the FPU, via at least CPACR_EL1, and possibly other registers if EL2 or EL3 are enabled.
Alternatively, you could tell the compiler not to emit code that uses the FPU. The Linux kernel takes this approach, using the -mgeneral-regs-only option.
Real hardware probably has more strict requirements for what you need to do to configure the CPU to be able to run C code; QEMU is quite lenient. For instance the architecture defines that the reset value of many system registers is UNKNOWN, though QEMU usually resets them to zero. A robust startup sequence will explicitly set bits in registers like SCTLR_EL1.
You may also need to watch out for whether your compiler and your startup code agree about whether the compiler generated code is allowed to emit unaligned accesses -- if the MMU is not enabled then all memory accesses are treated as of type Device, which means they must be aligned (regardless of SCTLR_EL1.A). So you either need to make sure your compiler doesn't try to emit unaligned loads and stores, or else turn on the MMU and set SCTLR_EL1.A to 0.
You could improve your ability to debug this sort of "exception in early bootup" by installing some exception vectors which do something helpful when an unexpected exception occurs. The ideal is to be able to print registers, especially ELR_EL1 and ESR_EL1, which tell you where and why the exception occurred; printing in early bootup can be tricky, though. An easy compromise is to at least catch the exception and loop; you can then use gdb to see what the CPU state is.
An addition to answer by Peter Maydell.
Here is the code that enables FPU (found here):
mrs x1, cpacr_el1
mov x0, #(3 << 20)
orr x0, x1, x0
msr cpacr_el1, x0

Including header file in assembly file

I am trying to include a header file containing a macro into my main assembly file, but the compilation fails.
Below is my main.S file
#include "common.h"
BEGIN
mov $0x0E40, %ax
int $0x10
hlt
Below is my common.h file :
.macro BEGIN
LOCAL after_locals
.code16
cli
ljmp $0, $1f
1:
xor %ax, %ax
/* We must zero %ds for any data access. */
mov %ax, %ds
mov %ax, %es
mov %ax, %fs
mov %ax, %gs
mov %ax, %bp
/* Automatically disables interrupts until the end of the next instruction. */
mov %ax, %ss
/* We should set SP because BIOS calls may depend on that. TODO confirm. */
mov %bp, %sp
/* Store the initial dl to load stage 2 later on. */
mov %dl, initial_dl
jmp after_locals
initial_dl: .byte 0
after_locals:
.endm
Both files are in same directory. When I do the compilation :
$ as --32 -o main.o main.S
main.S: Assembler messages:
main.S:2: Error: no such instruction: `begin'
What am I missing? I did a little research and got this answer in SO, but its not helpful. Please help.
$ as --32 -o main.o main.S
as is just an assembler, it translates assembly source to object code. It does not run the C preprocessor which is supposed to expand #include.
(# is the comment character in GAS syntax for x86 so the line is treated as a comment if it's seen by the assembler instead of replaced by CPP)
What you can do:
Use gcc to assemble, with appropriate file suffix (.S or .sx), it will run the C preprocessor before running the assembler.
Add -v to see what commands gcc is invoking.
If your source has a different suffix, you can -x assembler-with-cpp source.asm.
If you want to see the intermediate result after preprocessing, add -save-temps. This will write a .s file with the preprocessed source.
If you want to pass down a command line option to as, you can for example -Wa,--32. However, it is better to use options which the compiler driver understands like -m32 or -m16 in the present case. The driver knows about such options, for example it will also cater for appropriate options when linking, provided you are linking with gcc -m32 ... as noted below.
Use a .include assembler directive which is handled by the assembler itself, not the C preprocessor.
Note: In case 1. adding include search paths by means of -I path might not work as expected: The compiler driver (gcc in this case) will add -I path only to the assembler's command line if it knows that it's the GNU assembler. You can tell this when the compiler is configured by configure flag --with-gnu-as.
Note: Similar applies to linking. You probably do not want to call the linker (ld by hand) unless you're making a static executable or flat binary; use gcc or g++ instead if you're making a normal executable to run on the host system. It will add many options needed for linking like multilib paths, search paths, etc. which you do not want to fiddle by hand.
(int $0x10 is a 16-bit BIOS call, though, which won't work under a modern mainstream OS, only DOS or a legacy BIOS bootloader.)
If your header file is just assembly then include with .include "file" directive in main.S. But this way of doing would insert the code the location where its included.

How does the `asm()` function works in C language?

I am learning Operating System Development and a Beginner of course. I would like to build my system in real mode environment which is a 16 bit environment using C language.
In C, I used a function asm() to convert the codes to 16 bit as follows:
asm(".code16")
which in GCC's language to generate 16 bit executables(not exactly though).
Question:
Suppose I have two header files head1.h and head2.h and a main.c file. The contents of main.c file are as follows:
asm(".code16");
#include<head1.h>
#include<head2.h>
int main(){
return 0;
}
Now, Since I started my code with the command to generate 16 bit executable file and then included head1.h and head2.h, will I need to do the same in all header files that I am to create? (or) Is it sufficient to add the line asm(".code16"); once?
OS: Ubuntu
Compiler: Gnu CC
To answer your question: It suffices for the asm block to be present at the beginning of the translation unit.
So putting it once at the beginning will do.
But you can do better: you can avoid it altogether and use the -m16 command line option (available from 5.2.0) instead.
But you can do better: you can avoid it altogether.
The effect of -m16 and .code16 is to make 32-bit code executable in real mode, it is not to produce real mode code.
Look
16.c
int main()
{
return 4;
}
Extracting the raw .text segment
>gcc -c -m16 16.c
>objcopy -j .text -O binary 16.o 16.bin
>ndisasm 16.bin
we get
00000000 6655 push ebp
00000002 6689E5 mov ebp,esp
00000005 6683E4F0 and esp,byte -0x10
00000009 66E800000000 call dword 0xf
0000000F 66B804000000 mov eax,0x4
00000015 66C9 o32 leave
00000017 66C3 o32 ret
Which is just 32-bit code filled with operand size prefixes.
On a real pre-386 machine this won't work as the 66h opcode is UD.
There are old 16-bit compilers, like Turbo C1, that address the problematic of the real-mode applications properly.
Alternatively, switch in protected mode as soon as possible or consider using UEFI.
1 It is available online. This compiler is as old as me!
It is not needed to add asm("code16") neither in head1.h nor head2.h.
The main reason is how the C pre-compiler works. It replaces the content of head1.h and head2.h within main.c.
Please check How `#include' Works for further information.
Hope it helps!
Best regards,
Miguel Ángel

How to set gcc or clang to use Intel syntax permanently for inline asm() statements?

I have the following code which compiles fine with the gcc command gcc ./example.c. The program itself calls the function "add_two" which simply adds two integers. To use the intel syntax within the extended assembly instructions I need to switch at first to intel and than back to AT&T. According to the gcc documentation it is possible to switch to intel syntax entirely by using gcc -masm=intel ./exmaple.
Whenever I try to compile it with the switch -masm=intel it won't compile and I don't understand why? I already tried to delete the instruction .intel_syntax but it still don't compile.
#include <stdio.h>
int add_two(int, int);
int main(){
int src = 3;
int dst = 5;
printf("summe = %d \n", add_two(src, dst));
return 0;
}
int add_two(int src, int dst){
int sum;
asm (
".intel_syntax;" //switch to intel syntax
"mov %0, %1;"
"add %0, %2;"
".att_syntax;" //switch to at&t syntax
: "=r" (sum) //output
: "r" (src), "r" (dst) //input
);
return sum;
}
The error message by compiling the above mentioned program with gcc -masm=intel ./example.c is:
tmp/ccEQGI4U.s: Assembler messages:
/tmp/ccEQGI4U.s:55: Error: junk `PTR [rbp-4]' after expression
/tmp/ccEQGI4U.s:55: Error: too many memory references for `mov'
/tmp/ccEQGI4U.s:56: Error: too many memory references for `mov'
Use -masm=intel and don't use any .att_syntax directives in your inline asm. This works with GCC and I think ICC, and with any constraints you use. Other methods don't. (See Can I use Intel syntax of x86 assembly with GCC? for a simple answer saying that; this answer explores exactly what goes wrong, including with clang 13 and earlier.)
That also works in clang 14 and later. (Which isn't released yet but the patch is part of current trunk; see https://reviews.llvm.org/D113707).
Clang 13 and earlier would always use AT&T syntax for inline asm, both in substituting operands and in assembling as op src, dst. But even worse, clang -masm=intel would do that even when taking the Intel side of an asm template using dialect-alternatives like asm ("add {att | intel}" : ... )`!
clang -masm=intel did still control how it printed asm after its built-in assembler turned an asm() statement into some internal representation of the instruction. e.g. Godbolt showing clang13 -masm=intel turning add %0, 1 as add dword ptr [1], eax, but clang trunk producing add eax, 1.
Some of the rest of this answer talking about clang hasn't been updated for this new clang patch.
Clang does support Intel-syntax inside MSVC-style asm-blocks, but that's terrible (no constraints so inputs / outputs have to go through memory.
If you were hard-coding register names with clang, -masm=intel would be usable (or the equivalent -mllvm --x86-asm-syntax=intel). But it chokes on mov %eax, 5 in Intel-syntax mode so you can't let %0 expand to an AT&T-syntax register name.
-masm=intel makes the compiler use .intel_syntax noprefix at the top of its asm output file, and use Intel-syntax when generating asm from C outside your inline-asm statement. Using .att_syntax at the bottom of your asm template breaks the compiler's asm, hence the error messages like PTR [rbp-4] looking like junk to the assembler (which is expecting AT&T syntax).
The "too many operands for mov" is because in AT&T syntax, mov eax, ebx is a mov from a memory operand (with symbol name eax) to a memory operand (with symbol name ebx)
Some people suggest using .intel_syntax noprefix and .att_syntax prefix around your asm template. That can sometimes work but it's problematic. And incompatible with the preferred method of -masm=intel.
Problems with the "sandwich" method:
When the compiler substitutes operands into your asm template, it will do so according to -masm=. This will always break for memory operands (the addressing-mode syntax is completely different).
It will also break with clang even for registers. Clang's built-in assembler does not accept %eax as a register name in Intel-syntax mode, and doesn't accept .intel_syntax prefix (as opposed to the noprefix that's usually used with Intel-syntax).
Consider this function:
int foo(int x) {
asm(".intel_syntax noprefix \n\t"
"add %0, 1 \n\t"
".att_syntax"
: "+r"(x)
);
return x;
}
It assembles as follows with GCC (Godbolt):
movl %edi, %eax
.intel_syntax noprefix
add %eax, 1 # AT&T register name in Intel syntax
.att_syntax
The sandwich method depends on GAS accepting %eax as a register name even in Intel-syntax mode. GAS from GNU Binutils does, but clang's built-in assembler doesn't.
On a Mac, even using real GCC the asm output has to assemble with an as that's based on clang, not GNU Binutils.
Using clang on that source code complains:
<source>:2:35: error: unknown token in expression
asm(".intel_syntax noprefix \n\t"
^
<inline asm>:2:6: note: instantiated into assembly here
add %eax, 1
^
(The first line of the error message didn't handle the multi-line string literal very well. If you use ; instead of \n\t and put everything on one line the clang error message works better but the source is a mess.)
I didn't check what happens with "ri" constraints when the compiler picks an immediate; it will still decorate it with $ but IDK if GAS silently ignores that, too, in Intel syntax mode.
PS: your asm statement has a bug: you forgot an early-clobber on your output operand so nothing is stopping the compiler from picking the same register for the %0 output and the %2 input that you don't read until the 2nd instruction. Then mov will destroy an input.
But using mov as the first or last instruction of an asm-template is usually also a missed-optimization bug. In this case you can and should just use lea %0, [%1 + %2] to let the compiler add with the result written to a 3rd register, non-destructively. Or just wrap the add instruction (using a "+r" operand and an "r", and let the compiler worry about data movement.) If it had to load the value from memory anyway, it can put it in the right register so no mov is needed.
PS: it's possible to write inline asm that works with -masm=intel or att, using GNU C inline asm dialect alternatives. e.g.
void atomic_inc(int *p) {
asm( "lock add{l $1, %0 | %0, 1}"
: "+m" (*p)
:: "memory"
);
}
compiles with gcc -O2 (-masm=att is the default) to
atomic_inc(int*):
lock addl $1, (%rdi)
ret
Or with -masm=intel to:
atomic_inc(int*):
lock add DWORD PTR [rdi], 1
ret
Notice that the l suffix is required for AT&T, and the dword ptr is required for intel, because memory, immediate doesn't imply an operand-size. And that the compiler filled in valid addressing-mode syntax for both cases.
This works with clang, but only the AT&T version ever gets used.
Note that -masm= also affects the default inline assembler syntax:
Output assembly instructions using selected dialect. Also affects
which dialect is used for basic "asm" and extended "asm". Supported
choices (in dialect order) are att or intel. The default is att.
Darwin does not support intel.
That means that your first .intel_syntax directive is superfluous and the final .att_syntax is wrong because your GCC call compiles C to Intel assembler code.
IOW, either stick to -masm=intel or sandwich your inline Intel assembler code sections between .intel_syntax noprefix and .att_syntax prefix directives - but don't do both.
Note that the sandwich method isn't compatible with all inline assembler constraints - e.g. a constraint that involves m (i.e. memory operand) would insert an operand in ATT syntax which would yield an error like 'Error: junk (%rbp) after expression'. In those cases you have to use -masm=intel.

GNU assembler for MIPS: how to emit sync_* instructions?

MIPS32 ISA defines the following format for the sync instruction:
SYNC (stype = 0 implied)
SYNC stype
here, stype may be SYNC_WMB (SYNC 4), SYNC_MB (SYNC 16), etc.
In inline assembler, I may use default sync: __asm__ volatile ("sync" ::);.
But, if I write something like __asm__ volatile ("sync 0x10" ::), it doesn't compile:
Error: illegal operands 'sync 0x10'
Same if pass -mips32r2 option to gcc.
So, the question is: how to use SYNC_* (WYNC_WMB, SYNC_MB, SYNC_ACQUIRE, ...) instructions from GCC inlined assembly?
I suspect that your binutils are too old - it looks like support for this was only added in version 2.20.
As a workaround, if you can't upgrade your binutils easily, you could construct the opcode by hand.
sync is an opcode 0 instruction with a function code (bits 5..0) of 0xf, and this form of it encodes the sync type in the shift amount field (bits 10..6). So, e.g. for sync 0x10:
__asm__ volatile(".word (0x0000000f | (0x10 << 6))");

Resources