QEMU for AArch64: why execution stucks at "ldr q1, [x0]"?

QEMU for AArch64: why execution stucks at "ldr q1, [x0]"? - gcc

I have this simple C code:
#include "uart.h"
#include <string.h>
char x[32];
__attribute__((noinline))
void foo(void)
{
strcpy(x, "xxxxxxxxxxxxxxxxxxxxxxxx");
}
int main(void)
{
uart_puts("xxx\n");
foo();
uart_puts("yyy\n");
}
compiled as:
$ aarch64-none-elf-gcc t78.c -mcpu=cortex-a57 -Wall -Wextra -g -O2 -c -std=c11 \
&& aarch64-none-elf-ld -T linker.ld t78.o boot.o uart.o -o kernel.elf
and executed as:
$ qemu-system-aarch64.exe -machine virt -cpu cortex-a57 -nographic -kernel kernel.elf
prints:
xxx
Why yyy is not printed?
By reducing the issue I've found that:
for strcpy GCC generated a code other than "call strcpy" (see below)
ldr q1, [x0] causes yyy to not be printed.
Here is the generated code of foo:
foo:
.LFB0:
.file 1 "t78.c"
.loc 1 6 1 view -0
.cfi_startproc
.loc 1 7 5 view .LVU1
adrp x0, .LC0
add x0, x0, :lo12:.LC0
adrp x1, .LANCHOR0
add x2, x1, :lo12:.LANCHOR0
ldr q1, [x0] <<== root cause
ldr q0, [x0, 9]
str q1, [x1, #:lo12:.LANCHOR0]
str q0, [x2, 9]
.loc 1 8 1 is_stmt 0 view .LVU2
ret
If I put ret before ldr q1, [x0] the yyy is printed (ax expected).
The question: why ldr q1, [x0] causes yyy to not be printed?
Tool versions:
$ aarch64-none-elf-gcc --version
aarch64-none-elf-gcc.exe (Arm GNU Toolchain 12.2.Rel1 (Build arm-12.24)) 12.2.1 20221205
$ qemu-system-aarch64 --version
QEMU emulator version 7.2.0 (v7.2.0-11948-ge6523b71fc-dirty)

The ldr q1, [x0] instruction is taking an exception because it accesses a floating-point/SIMD register but your startup code does not enable the FPU. The compiler is assuming that it can generate code that uses the FPU, so to meet that assumption one of the things your startup code must do is enable the FPU, via at least CPACR_EL1, and possibly other registers if EL2 or EL3 are enabled.
Alternatively, you could tell the compiler not to emit code that uses the FPU. The Linux kernel takes this approach, using the -mgeneral-regs-only option.
Real hardware probably has more strict requirements for what you need to do to configure the CPU to be able to run C code; QEMU is quite lenient. For instance the architecture defines that the reset value of many system registers is UNKNOWN, though QEMU usually resets them to zero. A robust startup sequence will explicitly set bits in registers like SCTLR_EL1.
You may also need to watch out for whether your compiler and your startup code agree about whether the compiler generated code is allowed to emit unaligned accesses -- if the MMU is not enabled then all memory accesses are treated as of type Device, which means they must be aligned (regardless of SCTLR_EL1.A). So you either need to make sure your compiler doesn't try to emit unaligned loads and stores, or else turn on the MMU and set SCTLR_EL1.A to 0.
You could improve your ability to debug this sort of "exception in early bootup" by installing some exception vectors which do something helpful when an unexpected exception occurs. The ideal is to be able to print registers, especially ELR_EL1 and ESR_EL1, which tell you where and why the exception occurred; printing in early bootup can be tricky, though. An easy compromise is to at least catch the exception and loop; you can then use gdb to see what the CPU state is.

An addition to answer by Peter Maydell.
Here is the code that enables FPU (found here):
mrs x1, cpacr_el1
mov x0, #(3 << 20)
orr x0, x1, x0
msr cpacr_el1, x0

Related

Porting JonesForth to macOS v10.15 (Catalina)

I'm trying to make JonesForth run on a recent MacBook out of the box, just using Mac tools.
I started to convert everything 64 bits and attend to the Mac assembler syntax.
I got things to assemble, but I immediately run into a curious segmentation fault:
/* NEXT macro. */
.macro NEXT
lodsq
jmpq *(%rax)
.endm
...
/* Assembler entry point. */
.text
.globl start
.balign 16
start:
cld
mov %rsp,var_SZ(%rip) // Save the initial data stack pointer in FORTH variable S0.
mov return_stack_top(%rip),%rbp // Initialise the return stack.
//call set_up_data_segment
mov cold_start(%rip),%rsi // Initialise interpreter.
NEXT // Run interpreter!
.const
cold_start: // High-level code without a codeword.
.quad QUIT
QUIT is defined like this via macro defword:
.macro defword
.const_data
.balign 8
.globl name_$3
name_$3 :
.quad $4 // Link
.byte $2+$1 // Flags + length byte
.ascii $0 // The name
.balign 8 // Padding to next four-byte boundary
.globl $3
$3 :
.quad DOCOL // Codeword - the interpreter
// list of word pointers follow
.endm
// QUIT must not return (ie. must not call EXIT).
defword "QUIT",4,,QUIT,name_TELL
.quad RZ,RSPSTORE // R0 RSP!, clear the return stack
.quad INTERPRET // Interpret the next word
.quad BRANCH,-16 // And loop (indefinitely)
...more code
When I run this, I get a segmentation fault the first time in the NEXT macro:
(lldb) run
There is a running process, kill it and restart?: [Y/n] y
Process 83000 exited with status = 9 (0x00000009)
Process 83042 launched: '/Users/klapauciusisgreat/jonesforth64/jonesforth' (x86_64)
Process 83042 stopped
* thread #1, stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x0000000100000698 jonesforth`start + 24
jonesforth`start:
-> 0x100000698 <+24>: jmpq *(%rax)
0x10000069a <+26>: nopw (%rax,%rax)
jonesforth`code_DROP:
0x1000006a0 <+0>: popq %rax
0x1000006a1 <+1>: lodsq (%rsi), %rax
Target 0: (jonesforth) stopped.
rax does point to what I think is the dereferenced address, DOCOL:
(lldb) register read
General Purpose Registers:
rax = 0x0000000100000660 jonesforth`DOCOL
So one mystery is:
Why does RAX point to DOCOL instead of QUIT? My guess is that the instruction was halfway executed and the result of the indirection was stored in rax. What are some good pointers to documentation?
Why the segmentation fault?
I commented out the original segment setup code in the original that called brk to set up a data segment. Another [implementation] also did not call it at all, so I thought I could as well ignore this. Is there any magic on how to set up segment permissions with syscalls in a 64-bit binary on Catalina? The make command is pretty much the standard JonesForth one:
jonesforth: jonesforth.S
gcc -nostdlib -g -static $(BUILD_ID_NONE) -o $# $<
P.S.: Yes, I can get JonesForth to work perfectly in Docker images, but that's besides the point. I really want it to work in 64 bit on Catalina, out of the box.

The original code had something like
mov $cold_start,%rsi
And the Apple assembler complains about not being able to use 32 immediate addressing in 64-bit binaries.
So I tried
mov $cold_start(%rip),%rsi
but that also doesn't work.
So I tried
mov cold_start(%rip),%rsi
which assembles, but of course it dereferences cold start, which is not something I need.
The correct way of doing this is apparently
lea cold_start(%rip),%rsi
This seems to work as intended.

Branching to a c symbol from thumb inline assembly

I'm on a Cortex-M0+ device (Thumb only) and I'm trying to dynamically generate some code in ram and then jump to it, like so:
uint16_t code_buf[18];
...
void jump() {
register volatile uint32_t* PASET asm("r0") = &(PA->OUTSET.reg);
register volatile uint32_t* PACLR asm("r1") = &(PA->OUTCLR.reg);
register uint32_t set asm("r2") = startset;
register uint32_t cl0 asm("r3") = clears[0];
register uint32_t cl1 asm("r4") = clears[1];
register uint32_t cl2 asm("r5") = clears[2];
register uint32_t cl3 asm("r6") = clears[3];
register uint32_t dl0 asm("r8") = delays[0];
register uint32_t dl1 asm("r9") = delays[1];
register uint32_t dl2 asm("r10") = delays[2];
register uint32_t dl3 asm("r11") = delays[3];
asm volatile (
"bl code_buf\n"
: [set]"+r" (set) : [PASET]"r" (PASET), [PACLR]"r" (PACLR), [cl0]"r" (cl0), [cl1]"r" (cl1), [cl2]"r" (cl2), [cl3]"r" (cl3), [dl0]"r" (dl0), [dl1]"r" (dl1), [dl2]"r" (dl2), [dl3]"r" (dl3) : "lr"
);
}
The code in code_buf will use the arguments passed via registers (that's why I'm forcing specific registers).
This code compiles fine, but when I look at the disassembly the branch instruction has been changed to
a14: f004 ebb0 blx 0x5178
Which would try to switch the cpu to ARM mode and cause a HardFault. Is there a way to force the assembler to keep the branch as a simple bl?

So it turns out that the toolchain I was using (gcc 4.8) is buggy, and makes two errors: it interprets code_buf as an arm address, and produces a bogus blx label which isn't even legal on a cortex-m0+. I updated it to 6.3.1 and the inline asm was converted to a bl label as it was supposed to.

From section 4.1.1 of the ARMv6-M Architecture Reference Manual:
Thumb interworking is held as bit [0] of an interworking address.
Interworking addresses are used in the following instructions: BX,
BLX, or POP that loads the PC.
ARMv6-M only supports the Thumb
instruction Execution state, therefore the value of address bit [0]
must be 1 in interworking instructions, otherwise a fault occurs. All
instructions ignore bit [0] and write bits [31:1]:’0’ when updating
the PC.
The target of your branch, code_buf, will be word-aligned (possibly double-word aligned) so bit 0 will be clear in its address. The key is to ensure that bit 0 is set before you branch, and then even if the toolchain selects an interworking instruction you'll remain in thumb mode.
I don't have a development environment in front of me to test this, but I would suggest casting to a pointer-to-single-byte type and using pointer arithmetic to set bit 0:
uint8_t *thumb_target = ((uint8_t *)code_buf) + 1;
asm volatile (
"bl thumb_target\n"
: [set]"+r" (set) : [PASET]"r" (PASET), [PACLR]"r" (PACLR), [cl0]"r" (cl0), [cl1]"r" (cl1), [cl2]"r" (cl2), [cl3]"r" (cl3), [dl0]"r" (dl0), [dl1]"r" (dl1), [dl2]"r" (dl2), [dl3]"r" (dl3) : "lr"
);
Edit: The above doesn't work, as Peter Cordes points out, because a local variable can't be used in inline ASM in this context. Not being well-versed in gcc's inline ASM, I won't attempt to fix it.
I have now had a chance to test the supplied code though, and gcc 7.2.1 with -S -mtune=cortex-m0plus -fomit-frame-pointer generates a BL not a BLX.
Edit 2: The documentation (section A6.7.14) suggests that only the register-target version of BLX is present in the ARMv6-M architecture (this is in common with the ARMv7 devices I'm most familiar with) and so it looks to me as if the fault is caused not by an attempt to switch to ARM mode but by an illegal instruction. Is your compiler correctly configured?

IDK why your assembler would be changing bl into blx. Mine doesn't, using arm-none-eabi-gcc 7.3.0 on Arch Linux. arm-none-eabi-as --version shows Binutils 2.30.
unsigned short code_buf[18];
void jump() {
asm("bl code_buf");
asm("blx code_buf"); // still assembles to BL, not BLX
// asm("blx jump");
// asm("bl jump");
}
compiled with arm-none-eabi-gcc -O2 -nostdlib arm-bl.c -mcpu=cortex-m0plus -mthumb (I made a linked executable with -nostdlib so I could see actual branch displacements, not placeholders).
Disassembling with arm-none-eabi-objdump -d a.out shows
00008000 <jump>:
8000: f010 f804 bl 1800c <__data_start>
8004: f010 f802 bl 1800c <__data_start>
8008: 4770 bx lr
800a: 46c0 nop ; (mov r8, r8)
Your f004 ebb0 may be a Thumb2 encoding for BLX. I don't know why you're getting it.
The Thumb encoding for bl is documented in section 5.19 of this ARM7TDMI ISA manual ("long branch with link"), but that manual doesn't mention a Thumb encoding for blx at all (because it's only Thumb, not Thumb 2). The Thumb bl encoding stores the branch displacement right-shifted by 1 (i.e. without the low bit), and always stays in Thumb mode.
It's actually two separate instructions; one which puts the high 12 bits of the displacement into LR, and another which branches and updates LR to the return address. (This 2-instruction hack allows Thumb1 to work without Thumb2 32-bit instructions). Both instructions start with f, so your disassembly shows that you got something else; the first 16-bit chunk of f004 ebb0 is the LR setup, but ebb0 doesn't match any Thumb 1 instruction.
Possibly asm("bl code_buf+1" : ...); or blx code_buf+1 could work, if the +1 convinces the assembler to treat it as a Thumb target. But you might need to use asm to get a .thumb_func directive applied to code_buf somehow to keep your assembler happy.

Usage of as + gcc vs gcc only for ARM assembly

I am using as and gcc to assemble and create executables of ARM assembly programs, as recommended by this tutorial, as follows:
Given an assembly source file, program.s, I run:
as -o program.o program.s
Then:
gcc -o program program.o
However, running gcc on the assembly source directly like so:
gcc -o program program.s
yields the same result.
Does gcc call as behind the scenes? Is there any reason at all to use both as and gcc given that gcc alone is able to produce an executable from source?
I am running this on a Raspberry Pi 3, Raspbian Jessie (a Debian derivative), gcc 4.9.2, as 2.25.

GCC calls all kinds of things behind the scenes; not just as but ld as well. This is pretty easy to instrument if you want to prove it (replace the CORRECT as and ld and other binaries with ones that say print out their command line, then run GCC and see that binary gets called).
When you use GCC as an assembler, it goes through a C preprocessor, so you can do some fairly disgusting things like this:
start.s
//this is a comment
#this is a comment
#define FOO BAR
.globl _start
_start:
mov sp,#0x80000
bl hello
b .
.globl world
world:
bx lr
And to see more of what is going on, here are other files:
so.h
unsigned int world ( unsigned int, unsigned int );
#define FIVE 5
#define SIX 6
so.c
#include "so.h"
unsigned int hello ( void )
{
unsigned int a,b,c;
a=FIVE;
b=SIX;
c=world(a,b);
return(c+1);
}
build
arm-none-eabi-gcc -save-temps -nostdlib -nostartfiles -ffreestanding -O2 start.s so.c -o so.elf
arm-none-eabi-objdump -D so.elf
producing
00008000 <_start>:
8000: e3a0d702 mov sp, #524288 ; 0x80000
8004: eb000001 bl 8010 <hello>
8008: eafffffe b 8008 <_start+0x8>
0000800c <world>:
800c: e12fff1e bx lr
00008010 <hello>:
8010: e92d4010 push {r4, lr}
8014: e3a01006 mov r1, #6
8018: e3a00005 mov r0, #5
801c: ebfffffa bl 800c <world>
8020: e8bd4010 pop {r4, lr}
8024: e2800001 add r0, r0, #1
8028: e12fff1e bx lr
being a very simple project. Here is so.i after the pre-processor, which goes and gets the include files and replaces the defines:
# 1 "so.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "so.c"
# 1 "so.h" 1
unsigned int world ( unsigned int, unsigned int );
# 4 "so.c" 2
unsigned int hello ( void )
{
unsigned int a,b,c;
a=5;
b=6;
c=world(a,b);
return(c+1);
}
Then GCC calls the actual compiler (whose program name is not GCC).
That produces so.s:
.cpu arm7tdmi
.eabi_attribute 20, 1
.eabi_attribute 21, 1
.eabi_attribute 23, 3
.eabi_attribute 24, 1
.eabi_attribute 25, 1
.eabi_attribute 26, 1
.eabi_attribute 30, 2
.eabi_attribute 34, 0
.eabi_attribute 18, 4
.file "so.c"
.text
.align 2
.global hello
.syntax unified
.arm
.fpu softvfp
.type hello, %function
hello:
# Function supports interworking.
# args = 0, pretend = 0, frame = 0
# frame_needed = 0, uses_anonymous_args = 0
push {r4, lr}
mov r1, #6
mov r0, #5
bl world
pop {r4, lr}
add r0, r0, #1
bx lr
.size hello, .-hello
.ident "GCC: (GNU) 6.3.0"
Which is then fed to the assembler to make so.o. Then the linker is called to turn these into so.elf.
Now, you can do most of the calls directly. That doesn't mean that these programs have other programs they call. GCC still calls one or more programs to actually do the compile.
arm-none-eabi-as start.s -o start.o
arm-none-eabi-gcc -O2 -S so.c
arm-none-eabi-as so.s -o so.o
arm-none-eabi-ld start.o so.o -o so.elf
arm-none-eabi-objdump -D so.elf
Giving the same result:
00008000 <_start>:
8000: e3a0d702 mov sp, #524288 ; 0x80000
8004: eb000001 bl 8010 <hello>
8008: eafffffe b 8008 <_start+0x8>
0000800c <world>:
800c: e12fff1e bx lr
00008010 <hello>:
8010: e92d4010 push {r4, lr}
8014: e3a01006 mov r1, #6
8018: e3a00005 mov r0, #5
801c: ebfffffa bl 800c <world>
8020: e8bd4010 pop {r4, lr}
8024: e2800001 add r0, r0, #1
8028: e12fff1e bx lr
Using -S with GCC does feel a bit wrong. Using it like this instead feels more natural:
arm-none-eabi-gcc -O2 -c so.c -o so.o
Now there is a linker script that we didn't provide which the toolchain has a default for. We can control that, and depending on what this is being aimed at, perhaps we should.
I am not happy to see that the new/current version of as is tolerant of C comments, etc... Didn't used to be that way, must be a new thing with the latest release.
Thus the term "toolchain" it is a number of tools chained together, one linked to the next in order.
Not all compilers take the assembly language step. Some compile to intermediate code, and then there is another tool that turns that compiler specific intermediate code into assembly language. Then some assembler is called (GCC's intermediate code is inside tables in the compile step, where clang/llvm you can ask it to compile to this code then go from there to assembly language for one of the targets).
Some compilers go straight to machine code and don't stop at assembly language. This is likely one of those "climb the mountain just because it is there" things vs "go around". Like writing an operating system purely in assembly language.
For any decent sized project and a tool that can support it, you are going to have a linker, and an assembler the first tool you make to support a new to you target. The processor (chip or ip or both) vendor is going to have an assembler and then other tools available as well.
Try compiling even the above simple C program by hand using assembly language. Then try it again without using assembly language, by hand, using just machine code. You will find that using assembly language as an intermediate step is far more sane for compiler developers, along with the fact that it has been done this way forever, which is also a good reason to keep doing it this way.
If you wander about in the gnu toolchain directory you are using, you may find programs like cc1:
./libexec/gcc/arm-none-eabi/6.3.0/cc1 --help
The following options are specific to just the language Ada:
None found. Use --help=Ada to show *all* the options supported by the Ada front-end.
The following options are specific to just the language AdaSCIL:
None found. Use --help=AdaSCIL to show *all* the options supported by the AdaSCIL front-end.
The following options are specific to just the language AdaWhy:
None found. Use --help=AdaWhy to show *all* the options supported by the AdaWhy front-end.
The following options are specific to just the language C:
None found. Use --help=C to show *all* the options supported by the C front-end.
The following options are specific to just the language C++:
-Wplacement-new
-Wplacement-new= 0xffffffff
The following options are specific to just the language Fortran:
Now if you run that cc1 program against the so.i file you saved with -save-temps, you get so.s the assembly language file.
You could probably continue to dig into the directory or the gnu tools sources to find even more goodies.
Note this question has been asked before many times here at Stack Overflow in various ways.
Also note main() isn't anything special as I have demonstrated. In some compilers it might be, but I can make programs that don't require that function name.

gdb ignores breakpoint in Qemu bootloader

I am trying to step through the simple bootloader shown in this tutorial: http://mikeos.berlios.de/write-your-own-os.html - so I can use the Qemu monitor to inspect the general registers for educational purposes.
Eventhough I am able to connect Qemu and gdb and the breakpoint is set at the beginning of the bootloader (0x7c0), after hitting "c" on gdb the code just runs all the way till the end.
I have read kvm may "confuse" gbd with virtual memory addresses, so I disabled it. This didn't work.
I also read (Debugging bootloader with gdb in qemu) things worked when debugging Freedos boot after compiling gdb from HEAD. Instead of recompiling gdb, I tried debugging the Freedos boot - It worked!
So, I do believe my problem is actually getting the tutorial's bootloader to go through a step-by-step execution.
Other things I tried (none of them worked):
Use dozens of "si" before inserting the breakpoint
Try different breakpoint addresses
Use the -singlestep key on qemu
Here is my qemu command line:
qemu-system-i386 -fda disquete.img -boot a -s -S -monitor stdio
Here is my command sequence inside gdb:
(gdb) target remote localhost:1234
(gdb) set architecture i8086
(gdb) br *0x7c0
Then I hit "c" and it just passes the breakpoint all the way.
Versions:
$ uname -a
Linux Brod 3.8.0-30-generic #44-Ubuntu SMP Thu Aug 22 20:52:24 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
$ gdb --version
GNU gdb (GDB) 7.5.91.20130417-cvs-ubuntu
$ qemu --version
QEMU emulator version 1.4.0 (Debian 1.4.0+dfsg-1expubuntu4), Copyright (c) 2003-2008 Fabrice Bellard
As I am able to step through the Freedos boot, I do believe my setup is fine and I must be failing within some conceptual misunderstanding of the boot process for the bootloader tutorial I mentioned in the beginning of this post.
All help is welcome!

Because of hardware virtualization, it may be necessary to use a hardware breakpoint:
(gdb) hbreak *0x7c00
Also watch out for the correct architecture in gdb, even when using a 64-bit CPU (or kvm): The bootloader needs (gdb) set architecture i8086 as the CPU is still in real mode.

I was actually able to debug the sample bootloader I took from mikeos.berlios.de/write-your-own-os.html after rewriting it to specifically load at 0x7c00. My sources of information (other than the contributions here) were:
http://en.wikibooks.org/wiki/X86_Assembly/Bootloaders
http://viralpatel.net/taj/tutorial/hello_world_bootloader.php
The final code is this:
[BITS 16] ; Tells nasm to build 16 bits code
[ORG 0x7C00] ; The address the code will start
start:
mov ax, 0 ; Reserves 4Kbytes after the bootloader
add ax, 288 ; (4096 + 512)/ 16 bytes per paragraph
mov ss, ax
mov sp, 4096
mov ax, 0 ; Sets the data segment
mov ds, ax
mov si, texto ; Sets the text position
call imprime ; Calls the printing routine
jmp $ ; Infinite loop
texto db 'It works! :-D', 0
imprime: ; Prints the text on screen
mov ah, 0Eh ; int 10h - printing function
.repeat:
lodsb ; Grabs one char
cmp al, 0
je .done ; If char is zero, ends
int 10h ; Else prints char
jmp .repeat
.done:
ret
times 510-($-$$) db 0 ; Fills the remaining boot sector with 0s
dw 0xAA55 ; Standard boot signature
Now I can step through the program and see the registers changing.

GCC to emit ARM idiv instructions

How can I instruct gcc to emit idiv (integer division, udiv and sdiv) instructions for arm application processors?
So far only way I can come up with is to use -mcpu=cortex-a15 with gcc 4.7.
$cat idiv.c
int test_idiv(int a, int b) {
return a / b;
}
On gcc 4.7 (bundled with Android NDK r8e)
$gcc -O2 -mcpu=cortex-a15 -c idiv.c
$objdump -S idiv.o
00000000 <test_idiv>:
0: e710f110 sdiv r0, r0, r1
4: e12fff1e bx lr
Even this one gives idiv.c:1:0: warning: switch -mcpu=cortex-a15 conflicts with -march=armv7-a switch [enabled by default] if you add -march=armv7-a next to -mcpu=cortex-a15 and doesn't emit idiv instruction.
$gcc -O2 -mcpu=cortex-a15 -march=armv7-a -c idiv.c
idiv.c:1:0: warning: switch -mcpu=cortex-a15 conflicts with -march=armv7-a switch [enabled by default]
$objdump -S idiv.o
00000000 <test_idiv>:
0: e92d4008 push {r3, lr}
4: ebfffffe bl 0 <__aeabi_idiv>
8: e8bd8008 pop {r3, pc}
On gcc 4.6 (bundled with Android NDK r8e) it doesn't emit idiv instructions at all but recognizes -mcpu=cortex-a15 also doesn't complain to -mcpu=cortex-a15 -march=armv7-a combination.
Afaik idiv is optional on armv7, so there should be a cleaner way to instruct gcc to emit them but how?

If the instruction is not in the machine descriptions, then I doubt that gcc will emit code. Note1
You can always use inline-assembler to get the instruction if the compiler is not supporting it.Note2 Since your op-code is fairly rare/machine specific, there is probably not so much effort to get it in the gcc source. Especially, there are arch and tune/cpu flags. The tune/cpu is for a more specific machine, but the arch is suppose to allow all machines in that architecture. This op-code seems to break that rule, if I understand.
For gcc 4.6.2, it looks like thumb2 and cortex-r4 are cues to use these instructions and as you have noted with gcc 4.7.2, the cortex-a15 seems to be added to use these instructions. With gcc 4.7.2, the thumb2.md file no longer has udiv/sdiv. However, it might be included somewhere else; I am not 100% familiar with all the machine description language. It also seems that cortex-a7, cortex-a15, and cortex-r5 may enable these instructions with 4.7.2. Note3
This doesn't answer the question directly, but it does give some information/path to get the answer. You can compile the module with -mcpu=cortex-r4, although this may produce linker issues. Also, there is int my_idiv(int a, int b) __attribute__ ((__target__ ("arch=cortexe-r4")));, where you can specify on a per-function basis the machine-description used by the code generator. I haven't used any of these myself, but they are only possibilities to try. Generally you don't want to keep the wrong machine as it could generate sub-optimal (and possibly illegal) op-codes. You will have to experiment and maybe then provide the real answer.
Note1: This is for a stock gcc 4.6.2 and 4.7.2. I don't know if your Android compiler has patches.
gcc-4.6.2/gcc/config/arm$ grep [ius]div *.md
arm.md: "...,sdiv,udiv,other"
cortex-r4.md:;; We guess that division of A/B using sdiv or udiv, on average,
cortex-r4.md:;; This gives a latency of nine for udiv and ten for sdiv.
cortex-r4.md:(define_insn_reservation "cortex_r4_udiv" 9
cortex-r4.md: (eq_attr "insn" "udiv"))
cortex-r4.md:(define_insn_reservation "cortex_r4_sdiv" 10
cortex-r4.md: (eq_attr "insn" "sdiv"))
thumb2.md: "sdiv%?\t%0, %1, %2"
thumb2.md: (set_attr "insn" "sdiv")]
thumb2.md:(define_insn "udivsi3"
thumb2.md: (udiv:SI (match_operand:SI 1 "s_register_operand" "r")
thumb2.md: "udiv%?\t%0, %1, %2"
thumb2.md: (set_attr "insn" "udiv")]
gcc-4.7.2/gcc/config/arm$ grep -i [ius]div *.md
arm.md: "...,sdiv,udiv,other"
arm.md: "TARGET_IDIV"
arm.md: "sdiv%?\t%0, %1, %2"
arm.md: (set_attr "insn" "sdiv")]
arm.md:(define_insn "udivsi3"
arm.md: (udiv:SI (match_operand:SI 1 "s_register_operand" "r")
arm.md: "TARGET_IDIV"
arm.md: "udiv%?\t%0, %1, %2"
arm.md: (set_attr "insn" "udiv")]
cortex-a15.md:(define_insn_reservation "cortex_a15_udiv" 9
cortex-a15.md: (eq_attr "insn" "udiv"))
cortex-a15.md:(define_insn_reservation "cortex_a15_sdiv" 10
cortex-a15.md: (eq_attr "insn" "sdiv"))
cortex-r4.md:;; We guess that division of A/B using sdiv or udiv, on average,
cortex-r4.md:;; This gives a latency of nine for udiv and ten for sdiv.
cortex-r4.md:(define_insn_reservation "cortex_r4_udiv" 9
cortex-r4.md: (eq_attr "insn" "udiv"))
cortex-r4.md:(define_insn_reservation "cortex_r4_sdiv" 10
cortex-r4.md: (eq_attr "insn" "sdiv"))
Note2: See pre-processor as Assembler if gcc is passing options to gas that prevent use of the udiv/sdiv instructions. For example, you can use asm(" .long <opcode>\n"); where opcode is some token pasted stringified register encode macro output. Also, you can annotate your assembler to specify changes in the machine. So you can temporarily lie and say you have a cortex-r4, etc.
Note3:
gcc-4.7.2/gcc/config/arm$ grep -E 'TARGET_IDIV|arm_arch_arm_hwdiv|FL_ARM_DIV' *
arm.c:#define FL_ARM_DIV (1 << 23) /* Hardware divide (ARM mode). */
arm.c:int arm_arch_arm_hwdiv;
arm.c: arm_arch_arm_hwdiv = (insn_flags & FL_ARM_DIV) != 0;
arm-cores.def:ARM_CORE("cortex-a7", cortexa7, 7A, ... FL_ARM_DIV
arm-cores.def:ARM_CORE("cortex-a15", cortexa15, 7A, ... FL_ARM_DIV
arm-cores.def:ARM_CORE("cortex-r5", cortexr5, 7R, ... FL_ARM_DIV
arm.h: if (TARGET_IDIV) \
arm.h:#define TARGET_IDIV ((TARGET_ARM && arm_arch_arm_hwdiv) \
arm.h:extern int arm_arch_arm_hwdiv;
arm.md: "TARGET_IDIV"
arm.md: "TARGET_IDIV"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio